• 设为首页
  • 点击收藏
  • 手机版
    手机扫一扫访问
    迪恩网络手机版
  • 关注官方公众号
    微信扫一扫关注
    公众号

Naive Bayesian Classifier: 这是一个非常简单的 Python 库,实现了朴素贝叶斯分类器 ...

原作者: [db:作者] 来自: 网络 收藏 邀请

开源软件名称:

Naive Bayesian Classifier

开源软件地址:

https://gitee.com/mirrors/naive-bayes-classifier

开源软件介绍:

Naive Bayesian Classifier

yet another general purpose Naive Bayesian classifier.

##InstallationYou can install this package using the following pip command:

$ sudo pip install naiveBayesClassifier

##Example

"""Suppose you have some texts of news and know their categories.You want to train a system with this pre-categorized/pre-classified texts. So, you have better call this data your training set."""from naiveBayesClassifier import tokenizerfrom naiveBayesClassifier.trainer import Trainerfrom naiveBayesClassifier.classifier import ClassifiernewsTrainer = Trainer(tokenizer.Tokenizer(stop_words = [], signs_to_remove = ["?!#%&"]))# You need to train the system passing each text one by one to the trainer module.newsSet =[    {'text': 'not to eat too much is not enough to lose weight', 'category': 'health'},    {'text': 'Russia is trying to invade Ukraine', 'category': 'politics'},    {'text': 'do not neglect exercise', 'category': 'health'},    {'text': 'Syria is the main issue, Obama says', 'category': 'politics'},    {'text': 'eat to lose weight', 'category': 'health'},    {'text': 'you should not eat much', 'category': 'health'}]for news in newsSet:    newsTrainer.train(news['text'], news['category'])# When you have sufficient trained data, you are almost done and can start to use# a classifier.newsClassifier = Classifier(newsTrainer.data, tokenizer.Tokenizer(stop_words = [], signs_to_remove = ["?!#%&"]))# Now you have a classifier which can give a try to classifiy text of news whose# category is unknown, yet.unknownInstance = "Even if I eat too much, is not it possible to lose some weight"classification = newsClassifier.classify(unknownInstance)# the classification variable holds the possible categories sorted by # their probablity valueprint classification

Note: Definitely you will need much more training data than the amount in the above example. Really, a few lines of text like in the example is out of the question to be sufficient training set.

##What is the Naive Bayes Theorem and ClassifierIt is needless to explain everything once again here. Instead, one of the most eloquent explanations is quoted here.

The following explanation is quoted from another Bayes classifier which is written in Go.

BAYESIAN CLASSIFICATION REFRESHER: suppose you have a set of classes(e.g. categories) C := {C_1, ..., C_n}, and a document D consistingof words D := {W_1, ..., W_k}. We wish to ascertain the probabilitythat the document belongs to some class C_j given some set oftraining data associating documents and classes.

By Bayes' Theorem, we have that

P(C_j|D) = P(D|C_j)*P(C_j)/P(D).

The LHS is the probability that the document belongs to class C_jgiven the document itself (by which is meant, in practice, the wordfrequencies occurring in this document), and our program willcalculate this probability for each j and spit out the most likelyclass for this document.

P(C_j) is referred to as the "prior" probability, or the probabilitythat a document belongs to C_j in general, without seeing thedocument first. P(D|C_j) is the probability of seeing such adocument, given that it belongs to C_j. Here, by assuming that wordsappear independently in documents (this being the "naive"assumption), we can estimate

P(D|C_j) ~= P(W_1|C_j)*...*P(W_k|C_j)

where P(W_i|C_j) is the probability of seeing the given word in adocument of the given class. Finally, P(D) can be seen as merely ascaling factor and is not strictly relevant to classificiation,unless you want to normalize the resulting scores and actually seeprobabilities. In this case, note that

P(D) = SUM_j(P(D|C_j)*P(C_j))

One practical issue with performing these calculations is thepossibility of float64 underflow when calculating P(D|C_j), asindividual word probabilities can be arbitrarily small, and adocument can have an arbitrarily large number of them. A typicalmethod for dealing with this case is to transform the probability tothe log domain and perform additions instead of multiplications:

log P(C_j|D) ~ log(P(C_j)) + SUM_i(log P(W_i|C_j))

where i = 1, ..., k. Note that by doing this, we are discarding thescaling factor P(D) and our scores are no longer probabilities;however, the monotonic relationship of the scores is preserved by thelog function.

If you are very curious about Naive Bayes Theorem, you may find the following list helpful:

#ImprovementsThis classifier uses a very simple tokenizer which is just a module to split sentences into words. If your training set is large, you can rely on the available tokenizer, otherwise you need to have a better tokenizer specialized to the language of your training texts.

TODO

  • inline docs
  • unit-tests

AUTHORS

  • Mustafa Atik @muatik
  • Nejdet Yucesoy @nejdetckenobi

鲜花

握手

雷人

路过

鸡蛋
该文章已有0人参与评论

请发表评论

全部评论

专题导读
热门推荐
热门话题
阅读排行榜

扫描微信二维码

查看手机版网站

随时了解更新最新资讯

139-2527-9053

在线客服(服务时间 9:00~18:00)

在线QQ客服
地址:深圳市南山区西丽大学城创智工业园
电邮:jeky_zhao#qq.com
移动电话:139-2527-9053

Powered by 互联科技 X3.4© 2001-2213 极客世界.|Sitemap