Document Filtering(naive bayes method) used by python
2015-11-28 16:45
579 查看
The algorithms we mentioned can solve the more general problem of learning to recognize whether a document belongs in one category or another.
Early attempts to filter spam were all rule-based classifiers, however, after learning those rules, spammers stopped exhibiting the obvious behaviors to get around the filters. To solve this problem, we can create separate instances and datasets based on both initially and as we receive more messages for individual users, groups, or sites.
STEP1: the extraction of features
The classifier that you will be building needs features to use for classifying different items. A feature is anything that you can determine as being either present or absent in the item.
STEP2: Training the Classifier
The classifiers learn how to classify a document by being trained. These classifiers are specifically designed to start off very uncertain and increase in certainty as it learns which features are important for making a distinction.
We can easily get the conditional probability -” the probability of A given B ” which is also written as Pr(A|B) . In our case, we can get
Pr(word|classification)Pr(word|classification)
However, using only the information it has seen so far makes it incredibly sensitive during early training and to words that appear very rarely. To get around this problem, we’ll need to decide on an assumed probability, which will be used when we have very little information about the feature in question.Besides, we need to decide how much to weight the assumed probability.
we define the new probability as below:
probabilty=(weight∗assumpedprobability+count∗initialprobability)/(count+weight)probabilty=(weight*assumpedprobability+count*initialprobability)/(count+weight)
The introduction of Classifiers
A Naive Classifier
Some introduction of naive classifier is discussed in
朴素贝叶斯学习笔记
In our case, we assume that the probability of one word in the document being in a specific category is unrelated to the probability of the other words being in that category. We can easily get:
Pr(C|D)=Pr(D|C)×Pr(C)/Pr(D)Pr(C|D)=Pr(D|C)\times Pr(C)/Pr(D)
where C means Category and D means Document
The next step in building the naive Bayes classifier is actually deciding in which category a new item belongs. In some applications it’s better for the classifier to admit that it doesn’t know the answer than to decide that the answer is the category with a marginally higher probability.
We set a threshold to show this idea.
when Pr(C|D)>Pr(Cbest|D)×thresholdPr(C|D)>Pr(C_{best}|D)\times threshold we can say this document belongs to “good” category, if not , it belongs to “unknown” category.
The Fisher Method
Unlike the naive Bayesian filter, which uses the feature probabilities to create a whole document probability, the Fisher method calculates the probability of a category for each feature in the document, then combines the probabilities and tests to see if the set of probabilities is more or less likely than a random set.
P(Cj|Fi)=P(Fi|Cj)/∑k(Fk|Cj)P(C_j|F_i)=P(F_i|C_j)/\sum_k(F_k|C_j)
where CjC_j means the jjth category, FkF_k means the kkth feature.
REFERENCE
《Programming Collective Intelligence》
Early attempts to filter spam were all rule-based classifiers, however, after learning those rules, spammers stopped exhibiting the obvious behaviors to get around the filters. To solve this problem, we can create separate instances and datasets based on both initially and as we receive more messages for individual users, groups, or sites.
STEP1: the extraction of features
The classifier that you will be building needs features to use for classifying different items. A feature is anything that you can determine as being either present or absent in the item.
STEP2: Training the Classifier
The classifiers learn how to classify a document by being trained. These classifiers are specifically designed to start off very uncertain and increase in certainty as it learns which features are important for making a distinction.
We can easily get the conditional probability -” the probability of A given B ” which is also written as Pr(A|B) . In our case, we can get
Pr(word|classification)Pr(word|classification)
However, using only the information it has seen so far makes it incredibly sensitive during early training and to words that appear very rarely. To get around this problem, we’ll need to decide on an assumed probability, which will be used when we have very little information about the feature in question.Besides, we need to decide how much to weight the assumed probability.
we define the new probability as below:
probabilty=(weight∗assumpedprobability+count∗initialprobability)/(count+weight)probabilty=(weight*assumpedprobability+count*initialprobability)/(count+weight)
The introduction of Classifiers
A Naive Classifier
Some introduction of naive classifier is discussed in
朴素贝叶斯学习笔记
In our case, we assume that the probability of one word in the document being in a specific category is unrelated to the probability of the other words being in that category. We can easily get:
Pr(C|D)=Pr(D|C)×Pr(C)/Pr(D)Pr(C|D)=Pr(D|C)\times Pr(C)/Pr(D)
where C means Category and D means Document
The next step in building the naive Bayes classifier is actually deciding in which category a new item belongs. In some applications it’s better for the classifier to admit that it doesn’t know the answer than to decide that the answer is the category with a marginally higher probability.
We set a threshold to show this idea.
when Pr(C|D)>Pr(Cbest|D)×thresholdPr(C|D)>Pr(C_{best}|D)\times threshold we can say this document belongs to “good” category, if not , it belongs to “unknown” category.
The Fisher Method
Unlike the naive Bayesian filter, which uses the feature probabilities to create a whole document probability, the Fisher method calculates the probability of a category for each feature in the document, then combines the probabilities and tests to see if the set of probabilities is more or less likely than a random set.
P(Cj|Fi)=P(Fi|Cj)/∑k(Fk|Cj)P(C_j|F_i)=P(F_i|C_j)/\sum_k(F_k|C_j)
where CjC_j means the jjth category, FkF_k means the kkth feature.
class classifier: def __init__(self,getfeatures,filename=None): self.fc={} self.cc={} self.getfeatures=getfeatures self.thresholds={} def setthreshold(self,cat,t): self.thresholds[cat]=t def getthreshold(self,cat): if cat not in self.thresholds: return 1.0 return self.thresholds[cat] def classify(self,item,default=None): probs={} max=0.0 for cat in self.categories(): probs[cat]=self.prob(item,cat) if probs[cat]>max: max=probs[cat] best=cat for cat in probs: if cat==best: continue if probs[cat]*self.getthreshold(best)>probs[best]: return default return best def incf(self,f,cat): self.fc.setdefault(f,{}) self.fc[f].setdefault(cat,0) self.fc[f][cat]+=1 def incc(self,cat): self.cc.setdefault(cat,0) self.cc[cat]+=1 def fcount(self,f,cat): if f in self.fc and cat in self.fc[f]: return float(self.fc[f][cat]) return 0.0 def catcount(self,cat): if cat in self.cc: return float(self.cc[cat]) return 0 def totalcount(self): return sum(self.cc.values()) def categories(self): return self.cc.keys() def train(self,item,cat): features=self.getfeatures(item) for f in features: self.incf(f,cat) self.incc(cat) ######################################################## def ffcount(self,f): if f in self.fc: return sum(self.fc[f].values()) def ttotal(self): s=0.0 for f in self.fc: s+=self.ffcount(f) return s def docprob(self,item): features=self.getfeatures(item) p=1 for f in features: p*=(self.ffcount(f)/self.ttotal()) return p ######################################################## def fprob(self,f,cat): if self.catcount(cat)==0: return 0 return self.fcount(f,cat)/self.catcount(cat) def weightedprob(self,f,cat,prf,weight=1.0,ap=0.5): basicprob=prf(f,cat) totals=sum([self.fcount(f,c) for c in self.categories()]) bp=((weight*ap)+(totals*basicprob))/(weight+totals) return bp class naivebayes(classifier): def docprob(self,item,cat): features=self.getfeatures(item) p=1 for f in features: p*=self.weightedprob(f,cat,self.fprob) return p def prob(self,item,cat): catprob=self.catcount(cat)/self.totalcount() docprob=self.docprob(item,cat) return catprob*docprob class fisherclassifier(classifier): def cprob(self,f,cat): clf=self.fprob(f,cat) if clf==0: return 0 freqsum=sum([self.fprob(f,c) for c in self.categories()]) p=clf/(freqsum) return p def fisherprob(self,item,cat): p=1 features=self.getfeatures(item) for f in features: p*=(self.weightedprob(f,cat,self.cprob)) fscore=-2*math.log(p) return self.invchi2(fscore,len(features)*2) def invchi2(self,chi,df): m=chi/2.0 sum1 =term=math.exp(-m) for i in range(1,df//2): term *=m/i sum1+=term return min(sum1,1.0)
REFERENCE
《Programming Collective Intelligence》
相关文章推荐
- python多线程编程例子实验
- Python中模拟C#对应Linq的一些操作
- Python3.5+Django1.8链接Mysql数据库的方法
- python高可用程序设计方法
- python 切片
- Python的渔网生成工具。属性表操作
- python线程和进程,生产者消费者模型
- Python基础_继承类
- python遇到的问题
- 斐波那契数列
- Python爬取药智网的中药材图谱网页
- window7 下配置python2.7+tornado3.3开发环境
- python seek()
- python抓取gb2312/gbk编码网页乱码问题
- python培训day5 随笔
- Python基础_即时标记及其相关的语法
- Python基础_正则表达式学习一
- Day-5
- python知识点
- Python多线程编程及同步处理