naive bayes classifier
2014-03-10 10:29
267 查看
今天开始写第一个:朴素贝叶斯分类器
语言:python
之前没有怎么用过python,借此机会也学一下,嘿嘿,fighting。。。
因为刚刚开始学,所以主要参考了Naive
Bayes classifier in 50 lines,跟着学。。。
数据格式:
tennis.arff
tennis1.arff
语言:python
之前没有怎么用过python,借此机会也学一下,嘿嘿,fighting。。。
因为刚刚开始学,所以主要参考了Naive
Bayes classifier in 50 lines,跟着学。。。
#coding=gbk ''' @author: qwzhong1988(qwzhong1988@163.com) ''' import collections import math class Model(): def __init__(self): #序列类型,用于存放所有特征的顺序 #比如:['outlook','temperature'] self.featureNameList=[] #字典类型,用于存放特征的所有取值,将分类结果也看成一种特征处理 #比如:{'outlook':[sunny,overcast,rain]} self.features={} #(二维)序列类型,用于存放实际的数据 self.featureVectors=[] #字典类型,用于存放计算联合概率所需的计数 #比如:{('No','outlook','Sunny'):5} #表示在训练集中,play为No,outlook为Sunny的情况出现了5次 #lambda:0的意思是不管键是什么,都将值初始化为0 self.featureCounts=collections.defaultdict(lambda:0) #字典类型,用于存放计算先验概率所需的计数 #比如:{'No':3} #表示在训练集中,play为No的次数为3 self.labelCounts=collections.defaultdict(lambda:0) #用于平滑作用的总数 self.N=0 pass #读取.arff格式的文件 def __readFile(self,arffFile): fr=open(arffFile,'r') for line in fr: if not line.startswith('@'): #读取实际的数据 self.featureVectors.append(line.strip().split(',')) else: #读取属性 if(not line.startswith('@RELATION'))and(not line.startswith('@DATA')): self.featureNameList.append(line.split()[1]) self.features[self.featureNameList[len(self.featureNameList)-1]]=line[line.find('{')+1:line.find('}')].split(',') fr.close() pass #训练模型,即为各种概率的计算计数 def train(self,arffFile): #先读入文件 self.__readFile(arffFile) #对实际数据计数 for instance in self.featureVectors: #先验概率的计数 self.labelCounts[instance[len(instance)-1]]+=1 for i in range(0,len(instance)-1): #联合概率的计数 self.featureCounts[(instance[len(instance)-1],self.featureNameList[i],instance[i])]+=1 #计算N for featurename in self.featureNameList[:len(self.featureNameList)-1]: self.N+=len(self.features[featurename]) pass #分类器 def __classify(self,instance): probPerLabel={} for label in self.labelCounts.keys(): cprob=0 for i in range(0,len(instance)): cprob+=math.log((self.featureCounts[(label,self.featureNameList[i],instance[i])]+1)*1.0/(self.labelCounts[label]+self.N)) #对于先验概率,有两种说法 #一种是也要受平滑的影响 #一种是不受平滑的影响,因为分类在训练的时候肯定需要每个类都出现 #另外,在处理诸如分词时,一般都假设句子的先验出现是服从一定分布的(比如均匀分布,此时先验概率不参与计算) #因此,是否对先验概率平滑的讨论没有必要深究 probPerLabel[label]=cprob+math.log(self.labelCounts[label]*1.0/sum(self.labelCounts.values())) #max函数的形参key是一个匿名函数!!!其实也可以直接在上面的循环中得到最大值 return max(probPerLabel, key = lambda classlabel: probPerLabel[classlabel]) pass #测试集:要保证没有训练集数据的最后一列 def test(self,arffFile): fr=open(arffFile,'r') for line in fr: if not line.startswith('@'): instance=line.strip().split(',') print 'class: %s' % (self.__classify(instance)) pass pass if __name__=='__main__': model=Model() model.train('tennis.arff') model.test('tennis1.arff') pass
数据格式:
tennis.arff
@RELATION TENNIS @ATTRIBUTE outlook {sunny, overcast, rain} @ATTRIBUTE temperature {hot, mild, cool} @ATTRIBUTE humidity {high, normal, low} @ATTRIBUTE wind {weak, strong} @ATTRIBUTE play {yes, no} @DATA Sunny,Hot,High,Weak,No Sunny,Hot,High,Strong,No Overcast,Hot,High,Weak,Yes Rain,Mild,High,Weak,Yes Rain,Cool,Normal,Weak,Yes Rain,Cool,Normal,Strong,No Overcast,Cool,Normal,Strong,Yes Sunny,Mild,High,Weak,No Sunny,Cool,Normal,Weak,Yes Rain,Mild,Normal,Weak,Yes Sunny,Mild,Normal,Strong,Yes Overcast,Mild,High,Strong,Yes Overcast,Hot,Normal,Weak,Yes Rain,Mild,High,Strong,No
tennis1.arff
@RELATION TENNIS @ATTRIBUTE outlook {sunny, overcast, rain} @ATTRIBUTE temperature {hot, mild, cool} @ATTRIBUTE humidity {high, normal, low} @ATTRIBUTE wind {weak, strong} @ATTRIBUTE play {yes, no} @DATA Sunny,Hot,High,Weak Sunny,Hot,High,Strong Overcast,Hot,High,Weak Rain,Mild,High,Weak Rain,Cool,Normal,Weak Rain,Cool,Normal,Strong Overcast,Cool,Normal,Strong Sunny,Mild,High,Weak Sunny,Cool,Normal,Weak Rain,Mild,Normal,Weak Sunny,Mild,Normal,Strong Overcast,Mild,High,Strong Overcast,Hot,Normal,Weak Rain,Mild,High,Strong
相关文章推荐
- What is the difference between a Bayesian network and a Naive Bayes classifier?
- 朴素贝叶斯分类器的应用 Naive Bayes classifier
- PGM学习之三 朴素贝叶斯分类器(Naive Bayes Classifier)
- 朴素贝叶斯分类器的应用 Naive Bayes classifier
- Naive Bayes Classifier
- 【机器学习实战之二】:C++实现基于概率论的分类方法--朴素贝叶斯分类(Naive Bayes Classifier)
- weka:Naive Bayes Classifier
- naive Bayes classifier
- Naive Bayes Classifier
- Naive Bayes classifier
- PGM学习之三 朴素贝叶斯分类器(Naive Bayes Classifier)
- Naive Bayes Classifier in OpenNLP
- 朴素贝叶斯分类器 Naive Bayes Classifier
- PGM学习之三 朴素贝叶斯分类器(Naive Bayes Classifier)
- machine learning - Naive_Bayes_classifier (FINISHED)
- Accord.NET_Naive Bayes Classifier
- 朴素贝叶斯分类算法(Naive Bayes Classifier)
- TEXT CLASSIFICATION FOR SENTIMENT ANALYSIS – NAIVE BAYES CLASSIFIER
- sklearn.naive_bayes
- 朴素贝叶斯分类器(Naive Bayesian Classifier)