您的位置:首页 > 编程语言 > Python开发

决策树--Python实现

2015-08-03 09:52 447 查看

信息增益

导入模块:
from math import log
import operator
计算给定数据集的香农熵:def calcShannonEnt(dataSet):numEntries = len(dataSet)lableCounts = {}for featVec in dataSet:currentLable = featVec[-1]if currentLable not in lableCounts.keys():lableCounts[currentLable] = 0lableCounts[currentLable] += 1shannonEnt = 0for key in lableCounts:prob = float(lableCounts[key])/numEntriesshannonEnt -= prob * log(prob,2)return shannonEnt创建简单的数据集:def createDataSet():dataSet = [[1,1,0,'fight'],[1,0,1,'fight'],[1,0,1,'fight'],[1,0,1,'fight'],[0,0,1,'run'],[0,1,0,'fight'],[0,1,1,'run']]lables = ['weapon','bullet','blood']return dataSet,lables
字段说明
[1,1,0,'fight']
数值武器类型子弹血量
0步枪
1机枪
行为类别
fight战斗
run逃跑
按行打印数据集
def printData(myData):for item in myData:print '%s' %(item)
用Python命令提示符输入下列命令:>>> import tree>>> myDat,lable = tree.createDataSet()>>> tree.printData(myDat)[1, 1, 0, 'fight'][1, 0, 1, 'fight'][1, 0, 1, 'fight'][1, 0, 1, 'fight'][0, 0, 1, 'run'][0, 1, 0, 'fight'][0, 1, 1, 'run']>>> tree.calcShannonEnt(myDat)0.863120568566631得到香农熵为
0.863120568566631
熵越高,则混合的数据也越多。为数据集添加新分类
surrender
>>> myDat[0][-1] = 'surrender'>>> tree.printData(myDat)[1, 1, 0, 'surrender'][1, 0, 1, 'fight'][1, 0, 1, 'fight'][1, 0, 1, 'fight'][0, 0, 1, 'run'][0, 1, 0, 'fight'][0, 1, 1, 'run']>>> tree.calcShannonEnt(myDat)1.3787834934861756
得到香农熵为
1.3787834934861756

划分数据集

按照给定特征划分数据集:
def splitDataSet(dataSet,axis,value):retDataSet = []for featVec in dataSet:if featVec[axis] == value:reducedFeatVec = featVec[:axis]reducedFeatVec.extend(featVec[axis+1:])retDataSet.append(reducedFeatVec)return retDataSet
输入Python命令,分别提取武器类型为
1(机枪)
0(步枪)
的行为:>>> reload(tree)<module 'tree' from 'tree.py'>>>> myDat,lable = tree.createDataSet()>>> tree.printData(myDat)[1, 1, 0, 'fight'][1, 0, 1, 'fight'][1, 0, 1, 'fight'][1, 0, 1, 'fight'][0, 0, 1, 'run'][0, 1, 0, 'fight'][0, 1, 1, 'run']>>> tree.splitDataSet(myDat,0,1)[[1, 0, 'fight'], [0, 1, 'fight'], [0, 1, 'fight'], [0, 1, 'fight']]>>> tree.splitDataSet(myDat,0,0)[[0, 1, 'run'], [1, 0, 'fight'], [1, 1, 'run']]选择最好的数据集划分方式:def chooseBestFeatureToSplit(dataSet):numFeatures = len(dataSet[0]) - 1baseEntropy = calcShannonEnt(dataSet)bestInfoGain = 0bestFeature = -1for i in range(numFeatures):featList = [example[i] for example in dataSet]uniqueVals = set(featList)newEntropy = 0for value in uniqueVals:subDataSet = splitDataSet(dataSet,i,value)prob = len(subDataSet)/float(len(dataSet))newEntropy += prob * calcShannonEnt(subDataSet)infoGain = baseEntropy - newEntropyif (infoGain > bestInfoGain):bestInfoGain = infoGainbestFeature = ireturn bestFeature
chooseBestFeatureToSplit
调用的数据需要满足的要求:数据必须是一种由列表元素组成的列表所有列表元素都要具有相同的数据长度数据的最后一列是当前数据的类别标签在开始划分数据集之前,先计算整个数据集的原始香农熵,保存最初的无序度量值,用于与划分完之后的数据集计算的熵值进行比较,从而计算信息增益。遍历当前特征中的所有唯一属性值,对每个特征划分一次数据集,然后计算数据集的新熵值,并对所有唯一特征值得到的熵求和。最后,比较所有特征中的信息增益,返回最好特征划分的索引值。
>>> reload(tree)<module 'tree' from 'tree.pyc'>>>> myDat,lable = tree.createDataSet()>>> tree.printData(myDat)[1, 1, 0, 'fight'][1, 0, 1, 'fight'][1, 0, 1, 'fight'][1, 0, 1, 'fight'][0, 0, 1, 'run'][0, 1, 0, 'fight'][0, 1, 1, 'run']>>> tree.chooseBestFeatureToSplit(myDat)0.4695652111150.005977711423770.169584429670
在划分数据集之前之后信息发生的变化称为信息增益
特征信息增益
武器类型0.469565211115
子弹数量0.00597771142377
血量0.16958442967
运行结果告诉我们,第0个特征,也就是
武器类型
是最好的用于划分数据集的特征。

递归构建决策树

创建�的函数代码:
def createTree(dataSet,lables):classList = [example[-1] for example in dataSet]if classList.count(classList[0]) == len(classList):return classList[0]if len(dataSet[0]) == 1:return majorityCnt(classList)bestFeat = chooseBestFeatureToSplit(dataSet)bestFeatLable = lables[bestFeat]myTree = {bestFeatLable:{}}del(lables[bestFeat])featValues = [example[bestFeat] for example in dataSet]uniqueVals = set(featValues)for value in uniqueVals:subLables = lables[:]myTree[bestFeatLable][value] = createTree(splitDataSet(dataSet,bestFeat,value),subLables)return myTree
递归结束的条件:遍历完所有划分数据集的属性每个分支下的素有实力都有相同的分类所有的类标签完全相同,则返回该类标签。如果使用完了所有特征,仍然不能将数据集划分成仅包含唯一类别的分组,则通过
majorityCnt
挑选出出现次数最多的类别标签作为返回值。选出出现次数最多的分类名称:
def majorityCnt(classList):classCount = {}for vote in classList:if vote not in classCount.keys():classCount[vote] = 0classCount[vote] += 1sortedClassCount = sorted(classCount.iteritems(),key = operator.itemgetter(1),reverse=True)return sortedClassCount[0][0]
为测试代码的实际输出结果,在Python命令提示符中输入下列命令:>>> reload(tree)<module 'tree' from 'tree.pyc'>>>> myDat,lable = tree.createDataSet()>>> tree.printData(myDat)[1, 1, 0, 'fight'][1, 0, 1, 'fight'][1, 0, 1, 'fight'][1, 0, 1, 'fight'][0, 0, 1, 'run'][0, 1, 0, 'fight'][0, 1, 1, 'run']>>> tree.createTree(myDat,lable)0.4695652111150.005977711423770.169584429670.2516291673880.918295834054{'weapon': {0: {'blood': {0: 'fight', 1: 'run'}}, 1: 'fight'}}
createTree
返回的嵌套字典包含了很多代表树结构的信息,从左边开始,第一个关键字
weapon
是第一个划分数据集的特征名称,该关键字的值也是另一个数据字典。第二个关键字是
weapon
特征划分的数据集,这些关键字的值是
weapon
节点的子节点。这些值可能是类标签,也可能是另一个数据字典。如果值是类标签,则该节点是叶子节点;如果值是另一个数据字典,则子节点是一个判断节点,这种格式结构不断重复就构成了整棵�。该例子中包含了3个叶子节点和2个判断节点。
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: