您的位置:首页 > 其它

《机器学习》周志华习题4.3答案

2016-07-06 23:57 1136 查看
本文主要代码来源于《机器学习实战》作者:Peter harrington 一书的内容

并参考了Will-Lin的博客:机器学习算法的Python实现
(2):ID3决策树
最后加上了一些个人对函数的注释。

4.3试编程实现基于信息熵进行划分选择的决策树算法,并为表4.3中数据生成一棵决策树

数据:

编号,色泽,根蒂,敲声,纹理,脐部,触感,密度,含糖率,好瓜
1,青绿,蜷缩,浊响,清晰,凹陷,硬滑,0.697,0.46,是
2,乌黑,蜷缩,沉闷,清晰,凹陷,硬滑,0.774,0.376,是
3,乌黑,蜷缩,浊响,清晰,凹陷,硬滑,0.634,0.264,是
4,青绿,蜷缩,沉闷,清晰,凹陷,硬滑,0.608,0.318,是
5,浅白,蜷缩,浊响,清晰,凹陷,硬滑,0.556,0.215,是
6,青绿,稍蜷,浊响,清晰,稍凹,软粘,0.403,0.237,是
7,乌黑,稍蜷,浊响,稍糊,稍凹,软粘,0.481,0.149,是
8,乌黑,稍蜷,浊响,清晰,稍凹,硬滑,0.437,0.211,是
9,乌黑,稍蜷,沉闷,稍糊,稍凹,硬滑,0.666,0.091,否
10,青绿,硬挺,清脆,清晰,平坦,软粘,0.243,0.267,否
11,浅白,硬挺,清脆,模糊,平坦,硬滑,0.245,0.057,否
12,浅白,蜷缩,浊响,模糊,平坦,软粘,0.343,0.099,否
13,青绿,稍蜷,浊响,稍糊,凹陷,硬滑,0.639,0.161,否
14,浅白,稍蜷,沉闷,稍糊,凹陷,硬滑,0.657,0.198,否
15,乌黑,稍蜷,浊响,清晰,稍凹,软粘,0.36,0.37,否
16,浅白,蜷缩,浊响,模糊,平坦,硬滑,0.593,0.042,否
17,青绿,蜷缩,沉闷,稍糊,稍凹,硬滑,0.719,0.103,否


代码:

# coding: utf-8
from numpy import *
import pandas as pd
import codecs
import operator
from math import log
import json
from treePlotter import *

'''
输入:给定数据集   输出:香浓熵
对于输入数据集的每个样本,得到label的分布。比如10个A类,15个B类,这样得到A类占了0.4,B类占了0.6,这样就可以算出这个数据集的熵是
-0.4*log(0.4,2)-0.6*log(0.6,2)。熵反应不稳定程度,越大说明随机性越大,当A和B等概率出现的时候,熵取得最大值1。
'''
def calcShannonEnt(dataSet):
numEntries = len(dataSet)

labelCounts = {}
for featVec in dataSet: #the the number of unique elements and their occurance
currentLabel = featVec[-1]
if currentLabel not in labelCounts.keys(): labelCounts[currentLabel] = 0
labelCounts[currentLabel] += 1
shannonEnt = 0.0
for key in labelCounts:
prob = float(labelCounts[key])/numEntries
shannonEnt -= prob * log(prob,2) #log base 2
return shannonEnt

'''
输入:数据集,划分特征,划分特征的取值         输出:划分完毕后的数据子集
这个函数的作用是对数据集进行划分,对属性axis值为value的那部分数据进行挑选(注:此处得到的子集特征数比划分前少1,少了那个用来划分的数据)
'''
def splitDataSet(dataSet,axis,value):
returnMat = []
for data in dataSet:
if data[axis]==value:
returnMat.append(data[:axis]+data[axis+1:])
return returnMat
'''
与上述函数类似,区别在于上述函数是用来处理离散特征值而这里是处理连续特征值
对连续变量划分数据集,direction规定划分的方向,
决定是划分出小于value的数据样本还是大于value的数据样本集
'''
def splitContinuousDataSet(dataSet, axis, value, direction):
retDataSet = []
for featVec in dataSet:
if direction == 0:
if featVec[axis] > value:
retDataSet.append(featVec[:axis] + featVec[axis + 1:])
else:
if featVec[axis] <= value:
retDataSet.append(featVec[:axis] + featVec[axis + 1:])
return retDataSet

'''
决策树算法中比较核心的地方,究竟是用何种方式来决定最佳划分?
使用信息增益作为划分标准的决策树称为ID3
使用信息增益比作为划分标准的决策树称为C4.5
本题为信息增益的ID3树
从输入的训练样本集中,计算划分之前的熵,找到当前有多少个特征,遍历每一个特征计算信息增益,找到这些特征中能带来信息增益最大的那一个特征。
这里用分了两种情况,离散属性和连续属性
1、离散属性,在遍历特征时,遍历训练样本中该特征所出现过的所有离散值,假设有n种取值,那么对这n种我们分别计算每一种的熵,最后将这些熵加起来
就是划分之后的信息熵
2、连续属性,对于连续值就稍微麻烦一点,首先需要确定划分点,用二分的方法确定(连续值取值数-1)个切分点。遍历每种切分情况,对于每种切分,
计算新的信息熵,从而计算增益,找到最大的增益。
假设从所有离散和连续属性中已经找到了能带来最大增益的属性划分,这个时候是离散属性很好办,直接用原有训练集中的属性值作为划分的值就行,但是连续
属性我们只是得到了一个切分点,这是不够的,我们还需要对数据进行二值处理。
'''
def chooseBestFeatureToSplit(dataSet, labels):
numFeatures = len(dataSet[0]) - 1
baseEntropy = calcShannonEnt(dataSet)
bestInfoGain = 0.0
bestFeature = -1
bestSplitDict = {}
for i in range(numFeatures):
# 对连续型特征进行处理 ,i代表第i个特征,featList是每次选取一个特征之后这个特征的所有样本对应的数据
featList = [example[i] for example in dataSet]
#因为特征分为连续值和离散值特征,对这两种特征需要分开进行处理。
if type(featList[0]).__name__ == 'float' or type(featList[0]).__name__ == 'int':
# 产生n-1个候选划分点
sortfeatList = sorted(featList)
splitList = []
for j in range(len(sortfeatList) - 1):
splitList.append((sortfeatList[j] + sortfeatList[j + 1]) / 2.0)
bestSplitEntropy = 10000
# 求用第j个候选划分点划分时,得到的信息熵,并记录最佳划分点
for value in splitList:
newEntropy = 0.0
subDataSet0 = splitContinuousDataSet(dataSet, i, value, 0)
subDataSet1 = splitContinuousDataSet(dataSet, i, value, 1)

prob0 = len(subDataSet0) / float(len(dataSet))
newEntropy += prob0 * calcShannonEnt(subDataSet0)
prob1 = len(subDataSet1) / float(len(dataSet))
newEntropy += prob1 * calcShannonEnt(subDataSet1)
if newEntropy < bestSplitEntropy:
bestSplitEntropy = newEntropy
bestSplit = value
# 用字典记录当前特征的最佳划分点
bestSplitDict[labels[i]] = bestSplit
infoGain = baseEntropy - bestSplitEntropy

# 对离散型特征进行处理
else:
uniqueVals = set(featList)
newEntropy = 0.0
# 计算该特征下每种划分的信息熵,选取第i个特征的值为value的子集
for value in uniqueVals:
subDataSet = splitDataSet(dataSet, i, value)
prob = len(subDataSet) / float(len(dataSet))
newEntropy += prob * calcShannonEnt(subDataSet)
infoGain = baseEntropy - newEntropy
if infoGain > bestInfoGain:
bestInfoGain = infoGain
bestFeature = i
# 若当前节点的最佳划分特征为连续特征,则将其以之前记录的划分点为界进行二值化处理
# 即是否小于等于bestSplitValue,例如将密度变为密度<=0.3815
#将属性变了之后,之前的那些float型的值也要相应变为0和1
if type(dataSet[0][bestFeature]).__name__=='float' or type(dataSet[0][bestFeature]).__name__ == 'int':
bestSplitValue = bestSplitDict[labels[bestFeature]]
labels[bestFeature] = labels[bestFeature] + '<=' + str(bestSplitValue)
for i in range(shape(dataSet)[0]):
if dataSet[i][bestFeature] <= bestSplitValue:
dataSet[i][bestFeature] = 1
else:
dataSet[i][bestFeature] = 0
return bestFeature

'''
输入:类别列表     输出:类别列表中多数的类,即多数表决
这个函数的作用是返回字典中出现次数最多的value对应的key,也就是输入list中出现最多的那个值
'''
def majorityCnt(classList):
classCount={}
for vote in classList:
if vote not in classCount.keys(): classCount[vote] = 0
classCount[vote] += 1
sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)

return sortedClassCount[0][0]

'''
主程序,递归产生决策树。
params:
dataSet:用于构建树的数据集,最开始就是data_full,然后随着划分的进行越来越小,第一次划分之前是17个瓜的数据在根节点,然后选择第一个bestFeat是纹理
纹理的取值有清晰、模糊、稍糊三种,将瓜分成了清晰(9个),稍糊(5个),模糊(3个),这个时候应该将划分的类别减少1以便于下次划分
labels:还剩下的用于划分的类别
data_full:全部的数据
label_full:全部的类别

既然是递归的构造树,当然就需要终止条件,终止条件有三个:
1、当前节点包含的样本全部属于同一类别;-----------------注释1就是这种情形
2、当前属性集为空,即所有可以用来划分的属性全部用完了,这个时候当前节点还存在不同的类别没有分开,这个时候我们需要将当前节点作为叶子节点,
同时根据此时剩下的样本中的多数类(无论几类取数量最多的类)-------------------------注释2就是这种情形
3、当前节点所包含的样本集合为空。比如在某个节点,我们还有10个西瓜,用大小作为特征来划分,分为大中小三类,10个西瓜8大2小,因为训练集生成
树的时候不包含大小为中的样本,那么划分出来的决策树在碰到大小为中的西瓜(视为未登录的样本)就会将父节点的8大2小作为先验同时将该中西瓜的
大小属性视作大来处理。
'''

def createTree(dataSet,labels,data_full,labels_full):
classList=[example[-1] for example in dataSet]
if classList.count(classList[0])==len(classList):  #注释1
return classList[0]
if len(dataSet[0])==1:                             #注释2
return majorityCnt(classList)
#平凡情况,每次找到最佳划分的特征
bestFeat=chooseBestFeatureToSplit(dataSet,labels)
bestFeatLabel=labels[bestFeat]

myTree={bestFeatLabel:{}}
featValues=[example[bestFeat] for example in dataSet]
'''
刚开始很奇怪为什么要加一个uniqueValFull,后来思考下觉得应该是在某次划分,比如在根节点划分纹理的时候,将数据分成了清晰、模糊、稍糊三块
,假设之后在模糊这一子数据集中,下一划分属性是触感,而这个数据集中只有软粘属性的西瓜,这样建立的决策树在当前节点划分时就只有软粘这一属性了,
事实上训练样本中还有硬滑这一属性,这样就造成了树的缺失,因此用到uniqueValFull之后就能将训练样本中有的属性值都囊括。
如果在某个分支每找到一个属性,就在其中去掉一个,最后如果还有剩余的根据父节点投票决定。
但是即便这样,如果训练集中没有出现触感属性值为“一般”的西瓜,但是分类时候遇到这样的测试样本,那么应该用父节点的多数类作为预测结果输出。
'''
uniqueVals=set(featValues)
if type(dataSet[0][bestFeat]).__name__=='str':
currentlabel=labels_full.index(labels[bestFeat])
featValuesFull=[example[currentlabel] for example in data_full]
uniqueValsFull=set(featValuesFull)
del(labels[bestFeat])
'''
针对bestFeat的每个取值,划分出一个子树。对于纹理,树应该是{"纹理":{?}},显然?处是纹理的不同取值,有清晰模糊和稍糊三种,对于每一种情况,
都去建立一个自己的树,大概长这样{"纹理":{"模糊":{0},"稍糊":{1},"清晰":{2}}},对于0\1\2这三棵树,每次建树的训练样本都是值为value特征数减少1
的子集。
'''
for value in uniqueVals:
subLabels = labels[:]
if type(dataSet[0][bestFeat]).__name__ == 'str':
uniqueValsFull.remove(value)
myTree[bestFeatLabel][value] = createTree(splitDataSet \
(dataSet, bestFeat, value), subLabels, data_full, labels_full)

if type(dataSet[0][bestFeat]).__name__ == 'str':
for value in uniqueValsFull:
myTree[bestFeatLabel][value] = majorityCnt(classList)
return myTree

if __name__=="__main__":
# 读入csv文件数据
input_path = "data/西瓜数据集3.csv"
file = codecs.open(input_path, "r", 'utf-8')

filedata = [line.strip('\n').split(',') for line in file]
filedata = [[float(i) if '.' in i else i for i in row] for row in filedata]  # change decimal from string to float
dataSet = [row[1:] for row in filedata[1:]]

labels = []
for label in filedata[0][1:-1]:
labels.append(label)

myTree = createTree(dataSet, labels, dataSet, labels)
print myTree
createPlot(myTree)
print json.dumps(myTree, ensure_ascii=False, indent=4)


绘图如下:



结果保存为json格式,如下:

{

    "纹理": {

        "清晰": {

            "密度<=0.3815": {

                "0": "是", 

                "1": "否"

            }

        }, 

        "模糊": "否", 

        "稍糊": {

            "触感": {

                "软粘": "是", 

                "硬滑": "否"

            }

        }

    }

}

绘图函数:

# -*- coding: utf-8 -*-
'''
Created on Oct 14, 2010

@author: Peter Harrington
modified by deng in 2016.03.09
'''
import matplotlib.pyplot as plt

decisionNode = dict(boxstyle="sawtooth", fc="0.8")
leafNode = dict(boxstyle="round4", fc="0.8")
arrow_args = dict(arrowstyle="<-")

def getNumLeafs(myTree):
numLeafs = 0
firstStr = myTree.keys()[0]
secondDict = myTree[firstStr]
for key in secondDict.keys():
if type(secondDict[key]).__name__=='dict':#test to see if the nodes are dictonaires, if not they are leaf nodes
numLeafs += getNumLeafs(secondDict[key])
else:   numLeafs +=1
return numLeafs

def getTreeDepth(myTree):
maxDepth = 0
firstStr = myTree.keys()[0]
secondDict = myTree[firstStr]

for key in secondDict.keys():
if type(secondDict[key]).__name__=='dict':#test to see if the nodes are dictonaires, if not they are leaf nodes
thisDepth = 1 + getTreeDepth(secondDict[key])
else:   thisDepth = 1
if thisDepth > maxDepth: maxDepth = thisDepth
return maxDepth

def plotNode(nodeTxt, centerPt, parentPt, nodeType):
createPlot.ax1.annotate(nodeTxt, xy=parentPt,  xycoords='axes fraction',
xytext=centerPt, textcoords='axes fraction',
va="center", ha="center", bbox=nodeType, arrowprops=arrow_args )

def plotMidText(cntrPt, parentPt, txtString):
xMid = (parentPt[0]-cntrPt[0])/2.0 + cntrPt[0]
yMid = (parentPt[1]-cntrPt[1])/2.0 + cntrPt[1]
createPlot.ax1.text(xMid, yMid, txtString, va="center", ha="center", rotation=30)

def plotTree(myTree, parentPt, nodeTxt):#if the first key tells you what feat was split on
numLeafs = getNumLeafs(myTree)  #this determines the x width of this tree
depth = getTreeDepth(myTree)
firstStr = myTree.keys()[0]     #the text label for this node should be this
cntrPt = (plotTree.xOff + (1.0 + float(numLeafs))/2.0/plotTree.totalW, plotTree.yOff)
plotMidText(cntrPt, parentPt, nodeTxt)
plotNode(firstStr, cntrPt, parentPt, decisionNode)
secondDict = myTree[firstStr]
plotTree.yOff = plotTree.yOff - 1.0/plotTree.totalD
for key in secondDict.keys():
if type(secondDict[key]).__name__=='dict':#test to see if the nodes are dictonaires, if not they are leaf nodes
plotTree(secondDict[key],cntrPt,str(key))        #recursion
else:   #it's a leaf node print the leaf node
plotTree.xOff = plotTree.xOff + 1.0/plotTree.totalW
plotNode(secondDict[key], (plotTree.xOff, plotTree.yOff), cntrPt, leafNode)
plotMidText((plotTree.xOff, plotTree.yOff), cntrPt, str(key))
plotTree.yOff = plotTree.yOff + 1.0/plotTree.totalD
#if you do get a dictonary you know it's a tree, and the first element will be another dict

def createPlot(inTree):
fig = plt.figure(1, facecolor='white')
fig.clf()
axprops = dict(xticks=[], yticks=[])
createPlot.ax1 = plt.subplot(111, frameon=False, **axprops)    #no ticks
#createPlot.ax1 = plt.subplot(111, frameon=False) #ticks for demo puropses
plotTree.totalW = float(getNumLeafs(inTree))
plotTree.totalD = float(getTreeDepth(inTree))
plotTree.xOff = -0.5/plotTree.totalW; plotTree.yOff = 1.0;
plotTree(inTree, (0.5,1.0), '')
plt.show()

'''def createPlot():
#图像名和背景色
fig = plt.figure("tree", facecolor='white')
fig.clf()#清楚当前画图窗口
createPlot.ax1 = plt.subplot(111, frameon=False) #ticks for demo puropses
plotNode('a decision node', (0.5, 0.1), (0.1, 0.5), decisionNode)
plotNode('a leaf node', (0.8, 0.1), (0.3, 0.8), leafNode)
plt.show()'''

def retrieveTree(i):
listOfTrees =[{'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}},
{'no surfacing': {0: 'no', 1: {'flippers': {0: {'head': {0: 'no', 1: 'yes'}}, 1: 'no'}}}}
]
return listOfTrees[i]

#createPlot(thisTree)
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: