您的位置：首页 > 编程语言 > Python开发

组合不同算法为一个整体(集成方法)-基于opencv和python的学习笔记（二十二）

2019-03-08 20:28 656 查看

集成方法是把多个不同的模型绑定到一起来解决一个共同的问题的技术。一般可以少量地提高单个分类器的性能。（如果使用足够多的分类器，总有一个是对的）

装袋方法（bagging method）：把多个分类器的投票进行平均，做出最后的决策。>>>随机森林
提升方法（boosting method）：分类器尝试修正另一个分类器的误差。>>>adaboost
随机森林（random forest）：由多个决策树组成。
自适应提升（adaptive boosting）：adaboost。

集成方法的两个主要部分包括：

一组模型
一组确定如何结合这些模型的结果得到单个输出的决策规则

如果使用足够多的分类器，总有一个是对的，如何知道哪个是对的，就需要决策规则。

集成方法分类：

平均方法：它们并行构建模型，然后使用平均或者投票的技术来组成一个联合的估计器。
提升方法：它们是按顺序构建模型，每个后添加的模型都是为了提升组合后的估计器的得分。
堆叠方法：也叫做混合方法，它们在模型中使用多个分类器的加权输出结果作为下一层的输入。如同一个专家组把他们的决定传给下一个专家组。

一、理解集成方法

1.1、平均集成：

Bagging methods come in many different flavors. However, they typically only differ by the way they draw random subsets of the training set:

Pasting methods draw random subsets of the samples without
replacement
Bagging methods draw random subsets of the samples with replacement
Random subspace methods draw random subsets of the features but train
on all data samples
Random patches methods draw random subsets of both samples and
features

The BaggingClassifier class provides a number of options to customize the ensemble（设置集成）:

n_estimators:

As shown in the preceding code, this specifies the number of base estimators in the ensemble.

max_samples:

This denotes the number (or fraction) of samples to draw from the dataset to train each base estimator. We can set

bootstrap=True

to sample with replacement (effectively implementing bagging), or we can set

bootstrap=False

to implement pasting

max_features:

This denotes the number (or fraction) of features to draw from the feature matrix to train each base estimator. We can set

max_samples=1.0

and

max_features<1.0

to implement the random subspace method. Alternatively, we can set both

max_samples<1.0

and

max_features

<1.` to implement the random patches method.

更复杂的装袋集成叫作随机森林

1.2、提升集成：

提升模型按顺序使用多个单个的学习器，以迭代的方式来提升集成的性能。

只有一个节点的决策树——决策树桩。每次迭代过程中，训练集会被调整，这样下一个分类器使用的数据点是上一个分类器出错的数据点。经过多次迭代过程，每次迭代集成就通过一个新的树进行扩展。

与 BaggingClassifier class类似, GradientBoostingClassifier class 用以下参数设置:

n_estimators

: This denotes the number of base estimators in the ensemble. A large number of estimators typically results in better performance.

loss:

This denotes the loss function (or cost function) to be optimized. Setting

loss='deviance'

implements logistic regression for classification with probabilistic outputs. Setting

loss='exponential'

actually results in AdaBoost, which we will talk about in a little bit.

learning_rate:

This denotes the fraction by which to shrink the contribution of each tree. There is a trade-off between learning_rate and n_estimators.

max_depth:

This denotes the maximum depth of the individual trees in the ensemble.

criterion:

This denotes the function to measure the quality of a node split.

min_samples_split:

This denotes the number of samples required to split an internal node.

max_leaf_nodes:

This denotes the maximum number of leaf nodes allowed in each individual tree and so on.

# 理解集成
from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import load_boston
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import GradientBoostingRegressor

'''# 理解平均集成'''

# 实现一个装袋分类器
bagging = BaggingClassifier(KNeighborsClassifier(), n_estimators=10)    # 用10个K-nn分类器的集合来创建一个集成
# 使用k=5来实现10个K-nn分类器的装袋方法，使用50%的样本进行训练
bag_knn = BaggingClassifier(KNeighborsClassifier(n_neighbors=5),
n_estimators=10, max_samples=0.5,
bootstrap=True, random_state=3)              # 参数解释在文中英文部分
dataset = load_breast_cancer()                                            # 导入乳腺癌数据
X = dataset.data
y = dataset.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=3)  # 把数据分为训练集和测试集
bag_knn.fit(X_train, y_train)                                              # 训练
print('装袋分类器的准确度：',bag_knn.score(X_test, y_test))                   # 评分
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
print('单个分类器的准确度：',knn.score(X_test, y_test))

# 实现一个装袋回归器
bag_tree = BaggingRegressor(DecisionTreeRegressor(), max_features=0.5,
n_estimators=10,random_state=3)
dataset = load_boston()                                                  # 波士顿房价数据
X = dataset.data
y = dataset.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=3)  # 把数据分为训练集和测试集
bag_tree.fit(X_train, y_train)
print('装袋回归器的准确度：', bag_tree.score(X_test, y_test))

'''# 理解提升集成'''

# 实现一个提升分类器
boost_class = GradientBoostingClassifier(n_estimators=10,random_state=3)   # 使用一组10个决策树构建一个提升分类器
dataset = load_breast_cancer()                                             # 导入乳腺癌数据
X = dataset.data
y = dataset.target
X_train, X_test, y_train, y_test = train_test_split( X, y, random_state=3)
boost_class.fit(X_train, y_train)
print('提升分类器的准确度：',boost_class.score(X_test, y_test))

# 实现一个提升回归器
boost_reg = GradientBoostingRegressor(n_estimators=10,random_state=3)    #　与分类器类似
dataset = load_boston()
X = dataset.data
y = dataset.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=3)
boost_reg.fit(X_train, y_train)
print('提升回归器的准确度：',boost_reg.score(X_test, y_test))

# 一般在一个提升集成中使用100个弱学习器
boost_reg1 = GradientBoostingRegressor(n_estimators=100,random_state=3)
boost_reg1.fit(X_train, y_train)
print('提升回归器的准确度：',boost_reg1.score(X_test, y_test))

1.3、堆叠集成：

目前我们学习到的所有集成都有一个共同的设计理念：使用多个不同的分类器拟合数据，并通过某种简单的决策规则（平均或者提升）把它们的预测结果合并成为一个最终的预测。

而堆叠集成：使用层级构建集成。在这里，单个学习器被组织成多层结构，其中每一层学习器的输出结果被用作为下一层模型的训练数据。这样，它可以成功地融合成百上千的不同模型。

二、组合决策树为随机森林

装袋决策树的一个著名变种叫作随机森林。他本质是一组决策树，随机森林中的各个树是使用稍有不同的数据特征的子集进行训练的。

尽管不限深度的单个树可能对于预测数据也可以处理的相对不错，但是容易过拟合。

随机森林的原理：构建大量的数，每一个都使用数据样本和特征的一个随机子集进行训练，由于这个过程的随机性，森林中的每个树将会以稍有不同的方式过拟合数据。接下来过拟合的影响可以通过对所有单个树的预测取平均进行降低。

np.r_

和

np.c_

列子：

import numpy as np

a = np.array([[1, 2, 3]])
b = np.array([[4, 5, 6]])
c = np.c_[a,b]
print('1:',a,a.shape)
print('2:',np.r_[a,b])
print('3:',c)
print('4:',np.c_[c,a])

1: [[1 2 3]] (1, 3)
2: [[1 2 3]
[4 5 6]]
3: [[1 2 3 4 5 6]]
4: [[1 2 3 4 5 6 1 2 3]]

np.r_

就是把两矩阵的行上下相加，要求列数相等，类似于pandas中的concat()。

np.c_

就是把两矩阵左右的列相加，要求行数相等，类似于pandas中的merge()。

随机森林的缺点：

仅使用数据的子集或者仅使用特征的子集训练树。让决策树的效果下降。
很难去解释。

随机森林的优点：

训练和评估非常快。不仅基础决策树的数据结构相对简单，而且森林中的每个树都是独立的，这让并行训练它们非常简单。
多个树可以进行概率分类。通过投票程序，让一个集成可以对数据点属于特定类的概率进行预测。

实现极端随机树：

决策树一般通过选择各个特征的阈值，使每个节点分裂的纯度最大化。与之对应的是，极端随机树会随机选择这些阈值。然后在这些随机生成的阈值中，选择最好的一个用作分裂规则。

# 使用随机森林进行人脸识别
from sklearn.datasets import fetch_olivetti_faces
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import cv2
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
np.random.seed(21)
plt.rcParams['font.sans-serif']=['SimHei']
plt.rcParams['axes.unicode_minus'] = False

# 载入人脸数据集（64*64像素）
dataset = fetch_olivetti_faces()
X = dataset.data
y = dataset.target
idx_rand = np.random.randint(len(X), size=8)    # 随机返回8张图片的索引

# 原始8张数据可视化
plt.figure(figsize=(14, 8))
for p, i in enumerate(idx_rand):                # i为第几张图，p从0开始
plt.subplot(2, 4, p + 1)
plt.imshow(X[i, :].reshape((64, 64)), cmap='gray')
plt.axis('off')
plt.title('随机返回8张人脸数据')
plt.show()

# 预处理数据（保证样本图像有相同的灰度等级）
n_samples, n_features = X.shape
X -= X.mean(axis=0)                                # 确保每个数据点的特征值（即X中的一行）以0为中心
X -= X.mean(axis=1).reshape(n_samples, -1)         # 确保每个数据点的特征值（即X中的一列）以0为中心
# 预处理数据8张数据可视化
plt.figure(figsize=(14, 8))
for p, i in enumerate(idx_rand):                   # i为第几张图，p从0开始
plt.subplot(2, 4, p + 1)
plt.imshow(X[i, :].reshape((64, 64)), cmap='gray')
plt.axis('off')
# plt.savefig('olivetti-pre.png')
plt.title('预处理数据（保证样本图像有相同的灰度等级）的返回8张人脸数据')
plt.show()

# 训练和测试随机森林
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=21)
rtree = cv2.ml.RTrees_create()                                   # OpenCV构建随机森林
num_trees = 50                                                   # 创建50个决策树，因为有40个类别
eps = 0.01
criteria =(cv2.TERM_CRITERIA_MAX_ITER + cv2.TERM_CRITERIA_EPS,num_trees, eps)
rtree.setTermCriteria(criteria)
rtree.setMaxCategories(len(np.unique(y)))                        # 设置允许最大的类别数
rtree.setMinSampleCount(2)                                       # 设置节点分裂所需的最小数据点数量
rtree.setMaxDepth(1000)                                          # 设置树的最大深度
rtree.train(X_train, cv2.ml.ROW_SAMPLE, y_train)                 # 训练
print('训练得到的树的深度：',rtree.getMaxDepth())
_, y_hat = rtree.predict(X_test)
print('OpenCV构建随机森林的准确度：',accuracy_score(y_test, y_hat))

tree = DecisionTreeClassifier(random_state=21, max_depth=25)     # sklearn构建随机森林
tree.fit(X_train, y_train)
print('单个决策树的准确度：',tree.score(X_test, y_test))
# 修改参数，让森林由100个决策树组成
num_trees = 100
eps = 0.01
criteria = (cv2.TERM_CRITERIA_MAX_ITER + cv2.TERM_CRITERIA_EPS,num_trees, eps)
rtree.setTermCriteria(criteria)
rtree.train(X_train, cv2.ml.ROW_SAMPLE, y_train)
_, y_hat = rtree.predict(X_test)
print('100个决策树的准确度：',accuracy_score(y_test, y_hat))

三、实现AdaBoost

当森林中的树都是深度为1的树（也叫作决策树桩），并且使用的是提升而不是装袋时，得到的算法叫作AdaBoost。

AdaBoost通过下面的这些方式在每次迭代时调整数据集：

选择一个决策树桩。
当决策树桩分类不正确时，提高权重；当分类正确时，降低权重。

这种迭代调整权重的方法会让集成中每个新的分类器优先训练那些标记错误的情况。结果就是，模型可以通过那些高权重的数据点作为目标处理进而得到调整。最后，这些树桩合并得到最终的分类器。

# 实现AdaBoost
import cv2
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import train_test_split

# 使用OpenCV实现AdaBoost
img_bgr = cv2.imread('data/lena.jpg', cv2.IMREAD_COLOR)
img_gray = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2GRAY)
filename = 'data/haarcascade_frontalface_default.xml'    # 载入一个训练好的Haar级联分类器
face_cascade = cv2.CascadeClassifier(filename)           # Haar级联分类器
faces = face_cascade.detectMultiScale(img_gray, 1.1, 5)  # 这个算法仅能处理灰度图像
color = (255, 0, 0)
thickness = 2
for (x, y, w, h) in faces:
cv2.rectangle(img_bgr, (x, y), (x + w, y + h),color, thickness)
# 显示图像
plt.figure(figsize=(10, 6))
plt.imshow(cv2.cvtColor(img_bgr, cv2.COLOR_BGR2RGB))
plt.show()

# 使用sklearn实现AdaBoost
ada = AdaBoostClassifier(n_estimators=100,random_state=456)  # 由100个决策树桩（默认深度为1）组成的集成
cancer = load_breast_cancer()                                # 载入乳腺癌数据集
X = cancer.data
y = cancer.target
X_train, X_test, y_train, y_test=train_test_split(X, y, random_state=456) # 默认比例分割测试训练集
ada.fit(X_train, y_train)
print('由100个决策树桩组成的AdaBoost集成的准确度：',ada.score(X_test, y_test))

# 与随机森林进行对比
forest = RandomForestClassifier(n_estimators=100,max_depth=1, random_state=456) # 由100个决策树桩组成（深度为1）
forest.fit(X_train, y_train)
print('由100个决策树桩（深度为1）组成的随机森林的准确度：',forest.score(X_test, y_test))
forest = RandomForestClassifier(n_estimators=100,random_state=456)
forest.fit(X_train, y_train)
print('由100个决策树（深度按需求给定）组成的随机森林的准确度：',forest.score(X_test, y_test))

四、组合不同模型为一个投票分类器

投票分类器原理：集成中的单个学习器不需要是同一种类型。无论单个分类器如何进行预测。最后，我们都将应用一个决策规则，来集成单个分类器的所有投票。这也叫作投票机制。

理解不同的投票机制：

硬投票（多数投票）：每个单独的分类器对一个类投票，占大多数的类获胜。从统计学的角度看，集成的预测目标标签是单个预测标签的分布模式。
软投票：每个单独的分类器提供了特定数据点属于一个特定目标类的概率值。这些预测值根据分类器的重要性进行加权求和。然后加权概率最大的目标标签获胜。

# 组合不同模型为一个投票分类器（逻辑回归、高斯朴素贝叶斯、随机森林）
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier

# 导入数据集
iris = load_breast_cancer()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=13)
# 实例化（逻辑回归、高斯朴素贝叶斯、随机森林）分类器
model1 = LogisticRegression(random_state=13)
model2 = GaussianNB()
model3 = RandomForestClassifier(random_state=13)
# 组合集成分类器（硬投票）
vote = VotingClassifier(estimators=[('lr', model1),
('gnb', model2),
('rfc', model3)],voting='hard')
vote.fit(X_train, y_train)
print('集成分类器的准确度：',vote.score(X_test, y_test))
model1.fit(X_train, y_train)
print('逻辑回归分类器的准确度：',model1.score(X_test, y_test))
model2.fit(X_train, y_train)
print('高斯朴素贝叶斯分类器的准确度：',model2.score(X_test, y_test))
model3.fit(X_train, y_train)
print('随机森林分类器的准确度：',model3.score(X_test, y_test))

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航