[Machine Learning] Random Forest
2015-12-13 01:13
281 查看
Random Forest
Random Forest一 Review of random forest
二 Models random forest can be used
三 OOBout of bag estimation error
四 Variable importance
五 Proximity plot
六 R code for random forest
(一). Review of random forest
Random forest algorithm is a classifier based on two primarily method-bagging and random subspace methodFirstly, I give a review of two main ensemble algorithm: bagging and boosting:
bagging builds more approximating unbiased models using bootstrap samples and same features, then averaging or voting these models gives final prediction value. It is used for reducing variance.
boosting starts with a weak learner, and gradually improving it by refitting the data giving higher weights to the misclassified samples. The final classifier is built by weighted voting.
Random forest is a substantial modification of bagging on CART trees. It uses another randomness compared with bagging, which only samples cases with replacement (bootstrap). When splitting at each node in each CART tree, random forest also sample a subset of features without replacement and test only this subset features to find the best performing feature to split the data at this node.
As all we know, averaging B i.i.d random variable with variance σ2 has variance σ2B, but if these B variables are only identically distribution (bagging), the averaging variance is
ρσ2+(1−ρ)σ2B
proof:
Var[1B∑1BXi]=1B2⎡⎣∑iBVar(Xi)+∑i≠jCov(XiXj)⎤⎦=1B2[Bσ2+B(B−1)ρσ2]=ρσ2+(1−ρ)σ2B
(ρ is the correlation of each pair of bootstrap samples), so as inceasing B, the averaging variance is decreasing. As seen, if we can reduce the correlation ρ without increasing σ2 too much, we can reduce the averaging variance. Random forest is to achieve this idea by random selection of input features: before each split, select m<p of the input variables as candidates for splitting, and for regression, m is usually set to [p3], for classification, m set p√.
So now we can declare that
bootstrapping samples is used to reduce the variance of each individual tree.
random selection subset of features is used to reduce correlation between each pair of bootstrap samples.
pseudo-code for random forest:
For b=1 to B
(a) Draw a bootstrap sample Z∗ of size N from the training data.
(b) Grow a random-forest tree Tb of the bootstrapped data, by recursively repeating following steps in the terminal node of the tree until the minimum node size nmin is reached.
Select m variables at random from the p variables.
Pick the best variable/split-point among the m.
Split the node into two daughter nodes.
Output the ensemble of trees {Ti}Bi=1.
(二). Models random forest can be used
Notice that: random forest just performs well on non-liner model, such as decision tree, but is not suitable for linear model.Since bagging is an additive ensemble technique and averaging linear model is also linear. Note that fitting linear model is a convex problem, and we can find the best possible solution. With that said, since bagging produces a linear model, it can’t beat the best possible solution. Here we give an example using the sample mean which is linear: suppose x1,x2,..,xN are i.i.d. (μ,σ2). Let x¯∗i be the bootstrap realization of sample mean(i∈1:B). And the bagging model for sample mean is:
1B∑1Bx¯∗i
Here we declare that:
var(x̅∗i)=2N−1N2σ2.
cor(x̅∗i,x̅∗j)=N2N−1.
proof:
EX¯∗i=EXE[X¯∗i|X]=EX[X¯]=μ
EX¯∗12=EXE[X¯∗12|X]=EX[1N(1N∑i=1NX2i−X¯2)+X¯2]=2N−1N2σ2+μ2
Var(X¯∗1)=EX¯∗12−[EX¯∗1]2=2N−1N2σ2
E(X¯∗1X¯∗2)=EXE[X¯∗1X¯∗2|X]=EXE[X¯∗1|X]2=EX[X¯]2=1Nσ2+μ2
Cov(X¯∗1X¯∗2)=E(X¯∗1X¯∗2)−EX¯∗1EX¯∗2=1Nσ2
Cor(X¯∗1X¯∗2)=Cov(X¯∗1X¯∗2)Var(X¯∗1)Var(X¯∗2)−−−−−−−−−−−−−−√=N2N−1
Then using formula
ρσ2+(1−ρ)σ2B
we can get the variance of bagging is
1Nσ2+N−1BN2σ2>1Nσ2=X¯,
which says bagging of linear model cannot reduce variance.
(三). OOB(out of bag) estimation error
An important properties of random forest is the use of out-of-bag samples, which are the samples are not bootstrapped in building trees. OOB error isFor each observation zi=(xi,yi), construct its random forest predictor by averaging only those trees corresponding to bootstrap samples without zi.
Formula to understand the process of computing this error: suppose there have been built B trees {T1,T2,...TB} and for each individual sample xi, there exits a subset trees Si built without it, then we predict label for xi using these Si trees, y^i=argmax∑Si1{y^i=k}, and oob error is the average over all samples:
OOBerror=1N∑i=1N1{y^i≠yi}
OOB error is close to N-fold cross validation error. So we do not need to perform cross validation along tree building, and once the OOB error stabilizes, the training can be terminated.[OOB error can be used to select the number of trees need to build.]
(四). Variable importance
There are two ways to evaluate variable importance:Gini importance: at each split in each tree, the improvement in the split criteria is the importance measure attributed to the splitting variable, and is accumulated over all the trees separately for each variable. Formula to understand: in each decision tree Tb,b∈1:B, the square importance measure for ℓ variable is defined as
VI2ℓ(Tb)=∑t=1J−1i^2tI{v(t)=ℓ}
v(t) is the index of variable used to split node t into two daughter nodes and the sum is over the J−1 internal nodes of the tree. i2t is the split criteria such as squared error risk or Gini impurity. Then averaging over all B trees gives the total squared importance measure of ℓ variable:
VI2ℓ=1B∑b=1BVI2ℓ(Tb)=1B∑b=1B∑t=1J−1i^2tI{v(t)=ℓ}
Permutation importance: using OOB samples to construct a different importance measure. When Tb is built, the OOB samples are passed down the tree and record the prediction accuracy, denoted as Cb=1|OOB(b)|∑i∈OOB(b)I{y^i≠yi}. Then the values for the ℓth variable are randomly permuted in the OOB samples, and again the prediction accuracy is obtained, denoted as Cbℓ=1|OOB(b)|∑i∈OOB(b)I{y^i,πℓ≠yi}. The difference VIℓ(Tb)=Cb−Cbℓ is as the permutation importance of ℓth variable in the Tb tree. Finally, the importance measure of ℓth variable is the averaging of VIℓ(Tb) over B trees.
VIℓ=1B∑i=1BVIℓ(Tb)
summary: Gini importance measures how many times variable ℓ is used to split node and how much useful is this split. Permutation importance measures how much useful of variable ℓ to predict test data.
(五). Proximity plot
In growing a random forest, a N×N matrix is accumulated for the training data. For every tree, any pair of OOB samples sharing a terminal node has there proximity by one. This is to say that: for any pair of training samples, the proximity represents the number of trees which they are classified to the same terminal node in trees built without them.(六). R code for random forest
Here gives a simple R code for random forest using randomForest package in R.rm(list=ls(all = TRUE))##remove all objects #install.packages("randomForest")#insall randomForest package library(randomForest)#library package data(iris)#data to analysis n <- nrow(iris)#number of samples p <- ncol(iris)#number of variable test_inx <- sample(1:n, n/5)#use n/5 samples as test data iris_train <- iris[-test_inx, ]#train data iris_test <- iris[test_inx, ]#test data iris_rf <- randomForest(iris_train[,-5], iris_train[,5], ntree = 1000, mtry = sqrt(p), replace = TRUE, importance = TRUE, proximity = TRUE)#model on train data print(iris_rf)#view result iris_rf$importance#importance of variables varImpPlot(iris_rf)#plot variable importance iris_predict <- predict(iris_rf, iris_test[,-5], type = "response")#predict class labels for test data table(observed = iris_test[, 5], predicted = iris_predict)#look at prediction result error <- sum(iris_predict!=iris_test[,5])/length(test_inx)#prediction errro error
相关文章推荐
- 用Python从零实现贝叶斯分类器的机器学习的教程
- My Machine Learning
- 机器学习---学习首页 3ff0
- 反向传播(Backpropagation)算法的数学原理
- 也谈 机器学习到底有没有用 ?
- 量子计算机编程原理简介 和 机器学习
- 初识机器学习算法有哪些?
- 10个关于人工智能和机器学习的有趣开源项目
- 机器学习实践中应避免的7种常见错误
- 机器学习书单
- 北美常用的机器学习/自然语言处理/语音处理经典书籍
- 如何提升COBOL系统代码分析效率
- 自动编程体系设想(一)
- 自动编程体系设想(一)
- 支持向量机(SVM)算法概述
- 神经网络初步学习手记
- 常用的分类评估--基于R语言
- 开始spark之旅
- spark的几点备忘
- 关于机器学习的学习笔记(一):机器学习概念