转:xgboost特征选择
2017-11-15 16:48
148 查看
Xgboost在各大数据挖掘比赛中是一个大杀器,往往可以取得比其他各种机器学习算法更好的效果。数据预处理,特征工程,调参对Xgboost的效果有着非常重要的影响。这里介绍一下运用xgboost的特征选择,运用xgboost的特征选择可以筛选出更加有效的特征代入Xgboost模型。
这里采用的数据集来自于Kaggle
| Allstate Claims Severity比赛,这里的训练集如下所示,有116个离散特征(cat1-cat116),14个连续特征(cont1 -cont14),离散特征用字符串表示,先要对其进行数值化:
[python] view
plain copy
id cat1 cat2 cat3 cat4 cat5 cat6 cat7 cat8 cat9 ... cont6 \
0 1 A B A B A A A A B ... 0.718367
1 2 A B A A A A A A B ... 0.438917
2 5 A B A A B A A A B ... 0.289648
3 10 B B A B A A A A B ... 0.440945
4 11 A B A B A A A A B ... 0.178193
cont7 cont8 cont9 cont10 cont11 cont12 cont13 \
0 0.335060 0.30260 0.67135 0.83510 0.569745 0.594646 0.822493
1 0.436585 0.60087 0.35127 0.43919 0.338312 0.366307 0.611431
2 0.315545 0.27320 0.26076 0.32446 0.381398 0.373424 0.195709
3 0.391128 0.31796 0.32128 0.44467 0.327915 0.321570 0.605077
4 0.247408 0.24564 0.22089 0.21230 0.204687 0.202213 0.246011
xgboost的特征选择的代码如下:
[python] view
plain copy
import numpy as np
import pandas as pd
import xgboost as xgb
import operator
import matplotlib.pyplot as plt
def ceate_feature_map(features):
outfile = open('xgb.fmap', 'w')
i = 0
for feat in features:
outfile.write('{0}\t{1}\tq\n'.format(i, feat))
i = i + 1
outfile.close()
if __name__ == '__main__':
train = pd.read_csv("../input/train.csv")
cat_sel = [n for n in train.columns if n.startswith('cat')] #类别特征数值化
for column in cat_sel:
train[column] = pd.factorize(train[column].values , sort=True)[0] + 1
params = {
'min_child_weight': 100,
'eta': 0.02,
'colsample_bytree': 0.7,
'max_depth': 12,
'subsample': 0.7,
'alpha': 1,
'gamma': 1,
'silent': 1,
'verbose_eval': True,
'seed': 12
}
rounds = 10
y = train['loss']
X = train.drop(['loss', 'id'], 1)
xgtrain = xgb.DMatrix(X, label=y)
bst = xgb.train(params, xgtrain, num_boost_round=rounds)
features = [x for x in train.columns if x not in ['id','loss']]
ceate_feature_map(features)
importance = bst.get_fscore(fmap='xgb.fmap')
importance = sorted(importance.items(), key=operator.itemgetter(1))
df = pd.DataFrame(importance, columns=['feature', 'fscore'])
df['fscore'] = df['fscore'] / df['fscore'].sum()
df.to_csv("../input/feat_sel/feat_importance.csv", index=False)
plt.figure()
df.plot(kind='barh', x='feature', y='fscore', legend=False, figsize=(6, 10))
plt.title('XGBoost Feature Importance')
plt.xlabel('relative importance')
plt.show()
这里采用的数据集来自于Kaggle
| Allstate Claims Severity比赛,这里的训练集如下所示,有116个离散特征(cat1-cat116),14个连续特征(cont1 -cont14),离散特征用字符串表示,先要对其进行数值化:
[python] view
plain copy
id cat1 cat2 cat3 cat4 cat5 cat6 cat7 cat8 cat9 ... cont6 \
0 1 A B A B A A A A B ... 0.718367
1 2 A B A A A A A A B ... 0.438917
2 5 A B A A B A A A B ... 0.289648
3 10 B B A B A A A A B ... 0.440945
4 11 A B A B A A A A B ... 0.178193
cont7 cont8 cont9 cont10 cont11 cont12 cont13 \
0 0.335060 0.30260 0.67135 0.83510 0.569745 0.594646 0.822493
1 0.436585 0.60087 0.35127 0.43919 0.338312 0.366307 0.611431
2 0.315545 0.27320 0.26076 0.32446 0.381398 0.373424 0.195709
3 0.391128 0.31796 0.32128 0.44467 0.327915 0.321570 0.605077
4 0.247408 0.24564 0.22089 0.21230 0.204687 0.202213 0.246011
xgboost的特征选择的代码如下:
[python] view
plain copy
import numpy as np
import pandas as pd
import xgboost as xgb
import operator
import matplotlib.pyplot as plt
def ceate_feature_map(features):
outfile = open('xgb.fmap', 'w')
i = 0
for feat in features:
outfile.write('{0}\t{1}\tq\n'.format(i, feat))
i = i + 1
outfile.close()
if __name__ == '__main__':
train = pd.read_csv("../input/train.csv")
cat_sel = [n for n in train.columns if n.startswith('cat')] #类别特征数值化
for column in cat_sel:
train[column] = pd.factorize(train[column].values , sort=True)[0] + 1
params = {
'min_child_weight': 100,
'eta': 0.02,
'colsample_bytree': 0.7,
'max_depth': 12,
'subsample': 0.7,
'alpha': 1,
'gamma': 1,
'silent': 1,
'verbose_eval': True,
'seed': 12
}
rounds = 10
y = train['loss']
X = train.drop(['loss', 'id'], 1)
xgtrain = xgb.DMatrix(X, label=y)
bst = xgb.train(params, xgtrain, num_boost_round=rounds)
features = [x for x in train.columns if x not in ['id','loss']]
ceate_feature_map(features)
importance = bst.get_fscore(fmap='xgb.fmap')
importance = sorted(importance.items(), key=operator.itemgetter(1))
df = pd.DataFrame(importance, columns=['feature', 'fscore'])
df['fscore'] = df['fscore'] / df['fscore'].sum()
df.to_csv("../input/feat_sel/feat_importance.csv", index=False)
plt.figure()
df.plot(kind='barh', x='feature', y='fscore', legend=False, figsize=(6, 10))
plt.title('XGBoost Feature Importance')
plt.xlabel('relative importance')
plt.show()
相关文章推荐
- xgboost 特征选择,筛选特征的正要性
- 使用xgboost进行特征选择
- XGBoost Plotting API以及GBDT组合特征实践
- XGBoost Plotting API以及GBDT组合特征实践
- xgboost 保存模型和特征重要度
- XGBoost Plotting API以及GBDT组合特征实践
- xgboost特征工程--探索数据集的基本信息
- [置顶] 【数据挖掘 xgboost】特征的重要程度分析
- Xgboost筛选特征重要性
- xgboost gbdt特征点分烈点位置
- window10+anaconda3+python3.6下的xgboost最简单安装方法
- 特征选择, 经典三刀
- Windows+python安装xgboost(fix windowerror-127)
- the steps that may be taken to solve a feature selection problem:特征选择的步骤
- 特征选择常用算法综述
- 机器学习 特征选择概述
- XGBoost原理简介
- XGBoost参数调优完全指南(附Python代码)
- 关于树的几个ensemble模型的比较(GBDT、xgBoost、lightGBM、RF)
- 特征选择阅读文章