基于Xgboost的移动业务室内室外预测
2018-01-05 15:13
351 查看
本文作为公司内举办的“移动业务室内室外预测”比赛团队的解决方案介绍,主要包含特征工程和xgboost使用的代码说明。最后附上一些思考和改进思路。
数据规模 10000+ 特征 14 ,可见原始数据的规模并不大,可以通过单机来完成分析工作。目标字段
(12760, 15)
5 rows × 319 columns
5 rows × 319 columns
1. 训练模型
2. 计算训练精度
3. 计算训练AUC值
4. 通过xgboost包的交叉验证,更新弱分类器个数n_estimators为最佳。
5. 绘制feature importance图
2.使用xgboost调参前,应该尽可能的缩小特征规模(也就是做好特征选择),并且考虑用后台执行(notebook经常会出现页面假死)
3.xgboost调参的基本流程总结如下:
- 初始化学习率 得到estimator的数量
- 对 max_depth和min_weight进行grid search,他们对最终结果会有很大影响。
- 调整gamma参数
- 调整subsample和colsample_bytree参数
- 正则化参数调优:alpha和lambda
- 进一步降低学习率,增加树的数量,得到最佳参数。
比赛背景
利用大数据中心整理的一份包括基站、通话时间、通话时长,标签为室内、室外可公开的数据集(数量达12000条),运用机器学习方法预测用户分布是处于室内或者室外,帮助移动业务相关参数配置分析和优化。问题定位
特征字段含义解释
字段 | 含义 |
---|---|
reportcellkey | 小区cellID,基站编号和小区编号的组合 |
strongestnbpci | 小区PCI,范围0~503 |
aoa | angle of arrival入射角度 |
ta_calc | 时延 |
rsrp | 测试功率值 |
rsrq | 测量质量值 |
ta | 时延 |
tadltvalue | 下行时延 |
mrtime | MR上报时间 |
imsi | IMSI用户,做了脱敏 |
ndskey | NDS数据服务器的归属key |
uerecordid | 用户记录ID |
starttime | 业务开始时间 |
endtime | 业务结束时间 |
positionmark_real | 室内室外,1:室内,2:室外 |
Import Libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt %matplotlib inline # from scipy.stats import mode
导入数据
data_train = pd.read_csv("IndoorOutdoorPredict.csv",header=0) data_test = pd.read_csv("test_data.csv",header=0) data_train.head()
reportcellkey | strongestnbpci | aoa | ta_calc | rsrp | rsrq | ta | tadltvalue | mrtime | imsi | ndskey | uerecordid | starttime | endtime | positionmark_real | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 240877_2 | 226 | 235 | 3.1250 | -97 | -6.5 | 3 | 34 | 2017/2/23 20:41 | 4.600000e+14 | null | 003ACED20452DA84 | 2017/2/23 20:35 | 2017/2/23 20:56 | 0 |
1 | 240877_2 | 225 | 242 | 1.7500 | -101 | -7 | 2 | 28 | 2017/2/23 21:10 | 4.600000e+14 | null | 003ACEDE02959A4A | 2017/2/23 21:00 | 2017/2/23 21:11 | 0 |
2 | 64638_1 | 330 | 88 | 1.5625 | -93 | -6 | 2 | 25 | 2017/2/21 15:22 | 4.600000e+14 | null | 000FC7E0F5E63D49 | 2017/2/21 15:09 | 2017/2/21 15:25 | 0 |
3 | 240877_2 | null | 248 | 3.1875 | -95 | -6.5 | 3 | 35 | 2017/2/23 21:08 | 4.600000e+14 | null | 003ACEDBC3068F92 | 2017/2/23 20:58 | 2017/2/23 21:17 | 0 |
4 | 254892_2 | null | 220 | 22.6250 | null | null | 22 | 42 | 2017/2/22 11:05 | 4.600000e+14 | null | 003E3AC0CDAC941A | 2017/2/22 10:47 | 2017/2/22 11:08 | 0 |
positionmark_real是预测用户位于室内还是室外,是一个二分类问题。
数据预处理
检查数据类型
data_train.dtypes
reportcellkey object strongestnbpci object aoa object ta_calc float64 rsrp object rsrq object ta int64 tadltvalue int64 mrtime object imsi float64 ndskey object uerecordid object starttime object endtime object positionmark_real int64 dtype: object
检查缺失值
import numpy as np data_train.replace('null', np.nan, inplace=True) data_test.replace('null', np.nan, inplace=True) data_train.apply(lambda x: sum(x.isnull())),data_test.apply(lambda x: sum(x.isnull()))
(reportcellkey 0 strongestnbpci 2657 aoa 116 ta_calc 0 rsrp 225 rsrq 225 ta 0 tadltvalue 0 mrtime 0 imsi 0 ndskey 10698 uerecordid 0 starttime 0 endtime 0 positionmark_real 0 dtype: int64, reportcellkey 0 strongestnbpci 508 aoa 26 ta_calc 0 rsrp 40 rsrq 40 ta 0 tadltvalue 0 mrtime 0 imsi 0 ndskey 2062 starttime 0 endtime 0 positionmark_real 2062 dtype: int64)
data_train.shape,data_test.shape
((10698, 15), (2062, 14))
drop掉uerecordid列
通过观察,测试集没有该列,所以这列特征不能用。data_train.drop('uerecordid',axis=1,inplace=True)
处理ndskey列
该列全部为null 直接drop掉data_train.drop('ndskey',axis=1,inplace=True) data_test.drop('ndskey',axis=1,inplace=True)
处理 aoa列
该列缺失值不是很多,尝试用均值填充data_train[['aoa']]=data_train[['aoa']].astype('float32') data_test[['aoa']]=data_test[['aoa']].astype('float32') data_train['aoa'].describe(),data_test['aoa'].describe()
(count 10582.000000 mean 177.791718 std 96.962294 min 0.000000 25% 103.000000 50% 206.000000 75% 248.000000 max 359.000000 Name: aoa, dtype: float64, count 2036.000000 mean 178.873276 std 97.544271 min 0.000000 25% 105.000000 50% 209.000000 75% 248.000000 max 359.000000 Name: aoa, dtype: float64)
data_train['aoa'].replace(np.nan,177.791718,inplace=True) data_test['aoa'].replace(np.nan,178.873276,inplace=True)
处理rsrq rsrp
转换类型为float32 并用均值填充缺失值data_train[['rsrp']] = data_train[['rsrp']].astype('float32') data_train[['rsrq']] = data_train[['rsrq']].astype('float32') data_test[['rsrp']] = data_test[['rsrp']].astype('float32') data_test[['rsrq']] = data_test[['rsrq']].astype('float32') data_train['rsrp'].describe(),data_train['rsrq'].describe(),data_test['rsrp'].describe(),data_test['rsrq'].describe()
(count 10473.000000 mean -92.272797 std 12.115891 min -131.000000 25% -102.000000 50% -94.000000 75% -84.000000 max -48.000000 Name: rsrp, dtype: float64, count 10473.000000 mean -8.449012 std 2.264931 min -20.000000 25% -10.000000 50% -8.500000 75% -6.500000 max -3.000000 Name: rsrq, dtype: float64, count 2022.000000 mean -92.339760 std 12.019746 min -126.000000 25% -101.000000 50% -94.000000 75% -83.000000 max -56.000000 Name: rsrp, dtype: float64, count 2022.000000 mean -8.559594 std 2.271767 min -20.000000 25% -10.000000 50% -8.500000 75% -7.000000 max -4.000000 Name: rsrq, dtype: float64)
data_train['rsrp'].replace(np.nan,-92.272797,inplace=True) data_train['rsrq'].replace(np.nan,-8.449012,inplace=True) data_test['rsrp'].replace(np.nan,-92.339760,inplace=True) data_test['rsrq'].replace(np.nan,-8.559594,inplace=True)
处理strongestnbpci列
小区PCI当做类别处理 缺失值尝试填充众数pd.value_counts(data_train['strongestnbpci']),pd.value_counts(data_test['strongestnbpci'])
(226 684 216 457 88 352 274 285 271 282 268 275 5 263 37 255 227 204 419 198 9 196 59 183 225 148 228 141 132 134 2 133 211 124 417 122 367 121 402 117 0 113 20 110 404 109 38 103 330 100 35 95 391 94 11 75 273 74 4 74 ... 277 1 179 1 279 1 454 1 361 1 416 1 128 1 385 1 60 1 205 1 431 1 430 1 14 1 373 1 13 1 257 1 439 1 236 1 500 1 465 1 30 1 39 1 144 1 107 1 143 1 178 1 351 1 289 1 319 1 405 1 Name: strongestnbpci, Length: 322, dtype: int64, 226 137 216 72 88 71 271 56 268 56 37 54 274 52 227 46 9 45 5 40 59 35 402 35 419 31 228 30 225 29 2 28 211 24 132 24 367 23 417 22 20 21 0 19 35 16 273 15 38 15 404 14 330 14 3 14 391 12 33 11 ... 384 1 199 1 368 1 247 1 245 1 362 1 305 1 447 1 303 1 333 1 62 1 487 1 99 1 255 1 220 1 392 1 471 1 13 1 96 1 166 1 188 1 69 1 185 1 299 1 297 1 337 1 418 1 173 1 288 1 101 1 Name: strongestnbpci, Length: 216, dtype: int64)
data_train['strongestnbpci'].replace(np.nan,'226',inplace=True) data_test['strongestnbpci'].replace(np.nan,'216',inplace=True)
处理IMSI列
该列取值差异大,没有相同取值,对构建模型无用,droppd.value_counts(data_train['imsi']),pd.value_counts(data_test['imsi'])
(4.600000e+14 10698 Name: imsi, dtype: int64, 4.600000e+14 2062 Name: imsi, dtype: int64)
data_train.drop('imsi',axis=1,inplace=True) data_test.drop('imsi',axis=1,inplace=True)
处理starttime,endtime,mrtime列
转换为时间类型,读入时会默认为字符串类型。data_train["starttime_"]=pd.to_datetime(data_train["starttime"]) data_train["endtime_"]=pd.to_datetime(data_train["endtime"]) data_train["mrtime_"]=pd.to_datetime(data_train["mrtime"]) data_test["starttime_"]=pd.to_datetime(data_test["starttime"]) data_test["endtime_"]=pd.to_datetime(data_test["endtime"]) data_test["mrtime_"]=pd.to_datetime(data_test["mrtime"])
观察reportcellkey特征
pd.value_counts(data_train['reportcellkey']) # 可作为类别特征
240768_4 1046 240877_2 996 260284_2 665 263135_128 618 254892_2 600 240770_2 391 254892_0 306 240772_1 295 240688_3 280 240795_2 239 240792_6 239 240877_1 226 254850_1 221 64648_1 213 254823_2 201 254850_2 198 756794_128 174 64638_1 153 254792_1 149 240770_5 149 240792_5 146 254890_2 115 240831_4 115 254823_0 110 240771_2 95 240768_0 85 920496_130 81 254831_1 75 254850_4 65 240818_2 55 ... 240825_1 1 240768_2 1 5274_129 1 240803_2 1 240846_2 1 254791_8 1 240643_5 1 240643_7 1 240643_0 1 240643_2 1 254791_2 1 263115_129 1 263115_128 1 5275_129 1 240642_2 1 240695_3 1 5318_144 1 240697_1 1 240847_4 1 264004_128 1 240878_2 1 263139_128 1 756794_130 1 240687_2 1 254791_11 1 240644_10 1 64648_2 1 240694_3 1 240694_5 1 240847_1 1 Name: reportcellkey, Length: 298, dtype: int64
data_train.dtypes
reportcellkey object strongestnbpci object aoa float32 ta_calc float64 rsrp float32 rsrq float32 ta int64 tadltvalue int64 mrtime object starttime object endtime object positionmark_real int64 starttime_ datetime64[ns] endtime_ datetime64[ns] mrtime_ datetime64[ns] dtype: object
data = pd.concat([data_train,data_test]) data.shape
(12760, 15)
将datetime类型的feature转换为更有物理含义的特征
data['start_minute']=data['starttime_'].apply(lambda x:x.minute) data['start_day']=data['starttime_'].apply(lambda x:x.day) data['start_hour']=data['starttime_'].apply(lambda x:x.hour) data['end_hour']=data['endtime_'].apply(lambda x:x.hour) data['end_minute']=data['endtime_'].apply(lambda x:x.minute) data['end_day']=data['endtime_'].apply(lambda x:x.day) data['mrtime_hour']=data['mrtime_'].apply(lambda x:x.hour) data['mrtime_minute']=data['mrtime_'].apply(lambda x:x.minute) data['mrtime_day']=data['mrtime_'].apply(lambda x:x.day) data['time_delta']=data['endtime_'] - data_train['starttime_'] data['time_delta_in_seconds']= data['time_delta'].apply(lambda x: x.seconds)
Drop 掉无用特征
# data = data.drop(['starttime','endtime','starttime_','endtime_','mrtime','mrtime_','time_delta'],axis=1) data = data.drop(['starttime','endtime','starttime_','endtime_','mrtime','mrtime_'],axis=1)
Numerical Coding:
from sklearn.preprocessing import LabelEncoder le = LabelEncoder() var_to_encode = ['reportcellkey'] for col in var_to_encode: data[col] = le.fit_transform(data[col])
One-Hot Coding
对于不好处理的类别型列 reportcellkey 进行One-Hot编码data = pd.get_dummies(data, columns=var_to_encode) data.columns
Index([u'strongestnbpci', u'aoa', u'ta_calc', u'rsrp', u'rsrq', u'ta', u'tadltvalue', u'positionmark_real', u'start_minute', u'start_day', ... u'reportcellkey_292', u'reportcellkey_293', u'reportcellkey_294', u'reportcellkey_295', u'reportcellkey_296', u'reportcellkey_297', u'reportcellkey_298', u'reportcellkey_299', u'reportcellkey_300', u'reportcellkey_301'], dtype='object', length=319)
Generate final dataset
data.to_csv('data_is_ready_formal_version_full.csv',index=False) data_train.to_csv('data_is_ready_formal_version_train.csv',index=False) data_test.to_csv('data_is_ready_formal_version_test.csv',index=False)
data.head()
strongestnbpci | aoa | ta_calc | rsrp | rsrq | ta | tadltvalue | positionmark_real | start_minute | start_day | … | reportcellkey_292 | reportcellkey_293 | reportcellkey_294 | reportcellkey_295 | reportcellkey_296 | reportcellkey_297 | reportcellkey_298 | reportcellkey_299 | reportcellkey_300 | reportcellkey_301 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 226 | 235.0 | 3.1250 | -97.000000 | -6.500000 | 3 | 34 | 0.0 | 35 | 23 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 225 | 242.0 | 1.7500 | -101.000000 | -7.000000 | 2 | 28 | 0.0 | 0 | 23 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 330 | 88.0 | 1.5625 | -93.000000 | -6.000000 | 2 | 25 | 0.0 | 9 | 21 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 226 | 248.0 | 3.1875 | -95.000000 | -6.500000 | 3 | 35 | 0.0 | 58 | 23 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 226 | 220.0 | 22.6250 | -92.272797 | -8.449012 | 22 | 42 | 0.0 | 47 | 22 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
模型训练与预测
Import Libraries
import pandas as pd import numpy as np import xgboost as xgb from xgboost.sklearn import XGBClassifier from xgboost import plot_tree from sklearn import cross_validation, metrics from sklearn.grid_search import GridSearchCV import matplotlib.pylab as plt %matplotlib inline from matplotlib.pylab import rcParams rcParams['figure.figsize'] = 20,10
Load Data:
The data has gone through following pre-processing:data = pd.read_csv('data_is_ready_formal_version_full.csv') data.head()
strongestnbpci | aoa | ta_calc | rsrp | rsrq | ta | tadltvalue | positionmark_real | start_minute | start_day | … | reportcellkey_292 | reportcellkey_293 | reportcellkey_294 | reportcellkey_295 | reportcellkey_296 | reportcellkey_297 | reportcellkey_298 | reportcellkey_299 | reportcellkey_300 | reportcellkey_301 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 226 | 235.0 | 3.1250 | -97.000000 | -6.500000 | 3 | 34 | 0.0 | 35 | 23 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 225 | 242.0 | 1.7500 | -101.000000 | -7.000000 | 2 | 28 | 0.0 | 0 | 23 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 330 | 88.0 | 1.5625 | -93.000000 | -6.000000 | 2 | 25 | 0.0 | 9 | 21 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 226 | 248.0 | 3.1875 | -95.000000 | -6.500000 | 3 | 35 | 0.0 | 58 | 23 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 226 | 220.0 | 22.6250 | -92.272797 | -8.449012 | 22 | 42 | 0.0 | 47 | 22 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
data.shape
(12760, 319)
data.columns
Index([u'strongestnbpci', u'aoa', u'ta_calc', u'rsrp', u'rsrq', u'ta', u'tadltvalue', u'positionmark_real', u'start_minute', u'start_day', ... u'reportcellkey_292', u'reportcellkey_293', u'reportcellkey_294', u'reportcellkey_295', u'reportcellkey_296', u'reportcellkey_297', u'reportcellkey_298', u'reportcellkey_299', u'reportcellkey_300', u'reportcellkey_301'], dtype='object', length=319)
target = 'positionmark_real' data['positionmark_real'].value_counts()
0.0 7740 1.0 2958 Name: positionmark_real, dtype: int64
data_train = data[:10698] data_test = data[10698:]
# 初始化为0 后续替换为真实预测值 data_test['positionmark_real']=0
/root/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy from ipykernel import kernelapp as app
定义函数用来训练模型和交叉验证
函数包含以下功能:1. 训练模型
2. 计算训练精度
3. 计算训练AUC值
4. 通过xgboost包的交叉验证,更新弱分类器个数n_estimators为最佳。
5. 绘制feature importance图
def modelfit(alg,dtrain,dtest, predictors, cv_folds=5, early_stopping_rounds=50): xgb_param = alg.get_xgb_params() xgtrain = xgb.DMatrix(dtrain[predictors].values, label=dtrain[target].values) cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,metrics=['auc'], early_stopping_rounds=early_stopping_rounds) print cvresult.shape[0] alg.set_params(n_estimators=cvresult.shape[0]) #Fit the algorithm on the data alg.fit(dtrain[predictors], dtrain['positionmark_real'],eval_metric='auc') #Predict training set: dtrain_predictions = alg.predict(dtrain[predictors]) dtrain_predprob = alg.predict_proba(dtrain[predictors])[:,1] dtest_predictions = alg.predict(dtest[predictors]) #Print model report: print "\nModel Report" print "Accuracy : %.4g" % metrics.accuracy_score(dtrain['positionmark_real'].values, dtrain_predictions) print "AUC Score (Train): %f" % metrics.roc_auc_score(dtrain['positionmark_real'], dtrain_predprob) feat_imp = pd.Series(alg.booster().get_fscore()).sort_values(ascending=False) feat_imp.plot(kind='bar', title='Feature Importances') plt.ylabel('Feature Importance Score') return dtest_predictions,alg
Find the number of estimators for a high learning rate
predictors = [x for x in data.columns if x != target] xgb1 = XGBClassifier( learning_rate =0.1, n_estimators=1000, max_depth=5, min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8, objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27) data_test[target],model = modelfit(xgb1, data_train, data_test, predictors)
Will train until cv error hasn't decreased in 50 rounds. Stopping. Best iteration: 87 88 Model Report Accuracy : 1 AUC Score (Train): 1.000000
经验与不足
1.特征工程中缺少更加精确的特征选择步骤,导致代码执行的效率不够高。一个比较完整的特征工程如下图所示,特别应该引起注意的是特征选择步骤。2.使用xgboost调参前,应该尽可能的缩小特征规模(也就是做好特征选择),并且考虑用后台执行(notebook经常会出现页面假死)
3.xgboost调参的基本流程总结如下:
- 初始化学习率 得到estimator的数量
- 对 max_depth和min_weight进行grid search,他们对最终结果会有很大影响。
- 调整gamma参数
- 调整subsample和colsample_bytree参数
- 正则化参数调优:alpha和lambda
- 进一步降低学习率,增加树的数量,得到最佳参数。
相关文章推荐
- 基于xgboost的房价预测
- 基于R的xgboost
- xgboost用于文本分类预测,java接口
- 用XGBoost做时间序列预测—forecastxgb包
- Windows下安装XGBoost(python基于Anaconda2)
- R语言 - 基于[狗熊会]“百度搜索竞价广告创意研究”数据的定序回归、随机森林与xgboost - v0.5
- 利用随机森林,xgboost,logistic回归,预测泰坦尼克号上面的乘客的获救概率
- 强移动计算中心——基于可穿戴设备的一个技术预测
- 天池离线赛 - 移动推荐算法(四):基于LR, RF, GBDT等模型的预测
- 《BREW进阶与精通——3G移动增值业务的运营、定制与开发》连载之30---基于BREW的PTT服务
- 《BREW进阶与精通――3G移动增值业务的运营、定制与开发》连载之30---基于BREW的PTT服务
- 天池离线赛 - 移动推荐算法(二):基于简单规则的预测
- 用XGBoost做时间序列预测—forecastxgb包
- 《BREW进阶与精通——3G移动增值业务的运营、定制与开发》连载之31---LBS基于BREW的位置服务
- 基于Xgboost的不均衡数据分类
- 算法-基于成交量的Adaboost股价涨跌预测模型
- 《BREW进阶与精通――3G移动增值业务的运营、定制与开发》连载之31---LBS基于BREW的位置服务
- 在anaconda中安装xgboost,基于win10 32/64位系统
- XGboost 实战糖尿病预测
- 基于决策树的分类回归(随机森林,xgboost, gbdt)