您的位置:首页 > 其它

【Kaggle】 Titanic详解

2018-03-22 17:44 197 查看
kaggle : https://www.kaggle.com/c/titanic 这里做一个简单笔记记录



提交准确率:0.83
代码详解:
1、数据读取
#读取训练集
train = pd.read_csv('/Users/Cheney/Downloads/kaggle(方老师)/train.csv')
#读取测试集
test = pd.read_csv('/Users/Cheney/Downloads/kaggle(方老师)/test.csv')

2、特征选取选取数据中的features进行训练,根据对题目的分析,可知’PassengerId’是冗余信息,而’Name’,’Ticket’,'Cabin'三者对于乘客生存无明显影响,所以不选取。其余七项为选取的训练features

#features选取
X_train = train[['Pclass','Sex','Age','Embarked','SibSp','Parch','Fare']]
X_test = test[['Pclass','Sex','Age','Embarked','SibSp','Parch','Fare']]

3、缺失数据填充首先填充训练集缺失数据,Embarked列填补S是因为该列S出现的次数最多,所以缺失值是S的可能性最大,Age列选择填补均值。#填充训练集Embarked列缺失值
X_train['Embarked'].fillna('S')
#填充训练集Age列缺失值
X_train['Age'].fillna(X_train['Age'].mean())再填充测试集缺失数据,Embarked列,Age列和训练集一样,但是测试集中Fare列也出现了缺失值,这里采用了填补均值的办法

#填充测试集缺失值
X_test['Embarked'].fillna('S')
X_test['Age'].fillna(X_test['Age'].mean())
X_test['Fare'].fillna(X_test['Fare'].mean())

4、用DictVectorizer进行分类变量特征提取,将dict类型的list数据,转换成numpy array

#DictVectorizer进行特征提取
dict_vec = DictVectorizer(sparse=False)
X_train = dict_vec.fit_transform(X_train.to_dict(orient='record'))
X_test = dict_vec.transform(X_test.to_dict(orient='record'))

5、训练模型选择我选择XGBOOST,这个模型在大部分kaggle比赛中都有很好的表现,参加实验室导师布置的两道Inclass比赛,都用的XGBOOST,控制过拟合的效果很好。

booster:gbtree(基于树的模型 )objective :multi:softmax(使用softmax的多分类器,返回预测的类别)num_class :2(类别数目为2)learning_rate :0.1(通过减少每一步的权重,可以提高模型的鲁棒性,试了几个值,0.1准确率最高)max_depth :2(这个值也是用来避免过拟合的。max_depth越大,模型会学到更具体更局部的样本)silent :0(能显示运行情况,让我们更好地理解模型)其他参数采用默认参数。

#模型选择XGB
xgb_model = xgb.XGBClassifier()

#设置参数
params = dict(booster='gbtree',
objective='multi:softmax',
num_class=2,
learning_rate=0.1,
max_depth=2,
silent=0,)
6、迭代次数
# 设置迭代次数
plst = list(params.items())
num_rounds = 1000

7、sklearn.cross_validation进行训练数据集划分,训练集和交叉验证集比例。我这里划分了20%的数据作为验证集
# sklearn.cross_validation进行训练数据集划分,训练集和交叉验证集比例
train_x, val_X, train_y, val_y = train_test_split(X_train, y_train, test_size=0.2, random_state=1)

8、矩阵赋值# xgb矩阵赋值
xgb_val = xgb.DMatrix(val_X, label=val_y)
xgb_train = xgb.DMatrix(train_x, label=train_y)
xgb_test = xgb.DMatrix(X_test)
9、训练模型
early_stopping_rounds 当设置的迭代次数较大时,early_stopping_rounds 可在100次迭代次数内准确率没有提升就停止训练。
# training model
# early_stopping_rounds 当设置的迭代次数较大时,early_stopping_rounds 可在一定的迭代次数内准确率没有提升就停止训练
model = xgb.train(plst, xgb_train, num_rounds, watchlist, early_stopping_rounds=100)

10、预测#测试集合预测值
preds = model.predict(xgb_test, ntree_limit=model.best_ntree_limit)
11、输出#结果输出
np.savetxt('/Users/Cheney/Downloads/kaggle(方老师)/xgbc_res.csv', np.c_[range(1, len(X_test) + 1), preds], delimiter=',', header='Label', comments='', fmt='%d')

完整代码:import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.feature_extraction import DictVectorizer
from sklearn.cross_validation import train_test_split

#读取训练集 train = pd.read_csv('/Users/Cheney/Downloads/kaggle(方老师)/train.csv') #读取测试集 test = pd.read_csv('/Users/Cheney/Downloads/kaggle(方老师)/test.csv')
#features选取
X_train = train[['Pclass','Sex','Age','Embarked','SibSp','Parch','Fare']]
X_test = test[['Pclass','Sex','Age','Embarked','SibSp','Parch','Fare']]

y_train = train['Survived']

#填充训练集Embarked列缺失值
X_train['Embarked'].fillna('S')
#填充训练集Age列缺失值
X_train['Age'].fillna(X_train['Age'].mean())

#填充测试集缺失值 X_test['Embarked'].fillna('S') X_test['Age'].fillna(X_test['Age'].mean()) X_test['Fare'].fillna(X_test['Fare'].mean())

#DictVectorizer进行特征提取 dict_vec = DictVectorizer(sparse=False) X_train = dict_vec.fit_transform(X_train.to_dict(orient='record')) X_test = dict_vec.transform(X_test.to_dict(orient='record'))

#模型选择XGB xgb_model = xgb.XGBClassifier() #设置参数 params = dict(booster='gbtree', objective='multi:softmax', num_class=2, learning_rate=0.1, max_depth=2, silent=0,)
# 设置迭代次数 plst = list(params.items()) num_rounds = 1000

# sklearn.cross_validation进行训练数据集划分,训练集和交叉验证集比例 train_x, val_X, train_y, val_y = train_test_split(X_train, y_train, test_size=0.2, random_state=1)
# xgb矩阵赋值
xgb_val = xgb.DMatrix(val_X, label=val_y)
xgb_train = xgb.DMatrix(train_x, label=train_y)
xgb_test = xgb.DMatrix(X_test)

#watchlist 方便查看运行情况
watchlist = [(xgb_train, 'train'), (xgb_val, 'val')]

# training model
# early_stopping_rounds 当设置的迭代次数较大时,early_stopping_rounds 可在一定的迭代次数内准确率没有提升就停止训练
model = xgb.train(plst, xgb_train, num_rounds, watchlist, early_stopping_rounds=100)

#测试集合预测值
preds = model.predict(xgb_test, ntree_limit=model.best_ntree_limit)

#结果输出
np.savetxt('/Users/Cheney/Downloads/kaggle(方老师)/xgbc_res.csv', np.c_[range(1, len(X_test) + 1), preds], delimiter=',', header='Label', comments='', fmt='%d')

内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  kaggle titannic xgboost