您的位置:首页 > 其它

Kaggle入门之泰坦尼克号生还率预测

2017-12-17 16:33 309 查看
这是Kaggle上的一道入门题目,旨在让我们了解机器学习的大致过程。

题目链接:Titanic: Machine Learning from Disaster

题目大意:当年泰坦尼克号的沉没造成了很多人的死亡,救生艇不足是造成如此多人死亡的主要原因。尽管能否活下来要看运气,但是有些群体的存活概率比其他人更高。现在给出一些乘客的信息,包括他最后是否生还。根据这些信息,我们要对其他乘客是否生还进行预测。

首先引入我们需要的模块,如numpy、pandas和sklearn等:

import numpy as np
import pandas as pd
from pandas import Series, DataFrame

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import Imputer


然后,通过pandas读取训练数据并查看前5条数据:

train_data = pd.read_csv('./data/train.csv')
train_data.head()


PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th…female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
可以看到,整个数据集一共有12列,根据描述,每一列的含义如下:

列名含义
PassengerId乘客编号
Survival是否生还,0表示未生还,1表示生还
Pclass船票种类,折射处乘客的社会地位,1表示上层阶级,2表示中层阶级,3表示底层阶级
Sex性别,男性为male,女性为female
Age年龄,不满1岁的年龄为小数
SibSp该乘客同船的兄弟姐妹及配偶的数量
Parch该乘客同船的父母以及儿女的数量
Ticket船票编号
Fare买票的费用
Cabin船舱编号
Embarked代表在哪里上的船
下面我们看一下数据的描述性统计结果:

train_data.describe()


PassengerIdSurvivedPclassAgeSibSpParchFare
count891.000000891.000000891.000000714.000000891.000000891.000000891.000000
mean446.0000000.3838382.30864229.6991180.5230080.38159432.204208
std257.3538420.4865920.83607114.5264971.1027430.80605749.693429
min1.0000000.0000001.0000000.4200000.0000000.0000000.000000
25%223.5000000.0000002.00000020.1250000.0000000.0000007.910400
50%446.0000000.0000003.00000028.0000000.0000000.00000014.454200
75%668.5000001.0000003.00000038.0000001.0000000.00000031.000000
max891.0000001.0000003.00000080.0000008.0000006.000000512.329200
根据描述性统计结果,我们知道一共有891条记录,Age这一列有100多个缺失值。可以看到船上这些人的平均年龄在29岁左右,年龄最小的不到半岁,年龄最大的则有80岁。另外,船票的价格差距也比较大,有的人没有付钱就上了船,而船费最高的人则有512,而平均船费为32左右,这里猜测一下船费和最后是否生还有较大的关系。

下面我们进行探索性数据分析。首先看一下性别和是否生还之间的关系:

sns.barplot(x='Sex',y='Survived',data=train_data)


<matplotlib.axes._subplots.AxesSubplot at 0x16501a1ce10>




可以看出,男性的平均生还率不到0.2,而女性的平均生还率在0.75左右,说明性别对是否生还有着重大影响。下面我们再看一下登船地点是否会对生还率产生影响:

sns.barplot(x='Embarked',y='Survived',hue='Sex',data=train_data)


<matplotlib.axes._subplots.AxesSubplot at 0x16501aa32b0>




图片显示,在C地登船的人比在其他地方登船的人的平均生还率高,在Q地登船的男性生还率最低。这说明登船地点对是否生还有一定的影响,可以把它当作一个特征。接着,我们看一下社会地位对平均生还率的影响:

sns.pointplot(x='Pclass',y='Survived',hue='Sex',data=train_data,
palette={'male':'blue','female':'pink'},
markers=['*','o'],linestyles=['-','--'])


<matplotlib.axes._subplots.AxesSubplot at 0x16501b780f0>




图片显示,上层阶级男性与女性的生还率均比其他两个阶级高。对女性来说,中上层阶级的平均生还率比下层阶级女性高;而对于男性,中下层阶级的平均生还率没有很大的差别,上层阶级的平均生还率则明显高于其他两个阶级。因此,社会地位可以作为一个特征用来预测生还率。我记得《泰坦尼克号》中有一句台词:“让女人和孩子先走”,所以这里猜测年龄对生还有较大的影响。下面看一下各年龄段男女的生还人数:

grid = sns.FacetGrid(train_data,col='Survived',row='Sex',
size=2.2,aspect=1.6)
grid.map(plt.hist,'Age',alpha=.5,bins=20)
grid.add_legend()


train_data.Sex.value_counts()


male      577
female    314
Name: Sex, dtype: int64


船上共有男性577名,女性314名,男性的人数大约为女性人数的1.8倍 。在上面四幅图片中,可以明显看到,20岁到60岁的人中间,男性未生还的人数明显多于女性。“让女人和孩子先走”这一句台词在图中也有所体现,0~5岁儿童的生还率比较高。

sns.barplot(x='SibSp',y='Survived',data=train_data)


<matplotlib.axes._subplots.AxesSubplot at 0x16501df8f60>




有1个和2个兄弟或配偶时比没有兄弟和配偶时的生还几率要高,这也是人之常情,毕竟碰到困难的时候都会与自己关系好的人汇合。但是随着兄弟或配偶的增多,生还几率越来越低,毕竟人越多,一旦有一个人出了问题,一群人都要等那个人。下面对Parch的分析也基本类似。

sns.barplot(x='Parch',y='Survived',data=train_data)


<matplotlib.axes._subplots.AxesSubplot at 0x16501c0d940>




下面我们对数据进行一些处理,首先要将年龄的缺失值补全,这里我们用年龄的中位数来填充缺失值:

train_data.Age = train_data.Age.fillna(train_data.Age.median())
train_data.describe()


PassengerIdSurvivedPclassAgeSibSpParchFare
count891.000000891.000000891.000000891.000000891.000000891.000000891.000000
mean446.0000000.3838382.30864229.3615820.5230080.38159432.204208
std257.3538420.4865920.83607113.0196971.1027430.80605749.693429
min1.0000000.0000001.0000000.4200000.0000000.0000000.000000
25%223.5000000.0000002.00000022.0000000.0000000.0000007.910400
50%446.0000000.0000003.00000028.0000000.0000000.00000014.454200
75%668.5000001.0000003.00000035.0000001.0000000.00000031.000000
max891.0000001.0000003.00000080.0000008.0000006.000000512.329200
然后,我们把性别处理成数字,让男性为1,女性为0:

train_data.Sex.unique()


array(['male', 'female'], dtype=object)


train_data.loc[train_data.Sex == 'male','Sex'] = 1
train_data.loc[train_data.Sex == 'female','Sex'] = 0
train_data.head()


PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harris122.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th…038.010PC 1759971.2833C85C
2313Heikkinen, Miss. Laina026.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)035.01011380353.1000C123S
4503Allen, Mr. William Henry135.0003734508.0500NaNS
接着处理登船地点,由于登船地点有缺失值,所以我们要将缺失值补全。因为在S地登船的人最多,所以我们猜测确实的那部分数据中S出现的次数应该是比较多的,这里就用S去填充缺失值。然后把这一列转化成数值型数据:

train_data.Embarked.unique()


array(['S', 'C', 'Q', nan], dtype=object)


train_data.Embarked.value_counts()


S    644
C    168
Q     77
Name: Embarked, dtype: int64


train_data.Embarked = train_data.Embarked.fillna('S')
train_data.loc[train_data.Embarked == 'S','Embarked'] = 0
train_data.loc[train_data.Embarked == 'C','Embarked'] = 1
train_data.loc[train_data.Embarked == 'Q','Embarked'] = 2
train_data.head()


PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harris122.010A/5 211717.2500NaN0
1211Cumings, Mrs. John Bradley (Florence Briggs Th…038.010PC 1759971.2833C851
2313Heikkinen, Miss. Laina026.000STON/O2. 31012827.9250NaN0
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)035.01011380353.1000C1230
4503Allen, Mr. William Henry135.0003734508.0500NaN0
数据处理完毕后,我们就可以根据选择出的特征来训练模型了。由于这是一个二分类的问题,所以我们这里用Logistic回归算法:

features = ['Pclass','Sex','Age','SibSp','Parch','Fare','Embarked']

alg = LogisticRegression()
kf = KFold(n_splits=5, random_state=1)
predictions = list()
for train, test in kf.split(train_data):
k_train = train_data[features].iloc[train,:]
k_label = train_data.Survived.iloc[train]
alg.fit(k_train,k_label)
k_predictions = alg.predict(train_data[features].iloc[test,:])
predictions.append(k_predictions)

predictions = np.concatenate(predictions,axis=0)
accuracy_score(train_data.Survived,predictions)


0.79349046015712688


可以看到,使用Logistic回归预测的结果接近80%,效果还不错。

下面我们就处理我们需要预测的数据,然后使用训练数据对它进行预测。

test_data = pd.read_csv('./data/test.csv')
test_data.head()


PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
08923Kelly, Mr. Jamesmale34.5003309117.8292NaNQ
18933Wilkes, Mrs. James (Ellen Needs)female47.0103632727.0000NaNS
28942Myles, Mr. Thomas Francismale62.0002402769.6875NaNQ
38953Wirz, Mr. Albertmale27.0003151548.6625NaNS
48963Hirvonen, Mrs. Alexander (Helga E Lindqvist)female22.011310129812.2875NaNS
test_data.describe()


PassengerIdPclassAgeSibSpParchFare
count418.000000418.000000332.000000418.000000418.000000417.000000
mean1100.5000002.26555030.2725900.4473680.39234435.627188
std120.8104580.84183814.1812090.8967600.98142955.907576
min892.0000001.0000000.1700000.0000000.0000000.000000
25%996.2500001.00000021.0000000.0000000.0000007.895800
50%1100.5000003.00000027.0000000.0000000.00000014.454200
75%1204.7500003.00000039.0000001.0000000.00000031.500000
max1309.0000003.00000076.0000008.0000009.000000512.329200
test_data.Age = test_data.Age.fillna(test_data.Age.mean())
test_data.head()


PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
08923Kelly, Mr. Jamesmale34.5003309117.8292NaNQ
18933Wilkes, Mrs. James (Ellen Needs)female47.0103632727.0000NaNS
28942Myles, Mr. Thomas Francismale62.0002402769.6875NaNQ
38953Wirz, Mr. Albertmale27.0003151548.6625NaNS
48963Hirvonen, Mrs. Alexander (Helga E Lindqvist)female22.011310129812.2875NaNS
test_data.loc[test_data.Sex == 'male', 'Sex'] = 1
test_data.loc[test_data.Sex == 'female', 'Sex'] = 0
test_data.head()


PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
08923Kelly, Mr. James134.5003309117.8292NaNQ
18933Wilkes, Mrs. James (Ellen Needs)047.0103632727.0000NaNS
28942Myles, Mr. Thomas Francis162.0002402769.6875NaNQ
38953Wirz, Mr. Albert127.0003151548.6625NaNS
48963Hirvonen, Mrs. Alexander (Helga E Lindqvist)022.011310129812.2875NaNS
test_data.Embarked = test_data.Embarked.fillna('S')
test_data.loc[test_data.Embarked == 'S','Embarked'] = 0
test_data.loc[test_data.Embarked == 'C','Embarked'] = 1
test_data.loc[test_data.Embarked == 'Q','Embarked'] = 2
test_data.describe()


PassengerIdPclassAgeSibSpParchFare
count418.000000418.000000418.000000418.000000418.000000417.000000
mean1100.5000002.26555030.2725900.4473680.39234435.627188
std120.8104580.84183812.6345340.8967600.98142955.907576
min892.0000001.0000000.1700000.0000000.0000000.000000
25%996.2500001.00000023.0000000.0000000.0000007.895800
50%1100.5000003.00000030.2725900.0000000.00000014.454200
75%1204.7500003.00000035.7500001.0000000.00000031.500000
max1309.0000003.00000076.0000008.0000009.000000512.329200
test_data[features] = Imputer().fit_transform(test_data[features])
alg = LogisticRegression()
kf = KFold(n_splits=5,random_state=1)
for train, test in kf.split(train_data):
k_train = train_data[features].iloc[train,:]
k_label = train_data.Survived.iloc[train]
alg.fit(k_train,k_label)
predictions = alg.predict(test_data[features])


预测完毕之后,我们将结果写入到一个CSV文件中,提交并查看结果。

df = DataFrame([test_data.PassengerId,Series(predictions)],index=['PassengerId','Survived'])
df.T.to_csv('./data/gender_submission.csv',index=False)




本文GitHub地址:Litt1e0range/Kaggle/Titanic
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  机器学习 Kaggle