您的位置:首页 > 其它

Titanic 多模型版 详解数据分析部分 机器学习初学者实战

2017-03-28 17:09 525 查看
来源于

图片见原英文

附带其他分析:

1、使用XGboost算法,没有分析特征,但是能够很快理解数据分析预测的整个流程,便于接下来看其他复杂notebook

2、features分析很是详细且容易理解

3、使用heatmap图分析各个特征的相关性,使用stacking多层模型算法

4、如果想了解Pairplot图的含义,这里有分析。哪些feature更容易区分预测,哪些feature间存在很强相关性

5、使用交叉验证的检验模型准确度

6、很详细一篇

# 工作阶段
#   1、问题定义
#   2、获取训练、测试集
#   3、处理数据
#   4、分析数据
#   5、建模以解决问题
#   6、可视化展现
#   7、提交结果

# 1、问题定义
# 官网

# 导入
# data analysis and wrangling
import pandas as pd
import numpy as np
import random as rnd

# 图形可视化库
import seaborn as sns    #seaborn基于matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

# machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier

# 2、读取数据
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
combine = [train_df, test_df]

#  3、处理数据 4、分析数据

# 特征类别分析:
# print(train_df.columns.values)
# train_df.info()
# print('_'*40)
# test_df.info()
# available features:
# ['PassengerId' 'Survived' 'Pclass' 'Name' 'Sex' 'Age' 'SibSp' 'Parch' 'Ticket' 'Fare' 'Cabin' 'Embarked']
# 7个int或float类型(testset中只有6个),5个字符串。

# train_df.head()
# train_df.tail()
# categorical features:有助于将样本集分类,选择正确的可视化图形,这些特征nominal,ordinal,ratio,interval based?
# Categorical: Survived, Sex, and Embarked. Ordinal: Pclass.

# numerical features:有助于选择正确的可视化图形,这些特征discrete,continuous,timeseries based?
# Continous: Age, Fare. Discrete: SibSp, Parch.

# mixed features:混合型数据类型
# Ticket是数字或数字母混合 Cabin是数字母混合

# contain error features:
# name feature可能包含错误,因为存在多种表达name方式:titles,round brackets,quotes

# contain blank features:
# 训练集:Cabin > Age > Embarked features存在空值
# 测试集:Cabin > Age不完整

# 数据分布分析:
# numerical features:
# train_df.describe()
# 样本总量891,占Tittanic乘客数量40%
# Survived分类特征值为0或1
# 样本乘客存活率为38%,而实际存活率为32%
# 超过75%乘客未带父母和儿女
# 近30%乘客带有兄妹或配偶
# <1%乘客支付$512 Fare
# <1% 老者在65-80
# categorical features:
# train_df.describe(include=['O'])
# Names是unique
# Sex中65% male
# Cabin存在重复值,几个乘客共享cabin
# Embarked 存在3中选择,大多数乘客选择S port
# Ticket有22%重复率

# 基于数据分析的假设
# 在正式采取方案前,找出和Survival相关的features
# Completing:
# Age features肯定有关
# Embarked features有关或者与其他重要feature有关
# Correcting:
# Ticket features可能无关,因为高达22%的重复率
# Cabin features可能无关,因为在测试集和训练集中包含太多空值
# PassengerId明显无关
# Name features数据表示方法过多,不够标准化,可能不能对结果造成直接影响
# Creating:
# 可能想基于父母兄妹上船的家庭人数创建一个新feature Family
# 可能想要基于Name的title创建一个新feature
# 可能想基于Age创建一个新feature来将连续的数字特征转换为一个序列的分类特征
# 可能想创建一个Fare range的新特征
# Classifying:
# 女人可能更容易幸存
# Age<?的孩子可能更容易幸存
# 头等舱的乘客可能更容易幸存

# pivoting 特征的分析
# 为验证我们的观察和猜测,可以独立pivoting 特征来快速分析
# Pclass在Pclass=1时相关性大于50%
# train_df[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False)
# Sex在Sex=female时相关性大于74%
# train_df[["Sex", "Survived"]].groupby(['Sex'], as_index=False).mean().sort_values(by='Survived', ascending=False)
# SibSp and Parch相关性不大,不呈现规律。可能该特征来自于其他特征或一系列特征
# train_df[["SibSp", "Survived"]].groupby(['SibSp'], as_index=False).mean().sort_values(by='Survived', ascending=False)
# train_df[["Parch", "Survived"]].groupby(['Parch'], as_index=False).mean().sort_values(by='Survived', ascending=False)

# visualizing 数据分析:
# 使用可视化工具继续确定假设,直方图为例。
#@ 数字特征直方图(numerical features):
# 年龄直方图:
# g = sns.FacetGrid(train_df, col='Survived')
# g.map(plt.hist, 'Age', bins=20)
# 观察:
# 孩子<4存活率高
# age=80存活
# 大量15-25未活下来
# 大部分乘客年龄在15-35岁之间
# 决定:
# 在模型中考虑age特征
# 填充age的空值
# 应该分age组
#@ 联合多特征直方图(numerical and ordinal features):
# Pclass 和 age:
# grid = sns.FacetGrid(train_df, col='Pclass', hue='Survived')
# grid = sns.FacetGrid(train_df, col='Survived', row='Pclass', size=2.2, aspect=1.6)
# grid.map(plt.hist, 'Age', alpha=.5, bins=20)
# grid.add_legend();
# 观察:
# Pclass=3乘客最多,但大部分死亡
# Pclass=1或Pclass=2的孩子大部分存活
# Pclass=1乘客大部分存活
# 决定:
# 模型考虑Pclass特征
#@ 类别特征相关性(categorical features):
# 类别直方图
# grid = sns.FacetGrid(train_df, col='Embarked')
# grid = sns.FacetGrid(train_df, row='Embarked', size=2.2, aspect=1.6)
# grid.map(sns.pointplot, 'Pclass', 'Survived', 'Sex', palette='deep')
# grid.add_legend()
# 观察:
# 女士更容易存活
# 除了Exception=C 男人更容易存活,在Pclass和Embarked存在联系,Pclass与Surivived关系相反
# 在Pclass=3时,相比Pclass=2的C或Q港口,男士存活率更高
# 对于Pclass=3,港口有着不同存活率
# 决定:
# 模型考虑Sex特征
# 完善Embarked后加入模型
# 类别和数字特征相关性(categorical and numerical features):
# 将Embarked (Categorical non-numeric), Sex (Categorical non-numeric), Fare (Numeric continuous), with Survived (Categorical numeric)一同考虑
# grid = sns.FacetGrid(train_df, col='Embarked', hue='Survived', palette={0: 'k', 1: 'w'})
# grid = sns.FacetGrid(train_df, row='Embarked', col='Survived', size=2.2, aspect=1.6)
# grid.map(sns.barplot, 'Sex', 'Fare', alpha=.5, ci=None)
# grid.add_legend()
# 观察:
# 高票价乘客更容易存活
# 登港口与存活率有关
# 决定:
# 考虑Fare特征

# 处理数据:
#@ 通过删除无关特征:
# 根据前面假设和验证应删除Cabin和Ticket特征:
# print("Before", train_df.shape, test_df.shape, combine[0].shape, combine[1].shape)

# train_df = train_df.drop(['Ticket', 'Cabin'], axis=1)
# test_df = test_df.drop(['Ticket', 'Cabin'], axis=1)
# combine = [train_df, test_df]

# "After", train_df.shape, test_df.shape, combine[0].shape, combine[1].shape

#@ 根据已存在特征创建新特征:
# 在删除Name和PassengerId前想能否找到Title与Survival之间联系
# 通过正则找出Title:
# for dataset in combine:
#     dataset['Title'] = dataset.Name.str.extract(' ([A-Za-z]+)\.', expand=False)
# pd.crosstab(train_df['Title'], train_df['Sex'])
# 对找出的Title替换成同一Title:
# for dataset in combine:
#     dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col',\
#     'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
#     dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
#     dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
#     dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')
# train_df[['Title', 'Survived']].groupby(['Title'], as_index=False).mean()
# 将Title转化为序列:
# title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}
# for dataset in combine:
#     dataset['Title'] = dataset['Title'].map(title_mapping)
#     dataset['Title'] = dataset['Title'].fillna(0)

# train_df.head()
# 删除Name和PassengerId特征
# train_df = train_df.drop(['Name', 'PassengerId'], axis=1)
# test_df = test_df.drop(['Name'], axis=1)
# combine = [train_df, test_df]
# train_df.shape, test_df.shape
# 当我们画出Title、Age、Suivived,得出:
#  观察:
# 大部分Title与Age分组类似
# Survival在Title和Age间轻微不同
# 一些Title大部分存活 (Mme, Lady, Sir) ,一部分没有 (Don, Rev, Jonkheer)
# 决定:
# 将Title加入模型

#@ 将字符串特征转化成数字特征:
# 转化Sex:
# for dataset in combine:
#     dataset['Sex'] = dataset['Sex'].map( {'female': 1, 'male': 0} ).astype(int)
# train_df.head()
#@ 填充空值:
# 存在三种方法来完善连续数字特征:
# 1、简单方式:在mean和标准偏差间产生一个随机数
# 2、准确方式:通过相关特征猜测缺失值,此例中通过Pclass和Gender特征组合使用中值猜测Age的值
# 3、联合1、2基于Pclass和Gender特征组合,在中值和偏差间产生一个随机数
# 2方法:
# 方法1、3将会在我们模型中引入随机噪声,多次执行可能结果会有所不同。所以采用方法2:
# grid = sns.FacetGrid(train_df, col='Pclass', hue='Gender')
# grid = sns.FacetGrid(train_df, row='Pclass', col='Sex', size=2.2, aspect=1.6)
# grid.map(plt.hist, 'Age', alpha=.5, bins=20)
# grid.add_legend()
# 生成一个空数组来存储Age的猜测值:
# guess_ages = np.zeros((2,3))
# guess_ages
# 遍历Sex和Pclass来猜测Age猜测值:
# for dataset in combine:
#     for i in range(0, 2):
#         for j in range(0, 3):
#             guess_df = dataset[(dataset['Sex'] == i) & \
#                                 (dataset['Pclass'] == j+1)]['Age'].dropna()
#             # age_mean = guess_df.mean()
#             # age_std = guess_df.std()
#             # age_guess = rnd.uniform(age_mean - age_std, age_mean + age_std)

#             age_guess = guess_df.median()

#             # Convert random age float to nearest .5 age
#             guess_ages[i,j] = int( age_guess/0.5 + 0.5 ) * 0.5
#     for i in range(0, 2):
#         for j in range(0, 3):
#             dataset.loc[ (dataset.Age.isnull()) & (dataset.Sex == i) & (dataset.Pclass == j+1),\
#                     'Age'] = guess_ages[i,j]
#     dataset['Age'] = dataset['Age'].astype(int)
# train_df.head()
#  创建Age bands,来确定其和Survival关系:
# train_df['AgeBand'] = pd.cut(train_df['Age'], 5)
# train_df[['AgeBand', 'Survived']].groupby(['AgeBand'], as_index=False).mean().sort_values(by='AgeBand', ascending=True)
# 使用序列替换年龄:
# for dataset in combine:
#     dataset.loc[ dataset['Age'] <= 16, 'Age'] = 0
#     dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1
#     dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2
#     dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3
#     dataset.loc[ dataset['Age'] > 64, 'Age']
# train_df.head()
#  此时除去AgeBand
# train_df = train_df.drop(['AgeBand'], axis=1)
# combine = [train_df, test_df]
# train_df.head()

#@ 联合已存在特征创建新特征:
# 基于Parch和SibSp的FamilySize创建新特征,这样允许我们删除Parch和SibSp
# for dataset in combine:
#     dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] + 1
# train_df[['FamilySize', 'Survived']].groupby(['FamilySize'], as_index=False).mean().sort_values(by='Survived', ascending=False)
# 可以创建新特征IsAlone:
# for dataset in combine:
#     dataset['IsAlone'] = 0
#     dataset.loc[dataset['FamilySize'] == 1, 'IsAlone'] = 1
# train_df[['IsAlone', 'Survived']].groupby(['IsAlone'], as_index=False).mean()
# 删除SibSp、Parch、FamilySize:
# train_df = train_df.drop(['Parch', 'SibSp', 'FamilySize'], axis=1)
# test_df = test_df.drop(['Parch', 'SibSp', 'FamilySize'], axis=1)
# combine = [train_df, test_df]
# train_df.head()
# 同样可以创建一个特征Age*Pclass:
# for dataset in combine:
#     dataset['Age*Class'] = dataset.Age * dataset.Pclass
# train_df.loc[:, ['Age*Class', 'Age', 'Pclass']].head(10)

#@ 完善categorical features:
# Embarked特征有两个确实值,使用最普遍的进行填充:
# freq_port = train_df.Embarked.dropna().mode()[0]
# freq_port
# 填充:
# for dataset in combine:
#     dataset['Embarked'] = dataset['Embarked'].fillna(freq_port)
# train_df[['Embarked', 'Survived']].groupby(['Embarked'], as_index=False).mean().sort_values(by='Survived', ascending=False)
#@ 将categorical features转为数字型:
# for dataset in combine:
#     dataset['Embarked'] = dataset['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)
# train_df.head()

#@ 快速完善转化为numeric features):
# 使用最多出现的数填充Fare空值(并非创建新特征或进一步分析猜测空值,只是用一个值来填充):
# test_df['Fare'].fillna(test_df['Fare'].dropna().median(), inplace=True)
# test_df.head()
# 创建FareBand:
# train_df['FareBand'] = pd.qcut(train_df['Fare'], 4)
# train_df[['FareBand', 'Survived']].groupby(['FareBand'], as_index=False).mean().sort_values(by='FareBand', ascending=True)
# 基于FareBand将Fare转成序列:
# for dataset in combine:
#     dataset.loc[ dataset['Fare'] <= 7.91, 'Fare'] = 0
#     dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1
#     dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare']   = 2
#     dataset.loc[ dataset['Fare'] > 31, 'Fare'] = 3
#     dataset['Fare'] = dataset['Fare'].astype(int)
# train_df = train_df.drop(['FareBand'], axis=1)
# combine = [train_df, test_df]
# train_df.head(10)

# 处理完的数据:
# test_df.head(10)

#  模型、预测、解决:
# 该问题属于分类回归问题,监督学习。有以下几种模型选择:
# Logistic Regression 逻辑回归
# KNN or k-Nearest Neighbors K近邻学习
# Support Vector Machines  支持向量机
# Naive Bayes classifier  朴素贝叶斯分类器
# Decision Tree 决策树
# Random Forrest 随机森林
# Perceptron 感知机
# Artificial neural network 人工神经网络
# RVM or Relevance Vector Machine 相关向量机

# X_train = train_df.drop("Survived", axis=1)
# Y_train = train_df["Survived"]
# X_test  = test_df.drop("PassengerId", axis=1).copy()
# X_train.shape, Y_train.shape, X_test.shape

# 逻辑回归:通过逻辑函数,预测在分类依赖变量(果变量)与一个或多个独立变量(因变量)之间的关系。
# # Logistic Regression
# logreg = LogisticRegression()
# logreg.fit(X_train, Y_train)
# Y_pred = logreg.predict(X_test)
# acc_log = round(logreg.score(X_train, Y_train) * 100, 2)
# acc_log
# 80.359999999999999

# 逻辑回归可以验证我们的假设,通过系数可以知道features是正面还是反面的
# coeff_df = pd.DataFrame(train_df.columns.delete(0))
# coeff_df.columns = ['Feature']
# coeff_df["Correlation"] = pd.Series(logreg.coef_[0])
# coeff_df.sort_values(by='Correlation', ascending=False)

# 支持向量机SVM:
# # Support Vector Machines
# svc = SVC()
# svc.fit(X_train, Y_train)
# Y_pred = svc.predict(X_test)
# acc_svc = round(svc.score(X_train, Y_train) * 100, 2)
# acc_svc
# 83.840000000000003

# K近邻学习KNN:
# knn = KNeighborsClassifier(n_neighbors = 3)
# knn.fit(X_train, Y_train)
# Y_pred = knn.predict(X_test)
# acc_knn = round(knn.score(X_train, Y_train) * 100, 2)
# acc_knn
# 84.739999999999995

# 朴素贝叶斯分类器:
# # Gaussian Naive Bayes
# gaussian = GaussianNB()
# gaussian.fit(X_train, Y_train)
# Y_pred = gaussian.predict(X_test)
# acc_gaussian = round(gaussian.score(X_train, Y_train) * 100, 2)
# acc_gaussian
# Out[45]:
# 72.280000000000001

# 感知机:
# # Perceptron
# perceptron = Perceptron()
# perceptron.fit(X_train, Y_train)
# Y_pred = perceptron.predict(X_test)
# acc_perceptron = round(perceptron.score(X_train, Y_train) * 100, 2)
# acc_perceptron
# Out[46]:
# 78.0

# # Linear SVC
# linear_svc = LinearSVC()
# linear_svc.fit(X_train, Y_train)
# Y_pred = linear_svc.predict(X_test)
# acc_linear_svc = round(linear_svc.score(X_train, Y_train) * 100, 2)
# acc_linear_svc
# Out[47]:
# 79.010000000000005

# # Stochastic Gradient Descent
# sgd = SGDClassifier()
# sgd.fit(X_train, Y_train)
# Y_pred = sgd.predict(X_test)
# acc_sgd = round(sgd.score(X_train, Y_train) * 100, 2)
# acc_sgd
# Out[48]:
# 77.329999999999998

# 决策树:
# # Decision Tree
# decision_tree = DecisionTreeClassifier()
# decision_tree.fit(X_train, Y_train)
# Y_pred = decision_tree.predict(X_test)
# acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)
# acc_decision_tree
# Out[49]:
# 86.760000000000005

# 随机森林:
# # Random Forest
# random_forest = RandomForestClassifier(n_estimators=100)
# random_forest.fit(X_train, Y_train)
# Y_pred = random_forest.predict(X_test)
# random_forest.score(X_train, Y_train)
# acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)
# acc_random_forest

# 模型评估:
# models = pd.DataFrame({
#     'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression',
#             'Random Forest', 'Naive Bayes', 'Perceptron',
#             'Stochastic Gradient Decent', 'Linear SVC',
#             'Decision Tree'],
#     'Score': [acc_svc, acc_knn, acc_log,
#             acc_random_forest, acc_gaussian, acc_perceptron,
#             acc_sgd, acc_linear_svc, acc_decision_tree]})
# models.sort_values(by='Score', ascending=False)

# 存储数据:
# submission = pd.DataFrame({
#         "PassengerId": test_df["PassengerId"],
#         "Survived": Y_pred
#     })
# # submission.to_csv('../output/submission.csv', index=False)
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息