Titanic 多模型版 详解数据分析部分 机器学习初学者实战
2017-03-28 17:09
525 查看
来源于
图片见原英文
附带其他分析:
1、使用XGboost算法,没有分析特征,但是能够很快理解数据分析预测的整个流程,便于接下来看其他复杂notebook
2、features分析很是详细且容易理解
3、使用heatmap图分析各个特征的相关性,使用stacking多层模型算法
4、如果想了解Pairplot图的含义,这里有分析。哪些feature更容易区分预测,哪些feature间存在很强相关性
5、使用交叉验证的检验模型准确度
6、很详细一篇
图片见原英文
附带其他分析:
1、使用XGboost算法,没有分析特征,但是能够很快理解数据分析预测的整个流程,便于接下来看其他复杂notebook
2、features分析很是详细且容易理解
3、使用heatmap图分析各个特征的相关性,使用stacking多层模型算法
4、如果想了解Pairplot图的含义,这里有分析。哪些feature更容易区分预测,哪些feature间存在很强相关性
5、使用交叉验证的检验模型准确度
6、很详细一篇
# 工作阶段 # 1、问题定义 # 2、获取训练、测试集 # 3、处理数据 # 4、分析数据 # 5、建模以解决问题 # 6、可视化展现 # 7、提交结果 # 1、问题定义 # 官网 # 导入 # data analysis and wrangling import pandas as pd import numpy as np import random as rnd # 图形可视化库 import seaborn as sns #seaborn基于matplotlib import matplotlib.pyplot as plt %matplotlib inline # machine learning from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC, LinearSVC from sklearn.ensemble import RandomForestClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.naive_bayes import GaussianNB from sklearn.linear_model import Perceptron from sklearn.linear_model import SGDClassifier from sklearn.tree import DecisionTreeClassifier # 2、读取数据 train_df = pd.read_csv('train.csv') test_df = pd.read_csv('test.csv') combine = [train_df, test_df] # 3、处理数据 4、分析数据 # 特征类别分析: # print(train_df.columns.values) # train_df.info() # print('_'*40) # test_df.info() # available features: # ['PassengerId' 'Survived' 'Pclass' 'Name' 'Sex' 'Age' 'SibSp' 'Parch' 'Ticket' 'Fare' 'Cabin' 'Embarked'] # 7个int或float类型(testset中只有6个),5个字符串。 # train_df.head() # train_df.tail() # categorical features:有助于将样本集分类,选择正确的可视化图形,这些特征nominal,ordinal,ratio,interval based? # Categorical: Survived, Sex, and Embarked. Ordinal: Pclass. # numerical features:有助于选择正确的可视化图形,这些特征discrete,continuous,timeseries based? # Continous: Age, Fare. Discrete: SibSp, Parch. # mixed features:混合型数据类型 # Ticket是数字或数字母混合 Cabin是数字母混合 # contain error features: # name feature可能包含错误,因为存在多种表达name方式:titles,round brackets,quotes # contain blank features: # 训练集:Cabin > Age > Embarked features存在空值 # 测试集:Cabin > Age不完整 # 数据分布分析: # numerical features: # train_df.describe() # 样本总量891,占Tittanic乘客数量40% # Survived分类特征值为0或1 # 样本乘客存活率为38%,而实际存活率为32% # 超过75%乘客未带父母和儿女 # 近30%乘客带有兄妹或配偶 # <1%乘客支付$512 Fare # <1% 老者在65-80 # categorical features: # train_df.describe(include=['O']) # Names是unique # Sex中65% male # Cabin存在重复值,几个乘客共享cabin # Embarked 存在3中选择,大多数乘客选择S port # Ticket有22%重复率 # 基于数据分析的假设 # 在正式采取方案前,找出和Survival相关的features # Completing: # Age features肯定有关 # Embarked features有关或者与其他重要feature有关 # Correcting: # Ticket features可能无关,因为高达22%的重复率 # Cabin features可能无关,因为在测试集和训练集中包含太多空值 # PassengerId明显无关 # Name features数据表示方法过多,不够标准化,可能不能对结果造成直接影响 # Creating: # 可能想基于父母兄妹上船的家庭人数创建一个新feature Family # 可能想要基于Name的title创建一个新feature # 可能想基于Age创建一个新feature来将连续的数字特征转换为一个序列的分类特征 # 可能想创建一个Fare range的新特征 # Classifying: # 女人可能更容易幸存 # Age<?的孩子可能更容易幸存 # 头等舱的乘客可能更容易幸存 # pivoting 特征的分析 # 为验证我们的观察和猜测,可以独立pivoting 特征来快速分析 # Pclass在Pclass=1时相关性大于50% # train_df[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False) # Sex在Sex=female时相关性大于74% # train_df[["Sex", "Survived"]].groupby(['Sex'], as_index=False).mean().sort_values(by='Survived', ascending=False) # SibSp and Parch相关性不大,不呈现规律。可能该特征来自于其他特征或一系列特征 # train_df[["SibSp", "Survived"]].groupby(['SibSp'], as_index=False).mean().sort_values(by='Survived', ascending=False) # train_df[["Parch", "Survived"]].groupby(['Parch'], as_index=False).mean().sort_values(by='Survived', ascending=False) # visualizing 数据分析: # 使用可视化工具继续确定假设,直方图为例。 #@ 数字特征直方图(numerical features): # 年龄直方图: # g = sns.FacetGrid(train_df, col='Survived') # g.map(plt.hist, 'Age', bins=20) # 观察: # 孩子<4存活率高 # age=80存活 # 大量15-25未活下来 # 大部分乘客年龄在15-35岁之间 # 决定: # 在模型中考虑age特征 # 填充age的空值 # 应该分age组 #@ 联合多特征直方图(numerical and ordinal features): # Pclass 和 age: # grid = sns.FacetGrid(train_df, col='Pclass', hue='Survived') # grid = sns.FacetGrid(train_df, col='Survived', row='Pclass', size=2.2, aspect=1.6) # grid.map(plt.hist, 'Age', alpha=.5, bins=20) # grid.add_legend(); # 观察: # Pclass=3乘客最多,但大部分死亡 # Pclass=1或Pclass=2的孩子大部分存活 # Pclass=1乘客大部分存活 # 决定: # 模型考虑Pclass特征 #@ 类别特征相关性(categorical features): # 类别直方图 # grid = sns.FacetGrid(train_df, col='Embarked') # grid = sns.FacetGrid(train_df, row='Embarked', size=2.2, aspect=1.6) # grid.map(sns.pointplot, 'Pclass', 'Survived', 'Sex', palette='deep') # grid.add_legend() # 观察: # 女士更容易存活 # 除了Exception=C 男人更容易存活,在Pclass和Embarked存在联系,Pclass与Surivived关系相反 # 在Pclass=3时,相比Pclass=2的C或Q港口,男士存活率更高 # 对于Pclass=3,港口有着不同存活率 # 决定: # 模型考虑Sex特征 # 完善Embarked后加入模型 # 类别和数字特征相关性(categorical and numerical features): # 将Embarked (Categorical non-numeric), Sex (Categorical non-numeric), Fare (Numeric continuous), with Survived (Categorical numeric)一同考虑 # grid = sns.FacetGrid(train_df, col='Embarked', hue='Survived', palette={0: 'k', 1: 'w'}) # grid = sns.FacetGrid(train_df, row='Embarked', col='Survived', size=2.2, aspect=1.6) # grid.map(sns.barplot, 'Sex', 'Fare', alpha=.5, ci=None) # grid.add_legend() # 观察: # 高票价乘客更容易存活 # 登港口与存活率有关 # 决定: # 考虑Fare特征 # 处理数据: #@ 通过删除无关特征: # 根据前面假设和验证应删除Cabin和Ticket特征: # print("Before", train_df.shape, test_df.shape, combine[0].shape, combine[1].shape) # train_df = train_df.drop(['Ticket', 'Cabin'], axis=1) # test_df = test_df.drop(['Ticket', 'Cabin'], axis=1) # combine = [train_df, test_df] # "After", train_df.shape, test_df.shape, combine[0].shape, combine[1].shape #@ 根据已存在特征创建新特征: # 在删除Name和PassengerId前想能否找到Title与Survival之间联系 # 通过正则找出Title: # for dataset in combine: # dataset['Title'] = dataset.Name.str.extract(' ([A-Za-z]+)\.', expand=False) # pd.crosstab(train_df['Title'], train_df['Sex']) # 对找出的Title替换成同一Title: # for dataset in combine: # dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col',\ # 'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare') # dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss') # dataset['Title'] = dataset['Title'].replace('Ms', 'Miss') # dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs') # train_df[['Title', 'Survived']].groupby(['Title'], as_index=False).mean() # 将Title转化为序列: # title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5} # for dataset in combine: # dataset['Title'] = dataset['Title'].map(title_mapping) # dataset['Title'] = dataset['Title'].fillna(0) # train_df.head() # 删除Name和PassengerId特征 # train_df = train_df.drop(['Name', 'PassengerId'], axis=1) # test_df = test_df.drop(['Name'], axis=1) # combine = [train_df, test_df] # train_df.shape, test_df.shape # 当我们画出Title、Age、Suivived,得出: # 观察: # 大部分Title与Age分组类似 # Survival在Title和Age间轻微不同 # 一些Title大部分存活 (Mme, Lady, Sir) ,一部分没有 (Don, Rev, Jonkheer) # 决定: # 将Title加入模型 #@ 将字符串特征转化成数字特征: # 转化Sex: # for dataset in combine: # dataset['Sex'] = dataset['Sex'].map( {'female': 1, 'male': 0} ).astype(int) # train_df.head() #@ 填充空值: # 存在三种方法来完善连续数字特征: # 1、简单方式:在mean和标准偏差间产生一个随机数 # 2、准确方式:通过相关特征猜测缺失值,此例中通过Pclass和Gender特征组合使用中值猜测Age的值 # 3、联合1、2基于Pclass和Gender特征组合,在中值和偏差间产生一个随机数 # 2方法: # 方法1、3将会在我们模型中引入随机噪声,多次执行可能结果会有所不同。所以采用方法2: # grid = sns.FacetGrid(train_df, col='Pclass', hue='Gender') # grid = sns.FacetGrid(train_df, row='Pclass', col='Sex', size=2.2, aspect=1.6) # grid.map(plt.hist, 'Age', alpha=.5, bins=20) # grid.add_legend() # 生成一个空数组来存储Age的猜测值: # guess_ages = np.zeros((2,3)) # guess_ages # 遍历Sex和Pclass来猜测Age猜测值: # for dataset in combine: # for i in range(0, 2): # for j in range(0, 3): # guess_df = dataset[(dataset['Sex'] == i) & \ # (dataset['Pclass'] == j+1)]['Age'].dropna() # # age_mean = guess_df.mean() # # age_std = guess_df.std() # # age_guess = rnd.uniform(age_mean - age_std, age_mean + age_std) # age_guess = guess_df.median() # # Convert random age float to nearest .5 age # guess_ages[i,j] = int( age_guess/0.5 + 0.5 ) * 0.5 # for i in range(0, 2): # for j in range(0, 3): # dataset.loc[ (dataset.Age.isnull()) & (dataset.Sex == i) & (dataset.Pclass == j+1),\ # 'Age'] = guess_ages[i,j] # dataset['Age'] = dataset['Age'].astype(int) # train_df.head() # 创建Age bands,来确定其和Survival关系: # train_df['AgeBand'] = pd.cut(train_df['Age'], 5) # train_df[['AgeBand', 'Survived']].groupby(['AgeBand'], as_index=False).mean().sort_values(by='AgeBand', ascending=True) # 使用序列替换年龄: # for dataset in combine: # dataset.loc[ dataset['Age'] <= 16, 'Age'] = 0 # dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1 # dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2 # dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3 # dataset.loc[ dataset['Age'] > 64, 'Age'] # train_df.head() # 此时除去AgeBand # train_df = train_df.drop(['AgeBand'], axis=1) # combine = [train_df, test_df] # train_df.head() #@ 联合已存在特征创建新特征: # 基于Parch和SibSp的FamilySize创建新特征,这样允许我们删除Parch和SibSp # for dataset in combine: # dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] + 1 # train_df[['FamilySize', 'Survived']].groupby(['FamilySize'], as_index=False).mean().sort_values(by='Survived', ascending=False) # 可以创建新特征IsAlone: # for dataset in combine: # dataset['IsAlone'] = 0 # dataset.loc[dataset['FamilySize'] == 1, 'IsAlone'] = 1 # train_df[['IsAlone', 'Survived']].groupby(['IsAlone'], as_index=False).mean() # 删除SibSp、Parch、FamilySize: # train_df = train_df.drop(['Parch', 'SibSp', 'FamilySize'], axis=1) # test_df = test_df.drop(['Parch', 'SibSp', 'FamilySize'], axis=1) # combine = [train_df, test_df] # train_df.head() # 同样可以创建一个特征Age*Pclass: # for dataset in combine: # dataset['Age*Class'] = dataset.Age * dataset.Pclass # train_df.loc[:, ['Age*Class', 'Age', 'Pclass']].head(10) #@ 完善categorical features: # Embarked特征有两个确实值,使用最普遍的进行填充: # freq_port = train_df.Embarked.dropna().mode()[0] # freq_port # 填充: # for dataset in combine: # dataset['Embarked'] = dataset['Embarked'].fillna(freq_port) # train_df[['Embarked', 'Survived']].groupby(['Embarked'], as_index=False).mean().sort_values(by='Survived', ascending=False) #@ 将categorical features转为数字型: # for dataset in combine: # dataset['Embarked'] = dataset['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int) # train_df.head() #@ 快速完善转化为numeric features): # 使用最多出现的数填充Fare空值(并非创建新特征或进一步分析猜测空值,只是用一个值来填充): # test_df['Fare'].fillna(test_df['Fare'].dropna().median(), inplace=True) # test_df.head() # 创建FareBand: # train_df['FareBand'] = pd.qcut(train_df['Fare'], 4) # train_df[['FareBand', 'Survived']].groupby(['FareBand'], as_index=False).mean().sort_values(by='FareBand', ascending=True) # 基于FareBand将Fare转成序列: # for dataset in combine: # dataset.loc[ dataset['Fare'] <= 7.91, 'Fare'] = 0 # dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1 # dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare'] = 2 # dataset.loc[ dataset['Fare'] > 31, 'Fare'] = 3 # dataset['Fare'] = dataset['Fare'].astype(int) # train_df = train_df.drop(['FareBand'], axis=1) # combine = [train_df, test_df] # train_df.head(10) # 处理完的数据: # test_df.head(10) # 模型、预测、解决: # 该问题属于分类回归问题,监督学习。有以下几种模型选择: # Logistic Regression 逻辑回归 # KNN or k-Nearest Neighbors K近邻学习 # Support Vector Machines 支持向量机 # Naive Bayes classifier 朴素贝叶斯分类器 # Decision Tree 决策树 # Random Forrest 随机森林 # Perceptron 感知机 # Artificial neural network 人工神经网络 # RVM or Relevance Vector Machine 相关向量机 # X_train = train_df.drop("Survived", axis=1) # Y_train = train_df["Survived"] # X_test = test_df.drop("PassengerId", axis=1).copy() # X_train.shape, Y_train.shape, X_test.shape # 逻辑回归:通过逻辑函数,预测在分类依赖变量(果变量)与一个或多个独立变量(因变量)之间的关系。 # # Logistic Regression # logreg = LogisticRegression() # logreg.fit(X_train, Y_train) # Y_pred = logreg.predict(X_test) # acc_log = round(logreg.score(X_train, Y_train) * 100, 2) # acc_log # 80.359999999999999 # 逻辑回归可以验证我们的假设,通过系数可以知道features是正面还是反面的 # coeff_df = pd.DataFrame(train_df.columns.delete(0)) # coeff_df.columns = ['Feature'] # coeff_df["Correlation"] = pd.Series(logreg.coef_[0]) # coeff_df.sort_values(by='Correlation', ascending=False) # 支持向量机SVM: # # Support Vector Machines # svc = SVC() # svc.fit(X_train, Y_train) # Y_pred = svc.predict(X_test) # acc_svc = round(svc.score(X_train, Y_train) * 100, 2) # acc_svc # 83.840000000000003 # K近邻学习KNN: # knn = KNeighborsClassifier(n_neighbors = 3) # knn.fit(X_train, Y_train) # Y_pred = knn.predict(X_test) # acc_knn = round(knn.score(X_train, Y_train) * 100, 2) # acc_knn # 84.739999999999995 # 朴素贝叶斯分类器: # # Gaussian Naive Bayes # gaussian = GaussianNB() # gaussian.fit(X_train, Y_train) # Y_pred = gaussian.predict(X_test) # acc_gaussian = round(gaussian.score(X_train, Y_train) * 100, 2) # acc_gaussian # Out[45]: # 72.280000000000001 # 感知机: # # Perceptron # perceptron = Perceptron() # perceptron.fit(X_train, Y_train) # Y_pred = perceptron.predict(X_test) # acc_perceptron = round(perceptron.score(X_train, Y_train) * 100, 2) # acc_perceptron # Out[46]: # 78.0 # # Linear SVC # linear_svc = LinearSVC() # linear_svc.fit(X_train, Y_train) # Y_pred = linear_svc.predict(X_test) # acc_linear_svc = round(linear_svc.score(X_train, Y_train) * 100, 2) # acc_linear_svc # Out[47]: # 79.010000000000005 # # Stochastic Gradient Descent # sgd = SGDClassifier() # sgd.fit(X_train, Y_train) # Y_pred = sgd.predict(X_test) # acc_sgd = round(sgd.score(X_train, Y_train) * 100, 2) # acc_sgd # Out[48]: # 77.329999999999998 # 决策树: # # Decision Tree # decision_tree = DecisionTreeClassifier() # decision_tree.fit(X_train, Y_train) # Y_pred = decision_tree.predict(X_test) # acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2) # acc_decision_tree # Out[49]: # 86.760000000000005 # 随机森林: # # Random Forest # random_forest = RandomForestClassifier(n_estimators=100) # random_forest.fit(X_train, Y_train) # Y_pred = random_forest.predict(X_test) # random_forest.score(X_train, Y_train) # acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2) # acc_random_forest # 模型评估: # models = pd.DataFrame({ # 'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression', # 'Random Forest', 'Naive Bayes', 'Perceptron', # 'Stochastic Gradient Decent', 'Linear SVC', # 'Decision Tree'], # 'Score': [acc_svc, acc_knn, acc_log, # acc_random_forest, acc_gaussian, acc_perceptron, # acc_sgd, acc_linear_svc, acc_decision_tree]}) # models.sort_values(by='Score', ascending=False) # 存储数据: # submission = pd.DataFrame({ # "PassengerId": test_df["PassengerId"], # "Survived": Y_pred # }) # # submission.to_csv('../output/submission.csv', index=False)
相关文章推荐
- R语言实战:机器学习与数据分析源代码4
- 机器学习-实战-入门-iris数据分析
- R语言实战:机器学习与数据分析源代码5
- 《R语言实战——机器学习与数据分析》
- 学机器学习,不会数据分析怎么行?之NumPy详解
- R语言实战:机器学习与数据分析源代码2
- 2018python数据分析与机器学习实战(视频+源码+课件)
- 500G python web、爬虫、数据分析、机器学习、大数据、前端实战项目视频代码免费分享
- R语言实战:机器学习与数据分析源代码6(最终弹)
- R语言实战:机器学习与数据分析源代码3
- R语言实战:机器学习与数据分析源代码1
- POI实战-java开发excel详解(第二章 单元格各类型数据读取)
- Uboot启动分析之stage1-Nand-Flash启动部分详解
- 如何利用思维导图进行SWOT自我分析[实战详解]
- 实战Scribe日志搜集和数据分析
- 【软件性能测试-LoadRunner实战技能 4】== 监控指标数据分析
- PBOC/EMV-交易流程详解--POS与卡片的数据交互进行分析
- AIS数据分析详解
- AIS数据分析详解
- 音视频编解码知识学习详解(分多部分进行详细分析)