您的位置:首页 > 编程语言 > Python开发

Titanic数据分析报告(python)

2016-05-07 10:55 821 查看
# Titanic数据分析报告

## 1.1 数据加载与描述性统计

加载所需数据与所需的python库。

import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels.graphics.api as smg
import patsy
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
from scipy import stats
import seaborn as sns


train = pd.read_csv("D:/学习/数据挖掘与机器学习/Titanic/train.csv")


数据集中共有12个字段,PassengerId:乘客编号,Survived:乘客是否存活,Pclass:乘客所在的船舱等级;Name:乘客姓名,Sex:乘客性别,Age:乘客年龄,SibSp:乘客的兄弟姐妹和配偶数量,Parch:乘客的父母与子女数量,Ticket:票的编号,Fare:票价,Cabin:座位号,Embarked:乘客登船码头。

共有891位乘客的数据信息。其中277位乘客的年龄数据缺失,2位乘客的登船码头数据缺失,687位乘客的船舱数据缺失。

train.head()


PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale2210A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th…female3810PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale2600STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female351011380353.1000C123S
4503Allen, Mr. William Henrymale35003734508.0500NaNS
train.info()


train.describe()


PassengerIdSurvivedPclassAgeSibSpParchFare
count891.000000891.000000891.000000714.000000891.000000891.000000891.000000
mean446.0000000.3838382.30864229.6991180.5230080.38159432.204208
std257.3538420.4865920.83607114.5264971.1027430.80605749.693429
min1.0000000.0000001.0000000.4200000.0000000.0000000.000000
25%223.5000000.0000002.00000020.1250000.0000000.0000007.910400
50%446.0000000.0000003.00000028.0000000.0000000.00000014.454200
75%668.5000001.0000003.00000038.0000001.0000000.00000031.000000
max891.0000001.0000003.00000080.0000008.0000006.000000512.329200
## 1.2单变量探索

### 1.2.1 年龄与费用

画出训练集中乘客年龄和费用的分布直方图,如下所示。可以发现,大部分乘客的年龄位于20-40岁之间,总体上呈正态分布。大部分乘客的票价很低,位于0-100之间,其他少部分乘客的票价较高。

fig,ax = plt.subplots(nrows=1,ncols=2,figsize=(15,5))
train["Age"].hist(ax=ax[0])
ax[0].set_title("Hist plot of Age")
train["Fare"].hist(ax=ax[1])
ax[1].set_title("Hist plot of Fare")


### 1.2.2 乘客是否获救

画出乘客获救与没有获救的条形图,如下所示。可以发现,大部分乘客没有获救。

fig,ax = plt.subplots(figsize=(7,5))
train["Survived"].value_counts().plot(kind="bar")
ax.set_xticklabels(("Not Survived","Survived"),  rotation= "horizontal" )
ax.set_title("Bar plot of Survived ")


### 1.2.3 性别

画出乘客性别条形分布图,如下所示。可以发现,大部分乘客为男性。

fig,ax = plt.subplots(figsize=(7,5))
train["Sex"].value_counts().plot(kind="bar")
ax.set_xticklabels(("male","female"),rotation= "horizontal"  )
ax.set_title("Bar plot of Sex ")


### 1.2.4 乘客所在的船舱等级

画出乘客的Pclass条形分布图,如下所示。可以发现,大部分乘客位于第三等级,第一等级和第二等级的乘客各有200个左右。

fig,ax = plt.subplots(figsize=(7,5))
train["Pclass"].value_counts().plot(kind="bar")
ax.set_xticklabels(("Class3","Class1","Class2"),rotation= "horizontal"  )
ax.set_title("Bar plot of Pclass ")


### 1.2.5 乘客座位号

对乘客座位号数据进行处理,将缺失值赋值为Unknown。从乘客座位号数据可以发现,第一个字母可能代表了船舱号码,将该字符提取出来,赋值给Cabin,视为船舱号。

train.Cabin.fillna("Unknown",inplace=True)
for i in range(0, 891):
train.Cabin[i]= train.Cabin[i][0]


D:\software\新建文件夹 (4)\lib\site-packages\ipykernel\__main__.py:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy app.launch_new_instance()

画出乘客的船舱号的条形分布图,如下所示。可以发现,大部分乘客的船舱号为未知。

fig,ax = plt.subplots(figsize=(7,5))
train.Cabin.value_counts().plot(kind="bar")
ax.set_title("Bar plot of Cabin ")


### 1.2.6 兄弟姐妹与配偶数目

画出乘客兄弟姐妹与配偶数目的条形分布图,如下所示。可以发现,大部分乘客在船上没有兄弟姐妹或配偶,大约200位乘客在船上有1个兄弟姐妹或配偶。

fig,ax = plt.subplots(figsize=(7,5))
train["SibSp"].value_counts().plot(kind="bar")
ax.set_title("Bar plot of SibSp ")


### 1.2.7 父母与子女数目

画出乘客父母与子女数目的条形分布图,如下所示。可以发现,大部分乘客在船上没有父母或子女,100多位乘客在船上有1个兄弟姐妹或配偶,大约90位乘客在船上有2个兄弟姐妹或配偶。

fig,ax = plt.subplots(figsize=(7,5))
train["Parch"].value_counts().plot(kind="bar")
ax.set_title("Bar plot of Parch ")


### 1.2.8 乘客出发港口

画出乘客出发港口的分布条形图,如下所示。可以发现,大部分乘客从Southampton港口出发,不到200位乘客从Cherburge出发,不到100位乘客从Queentown出发。

fig,ax = plt.subplots(figsize=(7,5))
train["Embarked"].value_counts().plot(kind="bar")
ax.set_xticklabels(("Southampton","Cherbourg","Queenstown"),rotation= "horizontal"  )
ax.set_title("Bar plot of Embarked ")


## 1.3 多变量探索

### 1.3.1 性别与是否获救

画出性别与是否获救的交叉表和条形图,如下所示。可以发现,女性获救的可能性更高,而男性获救的比例很低。

pd.crosstab(train["Sex"],train["Survived"])


Survived01
Sex
female81233
male468109
pd.crosstab(train["Sex"],train["Survived"]).plot(kind="bar")


### 1.3.2 船舱等级与是否获救

画出船舱等级与是否获救的交叉表与条形图,如下所示。可以发现,第一等级的乘客获救的可能性更高,超过50%,第二等级的乘客获救可能性在50%左右,而第三等级的乘客获救可能性很低。

pd.crosstab(train["Pclass"],train["Survived"])


Survived01
Pclass
180136
29787
3372119
pd.crosstab(train["Pclass"],train["Survived"]).plot(kind="bar")


### 1.3.3 兄弟姐妹或配偶数量与是否获救

画出兄弟姐妹与配偶数目与是否获救的交叉表与条形图,如下所示。可以发现,有数量为1或2的乘客获救的可能性更高。

pd.crosstab(train["SibSp"],train["Survived"])


Survived01
SibSp
0398210
197112
21513
3124
4153
550
870
pd.crosstab(train["SibSp"],train["Survived"]).plot(kind="bar")


### 1.3.4 父母或子女数目和是否获救

画出父母或子女数目与是否获救的交叉表与条形图,如下所示。可以发现,有数量为1或2的乘客获救的可能性更高。

pd.crosstab(train["Parch"],train["Survived"])


Survived01
Parch
0445233
15365
24040
323
440
541
610
pd.crosstab(train["Parch"],train["Survived"]).plot(kind="bar")


### 1.3.5 登船港口与是否获救

画出登船港口与是否获救的交叉表与条形图,如下所示。可以发现,从Cherburge出发的乘客获救的人数比例更高。

pd.crosstab(train["Embarked"],train["Survived"])


Survived01
Embarked
C7593
Q4730
S427217
pd.crosstab(train["Embarked"],train["Survived"]).plot(kind="bar")


### 1.3.6 乘客船舱与是否获救

画出乘客所在船舱与是否获救的交叉表与条形图,如下所示。可以发现,船舱后没有缺失的乘客获救的人数比例更高。

pd.crosstab(train["Cabin"],train["Survived"])


Survived01
Cabin
A87
B1235
C2435
D825
E824
F58
G22
T10
U481206
pd.crosstab(train["Cabin"],train["Survived"]).plot(kind="bar")


### 1.3.7 乘客年龄与是否获救

画出乘客是否获救与年龄的箱线图,如下所示。从箱线图上来看,两者关系并不明显。

fig,ay = plt.subplots()
Age1 = train.Age[train.Survived == 1].dropna()
Age0 = train.Age[train.Survived == 0].dropna()
plt.boxplot((Age1,Age0),labels=('Survived','Not Survived'))
ay.set_ylim([-5,70])
ay.set_title("Boxplot of Age")


### 1.3.8 票价与是否获救

画出乘客是否获救与票价的箱线图,如下所示。可以发现,总体而言,获救的乘客票价更高。

fig,ay = plt.subplots()
Fare1 = train.Fare[train.Survived == 1]
Fare0 = train.Fare[train.Survived == 0]
plt.boxplot((Fare1,Fare0),labels=('Survived','Not Survived'))
ay.set_ylim([-10,150])
ay.set_title("Boxplot of Fare")


### 1.3.9 票价与乘客舱位等级

画出乘客票价与舱位等级的箱线图,如下所示。可以明显的发现,舱位等级越高的乘客,票价越高。这两个变量之间存在非常明显的线性相关关系。

fig,ay = plt.subplots()
Farec1 = train.Fare[train.Pclass == 1]
Farec2 = train.Fare[train.Pclass == 2]
Farec3 = train.Fare[train.Pclass == 3]
plt.boxplot((Farec1,Farec2,Farec3),labels=("Pclass1","Pclass2","Pclass3"))
ay.set_ylim([-10,180])
ay.set_title("Boxplot of Fare and Pclass")


## 1.4 数据处理

### 1.4.1 缺失值处理

用年龄的均值填充年龄的缺失值,用出发港口的众数填补出发港口的缺失值。

train.Age.mean()
train.Age.fillna(29.7,inplace=True)
train.Embarked.fillna("S",inplace=True)


### 1.4.2 数据分箱

根据以上分析结果和变量间的关系,将年龄数据分段为0-5岁、5-15岁、15-20岁、20-35岁、35-50岁、50-60岁、60-100岁7段。将Parch变量分成数目为0、数目为1或2、数目为大于2三段。将SibSp变量分成数目为0、数目为1或2、数目为大于2三段。将Cabin变量分为缺失和没有缺失两段。

train.age=pd.cut(train.Age,[0,5,15,20,35,50,60,100])
pd.crosstab(train.age,train.Survived).plot(kind="bar")


train.Parch[(train.Parch>0) & (train.Parch<=2)]=1
train.Parch[train.Parch>2]=2
train.SibSp[(train.SibSp>0) & (train.SibSp<=2)]=1
train.SibSp[train.SibSp>2]=2
train.Cabin[train.Cabin!="U"]="K"


D:\software\新建文件夹 (4)\lib\site-packages\ipykernel\__main__.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy if __name__ == ‘__main__’:
D:\software\新建文件夹 (4)\lib\site-packages\ipykernel\__main__.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy from ipykernel import kernelapp as app
D:\software\新建文件夹 (4)\lib\site-packages\ipykernel\__main__.py:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy app.launch_new_instance()
D:\software\新建文件夹 (4)\lib\site-packages\ipykernel\__main__.py:4: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy D:\software\新建文件夹 (4)\lib\site-packages\ipykernel\__main__.py:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
### 1.4.3 创建虚拟变量

为Pclass、Sex、Embarked、Parch、SibSp、Cabin变量创建虚拟变量

dummy_Pclass = pd.get_dummies(train.Pclass, prefix='Pclass')
dummy_Sex = pd.get_dummies(train.Sex, prefix='Sex')
dummy_Embarked = pd.get_dummies(train.Embarked, prefix='Embarked')
dummy_Parch = pd.get_dummies(train.Parch, prefix='Parch')
dummy_SibSp = pd.get_dummies(train.SibSp, prefix='SibSp')
dummy_Age = pd.get_dummies(train.age, prefix='Age')
dummy_Cabin = pd.get_dummies(train.Cabin, prefix='Cabin')


## 1.5 模型建立

### 1.5.1 创建训练集

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, roc_curve,roc_auc_score,classification_report


划分训练集,将编号为0-623的乘客作为训练集。去除PassengerId和Name变量,添加常数项intercept.

因变量为乘客是否获救,自变量为乘客的票价、性别、登船码头、父母与子女数目、兄弟姐妹与配偶数目、年龄、船舱。除票价外,都为虚拟变量。考虑到Fare和Pclass之间的线性相关性,剔除Pclass变量。

train_y = train[:623]["Survived"]
cols_to_keep = ["Fare"]
train_x = train[:623][cols_to_keep].join(dummy_Sex.ix[:, "Sex_male":]).join(dummy_Embarked.ix[:,"Embarked_Q":]).join(dummy_Parch.ix[:,"Parch_1":]).join(dummy_SibSp.ix[:,"SibSp_1":]).join(dummy_Age.ix[:,"Age_(5, 15]":]).join(dummy_Cabin.ix[:,"Cabin_U" :])
train_x['intercept'] = 1.0
train_x.tail()


FareSex_maleEmbarked_QEmbarked_SParch_1Parch_2SibSp_1SibSp_2Age_(5, 15]Age_(15, 20]Age_(20, 35]Age_(35, 50]Age_(50, 60]Age_(60, 100]Cabin_Uintercept
61839.0000001101000000001
61910.5000101000000100011
62014.4542100001000100011
62152.5542101001000010001
62215.7417100101001000011
### 1.5.2 模型构建

对训练集构建逻辑斯蒂模型。

clf = LogisticRegression()
clf.fit(train_x,train_y)


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class=’ovr’, n_jobs=1,
penalty=’l2’, random_state=None, solver=’liblinear’, tol=0.0001,
verbose=0, warm_start=False)

### 1.5.3 模型检验

划分测试集,将编号为624-890的乘客作为测试集。

test_y = train[623:]["Survived"]
cols_to_keep = ["Fare"]
test_x = train[623:][cols_to_keep].join(dummy_Sex.ix[:, "Sex_male":]).join(dummy_Embarked.ix[:,"Embarked_Q":]).join(dummy_Parch.ix[:,"Parch_1":]).join(dummy_SibSp.ix[:,"SibSp_1":]).join(dummy_Age.ix[:,"Age_(5, 15]":]).join(dummy_Cabin.ix[:,"Cabin_U" :])
test_x['intercept'] = 1.0
test_x.head()


FareSex_maleEmbarked_QEmbarked_SParch_1Parch_2SibSp_1SibSp_2Age_(5, 15]Age_(15, 20]Age_(20, 35]Age_(35, 50]Age_(50, 60]Age_(60, 100]Cabin_Uintercept
6237.8542101000000100011
62416.1000101000000100011
62532.3208101000000000101
62612.3500110000000001011
62777.9583001000000100001
利用测试集对模型进行测试

clf.predict(test_x)


array([0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0,
0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,
0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0,
0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1,
0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,
0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0,
0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0,
1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1,
1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1,
0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0], dtype=int64)


clf.predict_proba(test_x)


array([[ 0.86039834,  0.13960166],
[ 0.85711962,  0.14288038],
[ 0.74830885,  0.25169115],
[ 0.84502153,  0.15497847],
[ 0.12886629,  0.87113371],
[ 0.86038196,  0.13961804],
[ 0.81492416,  0.18507584],
[ 0.74973913,  0.25026087],
[ 0.88595841,  0.11404159],
[ 0.60609598,  0.39390402],
[ 0.86346251,  0.13653749],
[ 0.61197924,  0.38802076],
[ 0.35569986,  0.64430014],
[ 0.86037046,  0.13962954],
[ 0.79463014,  0.20536986],
[ 0.57570079,  0.42429921],
[ 0.83467975,  0.16532025],
[ 0.85798051,  0.14201949],
[ 0.10986907,  0.89013093],
[ 0.56445448,  0.43554552],
[ 0.84012224,  0.15987776],
[ 0.15820833,  0.84179167],
[ 0.56790934,  0.43209066],
[ 0.85796389,  0.14203611],
[ 0.65552637,  0.34447363],
[ 0.86051808,  0.13948192],
[ 0.35980506,  0.64019494],
[ 0.86038196,  0.13961804],
[ 0.29324608,  0.70675392],
[ 0.86017015,  0.13982985],
[ 0.28622388,  0.71377612],
[ 0.28287555,  0.71712445],
[ 0.8070549 ,  0.1929451 ],
[ 0.86038196,  0.13961804],
[ 0.20682467,  0.79317533],
[ 0.85835972,  0.14164028],
[ 0.53882734,  0.46117266],
[ 0.80220769,  0.19779231],
[ 0.85539757,  0.14460243],
[ 0.69483827,  0.30516173],
[ 0.87933359,  0.12066641],
[ 0.83561804,  0.16438196],
[ 0.8070549 ,  0.1929451 ],
[ 0.85835972,  0.14164028],
[ 0.86042952,  0.13957048],
[ 0.87914067,  0.12085933],
[ 0.11937883,  0.88062117],
[ 0.2853273 ,  0.7146727 ],
[ 0.59808311,  0.40191689],
[ 0.90594814,  0.09405186],
[ 0.85835972,  0.14164028],
[ 0.86346251,  0.13653749],
[ 0.85801214,  0.14198786],
[ 0.86032122,  0.13967878],
[ 0.35349566,  0.64650434],
[ 0.52724747,  0.47275253],
[ 0.22879217,  0.77120783],
[ 0.28601744,  0.71398256],
[ 0.56939306,  0.43060694],
[ 0.85743204,  0.14256796],
[ 0.94208727,  0.05791273],
[ 0.82348607,  0.17651393],
[ 0.74901495,  0.25098505],
[ 0.94336392,  0.05663608],
[ 0.85705258,  0.14294742],
[ 0.85800384,  0.14199616],
[ 0.05587819,  0.94412181],
[ 0.5941366 ,  0.4058634 ],
[ 0.18541791,  0.81458209],
[ 0.84012224,  0.15987776],
[ 0.83358085,  0.16641915],
[ 0.87933975,  0.12066025],
[ 0.88380587,  0.11619413],
[ 0.87914067,  0.12085933],
[ 0.28628812,  0.71371188],
[ 0.48214083,  0.51785917],
[ 0.70716252,  0.29283748],
[ 0.05715212,  0.94284788],
[ 0.65795297,  0.34204703],
[ 0.25709963,  0.74290037],
[ 0.81492001,  0.18507999],
[ 0.83837631,  0.16162369],
[ 0.8727473 ,  0.1272527 ],
[ 0.3942793 ,  0.6057207 ],
[ 0.69435145,  0.30564855],
[ 0.25955289,  0.74044711],
[ 0.76489311,  0.23510689],
[ 0.11637845,  0.88362155],
[ 0.65775927,  0.34224073],
[ 0.63734093,  0.36265907],
[ 0.85975561,  0.14024439],
[ 0.8839741 ,  0.1160259 ],
[ 0.66714496,  0.33285504],
[ 0.07984609,  0.92015391],
[ 0.15579323,  0.84420677],
[ 0.81105308,  0.18894692],
[ 0.86042952,  0.13957048],
[ 0.24260674,  0.75739326],
[ 0.8360098 ,  0.1639902 ],
[ 0.85835972,  0.14164028],
[ 0.87740579,  0.12259421],
[ 0.59721595,  0.40278405],
[ 0.85765731,  0.14234269],
[ 0.72253208,  0.27746792],
[ 0.2862853 ,  0.7137147 ],
[ 0.83015243,  0.16984757],
[ 0.3208549 ,  0.6791451 ],
[ 0.08720221,  0.91279779],
[ 0.79040078,  0.20959922],
[ 0.86346251,  0.13653749],
[ 0.85835972,  0.14164028],
[ 0.85835972,  0.14164028],
[ 0.85711962,  0.14288038],
[ 0.53746947,  0.46253053],
[ 0.24073084,  0.75926916],
[ 0.86038196,  0.13961804],
[ 0.86038196,  0.13961804],
[ 0.65520865,  0.34479135],
[ 0.61675979,  0.38324021],
[ 0.04187545,  0.95812455],
[ 0.83467975,  0.16532025],
[ 0.86037046,  0.13962954],
[ 0.63589211,  0.36410789],
[ 0.7945787 ,  0.2054213 ],
[ 0.35569986,  0.64430014],
[ 0.5923993 ,  0.4076007 ],
[ 0.8149159 ,  0.1850841 ],
[ 0.18626833,  0.81373167],
[ 0.55494501,  0.44505499],
[ 0.85974901,  0.14025099],
[ 0.86038196,  0.13961804],
[ 0.26826856,  0.73173144],
[ 0.72096103,  0.27903897],
[ 0.86042134,  0.13957866],
[ 0.85651789,  0.14348211],
[ 0.86032122,  0.13967878],
[ 0.12575524,  0.87424476],
[ 0.8577608 ,  0.1422392 ],
[ 0.87946251,  0.12053749],
[ 0.83078796,  0.16921204],
[ 0.09214503,  0.90785497],
[ 0.85801214,  0.14198786],
[ 0.1353402 ,  0.8646598 ],
[ 0.81833156,  0.18166844],
[ 0.28627693,  0.71372307],
[ 0.77835462,  0.22164538],
[ 0.86019807,  0.13980193],
[ 0.85974901,  0.14025099],
[ 0.87920886,  0.12079114],
[ 0.18831656,  0.81168344],
[ 0.83358085,  0.16641915],
[ 0.56217041,  0.43782959],
[ 0.85802213,  0.14197787],
[ 0.59346021,  0.40653979],
[ 0.26217247,  0.73782753],
[ 0.81492209,  0.18507791],
[ 0.0820547 ,  0.9179453 ],
[ 0.26297266,  0.73702734],
[ 0.11560723,  0.88439277],
[ 0.65520865,  0.34479135],
[ 0.7961241 ,  0.2038759 ],
[ 0.86071471,  0.13928529],
[ 0.86063609,  0.13936391],
[ 0.35525524,  0.64474476],
[ 0.92489299,  0.07510701],
[ 0.71693682,  0.28306318],
[ 0.6076947 ,  0.3923053 ],
[ 0.8149159 ,  0.1850841 ],
[ 0.85057636,  0.14942364],
[ 0.6376244 ,  0.3623756 ],
[ 0.822631  ,  0.177369  ],
[ 0.86038196,  0.13961804],
[ 0.87740579,  0.12259421],
[ 0.17163368,  0.82836632],
[ 0.35894969,  0.64105031],
[ 0.83357894,  0.16642106],
[ 0.26194935,  0.73805065],
[ 0.85835972,  0.14164028],
[ 0.26062053,  0.73937947],
[ 0.42452126,  0.57547874],
[ 0.71744256,  0.28255744],
[ 0.86074419,  0.13925581],
[ 0.86042952,  0.13957048],
[ 0.71232894,  0.28767106],
[ 0.35504562,  0.64495438],
[ 0.87740579,  0.12259421],
[ 0.11900024,  0.88099976],
[ 0.86038523,  0.13961477],
[ 0.87341934,  0.12658066],
[ 0.85935323,  0.14064677],
[ 0.60934863,  0.39065137],
[ 0.86032122,  0.13967878],
[ 0.67707567,  0.32292433],
[ 0.35952192,  0.64047808],
[ 0.751824  ,  0.248176  ],
[ 0.8796969 ,  0.1203031 ],
[ 0.94539356,  0.05460644],
[ 0.10542774,  0.89457226],
[ 0.86007975,  0.13992025],
[ 0.88191682,  0.11808318],
[ 0.12684285,  0.87315715],
[ 0.93191133,  0.06808867],
[ 0.81531115,  0.18468885],
[ 0.84012224,  0.15987776],
[ 0.69813211,  0.30186789],
[ 0.8149159 ,  0.1850841 ],
[ 0.18808428,  0.81191572],
[ 0.22676529,  0.77323471],
[ 0.71814943,  0.28185057],
[ 0.83357894,  0.16642106],
[ 0.86039834,  0.13960166],
[ 0.85780233,  0.14219767],
[ 0.0849913 ,  0.9150087 ],
[ 0.86007975,  0.13992025],
[ 0.86032122,  0.13967878],
[ 0.84012224,  0.15987776],
[ 0.60672195,  0.39327805],
[ 0.85795222,  0.14204778],
[ 0.85692031,  0.14307969],
[ 0.29681064,  0.70318936],
[ 0.83393869,  0.16606131],
[ 0.85765731,  0.14234269],
[ 0.87931473,  0.12068527],
[ 0.95077511,  0.04922489],
[ 0.83327556,  0.16672444],
[ 0.8180723 ,  0.1819277 ],
[ 0.08871631,  0.91128369],
[ 0.93364059,  0.06635941],
[ 0.90670657,  0.09329343],
[ 0.18814843,  0.81185157],
[ 0.11532953,  0.88467047],
[ 0.3446263 ,  0.6553737 ],
[ 0.30260559,  0.69739441],
[ 0.20902316,  0.79097684],
[ 0.70727922,  0.29272078],
[ 0.49908055,  0.50091945],
[ 0.83357894,  0.16642106],
[ 0.85717847,  0.14282153],
[ 0.83675021,  0.16324979],
[ 0.17163368,  0.82836632],
[ 0.6376244 ,  0.3623756 ],
[ 0.85835972,  0.14164028],
[ 0.39467085,  0.60532915],
[ 0.2731416 ,  0.7268584 ],
[ 0.63987502,  0.36012498],
[ 0.85974901,  0.14025099],
[ 0.72317603,  0.27682397],
[ 0.86038196,  0.13961804],
[ 0.11238481,  0.88761519],
[ 0.67348135,  0.32651865],
[ 0.87880937,  0.12119063],
[ 0.2665907 ,  0.7334093 ],
[ 0.26297533,  0.73702467],
[ 0.85718307,  0.14281693],
[ 0.85796389,  0.14203611],
[ 0.86038196,  0.13961804],
[ 0.10513251,  0.89486749],
[ 0.29535416,  0.70464584],
[ 0.86038196,  0.13961804],
[ 0.35756781,  0.64243219],
[ 0.85935323,  0.14064677],
[ 0.86071471,  0.13928529],
[ 0.50077687,  0.49922313],
[ 0.85835972,  0.14164028],
[ 0.14507275,  0.85492725],
[ 0.26239326,  0.73760674],
[ 0.60648726,  0.39351274],
[ 0.8149159 ,  0.1850841 ]])


preds = clf.predict(test_x)


计算模型的混淆矩阵如下所示。

confusion_matrix(test_y,preds)


array([[157,  15],
[ 35,  61]])


计算模型的ROC/AUC得分,并画出ROC曲线。模型的ROC/AUC得分为0.88,表明预测准确的概率为88%左右。模型预测结果较好。

pre = clf.predict_proba(test_x)
roc_auc_score(test_y,pre[:,1])


0.88114704457364346


fpr,tpr,thresholds = roc_curve(test_y,pre[:,1])
fig,ax = plt.subplots(figsize=(8,5))
plt.plot(fpr,tpr)
ax.set_title("Roc of Logistic Regression")


<matplotlib.text.Text at 0x7674a1f588>




模型预测结果分类报告如下所示。

print(classification_report(test_y,preds))


precision    recall  f1-score   support

0       0.82      0.91      0.86       172
1       0.80      0.64      0.71        96

avg / total       0.81      0.81      0.81       268


总体而言,模型的拟合结果较好。
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: