您的位置:首页 > 其它

credit card fraud detection

2017-12-17 22:47 429 查看
概述

本文为机器学习入门贴,一方面作为自己的学习记录,另一方面,作为入门级代码分享,希望能给一些同学带来一点点帮助。欢迎指正交流。

摘要

本文以[kaggle信用卡欺诈判别原数据]为学习对象,利用决策树(decision tree)、主成分分析(principle component analysis)、线性判别分析(linear discriminant analysis)、梯度提升树(gradient boosting decision tree)、XGBoost等数据工程方法,对信用卡欺诈进行识别,并给出一些可视化结果。(https://www.kaggle.com/dalpozz/creditcardfraud)

正文

libraries go first

print(__doc__)
# import libraries
import pandas as pds
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.cross_validation import train_test_split
from sklearn import decomposition
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn import tree
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
import matplotlib.pyplot as plt
import graphviz
from time import time


导入数据,划分训练集和测试集,记录开始时间

DataFrame=pds.read_csv('./creditcard.csv')
featureframe = DataFrame.drop(['Time','Amount','Class'],axis=1)
targetframe=DataFrame['Class']
X, Xt, Y, Yt = train_test_split(featureframe,targetframe, test_size=0.20,random_state=0)
X = X.fillna(0)
Xt = Xt.fillna(0)
t0=time()


1 决策树

# training tree
dtc=DecisionTreeClassifier(max_depth=5)  # 5 layers
dtc.fit(X,Y)
pred=dtc.predict(Xt)
print("done in %0.3fs" % (time()-t0))
# post analysis
print("result with all features:")
Accuracy=metrics.accuracy_score(pred,Yt, normalize=True, sample_weight=None)
print("Accuracy:",Accuracy)
tn, fp, fn,tp = metrics.confusion_matrix(pred,Yt).ravel()
Sensitivity=tp/float((tp+fn))  #Sensitivity
print("Sensitivity:",Sensitivity)
print(metrics.confusion_matrix(pred,Yt))
Specificity=tn/float((tn+fp))  #Specificity
print ("Specificity:",Specificity)


1.1 结果输出:

done in 13.483s(与电脑配置有关)

Accuracy: 0.999490888663

Sensitivity: 0.882978723404

Specificity: 0.999683477527



可以从树结构看到,左下角部分叶子节点并没有完成很好的分类。下图为8层树结构,所有叶子节点均得到比较好的划分。



1.2 利用DT进行特征重要性分析

# list feature importance
fec_indx=dtc.feature_importances_
# plot feature importance
fec_name=X.columns   # get feature names
plt.figure(figsize=(12, 6))
plt.title("Feature importances")
plt.bar(np.arange(fec_indx.shape[0]),fec_indx,color='y',align="center")
plt.xticks(np.arange(fec_indx.shape[0]),fec_name)
plt.show




1.3 仅取重要特征,利用DT继续分析

important_featureframe=DataFrame[['V1','V4','V10','V12','V14','V17','V18','V21','V23','V25','V26','V27']]
i_X, i_Xt, i_Y, i_Yt = train_test_split(important_featureframe,targetframe, test_size=0.20, random_state=0)
imp_dtc=DecisionTreeClassifier(max_depth=8)
imp_dtc.fit(i_X,i_Y)
i_pred=imp_dtc.predict(i_Xt)


done in 10.198s

result with only important features:

i_Accuracy: 0.999385555282

i_Sensitivity: 0.892857142857

i_Specificity: 0.999542881255

节省计算时间,灵敏度也提高了,可见特征工程的重要性。

2 PCA

考虑到特征重要性分析从28个特征中选出12个重要特征,这里利用PCA亦取前12阶特征向量。

pca=decomposition.PCA(n_components=12)
pca_X=X
pca_Xt=Xt
pca.fit(pca_X)
pca_X=pca.transform(pca_X)
pca_Xt=pca.transform(pca_Xt)
pca_dtc=DecisionTreeClassifier(max_depth=8)
pca_dtc.fit(pca_X,Y)
pca_pred=pca_dtc.predict(pca_Xt)


done in 12.058s

result with PCA:

pca_Accuracy: 0.999192444086

pca_Sensitivity: 0.83950617284

pca_Specificity: 0.999419841423

3 LDA

clf=LinearDiscriminantAnalysis()
lda_X=X
lda_Xt=Xt
clf.fit(lda_X,Y)
lda_pred=clf.predict(lda_Xt)


done in 1.601s

result with LDA:

lda_Accuracy: 0.999385555282

lda_Sensitivity: 0.875

lda_Specificity: 0.999419861822

这个算法,很快;灵敏度,凑合。

4 GBDT

本例利用GBDT没有获得有效的结果,有待进一步分析。代码如下

params = {'n_estimators': 100, 'max_depth': 4, 'min_samples_split': 2,'learning_rate': 0.01, 'loss': 'exponential'}
gbdt=GradientBoostingClassifier(**params)
gbdt.fit(X,Y)
y_pre=gbdt.predict(Xt)


done in 612.429s

results with gbdt:

i_Accuracy: 0.998226888101

i_Sensitivity: nan

i_Specificity: 0.998226888101

模糊矩阵:

[56861,101;0,0]

5 XGBoost

XGBoost作为强大的集成算法,有很多内容值得深入研究。本文先用网格搜索进行XGBoost调参,代码如下

params_cv={'gamma':[0,0.2,0.3],'max_depth':[3,5,7],'n_estimators':[50,100,200]}
gridcv=GridSearchCV(estimator=XGBClassifier(objective='binary:logistic'), param_grid=params_cv,scoring='roc_auc')   # roc_auc score
gridcv.fit(X,Y)
print(gridcv.grid_scores_)
print(gridcv.best_score_)
print(gridcv.best_params_)
y_pre=gridcv.predict(Xt)


5.1 网格搜索计算输出结果:

以XGBClassifier为estimator的GridSearchCV非常耗时,结果输出如下

done in 4236.404s(wait and wait)

{‘gamma’: 0.3, ‘max_depth’: 3, ‘n_estimators’: 100}

Accuracy_acore: 0.999490888663

sensitivity: 0.909090909091

specificity: 0.999630762739

也难怪提到XGBoost,就要想到并行计算了。

顺便给出1~10个基学习器树结构输出代码

# output the XGB tree in pdf format
for index in range(10):
graph=Digraph(format='pdf')
graph=xgb.to_graphviz(clf,num_trees=index)
graph.render('CCFD_xgb_tree_num_'+str(index))




利用网格搜索得到的参数,利用XGBoost分类

params = {'objective':'binary:logistic','gamma': 0.3, 'max_depth': 3,'n_estimators':100}
clf=xgb.XGBModel(**params)
clf.fit(X,Y)
y_pro=clf.predict(Xt)
y_pre=list()
for i in range(y_pro.shape[0]):
if y_pro[i] > 0.5:
y_pre.append(1)
else:
y_pre.append(0)


done in 51.067s

Accuracy_acore: 0.999490888663

sensitivity: 0.909090909091

specificity: 0.999630762739
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息