credit card fraud detection
2017-12-17 22:47
429 查看
概述
本文为机器学习入门贴,一方面作为自己的学习记录,另一方面,作为入门级代码分享,希望能给一些同学带来一点点帮助。欢迎指正交流。
摘要
本文以[kaggle信用卡欺诈判别原数据]为学习对象,利用决策树(decision tree)、主成分分析(principle component analysis)、线性判别分析(linear discriminant analysis)、梯度提升树(gradient boosting decision tree)、XGBoost等数据工程方法,对信用卡欺诈进行识别,并给出一些可视化结果。(https://www.kaggle.com/dalpozz/creditcardfraud)
正文
libraries go first
导入数据,划分训练集和测试集,记录开始时间
1 决策树
1.1 结果输出:
done in 13.483s(与电脑配置有关)
Accuracy: 0.999490888663
Sensitivity: 0.882978723404
Specificity: 0.999683477527
可以从树结构看到,左下角部分叶子节点并没有完成很好的分类。下图为8层树结构,所有叶子节点均得到比较好的划分。
1.2 利用DT进行特征重要性分析
1.3 仅取重要特征,利用DT继续分析
done in 10.198s
result with only important features:
i_Accuracy: 0.999385555282
i_Sensitivity: 0.892857142857
i_Specificity: 0.999542881255
节省计算时间,灵敏度也提高了,可见特征工程的重要性。
2 PCA
考虑到特征重要性分析从28个特征中选出12个重要特征,这里利用PCA亦取前12阶特征向量。
done in 12.058s
result with PCA:
pca_Accuracy: 0.999192444086
pca_Sensitivity: 0.83950617284
pca_Specificity: 0.999419841423
3 LDA
done in 1.601s
result with LDA:
lda_Accuracy: 0.999385555282
lda_Sensitivity: 0.875
lda_Specificity: 0.999419861822
这个算法,很快;灵敏度,凑合。
4 GBDT
本例利用GBDT没有获得有效的结果,有待进一步分析。代码如下
done in 612.429s
results with gbdt:
i_Accuracy: 0.998226888101
i_Sensitivity: nan
i_Specificity: 0.998226888101
模糊矩阵:
[56861,101;0,0]
5 XGBoost
XGBoost作为强大的集成算法,有很多内容值得深入研究。本文先用网格搜索进行XGBoost调参,代码如下
5.1 网格搜索计算输出结果:
以XGBClassifier为estimator的GridSearchCV非常耗时,结果输出如下
done in 4236.404s(wait and wait)
{‘gamma’: 0.3, ‘max_depth’: 3, ‘n_estimators’: 100}
Accuracy_acore: 0.999490888663
sensitivity: 0.909090909091
specificity: 0.999630762739
也难怪提到XGBoost,就要想到并行计算了。
顺便给出1~10个基学习器树结构输出代码
利用网格搜索得到的参数,利用XGBoost分类
done in 51.067s
Accuracy_acore: 0.999490888663
sensitivity: 0.909090909091
specificity: 0.999630762739
本文为机器学习入门贴,一方面作为自己的学习记录,另一方面,作为入门级代码分享,希望能给一些同学带来一点点帮助。欢迎指正交流。
摘要
本文以[kaggle信用卡欺诈判别原数据]为学习对象,利用决策树(decision tree)、主成分分析(principle component analysis)、线性判别分析(linear discriminant analysis)、梯度提升树(gradient boosting decision tree)、XGBoost等数据工程方法,对信用卡欺诈进行识别,并给出一些可视化结果。(https://www.kaggle.com/dalpozz/creditcardfraud)
正文
libraries go first
print(__doc__) # import libraries import pandas as pds import numpy as np from sklearn.tree import DecisionTreeClassifier from sklearn.cross_validation import train_test_split from sklearn import decomposition from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from sklearn import tree import xgboost as xgb from xgboost.sklearn import XGBClassifier from sklearn.model_selection import GridSearchCV from sklearn import metrics import matplotlib.pyplot as plt import graphviz from time import time
导入数据,划分训练集和测试集,记录开始时间
DataFrame=pds.read_csv('./creditcard.csv') featureframe = DataFrame.drop(['Time','Amount','Class'],axis=1) targetframe=DataFrame['Class'] X, Xt, Y, Yt = train_test_split(featureframe,targetframe, test_size=0.20,random_state=0) X = X.fillna(0) Xt = Xt.fillna(0) t0=time()
1 决策树
# training tree dtc=DecisionTreeClassifier(max_depth=5) # 5 layers dtc.fit(X,Y) pred=dtc.predict(Xt) print("done in %0.3fs" % (time()-t0)) # post analysis print("result with all features:") Accuracy=metrics.accuracy_score(pred,Yt, normalize=True, sample_weight=None) print("Accuracy:",Accuracy) tn, fp, fn,tp = metrics.confusion_matrix(pred,Yt).ravel() Sensitivity=tp/float((tp+fn)) #Sensitivity print("Sensitivity:",Sensitivity) print(metrics.confusion_matrix(pred,Yt)) Specificity=tn/float((tn+fp)) #Specificity print ("Specificity:",Specificity)
1.1 结果输出:
done in 13.483s(与电脑配置有关)
Accuracy: 0.999490888663
Sensitivity: 0.882978723404
Specificity: 0.999683477527
可以从树结构看到,左下角部分叶子节点并没有完成很好的分类。下图为8层树结构,所有叶子节点均得到比较好的划分。
1.2 利用DT进行特征重要性分析
# list feature importance fec_indx=dtc.feature_importances_ # plot feature importance fec_name=X.columns # get feature names plt.figure(figsize=(12, 6)) plt.title("Feature importances") plt.bar(np.arange(fec_indx.shape[0]),fec_indx,color='y',align="center") plt.xticks(np.arange(fec_indx.shape[0]),fec_name) plt.show
1.3 仅取重要特征,利用DT继续分析
important_featureframe=DataFrame[['V1','V4','V10','V12','V14','V17','V18','V21','V23','V25','V26','V27']] i_X, i_Xt, i_Y, i_Yt = train_test_split(important_featureframe,targetframe, test_size=0.20, random_state=0) imp_dtc=DecisionTreeClassifier(max_depth=8) imp_dtc.fit(i_X,i_Y) i_pred=imp_dtc.predict(i_Xt)
done in 10.198s
result with only important features:
i_Accuracy: 0.999385555282
i_Sensitivity: 0.892857142857
i_Specificity: 0.999542881255
节省计算时间,灵敏度也提高了,可见特征工程的重要性。
2 PCA
考虑到特征重要性分析从28个特征中选出12个重要特征,这里利用PCA亦取前12阶特征向量。
pca=decomposition.PCA(n_components=12) pca_X=X pca_Xt=Xt pca.fit(pca_X) pca_X=pca.transform(pca_X) pca_Xt=pca.transform(pca_Xt) pca_dtc=DecisionTreeClassifier(max_depth=8) pca_dtc.fit(pca_X,Y) pca_pred=pca_dtc.predict(pca_Xt)
done in 12.058s
result with PCA:
pca_Accuracy: 0.999192444086
pca_Sensitivity: 0.83950617284
pca_Specificity: 0.999419841423
3 LDA
clf=LinearDiscriminantAnalysis() lda_X=X lda_Xt=Xt clf.fit(lda_X,Y) lda_pred=clf.predict(lda_Xt)
done in 1.601s
result with LDA:
lda_Accuracy: 0.999385555282
lda_Sensitivity: 0.875
lda_Specificity: 0.999419861822
这个算法,很快;灵敏度,凑合。
4 GBDT
本例利用GBDT没有获得有效的结果,有待进一步分析。代码如下
params = {'n_estimators': 100, 'max_depth': 4, 'min_samples_split': 2,'learning_rate': 0.01, 'loss': 'exponential'} gbdt=GradientBoostingClassifier(**params) gbdt.fit(X,Y) y_pre=gbdt.predict(Xt)
done in 612.429s
results with gbdt:
i_Accuracy: 0.998226888101
i_Sensitivity: nan
i_Specificity: 0.998226888101
模糊矩阵:
[56861,101;0,0]
5 XGBoost
XGBoost作为强大的集成算法,有很多内容值得深入研究。本文先用网格搜索进行XGBoost调参,代码如下
params_cv={'gamma':[0,0.2,0.3],'max_depth':[3,5,7],'n_estimators':[50,100,200]} gridcv=GridSearchCV(estimator=XGBClassifier(objective='binary:logistic'), param_grid=params_cv,scoring='roc_auc') # roc_auc score gridcv.fit(X,Y) print(gridcv.grid_scores_) print(gridcv.best_score_) print(gridcv.best_params_) y_pre=gridcv.predict(Xt)
5.1 网格搜索计算输出结果:
以XGBClassifier为estimator的GridSearchCV非常耗时,结果输出如下
done in 4236.404s(wait and wait)
{‘gamma’: 0.3, ‘max_depth’: 3, ‘n_estimators’: 100}
Accuracy_acore: 0.999490888663
sensitivity: 0.909090909091
specificity: 0.999630762739
也难怪提到XGBoost,就要想到并行计算了。
顺便给出1~10个基学习器树结构输出代码
# output the XGB tree in pdf format for index in range(10): graph=Digraph(format='pdf') graph=xgb.to_graphviz(clf,num_trees=index) graph.render('CCFD_xgb_tree_num_'+str(index))
利用网格搜索得到的参数,利用XGBoost分类
params = {'objective':'binary:logistic','gamma': 0.3, 'max_depth': 3,'n_estimators':100} clf=xgb.XGBModel(**params) clf.fit(X,Y) y_pro=clf.predict(Xt) y_pre=list() for i in range(y_pro.shape[0]): if y_pro[i] > 0.5: y_pre.append(1) else: y_pre.append(0)
done in 51.067s
Accuracy_acore: 0.999490888663
sensitivity: 0.909090909091
specificity: 0.999630762739
相关文章推荐
- TensorFlow for Hackers (Part VII) - Credit Card Fraud Detection using Autoencoders in Keras
- Credit Card Fraud Detection(信用卡诈欺侦测)Spark建模
- Real Time Credit Card Fraud Detection with Apache Spark and Event Streaming
- 《Credit Risk Scorecard》第四章:Data Review and Project Parameters
- 【原】 可供测试的信用卡号 Test Credit Card Number
- idcard detection using opencv
- 《Credit Risk Scorecard》第五章: Development Database Creation
- Test Credit Card Numbers
- 《Credit Risk Scorecard》 第六章: Scorecard Development
- 《Credit Risk Scorecard》第八章: Scorecard Implementation
- LA 6448 Credit Card Payment
- PayPal开发 -- REDICT CREDIT CARD PAYMNETS AND PCI compliance
- LA 6448 Credit Card Payment
- Credit Card Validation
- An error occurred while saving your credit card. Please try again.
- Validate Credit Card Number
- Cancelled my American Express Credit card
- [Angular] Using directive to create a simple Credit card validator
- test credit card account