逻辑回归进行信用卡欺诈检测
2017-10-30 00:12
375 查看
利用Logistic regression进行信用卡欺诈检测,使用的是一份竞赛数据集(已脱敏处理),使用的是Python的Jupyter Notebook工具。
导入数据并查看前5行
数据有31列:Time、V1-V28、Amount和Class,注意到最后一列Class,这是我们的label值,0代表正常数据,1代表欺诈数据。首先习惯性地画个图观察一下欺诈数据的分布。
可以看到Class=0的数据大概有28W,欺诈数据Class=1极少,极度不均匀的分布状态。
通常有两种处理方法:
1. 过采样(让1变得和0一样多);
2. 下采样(在0中取出部分数据,数量与1一致)
这个时候所有特征数据都已经完成了标准化的操作。
为了保证拿到的是数据的原始分布,我们采用的是随机的下采样:
TP(true positives):被正确分类的正例个数
FN(false negatives):被错误分类的负例个数
FP(false positives):被错误分类的负例个数
TN(true negatives):被正确分类的负例个数
由于我们是要尽可能将所有信用卡欺诈的数据找出来,所以有个很重要的衡量标准:
召回率:Recall = TP/(TP+FN)
假设1000条信用卡数据中,有10条是欺诈数据,召回率有别于准确率,它关注的目标就是这10条数据,找出3条,那么召回率为0.3。
– – – – – – – – – – – – – – – – – – – – – -
C parameter: 0.01
– – – – – – – – – – – – – – – – – – – – – -
Iteration 1 : recall score = 0.958904109589
Iteration 2 : recall score = 0.917808219178
Iteration 3 : recall score = 1.0
Iteration 4 : recall score = 0.972972972973
Iteration 5 : recall score = 0.954545454545
Mean recall score 0.960846151257
– – – – – – – – – – – – – – – – – – – – – -
C parameter: 0.1
– – – – – – – – – – – – – – – – – – – – – -
Iteration 1 : recall score = 0.835616438356
Iteration 2 : recall score = 0.86301369863
Iteration 3 : recall score = 0.915254237288
Iteration 4 : recall score = 0.932432432432
Iteration 5 : recall score = 0.878787878788
Mean recall score 0.885020937099
– – – – – – – – – – – – – – – – – – – – – -
C parameter: 1
– – – – – – – – – – – – – – – – – – – – – -
Iteration 1 : recall score = 0.835616438356
Iteration 2 : recall score = 0.86301369863
Iteration 3 : recall score = 0.966101694915
Iteration 4 : recall score = 0.945945945946
Iteration 5 : recall score = 0.893939393939
Mean recall score 0.900923434357
– – – – – – – – – – – – – – – – – – – – – -
C parameter: 10
– – – – – – – – – – – – – – – – – – – – – -
Iteration 1 : recall score = 0.849315068493
Iteration 2 : recall score = 0.86301369863
Iteration 3 : recall score = 0.966101694915
Iteration 4 : recall score = 0.959459459459
Iteration 5 : recall score = 0.893939393939
Mean recall score 0.906365863087
– – – – – – – – – – – – – – – – – – – – – -
C parameter: 100
– – – – – – – – – – – – – – – – – – – – – -
Iteration 1 : recall score = 0.86301369863
Iteration 2 : recall score = 0.86301369863
Iteration 3 : recall score = 0.966101694915
Iteration 4 : recall score = 0.959459459459
Iteration 5 : recall score = 0.893939393939
Mean recall score 0.909105589115
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Best model to choose from cross validation is with C parameter = 0.01
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
由以上结果可以看到,当前最好的值为0.96
接下来,画一个更直观的混淆矩阵图出来
一目了然的图,可以看到,138个真实的欺诈被模型找出来了,但是有9个漏网之鱼,同时有17个正常数据被误杀。Recall值能达到0.93,看起来挺高的,这就是我们要的结果吗?并非如此,这是用的下采样数据计算的混淆矩阵。
接下来,我们用原始数据画出混淆矩阵图,看看结果:
这里我们能看到,模型出现一个很大的问题,误杀数量竟然达到了10318条,这无疑对业务产生了重大影响。为什么会出现这个问题呢?这是根据下采样模型得到的效果,而在下采样数据中,数据量太少,正常的少,异常的同样也少,样本是有局限的,出现这种情况也很正常。
那么如何解决这个问题呢?
如果我们一开始没有对数据进行任何预处理操作,我们能不能得到好的结果呢?
– – – – – – – – – – – – – – – – – – – – – -
C parameter: 0.01
– – – – – – – – – – – – – – – – – – – – – -
Iteration 1 : recall score = 0.492537313433
Iteration 2 : recall score = 0.602739726027
Iteration 3 : recall score = 0.683333333333
Iteration 4 : recall score = 0.569230769231
Iteration 5 : recall score = 0.45
Mean recall score 0.559568228405
– – – – – – – – – – – – – – – – – – – – – -
C parameter: 0.1
– – – – – – – – – – – – – – – – – – – – – -
Iteration 1 : recall score = 0.567164179104
Iteration 2 : recall score = 0.616438356164
Iteration 3 : recall score = 0.683333333333
Iteration 4 : recall score = 0.584615384615
Iteration 5 : recall score = 0.525
Mean recall score 0.595310250644
– – – – – – – – – – – – – – – – – – – – – -
C parameter: 1
– – – – – – – – – – – – – – – – – – – – – -
Iteration 1 : recall score = 0.55223880597
Iteration 2 : recall score = 0.616438356164
Iteration 3 : recall score = 0.716666666667
Iteration 4 : recall score = 0.615384615385
Iteration 5 : recall score = 0.5625
Mean recall score 0.612645688837
– – – – – – – – – – – – – – – – – – – – – -
C parameter: 10
– – – – – – – – – – – – – – – – – – – – – -
Iteration 1 : recall score = 0.55223880597
Iteration 2 : recall score = 0.616438356164
Iteration 3 : recall score = 0.733333333333
Iteration 4 : recall score = 0.615384615385
Iteration 5 : recall score = 0.575
Mean recall score 0.61847902217
– – – – – – – – – – – – – – – – – – – – – -
C parameter: 100
– – – – – – – – – – – – – – – – – – – – – -
Iteration 1 : recall score = 0.55223880597
Iteration 2 : recall score = 0.616438356164
Iteration 3 : recall score = 0.733333333333
Iteration 4 : recall score = 0.615384615385
Iteration 5 : recall score = 0.575
Mean recall score 0.61847902217
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Best model to choose from cross validation is with C parameter = 10.0
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
可以看到,直接用极度不均衡数据建模的话,效果都很差。所以对数据进行预处理是非常有必要的。
数据决定上限,参数决定下限。
我们还是先看看它的混淆矩阵结果:
从结果看到,误杀少了,但是很多欺诈数据没有找出来。
之前我们使用的是Sigmoid函数中默认的阈值:0.5,如果我们自己指定阈值,会对结果产生什么影响呢?
Recall metric in the testing dataset: 1.0
Recall metric in the testing dataset: 1.0
Recall metric in the testing dataset: 1.0
Recall metric in the testing dataset: 0.986394557823
Recall metric in the testing dataset: 0.925170068027
Recall metric in the testing dataset: 0.863945578231
Recall metric in the testing dataset: 0.829931972789
Recall metric in the testing dataset: 0.748299319728
Recall metric in the testing dataset: 0.585034013605
当阈值为0.1-0.3时,recall值为1,说明太过严苛。随着阈值越来越大,模型的要求越来越宽松。这里需要根据实际业务需求,权衡利弊,选定一个代价最低的模型。
说到过采样,那么就有个问题,怎么生成数据呢?
在机器学习中,有这么个套路,即SMOTE样本生成策略:
其中k值为要翻的倍数,假设少数类样本为100,你想变成500,K就取5。先算x到其他少数类样本的距离,然后找出离它最近的5个样本,分别得到距离,将这个距离乘上一个0-1之间的随机数,加上样本本身,得到新数据。相当于对样本进行了微调的过程。
生成新数据
查看下
227454
– – – – – – – – – – – – – – – – – – – – – -
C parameter: 0.01
– – – – – – – – – – – – – – – – – – – – – -
Iteration 1 : recall score = 0.890322580645
Iteration 2 : recall score = 0.894736842105
Iteration 3 : recall score = 0.968794954078
Iteration 4 : recall score = 0.957760411514
Iteration 5 : recall score = 0.958266011585
Mean recall score 0.933976159985
– – – – – – – – – – – – – – – – – – – – – -
C parameter: 0.1
– – – – – – – – – – – – – – – – – – – – – -
Iteration 1 : recall score = 0.890322580645
Iteration 2 : recall score = 0.894736842105
Iteration 3 : recall score = 0.970432665708
Iteration 4 : recall score = 0.960046603137
Iteration 5 : recall score = 0.957650498456
Mean recall score 0.93463783801
– – – – – – – – – – – – – – – – – – – – – -
C parameter: 1
– – – – – – – – – – – – – – – – – – – – – -
Iteration 1 : recall score = 0.890322580645
Iteration 2 : recall score = 0.894736842105
Iteration 3 : recall score = 0.970432665708
Iteration 4 : recall score = 0.960321385784
Iteration 5 : recall score = 0.960750046713
Mean recall score 0.935312704191
– – – – – – – – – – – – – – – – – – – – – -
C parameter: 10
– – – – – – – – – – – – – – – – – – – – – -
Iteration 1 : recall score = 0.890322580645
Iteration 2 : recall score = 0.894736842105
Iteration 3 : recall score = 0.970499059422
Iteration 4 : recall score = 0.960211472725
Iteration 5 : recall score = 0.96009056836
Mean recall score 0.935172104652
– – – – – – – – – – – – – – – – – – – – – -
C parameter: 100
– – – – – – – – – – – – – – – – – – – – – -
Iteration 1 : recall score = 0.890322580645
Iteration 2 : recall score = 0.894736842105
Iteration 3 : recall score = 0.970543321899
Iteration 4 : recall score = 0.960398324925
Iteration 5 : recall score = 0.956903089656
Mean recall score 0.934580831846
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Best model to choose from cross validation is with C parameter = 1.0
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
看结果,与下采样对比,误杀比例明显小得多,也就是说,当我们用过采样策略,模型效果最好。
观察数据
import pandas as pd import matplotlib.pyplot as plt import numpy as np %matplotlib inline
导入数据并查看前5行
data = pd.read_csv("creditcard.csv") data.head()
数据有31列:Time、V1-V28、Amount和Class,注意到最后一列Class,这是我们的label值,0代表正常数据,1代表欺诈数据。首先习惯性地画个图观察一下欺诈数据的分布。
count_classes = pd.value_counts(data['Class'], sort = True).sort_index() count_classes.plot(kind = 'bar') plt.title("Fraud class histogram") plt.xlabel("Class") plt.ylabel("Frequency")
可以看到Class=0的数据大概有28W,欺诈数据Class=1极少,极度不均匀的分布状态。
通常有两种处理方法:
1. 过采样(让1变得和0一样多);
2. 下采样(在0中取出部分数据,数量与1一致)
标准化
在特征数据中,Amount与其他特征数据的取值范围相比,太大了,应该是还没有标准化。所以,需要先对这一列进行标准化:from sklearn.preprocessing import StandardScaler # 标准化,将Amount这一列传进 data['normAmount'] = StandardScaler().fit_transform(data['Amount'].reshape(-1, 1)) #reshape(-1,1)# -1表示默认计算,转化行数模糊,1表示维度,最终转化为一列 data = data.drop(['Time','Amount'],axis=1) # 删除没用的两列数据,得到一个新的数据集 data.head()
这个时候所有特征数据都已经完成了标准化的操作。
随机下采样
下采样相对简单,所以我们先进行下采样。现在,分别取出特征和标签:X = data.loc[:, data.columns != 'Class'] # 取特征(列名不等于class的所有数据) y = data.loc[:, data.columns == 'Class'] # 取label
为了保证拿到的是数据的原始分布,我们采用的是随机的下采样:
# 随机下采样 # 筛选出class为1的数据总数,并取得其索引值 number_records_fraud = len(data[data.Class == 1]) fraud_indices = np.array(data[data.Class == 1].index) # 把class为0的数据索引拿到手 normal_indices = data[data.Class == 0].index random_normal_indices = np.random.choice(normal_indices, number_records_fraud, replace = False) # 随机采样,并不对原始dataframe进行替换 random_normal_indices = np.array(random_normal_indices) # 转换成numpy的array格式 # 将两组索引数据连接成性的数据索引 under_sample_indices = np.concatenate([fraud_indices,random_normal_indices]) # 下采样数据集 under_sample_data = data.iloc[under_sample_indices,:] # 定位到真正的数据 # 切分出下采样数据的特征和标签 X_undersample = under_sample_data.loc[:, under_sample_data.columns != 'Class'] y_undersample = under_sample_data.loc[:, under_sample_data.columns == 'Class'] # 展示下比例 print("Percentage of normal transactions: ", len(under_sample_data[under_sample_data.Class == 0])/len(under_sample_data)) print("Percentage of fraud transactions: ", len(under_sample_data[under_sample_data.Class == 1])/len(under_sample_data)) print("Total number of transactions in resampled data: ", len(under_sample_data))
数据切分
将数据集切分为训练集和测试集:from sklearn.model_selection import train_test_split # 对全部数据集进行切分,注意使用相同的随机策略 X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state = 0) # 30%作为测试集,random_state = 0保证数据集一致性,以便调参 print("Number transactions train dataset: ", len(X_train)) print("Number transactions test dataset: ", len(X_test)) print("Total number of transactions: ", len(X_train)+len(X_test)) # 对下采样数据集进行切分 X_train_undersample, X_test_undersample, y_train_undersample, y_test_undersample = train_test_split(X_undersample,y_undersample,test_size = 0.3,random_state = 0) print("") print("Number transactions train dataset: ", len(X_train_undersample)) print("Number transactions test dataset: ", len(X_test_undersample)) print("Total number of transactions: ", len(X_train_undersample)+len(X_test_undersample))
模型效果评估
在建模之前,我们还先考虑一下,选定哪些参数,指定什么作为评估标准?TP(true positives):被正确分类的正例个数
FN(false negatives):被错误分类的负例个数
FP(false positives):被错误分类的负例个数
TN(true negatives):被正确分类的负例个数
由于我们是要尽可能将所有信用卡欺诈的数据找出来,所以有个很重要的衡量标准:
召回率:Recall = TP/(TP+FN)
假设1000条信用卡数据中,有10条是欺诈数据,召回率有别于准确率,它关注的目标就是这10条数据,找出3条,那么召回率为0.3。
建模
接下来就是建模了,很多时候我们也不知道参数设置为多少比较合适,所以最好的办法写一个脚本让机器分别去跑,我们根据各个模型结果再做选择比较省心。from sklearn.linear_model import LogisticRegression from sklearn.cross_validation import KFold, cross_val_score from sklearn.metrics import confusion_matrix,recall_score,classification_report
# 训练模型,实例化逻辑回归模型,指定不同的惩罚系数,利用交叉验证找到最合适的参数,打印每个结果 def printing_Kfold_scores(x_train_data,y_train_data): fold = KFold(len(y_train_data),5,shuffle=False) # 五折交叉验证 # 正则化权重参数,指定惩罚力度,用以控制过拟合 c_param_range = [0.01,0.1,1,10,100] results_table = pd.DataFrame(index = range(len(c_param_range),2), columns = ['C_parameter','Mean recall score']) results_table['C_parameter'] = c_param_range # the k-fold will give 2 lists: train_indices = indices[0], test_indices = indices[1] j = 0 # 外层循环,调节权重参数 for c_param in c_param_range: print('-------------------------------------------') print('C parameter: ', c_param) print('-------------------------------------------') print('') recall_accs = [] # 内层循环,调节交叉验证参数 for iteration, indices in enumerate(fold,start=1): # 调节交叉验证 # 建立逻辑回归模型,逻辑回归中有很多惩罚参数,这里使用的是惩罚力度,指定惩罚方案为L1(或L2) lr = LogisticRegression(C = c_param, penalty = 'l1') # 使用训练集训练模型,并做交叉验证 lr.fit(x_train_data.iloc[indices[0],:],y_train_data.iloc[indices[0],:].values.ravel()) # 在训练集中,交叉验证预测出的结果y y_pred_undersample = lr.predict(x_train_data.iloc[indices[1],:].values) # 用预测的y值与真实的y值计算recall值,打印结果 recall_acc = recall_score(y_train_data.iloc[indices[1],:].values,y_pred_undersample) recall_accs.append(recall_acc) print('Iteration ', iteration,': recall score = ', recall_acc) # 计算交叉验证结果得出的recall的平均值,并打印 results_table.ix[j,'Mean recall score'] = np.mean(recall_accs) j += 1 print('') print('Mean recall score ', np.mean(recall_accs)) print('') best_c = results_table.loc[results_table['Mean recall score'].idxmax()]['C_parameter'] # 最后,我们可以选择C参数之间的最优值 print('*********************************************************************************') print('Best model to choose from cross validation is with C parameter = ', best_c) print('*********************************************************************************') return best_c
best_c = printing_Kfold_scores(X_train_undersample,y_train_undersample)
– – – – – – – – – – – – – – – – – – – – – -
C parameter: 0.01
– – – – – – – – – – – – – – – – – – – – – -
Iteration 1 : recall score = 0.958904109589
Iteration 2 : recall score = 0.917808219178
Iteration 3 : recall score = 1.0
Iteration 4 : recall score = 0.972972972973
Iteration 5 : recall score = 0.954545454545
Mean recall score 0.960846151257
– – – – – – – – – – – – – – – – – – – – – -
C parameter: 0.1
– – – – – – – – – – – – – – – – – – – – – -
Iteration 1 : recall score = 0.835616438356
Iteration 2 : recall score = 0.86301369863
Iteration 3 : recall score = 0.915254237288
Iteration 4 : recall score = 0.932432432432
Iteration 5 : recall score = 0.878787878788
Mean recall score 0.885020937099
– – – – – – – – – – – – – – – – – – – – – -
C parameter: 1
– – – – – – – – – – – – – – – – – – – – – -
Iteration 1 : recall score = 0.835616438356
Iteration 2 : recall score = 0.86301369863
Iteration 3 : recall score = 0.966101694915
Iteration 4 : recall score = 0.945945945946
Iteration 5 : recall score = 0.893939393939
Mean recall score 0.900923434357
– – – – – – – – – – – – – – – – – – – – – -
C parameter: 10
– – – – – – – – – – – – – – – – – – – – – -
Iteration 1 : recall score = 0.849315068493
Iteration 2 : recall score = 0.86301369863
Iteration 3 : recall score = 0.966101694915
Iteration 4 : recall score = 0.959459459459
Iteration 5 : recall score = 0.893939393939
Mean recall score 0.906365863087
– – – – – – – – – – – – – – – – – – – – – -
C parameter: 100
– – – – – – – – – – – – – – – – – – – – – -
Iteration 1 : recall score = 0.86301369863
Iteration 2 : recall score = 0.86301369863
Iteration 3 : recall score = 0.966101694915
Iteration 4 : recall score = 0.959459459459
Iteration 5 : recall score = 0.893939393939
Mean recall score 0.909105589115
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Best model to choose from cross validation is with C parameter = 0.01
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
由以上结果可以看到,当前最好的值为0.96
接下来,画一个更直观的混淆矩阵图出来
def plot_confusion_matrix(cm, classes, title='Confusion matrix', cmap=plt.cm.Blues): """ This function prints and plots the confusion matrix. """ plt.imshow(cm, interpolation='nearest', cmap=cmap) plt.title(title) plt.colorbar() tick_marks = np.arange(len(classes)) plt.xticks(tick_marks, classes, rotation=0) plt.yticks(tick_marks, classes) thresh = cm.max() / 2. for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])): plt.text(j, i, cm[i, j], horizontalalignment="center", color="white" if cm[i, j] > thresh else "black") plt.tight_layout() plt.ylabel('True label') plt.xlabel('Predicted label')
import itertools lr = LogisticRegression(C = best_c, penalty = 'l1') lr.fit(X_train_undersample,y_train_undersample.values.ravel()) y_pred_undersample = lr.predict(X_test_undersample.values) # 计算混淆矩阵 cnf_matrix = confusion_matrix(y_test_undersample,y_pred_undersample) np.set_printoptions(precision=2) print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1])) # 非归一化混淆矩阵 class_names = [0,1] plt.figure() plot_confusion_matrix(cnf_matrix , classes=class_names , title='Confusion matrix') plt.show()
一目了然的图,可以看到,138个真实的欺诈被模型找出来了,但是有9个漏网之鱼,同时有17个正常数据被误杀。Recall值能达到0.93,看起来挺高的,这就是我们要的结果吗?并非如此,这是用的下采样数据计算的混淆矩阵。
接下来,我们用原始数据画出混淆矩阵图,看看结果:
lr = LogisticRegression(C = best_c, penalty = 'l1') lr.fit(X_train_undersample,y_train_undersample.values.ravel()) y_pred = lr.predict(X_test.values) # 计算混淆矩阵 cnf_matrix = confusion_matrix(y_test,y_pred) np.set_printoptions(precision=2) print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1])) # 非归一化混淆矩阵 class_names = [0,1] plt.figure() plot_confusion_matrix(cnf_matrix , classes=class_names , title='Confusion matrix') plt.show()
这里我们能看到,模型出现一个很大的问题,误杀数量竟然达到了10318条,这无疑对业务产生了重大影响。为什么会出现这个问题呢?这是根据下采样模型得到的效果,而在下采样数据中,数据量太少,正常的少,异常的同样也少,样本是有局限的,出现这种情况也很正常。
那么如何解决这个问题呢?
如果我们一开始没有对数据进行任何预处理操作,我们能不能得到好的结果呢?
best_c = printing_Kfold_scores(X_train,y_train)
– – – – – – – – – – – – – – – – – – – – – -
C parameter: 0.01
– – – – – – – – – – – – – – – – – – – – – -
Iteration 1 : recall score = 0.492537313433
Iteration 2 : recall score = 0.602739726027
Iteration 3 : recall score = 0.683333333333
Iteration 4 : recall score = 0.569230769231
Iteration 5 : recall score = 0.45
Mean recall score 0.559568228405
– – – – – – – – – – – – – – – – – – – – – -
C parameter: 0.1
– – – – – – – – – – – – – – – – – – – – – -
Iteration 1 : recall score = 0.567164179104
Iteration 2 : recall score = 0.616438356164
Iteration 3 : recall score = 0.683333333333
Iteration 4 : recall score = 0.584615384615
Iteration 5 : recall score = 0.525
Mean recall score 0.595310250644
– – – – – – – – – – – – – – – – – – – – – -
C parameter: 1
– – – – – – – – – – – – – – – – – – – – – -
Iteration 1 : recall score = 0.55223880597
Iteration 2 : recall score = 0.616438356164
Iteration 3 : recall score = 0.716666666667
Iteration 4 : recall score = 0.615384615385
Iteration 5 : recall score = 0.5625
Mean recall score 0.612645688837
– – – – – – – – – – – – – – – – – – – – – -
C parameter: 10
– – – – – – – – – – – – – – – – – – – – – -
Iteration 1 : recall score = 0.55223880597
Iteration 2 : recall score = 0.616438356164
Iteration 3 : recall score = 0.733333333333
Iteration 4 : recall score = 0.615384615385
Iteration 5 : recall score = 0.575
Mean recall score 0.61847902217
– – – – – – – – – – – – – – – – – – – – – -
C parameter: 100
– – – – – – – – – – – – – – – – – – – – – -
Iteration 1 : recall score = 0.55223880597
Iteration 2 : recall score = 0.616438356164
Iteration 3 : recall score = 0.733333333333
Iteration 4 : recall score = 0.615384615385
Iteration 5 : recall score = 0.575
Mean recall score 0.61847902217
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Best model to choose from cross validation is with C parameter = 10.0
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
可以看到,直接用极度不均衡数据建模的话,效果都很差。所以对数据进行预处理是非常有必要的。
数据决定上限,参数决定下限。
我们还是先看看它的混淆矩阵结果:
lr = LogisticRegression(C = best_c, penalty = 'l1') lr.fit(X_train,y_train.values.ravel()) y_pred_undersample = lr.predict(X_test.values) cnf_matrix = confusion_matrix(y_test,y_pred_undersample) np.set_printoptions(precision=2) print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1])) class_names = [0,1] plt.figure() plot_confusion_matrix(cnf_matrix , classes=class_names , title='Confusion matrix') plt.show()
从结果看到,误杀少了,但是很多欺诈数据没有找出来。
之前我们使用的是Sigmoid函数中默认的阈值:0.5,如果我们自己指定阈值,会对结果产生什么影响呢?
lr = LogisticRegression(C = 0.01, penalty = 'l1') lr.fit(X_train_undersample,y_train_undersample.values.ravel()) y_pred_undersample_proba = lr.predict_proba(X_test_undersample.values) #这里改成计算结果的概率值 # 指定阈值 thresholds = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9] plt.figure(figsize=(10,10)) # 将预测的概率值与阈值进行对比 j = 1 for i in thresholds: y_test_predictions_high_recall = y_pred_undersample_proba[:,1] > i # 画出3*3的子图 plt.subplot(3,3,j) j += 1 # Compute confusion matrix cnf_matrix = confusion_matrix(y_test_undersample,y_test_predictions_high_recall) np.set_printoptions(precision=2) print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1])) # Plot non-normalized confusion matrix class_names = [0,1] plot_confusion_matrix(cnf_matrix , classes=class_names , title='Threshold >= %s'%i)
Recall metric in the testing dataset: 1.0
Recall metric in the testing dataset: 1.0
Recall metric in the testing dataset: 1.0
Recall metric in the testing dataset: 0.986394557823
Recall metric in the testing dataset: 0.925170068027
Recall metric in the testing dataset: 0.863945578231
Recall metric in the testing dataset: 0.829931972789
Recall metric in the testing dataset: 0.748299319728
Recall metric in the testing dataset: 0.585034013605
当阈值为0.1-0.3时,recall值为1,说明太过严苛。随着阈值越来越大,模型的要求越来越宽松。这里需要根据实际业务需求,权衡利弊,选定一个代价最低的模型。
过采样-SMOTE样本生成策略
既然下采样有局限性,误杀这么高,那过采样呢?说到过采样,那么就有个问题,怎么生成数据呢?
在机器学习中,有这么个套路,即SMOTE样本生成策略:
其中k值为要翻的倍数,假设少数类样本为100,你想变成500,K就取5。先算x到其他少数类样本的距离,然后找出离它最近的5个样本,分别得到距离,将这个距离乘上一个0-1之间的随机数,加上样本本身,得到新数据。相当于对样本进行了微调的过程。
import pandas as pd from imblearn.over_sampling import SMOTE from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import confusion_matrix from sklearn.model_selection import train_test_split
credit_cards=pd.read_csv('creditcard.csv') columns=credit_cards.columns # 最后一类是Class,简单地删除它,获得特征列 features_columns=columns.delete(len(columns)-1) features=credit_cards[features_columns] labels=credit_cards['Class']
features_train, features_test, labels_train, labels_test = train_test_split(features, labels, test_size=0.2, random_state=0)
生成新数据
oversampler=SMOTE(random_state=0) os_features,os_labels=oversampler.fit_sample(features_train,labels_train)
查看下
len(os_labels[os_labels==1])
227454
os_features = pd.DataFrame(os_features) os_labels = pd.DataFrame(os_labels) best_c = printing_Kfold_scores(os_features,os_labels)
– – – – – – – – – – – – – – – – – – – – – -
C parameter: 0.01
– – – – – – – – – – – – – – – – – – – – – -
Iteration 1 : recall score = 0.890322580645
Iteration 2 : recall score = 0.894736842105
Iteration 3 : recall score = 0.968794954078
Iteration 4 : recall score = 0.957760411514
Iteration 5 : recall score = 0.958266011585
Mean recall score 0.933976159985
– – – – – – – – – – – – – – – – – – – – – -
C parameter: 0.1
– – – – – – – – – – – – – – – – – – – – – -
Iteration 1 : recall score = 0.890322580645
Iteration 2 : recall score = 0.894736842105
Iteration 3 : recall score = 0.970432665708
Iteration 4 : recall score = 0.960046603137
Iteration 5 : recall score = 0.957650498456
Mean recall score 0.93463783801
– – – – – – – – – – – – – – – – – – – – – -
C parameter: 1
– – – – – – – – – – – – – – – – – – – – – -
Iteration 1 : recall score = 0.890322580645
Iteration 2 : recall score = 0.894736842105
Iteration 3 : recall score = 0.970432665708
Iteration 4 : recall score = 0.960321385784
Iteration 5 : recall score = 0.960750046713
Mean recall score 0.935312704191
– – – – – – – – – – – – – – – – – – – – – -
C parameter: 10
– – – – – – – – – – – – – – – – – – – – – -
Iteration 1 : recall score = 0.890322580645
Iteration 2 : recall score = 0.894736842105
Iteration 3 : recall score = 0.970499059422
Iteration 4 : recall score = 0.960211472725
Iteration 5 : recall score = 0.96009056836
Mean recall score 0.935172104652
– – – – – – – – – – – – – – – – – – – – – -
C parameter: 100
– – – – – – – – – – – – – – – – – – – – – -
Iteration 1 : recall score = 0.890322580645
Iteration 2 : recall score = 0.894736842105
Iteration 3 : recall score = 0.970543321899
Iteration 4 : recall score = 0.960398324925
Iteration 5 : recall score = 0.956903089656
Mean recall score 0.934580831846
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Best model to choose from cross validation is with C parameter = 1.0
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
lr = LogisticRegression(C = best_c, penalty = 'l1') lr.fit(os_features,os_labels.values.ravel()) y_pred = lr.predict(features_test.values) # Compute confusion matrix cnf_matrix = confusion_matrix(labels_test,y_pred) np.set_printoptions(precision=2) print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1])) # Plot non-normalized confusion matrix class_names = [0,1] plt.figure() plot_confusion_matrix(cnf_matrix , classes=class_names , title='Confusion matrix') plt.show()
看结果,与下采样对比,误杀比例明显小得多,也就是说,当我们用过采样策略,模型效果最好。
相关文章推荐
- python实现信用卡欺诈检测 logistic回归逻辑回归算法
- 逻辑回归:使用SGD(Stochastic Gradient Descent)进行大规模机器学习
- 机器学习实战之信用卡欺诈检测
- Python机器学习(三):Logistic回归建模分类实例——信用卡欺诈监测(下)
- 使用R进行逻辑回归 分类
- 【转载】逻辑回归:使用SGD(Stochastic Gradient Descent)进行大规模机器学习
- 机器学习逻辑回归:使用C++语言手工编写程序对a1a数据集进行Logistic分类
- Python实现Logistic回归实例——信用卡欺诈检测
- 机器学习案例实战-信用卡欺诈检测
- Python机器学习(二):Logistic回归建模分类实例——信用卡欺诈监测(上)
- 机器学习案例实战:信用卡欺诈检测
- 机器学习(4)-实例:信用卡欺诈检测
- 机器学习案例实战-信用卡欺诈检测
- [session] 欺诈的潜伏性: 如何利用大数据进行反欺诈检测
- tensorflow 学习专栏(四):使用tensorflow在mnist数据集上使用逻辑回归logistic Regression进行分类
- Python 信用卡评分模型 自动分箱&逻辑回归&制作评分卡
- Tensorflow案例分析(1)—— 使用逻辑回归进行泰坦尼克号预测
- 逻辑回归(Logistic Regression, LR)又称为逻辑回归分析,是分类和预测算法中的一种。通过历史数据的表现对未来结果发生的概率进行预测。例如,我们可以将购买的概率设置为因变量,将用户的
- 使用Spark MLlib的逻辑回归(LogisticRegression)进行用户分类预测识别