您的位置：首页 > 其它

逻辑回归进行信用卡欺诈检测

2017-10-30 00:12 375 查看

利用Logistic regression进行信用卡欺诈检测，使用的是一份竞赛数据集（已脱敏处理），使用的是Python的Jupyter Notebook工具。

观察数据

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline

导入数据并查看前5行

data = pd.read_csv("creditcard.csv")
data.head()

数据有31列：Time、V1-V28、Amount和Class，注意到最后一列Class，这是我们的label值，0代表正常数据，1代表欺诈数据。首先习惯性地画个图观察一下欺诈数据的分布。

count_classes = pd.value_counts(data['Class'], sort = True).sort_index()
count_classes.plot(kind = 'bar')
plt.title("Fraud class histogram")
plt.xlabel("Class")
plt.ylabel("Frequency")

可以看到Class=0的数据大概有28W，欺诈数据Class=1极少，极度不均匀的分布状态。

通常有两种处理方法：

1. 过采样（让1变得和0一样多）；

2. 下采样（在0中取出部分数据，数量与1一致）

标准化

在特征数据中，Amount与其他特征数据的取值范围相比，太大了，应该是还没有标准化。所以，需要先对这一列进行标准化：

from sklearn.preprocessing import StandardScaler
# 标准化，将Amount这一列传进
data['normAmount'] = StandardScaler().fit_transform(data['Amount'].reshape(-1, 1)) #reshape（-1，1）# -1表示默认计算，转化行数模糊，1表示维度，最终转化为一列
data = data.drop(['Time','Amount'],axis=1) # 删除没用的两列数据，得到一个新的数据集
data.head()

这个时候所有特征数据都已经完成了标准化的操作。

随机下采样

下采样相对简单，所以我们先进行下采样。现在，分别取出特征和标签：

X = data.loc[:, data.columns != 'Class'] # 取特征（列名不等于class的所有数据）
y = data.loc[:, data.columns == 'Class'] # 取label

为了保证拿到的是数据的原始分布，我们采用的是随机的下采样：

# 随机下采样
# 筛选出class为1的数据总数，并取得其索引值
number_records_fraud = len(data[data.Class == 1])
fraud_indices = np.array(data[data.Class == 1].index)

# 把class为0的数据索引拿到手
normal_indices = data[data.Class == 0].index

random_normal_indices = np.random.choice(normal_indices, number_records_fraud, replace = False)  # 随机采样，并不对原始dataframe进行替换
random_normal_indices = np.array(random_normal_indices)  # 转换成numpy的array格式

# 将两组索引数据连接成性的数据索引
under_sample_indices = np.concatenate([fraud_indices,random_normal_indices])

# 下采样数据集
under_sample_data = data.iloc[under_sample_indices,:] # 定位到真正的数据

# 切分出下采样数据的特征和标签
X_undersample = under_sample_data.loc[:, under_sample_data.columns != 'Class']
y_undersample = under_sample_data.loc[:, under_sample_data.columns == 'Class']

# 展示下比例
print("Percentage of normal transactions: ", len(under_sample_data[under_sample_data.Class == 0])/len(under_sample_data))
print("Percentage of fraud transactions: ", len(under_sample_data[under_sample_data.Class == 1])/len(under_sample_data))
print("Total number of transactions in resampled data: ", len(under_sample_data))

数据切分

将数据集切分为训练集和测试集：

from sklearn.model_selection import train_test_split

# 对全部数据集进行切分，注意使用相同的随机策略
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state = 0)  # 30%作为测试集，random_state = 0保证数据集一致性，以便调参

print("Number transactions train dataset: ", len(X_train))
print("Number transactions test dataset: ",  len(X_test))
print("Total number of transactions: ", len(X_train)+len(X_test))

# 对下采样数据集进行切分
X_train_undersample, X_test_undersample, y_train_undersample, y_test_undersample = train_test_split(X_undersample,y_undersample,test_size = 0.3,random_state = 0)
print("")
print("Number transactions train dataset: ", len(X_train_undersample))
print("Number transactions test dataset: ", len(X_test_undersample))
print("Total number of transactions: ", len(X_train_undersample)+len(X_test_undersample))

模型效果评估

在建模之前，我们还先考虑一下，选定哪些参数，指定什么作为评估标准？

TP(true positives)：被正确分类的正例个数

FN(false negatives)：被错误分类的负例个数

FP(false positives)：被错误分类的负例个数

TN(true negatives)：被正确分类的负例个数

由于我们是要尽可能将所有信用卡欺诈的数据找出来，所以有个很重要的衡量标准：

召回率：Recall = TP/(TP+FN)

假设1000条信用卡数据中，有10条是欺诈数据，召回率有别于准确率，它关注的目标就是这10条数据，找出3条，那么召回率为0.3。

建模

接下来就是建模了，很多时候我们也不知道参数设置为多少比较合适，所以最好的办法写一个脚本让机器分别去跑，我们根据各个模型结果再做选择比较省心。

from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import KFold, cross_val_score
from sklearn.metrics import confusion_matrix,recall_score,classification_report

# 训练模型，实例化逻辑回归模型，指定不同的惩罚系数，利用交叉验证找到最合适的参数，打印每个结果

def printing_Kfold_scores(x_train_data,y_train_data):

fold = KFold(len(y_train_data),5,shuffle=False) # 五折交叉验证

# 正则化权重参数，指定惩罚力度，用以控制过拟合
c_param_range = [0.01,0.1,1,10,100]

results_table = pd.DataFrame(index = range(len(c_param_range),2), columns = ['C_parameter','Mean recall score'])
results_table['C_parameter'] = c_param_range

# the k-fold will give 2 lists: train_indices = indices[0], test_indices = indices[1]
j = 0
# 外层循环，调节权重参数
for c_param in c_param_range:
print('-------------------------------------------')
print('C parameter: ', c_param)
print('-------------------------------------------')
print('')

recall_accs = []
# 内层循环，调节交叉验证参数
for iteration, indices in enumerate(fold,start=1):  # 调节交叉验证

# 建立逻辑回归模型，逻辑回归中有很多惩罚参数，这里使用的是惩罚力度，指定惩罚方案为L1（或L2）
lr = LogisticRegression(C = c_param, penalty = 'l1')

# 使用训练集训练模型，并做交叉验证
lr.fit(x_train_data.iloc[indices[0],:],y_train_data.iloc[indices[0],:].values.ravel())

# 在训练集中，交叉验证预测出的结果y
y_pred_undersample = lr.predict(x_train_data.iloc[indices[1],:].values)

# 用预测的y值与真实的y值计算recall值，打印结果
recall_acc = recall_score(y_train_data.iloc[indices[1],:].values,y_pred_undersample)
recall_accs.append(recall_acc)
print('Iteration ', iteration,': recall score = ', recall_acc)

# 计算交叉验证结果得出的recall的平均值，并打印
results_table.ix[j,'Mean recall score'] = np.mean(recall_accs)
j += 1
print('')
print('Mean recall score ', np.mean(recall_accs))
print('')

best_c = results_table.loc[results_table['Mean recall score'].idxmax()]['C_parameter']

# 最后，我们可以选择C参数之间的最优值
print('*********************************************************************************')
print('Best model to choose from cross validation is with C parameter = ', best_c)
print('*********************************************************************************')

return best_c

best_c = printing_Kfold_scores(X_train_undersample,y_train_undersample)

– – – – – – – – – – – – – – – – – – – – – -

C parameter: 0.01

– – – – – – – – – – – – – – – – – – – – – -

Iteration 1 : recall score = 0.958904109589

Iteration 2 : recall score = 0.917808219178

Iteration 3 : recall score = 1.0

Iteration 4 : recall score = 0.972972972973

Iteration 5 : recall score = 0.954545454545

Mean recall score 0.960846151257

– – – – – – – – – – – – – – – – – – – – – -

C parameter: 0.1

– – – – – – – – – – – – – – – – – – – – – -

Iteration 1 : recall score = 0.835616438356

Iteration 2 : recall score = 0.86301369863

Iteration 3 : recall score = 0.915254237288

Iteration 4 : recall score = 0.932432432432

Iteration 5 : recall score = 0.878787878788

Mean recall score 0.885020937099

– – – – – – – – – – – – – – – – – – – – – -

C parameter: 1

– – – – – – – – – – – – – – – – – – – – – -

Iteration 1 : recall score = 0.835616438356

Iteration 2 : recall score = 0.86301369863

Iteration 3 : recall score = 0.966101694915

Iteration 4 : recall score = 0.945945945946

Iteration 5 : recall score = 0.893939393939

Mean recall score 0.900923434357

– – – – – – – – – – – – – – – – – – – – – -

C parameter: 10

– – – – – – – – – – – – – – – – – – – – – -

Iteration 1 : recall score = 0.849315068493

Iteration 2 : recall score = 0.86301369863

Iteration 3 : recall score = 0.966101694915

Iteration 4 : recall score = 0.959459459459

Iteration 5 : recall score = 0.893939393939

Mean recall score 0.906365863087

– – – – – – – – – – – – – – – – – – – – – -

C parameter: 100

– – – – – – – – – – – – – – – – – – – – – -

Iteration 1 : recall score = 0.86301369863

Iteration 2 : recall score = 0.86301369863

Iteration 3 : recall score = 0.966101694915

Iteration 4 : recall score = 0.959459459459

Iteration 5 : recall score = 0.893939393939

Mean recall score 0.909105589115

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

Best model to choose from cross validation is with C parameter = 0.01

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

由以上结果可以看到，当前最好的值为0.96

接下来，画一个更直观的混淆矩阵图出来

def plot_confusion_matrix(cm, classes,
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
"""
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=0)
plt.yticks(tick_marks, classes)

thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, cm[i, j],
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")

plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')

import itertools
lr = LogisticRegression(C = best_c, penalty = 'l1')
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred_undersample = lr.predict(X_test_undersample.values)

# 计算混淆矩阵
cnf_matrix = confusion_matrix(y_test_undersample,y_pred_undersample)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# 非归一化混淆矩阵
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
, classes=class_names
, title='Confusion matrix')
plt.show()

一目了然的图，可以看到，138个真实的欺诈被模型找出来了，但是有9个漏网之鱼，同时有17个正常数据被误杀。Recall值能达到0.93，看起来挺高的，这就是我们要的结果吗？并非如此，这是用的下采样数据计算的混淆矩阵。

接下来，我们用原始数据画出混淆矩阵图，看看结果：

lr = LogisticRegression(C = best_c, penalty = 'l1')
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred = lr.predict(X_test.values)

# 计算混淆矩阵
cnf_matrix = confusion_matrix(y_test,y_pred)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# 非归一化混淆矩阵
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
, classes=class_names
, title='Confusion matrix')
plt.show()

这里我们能看到，模型出现一个很大的问题，误杀数量竟然达到了10318条，这无疑对业务产生了重大影响。为什么会出现这个问题呢？这是根据下采样模型得到的效果，而在下采样数据中，数据量太少，正常的少，异常的同样也少，样本是有局限的，出现这种情况也很正常。

那么如何解决这个问题呢？

如果我们一开始没有对数据进行任何预处理操作，我们能不能得到好的结果呢？

best_c = printing_Kfold_scores(X_train,y_train)

– – – – – – – – – – – – – – – – – – – – – -

C parameter: 0.01

– – – – – – – – – – – – – – – – – – – – – -

Iteration 1 : recall score = 0.492537313433

Iteration 2 : recall score = 0.602739726027

Iteration 3 : recall score = 0.683333333333

Iteration 4 : recall score = 0.569230769231

Iteration 5 : recall score = 0.45

Mean recall score 0.559568228405

– – – – – – – – – – – – – – – – – – – – – -

C parameter: 0.1

– – – – – – – – – – – – – – – – – – – – – -

Iteration 1 : recall score = 0.567164179104

Iteration 2 : recall score = 0.616438356164

Iteration 3 : recall score = 0.683333333333

Iteration 4 : recall score = 0.584615384615

Iteration 5 : recall score = 0.525

Mean recall score 0.595310250644

– – – – – – – – – – – – – – – – – – – – – -

C parameter: 1

– – – – – – – – – – – – – – – – – – – – – -

Iteration 1 : recall score = 0.55223880597

Iteration 2 : recall score = 0.616438356164

Iteration 3 : recall score = 0.716666666667

Iteration 4 : recall score = 0.615384615385

Iteration 5 : recall score = 0.5625

Mean recall score 0.612645688837

– – – – – – – – – – – – – – – – – – – – – -

C parameter: 10

– – – – – – – – – – – – – – – – – – – – – -

Iteration 1 : recall score = 0.55223880597

Iteration 2 : recall score = 0.616438356164

Iteration 3 : recall score = 0.733333333333

Iteration 4 : recall score = 0.615384615385

Iteration 5 : recall score = 0.575

Mean recall score 0.61847902217

– – – – – – – – – – – – – – – – – – – – – -

C parameter: 100

– – – – – – – – – – – – – – – – – – – – – -

Iteration 1 : recall score = 0.55223880597

Iteration 2 : recall score = 0.616438356164

Iteration 3 : recall score = 0.733333333333

Iteration 4 : recall score = 0.615384615385

Iteration 5 : recall score = 0.575

Mean recall score 0.61847902217

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

Best model to choose from cross validation is with C parameter = 10.0

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

可以看到，直接用极度不均衡数据建模的话，效果都很差。所以对数据进行预处理是非常有必要的。

数据决定上限，参数决定下限。

我们还是先看看它的混淆矩阵结果：

lr = LogisticRegression(C = best_c, penalty = 'l1')
lr.fit(X_train,y_train.values.ravel())
y_pred_undersample = lr.predict(X_test.values)

cnf_matrix = confusion_matrix(y_test,y_pred_undersample)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
, classes=class_names
, title='Confusion matrix')
plt.show()

从结果看到，误杀少了，但是很多欺诈数据没有找出来。

之前我们使用的是Sigmoid函数中默认的阈值：0.5，如果我们自己指定阈值，会对结果产生什么影响呢？

lr = LogisticRegression(C = 0.01, penalty = 'l1')
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred_undersample_proba = lr.predict_proba(X_test_undersample.values)  #这里改成计算结果的概率值
# 指定阈值
thresholds = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]

plt.figure(figsize=(10,10))

# 将预测的概率值与阈值进行对比
j = 1
for i in thresholds:
y_test_predictions_high_recall = y_pred_undersample_proba[:,1] > i

# 画出3*3的子图
plt.subplot(3,3,j)
j += 1

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test_undersample,y_test_predictions_high_recall)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# Plot non-normalized confusion matrix
class_names = [0,1]
plot_confusion_matrix(cnf_matrix
, classes=class_names
, title='Threshold >= %s'%i)

Recall metric in the testing dataset: 1.0

Recall metric in the testing dataset: 1.0

Recall metric in the testing dataset: 1.0

Recall metric in the testing dataset: 0.986394557823

Recall metric in the testing dataset: 0.925170068027

Recall metric in the testing dataset: 0.863945578231

Recall metric in the testing dataset: 0.829931972789

Recall metric in the testing dataset: 0.748299319728

Recall metric in the testing dataset: 0.585034013605

当阈值为0.1-0.3时，recall值为1，说明太过严苛。随着阈值越来越大，模型的要求越来越宽松。这里需要根据实际业务需求，权衡利弊，选定一个代价最低的模型。

过采样-SMOTE样本生成策略

既然下采样有局限性，误杀这么高，那过采样呢？

说到过采样，那么就有个问题，怎么生成数据呢？

在机器学习中，有这么个套路，即SMOTE样本生成策略：

其中k值为要翻的倍数，假设少数类样本为100，你想变成500，K就取5。先算x到其他少数类样本的距离，然后找出离它最近的5个样本，分别得到距离，将这个距离乘上一个0-1之间的随机数，加上样本本身，得到新数据。相当于对样本进行了微调的过程。

import pandas as pd
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split

credit_cards=pd.read_csv('creditcard.csv')

columns=credit_cards.columns
# 最后一类是Class，简单地删除它，获得特征列
features_columns=columns.delete(len(columns)-1)

features=credit_cards[features_columns]
labels=credit_cards['Class']

features_train, features_test, labels_train, labels_test = train_test_split(features, labels, test_size=0.2, random_state=0)

生成新数据

oversampler=SMOTE(random_state=0)
os_features,os_labels=oversampler.fit_sample(features_train,labels_train)

查看下

len(os_labels[os_labels==1])

227454

os_features = pd.DataFrame(os_features)
os_labels = pd.DataFrame(os_labels)
best_c = printing_Kfold_scores(os_features,os_labels)

– – – – – – – – – – – – – – – – – – – – – -

C parameter: 0.01

– – – – – – – – – – – – – – – – – – – – – -

Iteration 1 : recall score = 0.890322580645

Iteration 2 : recall score = 0.894736842105

Iteration 3 : recall score = 0.968794954078

Iteration 4 : recall score = 0.957760411514

Iteration 5 : recall score = 0.958266011585

Mean recall score 0.933976159985

– – – – – – – – – – – – – – – – – – – – – -

C parameter: 0.1

– – – – – – – – – – – – – – – – – – – – – -

Iteration 1 : recall score = 0.890322580645

Iteration 2 : recall score = 0.894736842105

Iteration 3 : recall score = 0.970432665708

Iteration 4 : recall score = 0.960046603137

Iteration 5 : recall score = 0.957650498456

Mean recall score 0.93463783801

– – – – – – – – – – – – – – – – – – – – – -

C parameter: 1

– – – – – – – – – – – – – – – – – – – – – -

Iteration 1 : recall score = 0.890322580645

Iteration 2 : recall score = 0.894736842105

Iteration 3 : recall score = 0.970432665708

Iteration 4 : recall score = 0.960321385784

Iteration 5 : recall score = 0.960750046713

Mean recall score 0.935312704191

– – – – – – – – – – – – – – – – – – – – – -

C parameter: 10

– – – – – – – – – – – – – – – – – – – – – -

Iteration 1 : recall score = 0.890322580645

Iteration 2 : recall score = 0.894736842105

Iteration 3 : recall score = 0.970499059422

Iteration 4 : recall score = 0.960211472725

Iteration 5 : recall score = 0.96009056836

Mean recall score 0.935172104652

– – – – – – – – – – – – – – – – – – – – – -

C parameter: 100

– – – – – – – – – – – – – – – – – – – – – -

Iteration 1 : recall score = 0.890322580645

Iteration 2 : recall score = 0.894736842105

Iteration 3 : recall score = 0.970543321899

Iteration 4 : recall score = 0.960398324925

Iteration 5 : recall score = 0.956903089656

Mean recall score 0.934580831846

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

Best model to choose from cross validation is with C parameter = 1.0

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

lr = LogisticRegression(C = best_c, penalty = 'l1')
lr.fit(os_features,os_labels.values.ravel())
y_pred = lr.predict(features_test.values)

# Compute confusion matrix
cnf_matrix = confusion_matrix(labels_test,y_pred)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
, classes=class_names
, title='Confusion matrix')
plt.show()

看结果，与下采样对比，误杀比例明显小得多，也就是说，当我们用过采样策略，模型效果最好。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： 逻辑回归 smote算法信用卡

相关文章推荐

新的分享

章节导航