您的位置：首页 > 其它

NLP数据预处理（中英文）& 召回率、准确率、ROC曲线、AUC、PR曲线

2019-03-03 10:35 218 查看

一、对英文数据集IMDB数据集进行预处理

1.下载数据

我们将使用 IMDB 数据集，其中包含来自互联网电影数据库的 50000 条影评文本。我们将这些影评拆分为训练集（25000 条影评）和测试集（25000 条影评）。训练集和测试集之间达成了平衡，意味着它们包含相同数量的正面和负面影评。

此笔记本使用的是 tf.keras，它是一种用于在 TensorFlow 中构建和训练模型的高阶 API。有关使用 tf.keras 的更高级文本分类教程，请参阅 MLCC 文本分类指南。

import tensorflow as tf
from tensorflow import keras
import numpy as np

# 下载数据集
imdb = keras.datasets.imdb
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
17465344/17464789 [==============================] - 0s 0us/step

参数 num_words=10000 会保留训练数据中出现频次在前 10000 位的字词。为确保数据规模处于可管理的水平，罕见字词将被舍弃。

2. 探索数据

我们花点时间来了解一下数据的格式。该数据集已经过预处理：每个样本都是一个整数数组，表示影评中的字词。每个标签都是整数值 0 或 1，其中 0 表示负面影评，1 表示正面影评。

print("Training entries: {}, labels: {}".format(len(train_data), len(train_labels)))

Training entries: 25000, labels: 25000

再看一下分类数，及分类标签

set(train_labels)

{0, 1}

影评文本已转换为整数，其中每个整数都表示字典中的一个特定字词。第一条影评如下所示：

print(train_data[0])

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]

要注意到影评的长度可能会有所不同

len(train_data[0]), len(train_data[1])

(218, 189)

3.将整数转换为字词

# 获取字映射到数字的字典
word_index = imdb.get_word_index()

word_index = {k:(v+3) for k,v in word_index.items()}
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2  # unknown
word_index["<UNUSED>"] = 3

reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

def decode_review(text):
return ' '.join([reverse_word_index.get(i, '?') for i in text])

显示第一条数据的文本

decode_review(train_data[0])

" this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert  is an amazing actor and now the same being director  father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for  and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also  to the two little boy's that played the  of norman and paul they were just brilliant children are often left out of the  list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all"

4.准备数据

填充数组，使它们都具有相同的长度，然后创建一个形状为 max_length * num_reviews 的整数张量。我们可以使用一个能够处理这种形状的嵌入层作为网络中的第一层。

train_data = keras.preprocessing.sequence.pad_sequences(train_data,
value=word_index["<PAD>"],
padding='post',
maxlen=256)

test_data = keras.preprocessing.sequence.pad_sequences(test_data,
value=word_index["<PAD>"],
padding='post',
maxlen=256)

二、对中文数据集THUCNews进行预处理

参考内容：CNN与RNN中文文本分类-基于TENSORFLOW实现

1.读取数据

import pandas as pd
import numpy as np
train_data = pd.read_csv('D:\Competition_Data\cnews\cnews.train.txt',sep='\t',engine='python',names=['label','content'],encoding="UTF-8")
test_data = pd.read_csv('D:\Competition_Data\cnews\cnews.test.txt',sep='\t',engine='python',names=['label','content'],encoding='UTF-8')

2.探索数据

看一下数据长度

print("Training entries: {}".format(len(train_data)))

Training entries: 50000

探索标签类别，以及各类别的个数

train_data['label'].value_counts()

娱乐    5000
时政    5000
科技    5000
游戏    5000
教育    5000
财经    5000
家居    5000
房产    5000
体育    5000
时尚    5000
Name: label, dtype: int64

探索一下文本的长度

train_data['content_length'] = train_data['content'].apply(lambda x:len(x))
train_data['content_length'].describe()

content_length
count	50000.000000
mean	913.317160
std	930.085314
min	8.000000
25%	350.000000
50%	688.000000
75%	1154.000000
max	27467.000000

3.数据预处理

import sys
from collections import Counter

def open_file(filename, mode='r'):
"""
常用文件操作，可在python2和python3间切换.
mode: 'r' or 'w' for read or write
"""
if is_py3:
return open(filename, mode, encoding='utf-8', errors='ignore')
else:
return open(filename, mode)

def read_file(filename):
"""读取文件数据"""
contents, labels = [], []
with open_file(filename) as f:
for line in f:
try:
label, content = line.strip().split('\t')
if content:
contents.append(list(native_content(content)))
labels.append(native_content(label))
except:
pass
return contents, labels

def build_vocab(train_dir, vocab_dir, vocab_size=5000):
"""根据训练集构建词汇表，存储"""
data_train, _ = read_file(train_dir)

all_data = []
for content in data_train:

1cca7
all_data.extend(content)

counter = Counter(all_data)
count_pairs = counter.most_common(vocab_size - 1)
words, _ = list(zip(*count_pairs))
# 添加一个 <PAD> 来将所有文本pad为同一长度
words = ['<PAD>'] + list(words)
open_file(vocab_dir, mode='w').write('\n'.join(words) + '\n')

def read_vocab(vocab_dir):
"""读取词汇表"""
# words = open_file(vocab_dir).read().strip().split('\n')
with open_file(vocab_dir) as fp:
# 如果是py2 则每个值都转化为unicode
words = [native_content(_.strip()) for _ in fp.readlines()]
word_to_id = dict(zip(words, range(len(words))))
return words, word_to_id

def read_category():
"""读取分类目录，固定"""
categories = ['体育', '财经', '房产', '家居', '教育', '科技', '时尚', '时政', '游戏', '娱乐']

categories = [native_content(x) for x in categories]

cat_to_id = dict(zip(categories, range(len(categories))))

return categories, cat_to_id

def to_words(content, words):
"""将id表示的内容转换为文字"""
return ''.join(words[x] for x in content)

def process_file(filename, word_to_id, cat_to_id, max_length=600):
"""将文件转换为id表示"""
contents, labels = read_file(filename)

data_id, label_id = [], []
for i in range(len(contents)):
data_id.append([word_to_id[x] for x in contents[i] if x in word_to_id])
label_id.append(cat_to_id[labels[i]])

# 使用keras提供的pad_sequences来将文本pad为固定长度
x_pad = kr.preprocessing.sequence.pad_sequences(data_id, max_length)
y_pad = kr.utils.to_categorical(label_id, num_classes=len(cat_to_id))  # 将标签转换为one-hot表示

return x_pad, y_pad

def batch_iter(x, y, batch_size=64):
"""生成批次数据"""
data_len = len(x)
num_batch = int((data_len - 1) / batch_size) + 1

indices = np.random.permutation(np.arange(data_len))
x_shuffle = x[indices]
y_shuffle = y[indices]

for i in range(num_batch):
start_id = i * batch_size
end_id = min((i + 1) * batch_size, data_len)
yield x_shuffle[start_id:end_id], y_shuffle[start_id:end_id]

三、学习召回率、准确率、ROC曲线、AUC、PR曲线、

召回率和准确率：

（摘自西瓜书）
对二分类问题，客将样例根据其真实类别与学习器学习器预测的类别划分为真正例、假正例、真反例、假反例四种情形，令TP、FP、TN分别表示其对应的样例数，则显然有TP+FP+TN+FN=样例总数。分类结果用混淆矩阵表示

TP（真正例）	FN（假正例）
FP（假正例）	TN（真反例）

查准率P与查全率R分别定义为：
P=TP/(TP+FP)P = TP/(TP+FP)P=TP/(TP+FP)
R=TP/(TP+FN)R = TP/(TP+FN)R=TP/(TP+FN)

查全率和查准率是一对矛盾，一般来说，查准率高时，查准率往往偏低。而查全率高时，查准率往往偏低。

PR曲线：
在很多情况下，我们根据学习器预测的结果对样例进行排序，排在前面的是学习器认为“最可能”是正例的样本，排在最后的则是学习器认为“最不可能”是正例的样本，按此顺序逐个把样本作为正例进行预测，则每次可以计算出当前的查全率、查准率。以查准率为纵轴，查全率为横轴作图，就得到了查准率-查全率曲线，简称“P-R曲线”，如图（图截自https://blog.csdn.net/b876144622/article/details/80009867）

P-R图直观地显示出学习器在样本总体上的查全率、查准率。在进行比较时，若一个学习器的P-R曲线被另一个学习器的曲线完全包住，则可断言后者的性能优于前者。

ROC

ROC曲线常用于二分类问题中的模型比较，主要表现为一种真正例率 (TPR) 和假正例率 (FPR) 的权衡。具体方法是在不同的分类阈值 (threshold) 设定下分别以TPR和FPR为纵、横轴作图。由ROC曲线的两个指标，TPR=TPP=TPTP+FNTPR=TPP=TPTP+FN，FPR=FPN=FPFP+TNFPR=FPN=FPFP+TN可以看出，当一个样本被分类器判为正例，若其本身是正例，则TPR增加；若其本身是负例，则FPR增加，因此ROC曲线可以看作是随着阈值的不断移动，所有样本中正例与负例之间的“对抗”。曲线越靠近左上角，意味着越多的正例优先于负例，模型的整体表现也就越好。

AUC

先看一下ROC曲线中的随机线，图中[0,0]到[1,1]的虚线即为随机线，该线上所有的点都表示该阈值下TPR=FPR，根据定义，TPR=TPPTPR=TPP，表示所有正例中被预测为正例的概率；FPR=FPNFPR=FPN，表示所有负例中被被预测为正例的概率。若二者相等，意味着无论一个样本本身是正例还是负例，分类器预测其为正例的概率是一样的，这等同于随机猜测（注意这里的“随机”不是像抛硬币那样50%正面50%反面的那种随机）。

上图中B点就是一个随机点，无论是样本数量和类别如何变化，始终将75%的样本分为正例。

ROC曲线围成的面积 (即AUC)可以解读为：从所有正例中随机选取一个样本A，再从所有负例中随机选取一个样本B，分类器将A判为正例的概率比将B判为正例的概率大的可能性。可以看到位于随机线上方的点(如图中的A点)被认为好于随机猜测。在这样的点上TPR总大于FPR，意为正例被判为正例的概率大于负例被判为正例的概率。

从另一个角度看，由于画ROC曲线时都是先将所有样本按分类器的预测概率排序，所以AUC反映的是分类器对样本的排序能力，依照上面的例子就是A排在B前面的概率。AUC越大，自然排序能力越好，即分类器将越多的正例排在负例之前。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航