NLP数据预处理(中英文)& 召回率、准确率、ROC曲线、AUC、PR曲线
一、对英文数据集IMDB数据集进行预处理
参考资料:影评文本分类
1.下载数据
我们将使用 IMDB 数据集,其中包含来自互联网电影数据库的 50000 条影评文本。我们将这些影评拆分为训练集(25000 条影评)和测试集(25000 条影评)。训练集和测试集之间达成了平衡,意味着它们包含相同数量的正面和负面影评。
此笔记本使用的是 tf.keras,它是一种用于在 TensorFlow 中构建和训练模型的高阶 API。有关使用 tf.keras 的更高级文本分类教程,请参阅 MLCC 文本分类指南。
import tensorflow as tf from tensorflow import keras import numpy as np # 下载数据集 imdb = keras.datasets.imdb (train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz 17465344/17464789 [==============================] - 0s 0us/step
参数 num_words=10000 会保留训练数据中出现频次在前 10000 位的字词。为确保数据规模处于可管理的水平,罕见字词将被舍弃。
2. 探索数据
我们花点时间来了解一下数据的格式。该数据集已经过预处理:每个样本都是一个整数数组,表示影评中的字词。每个标签都是整数值 0 或 1,其中 0 表示负面影评,1 表示正面影评。
print("Training entries: {}, labels: {}".format(len(train_data), len(train_labels)))
Training entries: 25000, labels: 25000
再看一下分类数,及分类标签
set(train_labels)
{0, 1}
影评文本已转换为整数,其中每个整数都表示字典中的一个特定字词。第一条影评如下所示:
print(train_data[0])
[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]
要注意到影评的长度可能会有所不同
len(train_data[0]), len(train_data[1])
(218, 189)
3.将整数转换为字词
# 获取字映射到数字的字典 word_index = imdb.get_word_index() word_index = {k:(v+3) for k,v in word_index.items()} word_index["<PAD>"] = 0 word_index["<START>"] = 1 word_index["<UNK>"] = 2 # unknown word_index["<UNUSED>"] = 3 reverse_word_index = dict([(value, key) for (key, value) in word_index.items()]) def decode_review(text): return ' '.join([reverse_word_index.get(i, '?') for i in text])
显示第一条数据的文本
decode_review(train_data[0])
" this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert is an amazing actor and now the same being director father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also to the two little boy's that played the of norman and paul they were just brilliant children are often left out of the list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all"
4.准备数据
填充数组,使它们都具有相同的长度,然后创建一个形状为 max_length * num_reviews 的整数张量。我们可以使用一个能够处理这种形状的嵌入层作为网络中的第一层。
train_data = keras.preprocessing.sequence.pad_sequences(train_data, value=word_index["<PAD>"], padding='post', maxlen=256) test_data = keras.preprocessing.sequence.pad_sequences(test_data, value=word_index["<PAD>"], padding='post', maxlen=256)
二、对中文数据集THUCNews进行预处理
参考内容:CNN与RNN中文文本分类-基于TENSORFLOW实现
1.读取数据
import pandas as pd import numpy as np train_data = pd.read_csv('D:\Competition_Data\cnews\cnews.train.txt',sep='\t',engine='python',names=['label','content'],encoding="UTF-8") test_data = pd.read_csv('D:\Competition_Data\cnews\cnews.test.txt',sep='\t',engine='python',names=['label','content'],encoding='UTF-8')
2.探索数据
看一下数据长度
print("Training entries: {}".format(len(train_data)))
Training entries: 50000
探索标签类别,以及各类别的个数
train_data['label'].value_counts()
娱乐 5000 时政 5000 科技 5000 游戏 5000 教育 5000 财经 5000 家居 5000 房产 5000 体育 5000 时尚 5000 Name: label, dtype: int64
探索一下文本的长度
train_data['content_length'] = train_data['content'].apply(lambda x:len(x)) train_data['content_length'].describe()
content_length count 50000.000000 mean 913.317160 std 930.085314 min 8.000000 25% 350.000000 50% 688.000000 75% 1154.000000 max 27467.000000
3.数据预处理
import sys from collections import Counter def open_file(filename, mode='r'): """ 常用文件操作,可在python2和python3间切换. mode: 'r' or 'w' for read or write """ if is_py3: return open(filename, mode, encoding='utf-8', errors='ignore') else: return open(filename, mode) def read_file(filename): """读取文件数据""" contents, labels = [], [] with open_file(filename) as f: for line in f: try: label, content = line.strip().split('\t') if content: contents.append(list(native_content(content))) labels.append(native_content(label)) except: pass return contents, labels def build_vocab(train_dir, vocab_dir, vocab_size=5000): """根据训练集构建词汇表,存储""" data_train, _ = read_file(train_dir) all_data = [] for content in data_train: 1cca7 all_data.extend(content) counter = Counter(all_data) count_pairs = counter.most_common(vocab_size - 1) words, _ = list(zip(*count_pairs)) # 添加一个 <PAD> 来将所有文本pad为同一长度 words = ['<PAD>'] + list(words) open_file(vocab_dir, mode='w').write('\n'.join(words) + '\n') def read_vocab(vocab_dir): """读取词汇表""" # words = open_file(vocab_dir).read().strip().split('\n') with open_file(vocab_dir) as fp: # 如果是py2 则每个值都转化为unicode words = [native_content(_.strip()) for _ in fp.readlines()] word_to_id = dict(zip(words, range(len(words)))) return words, word_to_id def read_category(): """读取分类目录,固定""" categories = ['体育', '财经', '房产', '家居', '教育', '科技', '时尚', '时政', '游戏', '娱乐'] categories = [native_content(x) for x in categories] cat_to_id = dict(zip(categories, range(len(categories)))) return categories, cat_to_id def to_words(content, words): """将id表示的内容转换为文字""" return ''.join(words[x] for x in content) def process_file(filename, word_to_id, cat_to_id, max_length=600): """将文件转换为id表示""" contents, labels = read_file(filename) data_id, label_id = [], [] for i in range(len(contents)): data_id.append([word_to_id[x] for x in contents[i] if x in word_to_id]) label_id.append(cat_to_id[labels[i]]) # 使用keras提供的pad_sequences来将文本pad为固定长度 x_pad = kr.preprocessing.sequence.pad_sequences(data_id, max_length) y_pad = kr.utils.to_categorical(label_id, num_classes=len(cat_to_id)) # 将标签转换为one-hot表示 return x_pad, y_pad def batch_iter(x, y, batch_size=64): """生成批次数据""" data_len = len(x) num_batch = int((data_len - 1) / batch_size) + 1 indices = np.random.permutation(np.arange(data_len)) x_shuffle = x[indices] y_shuffle = y[indices] for i in range(num_batch): start_id = i * batch_size end_id = min((i + 1) * batch_size, data_len) yield x_shuffle[start_id:end_id], y_shuffle[start_id:end_id]
三、学习召回率、准确率、ROC曲线、AUC、PR曲线、
召回率和准确率:
(摘自西瓜书)
对二分类问题,客将样例根据其真实类别与学习器学习器预测的类别划分为真正例、假正例、真反例、假反例四种情形,令TP、FP、TN分别表示其对应的样例数,则显然有TP+FP+TN+FN=样例总数。分类结果用混淆矩阵表示
TP(真正例) | FN(假正例) |
---|---|
FP(假正例) | TN(真反例) |
查准率P与查全率R分别定义为:
P=TP/(TP+FP)P = TP/(TP+FP)P=TP/(TP+FP)
R=TP/(TP+FN)R = TP/(TP+FN)R=TP/(TP+FN)
查全率和查准率是一对矛盾,一般来说,查准率高时,查准率往往偏低。而查全率高时,查准率往往偏低。
PR曲线:
在很多情况下,我们根据学习器预测的结果对样例进行排序,排在前面的是学习器认为“最可能”是正例的样本,排在最后的则是学习器认为“最不可能”是正例的样本,按此顺序逐个把样本作为正例进行预测,则每次可以计算出当前的查全率、查准率。以查准率为纵轴,查全率为横轴作图,就得到了查准率-查全率曲线,简称“P-R曲线”,如图(图截自https://blog.csdn.net/b876144622/article/details/80009867)
P-R图直观地显示出学习器在样本总体上的查全率、查准率。在进行比较时,若一个学习器的P-R曲线被另一个学习器的曲线完全包住,则可断言后者的性能优于前者。
ROC
ROC曲线常用于二分类问题中的模型比较,主要表现为一种真正例率 (TPR) 和假正例率 (FPR) 的权衡。具体方法是在不同的分类阈值 (threshold) 设定下分别以TPR和FPR为纵、横轴作图。由ROC曲线的两个指标,TPR=TPP=TPTP+FNTPR=TPP=TPTP+FN,FPR=FPN=FPFP+TNFPR=FPN=FPFP+TN可以看出,当一个样本被分类器判为正例,若其本身是正例,则TPR增加;若其本身是负例,则FPR增加,因此ROC曲线可以看作是随着阈值的不断移动,所有样本中正例与负例之间的“对抗”。曲线越靠近左上角,意味着越多的正例优先于负例,模型的整体表现也就越好。
AUC
先看一下ROC曲线中的随机线,图中[0,0]到[1,1]的虚线即为随机线,该线上所有的点都表示该阈值下TPR=FPR,根据定义,TPR=TPPTPR=TPP,表示所有正例中被预测为正例的概率;FPR=FPNFPR=FPN,表示所有负例中被被预测为正例的概率。若二者相等,意味着无论一个样本本身是正例还是负例,分类器预测其为正例的概率是一样的,这等同于随机猜测(注意这里的“随机”不是像抛硬币那样50%正面50%反面的那种随机)。
上图中B点就是一个随机点,无论是样本数量和类别如何变化,始终将75%的样本分为正例。
ROC曲线围成的面积 (即AUC)可以解读为:从所有正例中随机选取一个样本A,再从所有负例中随机选取一个样本B,分类器将A判为正例的概率比将B判为正例的概率大的可能性。可以看到位于随机线上方的点(如图中的A点)被认为好于随机猜测。在这样的点上TPR总大于FPR,意为正例被判为正例的概率大于负例被判为正例的概率。
从另一个角度看,由于画ROC曲线时都是先将所有样本按分类器的预测概率排序,所以AUC反映的是分类器对样本的排序能力,依照上面的例子就是A排在B前面的概率。AUC越大,自然排序能力越好,即分类器将越多的正例排在负例之前。
- Datawhale《深度学习-NLP》Task1-NLP-召回率、准确率、ROC曲线、AUC、PR曲线学习理解
- 几个易混淆的概念(准确率-召回率,击中率-虚警率,PR曲线和mAP,ROC曲线和AUC)
- 几个易混淆的概念(准确率-召回率,击中率-虚警率,PR曲线和mAP,ROC曲线和AUC)
- 准确率-召回率,击中率-虚警率,PR曲线和mAP,ROC曲线和AUC
- PR曲线和F1、ROC曲线和AUC
- 精确率与召回率,RoC曲线与PR曲线
- 精确率与召回率,RoC曲线与PR曲线
- PR曲线,ROC曲线,AUC指标等,Accuracy vs Precision
- 机器学习常见的几种评价指标:精确率(Precision)、召回率(Recall)、F值(F-measure)、ROC曲线、AUC、准确率(Accuracy)
- 2.sklearn—评价指标大全(平均误差、均方误差、混淆矩阵、准确率、查全率、查准率、召回率、特异度,F1-score、G-mean、KS值、ROC曲线、AUC值、损失函数、结构风险最小)
- 精确率(准确率、查准率、precision)、召回率(查全率、recall)、RoC曲线、AUC面积、PR曲线
- 数据分析,信息检索,分类体系中常用指标简明解释——关于准确率、召回率、F1、AP、mAP、ROC和AUC
- PR曲线,ROC曲线,AUC指标等,Accuracy vs Precision
- RMSE、MAPE、准确率、召回率、F1、ROC、AUC数据挖掘中的性能指标总结
- 模型评估准确率、召回率、ROC曲线、AUC总结
- ROC曲线,PR曲线,F1值和AUC概念解释及举例说明
- 精确率与召回率,RoC曲线与PR曲线
- R语言︱分类器的性能表现评价(混淆矩阵,准确率,召回率,F1,mAP、ROC曲线)
- 对混淆矩阵、F1-Score、ROC曲线、AUC和KS曲线的理解
- ROC曲线和PR(Precision-Recall)曲线的联系