您的位置:首页 > 运维架构

Keras 文本TOPIC分类小结

2015-12-31 16:01 387 查看

Keras 文本TOPIC分类小结

1.任务简介

对一段输入文本预测其类别。因时间有限,只在20 news group数据集上进行实验。

以下是20 news group数据集的简介

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of my knowledge, it was originally collected by Ken Lang, probably for his Newsweeder: Learning to filter netnews paper, though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.

The data is organized into 20 different newsgroups, each corresponding to a different topic. Some of the newsgroups are very closely related to each other (e.g.comp.sys.ibm.pc.hardware / comp.sys.mac.hardware), while others are highly unrelated (e.g misc.forsale / soc.religion.christian). Here is a list of the 20 newsgroups, partitioned (more or less) according to subject matter:

comp.graphics

comp.os.ms-windows.misc

comp.sys.ibm.pc.hardware

comp.sys.mac.hardware

comp.windows.x

rec.autos

rec.motorcycles

rec.sport.baseball

rec.sport.hockey

sci.crypt

sci.electronics

sci.med

sci.space

misc.forsale

talk.politics.misc

talk.politics.guns

talk.politics.mideast

talk.religion.misc

alt.atheism

soc.religion.christian

2. 数据处理

2.1 预处理

通常的处理顺序为,去除文本中的标点符号,去除无用词,去除大小写差异。这里我用http://web.ist.utl.pt/~acardoso/datasets/这里提供的已处理好的文本:

1.
all-terms Obtained
from the original datasets by applying the following transformations:
1.
Substitute TAB, NEWLINE and RETURN
characters by SPACE.

2.
Keep only letters (that is, turn
punctuation, numbers, etc. into SPACES).

3.
Turn all letters to lowercase.

4.
Substitute multiple SPACES by a single
SPACE.

5.
The title/subject of each document is
simply added in the beginning of the document's text.

2.
no-short Obtained
from the previous file, by removing words that are less than 3 characters long.
For example, removing "he" but keeping "him".

3.
no-stop Obtained
from the previous file, by removing the 524 SMART stopwords. Some of them had
already been removed, because they were shorter than 3 characters.

4.
stemmed Obtained from
the previous file, by applying Porter's Stemmer to the remaining words.
Information about stemming can be found here.

然后将类别标签转换为整形标签。

2.2 文本转特征

转特征过程中使用两种不同的方法,分别对应不同的模型DNN和LSTM

a) word转为词序号

从训练语料统计获得单词列表,并按照词频从大到小排序,序号从0开始,然后将句子中单词全部转为序号

b) word 转为词向量

用google的word2vec工具,根据训练文本生成单词对应的词向量。需要注意的是,此工具生成的词典中带有一个
名为 </s> 的单词,
它是换行,回车符转换过来的,无视此条目即可。当测试语料中出现集外词时,使用全0填充vector。

本实验中,vector的size设为了48,即word转换为了48维的词向量。

3.实验进行

列一下DNN,LSTM变体GRU两个模型的实验代码

a) DNN

from __future__ import absolute_import

from __future__ import print_function

import numpy as np

np.random.seed(1337) # for reproducibility

from keras.preprocessing import sequence

from keras.optimizers import SGD, RMSprop, Adagrad

from keras.utils import np_utils

from keras.models import Sequential

from keras.layers.core import Dense, Dropout,
Activation, Reshape

from keras.layers.embeddings import Embedding

from keras.layers.recurrent import LSTM, GRU

from keras.preprocessing.text import Tokenizer

import pickle

import LoadOriData

batch_size = 32

maxlen = 100

max_features = 1000

print("Loading data...")

X_train, Y_train =
LoadOriData.Process('20ng-train-stemmed.txt', nb_words=max_features)

X_test, Y_test = LoadOriData.Process('20ng-test-stemmed.txt',
nb_words=max_features)

print(len(X_train), 'train sequences')

tokenizer = Tokenizer(nb_words=max_features)

X_train = tokenizer.sequences_to_matrix(X_train,
mode="binary")

X_test = tokenizer.sequences_to_matrix(X_test,
mode='binary')

print('X_train shape:', X_train.shape)

#print('Y_train shape:', Y_train.shape)

print('Build model...')

model = Sequential()

model.add(Dense(512, input_shape=(max_features,),
activation='tanh'))

model.add(Dropout(0.5))

model.add(Dense(20, activation='softmax'))

# try using different optimizers and different
optimizer configs

model.compile(loss='categorical_crossentropy',
optimizer='adam', class_mode="categorical")

json_string = model.to_json()

print(json_string)

f = open('20mlp_model.txt', 'w')

f.write(json_string)

f.close()

print("Train...")

model.fit(X_train, Y_train, batch_size=batch_size,
nb_epoch=5, show_accuracy=True)

model.save_weights('20mlp_weights.h5',
overwrite=True)

score, acc = model.evaluate(X_test, Y_test,
batch_size=batch_size, verbose=1, show_accuracy=True)

print('Test score:', score)

print('Test accuracy:', acc)

注:

使用tokenizer.sequences_to_matrix 将词序号组成的序列转换为0,1值的序列。本实验使用max_features
= 1000,即只记录了top 1000个词每个单词是否出现的信息。于是输入层size为1000

为了方便,我在预处理的时候把输出,即类别标签已转换为了0,1序列,所以输出不再需要处理。
不过keras自带工具,keras.utils. np_utils可以完成转换,例如,若y_test为整型的类别标签,Y_test
= np_utils.to_categorical(y_test, nb_classes), Y_test将得到0,1序列化的结果。

本实验DNN模型结构为 1000*512*20,dropout为50%, 注意最后一层激活函数为softmax, 模型的损失函数设为categorical_crossentropy 类别预测的交叉熵,class_mode设为categorical

b) GRU

from __future__ import absolute_import

from __future__ import print_function

import numpy as np

np.random.seed(1337) # for reproducibility

from keras.preprocessing import sequence

from keras.optimizers import SGD, RMSprop, Adagrad

from keras.utils import np_utils

from keras.models import Sequential

from keras.layers.core import Dense, Dropout,
Activation, Reshape

from keras.layers.embeddings import Embedding

from keras.layers.recurrent import LSTM, GRU

import pickle

import os

batch_size = 32

weights_file = '20lstm_weights.h5'

print("Loading data...")

f=open('train.pkl', 'r')

X_train, Y_train = pickle.load(f)

f.close()

print('X_train shape:', X_train.shape)

print('Y_train shape:', Y_train.shape)

print('Build model...')

model = Sequential()

model.add(GRU(output_dim=128,input_dim = 48,
activation='tanh', inner_activation='hard_sigmoid', input_length=100)) # try using a GRU instead, for fun

model.add(Dropout(0.5))

model.add(Dense(20, activation='softmax'))

# try using different optimizers and different
optimizer configs

model.compile(loss='categorical_crossentropy',
optimizer='adam', class_mode="categorical")

json_string = model.to_json()

print(json_string)

print("Train...")

if os.path.exists(weights_file):

model.load_weights(weights_file)

model.fit(X_train, Y_train, batch_size=batch_size,
nb_epoch=4, show_accuracy=True)

model.save_weights(weights_file, overwrite=True)

f=open('test.pkl', 'r')

X_test, Y_test = pickle.load(f)

f.close()

score, acc = model.evaluate(X_test, Y_test,
batch_size=batch_size, verbose=1, show_accuracy=True)

print('Test score:', score)

print('Test accuracy:', acc)

注:

train.pkl, test.pkl 根据word2vec结果,截断前100个词 生成X,Y。X为48维,时序长度100, 对于集外词填充全0; Y为20(总类别数目)维, 所属类别维度设置为1,其余为0。

model.evaluate 直接可以计算误差和准确率(只有分类任务才有意义)

LSTM的权重包含,U_c,U_f,U_i,U_o W_c,W_f,W_i,W_o b_c,b_f,b_i,b_o共12个参数, 查看其数值用 eval()函数。 gru权重包括U_h,U_r,U_z, W_h,W_r,W_z b_h,b_r,b_z

LSTM和GRU的计算公式参考 http://colah.github.io/posts/2015-08-Understanding-LSTMs/ 。需要注意的是,文章中的W矩阵在由keras中的U和W这两个矩阵组合成的,文中相当于在计算公式上做了一个合并

4 结论

4.1 DNN

迭代5次,集内准确率92%,
集外70%

In [156]: run 20ng_mlp.py

Loading data...

11293 train sequences

X_train shape: (11293, 1000)

Build model...

{"layers": [{"b_constraint":
null, "name": "Dense", "activity_regularizer":
null, "W_constraint": null, "input_shape": [1000],
"init": "glorot_uniform", "activation": "tanh",
"input_dim": null, "b_regularizer": null,
"W_regularizer": null, "output_dim": 512}, {"p":
0.5, "name": "Dropout"}, {"b_constraint": null,
"name": "Dense", "activity_regularizer": null,
"W_constraint": null, "init": "glorot_uniform",
"activation": "softmax", "input_dim": null,
"b_regularizer": null, "W_regularizer": null,
"output_dim": 20}], "loss":
"categorical_crossentropy", "theano_mode": null,
"name": "Sequential", "class_mode":
"categorical", "optimizer": {"beta_1": 0.9,
"epsilon": 1e-08, "beta_2": 0.999, "lr": 0.001,
"name": "Adam"}}

Train...

Epoch 1/5

11293/11293 [==============================] - 9s -
loss: 1.4124 - acc: 0.6726

Epoch 2/5

11293/11293 [==============================] - 9s -
loss: 0.6080 - acc: 0.8343

Epoch 3/5

11293/11293 [==============================] - 10s
- loss: 0.4347 - acc: 0.8773

Epoch 4/5

11293/11293 [==============================] - 9s -
loss: 0.3388 - acc: 0.9059

Epoch 5/5

11293/11293 [==============================] - 9s -
loss: 0.2772 - acc: 0.9212

7528/7528 [==============================] - 1s

Test score: 1.12070538341

Test accuracy: 0.701248671626

4.2 GRU

迭代了24次(前面先迭代了20次,本次只截了最后4次的log)。

集内准确率86%,集外75%

In [1]: run 20ng_lstm.py

Loading data...

X_train shape: (11293, 100, 48)

Y_train shape: (11293, 20)

Build model...

/usr/local/lib/python2.7/dist-packages/Theano-0.7.0-py2.7.egg/theano/scan_module/ scan_perform_ext.py:133:
RuntimeWarning: numpy.ndarray size changed, may indicate
binary incompatibility

from scan_perform.scan_perform import
*

{"layers": [{"truncate_gradient": -1,
"name": "GRU", "inner_activation":
"hard_sigmoid", "output_dim": 128, "input_shape":
[100, 48], "init": "glorot_uniform",
"inner_init": "orthogonal", "input_dim": 48,
"return_sequences": false, "activation": "tanh",
"input_length": 100}, {"p": 0.5, "name":
"Dropout"}, {"b_constraint": null, "name":
"Dense", "activity_regularizer": null,
"W_constraint": null, "init": "glorot_uniform",
"activation": "softmax", "input_dim": null,
"b_regularizer": null, "W_regularizer": null,
"output_dim": 20}], "loss":
"categorical_crossentropy", "theano_mode": null,
"name": "Sequential", "class_mode":
"categorical", "optimizer": {"beta_1": 0.9,
"epsilon": 1e-08, "beta_2": 0.999, "lr": 0.001,
"name": "Adam"}}

Train...

Epoch 1/4

11293/11293 [==============================] - 124s - loss: 0.4315 - acc:
0.8681
56/11293 [=========================>....] - ETA: 15s - loss: 0.4299 -
acc: 0.8693 Epoch 3/4

11293/11293 [==============================] - 118s - loss: 0.4081 - acc:
0.8756

Epoch 4/4

11293/11293 [==============================] - 130s - loss: 0.3950 - acc:
0.8837

7528/7528 [==============================] - 21s

Test score: 0.863923724031

Test accuracy: 0.758368756642

4.3 总结

1. LSTM/GRU 每次迭代运行时间大概是DNN的11倍, 迭代次数也需要比DNN多,不过其集外准确率要强过DNN。

2. 当增加LSTM/GRU的窗长时(时序长度),每次迭代的准确率会变优,但运行时间变长

3. GRU表现比LSTM好:准确率高,且运行速度快。时间关系,实验时没有留下证据,但运行结果是GRU优于LSTM。GRU介绍参考此网址 http://colah.github.io/posts/2015-08-Understanding-LSTMs/

5 其他

如何查看中间层的输出结果:

以DNN的model为例:

model2 = Sequential()

model2.add(Dense(512,
input_shape=(max_features,), activation='tanh', weights =
model.layers[0].get_weights()))

model2.compile(loss='categorical_crossentropy', optimizer='adam',
class_mode = "categorical")

然后TT=model2.predict(X_test,
batch_size=..), 获得的就是第一层之后的输出

Dropout 的应用:以系数0.5为例

a)
它在训练时通过概率来将1/2的神经元disable掉

b)
在预测的时候,是将(W*X+B)*dropout_rate 来作输出

模型结构的保存可以用model.to_json(),然后保存字符串

模型权重的保存model.save_weights('20mlp_weights.h5', overwrite=True)来保存h5格式的文件。

或者model.get_weights()
将获得所有
有权重参数层(dropout层就没有权重参数)的权重。例如,DNN的结果是 length 为4的array, array[0],array[1]为1000*512那一层的W和b; 而array[2],
array[3]为512*20那一层的W和b
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: