您的位置:首页 > 大数据 > 人工智能

[比赛分享] Kaggle-Toxic Comment [Keras多二分类,优质Comment语料, Pre-trained词向量的使用]

2018-01-02 10:42 357 查看

摘要

最近在看一个Kaggle的比赛【Toxic Comment】

https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge

比赛目标是判断文字评论是否为毒评论

同时毒评论具体细化成了六个类别

【’toxic’, ‘severe_toxic’, ‘obscene’, ‘threat’, ‘insult’, ‘identity_hate’】

本博客主要分享学习到的新姿势

Keras 之居然可以同时做多个2分类

使用Bi-LSTM实现的Baseline[0.051],居然是同时做6个2分类,以前居然不知道还可以这么操作!

代码如下:

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from keras.models import Model
from keras.layers import Dense, Embedding, Input
from keras.layers import LSTM, Bidirectional, GlobalMaxPool1D, Dropout
from keras.preprocessing import text, sequence
from keras.callbacks import EarlyStopping, ModelCheckpoint

max_features = 20000
maxlen = 100

train = pd.read_csv('../data/train/train.csv')
test = pd.read_csv('../data/test/test.csv')
subm = pd.read_csv('../data/sample_submission.csv/sample_submission.csv')
train = train.sample(frac=1)

list_sentences_train = train["comment_text"].fillna("CVxTz").values
list_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
y = train[list_classes].values
list_sentences_test = test["comment_text"].fillna("CVxTz").values

tokenizer = text.Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(list_sentences_train))
list_tokenized_train = tokenizer.texts_to_sequences(list_sentences_train)
list_tokeni
da42
zed_test = tokenizer.texts_to_sequences(list_sentences_test)

X_t = sequence.pad_sequences(list_tokenized_train, maxlen=maxlen)
X_te = sequence.pad_sequences(list_tokenized_test, maxlen=maxlen)

def get_model():
embed_size = 128
inp = Input(shape=(maxlen, ))
x = Embedding(max_features, embed_size)(inp)
x = Bidirectional(LSTM(50, return_sequences=True, dropout=0.1))(x)
x = GlobalMaxPool1D()(x)
x = Dropout(0.1)(x)
x = Dense(50, activation="relu")(x)
x = Dropout(0.1)(x)
x = Dense(6, activation="sigmoid")(x)
model = Model(inputs=inp, outputs=x)
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])

return model

model = get_model()
batch_size = 32
epochs = 3

file_path="weights_base.best.hdf5"
# checkpoint = ModelCheckpoint(file_path, monitor='val_loss', verbose=1, save_best_only=True, mode='min')
checkpoint = ModelCheckpoint(file_path, monitor='val_acc', verbose=1, save_best_only=True, mode='max')

# early = EarlyStopping(monitor="val_loss", mode="min", patience=20)
early = EarlyStopping(monitor="val_acc", mode="max", patience=20)

callbacks_list = [checkpoint, early] #early
model.fit(X_t, y, batch_size=batch_size, epochs=epochs, validation_split=0.1, callbacks=callbacks_list)
# model.fit(X_t, y, batch_size=batch_size, epochs=epochs, validation_split=0.1)

model.load_weights(file_path)
y_test = model.predict(X_te)

sample_submission = pd.read_csv("../input/sample_submission.csv")
sample_submission[list_classes] = y_test

sample_submission.to_csv("baseline.csv", index=False)


优质的各种Comment语料

Comment

YouTube Comments(excellent for supplementing the threat and identity_hate columns)

Reddit Comments(roughly a terabyte of data, divided by year)

Toxic word dictionary

http://www.bannedwordlist.com/

https://www.cs.cmu.edu/~biglou/resources/bad-words.txt

https://www.freewebheaders.com/full-list-of-bad-words-banned-by-google/

https://www.frontgatemedia.com/a-list-of-723-bad-words-to-blacklist-and-how-to-use-facebooks-moderation-tool/

https://kaggle2.blob.core.windows.net/forum-message-attachments/4810/badwords.txt

https://gist.github.com/ryanlewis/a37739d710ccdb4b406d

Pre-trained word embeddings

Google’s word2vec embedding: [Word2Vec] [DownloadLink]

Glove word vectors: [Glove]

Facebook’s fastText embeddings: [FastText]

[DeepMoji]: To understand how language is used to express emotions

WikiPedia

Wikipedia database reports: https://en.wikipedia.org/wiki/Wikipedia:Database_reports

Wikimedia logs: https://meta.wikimedia.org/w/index.php?title=Special%3ALog

Other

https://github.com/conversationai/perspectiveapi

https://www.kaggle.com/c/facebook-recruiting-iii-keyword-extraction

Google NLP Model: https://cloud.google.com/natural-language/

使用Pre-trained词向量

https://github.com/MoyanZitto/keras-cn/blob/master/docs/legacy/blog/word_embedding.md

使用方法如下:

GLOVE

GLOVE_DIR = 'D:\glove.6B'
embeddings_index = {}
f = open(os.path.join(GLOVE_DIR, 'glove.6B.100d.txt'), encoding='utf-8')
for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
f.close()
print('Found %s word vectors.' % len(embeddings_index))

embedding_matrix = np.zeros((len(word_index) + 1, EMBEDDING_DIM))
for word, i in word_index.items():
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
# words not found in embedding index will be all-zeros.
embedding_matrix[i] = embedding_vector


Google Word2Vec

from gensim.models.keyedvectors import KeyedVectors
w2v_bin = 'D:\GoogleNews-vectors-negative300.bin'
model = KeyedVectors.load_word2vec_format(w2v_bin, binary=True)

embedding_matrix = np.zeros((len(word_index) + 1, EMBEDDING_DIM))
for word, i in word_index.items():
embedding_vector = model[word] if word in model else None
if embedding_vector is not None:
# words not found in embedding index will be all-zeros.
embedding_matrix[i] = embedding_vector


最后用在Keras中

Embedding(len(word_index) + 1,
EMBEDDING_DIM,
weights=[embedding_matrix],
input_length=MAX_SEQUENCE_LENGTH,
trainable=False)


Categorical_crossentropy VS Binary_crossentropy

引用第一名的解释如下:

In this case, it should be binary_crossentropy and not categorical_crossentropy. categorical_crossentropy assumes that all the probabilities of classes sum to 1 (a multi-class scenario where every sample has exactly 1 class). In this competition, we have a multi-label scenario, because a sample can have any number of classes (or none at all), so binary_crossentropy independently optimises each class.
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: