[比赛分享] Kaggle-Toxic Comment [Keras多二分类,优质Comment语料, Pre-trained词向量的使用]
2018-01-02 10:42
357 查看
摘要
最近在看一个Kaggle的比赛【Toxic Comment】https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge
比赛目标是判断文字评论是否为毒评论
同时毒评论具体细化成了六个类别
【’toxic’, ‘severe_toxic’, ‘obscene’, ‘threat’, ‘insult’, ‘identity_hate’】
本博客主要分享学习到的新姿势
Keras 之居然可以同时做多个2分类
使用Bi-LSTM实现的Baseline[0.051],居然是同时做6个2分类,以前居然不知道还可以这么操作!代码如下:
import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) from keras.models import Model from keras.layers import Dense, Embedding, Input from keras.layers import LSTM, Bidirectional, GlobalMaxPool1D, Dropout from keras.preprocessing import text, sequence from keras.callbacks import EarlyStopping, ModelCheckpoint max_features = 20000 maxlen = 100 train = pd.read_csv('../data/train/train.csv') test = pd.read_csv('../data/test/test.csv') subm = pd.read_csv('../data/sample_submission.csv/sample_submission.csv') train = train.sample(frac=1) list_sentences_train = train["comment_text"].fillna("CVxTz").values list_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"] y = train[list_classes].values list_sentences_test = test["comment_text"].fillna("CVxTz").values tokenizer = text.Tokenizer(num_words=max_features) tokenizer.fit_on_texts(list(list_sentences_train)) list_tokenized_train = tokenizer.texts_to_sequences(list_sentences_train) list_tokeni da42 zed_test = tokenizer.texts_to_sequences(list_sentences_test) X_t = sequence.pad_sequences(list_tokenized_train, maxlen=maxlen) X_te = sequence.pad_sequences(list_tokenized_test, maxlen=maxlen) def get_model(): embed_size = 128 inp = Input(shape=(maxlen, )) x = Embedding(max_features, embed_size)(inp) x = Bidirectional(LSTM(50, return_sequences=True, dropout=0.1))(x) x = GlobalMaxPool1D()(x) x = Dropout(0.1)(x) x = Dense(50, activation="relu")(x) x = Dropout(0.1)(x) x = Dense(6, activation="sigmoid")(x) model = Model(inputs=inp, outputs=x) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) return model model = get_model() batch_size = 32 epochs = 3 file_path="weights_base.best.hdf5" # checkpoint = ModelCheckpoint(file_path, monitor='val_loss', verbose=1, save_best_only=True, mode='min') checkpoint = ModelCheckpoint(file_path, monitor='val_acc', verbose=1, save_best_only=True, mode='max') # early = EarlyStopping(monitor="val_loss", mode="min", patience=20) early = EarlyStopping(monitor="val_acc", mode="max", patience=20) callbacks_list = [checkpoint, early] #early model.fit(X_t, y, batch_size=batch_size, epochs=epochs, validation_split=0.1, callbacks=callbacks_list) # model.fit(X_t, y, batch_size=batch_size, epochs=epochs, validation_split=0.1) model.load_weights(file_path) y_test = model.predict(X_te) sample_submission = pd.read_csv("../input/sample_submission.csv") sample_submission[list_classes] = y_test sample_submission.to_csv("baseline.csv", index=False)
优质的各种Comment语料
CommentYouTube Comments(excellent for supplementing the threat and identity_hate columns)
Reddit Comments(roughly a terabyte of data, divided by year)
Toxic word dictionary
http://www.bannedwordlist.com/
https://www.cs.cmu.edu/~biglou/resources/bad-words.txt
https://www.freewebheaders.com/full-list-of-bad-words-banned-by-google/
https://www.frontgatemedia.com/a-list-of-723-bad-words-to-blacklist-and-how-to-use-facebooks-moderation-tool/
https://kaggle2.blob.core.windows.net/forum-message-attachments/4810/badwords.txt
https://gist.github.com/ryanlewis/a37739d710ccdb4b406d
Pre-trained word embeddings
Google’s word2vec embedding: [Word2Vec] [DownloadLink]
Glove word vectors: [Glove]
Facebook’s fastText embeddings: [FastText]
[DeepMoji]: To understand how language is used to express emotions
WikiPedia
Wikipedia database reports: https://en.wikipedia.org/wiki/Wikipedia:Database_reports
Wikimedia logs: https://meta.wikimedia.org/w/index.php?title=Special%3ALog
Other
https://github.com/conversationai/perspectiveapi
https://www.kaggle.com/c/facebook-recruiting-iii-keyword-extraction
Google NLP Model: https://cloud.google.com/natural-language/
使用Pre-trained词向量
https://github.com/MoyanZitto/keras-cn/blob/master/docs/legacy/blog/word_embedding.md使用方法如下:
GLOVE
GLOVE_DIR = 'D:\glove.6B' embeddings_index = {} f = open(os.path.join(GLOVE_DIR, 'glove.6B.100d.txt'), encoding='utf-8') for line in f: values = line.split() word = values[0] coefs = np.asarray(values[1:], dtype='float32') embeddings_index[word] = coefs f.close() print('Found %s word vectors.' % len(embeddings_index)) embedding_matrix = np.zeros((len(word_index) + 1, EMBEDDING_DIM)) for word, i in word_index.items(): embedding_vector = embeddings_index.get(word) if embedding_vector is not None: # words not found in embedding index will be all-zeros. embedding_matrix[i] = embedding_vector
Google Word2Vec
from gensim.models.keyedvectors import KeyedVectors w2v_bin = 'D:\GoogleNews-vectors-negative300.bin' model = KeyedVectors.load_word2vec_format(w2v_bin, binary=True) embedding_matrix = np.zeros((len(word_index) + 1, EMBEDDING_DIM)) for word, i in word_index.items(): embedding_vector = model[word] if word in model else None if embedding_vector is not None: # words not found in embedding index will be all-zeros. embedding_matrix[i] = embedding_vector
最后用在Keras中
Embedding(len(word_index) + 1, EMBEDDING_DIM, weights=[embedding_matrix], input_length=MAX_SEQUENCE_LENGTH, trainable=False)
Categorical_crossentropy VS Binary_crossentropy
引用第一名的解释如下:In this case, it should be binary_crossentropy and not categorical_crossentropy. categorical_crossentropy assumes that all the probabilities of classes sum to 1 (a multi-class scenario where every sample has exactly 1 class). In this competition, we have a multi-label scenario, because a sample can have any number of classes (or none at all), so binary_crossentropy independently optimises each class.
相关文章推荐
- [比赛分享] Kaggle-Toxic Comment 中使用的各种深度学习模型, 处理方法和套路
- 使用mxnet的预训练模型(pretrained model)分类与特征提取
- CNN中使用SVM进行分类(keras的实现)
- 使用Keras构建神经网络进行Mnist手写字体分类
- Keras + LSTM + 词向量 情感分类/情感分析实验
- 数据挖掘系列篇(27):Kaggle 数据挖掘比赛经验分享
- Kaggle—So Easy!百行代码实现排名Top 5%的图像分类比赛
- kaggle-浮游生物分类比赛一等奖---译文(第一部分)
- GENSIM 使用笔记1 --- 语料和向量空间
- 使用Keras进行图像分类
- 使用深度卷积网络和支撑向量机实现的商标检测与分类的例子
- 微信分享 使用weixin js sdk 兼容 旧版本 分类: 微信分享 微信 微信js sdk 2015-01-28 13:09 550人阅读 评论(3) 收藏
- Kaggle 数据挖掘比赛经验分享 (转载)
- ubuntu下使用pre-trained模型测试caffe,找不到caffe 和 protobuf的错误
- Keras 使用自己的数据分类,并使用tensorboard记录的简单实例
- 使用opensmile提取音频的特征,得到特征向量,并扔进libsvm中进行分类训练测试
- 关于对pre-trained模型的使用和理解
- Pytorch实战指南---使用Pytorch完成Kaggle上的经典比赛:Dogs vs Cats---updating
- kaggle-浮游生物分类比赛一等奖---译文(第二部分)
- 贝叶斯分类实例(Kaggle比赛之『旧金山犯罪分类预测』)