神经网络之情感分析(一)
2018-03-15 10:52
295 查看
神经网络之情感分析
本文主要是介绍了运用神经网络进行情感分类,来源于Udacity的深度学习基石,这是第二周的课程,原项目中是对英文进行了分类,我这边改为了中文。 首先是中文切词,使用的是结巴。import jieba
seg = '使用结巴来对中文进行分词'
seg_list = jieba.cut(seg)
print("/ ".join(seg_list))
使用/ 结巴/ 来/ 对/ 中文/ 进行/ 分词
情感分类的依据
一个思路是分别统计在 positive 和 negative 中词出现的次数,然后理论上应该某些词在 positive 和 negative中出现的此处应该是有倾向的,下面来验证下吧import pandas as pd
import numpy as np
neg=pd.read_excel('data/neg.xls',header=None,index=None)
pos=pd.read_excel('data/pos.xls',header=None,index=None)
neg.head()
0 0 做为一本声名在外的流行书,说的还是广州的外企,按道理应该和我的生存环境差不多啊。但是一看之下... 1 作者有明显的自恋倾向,只有有老公养不上班的太太们才能像她那样生活。很多方法都不实用,还有抄袭... 2 作者完全是以一个过来的自认为是成功者的角度去写这个问题,感觉很不客观。虽然不是很喜欢,但是,... 3 作者提倡内调,不信任化妆品,这点赞同。但是所列举的方法太麻烦,配料也不好找。不是太实用。 4 作者的文笔一般,观点也是和市面上的同类书大同小异,不推荐读者购买。
pos.head(6)
0 0 做父母一定要有刘墉这样的心态,不断地学习,不断地进步,不断地给自己补充新鲜血液,让自己保持一... 1 作者真有英国人严谨的风格,提出观点、进行论述论证,尽管本人对物理学了解不深,但是仍然能感受到... 2 作者长篇大论借用详细报告数据处理工作和计算结果支持其新观点。为什么荷兰曾经县有欧洲最高的生产... 3 作者在战几时之前用了"拥抱"令人叫绝.日本如果没有战败,就有会有美军的占领,没胡官僚主义的延... 4 作者在少年时即喜阅读,能看出他精读了无数经典,因而他有一个庞大的内心世界。他的作品最难能可贵... 5 作者有一种专业的谨慎,若能有幸学习原版也许会更好,简体版的书中的印刷错误比较多,影响学者理解...
pos['mark'] = 1 neg['mark'] = 0 # 给训练语料贴上标签 pn = pd.concat([pos,neg],ignore_index=True) # 合并语料 neglen = len(neg) poslen = len(pos) # 计算语料数目
cw = lambda x:list(jieba.cut(x)) # 定义分词函数 pn['words'] = pn[0].apply(cw) # 随机 pn = pn.reindex(np.random.permutation(pn.index)) pn.head()
rom collections import Counter positive_counts = Counter() negative_counts = Counter() total_counts = Counter() len(pn['words'])
pn['words'][1][:10]
['作者', '真有', '英国人', '严谨', '的', '风格', ',', '提出', '观点', '、']
我们开始统计每个词出现的次数
for i in range(len(pn['words'])): if pn['mark'][i] == 1: for word in pn['words'][i]: positive_counts[word] += 1 total_counts[word] += 1 else: for word in pn['words'][i]: negative_counts[word] += 1 total_counts[word] += 1
positive_counts.most_common(10)
[(',', 63862), ('的', 48811), ('。', 25667), ('了', 14110), ('是', 10775), ('我', 9578), ('很', 8270), (',', 6682), (' ', 6354), ('也', 6307)]
negative_counts.most_common(10)
[(',', 42831), ('的', 28859), ('。', 16847), ('了', 13476), (',', 8462), ('是', 7994), ('我', 7841), (' ', 7528), ('!', 7084), ('不', 5821)]
pos_neg_ratios = Counter() for term,cnt in list(total_counts.most_common()): if(cnt > 100): pos_neg_ratio = positive_counts[term] / float(negative_counts[term] +1) pos_neg_ratios[term] = pos_neg_ratio list(reversed(pos_neg_ratios.most_common()))[0:30]
[('上当', 0.014285714285714285), ('不买', 0.037267080745341616), ('最差', 0.04580152671755725), ('抵制', 0.057034220532319393), ('退货', 0.0707070707070707), ('死机', 0.07075471698113207), ('太差', 0.0728476821192053), ('退', 0.07920792079207921), ('极差', 0.08421052631578947), ('论语', 0.08849557522123894), ('恶心', 0.0896551724137931), ('很差', 0.09166666666666666), ('招待所', 0.09243697478991597), ('投诉', 0.09433962264150944), ('垃圾', 0.10138248847926268), ('没法', 0.125), ('几页', 0.12903225806451613), ('糟糕', 0.13333333333333333), ('脏', 0.13978494623655913), ('维修', 0.14606741573033707), ('晕', 0.15151515151515152), ('严重', 0.16), ('不值', 0.16161616161616163), ('浪费', 0.16793893129770993), ('失望', 0.16888045540796964), ('差', 0.16926503340757237), ('页', 0.18562874251497005), ('郁闷', 0.19730941704035873), ('根本', 0.20512820512820512), ('后悔', 0.20574162679425836)]
for word,ratio in pos_neg_ratios.most_common(): if (ratio > 1): pos_neg_ratios[word] = np.log(ratio) else: pos_neg_ratios[word] = -np.log((1 / (ratio +0.01))) pos_neg_ratios.most_common(10)
[('结局', 3.7954891891721947), ('命运', 3.1986731175506815), ('成长', 3.0002674287193822), ('人们', 2.9885637840753785), ('快乐', 2.968080742223481), ('人类', 2.8332133440562162), ('自由', 2.6996819514316934), ('小巧', 2.57191802677763), ('世界', 2.5416019934645457), ('幸福', 2.5403579543242145)]我们会发现一些一些词:好,不错,喜欢等带有感情色彩的词
list(reversed(pos_neg_ratios.most_common()))[:10]
[('上当', -3.7178669909871886), ('不买', -3.0519411931108684), ('最差', -2.8859540494394644), ('抵制', -2.7025520357679857), ('退货', -2.5169290903564066), ('死机', -2.5163389039584163), ('太差', -2.490 4000 7515123361046), ('退', -2.4167854452210129), ('极差', -2.3622233593137767), ('论语', -2.3177436534248872)]我们现在有了个大致的判断,对于标注为 positive 和 negative 的其评论切词后是会有些许不同,
一些词在正评论中出现的评论会比负评论中多
vocab = set(total_counts.keys()) vocab_size = len(vocab)
对词进行编号
现在我们思路是直接对存在的vocab_size个分词进行排号,即一个vocab_size的向量,然后对于每段话都可以用一个vocab_size的向量表示了layer_0 = np.zeros((1,vocab_size))word2index = {}
for i,word in enumerate(vocab):
word2index[word] = i
def update_input_layer(reviews): global layer_0 # clear out previous state,reset the layer to be all 0s layer_0 *= 0 for word in reviews: layer_0[0][word2index[word]] += 1 update_input_layer(pn['words'][5]) print(layer_0)
[[ 0. 0. 0. ..., 0. 0. 0.]]完整代码如下:import time
import sys
import numpy as np
# Let's tweak our network from before to model these phenomena
class SentimentNetwork:
def __init__(self, reviews,labels,hidden_nodes = 10, learning_rate = 0.1):
'''
参数:
reviews(dataFrame), 用于训练
labels(dataFrame), 用于训练
hidden_nodes(int), 隐层的个数
learning_rate(double),学习步长
'''
# set our random number generator
# np.random.seed(1)
self.pre_process_data(reviews, labels)
self.init_network(len(self.review_vocab),hidden_nodes, 1, learning_rate)
def pre_process_data(self, reviews, labels):
'''
预处理数据,统计reviews中出现的所有单词,并且生成word2index
'''
# 统计reviews中出现的所有单词
review_vocab = set()
for review in reviews:
for word in review:
review_vocab.add(word)
self.review_vocab = list(review_vocab)
# 统计labels中所有出现的label(其实在这里,就+1和-1两种)
# label_vocab = set()
# for label in labels:
# label_vocab.add(label)
# self.label_vocab = list(label_vocab)
self.review_vocab_size = len(self.review_vocab)
# self.label_vocab_size = len(self.label_vocab)
# 构建word2idx,给每个单词安排一个"门牌号"
self.word2index = {}
for i, word in enumerate(self.review_vocab):
self.word2index[word] = i
# self.label2index = {}
# for i, label in enumerate(self.label_vocab):
# self.label2index[label] = i
def init_network(self, input_nodes, hidden_nodes, output_nodes, learning_rate):
# Set number of nodes in input, hidden and output layers.
self.input_nodes = input_nodes
self.hidden_nodes = hidden_nodes
self.output_nodes = output_nodes
# Initialize weights
self.weights_0_1 = np.zeros((self.hidden_nodes,self.input_nodes))
self.weights_1_2 = np.random.normal(0.0, self.output_nodes**-0.5,
(self.output_nodes, self.hidden_nodes))
self.learning_rate = learning_rate
self.layer_0 = np.zeros((input_nodes,1))
def update_input_layer(self,review):
'''
对review进行数字化处理,并将结果存放到self.layer_0中,也就是输入层
'''
# clear out previous state, reset the layer to be all 0s
self.layer_0 *= 0
for word in review:
if(word in self.word2index.keys()):
self.layer_0[self.word2index[word]][0] = 1
# def get_target_for_label(self,label):
# if(label == 'POSITIVE'):
# return 1
# else:
# return 0
def sigmoid(self,x):
return 1 / (1 + np.exp(-x))
def sigmoid_output_2_derivative(self,output):
return output * (1 - output)
def train(self, training_reviews, training_labels):
assert(len(training_reviews) == len(training_labels))
correct_so_far = 0
start = time.time()
for i in range(len(training_reviews)):
review = training_reviews[i]
label = training_labels[i]
#### Implement the forward pass here ####
### Forward pass ###
# Input Layer
self.update_input_layer(review)
layer_0 = self.layer_0
# Hidden layer
layer_1 = self.weights_0_1.dot(self.layer_0)
# Output layer
layer_2 = self.sigmoid(self.weights_1_2.dot(layer_1))
#### Implement the backward pass here ####
### Backward pass ###
# TODO: Output error
layer_2_error = layer_2 - label # Output layer error is the difference between desired target and actual output.
layer_2_delta = layer_2_error * self.sigmoid_output_2_derivative(layer_2)
# TODO: Backpropagated error
layer_1_error = self.weights_1_2.T.dot(layer_2_delta) # errors propagated to the hidden layer
layer_1_delta = layer_1_error # hidden layer gradients - no nonlinearity so it's the same as the error
# TODO: Update the weights
self.weights_1_2 -= layer_2_delta.dot(layer_1.T) * self.learning_rate # update hidden-to-output weights with gradient descent step
self.weights_0_1 -= layer_1_delta.dot(layer_0.T) * self.learning_rate # update input-to-hidden weights with gradient descent step
if(np.abs(layer_2_error) < 0.5):
correct_so_far += 1
reviews_per_second = i / float(time.time() - start)
sys.stdout.write("\rProgress:" + __builtins__.str(100 * i/float(len(training_reviews)))[:4]
+ "% Speed(reviews/sec):" + __builtins__.str(reviews_per_second)[0:5]
+ " #Correct:" + __builtins__.str(correct_so_far)
+ " #Trained:" + __builtins__.str(i+1)
+ " Training Accuracy:" + __builtins__.str(correct_so_far * 100 / float(i+1))[:4]
+ "%")
if(i % 2500 == 0):
print("")
def test(self, testing_reviews, testing_labels):
correct = 0
start = time.time()
for i in range(len(testing_reviews)):
pred = self.run(testing_reviews[i])
if(pred == testing_labels[i]):
correct += 1
reviews_per_second = i / float(time.time() - start)
sys.stdout.write("\rProgress:" + __builtins__.str(100 * i/float(len(testing_reviews)))[:4] \
+ "% Speed(reviews/sec):" + __builtins__.str(reviews_per_second)[0:5] \
+ "% #Correct:" + __builtins__.str(correct) + " #Tested:" + __builtins__.str(i+1) + " Testing Accuracy:" + __builtins__.str(correct * 100 / float(i+1))[:4] + "%")
def run(self, review):
# Input Layer
# print(review)
self.update_input_layer(review)
# print(self.layer_0.shape)
# print(self.weights_0_1.shape)
# print(np.dot(self.weights_0_1,self.layer_0))
# Hidden layer
layer_1 = self.weights_0_1.dot(self.layer_0)
# Output layer
layer_2 = self.sigmoid(self.weights_1_2.dot(layer_1))
# print(layer_2) # 发现一只0.5呢
if(layer_2[0] > 0.5):
return 1
else:
return 0
reviews = pn['words'].values labels = pn['mark'].values
# 除最后1000个外的数据训练 mlp = SentimentNetwork(reviews[:-1000],labels[:-1000],learning_rate=0.01)
mlp.train(reviews[:-1000],labels[:-1000])
Progress:0.0% Speed(reviews/sec):0.0 #Correct:0 #Trained:1 Training Accuracy:0.0% Progress:12.4% Speed(reviews/sec):91.76 #Correct:1911 #Trained:2501 Training Accuracy:76.4% Progress:24.8% Speed(reviews/sec):103.7 #Correct:3993 #Trained:5001 Training Accuracy:79.8% Progress:37.3% Speed(reviews/sec):108.4 #Correct:6117 #Trained:7501 Training Accuracy:81.5% Progress:49.7% Speed(reviews/sec):110.7 #Correct:8285 #Trained:10001 Training Accuracy:82.8% Progress:62.1% Speed(reviews/sec):112.5 #Correct:10450 #Trained:12501 Training Accuracy:83.5% Progress:74.6% Speed(reviews/sec):113.0 #Correct:12654 #Trained:15001 Training Accuracy:84.3% Progress:87.0% Speed(reviews/sec):113.4 #Correct:14849 #Trained:17501 Training Accuracy:84.8% Progress:99.4% Speed(reviews/sec):113.0 #Correct:17055 #Trained:20001 Training Accuracy:85.2% Progress:99.9% Speed(reviews/sec):113.0 #Correct:17147 #Trained:20105 Training Accuracy:85.2%
mlp.test(reviews[-1000:],labels[-1000:])
Progress:15.2% Speed(reviews/sec):745.0% #Correct:136 #Tested:153 Testing Accuracy:88.8% Progress:32.6% Speed(reviews/sec):806.8% #Correct:292 #Tested:327 Testing Accuracy:89.2% Progress:52.3% Speed(reviews/sec):862.9% #Correct:461 #Tested:524 Testing Accuracy:87.9% Progress:70.9% Speed(reviews/sec):877.4% #Correct:628 #Tested:710 Testing Accuracy:88.4% Progress:88.2% Speed(reviews/sec):874.0% #Correct:782 #Tested:883 Testing Accuracy:88.5% Progress:99.9% Speed(reviews/sec):876.2% #Correct:886 #Tested:1000 Testing Accuracy:88.6%
总结
至此就是本篇情感分析的所有了,回顾下:1.最开始,我们通过分析在不同意见中词出现的次数不同,我们得出了可以根据一段话分词后不同词出现的次数来判断最终的意见,2.接着我们通过对分词后的词进行编码,将一段话转换为一个向量3.接着就是构建神经系统了(老套路)4.下面我们不断去分析怎么能计算的更快,得出可以去掉某些频度太低的词,以及去除一些在正负观点中都出现的,代表性不是那么强的词5.最后我们分析了训练出来的神经网络的weights的含义,发现可以根据weighs来对词进行分类,相同意见的词自然而然就聚合到一起了分析下上面的问题,其实在对于词输入上,我们只是简单的进行了编码,没有考虑词之间的前后的位置关系,也没有考虑不同词其实其含义是一样的,下一篇将会使用RNN和word2Vec来进行优化。相关文章推荐
- 使用循环神经网络训练情感分析
- 神经网络之文本情感分析(一)
- [置顶] 【python keras实战】多层全连接神经网络训练情感分析
- 神经网络之文本情感分析(三)
- tensorflow 实践(一)使用神经网络做中文情感分析
- 神经网络之文本情感分析(四)
- 神经网络之文本情感分析(二)
- TensorFlow-RNN循环神经网络 Example 2:文本情感分析
- 两个基于神经网络的情感分析模型
- 一步一步分析讲解神经网络基础-Feedforward Neural Network
- 实时的神经网络:Faster-RCNN技术分析
- 百度Apollo计划跟踪:感知与预测中神经网络的分析
- 一步一步分析讲解神经网络基础-gradient descent algorithm
- (原创)大数据时代:基于微软案例数据库数据挖掘知识点总结(Microsoft 神经网络分析算法)
- 机器学习、统计分析、数据挖掘、神经网络、人工智能、模式识别之间的关系是什么?
- python 深度学习、python神经网络算法、python数据分析、python神经网络算法数学基础教学
- 菜鸟尝试超简单三层神经网络回归分析失败
- 实例分析神经网络传播过程
- 循环神经网络(RNN)练习:比特币市场的分析与预测
- 实例分析神经网络传播过程