【Deep Learning】循环神经网络(RNN)推导和实现
2017-07-17 20:45
567 查看
主要参考wildml的博客所写,所有的代码都是python实现,并且没有使用深度学习的框架,所以对理解RNN可以起到很大的帮助。
一、语言模型
如果一个句子有m个词,那么这个句子生成的概率就是:
其即假设下一次词生成的概率和只和句子前面的词有关,举一个例子:How are you,生成的概率可以表示为:
P(How are you) = P(you)*P(you|How,are) 。
二、数据预处理
语料预处理会去掉一些低频词从而控制词典大小,这里我们截取前8000个高频词汇,低频词使用一个统一标识替换(这里是UNKNOWN_TOKEN),在经过预处理之后每一个词得到一个编号;为了学出来哪些词常常作为句子开始和句子结束,引入SENTENCE_START和SENTENCE_END两个特殊字符。具体代码如下:
三、网络结构
循环神经网络的结构如下图:
RNN网络有状态的概念。如上图,t表示的是状态, xt表示的状态t的输入, st 表示状态t时隐层的输出, ot表示输出。特别的地方在于,隐层的输入有两个来源,一个是当前的 xt输入、一个是上一个状态隐层的输出 st−1。W,U,V 为参数。使用公式可以将上面结构表示为:
参数的初始化有很多种方法,都初始化为0将会导致symmetric calculations ,如何初始化其实是和具体的激活函数有关系,我们这里使用的是tanh,一种推荐的方式是初始化为 [−1/√n,1/√n],其中n是前一层接入的链接数。
前向传播代码如下:
使用交叉熵作为损失函数,如果有N个样本,损失函数可以写为:
损失函数计算代码:
BPTT(Backpropagation Through Time)是一种非常直观的方法,和传统的BP类似,只不过传播的路径是个循环,并且路径上的参数是共享的。损失是交叉熵,损失可以表示为:
其中 yt是真实值, (̂yt)是预估值,将误差展开可以用图表示为:
BPTT梯度更新的代码为:
tanh和sigmoid函数和导数的取值返回如下图,可以看到导数取值是[0-1],用几次链式法则就会将梯度指数级别缩小,所以传播不了几层就会出现梯度非常弱。克服这个问题的LSTM是一种最近比较流行的解决方案。
八、Gradient Checking
梯度检验是非常有用的,检查的原理是一个点的梯度等于这个点的斜率,估算一个点的斜率可以通过求极限的方式:
通过比较斜率和梯度的值,我们就可以判断梯度计算的是否有问题。需要注意的是这个检验成本还是很高的,因为我们的参数个数是百万量级的。梯度检验的代码:
九、SGD实现
W=W−λΔW,其中 ΔW就是梯度,具体代码:
十、文本生成
生成过程其实就是模型的应用过程,只需要反复执行预测函数即可:
参考:
①http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-2-implementing-a-language-model-rnn-with-python-numpy-and-theano/
一、语言模型
如果一个句子有m个词,那么这个句子生成的概率就是:
其即假设下一次词生成的概率和只和句子前面的词有关,举一个例子:How are you,生成的概率可以表示为:
P(How are you) = P(you)*P(you|How,are) 。
二、数据预处理
语料预处理会去掉一些低频词从而控制词典大小,这里我们截取前8000个高频词汇,低频词使用一个统一标识替换(这里是UNKNOWN_TOKEN),在经过预处理之后每一个词得到一个编号;为了学出来哪些词常常作为句子开始和句子结束,引入SENTENCE_START和SENTENCE_END两个特殊字符。具体代码如下:
vocabulary_size = 8000 unknown_token = "UNKNOWN_TOKEN" sentence_start_token = "SENTENCE_START" sentence_end_token = "SENTENCE_END" # Read the data and append SENTENCE_START and SENTENCE_END tokens print "Reading CSV file..." with open('data/reddit-comments-2015-08.csv', 'rb') as f: reader = csv.reader(f, skipinitialspace=True) reader.next() # Split full comments into sentences sentences = itertools.chain(*[nltk.sent_tokenize(x[0].decode('utf-8').lower()) for x in reader]) # Append SENTENCE_START and SENTENCE_END sentences = ["%s %s %s" % (sentence_start_token, x, sentence_end_token) for x in sentences] print "Parsed %d sentences." % (len(sentences)) # Tokenize the sentences into words tokenized_sentences = [nltk.word_tokenize(sent) for sent in sentences] # Count the word frequencies word_freq = nltk.FreqDist(itertools.chain(*tokenized_sentences)) print "Found %d unique words tokens." % len(word_freq.items()) # Get the most common words and build index_to_word and word_to_index vectors vocab = word_freq.most_common(vocabulary_size-1) index_to_word = [x[0] for x in vocab] index_to_word.append(unknown_token) word_to_index = dict([(w,i) for i,w in enumerate(index_to_word)]) print "Using vocabulary size %d." % vocabulary_size print "The least frequent word in our vocabulary is '%s' and appeared %d times." % (vocab[-1][0], vocab[-1][1]) # Replace all words not in our vocabulary with the unknown token for i, sent in enumerate(tokenized_sentences): tokenized_sentences[i] = [w if w in word_to_index else unknown_token for w in sent] print "\nExample sentence: '%s'" % sentences[0] print "\nExample sentence after Pre-processing: '%s'" % tokenized_sentences[0] # Create the training data X_train = np.asarray([[word_to_index[w] for w in sent[:-1]] for sent in tokenized_sentences]) y_train = np.asarray([[word_to_index[w] for w in sent[1:]] for sent in tokenized_sentences]) Here’s an actual training example from our text: x: SENTENCE_START what are n't you understanding about this ? ! [0, 51, 27, 16, 10, 856, 53, 25, 34, 69] y: what are n't you understanding about this ? ! SENTENCE_END [51, 27, 16, 10, 856, 53, 25, 34, 69, 1]
三、网络结构
循环神经网络的结构如下图:
RNN网络有状态的概念。如上图,t表示的是状态, xt表示的状态t的输入, st 表示状态t时隐层的输出, ot表示输出。特别的地方在于,隐层的输入有两个来源,一个是当前的 xt输入、一个是上一个状态隐层的输出 st−1。W,U,V 为参数。使用公式可以将上面结构表示为:
参数的初始化有很多种方法,都初始化为0将会导致symmetric calculations ,如何初始化其实是和具体的激活函数有关系,我们这里使用的是tanh,一种推荐的方式是初始化为 [−1/√n,1/√n],其中n是前一层接入的链接数。
class RNNNumpy: def __init__(self, word_dim, hidden_dim=100, bptt_truncate=4): # Assign instance variables self.word_dim = word_dim self.hidden_dim = hidden_dim self.bptt_truncate = bptt_truncate # Randomly initialize the network parameters self.U = np.random.uniform(-np.sqrt(1./word_dim), np.sqrt(1./word_dim), (hidden_dim, word_dim)) self.V = np.random.uniform(-np.sqrt(1./hidden_dim), np.sqrt(1./hidden_dim), (word_dim, hidden_dim)) self.W = np.random.uniform(-np.sqrt(1./hidden_dim), np.sqrt(1./hidden_dim), (hidden_dim, hidden_dim))四、前向传播
前向传播代码如下:
def forward_propagation(self, x): # The total number of time steps T = len(x) # During forward propagation we save all hidden states in s because need them later. # We add one additional element for the initial hidden, which we set to 0 s = np.zeros((T + 1, self.hidden_dim)) s[-1] = np.zeros(self.hidden_dim) # The outputs at each time step. Again, we save them for later. o = np.zeros((T, self.word_dim)) # For each time step... for t in np.arange(T): # Note that we are indxing U by x[t]. This is the same as multiplying U with a one-hot vector. s[t] = np.tanh(self.U[:,x[t]] + self.W.dot(s[t-1])) o[t] = softmax(self.V.dot(s[t])) return [o, s]预测函数为:
def predict(self, x): # Perform forward propagation and return index of the highest score o, s = self.forward_propagation(x) return np.argmax(o, axis=1)五、损失函数
使用交叉熵作为损失函数,如果有N个样本,损失函数可以写为:
损失函数计算代码:
def calculate_total_loss(self, x, y): L = 0 # For each sentence... for i in np.arange(len(y)): o, s = self.forward_propagation(x[i]) # We only care about our prediction of the "correct" words correct_word_predictions = o[np.arange(len(y[i])), y[i]] # Add to the loss based on how off we were L += -1 * np.sum(np.log(correct_word_predictions)) return L def calculate_loss(self, x, y): # Divide the total loss by the number of training examples N = np.sum((len(y_i) for y_i in y)) return self.calculate_total_loss(x,y)/N六、BPTT学习参数
BPTT(Backpropagation Through Time)是一种非常直观的方法,和传统的BP类似,只不过传播的路径是个循环,并且路径上的参数是共享的。损失是交叉熵,损失可以表示为:
其中 yt是真实值, (̂yt)是预估值,将误差展开可以用图表示为:
BPTT梯度更新的代码为:
de d12b f bptt(self, x, y): T = len(y) # Perform forward propagation o, s = self.forward_propagation(x) # We accumulate the gradients in these variables dLdU = np.zeros(self.U.shape) dLdV = np.zeros(self.V.shape) dLdW = np.zeros(self.W.shape) delta_o = o delta_o[np.arange(len(y)), y] -= 1. # For each output backwards... for t in np.arange(T)[::-1]: dLdV += np.outer(delta_o[t], s[t].T) # Initial delta calculation: dL/dz delta_t = self.V.T.dot(delta_o[t]) * (1 - (s[t] ** 2)) # Backpropagation through time (for at most self.bptt_truncate steps) for bptt_step in np.arange(max(0, t-self.bptt_truncate), t+1)[::-1]: # print "Backpropagation step t=%d bptt step=%d " % (t, bptt_step) # Add to gradients at each previous step dLdW += np.outer(delta_t, s[bptt_step-1]) dLdU[:,x[bptt_step]] += delta_t # Update delta for next step dL/dz at t-1 delta_t = self.W.T.dot(delta_t) * (1 - s[bptt_step-1] ** 2) return [dLdU, dLdV, dLdW]七、梯度弥散现象
tanh和sigmoid函数和导数的取值返回如下图,可以看到导数取值是[0-1],用几次链式法则就会将梯度指数级别缩小,所以传播不了几层就会出现梯度非常弱。克服这个问题的LSTM是一种最近比较流行的解决方案。
八、Gradient Checking
梯度检验是非常有用的,检查的原理是一个点的梯度等于这个点的斜率,估算一个点的斜率可以通过求极限的方式:
通过比较斜率和梯度的值,我们就可以判断梯度计算的是否有问题。需要注意的是这个检验成本还是很高的,因为我们的参数个数是百万量级的。梯度检验的代码:
def gradient_check(self, x, y, h=0.001, error_threshold=0.01): # Calculate the gradients using backpropagation. We want to checker if these are correct. bptt_gradients = self.bptt(x, y) # List of all parameters we want to check. model_parameters = ['U', 'V', 'W'] # Gradient check for each parameter for pidx, pname in enumerate(model_parameters): # Get the actual parameter value from the mode, e.g. model.W parameter = operator.attrgetter(pname)(self) print "Performing gradient check for parameter %s with size %d." % (pname, np.prod(parameter.shape)) # Iterate over each element of the parameter matrix, e.g. (0,0), (0,1), ... it = np.nditer(parameter, flags=['multi_index'], op_flags=['readwrite']) while not it.finished: ix = it.multi_index # Save the original value so we can reset it later original_value = parameter[ix] # Estimate the gradient using (f(x+h) - f(x-h))/(2*h) parameter[ix] = original_value + h gradplus = self.calculate_total_loss([x],[y]) parameter[ix] = original_value - h gradminus = self.calculate_total_loss([x],[y]) estimated_gradient = (gradplus - gradminus)/(2*h) # Reset parameter to original value parameter[ix] = original_value # The gradient for this parameter calculated using backpropagation backprop_gradient = bptt_gradients[pidx][ix] # calculate The relative error: (|x - y|/(|x| + |y|)) relative_error = np.abs(backprop_gradient - estimated_gradient)/(np.abs(backprop_gradient) + np.abs(estimated_gradient)) # If the error is to large fail the gradient check if relative_error > error_threshold: print "Gradient Check ERROR: parameter=%s ix=%s" % (pname, ix) print "+h Loss: %f" % gradplus print "-h Loss: %f" % gradminus print "Estimated_gradient: %f" % estimated_gradient print "Backpropagation gradient: %f" % backprop_gradient print "Relative Error: %f" % relative_error return it.iternext() print "Gradient check for parameter %s passed." % (pname)
九、SGD实现
W=W−λΔW,其中 ΔW就是梯度,具体代码:
# Performs one step of SGD. def numpy_sdg_step(self, x, y, learning_rate): # Calculate the gradients dLdU, dLdV, dLdW = self.bptt(x, y) # Change parameters according to gradients and learning rate self.U -= learning_rate * dLdU self.V -= learning_rate * dLdV self.W -= learning_rate * dLdW
十、文本生成
生成过程其实就是模型的应用过程,只需要反复执行预测函数即可:
def generate_sentence(model): # We start the sentence with the start token new_sentence = [word_to_index[sentence_start_token]] # Repeat until we get an end token while not new_sentence[-1] == word_to_index[sentence_end_token]: next_word_probs = model.forward_propagation(new_sentence) sampled_word = word_to_index[unknown_token] # We don't want to sample unknown words while sampled_word == word_to_index[unknown_token]: samples = np.random.multinomial(1, next_word_probs[-1]) sampled_word = np.argmax(samples) new_sentence.append(sampled_word) sentence_str = [index_to_word[x] for x in new_sentence[1:-1]] return sentence_str num_sentences = 10 senten_min_length = 7 for i in range(num_sentences): sent = [] # We want long sentences, not sentences with one or two words while len(sent) < senten_min_length: sent = generate_sentence(model) print " ".join(sent)
参考:
①http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-2-implementing-a-language-model-rnn-with-python-numpy-and-theano/
相关文章推荐
- 循环神经网络RNN以及LSTM的推导和实现
- 深度学习(07)_RNN-循环神经网络-02-Tensorflow中的实现
- 循环神经网络(一般RNN)推导
- 循环神经网络教程 第四部分 用Python 和 Theano实现GRU/LSTM RNN
- pytorch从头开始实现一个RNN(循环神经网络)
- RNN-LSTM循环神经网络-03Tensorflow进阶实现
- 深度学习(Deep Learning)读书思考六:循环神经网络一(RNN)
- (Unfinished)RNN-循环神经网络之LSTM和GRU-04介绍及推导
- 深度学习之循环神经网络RNN概述,双向LSTM实现字符识别
- 初探循环神经网络 RNN 及 TensorFlow 实现
- TensorFlow练手项目一:使用循环神经网络(RNN)实现影评情感分类
- 循环神经网络教程4-用Python和Theano实现GRU/LSTM RNN, Part 4 – Implementing a GRU/LSTM RNN with Python and Theano
- 深度学习(08)_RNN-LSTM循环神经网络-03-Tensorflow进阶实现
- RNN循环神经网络详解与源码实现
- MXNet动手学深度学习笔记:循环神经网络RNN实现
- 深度学习(Deep Learning)读书思考八:循环神经网络三(RNN应用)
- TensorFlow实现RNN循环神经网络
- RNN-LSTM循环神经网络-03Tensorflow进阶实现
- tensorflow 学习笔记12 循环神经网络RNN LSTM结构实现MNIST手写识别
- [置顶] 【深度学习】RNN循环神经网络Python简单实现