您的位置：首页 > 其它

Neural Networks and Deep Learning-读书笔记

2018-02-12 23:23 1191 查看

序言

很久没有认认真真的读一本书，更没有写过什么读书笔记。看到Michael Nielsen的这本《Neural Networks and Deep Learning》，非常喜欢，也萌发了认认真真都本书，认认真真写一份读书笔记。

使用神经网络识别手写数字

手写数字识别作为入门机器学习的一个典型例子，见到了许多次，实现方式也是多种多样。Michael Nielsen也将其作为贯穿整本书的例子，循序渐渐开始神经网络和深度学习的介绍。

感知器

1943年，Warren McCulloch和Walter Pitts发表题为《A Logical Calculus of the Ideas Immanent in Nervous Activity》的论文，首次提出神经元的MP模型，如图1，该模型借鉴已知的神经细胞生物过程原理，从逻辑功能器件的角度描述神经元。MP模型是一个神经元数学模型，对所有的输入信号加权求和，与阈值比较后决定神经元的输出。MP模型从原理上证明了人工神经网络可以计算任何算术和逻辑函数，为后续的神经网络研究工作奠定了基础。Frank Rosenblatt受到Warren McCulloch和Walter Pitts的影响，1958年在MP模型的基础上，设计名为“Perception”的人工神经元模型，旨在通过最小化误分类损失函数来优化分类超平面，从而对新的实例实现准确预测。感知器的神经元模型如图2所示。

感知器的三个输入x1x1、x2x2、x3x3，为每一个输入引入一个对应权重w1w1、w2w2、w3w3，表示各个输入对于输出的重要性。神经元的输出0或者1，取决于分配权重后的和∑jwjxj∑jwjxj相对于***阈值***的大小，输出代数式如式1-1所示。

$output = \left\{\begin{matrix} 0 & if \sum_{j}w_{j}x_{j} \leq threshold \\ 1 & if \sum_{j}w_{j}x_{j} > threshold \end{matrix}\right.$
threshold \end{matrix}\right.">

上式表示的感知器的数学描述有些冗余，可以进行简化，把阈值移到不等式的同一边，并用偏置b=−thresholdb=−threshold代替。感知器模型可以写为如下形式：

$output = \left\{\begin{matrix} 0 & if w\centerdot x + b\leq 0 \\ 1 & if w\centerdot x + b < 0 \end{matrix}\right.$

感知器中的权重wiwi表示对应输入变量xixi的重要性；偏置bb反映结果偏向1的程度，b越大结果越容易为1。

Rosenblatt引入一种感知器学习规则，如若感知机的输出值和实例中默认正确的输出值不同：(1)若输出值应该为0但实际为1，减少输入值是1的例子的权重。(2)若输出值应该为1但实际为0，增加输入值是1的例子的权重。

这就导致一个问题，当w⋅xw⋅x在0附近时，输入xixi的微小变化可能就导致输出的反转。

第一列为输入层；第二层为感知器层，接收输入层的输入，依据权重做出简单的决策。下一层的感知器把第一层的输出作为输入，根据权重做出抽象的决策。最后一层是输出层。每个感知器都只有一个输出，只不过是每个输出可以作为不同的下层神经元的输入，网络结构中，从同一个神经元引出的箭头表示的同一个输出。

感知器令人兴奋，又让人失望。它可以学习，但学习能力太弱，只能解决线性问题，对于XORXOR这样最基本的非线性问题都无法解决。

S型神经元

学习算法听起来非常好，但如何给神经网络涉及一个学习算法，感知器神经网络可以吗？对于一个感知器神经网络，我们微调权重或者偏置，感知器输出可能完全反转，这样其余的神经网络行为完全不可知，也就是很通过逐步调节权值和偏置的方法来训练网络。为此，S型神经元模型出现，与感知器相比，其权值和偏置的微小变化只会引起神经元输出的微小变化。

与感知器的网络结构一样，S型神经元也有多个输入，对应多个权重，但其输出是σ(w⋅x+b)σ(w⋅x+b)，这里的σσ称为S型函数（又称为激励函数），定义为

σ(z)≡11+e−zσ(z)≡11+e−z

对于具有输入和权重的神经元，可以写成如下的形式，更容易理解：

σ(z)≡11+exp(−∑jwjxj+b)σ(z)≡11+exp(−∑jwjxj+b)

看起来S型神经元和感知器差别很大，实际上它们很相似，当z=w⋅x+bz=w⋅x+b是一个很大的正数时，输出与感知器一样，也为1；当z=w⋅x+bz=w⋅x+b是一个非常小的负数时，输出为0.因此，可以将S型神经元看作是感知器的平滑版本。

感知器的激活函数是一个阶跃函数，形式如图3所示。

S型神经元的形式如图4所示。最重要的是权值和偏置的微小变化，即DeltawjDeltawj和DeltaxjDeltaxj只会产生一个微小的变化输出ΔoutputΔoutput ，代数形式如1-4所示。

Δoutput≈∑j∂outputwjΔwj+∂outputbjΔbjΔoutput≈∑j∂outputwjΔwj+∂outputbjΔbj

其实，σσ的精确形式并不重要，重要的是其形状，对于不同的σσ，只是其偏导不同。与感知器相比，不仅可以输出0、1，还可以输出0~1之间的任何数。S型神经元与感知器不同的还有，当w⋅x+b=0w⋅x+b=0时，S型神经元输出1，感知器输出0，这一点S型神经元与感知器的表现不同。

练习

问题1

Sigmoid neurons simulating perceptrons, part I

Suppose we take all the weights and biases in a network of perceptrons, and multiply them by a positive constant, c>0. Show that the behaviour of the network doesn’t change.

输入和为：∑w⋅x+b∑w⋅x+b乘上1个整数c>0c>0，其符号并不会发生变化，因此其激活值也保持不变，因此感知器网络的行为也不会发生变化。

问题2

Sigmoid neurons simulating perceptrons, part II

Suppose we have the same setup as the last problem - a network of perceptrons. Suppose also that the overall input to the network of perceptrons has been chosen. We won’t need the actual input value, we just need the input to have been fixed. Suppose the weights and biases are such that w⋅x+b≠0 for the input xx to any particular perceptron in the network. Now replace all the perceptrons in the network by sigmoid neurons, and multiply the weights and biases by a positive constant c>0. Show that in the limit as c→∞ the behaviour of this network of sigmoid neurons is exactly the same as the network of perceptrons. How can this fail when w⋅x+b=0 for one of the perceptrons?

输入的和z=∑w⋅x+bz=∑w⋅x+b，激活函数为11+e−z11+e−z，当zz乘上一个整数c>0c>0时，z′=c∗z=c∗(∑w⋅x+b)z′=c∗z=c∗(∑w⋅x+b)，当c→∞c→∞时，zz是一个很小的正数时，z′→∞z′→∞，激活函数输出为1；当zz是一个很小的负数时，z′→−∞z′→−∞，激活函数输出为0；但是当z=∑w⋅x+b=0z=∑w⋅x+b=0时，z′=0z′=0，激活函数输出为0.5，这与感知器输出不一致。

对于问题3

There is a way of determining the bitwise representation of a digit by adding an extra layer to the three-layer network above. The extra layer converts the output from the previous layer into a binary representation, as illustrated in the figure below. Find a set of weights and biases for the new output layer. Assume that the first 33 layers of neurons are such that the correct output in the third layer (i.e., the old output layer) has activation at least 0.990.99, and incorrect outputs have activation less than 0.01.

对新添加的层，手动设计权重和基是比较困难的；新的层是把10个输出转换为四个输出。倒是可以采用机器学习的方法，通过一些10位输出向量及对应的值，来确定各自权重。

问题4

Prove the assertion of the last paragraph. Hint: If you’re not already familiar with the Cauchy-Schwarz inequality, you may find it helpful to familiarize yourself with it.

问题5

I explained gradient descent when Cis a function of two variables, and when it’s a function of more than two variables. What happens when C is a function of just one variable? Can you provide a geometric interpretation of what gradient descent is doing in the one-dimensional case?

当C是单变量函数，那么ΔvΔv是沿曲线的切线的反方向移动。

问题6

An extreme version of gradient descent is to use a mini-batch size of just 1. That is, given a training input, xx, we update our weights and biases according to the rules wk→w′k=wk−η∂Cx/∂wk and bl→b′l=bl−η∂Cx/∂bl. Then we choose another training input, and update the weights and biases again. And so on, repeatedly. This procedure is known as online, on-line, or incremental learning. In online learning, a neural network learns from just one training input at a time (just as human beings do). Name one advantage and one disadvantage of online learning, compared to stochastic gradient descent with a mini-batch size of, say, 20.

与批处理尺寸为20的批处理相比，在线学习的速度快，但受数据的质量影响大，参数更新方向会受到噪声的很大干扰。

问题7

Write out Equation (22) in component form, and verify that it gives the same result as the rule (4) for computing the output of a sigmoid neuron.

a′=σ(w⋅a+b)a′=σ(w⋅a+b)的分量形式a′=σ(∑iwi⋅xi+bi)a′=σ(∑iwi⋅xi+bi)，它们只是存在形式上的差异，计算结果是一致的。

本节给出的代码虽然不是最优的，但非常有助于理解神经网络训练原理，以及超参设置的重要性，如果超参设置不好，会取得不好的训练效果。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航