您的位置：首页 > 大数据 > 人工智能

deeplearning.ai课程知识点整理

2017-12-29 10:05 676 查看

Neural Networks and Deep Learning

Introduction to deep learning

Neural Networks Basics

Logistic Regression as a Neural Network

Computation graph

神经网络的计算过程由正向传播（forward propagation ）来进行前向计算来计算神经网络的输出以及反向传播（back propagation ）计算来计算梯度（gradients）或微分（derivatives）。计算图（computation graph）解释了为什么以这种方式来组织。

计算图（computation graph）可以方便地展示神经网络的分层计算过程。

Shallow neural networks

Learn to build a neural network with one hidden layer, using forward propagation and backpropagation.

Learning Objectives

Understand hidden units and hidden layers

Be able to apply a variety of activation functions in a neural network.

Build your first forward and backward propagation with a hidden layer

Apply random initialization to your neural network

Become fluent with Deep Learning notations and Neural Network Representations

Build and train a neural network with one hidden layer.

Activation functions

Pros and cons of activation functions

Sigmoid和tanh函数的缺点之一是如果Z的值非常大或者非常小那么关于这个函数导数的梯度或者斜率会变的很小当Z很大或者很小的时候函数的斜率值会接近零这会使得梯度下降变的缓慢。

ReLU是目前广泛被人们使用的一个方法虽然有时候人们也会使用双曲函数作为激活函数 ReLU的缺点之一是当z为负数的时候其导数为0,但在实际应用中并不是问题。ReLU和Leaky ReLU的共有的优势是在z的数值空间里面激活函数的导数或者说激活函数的斜率离0比较远因此在实践当中使用普通的 ReLU激活函数的话那么神经网络的学习速度通常会比使用双曲函数tanh或者Sigmoid函数来的更快主要原因是使学习变慢的斜率趋向0的现象变少了激活函数的导数趋向于0会降低学习的速度我们知道，一半z的数值范围 ReLU的斜率为0 但是在实际使用中大多数的隐藏单元的z值将会大于0，因此学习仍然可以很快。

不同激活函数的优缺点：

不要使用Sigmoid激活函数，除非在输出层上并且你要解决的是二分类问题

tanh函数相比Sigmoid要好很多

默认的最经常使用的激活函数是ReLU函数

Why do you need non-linear activation functions?

如果使用线性激活函数或者叫恒等激活函数那么神经网络的输出仅仅是输入函数的线性变化。

如果使用线性激活函数或者说没有使用激活函数那么无论你的神经网络有多少层它所做的仅仅是计算线性激活函数这还不如去除所有隐藏层。线性的隐藏层没有任何用处因为两个线性函数的组合仍然是线性函数除非你在这里引入非线性函数否则无论神经网络模型包含多少隐藏层都无法实现更有趣的功能只有一个地方会使用线性激活函数当g(z)等于z 就是使用机器学习解决回归问题的时候。

Deep Neural Networks

Improving Deep Neural Networks

About this Course

This course will teach you the “magic” of getting deep learning to work well. Rather than the deep learning process being a black box, you will understand what drives performance, and be able to more systematically get good results. You will also learn TensorFlow.

After 3 weeks, you will:

Understand industry best-practices for building deep learning applications.

Be able to effectively use the common neural network “tricks”, including initialization, L2 and dropout regularization, Batch normalization, gradient checking,

Be able to implement and apply a variety of optimization algorithms, such as mini-batch gradient descent, Momentum, RMSprop and Adam, and check for their convergence.

Understand new best-practices for the deep learning era of how to set up train/dev/test sets and analyze bias/variance

Be able to implement a neural network in TensorFlow.

This is the second course of the Deep Learning Specialization.

Practical aspects of Deep Learning

Learning Objectives

Recall that different types of initializations lead to different results

Recognize the importance of initialization in complex neural networks.

Recognize the difference between train/dev/test sets

Diagnose the bias and variance issues in your model

Learn when and how to use regularization methods such as dropout or L2 regularization.

Understand experimental issues in deep learning such as Vanishing or Exploding gradients and learn how to deal with them

Use gradient checking to verify the correctness of your backpropagation implementation

Regularizing your neural network

What we want you to remember from this module:

- Regularization will help you reduce overfitting.

- Regularization will drive your weights to lower values.

- L2 regularization and Dropout are two very effective regularization techniques.

Regularization

Deep Learning models have so much flexibility and capacity that overfitting can be a serious problem, if the training dataset is not big enough. Sure it does well on the training set, but the learned network doesn’t generalize to new examples that it has never seen!

The standard way to avoid overfitting is called L2 regularization. It consists of appropriately modifying your cost function, from:

J=−1m∑i=1m(y(i)log(a[L](i))+(1−y(i))log(1−a[L](i)))(1)(1)J=−1m∑i=1m(y(i)log⁡(a[L](i))+(1−y(i))log⁡(1−a[L](i)))

To:

Jregularized=−1m∑i=1m(y(i)log(a[L](i))+(1−y(i))log(1−a[L](i)))cross-entropy cost+1mλ2∑l∑k∑jW[l]2k,jL2 regularization cost(2)(2)Jregularized=−1m∑i=1m(y(i)log⁡(a[L](i))+(1−y(i))log⁡(1−a[L](i)))⏟cross-entropy cost+1mλ2∑l∑k∑jWk,j[l]2⏟L2 regularization cost

为什么只对参数w进行正则化呢? 为什么我们不把b的相关项也加进去呢？实际上也可以这样做但通常会把它省略掉因为w往往是一个非常高维的参数矢量尤其是在发生高方差问题的情况下可能w有非常多的参数模型没能很好地拟合所有的参数而b只是单个数字几乎所有的参数都集中在w中而不是b中即使加上了最后这一项实际上也不会起到太大的作用因为b只是大量参数中的一个参数在实践中通常就不费力气去包含它了但如果想的话也可以(包含b)

L1 regulazation VS L2 regulazation

L1正则化即不使用L2范数(Euclid范数（欧几里得范数，常用计算向量长度），即向量元素绝对值的平方和再开方) 而是使用lambda/m乘以这一项的和这称为参数矢量w的L1范数(即向量元素绝对值之和) 这里有一个数字1的小角标无论你在分母中使用m还是2m 它只是一个缩放常量如果你使用L1正则化 w最后会变得稀疏这意味着w矢量中有很多0 有些人认为这有助于压缩模型因为有一部分参数是0 只需较少的内存来存储模型然而在实践中发现通过L1正则化让模型变得稀疏带来的收效甚微所以至少在压缩模型的目标上它的作用不大

Why regularization reduces overfitting?

L2-regularization relies on the assumption that a model with small weights is simpler than a model with large weights. Thus, by penalizing the square values of the weights in the cost function you drive all the weights to smaller values. It becomes too costly for the cost to have large weights! This leads to a smoother model in which the output changes more slowly as the input changes.

Dropout Regularization

Dropout is a widely used regularization technique that is specific to deep learning.

It randomly shuts down some neurons in each iteration. Watch these two videos to see what this means!

Figure 2 : Drop-out on the second hidden layer.
At each iteration, you shut down (= set to zero) each neuron of a layer with probability 1−keep_prob1−keep_prob or keep it with probability keep_probkeep_prob (50% here). The dropped neurons don’t contribute to the training in both the forward and backward propagations of the iteration.

Figure 3 : Drop-out on the first and third hidden layers.
1st1st layer: we shut down on average 40% of the neurons. 3rd3rd layer: we shut down on average 20% of the neurons.

When you shut some neurons down, you actually modify your model. The idea behind drop-out is that at each iteration, you train a different model that uses only a subset of your neurons. With dropout, your neurons thus become less sensitive to the activation of one other specific neuron, because that other neuron might be shut down at any time.

What you should remember about dropout:

- Dropout is a regularization technique.

- You only use dropout during training. Don’t use dropout (randomly eliminate nodes) during test time.

- Apply dropout both during forward and backward propagation.

- During training time, divide each dropout layer by keep_prob to keep the same expected value for the activations. For example, if keep_prob is 0.5, then we will on average shut down half the nodes, so the output will be scaled by 0.5 since only the remaining half are contributing to the solution. Dividing by 0.5 is equivalent to multiplying by 2. Hence, the output now has the same expected value. You can check that this works even when keep_prob is other values than 0.5.

Optimization algorithms

Learning Objectives

Remember different optimization methods such as (Stochastic) Gradient Descent, Momentum, RMSProp and Adam

Use random minibatches to accelerate the convergence and improve the optimization

Know the benefits of learning rate decay and apply it to your optimization

Mini-batch gradient descent

机器学习的应用是一个高度依赖经验的，不断重复的过程。你需要训练很多模型才能找到一个确实好用的，所以能够快速的训练模型的确是个优势。令情况更艰难的是在大数据领域中深度学习表现得并不算完美，我们能够训练基于大量数据的神经网络，而用大量数据训练就会很慢，所以你会发现快速的优化算法，好的优化算法的确能大幅提高你和你的团队的效率。

我们之前学过，矢量化(vectorization)可以让你有效地计算所有m个样例而不需要一个具体的for循环就能处理整个训练集。是如果M非常大，速度依然会慢。例如，如果M是5百万或者5千万或者更大，对你的整个训练集运用梯度下降法，你必须先处理你的整个训练集才能在梯度下降中往前一小步，然后再处理一次整个5百万的训练集才能再往前一小步。所以实际上算法是可以加快的，如果你让梯度下降在处理完整个巨型的5百万训练集之前就开始有所成效。具体来说，你可以这样做，将你的训练集拆分成更小的，微小的训练集，即小批量训练集(mini-batch)。

一次遍历(epoch)是指过一遍训练集，只不过在批量梯度下降法中对训练集的一轮处理只能得到一步梯度逼近，而小批量梯度下降法中对训练集的一轮处理，也就是一次遍历，可以得到5000步梯度逼近。

当你有一个大型训练集时，小批量梯度下降法比梯度下降法要快得多。这几乎是每个从事深度学习的人在处理一个大型数据集时会采用的算法。

Understanding mini-batch gradient descent

A variant is Stochastic Gradient Descent (SGD), which is equivalent to mini-batch gradient descent where each mini-batch has just 1 example. In Stochastic Gradient Descent, you use only 1 training example before updating the gradients. When the training set is large, SGD can be faster. But the parameters will “oscillate” toward the minimum rather than converge smoothly. Here is an illustration of this:

In practice, you’ll often get faster results if you do not use neither the whole training set, nor only one training example, to perform each update. Mini-batch gradient descent uses an intermediate number of examples for each step. With mini-batch gradient descent, you loop over the mini-batches instead of looping over individual training examples.

The difference between gradient descent, mini-batch gradient descent and stochastic gradient descent is the number of examples you use to perform one update step.

You have to tune a learning rate hyperparameter αα.

With a well-turned mini-batch size, usually it outperforms either gradient descent or stochastic gradient descent (particularly when the training set is large).

Exponentially weighted averages

有几个点需要注意当beta的值很大的时候你得到的曲线会更平滑因为你对更多天数的温度做了平均处理因此曲线就波动更小更加平滑但另一方面这个曲线会右移因为你在一个更大的窗口内计算平均温度通过在更大的窗口内计算平均这个指数加权平均的公式在温度变化时适应地更加缓慢这就造成了一些延迟原因是当beta=0.98的时候之前的值具有更大的权重而当前值的权重就非常小只有0.02 所以当温度变化的时候温度上升或者下降这个指数加权平均在beta较大时就会适应得更慢我们来试试另一个值让beta的值取另一个极端比如0.5 那么由右边的公式这就变成了只对两天进行平均如果画出来就会得到黄色的线由于仅仅平均两天的气温即只在很小的窗口内计算平均得到结果中会有更多的噪声更容易受到异常值的影响但它可以更快地适应温度变化使用这个公式就可以实现指数加权平均在统计学中它被称为指数加权滑动平均我们把它简称为指数加权平均

Understanding exponentially weighted averages

One of the advantages of this exponentially weighted average formula, is that it takes very little memory. You just need to keep just one row number in computer memory, and you keep on overwriting it with this formula based on the latest values that you got. And it’s really this reason, the efficiency, it just takes up one line of code basically and just storage and memory for a single row number to compute this exponentially weighted average.

Gradient descent with momentum

动量(Momentum) 或者叫动量梯度下降算法的主要思想是计算梯度的指数加权平均然后使用这个梯度来更新权重。

Because mini-batch gradient descent makes a parameter update after seeing just a subset of examples, the direction of the update has some variance, and so the path taken by mini-batch gradient descent will “oscillate” toward convergence. Using momentum can reduce these oscillations.

Momentum takes into account the past gradients to smooth out the update. We will store the ‘direction’ of the previous gradients in the variable vv. Formally, this will be the exponentially weighted average of the gradient on previous steps. You can also think of vv as the “velocity” of a ball rolling downhill, building up speed (and momentum) according to the direction of the gradient/slope of the hill.

The momentum update rule is, for l=1,...,Ll=1,...,L:

$\begin{cases} v_{dW^{[l]}} = \beta v_{dW^{[l]}} + (1 - \beta) dW^{[l]} \\ W^{[l]} = W^{[l]} - \alpha v_{dW^{[l]}} \end{cases}\tag{3}$$$$\begin{cases} v_{db^{[l]}} = \beta v_{db^{[l]}} + (1 - \beta) db^{[l]} \\ b^{[l]} = b^{[l]} - \alpha v_{db^{[l]}} \end{cases}\tag{4}$

where L is the number of layers, ββ is the momentum and αα is the learning rate.

How do you choose ββ?

The larger the momentum ββ is, the smoother the update because the more we take the past gradients into account. But if ββ is too big, it could also smooth out the updates too much.

Common values for ββ range from 0.8 to 0.999. If you don’t feel inclined to tune this, β=0.9β=0.9 is often a reasonable default.

Tuning the optimal ββ for your model might need trying several values to see what works best in term of reducing the value of the cost function JJ.

Momentum takes past gradients into account to smooth out the steps of gradient descent. It can be applied with batch gradient descent, mini-batch gradient descent or stochastic gradient descent.

You have to tune a momentum hyperparameter ββ and a learning rate αα.

RMSprop

你已经学习了如何用动量来加速梯度下降还有一个叫做RMSprop的算法全称为均方根传递(Root Mean Square prop)，它也可以加速梯度下降我们来看看它是如何工作的回忆一下之前的例子在实现梯度下降时可能会在垂直方向上出现巨大的振荡即使它试图在水平方向上前进为了说明这个例子我们假设纵轴代表参数b 横轴代表参数W 当然这里也可以是W1和W2等其他参数我们使用b和W是为了便于理解你希望减慢b方向的学习也就是垂直方向同时加速或至少不减慢水平方向的学习这就是RMSprop算法要做的。另一个收效是你可以使用更大的学习率alpha 学习得更快而不用担心在垂直方向上发散

在水平方向上即例子中W的方向上我们希望学习速率较快而在垂直方向上即例子中b的方向上我们希望降低垂直方向上的振荡对于S_dW和S_db这两项我们希望S_dW相对较小因此这里除以的是一个较小的数而S_db相对较大因此这里除以的是一个较大的数这样就可以减缓垂直方向上的更新实际上如果你看一下导数就会发现垂直方向上的倒数要比水平方向上的更大所以在b方向上的斜率很大对于这样的导数 db很大而dW相对较小因为函数在垂直方向即b方向的斜率要比w方向也就是比水平方向更陡所以 db的平方会相对较大因此S_db会相对较大相比之下dW会比较小或者说dW的平方会较小所以S_dW会较小结果是垂直方向上的更新量会除以一个较大的数这有助于减弱振荡而水平方向上的更新量会除以一个较小的数

Adam optimization algorithm

Adam is one of the most effective optimization algorithms for training neural networks. It combines ideas from RMSProp (described in lecture) and Momentum.

How does Adam work?

It calculates an exponentially weighted average of past gradients, and stores it in variables vv (before bias correction) and vcorrectedvcorrected (with bias correction).

It calculates an exponentially weighted average of the squares of the past gradients, and stores it in variables ss (before bias correction) and scorrectedscorrected (with bias correction).

It updates parameters in a direction based on combining information from “1” and “2”.

The update rule is, for l=1,...,Ll=1,...,L:

$\begin{cases} v_{dW^{[l]}} = \beta_1 v_{dW^{[l]}} + (1 - \beta_1) \frac{\partial \mathcal{J} }{ \partial W^{[l]} } \\ v^{corrected}_{dW^{[l]}} = \frac{v_{dW^{[l]}}}{1 - (\beta_1)^t} \\ s_{dW^{[l]}} = \beta_2 s_{dW^{[l]}} + (1 - \beta_2) (\frac{\partial \mathcal{J} }{\partial W^{[l]} })^2 \\ s^{corrected}_{dW^{[l]}} = \frac{s_{dW^{[l]}}}{1 - (\beta_1)^t} \\ W^{[l]} = W^{[l]} - \alpha \frac{v^{corrected}_{dW^{[l]}}}{\sqrt{s^{corrected}_{dW^{[l]}}} + \varepsilon} \end{cases}$

where:

- t counts the number of steps taken of Adam

- L is the number of layers

- β1β1 and β2β2 are hyperparameters that control the two exponentially weighted averages.

- αα is the learning rate

- εε is a very small number to avoid dividing by zero

Hyperparameter tuning, Batch Normalization and Programming Frameworks

Hyperparameter tuning

Tuning process

超参数（Hyperparameters ）并不具有同等的重要性, E.g. learning rate αα比Adam中的β1β1重要。

在机器学习领域，超参数比较少的情况下，我们之前利用设置网格点的方式来调试超参数；

但在深度学习领域，超参数较多的情况下，不是设置规则的网格点，而是随机选择点进行调试。这样做是因为在我们处理问题的时候，是无法知道哪个超参数是更重要的，所以随机的方式去测试超参数点的性能，更为合理，这样可以探究更超参数的潜在价值。

Using an appropriate scale to pick hyperparameters

在选择超参数的比例的时候，原则上应在不同比例范围内进行均匀随机取值，如 0.001~0.001 、 0.001~0.01 、 0.01~0.1 、 0.1~1 范围内选择。

一般地，如果在 10a10a~10b10b 之间的范围内进行按比例的选择，则 r 范围为[a, b] ， αα = 10r10r。

同样，在使用指数加权平均的时候，超参数beta也需要用上面这种方向进行选择。

Hyperparameters tuning in practice: Pandas vs. Caviar

在计算资源有限的情况下，使用第一种，仅调试一个模型，每天不断优化；

在计算资源充足的情况下，使用第二种，同时并行调试多个模型，选取其中最好的模型。

Batch Normalization

Normalizing activations in a network

在深度学习不断兴起的过程中最重要的创新之一是一种叫批量归一化 (Batch Normalization) 的算法它由Sergey Ioffe 和 Christian Szegedy提出可以让你的超参搜索变得很简单让你的神经网络变得更加具有鲁棒性可以让你的神经网络对于超参数的选择上不再那么敏感而且可以让你更容易地训练非常深的网络。

Implementing Batch Norm

这里的γγ和ββ值可以从你的模型中学习，这样我们就可以使用梯度下降算法或者其他类似算法比如 momentum的梯度下降算法或者atom算法来更新γγ和ββ的值就像更新神经网络的权重一样。

Batch norm所做的就是不仅仅在输入层而且在一些隐藏层上也做归一化你使用这种归一化方法对某些隐藏单元的值z做归一化但是输入层和隐藏层的归一化还有一点不同就是隐藏层归一化后并不一定是均值0方差1 比如如果你的激活函数是sigmoid 你就不希望归一化后的值都聚集在这里可能希望它们有更大的方差以便于更好的利用s函数非线性的特性而不是所有的值都在中间这段近似直线的区域上这就是为什么通过设置γγ和ββ 你可以控制z(i)z(i)在你希望的范围内或者说它真正实现的是通过两个参数γγ和ββ来让你的隐藏单元有可控的方差和均值而这两个参数是可以在算法中自由设置的目的就是可以得到一些修正的均值和方差这意味可以是均值0方差1 也可以是被参数γγ ββ控制的其他值

Batch Norm at test time

通常的方法就是在我们训练的过程中，对于训练集的Mini-batch，使用指数加权平均，当训练结束的时候，得到指数加权平均后的均值和方差，而这些值直接用于Batch Norm公式的计算，用以对测试样本进行预测。

Multi-class classification

Softmax Regression

Softmax回归是一种更普遍的逻辑回归的方法。这种方法能够让你试图预测多分类问题，而不仅仅是二分类问题。

总结一下从zL到aL的计算过程。这整个的计算过程从计算幂，到得出临时变量，再做归一化。我们可以把这个过程总结为一个softmax激活函数。假设aL等于向量zL的激活函数g，这个激活函数不同之处在于，这个函数g需要输入一个4*1的向量，也会输出一个4*1的向量。以前我们的激活通常是接收单行输入，比如sigmoid函数和ReLU函数就是接收一个实数输入，然后输出一个实数的输出。softmax函数的不同之处就是，它需要把输出归一化，以及输入输出都是向量。

Structuring Machine Learning Projects

About this Course

You will learn how to build a successful machine learning project. If you aspire to be a technical leader in AI, and know how to set direction for your team’s work, this course will show you how.

Much of this content has never been taught elsewhere, and is drawn from my experience building and shipping many deep learning products. This course also has two “flight simulators” that let you practice decision-making as a machine learning project leader. This provides “industry experience” that you might otherwise get only after years of ML work experience.

After 2 weeks, you will:

Understand how to diagnose errors in a machine learning system, and

Be able to prioritize the most promising directions for reducing error

Understand complex ML settings, such as mismatched training/test sets, and comparing to and/or surpassing human-level performance

Know how to apply end-to-end learning, transfer learning, and multi-task learning

ML Strategy (1)

Learning Objectives

Understand why Machine Learning strategy is important

Apply satisficing and optimizing metrics to set up your goal for ML projects

Choose a correct train/dev/test split of your dataset

Understand how to define human-level performance

Use human-level perform to define your key priorities in ML projects

Take the correct ML Strategic decision based on observations of performances and dataset

Introduction to ML Strategy

Why ML Strategy

改进ML系统的方法：

ML Strategy的课程内容包括:

1. teach a number of strategies, that is, ways of analyzing a machine learning problem that will point you in the direction of the most promising things to try.

2. 吴恩达自己关于building and shipping large number of deep learning products的经验。

吴恩达指出，在深度学习的时代，机器学习策略正在发生变化，因为现在可以用深度学习算法做的事情已经和前一代的机器学习算法不同了。

Orthogonalization

背景：One of the challenges with building machine learning systems is that there’s so many things you could try, so many things you could change. Including, for example, so many hyperparameters you could tune.

Orthogonalization or orthogonality is a system design property that assures that modifying an instruction or a component of an algorithm will not create or propagate side effects to other components of the system. It becomes easier to verify the algorithms independently from one another, it reduces testing and development time.

When a supervised learning system is design, these are the 4 assumptions that needs to be true and orthogonal.

Fit training set well in cost function

If it doesn’t fit well, the use of a bigger neural network or switching to a better optimization algorithm might help.

Fit development set well on cost function

If it doesn’t fit well, regularization or using bigger training set might help.

Fit test set well on cost function

If it doesn’t fit well, the use of a bigger development set might help

Performs well in real world

If it doesn’t perform well, the development test set is not set correctly or the cost function is not evaluating the right thing.

正交（Orthogonalization）在机器学习领域的意义：Figure out exactly what’s wrong, and then have exactly one knob, or a specific set of knobs that helps to just solve that problem that is limiting the performance of machine learning system.

Setting up your goal

Single number evaluation metric

Single number evaluation metric的好处：lets you quickly tell if the new thing you just tried is working better or worse than your last idea.

Evaluation metric的例子:

Precision

Of all the images we predicted y=1, what fraction of it have cats?

Recall

Of all the images that actually have cats, what fraction of it did we correctly identifying have cats?

The problem with using precision/recall as the evaluation metric is that you are not sure which one is better since in this case, both of them have a good precision et recall. F1-score, a harmonic mean, combine both precision and recall.

F1-Score = 21p+1r21p+1r

F1-Score is not the only evaluation metric that can be use, the average, for example, could also be an indicator of which classifier to use.

Satisficing and Optimizing metric

There are different metrics to evaluate the performance of a classifier, they are called evaluation matrices. They can be categorized as satisficing and optimizing matrices. It is important to note that these evaluation matrices must be evaluated on a training set, a development set or on the test set.

Example: Cat vs Non-cat


Classifier	Accuracy	Running time
A	90%	80 ms
B	92%	95 ms
C	95%	1,500 ms

In this case, accuracy and running time are the evaluation matrices. Accuracy is the optimizing metric, because you want the classifier to correctly detect a cat image as accurately as possible. The running time which is set to be under 100 ms in this example, is the satisficing metric which mean that the metric has to meet expectation set.

The general rule is:

Nmetric:{1Optimizing metricNmetric−1Satisficing metricNmetric:{1Optimizing metricNmetric−1Satisficing metric

Summary：If there are multiple things you care about by say there’s one as the optimizing metric that you want to do as well as possible on and one or more as satisficing metrics were you’ll be satisfice. Almost it does better than some threshold you can now have an almost automatic way of quickly looking at multiple core size and picking the, quote, best one.

Train /dev /test distributions

Setting up the training, development and test sets have a huge impact on productivity. It is important to choose the development and test sets from the same distribution and it must be taken randomly from all the data. However, it is not a problem to have different training and dev distribution.

Guideline

Choose a development set and test set to reflect data you expect to get in the future and consider important to do well.

Size of the dev and test sets

Old way of splitting data

We had smaller data set therefore we had to use a greater percentage of data to develop and test ideas and models.

Modern era – Big data

Now, because a large amount of data is available, we don’t have to compromised as much and can use a greater portion to train the model.

Guidelines

Set up the size of the test set to give a high confidence in the overall performance of the system.

Test set helps evaluate the performance of the final classifier which could be less 30% of the whole data set.

The development set has to be big enough to evaluate different ideas.

When to change dev /test sets and metrics

If doing well on your metric + dev/test set does not correspond to doing well on your application, change your metric and/or dev/test set.

Guideline

Define correctly an evaluation metric that helps better rank order classifiers

Optimize the evaluation metric

Why human-level performance?

Today, machine learning algorithms can compete with human-level performance since they are more productive and more feasible in a lot of application. Also, the workflow of designing and building a machine learning system, is much more efficient than before.

Moreover, some of the tasks that humans do are close to ‘’perfection’’, which is why machine learning tries to mimic human-level performance.

The graph below shows the performance of humans and machine learning over time.

The Machine learning progresses slowly when it surpasses human-level performance. One of the reason is that human-level performance can be close to Bayes optimal error, especially for natural perception problem.

Bayes optimal error is defined as the best possible error. In other words, it means that any functions mapping from x to y can’t surpass a certain level of accuracy.

Also, when the performance of machine learning is worse than the performance of humans, you can improve it with different tools. They are harder to use once its surpasses human-level performance.

These tools are:

Get labeled data from humans

Gain insight from manual error analysis: Why did a person get this right?

Better analysis of bias/variance.

Avoidable bias

By knowing what the human-level performance is, it is possible to tell when a training set is performing well or not.

Example: Cat vs Non-Cat

	Classification error (%)
Scenario A	Scenario B
Humans	1	7.5
Training error	8	8
Development error	10	10

In this case, the human level error as a proxy for Bayes error since humans are good to identify images. If you want to improve the performance of the training set but you can’t do better than the Bayes error otherwise the training set is overfitting. By knowing the Bayes error, it is easier to focus on whether bias or variance avoidance tactics will improve the performance of the model.

Scenario A

There is a 7% gap between the performance of the training set and the human level error. It means that the algorithm isn’t fitting well with the training set since the target is around 1%. To resolve the issue, we use bias reduction technique such as training a bigger neural network or running the training set longer.

Scenario B

The training set is doing good since there is only a 0.5% difference with the human level error. The

difference between the training set and the human level error is called avoidable bias. The focus here is to reduce the variance since the difference between the training error and the development error is 2%. To resolve the issue, we use variance reduction technique such as regularization or have a bigger training set.

Understanding human-level performance

Summary of bias/variance with human-level performance

Human - level error – proxy for Bayes error

If the difference between human-level error and the training error is bigger than the difference

between the training error and the development error. The focus should be on bias reduction

technique

If the difference between training error and the development error is bigger than the difference

between the human-level error and the training error. The focus should be on variance reduction

technique.

Surpassing human-level performance

There are many problems where machine learning significantly surpasses human-level performance, especially with structured data:

Online advertising

Product recommendations

Logistics (predicting transit time)

Loan approvals

Improving your model performance

The two fundamental assumptions of supervised learning:

There are 2 fundamental assumptions of supervised learning. The first one is to have a low avoidable bias which means that the training set fits well. The second one is to have a low or acceptable variance which means that the training set performance generalizes well to the development set and test set.

If the difference between human-level error and the training error is bigger than the difference between the training error and the development error, the focus should be on bias reduction technique which are training a bigger model, training longer or change the neural networks architecture or try various hyperparameters search.

If the difference between training error and the development error is bigger than the difference between the human-level error and the training error, the focus should be on variance reduction technique which are bigger data set, regularization or change the neural networks architecture or try various hyperparameters search.

Summary

ML Strategy (2)

Learning Objectives

Understand what multi-task learning and transfer learning are

Recognize bias, variance and data-mismatch by looking at the performances of your algorithm on train/dev/test sets

Error Analysis

Carrying out error analysis

Summary

To carry out error analysis, you should find a set of mislabeled examples, either in your dev set, or in your development set. And look at the mislabeled examples for false positives and false negatives. And just count up the number of errors that fall into various different categories.

During this process, you might be inspired to generate new categories of errors. But by counting up the fraction of examples that are mislabeled in different ways, often this will help you prioritize. Or give you inspiration for new directions to go in.

Example

创建一个人工错误检查表格使差错检测工作更清晰有条理。

Cleaning up incorrectly labeled data

DL algorithms are quite robust to random errors comparing with systematic errors in the training set.

Guideline

Apply same process to your dev and test sets to make sure they continue to come from the same distribution.

Consider examining examples your algorithm got right as well as ones it got wrong.

Train and dev/test data may now come from slightly different distributions.

Build your first system quickly, then iterate

Depending on the area of application, the guideline below will help you prioritize when you build your system.

Guideline

Set up development/ test set and metrics

Set up a target

Build an initial system quickly

Train training set quickly: Fit the parameters

Development set: Tune the parameters

Test set: Assess the performance

Use Bias/Variance analysis & Error analysis to prioritize next steps

Mismatched training and dev/test set

Example: Cat vs Non-cat

In this example, we want to create a mobile application that will classify and recognize pictures of cats taken and uploaded by users.

There are two sources of data used to develop the mobile app. The first data distribution is small, 10 000 pictures uploaded from the mobile application. Since they are from amateur users, the pictures are not professionally shot, not well framed and blurrier. The second source is from the web, you downloaded 200 000 pictures where cat’s pictures are professionally framed and in high resolution.

The problem is that you have a different distribution:

small data set from pictures uploaded by users. This distribution is important for the mobile app.

bigger data set from the web.

The guideline used is that you have to choose a development set and test set to reflect data you expect to get in the future and consider important to do well.

The data is split as follow:

The advantage of this way of splitting up is that the target is well defined.

The disadvantage is that the training distribution is different from the development and test set

distributions. However, this way of splitting the data has a better performance in long term.

Bias and Variance with mismatched data distributions

When the training set is from a different distribution than the development and test sets, the method to analyze bias and variance changes.

Scenario A

If the development data comes from the same distribution as the training set, then there is a large

variance problem and the algorithm is not generalizing well from the training set.

However, since the training data and the development data come from a different distribution, this

conclusion cannot be drawn. There isn’t necessarily a variance problem. The problem might be that the development set contains images that are more difficult to classify accurately.

When the training set, development and test sets distributions are different, two things change at the same time. First of all, the algorithm trained in the training set but not in the development set. Second of all, the distribution of data in the development set is different.

It’s difficult to know which of these two changes what produces this 9% increase in error between the training set and the development set. To resolve this issue, we define a new subset called trainingdevelopment set. This new subset has the same distribution as the training set, but it is not used for training the neural network.

Scenario B

The error between the training set and the training- development set is 8%. In this case, since the training set and training-development set come from the same distribution, the only difference between them is the neural network sorted the data in the training and not in the training development. The neural network is not generalizing well to data from the same distribution that it hadn’t seen before

Therefore, we have really a variance problem.

Scenario C

In this case, we have a mismatch data problem since the 2 data sets come from different distribution.

Scenario D

In this case, the avoidable bias is high since the difference between Bayes error and training error is 10 %.

Scenario E

In this case, there are 2 problems. The first one is that the avoidable bias is high since the difference between Bayes error and training error is 10 % and the second one is a data mismatched problem.

Scenario F

Development should never be done on the test set. However, the difference between the development set and the test set gives the degree of overfitting to the development set.

General formulation

Addressing data mismatch

This is a general guideline to address data mismatch:

Perform manual error analysis to understand the error differences between training, development/test sets. Development should never be done on test set to avoid overfitting.

Make training data or collect data similar to development and test sets. To make the training data more similar to your development set, you can use is artificial data synthesis. However, it is possible that if you might be accidentally simulating data only from a tiny subset of the space of all possible examples.

Learning from multiple tasks

Transfer learning

Transfer learning refers to using the neural network knowledge for another application. When to use transfer learning

• Task A and B have the same input xx

• A lot more data for Task A than Task B

• Low level features from Task A could be helpful for Task B

Example 1: Cat recognition - radiology diagnosis

The following neural network is trained for cat recognition, but we want to adapt it for radiology diagnosis. The neural network will learn about the structure and the nature of images. This initial phase of training on image recognition is called pre-training, since it will pre-initialize the weights of the neural network. Updating all the weights afterwards is called fine-tuning.

For cat recognition

Input xx: image

Output yy – 1: cat, 0: no cat

Radiology diagnosis

Input xx: Radiology images – CT Scan, X-rays

Output yy :Radiology diagnosis – 1: tumor malign, 0: tumor benign

Guideline

• Delete last layer of neural network

• Delete weights feeding into the last output layer of the neural network

• Create a new set of randomly initialized weights for the last layer only

• New data set (x,y)(x,y)

Multi-task learning

Multi-task learning refers to having one neural network do simultaneously several tasks.

When to use multi-task learning

Training on a set of tasks that could benefit from having shared lower-level features

Usually: Amount of data you have for each task is quite similar

Can train a big enough neural network to do well on all tasks

Example: Simplified autonomous vehicle

The vehicle has to detect simultaneously several things: pedestrians, cars, road signs, traffic lights, cyclists, etc. We could have trained four separate neural networks, instead of train one to do four tasks. However, in this case, the performance of the system is better when one neural network is trained to do four tasks than training four separate neural networks since some of the earlier features in the neural network could be shared between the different types of objects.

The input x(i)x(i) is the image with multiple labels

The output y(i)y(i) has 4 labels which are represents:

Also, the cost can be compute such as it is not influenced by the fact that some entries are not labeled.

End-to-end deep learning

What is end-to-end deep learning?

End-to-end deep learning is the simplification of a processing or learning systems into one neural network.

Example - Speech recognition model

End-to-end deep learning cannot be used for every problem since it needs a lot of labeled data. It is used mainly in audio transcripts, image captures, image synthesis, machine translation, steering in self-driving cars, etc.

Whether to use end-to-end deep learning

Before applying end-to-end deep learning, you need to ask yourself the following question: Do you have enough data to learn a function of the complexity needed to map x and y?

Pro:

Let the data speak

By having a pure machine learning approach, the neural network will learn from x to y. It will be able to find which statistics are in the data, rather than being forced to reflect human preconceptions.

Less hand-designing of components needed

It simplifies the design work flow.

Cons:

Large amount of labeled data

It cannot be used for every problem as it needs a lot of labeled data.

Excludes potentially useful hand-designed component

Data and any hand-design’s components or features are the 2 main sources of knowledge for a learning algorithm. If the data set is small than a hand-design system is a way to give manual knowledge into the algorithm.

Convolutional Neural Networks

About this Course

This course will teach you how to build convolutional neural networks and apply it to image data. Thanks to deep learning, computer vision is working far better than just two years ago, and this is enabling numerous exciting applications ranging from safe autonomous driving, to accurate face recognition, to automatic reading of radiology images.

You will:

- Understand how to build a convolutional neural network, including recent variations such as residual networks.

- Know how to apply convolutional networks to visual detection and recognition tasks.

- Know to use neural style transfer to generate art.

- Be able to apply these algorithms to a variety of image, video, and other 2D or 3D data.

This is the fourth course of the Deep Learning Specialization.

Foundations of Convolutional Neural Networks

Learn to implement the foundational layers of CNNs (pooling, convolutions) and to stack them properly in a deep network to solve multi-class image classification problems.

Learning Objectives

Understand the convolution operation

Understand the pooling operation

Remember the vocabulary used in convolutional neural network (padding, stride, filter, …)

Build a convolutional neural network for image multi-class classification

Computer Vision

Computer vision is one of the areas that’s been advancing rapidly thanks to deep learning. Two reasons make people excited about deep learning for computer vision:

Rapid advances in computer vision are enabling brand new applications to view, though they just were impossible a few years ago.

Computer vision research community has been so creative and so inventive in coming up with new neural network architectures and algorithms, is actually inspire that creates a lot cross-fertilization into other areas as well.

Edge Detection Example

The convolution operation is one of the fundamental building blocks of a convolutional neural network.

Using edge detection as the motivating example in this module:

在上图的例子中当3x3的filter移动到红圈和蓝圈处时，卷积操作（convolution）的结果如右下角图所示，图中间出现lighter区域，说明vertical edge被成功检测到。

The convolution operation gives you a convenient way to specify how to find these vertical edges in an image.

More Edge Detection

In this module, you’ll learn the difference between positive and negative edges, that is, the difference between light to dark versus dark to light edge transitions. And you’ll also see other types of edge detectors, as well as how to have an algorithm learn, rather than have us hand code an edge detector as we’ve been doing so far.

Different filters allow you to find vertical and horizontal edges:

With the rise of deep learning, one of the things we learned is that when you really want to detect edges in some complicated image, maybe you don’t need to have computer vision researchers handpick these nine numbers. Maybe you can just learn them and treat the nine numbers of this matrix as parameters, which you can then learn using back propagation. And the goal is to learn nine parameters so that when you take the image, the six by six image, and convolve it with your three by three filter, that this gives you a good edge detector.

Rather than just vertical and horizontal edges, maybe deep learning can learn to detect edges that are at 45 degrees or 70 degrees or 73 degrees or at whatever orientation it chooses. And so by just letting all of these numbers be parameters and learning them automatically from data, we find that neural networks can actually learn low level features, can learn features such as edges, even more robustly than computer vision researchers are generally able to code up these things by hand. But underlying all these computations is still this convolution operation, Which allows back propagation to learn whatever three by three filter it wants and then to apply it throughout the entire image, at this position, at this position, at this position, in order to output whatever feature it’s trying to detect. Be it vertical edges, horizontal edges, or edges at some other angle or even some other filter that we might not even have a name for in English.

The idea you can treat these nine numbers as parameters to be learned has been one of the most powerful ideas in computer vision.

Padding

In order to build deep neural networks one modification to the basic convolutional operation that you need to really use is padding.

Basic卷积操作的缺点：

1. If every time you apply a convolutional operator, your image shrinks, so you come from six by six down to four by four then, you can only do this a few times before your image starts getting really small, maybe it shrinks down to one by one or something, so maybe, you don’t want your image to shrink every time you detect edges or to set other features on it.

2. If you look the pixel at the corner or the edge, this little pixel is touched as used only in one of the outputs, because this touches that three by three region. Whereas, if you take a pixel in the middle, say this pixel, then there are a lot of three by three regions that overlap that pixel and so, is as if pixels on the corners or on the edges are use much less in the output. So you’re throwing away a lot of the information near the edge of the image.

In order to fix both of these problems, what you can do is the full apply of convolutional operation. You can pad the image. So in this case, let’s say you pad the image with an additional one border, with the additional border of one pixel all around the edges.

The main benefits of padding are the following:

It allows you to use a CONV layer without necessarily shrinking the height and width of the volumes. This is important for building deeper networks, since otherwise the height/width would shrink as you go to deeper layers. An important special case is the “same” convolution, in which the height/width is exactly preserved after one layer.

It helps us keep more of the information at the border of an image. Without padding, very few values at the next layer would be affected by pixels as the edges of an image.

In

terms of how much to pad, it turns out there two common choices that are called, Valid convolutions and Same convolutions.

By convention in computer vision, f is usually odd. There are two reasons for that:

If f was even, then you need some asymmetric padding.

When you have an odd dimension filter, such as three by three or five by five, then it has a central position and sometimes in computer vision its nice to have a distinguisher, it’s nice to have a pixel, you can call the central pixel so you can talk about the position of the filter.

Strided Convolutions

Strided convolutions is another piece of the basic building block of convolutions as used in Convolutional Neural Networks.

Example:

Let’s say you want to convolve this seven by seven image with this three by three filter, except that instead of doing the usual way, we are going to do it with a stride of two. What that means is instead of stepping the blue box over by one step, we are going to step over by two steps.

Summary of convolutions:

Reminder:

The formulas relating the output shape of the convolution to the input shape is:

nH=⌊nHprev−f+2×padstride⌋+1nH=⌊nHprev−f+2×padstride⌋+1

nW=⌊nWprev−f+2×padstride⌋+1nW=⌊nWprev−f+2×padstride⌋+1

nC=number of filters used in the convolutionnC=number of filters used in the convolution

Technical note on cross-correlation vs. convolution

Convolution in math textbook:

In the different math textbook or signal processing textbook, there is one other possible inconsistency in the notation which is the way that the convolution is defined before doing the element Y’s product and summing, there’s actually one other step that you’ll first take which is to convolve this six by six matrix with this three by three filter. You at first take the three by three filter and flip it on the horizontal as well as the vertical axis. Then apply the flipped filter on the target matrix.

To summarize, by convention in machine learning, we usually do not bother with this flipping operation and technically, this operation is maybe better called cross-correlation but most of the deep learning literature just calls it the convolution operator.

Convolutions Over Volume

Convolution can be implemented not only over just 2D images, but over three dimensional volumes.

The three by three by three filter has 27 numbers, or 27 parameters, that’s three cubes. And so, what you do is take each of these 27 numbers and multiply them with the corresponding numbers from the red, green, and blue channels of the image, so take the first nine numbers from red channel, then the three beneath it to the green channel, then the three beneath it to the blue channel, and multiply it with the corresponding 27 numbers that gets covered by this yellow cube show on the left. Then add up all those numbers and this gives you this first number in the output, and then to compute the next output you take this cube and slide it over by one

Multiple filters

The idea of convolution on volumes, turns out to be really powerful. Only a small part of it is that you can now operate directly on RGB images with three channels. But even more important is that you can now detect two features, like vertical, horizontal edges, or maybe several hundreds of different features. And the output will then have a number of channels equal to the number of filters you are detecting.

One Layer of a Convolutional Network

Summary of notation

If layer l is a convolution layer:

Simple Convolutional Network Example

Types of layer in a convolutional network

Convolution

Pooling

Fully connected

Pooling Layers

Other than convolutional layers, ConvNets often also use pooling layers to reduce the size of the representation, to speed the computation, as well as make some of the features that detects a bit more robust.

The pooling (POOL) layer reduces the height and width of the input. It helps reduce computation, as well as helps make feature detectors more invariant to its position in the input. The two types of pooling layers are:

Max-pooling layer: slides an (f,ff,f) window over the input and stores the max value of the window in the output.

Average-pooling layer: slides an (f,ff,f) window over the input and stores the average value of the window in the output.

These pooling layers have no parameters for backpropagation to train. However, they have hyperparameters such as the window size ff . This specifies the height and width of the fxf window you would compute a max or average over.

The intuition behind max pooling:

If the features detected anywhere in the filter, then keep a high number. But if the feature is not detected, so maybe this feature doesn’t exist in the upper right-hand quadrant. Then the max of all those numbers is still itself quite small.

Example of max pooling:

If you have a 3D input, then the outputs will have the same dimension.

There is another type of pooling that isn’t used very often, but will mention briefly which is average pooling.

Instead of taking the maxes within each filter, average pooling take the average.

Summary of pooling

CNN Example

Why Convolutions?

Two main advantages of convolutional layers over just using fully connected layers:

parameter sharing

A feature detector (such as a vertical edge detector) that’s useful in one part of the image is probably useful in another part of the image.

sparsity of connections

In each layer, each output value depends only on a small number of inputs.

Deep convolutional models: case studies

Learn about the practical tricks and methods used in deep CNNs straight from the research papers.

Learning Objectives

Understand multiple foundational papers of convolutional neural networks

Analyze the dimensionality reduction of a volume in a very deep network

Understand and Implement a Residual network

Build a deep neural network using Keras

Implement a skip-connection in your network

Clone a repository from github and use transfer learning

Case studies

Why look at case studies?

why look at case studies?

A good way to get intuition on how to build conv nets is to read or to see other examples of effective conv nets.

A net neural network architecture that works well on one computer vision task often works well on other tasks.

Outline

Classic networks:

LeNet-5

AlexNet

VGG

ResNet

Inception

Classic Networks

LeNet - 5

The goal of LeNet-5 was to recognize handwritten digits.

AlexNet

AlexNet convinced a lot of the computer vision community to take a serious look at deep learning to convince them that deep learning really works in computer vision. And then it grew on to have a huge impact not just in computer vision but beyond computer vision as well.

VGG - 16

A remarkable thing about the VGG-16 net is that they said, instead of having so many hyperparameters, the VGG network really simplified this neural network architectures. The architecture is really quite uniform.

ResNets

The problem of very deep neural networks

Last week, you built your first convolutional neural network. In recent years, neural networks have become deeper, with state-of-the-art networks going from just a few layers (e.g., AlexNet) to over a hundred layers.

The main benefit of a very deep network is that it can represent very complex functions. It can also learn features at many different levels of abstraction, from edges (at the lower layers) to very complex features (at the deeper layers). However, using a deeper network doesn’t always help. A huge barrier to training them is vanishing gradients: very deep networks often have a gradient signal that goes to zero quickly, thus making gradient descent unbearably slow. More specifically, during gradient descent, as you backprop from the final layer back to the first layer, you are multiplying by the weight matrix on each step, and thus the gradient can decrease exponentially quickly to zero (or, in rare cases, grow exponentially quickly and “explode” to take very large values).

During training, you might therefore see the magnitude (or norm) of the gradient for the earlier layers descrease to zero very rapidly as training proceeds:

You are now going to solve this problem by building a Residual Network!

Building a Residual Network

In ResNets, a “shortcut” or a “skip connection” allows the gradient to be directly backpropagated to earlier layers:

The image on the left shows the “main path” through the network. The image on the right adds a shortcut to the main path. By stacking these ResNet blocks on top of each other, you can form a very deep network.

We also saw in lecture that having ResNet blocks with the shortcut also makes it very easy for one of the blocks to learn an identity function. This means that you can stack on additional ResNet blocks with little risk of harming training set performance. (There is also some evidence that the ease of learning an identity function–even more than skip connections helping with vanishing gradients–accounts for ResNets’ remarkable performance.)

Two main types of blocks are used in a ResNet, depending mainly on whether the input/output dimensions are same or different.

The identity block is the standard block used in ResNets, and corresponds to the case where the input activation (say a[l]a[l]) has the same dimension as the output activation (say a[l+2]a[l+2]). To flesh out the different steps of what happens in a ResNet’s identity block, here is an alternative diagram showing the individual steps:

The upper path is the “shortcut path.” The lower path is the “main path.” In this diagram, we have also made explicit the CONV2D and ReLU steps in each layer. To speed up training we have also added a BatchNorm step.

The ResNet “convolutional block” is the other type of block. You can use this type of block when the input and output dimensions don’t match up. The difference with the identity block is that there is a CONV2D layer in the shortcut path:

The CONV2D layer in the shortcut path is used to resize the input xx to a different dimension, so that the dimensions match up in the final addition needed to add the shortcut value back to the main path. (This plays a similar role as the matrix WsWs discussed in lecture.) For example, to reduce the activation dimensions’s height and width by a factor of 2, you can use a 1x1 convolution with a stride of 2. The CONV2D layer on the shortcut path does not use any non-linear activation function. Its main role is to just apply a (learned) linear function that reduces the dimension of the input, so that the dimensions match up for the later addition step.

What you should remember:

- Very deep “plain” networks don’t work in practice because they are hard to train due to vanishing gradients.

- The skip-connections help to address the Vanishing Gradient problem. They also make it easy for a ResNet block to learn an identity function.

- There are two main type of blocks: The identity block and the convolutional block.

- Very deep Residual Networks are built by stacking these blocks together.

Why ResNets Work

why do ResNets work so well?

Doing well on the training set is usually a prerequisite to doing well on your hold up or on your dev or on your test sets. So, being able to at least train ResNet to do well on the training set is a good first step toward that.

If you make a network deeper, it can hurt your ability to train the network to do well on the training set. But this is not true or at least is much less true when you training a ResNet.

Example:

W is really the key term to pay attention to here. And if w[l+2]w[l+2] is equal to zero. And let’s say that B is also equal to zero, then these terms go away because they’re equal to zero, and then g of a[l]a[l], this is just equal to a[l]a[l] because we assumed we’re using the relu activation function. And so all of the activation are non-negative and so, g(a[l])g(a[l]) is the value applied to a non-negative quantity, so you just get back, a[l]. So, what this shows is that the identity function is easy for residual block to learn. And it’s easy to get a[l+2]a[l+2] equals to a[l]a[l] because of this skip connection. And what that means is that adding these two layers in your neural network, it doesn’t really hurt your neural network’s ability to do as well as this simpler network without these two extra layers, because it’s quite easy for it to learn the identity function to just copy a[l] to a[l+2] using despite the addition of these two layers. And this is why adding two extra layers, adding this residual block to somewhere in the middle or the end of this big neural network it doesn’t hurt performance. But of course our goal is to not just not hurt performance, is to help performance and so you can imagine that if all of these hidden units if they actually learned something useful then maybe you can do even better than learning the identity function. And what goes wrong in very deep plain nets in very deep network without this residual of the skip connections is that when you make the network deeper and deeper, it’s actually very difficult for it to choose parameters that learn even the identity function which is why a lot of layers end up making your result worse rather than making your result better.

The main reason the residual network works is that it’s so easy for these extra layers to learn the identity function that you’re kind of guaranteed that it doesn’t hurt performance and then a lot the time you maybe get lucky and then even helps performance.

Networks in Networks and 1x1 Convolutions

Why does a 1 × 1 convolution do?

The 1 × 1 convolution will look at each of the 36 different positions here, and it will take the element wise product between 32 numbers on the left and 32 numbers in the filter. And then apply a ReLU non-linearity to it after that.

This idea is often called a 1 x 1 convolution but it’s sometimes also called Network in Network。

Using 1×1 convolutions

1 x 1 convolution is a way to shrink nC, whereas pooling layers are to shrink nH and nW, the height and width.

Inception Network Motivation

Motivation for inception network

The problem of computational cost

To summarize, if you are building a layer of a neural network and you don’t want to have to decide, do you want a 1 by 1, or 3 by 3, or 5 by 5, or pooling layer, the inception module let’s you say let’s do them all, and let’s concatenate the results. And then we run to the problem of computational cost. And what you saw here was how using a 1 by 1 convolution, you can create this bottleneck layer thereby reducing the computational cost significantly. Now you might be wondering, does shrinking down the representation size so dramatically, does it hurt the performance of your neural network? It turns out that so long as you implement this bottleneck layer so that within reason, you can shrink down the representation size significantly, and it doesn’t seem to hurt the performance, but saves you a lot of computation. So these are the key ideas of the inception module.

Inception Network

Inception module

Inception network (GoogleNet)

To summarize, if you understand the Inception module, then you understand the Inception network, which is largely the Inception module repeated a bunch of times throughout the network.

Practical advices for using ConvNets

Using Open-Source Implementation

It turns out that a lot of these neural networks are difficult or finicky to replicate because a lot of details about tuning of the hyperparameters such as learning decay and other things that make some difference to the performance.

Therefore, it’s sometimes difficult to replicate someone else’s published work just from reading their paper. Fortunately, a lot of deep learning researchers routinely open source their work on the Internet, such as on GitHub.

One of the advantages of doing so also is that sometimes these networks take a long time to train, and someone else might have used multiple GPUs and a very large dataset to pretrain some of these networks. And that allows you to do transfer learning using these networks.

Transfer Learning

If you’re building a computer vision application rather than training the ways from scratch, from random initialization, you often make much faster progress if you download ways that someone else has already trained on the network architecture and use that as pre-training and transfer that to a new task that you might be interested in.

In practice, because the open data sets on the internet are so big and the ways you can download that someone else has spent weeks training has learned from so much data, you find that for a lot of computer vision applications, you just do much better if you download someone else’s open source ways and use that as initialization for your problem. In all the different disciplines, in all the different applications of deep learning, I think that computer vision is one where transfer learning is something that you should almost always do unless, you have an exceptionally large data set to train everything else from scratch yourself. But transfer learning is just very worth seriously considering unless you have an exceptionally large data set and a very large computation budget to train everything from scratch by yourself.

Data Augmentation

Most computer vision task could use more data. And so data augmentation is one of the techniques that is often used to improve the performance of computer vision systems.

Common augmentation method

Color shifting

Implementing distortions during training

Similar to other parts of training a deep neural network, the data augmentation process also has a few hyperparameters such as how much color shifting do you implement and exactly what parameters you use for random cropping? So, similar to elsewhere in computer vision, a good place to get started might be to use someone else’s open source implementation for how they use data augmentation. But of course, if you want to capture more in variances, then you think someone else’s open source implementation isn’t, it might be reasonable also to use hyperparameters yourself.

State of Computer Vision

Deep learning has been successfully applied to computer vision, natural language processing, speech recognition, online advertising, logistics, many, many, many problems. There are a few things that are unique about the application of deep learning to computer vision, about the status of computer vision. In this video, I will share with you some of my observations about deep learning for computer vision and I hope that that will help you better navigate the literature, and the set of ideas out there, and how you build these systems yourself for computer vision.

Data vs. hand-engineering

Tips for doing well on benchmarks/wining competitions

Use open source code

Object detection

Learn how to apply your knowledge of CNNs to one of the toughest but hottest field of computer vision: Object detection.

Learning Objectives

Understand the challenges of Object Localization, Object Detection and Landmark Finding

Understand and implement non-max suppression

Understand and implement intersection over union

Understand how we label a dataset for an object detection application

Remember the vocabulary of object detection (landmark, anchor, bounding box, grid, …)

Object Localization

Object detection is one of the areas of computer vision that’s just exploding and is working so much better than just a couple of years ago. In order to build up to object detection, you first learn about object localization.

The problem discuss here is classification with localization. Which means not only do you have to label this as say a car but the algorithm also is responsible for putting a bounding box, or drawing a red rectangle around the position of the car in the image. So that’s called the classification with localization problem. Where the term localization refers to figuring out where in the picture is the car you’ve detective.

The above loss function is just for simplicity, in practice you could probably use a log like feature loss for the C1,C2,C3C1,C2,C3 to the softmax output. One of those elements usually you can use squared error or something like squared error for the bounding box coordinates and if a PcPc you could use something like the logistics regression loss. Although even if you use squared error it’ll probably work okay.

Landmark Detection

Landmarks is the important points and image, whose X and Y coordinates output by neural network , that you want the neural networks to recognize.

In order to treat a network like detect landmark, you will need a label training set. The labels have to be consistent across different images. But if you can hire labelers or label yourself a big enough data set to do this, then a neural network can output all of these landmarks which is going to used to carry out other interesting effect such as with the pose of the person, maybe try to recognize someone’s emotion from a picture, and so on.

Object Detection

Special applications: Face recognition & Neural style transfer

Discover how CNNs can be applied to multiple fields, including art generation and face recognition. Implement your own algorithm to generate art and recognize faces!

Face Recognition

What you should remember:

- Face verification solves an easier 1:1 matching problem; face recognition addresses a harder 1:K matching problem.

- The triplet loss is an effective loss function for training a neural network to learn an encoding of a face image.

- The same encoding can be used for verification and recognition. Measuring distances between two images’ encodings allows you to determine whether they are pictures of the same person.

What is face recognition?

Face verification vs. face recognition

Verification

Input image, name/ID

Output whether the input image is that of the claimed person

1:1 matching problem.

Recognition

Has a database of K persons

Get an input image

Output ID if the image is any of the K persons (or “not recognized”)

1:K matching problem

One Shot Learning

Need to be able to recognize a person even though you can only have one sample in your DB.

You can’t train a CNN with a softmax(each person) because:

You Don’t have enough samples

If a new person joins, you need to retrain the network

Siamese Network

Siamese network is a good way to input two faces and tell you how similar or how different they are.

By using a 128-neuron fully connected layer as its last layer, the model ensures that the output is an encoding vector of size 128. You then use the encodings the compare two face images as follows:

Figure 2:
By computing a distance between two encodings and thresholding, you can determine if the two pictures represent the same person

So, an encoding is a good one if:

The encodings of two images of the same person are quite similar to each other

The encodings of two images of different persons are very different

Triplet Loss

The triplet loss function formalizes this, and tries to “push” the encodings of two images of the same person (Anchor and Positive) closer together, while “pulling” the encodings of two images of different persons (Anchor, Negative) further apart.

Figure 3:
In the next part, we will call the pictures from left to right: Anchor (A), Positive (P), Negative (N)

For an image xx, we denote its encoding f(x)f(x), where ff is the function computed by the neural network.

Training will use triplets of images (A,P,N)(A,P,N):

A is an “Anchor” image–a picture of a person.

P is a “Positive” image–a picture of the same person as the Anchor image.

N is a “Negative” image–a picture of a different person than the Anchor image.

These triplets are picked from our training dataset. We will write (A(i),P(i),N(i))(A(i),P(i),N(i)) to denote the ii-th training example.

You’d like to make sure that an image A(i)A(i) of an individual is closer to the Positive P(i)P(i) than to the Negative image N(i)N(i)) by at least a margin αα:

∣∣f(A(i))−f(P(i))∣∣22+α<∣∣f(A(i))−f(N(i))∣∣22∣∣f(A(i))−f(P(i))∣∣22+α<∣∣f(A(i))−f(N(i))∣∣22

You would thus like to minimize the following “triplet cost”:

J=∑i=1m[∣∣f(A(i))−f(P(i))∣∣22(1)−∣∣f(A(i))−f(N(i))∣∣22(2)+α]+(3)(3)J=∑i=1m[∣∣f(A(i))−f(P(i))∣∣22⏟(1)−∣∣f(A(i))−f(N(i))∣∣22⏟(2)+α]+

Here, we are using the notation “[z]+[z]+” to denote max(z,0)max(z,0).

Notes:

- The term (1) is the squared distance between the anchor “A” and the positive “P” for a given triplet; you want this to be small.

- The term (2) is the squared distance between the anchor “A” and the negative “N” for a given triplet, you want this to be relatively large, so it thus makes sense to have a minus sign preceding it.

- αα is called the margin. It is a hyperparameter that you should pick manually. We will use α=0.2α=0.2.

Most implementations also normalize the encoding vectors to have norm equal one (i.e., ∣∣f(img)∣∣2∣∣f(img)∣∣2=1);

How do we choose triplets to train on?

If A/P are very similar, and A/N are very different, training is very easy.

Select A/N that are pretty similar to train a good net.

Some big companies have already trained networks on large amount of photos so you may just want to reuse their weights.

Face Verification and Binary Classification

The Triplet Loss is one good way to learn the parameters of a continent for face recognition. There’s another way to learn these parameters. Another way to train a neural network, is to take this pair of neural networks to take this Siamese Network and have them both compute these embeddings, and then have these be input to a logistic regression unit to then just make a prediction. Where the target output will be one if both of these are the same persons, and zero if both of these are of different persons. So, this is a way to treat face recognition just as a binary classification problem.

Rather than just feed in the encoding, the input of the final logistic regression unit will be the differences between the encodings. So, this will be one pretty useful way to learn to predict zero or one whether these are the same person or different persons.

One computational trick that can help neural deployment significantly, which is that, if this is the new image,then instead of having to compute, this embedding every single time, you can do is actually pre-compute that, so, when the new employee walks in, what you can do is use this upper components to compute that encoding and use it, then compare it to your pre-computed encoding and then use that to make a prediction. Because you don’t need to store the raw images and also because if you have a very large database of employees, you don’t need to compute these encodings every single time for every employee database. This idea of free computing, some of these encodings can save a significant computation. And this type of pre-computation works both for this type of Siamese Central architecture where you treat face recognition as a binary classification problem, as well as, when you were learning encodings maybe using the Triplet Loss function as described in the last module.

To treat face verification as supervised learning, you create a training set of pairs of images where the target label is one when these are a pair of pictures of the same person and where the target label is zero, when these are pictures of different persons and you use different pairs to train the neural network to train the Siamese network that were using back propagation.

Sequence Models

About the Course

This course will teach you how to build models for natural language, audio, and other sequence data. Thanks to deep learning, sequence algorithms are working far better than just two years ago, and this is enabling numerous exciting applications in speech recognition, music synthesis, chatbots, machine translation, natural language understanding, and many others.

You will:

Understand how to build and train Recurrent Neural Networks (RNNs), and commonly-used variants such as GRUs and LSTMs.

Be able to apply sequence models to natural language problems, including text synthesis.

Be able to apply sequence models to audio applications, including speech recognition and music synthesis.

This is the fifth and final course of the Deep Learning Specialization.

deeplearning.ai is also partnering with the NVIDIA Deep Learning Institute (DLI) in Course 5, Sequence Models, to provide a programming assignment on Machine Translation with deep learning. You will have the opportunity to build a deep learning project with cutting-edge, industry-relevant content.

Natural Language Processing & Word Embeddings

Natural language processing with deep learning is an important combination. Using word vector representations and embedding layers you can train recurrent neural networks with outstanding performances in a wide variety of industries. Examples of applications are sentiment analysis, named entity recognition and machine translation.

Introduction to Word Embeddings

Word embedding is a way of representing words that your algorithms automatically understand analogies like that, man is to woman, as king is to queen, and many other examples. And through these ideas of word embeddings, you’ll be able to build NLP applications, even with models the size of, usually of relatively small label training sets.

Word Representation

One of the weaknesses of 1-hot representation is that it treats each word as a thing unto itself, and it doesn’t allow an algorithm to easily generalize the cross words.

So, instead of a one-hot presentation we can learn a featurized representation with each of these words, we could learn a set of features and values for each of words.

For getting words embeddings. We just need you to learn high dimensional feature vectors like these, that gives a better representation than one-hot vectors for representing different words.

One common algorithm for visualize high-dimensional data is the t-SNE algorithm. By doing that, you can easily group similar words together.

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： deep-learning

相关文章推荐

新的分享

章节导航