您的位置:首页 > 其它

COURSE 1 Neural Networks and Deep Learning

2017-12-11 14:56 537 查看

Week1

What is neural network?

It is a powerful learning algorithm inspired by how the brain works.

Example 1 - single neural network

Given data about the size of houses on the real estate market and you want to fit a function that will

predict their price. It is a linear regression problem because the price as a function of size is a continuous

output.

We know the prices can never be negative so we are creating a function called Rectified Linear Unit (ReLU)

which starts at zero.



The input is the size of the house (x)

The output is the price (y)

The “neuron” implements the function ReLU (blue line)



Example 2 – Multiple neural network

The price of a house can be affected by other features such as size, number of bedrooms, zip code and

wealth. The role of the neural network is to predicted the price and it will automatically generate the

hidden units. We only need to give the inputs x and the output y.



Supervised learning for Neural Network

In supervised learning, we are given a data set and already know what our correct output should look like,

having the idea that there is a relationship between the input and the output.

Supervised learning problems are categorized into “regression” and “classification” problems. In a

regression problem, we are trying to predict results within a continuous output, meaning that we are

trying to map input variables to some continuous function. In a classification problem, we are instead

trying to predict results in a discrete output. In other words, we are trying to map input variables into

discrete categories.

There are different types of neural network, for example Convolution Neural Network (CNN) used often

for image application and Recurrent Neural Network (RNN) used for one-dimensional sequence data

such as translating English to Chinses or a temporal component such as text transcript. As for the

autonomous driving, it is a hybrid neural network architecture.

Neural Network examples



Structured vs unstructured data

Structured data refers to things that has a defined meaning such as price, age whereas unstructured

data refers to thing like pixel, raw audio, text.



Why is deep learning taking off?

Deep learning is taking off due to a large amount of data available through the digitization of the society, faster computation and innovation in the development of neural network algorithm.



Two things have to be considered to get to the high level of performance:

Being able to train a big enough neural network

Huge amount of labeled data

The process of training a neural network is iterative.



It could take a good amount of time to train a neural network, which affects your productivity. Faster computation helps to iterate and improve new algorithm.

Week2

Binary Classification

In a binary classification problem, the result is a discrete value output

Notation

a training example:

(x,y),x∈ℝnx,y∈{0,1}

m training examples:

{(x(1),y(1)),(x(2),y(2)),...,(x(m),y(m))}m=mtrain=# of train examples

matrix:

X=[x(1),x(2),...,x(m)]∈ℝnx×mY=[y(1),y(2),...,y(m)]∈ℝ1×m

goal:

Given x,ŷ =P(y=1|x),where 0≤ŷ

Logistic Regression

parameters

The input features vector:

x∈ℝnx,where nx is the number of features

The training label:

y∈{0,1}

The weights:

w∈ℝnX,where nx is the number of features

The threshold:

b∈ℝ

The output:

ŷ =σ(wTx+b)

Sigmoid function:

s=σ(wtx+b)=σ(z)=11+e−z

Loss (error)​ function:

ℓ(ŷ ,y)=−(ylog(ŷ )+(1−y)log(1−ŷ ))

Cost function:

J(w,b)=1m∑i=1mℓ(ŷ (i),y(i))=−1m∑i=1m(y(i)log(ŷ (i))+(1−y(i))log(1−ŷ (i)))

Gradient Descent

Want to find w and b that minimize J(w, b)

Process

Repeat

w:=w−α∂J(w,b)∂wb:=b−α∂J(b,w)∂b

Logistic Regression Gradient Descent

Recap

z=wTx+bŷ =a=σ(z)ℓ(a,y)=−(ylog(a)+(1−y)log(1−a))

Gradient Descent

dz=∂ℓ∂z=a−y=a(1−a)dw1=∂ℓ∂w1=x1⋅dzdw2=∂ℓ∂w2=x2⋅dz...db=∂ℓ∂b=dz

Process

w1:=w1−αdw1w2:=w2−αdw2...b:=b−αdb

Gradient Descent on m examples

Recap

J(w,b)=1m∑i=1mℓ(a(i),y(i))=−1m∑i=1m(y(i)log(a(i))+(1−y(i))log(1−a(i)))a(i)=y(i)=σ(z(i))=σ(wTx+b)

Descent

dz(i)=∂ℓ∂z(i)=a(i)−y(i)dw1=1m∑i=1m∂ℓ∂w1=1m∑i=1mx1⋅dz(i)dw2=1m∑i=1m∂ℓ∂w2=1m∑i=1mx2⋅dz(i)...db=1m∑i=1m∂ℓ∂b=1m∑i=1mdz(i)

Pseudocode



Vectorization

Logistic Regression Derivatives



Vectorizing Logistic Regression

X=[x(1),x(2),...,x(m)]Y=[y(1),y(2),...,y(m)]Z=[z(1),z(2),...,z(m)]A=[a(1),a(2),...,a(m)]=σ(Z)

Implementing Logistic Regression



Broadcasting in Python

General Principle

(m,n)[+−∗/](1,n)→(m,n)[+−∗/](m,n)(m,n)[+−∗/](m,1)→(m,n)[+−∗/](m,n)

Week3

Neural Networks Overview



Neural Network Representation



Computing a Neural Network’s Output

z[1]=W[1]Tx+b[1]=W[1]Ta[0]+b[1]a[1]=σ(z[1])z[2]=W[2]Ta[1]+b[2]a[2]=σ(z[2])...

Vectorizing across multiple examples

a[2](i):example i, layer 2



Activation functions

sigmoida=11+e−z,a′=a(1−a)tanha=ez−e−zez+e−z,a′=1−a2ReLUa=max(0,z),a′={01if z<0if z≥0leaky ReLUa=max(0.01z,z).a′={0.011if z<0if z≥0



Why do you need non-linear activation functions

Suppose

z[1]=W[1]x+b[1]a[1]=g[1](z[1])=z[1]z[2]=W[2]a[1]+b[2]a[2]=g[2](z[2])=z[2]

Then

a[1]=z[1]=W[1]x+b[1]a[2]=z[2]=W[2]a[1]+b[2]→a[2]=W[2](W[1]x+b[1])+b[2]=(W[2]W[1])x+(W[2]b[1]+b[2])

It is similar to

a[2]=W′x+b′

If you were to use linear activation functions or we go to call them identity activation functions, then the new network is just outputting a linear function of the input and we’ll talk about deep networks later new networks with many many layers, many many hidden layers and it turns out that if you use a linear activation function or alternatively if you don’t have an activation function. Then no matter how many layers, your neural network has always doing is just computing a linear activation function.

Gradient Descent for Neural Networks

Backpropogation

dZ[2]=g[2]′(Z[2])dW[2]=1mdZ[2]A[1]Tdb[2]=1mnp.sum(dZ[2],axis=1,keepdims=True)dz[1]=W[2]TdZ[2]∘g[1]′(Z[1])dW[1]=1mdZ[1]XTdb[1]=1mnp.sum(dZ[1],axis=1,keepdims=True)

Random Initialization

If initializing weights to zeros, then all weights will update symmetricly. Then no matter how many nodes in one layer, your neural network has always doing is just using one node in one layer.

Week4

Building Blocks of Deep Neural Networks



Propagation

Forward Propagation for Layer l

Input

a[l−1]

Cache

z[l]=W[l]a[l−1]+b[l]

Output

a[l]=g[l](z[l])

Vectorized

Input

A[l−1]

Cache

Z[l]=W[l]A[l−1]+b[l]

Output

A[l]=g[l](Z[l])

Backward Propagation for Layer l

Input

da[l]

Local

dz[l]=da[l]∘g[l]′(z[l])

Output

dW[l]=dz[l]a[l−1]db[l]=dz[l]da[l−1]=W[l]Tdz[l]

Vectorized

Input

dA[l]

Local

dZ[l]=dA[l]∘g[l]′(Z[l])

Output

dW[l]=1mdZ[l]A[l−1]db[l]=1mnp.sum(dZ[l],axis=1,keepdims=True)dA[l−1]=W[l]TdZ[l]

Parameters vs Hyperparameters

Parameters

W[1],b[1]W[2],b[2]...

Hyperparameters

Hyperparameters can control W and b

learning rate α# of iterations# of hidden layers L# of hidden units n[1],n[2],...choice of activation functionmomentum termmini batch sizevarious forms of regularization parameters
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: