您的位置:首页 > 其它

Neural networks and Deep Learning

2015-10-27 20:28 447 查看
The main thing that changes when we use a different activation function is that the particular values for the partial derivations.



The smoothness of a function means that small changes delta wj in the weights and delta b in the bias will produce a small change delta output in the output from the neuron.

delta output is a linear function of the changes delta wj and delta b in the weights and bias. This linearity makes it easy to choose small changes in the weights and biases to achieve any desired small change in the output.

However, there are other models of artificial neural networks in which feedback loops are possible. These models are called recurrent neural networks. The idea in these models is to have neurons which fire for some limited duration
of time, before becoming quiescent. That firing can stimulate other neurons, which may fire a little while later, also for a limited duration.That causes still more neurons to fire, and so over time we get a cascade of neurons firing. Loops don't cause problems
in such a model, since a neuron's output only affects its input at some later time, not instantaneously.

why we introduce the quadratic cost? After All, aren't we primarily interested in the number of images correctly classified by the network? Why not try to maximize the number directly, rather than minimizing a proxy measure like
the quadratic cost? The problem with that is that the number of images correctly classified is not a smooth function of the weights and biases in the network. For the most part, making small changes to the weights and biases won't cause any change at all in
the number of training images classified correctly. That makes it difficult to figure out how to change the weights and biases to get improved performance. If we instead use a smooth cost function like the quadratic cost it turns out to be easy to figure out
how to make small changes in the weights and biased so as to get an improvement in the cost.





Improving the way neural networks learn:

A better choice of the cost function, known as the cross-entropy cost function; four so-called "regularization" methods(L1 and L2 regularization, dropout, and artificial expansion of the training data), which make our networks
better at generalizing beyond the training data; a better method for initializing the weights in the network; and a set of heuristics to help choose good hyper-parameters for the network.

How can we address the learning slowdown? It turns out that we can solve the problem by replacing the quadratic cost wit different cost function, known as the cross-entropy.


 

the cross-entropy is positive, and tends toward zero as the neuron gets better at computing the desired output ,y, for all training inputs,x. This are both properties we'd intuitively expect for a cost function.

But the cross-entropy cost function has the benefit that, unlike the quadratic cost, it avoids the problem of learning slowing down. To see this, let's compute the partial derivative of the cross-entropy cost with respect to
weights.



It tells us that the rate at which the weights learns is controlled by sigmoid(z)-y, by the error in the output.

The larger the error, the faster the neuron will learn. This is just what we'd intuitively expect. In particular, it avoids the learning slowdown caused by the sigmoid'(z) term in the analogous equation for the quadratic cost. 


 

the cross-neurons is nearly always the better choice, provided the output neurons are sigmoid neurons.



Softmax:

The idea of softmax is to define a new type of output layer for our neural networks.



the output activations are guaranteed to always sum up to 1.

In other words, the output from the softmax layer can be thought of as a probability distribution.

The fact that a softmax layer outputs a probability distribution is rather pleasing. In many problems it's convenient to be able to interpret the output as the network's estimate of the probability that the correct output is
j.

And so with a sigmoid output layer we don't have such a simple interpretation of the output activations.

A nice thing about sigmoid layers is that the output aj is a function of the corresponding weighted input, aj=sigmod(zj). Explain why this is not the case for a softmax layer: any pacticular output activation aj depends on all
the weighted inputs.

How a softmax layer lets us address the learning slowdown problem.To understand that, let's define the log-likelihood cost function.



In fact, it's useful to think of a softmax output layer with log-likelihood cost as being quite similar to a sigmoid output layer with cross-entropy cost.

There's enough freedom in the model that it can describe almost any data set of the given size, without capturing any genuine insights into the underlying phenomenon.When that happens the model will work well for the existing
data, but will fail to generalize to new situations.





Finally, at around epoch 280 the classification accuracy pretty much stops improving. Later epochs 

merely see small stochastic fluctuations near the value to of the accuracy at epoch 280. Contrast with the earlier

graph, where the cost associated to the training data continues to smoothly drop. If we just look at that cost, it appears that our model is still getting "better". But the test accuracy results show the improvement is an illusion.

We say the network is overfitting or overtraining beyond epoch 280.



1000 training data, 400 epoch.

We can see that the cost on the test data improves until around epoch 15, but after that it actually starts to

get worse, even though the cost on the training data is continuing to get better. This is another sign that our

model is overfitting.

The obvious way to detect overfiting is to keep track of accuracy on the test data as our network trains. If we see that the accuracy on the test data is no longer improving, then we should stop training.



5000 training data, 30 epoch

As you can see, the accuracy on the test and training data remain much closer together than when we were

using 1000 training examples.

Overfitting is still going on, but it's been greatly reduced. Our network is generalizing much better

from the training data to the test data.

In general, one of the best ways of reducing overfitting is to increase the size of the training data.

With enough training data it is difficult for even a very large network to overfit. Unfortunately, training data can be expensive or difficult to acquire so that is not always a practical option.

Increasing the amount of training data is one way of reducing overfiting. However, one possible approach is to reduce the size of out network. However, large networks have the potential to be more powerful than small networks,
ans so this is an option we'd only adopt reluctantly.





Intuitively, the effect of regularization is to make it so the network prefers to learn small weights.

Now, it's really not at all obvious that why making this kind of compromise should help reduce overfitting.

We first need to figure out how to apply our stochastic gradient descent learning algorithm in a regularized neural network.

I've described regularization as a way to reduce overfitting and to increase classification accuracies.

Empirically, when doing multiple runs of our MNIST networks, but with different (random) weight initializations, I've found that the unregularized runs will occasionally get "stuck", apparently caught in local minima of the cost
functions. 

The small of the weights means that the behaviour of the network won't change too much if we change a few random inputs here and there. That makes it difficult for a regularized network to learn effects of local noise in the
data.

Instead, a regularized network learns to respond to types of evidence which are seen often across the training set. By contrast a network with large weights may change its behaviour quite a bit in response to small changes in
the input.

L2 regularization doesn't constrain the biases. We don't usually include bias terms when regularizing.

Three other approaches to reducing overfitting: L1 regularization, dropout, and artificially increasing the training set size.





In L1 regularization, the weights shrink by a constant amount toward 0. In L2 regularization, the weights shrink by an

amount which is proportional to w. And so when a particular weight has a large magnitude, |w|, L1 shrinks the weight much less than L2 regularization does. By contrast, when |w| is small, L1 regularization shrinks the weight
much more than L2. The net result is that L1 tends to concentrate the weight of the network in a relatively small number of high-importance eonnections, while the other weights are driven towards zero.

Dropout:

Ordinarily, we'd train by forward-propagating x through the network, and then backpropagating to determine the contribution to the gradient. With dropout, this process is modified. We start by randomly(and temporarily) deleting
half the hidden neurons in the network, while leaving the input and output hidden untouched.

We forward-propagate the input x through the modified network, and then backpropagate the result, also through the modified network. After doing this over a mini-batch of examples, we update the appropriate weights and biases.
We then repeat the process, first restoring the dropout neurons, then choosing a new random subset of hidden neurons to delete, estimating the gradient for a different mini-batch, and updating the weights and biases in the network.

When we actually run the full network that means that twice as many hidden neurons will be active. To compensate for that, we halve the weights outgoing from the hidden neurons.

This kind of averaging scheme is often found to be a powerful (though expensive) way of reducing overfitting. The reason is that the different networks may overfit in different ways, and averaging may help eliminate that kind
of overfitting.

We can expand our training data by making many small rotations of all the MNIST training images, and then using the expanded training data to improve our network's performance.

 Expanding the training data, using not just rotations, but also translating and skewing the images.

They also experimented with what they called "elastic distortions", a special type of image distortion intended to emulate the random oscillations found in hand muscles.

Weight initialization:

if the weights in later hidden layers are initialized using normalized Gaussians, then activations will to be very close 0 or 1, and learning will proceed very slowly.

Suppose we have a neuron with Nin input weights. Then we shall initialize those weights as Gaussian random variables with mean 0 and standard deviation 1/sqrt(Nin).

We will continue to choose the bias as a Gaussian with mean 0 and standard deviation 1.

To understand the reason for the oscillations, recall that stochastic gradient descent is supposed to step us gradually down into a valley of the cost function.

A more complete explanation is as follows: gradient descent uses a first-order approximation to the cost function as a guide to how to decrease the cost. For large learning rate, higher-order terms in the cost function become
more important, and may dominate the behaviour, causing gradient descent to break down. This is especially likely as we approach minima and quasi-minima of the cost function, since near such points the gradient becomes small, making it easier for higher-order
terms to dominate behaviour.

You can optionally refine your estimate, to pick out the largest value of learning rate which the cost decreases during the first few epochs(there's no need for this to be super-accurate).

Variations on stochastic gradient descent:

Hessian technique:





This approach to minimizing a cost function is known as the Hessian technique or Hessian optimization.

There are theoretical and empirical results showing that Hessian methods converge on a minimum in fewer steps than standard descent. In particular, by incorporating information about second-order changes in the cost function 

it's possible for the Hessian approach to avoid many pathologies that occur in gradient descent.

Momentum-based gradient descent:

Intuitively, the advantage Hessian optimization has is that it incorporates not just information about the gradient, but also information about how gradient is changing. 

The momentum technique modifies gradient descent in two ways that make it more similar to the physical picture.

First, it introduce a notion of "velocity" for the parameters we're trying to optimize. The gradient acts to change the velocity, not (directly) the "position", in much the way as physical forces change the velocity, and only
indirectly affect position. Second, the momentum method introduces a kind of friction term, which tends to gradually reduce the velocity.



when u=1, as we've seen, there is no friction, and the velocity is completely driven by the

gradient .By contrast, when u=0, there's a lot of friction, the velocity can't build up.

Other models of artificial neuron:

In practice, a network built from sigmoid neurons can compute any function. In practice,

however, networks built using other model neurons sometimes outperform sigmoid networks.

tanh(w*x+b) (hyperbolic tangent function)

It turns out that this is very closely related to the sigmoid neuron.



Similar to sigmoid neurons, a network of tanh neurons can, in principle, compute any function mapping to the range -1 to 1.

Rectified linear neuron or rectified linear unit:

max(0, w*x+b)



like the sigmoid and tanh neurons rectified linear units can be used to compute any function, and they can be trained using ideas such as backpropagation and stochastic gradient descent.

When we look closely, we'll discover that the different layers in our deep are learning at vastly different speed.

In particular, when later layers in the network are learning well, early layers often get stuck during training, learning 

almost nothing at all. We'll discover there are fundamental reasons the learning slowdown occurs, connected to our use of gradient-based learning techniques.

As we delve into the problem more deeply, we'll learn that the opposite phenomenon can also occur: the early layers may be learned well, but later layers can become stuck. In fact, we'll find that there's an intrinsic instability
associated to learning by gradient descent in deep, many-layer neural networks. This instability

tends to result in either the early or the later layers getting stuck during training.

In at least some deep neural networks, the gradient tends to get smaller as we move backward through the hidden layers. This means that neurons in the earlier layers learn much more slowly than neurons in later layers. The phenomenon
is known as the vanishing gradient problem.

In fact, it is not inevitable, although the alternative is not very attractive, either: sometimes the gradient gets much larger in earlier layers! This is the exploding gradient problem. More generally, it turns out that the
gradient in deep neural networks is unstable, tending to either explode or vanish in earlier layers.

The unstable gradient problem: The fundamental problem here don't so much the vanishing gradient problem or the exploding gradient problem.It's that the gradient in early layers is the product of terms from all the later layers.

Deep learning:

We'll explore many powerful techniques: convolutions, pooling, the use of GPUs to do far more traning than we did with our shallow networks, the algorithmic expansion of our training data(to reduce overfitting), the use of dropout
technique(also to reduce overfitting), the use of ensembles of networks.

Introducing convolutional networks:

fully-connected layers does not take into account the spatial structure of the images. For instance, it treats input pixels which are far apart and close together on exactly the same footing. Such concepts of spatial structure
must instead be inferred from the training data.

convolutional neural networks:

These networks use a special architecure which is particularly wee-adapted to classify images. Using this architecture makes convolutional networks fast to train.

Convolutional neural networks use three basic ideas: local receptive fields, shared weights, and pooling.

Local receptive fields: In the fully-connected layers shown earlier, the inputs were depicted as a vertical line of neurons. In a convolutional net, it'll help to think instead of the input as a 28*28 squre of neurons, whose
values correspond to the 28*28 pixel.



That region in the input image is called the local receptive field for the hidden neuron. It's a little window on the input pixels. Each connection learns a weight. And the hidden neuron learns an overall bias as well. You can
think of that particular hidden neuron as learning to analyze its particular local receptive field.

We then slide the local receptive field across the entire input image. For each local receptive field, there is a different hidden neuron in the first hidden layer.



Then we slide the local receptive field over by one pixel to the right, to connect to a second hidden neuron:



And so on, building up the first layer. Note that if we have a 28*28 input image, and 5*5 local receptive fields , then there will be 24*24 neurons in the hidden layer.

In fact, sometimes a different stride length is used. For instance, we might move the local receptive field 2 pixels to the right(or down).

I've said that each hidden neuron has a bias and 5*5 weights connected to its local receptive field. What I did not yet

mention is that we're going to use the same weights and bias for each of the 24*24 hidden neurons. In other words, for the j,kth hidden neuron, the output is:



This means that all the neurons in the first hidden layer detect exactly the same feature, just at different locations in the input image. To see why this makes sense, suppose the weights and bias are such that the hidden neuron
can pick out, say, a vertical edge in a particular local receptive field. 

For this reason, we sometimes call the map from the input layer to the hidden layer a feature map. We call the weights defining the feature map the shared weights. The shared weights and bias are often said to define a kernel
or filter



The 20 images correspond to 20 different feature maps.

 It is clear there is spatial structure here beyond what we'd expect at random: many of the features have clear sub-regions of light and dark. That shows our network really is learning things related to the spatial structure.

A big advantage of sharing weights and biases is that it greatly reduces the number of parameters involved in a convolutional network. Intuitively, it seems likely that the use of translation invariance by the convolutional layer
will reduce the number of parameters it needs to get the same performance as the fully-connected model.

Pooling layers:

convolutional neural networks also contain pooling layers. Pooling layers are usually used immediately after convolutional layers. What the pooling layers do is simplify the information in the output from the convolutional layer.

In detail, a pooling layer takes each feature map output from the convolutional layer and prepares a condensed feature map. For instance, each unit in the pooling layer may summarize a region of (say) 2*2 neurons in the previous
layer. As a concrete example, one common procedure for pooling is known as max-pooling. In max-pooling, a pooling unit simply outputs the maximum activation in the 2*2 input region.

As mentioned above, the convolutional layer usually involves more than a single feature map. We apply max-pooling to each feature map separately. So if there were feature maps, the combined convolutional ans max-pooling layers
would look like:



We can think of max-pooling as a way for the network to ask whether a given feature is found anywhere in a region of the image. It then throws away the exact positional information. The intuition is that once a feature has been
found, its exact location isn't as important as its rough location relative to other features.

Another common approach is known as L2 pooling.  Here,instead of taking the maximum activation of a 2*2 region of neurons, we take the square root of the sum of the squares of the activation in the 2*2 region.

The final layer of connections in the network is a fully-connected layer. That is, this layer connects every neuron from the max-pooled layer to every one of 10 output neurons.

What does it even mean to apply a second convolutional-pooling layer? In fact, you can think of the second convolutional-pooling layer as having as input 12*12 "images", whose "pixels" represent the presence(or absence) of particular
localized features in the original input image. So you can think of this layer as having as input a version of the original input image. That version is abstracted and condensed, but still has a lot of spatial structure, ans so it makes sense to use a second
convolutional-pooling layer.

More informally: the feature detectors in the second convolutional-pooling layer have access to all the features from the previous layer, but only with their particular local receptive field.

Using the tanh activation function:

tanh networks train a little faster, but the final accuracies are very similar. Can you explain why the tanh network might train faster?Can you get a similar training speed with the sigmoid, perhaps by changing the learning rate.

Using rectified linear units:

What makes the rectified linear activation function better than the sigmoid or tanh function? In an ideal world we'd have a theory telling us which activation function to pick for which application. But at present we're a long
way from such a world.

Expanding the training data:

Another way we may hope to impove our results is by algorithmically expanding the training data. A simple way of expanding the training data is to displace each training image by a single pixel, either up one pixel, down one
pixel..

In fact, expanding the data turned out to considerably reduce the effect of overfitting.

elastic distortion: a way of emulating the random oscillations hand muscles undergo when a person is writing.

The idea of convolutional layers is to behave in an invariant way across images. It may seem surprising, then, that our network can learn more when all we've done is translate the input data.

Recall that the basic idea of dropout is to remove individual activations at random while training the network. This makes the model more robust to the loss of individual pieces of evidence, and thus less likely to rely on particular
idiosyncracies of the training data. dropout reduced overfitting, and so we learned faster.

Using an ensemble of networks:

An easy way to improve performance still further is to create several neural networks, and then get them to vote to determine the best classification.

Why we only applied dropout to the fully-connected layers:

In principle we could apply a similar procedure to the convolutional layers. But, in fact, there's no need:

the convolutional layers have considerable inbuilt resistance to overfitting. The reason is that the shared weights mean that convolutional filters are forced to learn from across the entire image. This makes them less likely
to pick up on local idiosuncracies in the training data. And so there is less need to apply other regularizers, such as dropout.

Why are we able to train?

In particular, we saw that the gradient tends to be quite unstable: as we move from the output layer to earlier layers the gradient tends to vanish(the vanishing gradient problem) or explode(the exploding gradient problem). How
have we avoided those result?

1.Using convolutional layers greatly reduces the number of parameters in those layers, making the  learning problem much easier.

2. Using more powerful regularization techniques(notably dropout and convolutional layers) to reduce overfitting, which is otherwise more of a problem in more complex networks.

3. Using rectified linear units instead of sigmoid neurons, to speed us training-empirically, often by a factor of 3-5.

You'll run into many ideas we haven't discussed: recurrent neural networks, Boltzmann machines, generative models, transfer learning, reinforcement learning, and so on...

Recurrent neural networks(RNNs):

In the feedforward nets we've been using there is a single input which completely determines the activations of all the neurons through the remaining layers. Suppose we allow the elements in the networks to keep changing in a
dynamic way. For instance, the behaviour of hidden neurons might not just be determined by the activations in previous hidden layers, but also by the activations at earlier times. Indeed, a neuron's activation might be determined in part by its own activation
at an earlier time. Or perhaps the activations of hidden and output neurons won't be determined just by the current input to the network, but also by earlier inputs.

Neural networks with this kind of time-vaying behaviour are known as recurrent neural networks or RNNs.

The broad idea is that RNNs are neural networks in which there is some notion of dynamic change over time.

And, not surprisingly, they're particularly useful in analysing data or process that change over time. Such data and process arise naturally in problems such as speech or natural language, for example.

Long short-term memory units(LSTMs):

The gradient gets smaller and smaller as it is propagated back through layers. This makes learning in early layers extremely slow. The problem actually gets wrose in RNNs, since gradients aren't just propagated backward through
layers, they're propagated backward through time.
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: