机器学习中的神经网络Neural Networks for Machine Learning:Lecture 14 Quiz
2016-02-18 09:44
531 查看
Lecture 14 QuizHelp
Center
Warning: The hard deadline has passed. You can attempt it, but you will not get credit for it. You are welcome to try it as a learning exercise.In
accordance with the Coursera Honor Code, I certify that the answers here are my own work.
Question 1
Why is a Deep Belief Network not a Boltzmann Machine ?All
edges in a DBN are directed.
Some
edges in a DBN are directed.
A
DBN is not a probabilistic model of the data.
A
DBN does not have hidden units.
Question 2
Brian looked at the direction of arrows in a DBN and was surprised to find that the data is at the "output". "Where is the input ?!", he exclaimed, "How will I give input to this model and get all those cool features?" In this context, which of the followingstatements are true? Check all that apply.
In
order to get features h given
some data v,
he must perform inference to find out P(h|v).
There is an easy exact way of doing this, just traverse the arrows in the opposite direction.
A
DBN is a generative model of the data, which means that, its arrows define a way of generating data from a probability distribution, so there is no "input".
A
DBN is a generative model of the data and cannot be used to generate features for any given input. It can only be used to get features for data that was generated by the model.
In
order to get features h given
some data v,
he must perform inference to find out P(h|v).
There is an easy approximateway of doing this, just traverse the arrows in the opposite direction.
Question 3
Suppose you wanted to learn a neural net classifier. You have data and labels. All you care about is predicting the labels accurately for a test set. How can pretraining help in getting better accuracy, even though it does not use any information aboutthe labels ?
Pretraining
will learn exactly the same features that a simple neural net would learn because after all, they are training on the same data set. But pretraining does not use the labels and hence it can prevent overfitting.
There
is an assumption that pretraining will learn features that will be useful for discrimination and it would be difficult to learn these features using just the labels.
The
objective function used during pretraining is the same as the one used during fine-tuning. So pretraining provides more updates towards solving the same optimization problem.
Pretraining
will always learn features that will be useful for discrimination, no matter what the discriminative task is.
Question 4
Why does pretraining help more when the network is deep ?During
backpropagation in very deep nets, the lower level layers get very small gradients, making it hard to learn good low-level features. Since pretraining starts those low-level features off at a good point, there is a big win.
Backpropagation
algorithm cannot give accurate gradients for very deep networks. So it is important to have good initializtions, especially, for the lower layers.
As
nets get deeper, contrastive divergence objective used during pretraining gets closer to the classification objective.
Deeper
nets have more parameters than shallow ones and they overfit easily. Therefore, initializing them sensibly is important.
Question 5
The energy function for binary RBMs goes by E(v,h)=−∑ivibi−∑jhjaj−∑i,jviWijhj
When modeling real-valued data (i.e., when v is
a real-valued vector not a binary one) we change it to
E(v,h)=∑i(vi−bi)22σ2i−∑jhjaj−∑i,jviσiWijhj
Why can't we still use the same old one ?
If
we continue to use the same one, then in general, there will be infinitely many v's
and h's
such that, E(v,h) will
be infinitely small (close to −∞).
The probability distribution resulting from such an energy function is not useful for modeling real data.
Probability
distributions over real-valued data can only be modeled by having a conditional Gaussian distribution over them. So we have to use a quadratic term.
If
we use the old one, the real-valued vectors would end up being constrained to be binary.
If
the model assigns an energy e1 to
state v1,h,
and e2 to
state v2,h,
then it would assign energy (e1+e2)/2 to
state (v1+v2)/2,h.
This does not make sense for the kind of distributions we usually want to model.
相关文章推荐
- 用Python从零实现贝叶斯分类器的机器学习的教程
- My Machine Learning
- 机器学习---学习首页 3ff0
- bp神经网络及matlab实现
- 反向传播(Backpropagation)算法的数学原理
- 也谈 机器学习到底有没有用 ?
- 量子计算机编程原理简介 和 机器学习
- 近200篇机器学习&深度学习资料分享(含各种文档,视频,源码等)
- 基于神经网络的预测模型
- 初识机器学习算法有哪些?
- 机器学习相关的库和工具
- 10个关于人工智能和机器学习的有趣开源项目
- 机器学习实践中应避免的7种常见错误
- 机器学习书单
- 北美常用的机器学习/自然语言处理/语音处理经典书籍
- 如何提升COBOL系统代码分析效率
- 自动编程体系设想(一)
- 自动编程体系设想(一)
- 支持向量机(SVM)算法概述
- [Ng机器学习公开课1]机器学习概述