您的位置:首页 > 其它

Coursera | Andrew Ng (02-week3-3.2)—为超参数选择合适的范围

2018-01-21 09:03 525 查看
该系列仅在原课程基础上部分知识点添加个人学习笔记,或相关推导补充等。如有错误,还请批评指教。在学习了 Andrew Ng 课程的基础上,为了更方便的查阅复习,将其整理成文字。因本人一直在学习英语,所以该系列以英文为主,同时也建议读者以英文为主,中文辅助,以便后期进阶时,为学习相关领域的学术论文做铺垫。- ZJ

Coursera 课程 |deeplearning.ai |网易云课堂

转载请注明作者和出处:ZJ 微信公众号-「SelfImprovementLab」

知乎https://zhuanlan.zhihu.com/c_147249273

CSDNhttp://blog.csdn.net/junjun_zhao/article/details/79115798

3.2 Using an appropriate scale to pick hyperparameters 为超参数选择合适的范围

(字幕来源:网易云课堂)



In the last video, you saw how sampling at random, over the range of hyperparameters,can allow you to search over the space of hyperparameters more efficiently.But it turns out that sampling at random doesn’t mean sampling uniformly at random over the range of valid values.Instead, it’s important to pick the appropriate scale on which to explore the hyperparamaters. In this video,I want to show you how to do that.Let’s say that you’re trying to choose the number of hidden units, n[l],for a given layer l.And let’s say that you think a good range of values is somewhere from 50 to 100 .In that case, if you look at the number line from 50 to 100 ,maybe picking some number values at random within this number line.There’s a pretty visible way to search for this particular hyperparameter.Or if you’re trying to decide on the number of layers in your neural network,we’re calling that capital L.Maybe you think the total number of layers should be somewhere between 2 to 4.Then sampling uniformly at random, along 2, 3 and 4, might be reasonable.Or even using a grid search, where you explicitly evaluate the values 2, 3 and 4 might be reasonable.So these were a couple examples where sampling uniformly at random over the range you’re contemplating, might be a reasonable thing to do.But this is not true for all hyperparameters.



在上一个视频中 你已经看到了在超参数范围中,随机取值可以提升你的搜索效率但随机取值并不是在有效值范围内的随机均匀取值,而是选择合适的标尺,用于探究这些超参数这很重要,在这个视频中 我会教你怎么做,假设 你要选取隐藏单元的数量 n[l],对于给定层 1 而言,假设 你选择的取值范围是从 50 到 100 中某点,这种情况下 看到这条从 50 - 100 的数轴,你可以随机在其上取点,这是一个搜索特定超参数的很直观的方式,或者 如果你要选取神经网络的层数,我们称之为字母 L,你也许会选择层数为 2 到 4 中的某个值,接着 顺着 2 3 4 随机均匀取样才比较合理,你还可以应用网格搜索 你会觉得 2 3 4,这三个数值是合理的, 这是几个在你的考虑范围内随机均匀取值的例子,这些取值还蛮合理的,但这对某些超参数而言不适用。

Let’s look at another example.Say your searching for the hyperparameter alpha, the learning rate.And let’s say that you suspect 0.0001 might be on the low end,or maybe it could be as high as 1.Now if you draw the number line from 0.0001 to 1,and sample values uniformly at random over this number line.Well about 90% of the values you sample would be between 0.1 and 1.So you’re using 90% of the resources to search between 0.1 and 1, and only 10% of the resources to search between 0.0001 and 0.1.So that doesn’t seem right.Instead, it seems more reasonable to search for hyperparameters on a log scale.Where instead of using a linear scale,you’d have 0.0001 here,and then 0.001, 0.01, 0.1, and then 1.And you instead sample uniformly, at random, on this type of logarithmic scale.Now you have more resources dedicated to searching between 0.0001 and 0.001,and between 0.001 and 0.01, and so on.



看看这个例子,假设你在搜索超参数 αα 学习速率,假设你怀疑其值最小是 0.0001 ,或最大是 1,如果你画一条从 0.0001 到 1 的数轴,沿其随机均匀取值,那 90% 的数值将会落在 0.1 到 1 之间,结果就是 在 0.1 到 1 之间 应用了 90% 的资源,而在 0.0001 到 0.1 之间 只有 10%的搜索资源,这看上去不太对,反而 用对数标尺搜索超参数的方式会更合理,因此这里不使用线性轴,分别依次取 0.0001 0.001 0.01 1,在对数轴上均匀随机取点,这样 在 0.0001 到 0.001 之间 就会有更多的搜索资源可用,还有在 0.001 到 0.01 之间等等。

So in Python, the way you implement this,is let
r = -4 * np.random.rand()
.And then a randomly chosen value of alpha, would be alpha = 10 to the power of r.So after this first line, r will be a random number between -4 and 0.And so alpha here will be between 10 to the -4 and 10 to the 0.So 10 to the -4 is this left thing,this 10 to the -4.And 1 is 10 to the 0.In a more general case,if you’re trying to sample between 10 to the a, to 10 to the b, on the log scale.And in this example, this is 10 to the a.And you can figure out what a is by taking the log base 10 of 0.0001 ,which is going to tell you a is -4.And this value on the right,this is 10 to the b.And you can figure out what b is,by taking log base 10 of 1,which tells you b is equal to 0.So what you do, is then sample r uniformly, at random, between a and b.So in this case,r would be between -4 and 0.And you can set alpha,on your randomly sampled hyperparameter value, as 10 to the r, okay?So just to recap, to sample on the log scale,you take the low value,take logs to figure out what is a.Take the high value,take a log to figure out what is b.So now you’re trying to sample, from 10 to the a to the b, on a log scale.So you set r uniformly, at random, between a and b.And then you set the hyperparameter to be 10 to the r.So that’s how you implement sampling on this logarithmic scale.



所以 在Python中 你可以这样做,使
r = -4 * np.random.rand()
,然后 αα随机取值 α=10rα=10r,所以 第一行可以得出r∈[-4,0],那αα会在10(−4)10(−4)和 100100 之间,所以最左边的数字是10(−4)10(−4),最右边是100100,更常见的情况是,如果你在10a10a和10b10b之间取值,在此例中 这是10a10a,你可以通过log0.000110log100.0001算出 a 的值,即 -4,在右边的值是 10b10b,你可以算出b 的值,log110log101即 0,你要做的 就是在【a,b】区间随机均匀地给 r 取值,这个例子中 r∈[-4,0],然后你可以设置αα的值,基于随机取样的超参数值 α=10rα=10r,所以 总结一下 在对数坐标上取值,取最小值的对数就得到 a 值,取最大值的对数就得到 b 值所以现在你在对数轴上的10a10a到10b10b区间取值,在 a b 间随意均匀的选取 r 值,将超参数设置为10r10r,这就是在对数轴上取值的过程

Finally, one other tricky case is sampling the hyperparameter beta,used for computing exponentially weighted averages.So let’s say you suspect that beta should be somewhere between 0.9 to 0.999 .Maybe this is the range of values you want to search over.So remember, that when computing exponentially weighted averages,using 0.9 is like averaging over the last 10 values.kind of like taking the average of 10 days temperature,whereas using 0.999 is like averaging over the last 1,000 values.So similar to what we saw on the last slide,if you want to search between 0.9 and 0.999 , it doesn’t make sense to sample on the linear scale, right?Uniformly, at random,between 0.9 and 0.999 .So the best way to think about this is that we want to explore the range of values for 1 minus beta,which is going to now range from 0.1 to 0.001.And so we’ll sample the between beta,taking values from 0.1,to maybe 0.1, to 0.001.So using the method we have figured out on the previous slide,this is 10 to the -1,this is 10 to the -3.Notice on the previous slide,we had the small value on the left, andthe large value on the right,but here we have reversed.We have the large value on the left,and the small value on the right.So what you do, is you sample r uniformly at random, from -3 to -1.And you set 1- beta = 10 to the r, and so beta = 1- 10 to the r.And this becomes your randomly sampled value of your hyperparameter,chosen on the appropriate scale.And hopefully this makes sense, in that this way,you spend as much resources exploring the range 0.9 to 0.99,as you would exploring 0.99 to 0.999 .



最后 另一个棘手的例子是给ββ取值,用于计算指数的加权平均值,假设你认为ββ是 0.9 到 0.999 之间的某个值,也许这就是你想搜索的范围,请记住这一点 当计算指数的加权平均值时,取 0.9 就像在 10 个值中计算平均值,有点类似于计算 10 天的温度平均值,而取 0.999 就是在 1000个值中取平均,所以 和上张幻灯片上的内容类似,如果你想在 0.9 到 0.999 区间搜索 那就不能用线性轴取值 对吧?,不要随机均匀在此区间取值,所以考虑这个问题最好的方法就是,我们要探究的是 1−β1−β,此值在 0.1 到 0.001 区间内,所以我们会给1−β1−β取值,大概是从 0.1 到 0.001,应用之前幻灯片中介绍的方法,这是 10(−1)10(−1) 这是 10(−3)10(−3),值得注意的是 在之前的幻灯片里 我们把把最小值写在左边,最大值写在右边 但在这里 我们颠倒了大小,这里 左边的是最大值 右边的是最小值,所以你要做的就是在 [-3,-1] 里随机均匀的给 r 取值,你设定了1−β=10r1−β=10r 所以β=1−10rβ=1−10r,然后这就变成了你的超参数随机取值,在特定的选择范围内,希望用这种方式可以得到想要的结果,你在 0.9 到 0.99 区间探究的资源,和在 0.99 到 0.999 区间探究的一样多。

So if you want to study more formal mathematical justification for why we’redoing this, right, why is it such a bad idea to sample in a linear scale?It is that, when beta is close to 1,the sensitivity of the results you get changes,even with very small changes to beta.So if beta goes from 0.9 to 0.9005,it’s no big deal,this is hardly any change in your results.But if beta goes from 0.999 to 0.999 5,this will have a huge impact on exactly what your algorithm is doing, right?In both of these cases,it’s averaging over roughly 10 values.But here it’s gone from an exponentially weighted average overabout the last 1,000 examples,to now, the last 2,000 examples.And it’s because that formula we have,1 / 1- beta,this is very sensitive to small changes in beta, when beta is close to 1.So what this whole sampling process does,is it causes you to sample more densely in the region of when beta is close to 1.Or, alternatively,when 1- beta is close to 0.So that you can be more efficient in terms of how you distribute the samples,to explore the space of possible outcomes more efficiently.So I hope this helps you select the right scale on which tosample the hyperparameters.In case you don’t end up making the right scaling decision on some hyperparameter choice,don’t worry to much about it.Even if you sample on the uniform scale,where sum of the scale would have been superior,you might still get okay results.Especially if you use a coarse to fine search, so that in later iterations,you focus in more on the most useful range of hyperparameter values to sample.I hope this helps you in your hyperparameter search.In the next video, I also want to share with you some thoughts of how to organizeyour hyperparameter search process.That I hope will make your workflow a bit more efficient.



所以 如果你想研究更多正式的数学证据,关于为什么我们要这样做 为什么用线性轴取值不是个好方法,这是因为当ββ接近 1 时,所得结果的灵敏度会变化 即使ββ有微小的变化,如果ββ在 0.9 到 0.9005 之间取值,无关紧要 你的结果几乎不会变化,但ββ值如果在 0.999 到 0.9995 之间,这会对你的算法产生巨大影响 对吧?在这两种情况下 是根据大概 10 个值取平均,但这里它是指数的加权平均值,基于 1000 个值 现在是 2000 个值,因为这个公式1/1−β)1/1−β),当ββ接近 1 时 ββ就会对细微的变化变得很敏感,所以 整个取值过程中,你需要更加密集地取值 在ββ接近 1 的区间内,或者说 当1−β1−β接近于 0 时,这样 你就可以更加有效的分布取样点,更有效率的探究可能的结果,希望能帮助你选择合适的标尺,来给超参数取值,如果你没有在超参数选择中做出正确的标尺决定,别担心,即使你在均匀的标尺上取值,如果数值总量较多的话 你也会得到还不错的结果,尤其是应用从粗到细的搜索方法 在之后的迭代中,你还是会聚焦到有用的超参数取值的范围上,希望这会对你的超参数搜索有帮助,下一个视频中 我将会分享一些,关于如何组建搜索过程的思考,希望它能使你的工作更高效。

重点总结:

为超参数选择合适的范围

Scale均匀随机

在超参数选择的时候,一些超参数是在一个范围内进行均匀随机取值,如隐藏层神经元结点的个数、隐藏层的层数等。但是有一些超参数的选择做均匀随机取值是不合适的,这里需要按照一定的比例在不同的小范围内进行均匀随机取值,以学习率αα的选择为例,在0.001,…,1范围内进行选择:



如上图所示,如果在 0.001,…,1 的范围内进行进行均匀随机取值,则有90%的概率 选择范围在 0.1∼1 之间,而只有10%的概率才能选择到0.001∼0.1之间,显然是不合理的。

所以在选择的时候,在不同比例范围内进行均匀随机取值,如0.001∼0.001、0.001∼0.01、0.01∼0.1、0.1∼1 范围内选择。

代码实现:

r = -4 * np.random.rand() # r in [-4,0]
learning_rate = 10 ** r # 10^{r}


一般的,如果在 10a∼10b10a∼10b 之间的范围内进行按比例的选择,则 r∈[a,b]r∈[a,b], α=10rα=10r。

同样,在使用指数加权平均的时候,超参数ββ也需要用上面这种方向进行选择.

参考文献:

[1]. 大树先生.吴恩达Coursera深度学习课程 DeepLearning.ai 提炼笔记(2-3)– 超参数调试 和 Batch Norm

PS: 欢迎扫码关注公众号:「SelfImprovementLab」!专注「深度学习」,「机器学习」,「人工智能」。以及 「早起」,「阅读」,「运动」,「英语 」「其他」不定期建群 打卡互助活动。

内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息