您的位置：首页 > 其它

Coursera | Andrew Ng (03-week1-1.3)—单一数字评估指标

2018-01-24 09:20 495 查看

该系列仅在原课程基础上部分知识点添加个人学习笔记，或相关推导补充等。如有错误，还请批评指教。在学习了 Andrew Ng 课程的基础上，为了更方便的查阅复习，将其整理成文字。因本人一直在学习英语，所以该系列以英文为主，同时也建议读者以英文为主，中文辅助，以便后期进阶时，为学习相关领域的学术论文做铺垫。- ZJ

Coursera 课程 |deeplearning.ai |网易云课堂

转载请注明作者和出处：ZJ 微信公众号-「SelfImprovementLab」

知乎：https://zhuanlan.zhihu.com/c_147249273

CSDN：http://blog.csdn.net/junjun_zhao/article/details/79142314

1.3 Sigle number evaluation metric (单一数字评估指标)

(字幕来源：网易云课堂)

Whether you’re tuning hyperparameters or trying out different ideas for learning algorithms,or just trying out different options for building your machine learning system.You’ll find that your progress will be much faster if you have a single real number evaluation metric that lets you quickly tell if the new thing you just tried is working better or worse than your last idea.So when teams are starting on a machine learning project,I often recommend that you set up a single real number evaluation metric for your problem.Let’s look at an example.You’ve heard me say before that applied machine learning is a very empirical process.We often have an idea, code it up, run the experiment to see how it did, and then use the outcome of the experiment to refine your ideas.And then keep going around this loop as you keep on improving your algorithm.

无论您是调整超参数，或者是尝试不同的学习算法，或者在搭建机器学习系统时尝试不同手段，你会发现 如果你有一个单实数评估指标，你的进展会快得多，它可以快速告诉你，新尝试的手段比之前的手段好还是坏，所以当团队开始进行机器学习项目时，我经常推荐他们，为你的问题设置一个单实数评估指标，我们来看一个例子，你之前听过我说过，应用机器学习是一个非常经验性的过程，我们通常有一个想法编程序跑实验 看看效果如何，然后使用这些实验结果来改善你的想法，然后继续走这个循环不断改进你的算法。

So let’s say for your cat classifier, you had previously built some classifier A.And by changing the hyperparameters and the training sets or some other thing,you’ve now trained a new classifier, B.So one reasonable way to evaluate the performance of your classifiers is to look at its precision and recall.The exact details of what’s precision and recall don’t matter too much for this example.But briefly, the definition of precision is of the examples that your classifier recognizes as cats,What percentage actually are cats?So if classifier A has 95% precision,this means that when classifier A says something is a cat,there’s a 95% chance it really is a cat.And recall is of all the images that really are cats,what percentage were correctly recognized by your classifier?So what percentage of actual cats, Are correctly recognized?So if classifier A is 90% recall, this means that of all of the images in,say, your dev sets that really are cats,classifier A accurately pulled out 90% of them.So don’t worry too much about the definitions of precision and recall.It turns out that there’s often a tradeoff between precision and recall,and you care about both.You want that, when the classifier says something is a cat,there’s a high chance it really is a cat.But of all the images that are cats,you also want it to pull a large fraction of them as cats.So it might be reasonable to try to evaluate the classifiers in terms of its precision and its recall.

比如说对于你的猫分类器之前你搭建了某个分类器 A ，通过改变超参数还有改变训练集等手段，你现在训练出来了一个新的分类器 B，所以评估你的分类器的一个合理方式是，观察它的查准率和查全率，查准率和查全率的确切细节，对于这个例子来说不太重要，但简而言之 查准率的定义是，在你的分类器标记为猫的例子中，有多少真的是猫，所以如果分类器 A 有 95% 的查准率，这意味着你的分类器说这图有猫的时候，有 95％的机会真的是猫，查全率就是对于所有真猫的图片，你的分类器正确识别出了多少百分比，实际为猫的图片中有多少被系统识别出来?如果分类器 A 查全率是 90% 这意味着对于所有的图像，比如说你的开发集都是真的猫图，分类器 A 准确地分辨出了其中的 90% ，所以关于查准率和查全率的定义不用想太多，事实证明查准率和查全率之间往往需要折衷，两个指标都要顾及到，你希望得到的效果是当你的分类器说某个东西是猫的时候，有很大的机会它真的是一只猫，但对于所有是猫的图片，你也希望系统能够将大部分分类为猫，所以用查准率和查全率来评估分类器，是比较合理的。

The problem with using precision recall as your evaluation metric is that if classifier A does better on recall, which it does here,the classifier B does better on precision,then you’re not sure which classifier is better.And if you’re trying out a lot of different ideas, a lot of different hyperparameters,you want to rather quickly try out not just two classifiers,but maybe a dozen classifiers and quickly pick out the, quote, best ones,so you can keep on iterating from there.And with two evaluation metrics, it is difficult to know how to quickly pick one of the two or quickly pick one of the ten.So what I recommend is rather than using two numbers,precision and recall, to pick a classifier,you just have to find a new evaluation metric that combines precision and recall.In the machine learning literature,the standard way to combine precision and recall is something called an F1 score.And the details of F1 score aren’t too important, but informally,you can think of this as the average of precision, P, and recall, R.Formally, the F1 score is defined by this formula,it’s 2/ 1/P + 1/R.And in mathematics, this function is called the harmonic mean of precision P and recall R.But less formally,you can think of this as some way that averages precision and recall.Only instead of taking the arithmetic mean,you take the harmonic mean, which is defined by this formula.And it has some advantages in terms of trading off precision and recall.But in this example,you can then see right away that classifier A has a better F1 score.And assuming F1 score is a reasonable way to combine precision and recall,you can then quickly select classifier A over classifier B.

但使用查准率和查全率作为评估指标的时候有个问题，如果分类器 A 在查全率上表现更好，分类器 B 在查准率上表现更好，你就无法判断哪个分类器更好，如果你尝试了很多不同想法很多不同的超参数，你希望能够快速试验不仅仅是两个分类器，也许是十几个分类器快速选出 “最好的”那个，这样你可以从那里出发再迭代，如果有两个评估指标就很难去，快速地二中选一或者十中选一，所以我并不推荐使用两个评估指标，查准率和查全率来选择一个分类器，你只需要找到一个新的评估指标能够结合查准率和查全率，在机器学习文献中，结合查准率和查全率的标准方法是所谓的 F1分数，F1 分数的细节并不重要但非正式的，你可以认为这是查准率 P 和查全率 R 的平均值，正式来看 F1分数的定义是这个公式，这是 2(1P+1R)，在数学中这个函数叫做，查准率 P 和查全率 R 的调和平均数，但非正式来说，你可以将它看成是某种查准率和查全率的平均值，只不过你算的不是直接的算术平均，而是用这个公式定义的调和平均，这个指标在权衡查准率和查全率时有一些优势，但在这个例子中，你可以马上看出分类器 A 的F1分数更高，假设 F1 分数是结合查准率和查全率的合理方式，你可以快速选出分类器 A 淘汰分类器 B。

So what I found for a lot of machine learning teams is that having a well-defined dev set,which is how you’re measuring precision and recall,plus a single number evaluation metric,sometimes I’ll call it single real number. Evaluation metric allows you to quickly tell if classifier A or classifier B is better,and therefore having a dev set plus single number evaluation metric destine to speed up iterating.It speeds up this iterative process of improving your machine learning algorithm.

我发现很多机器学习团队就是这样，有一个定义明确的开发集，用来测量查准率和查全率，再加上这样一个单一数值评估指标，有时我叫单实数评估指标，能让你快速判断分类器 A 或者分类器 B 更好，所以有这样一个开发集加上，单实数评估指标 你的迭代速度肯定会很快，它可以加速改进您的机器学习算法的迭代过程。

Let’s look at another example.Let’s say you’re building a cat app for cat lovers in four major geographies,the US, China, India, and other, the rest of the world.And let’s say that your two classifiers achieve different errorsin data from these four different geographies.So algorithm A achieves 3% error on pictures submitted by US users and so on.So it might be reasonable to keep track of how well your classifiers do in these different markets or these different geographies.But by tracking four numbers, it’s very difficult to look at these numbers and quickly decide if algorithm A or algorithm B is superior.

我们来看另一个例子，假设你在开发一个猫应用来服务四个地理大区的爱猫人士，美国中国印度还有世界其他地区，我们假设你的两个分类器在，来自四个地理大区的数据中得到了不同的错误率，比如算法 A 在美国用户上传的图片中达到了3%错误率等等，所以跟踪一下，你的分类器在不同市场和地理大区中的表现应该是有用的，但是通过跟踪四个数字很难扫一眼这些数值，就快速判断算法 A 或算法 B 哪个更好。

And if you’re testing a lot of different classifiers,then it’s just difficult to look at all these numbers and quickly pick one.So what I recommend in this example is, in addition to tracking your performance in the four different geographies, to also compute the average.And assuming that average performance is a reasonable single real number evaluation metric,by computing the average, you can quickly tell that it looks like algorithm C has a lowest average error.And you might then go ahead with that one.You have to pick an algorithm to keep on iterating from.So your workflow with machine learning is often, you have an idea,you implement it try it out, and you want to know whether your idea helped.So what was seen in this video is that having a single number evaluation metric can really improve your efficiency or the efficiency of your team in making those decisions.Now we’re not yet done with the discussion on how to effectively set up evaluation metrics.In the next video,I’m going to share with you how to set up optimizing, as well as satisficing metrics.So let’s take a look at the next video.

如果你测试很多不同的分类器，那么看着那么多数字然后快速选一个最优是很难的，所以在这个例子中我建议除了跟踪，分类器在四个不同的地理大区的表现也要算算平均值，假设平均表现是，一个合理的单实数评估指标，通过计算平均值你就可以快速判断，看起来算法 C 的平均错误率最低，然后你可以继续用那个算法，你必须选择一个算法然后不断迭代，所以你的机器学习的工作流程往往是你有一个想法，你尝试实现它 看看这个想法好不好，所以本视频介绍的是 有一个单实数评估指标，真的可以提高你的效率，或者提高你的团队做出这些决策的效率，现在我们还没有完整讨论，如何有效地建立评估指标，在下一个视频中，我会教你们如何设置优化以及满足指标，我们来看下一段视频。

重点总结：

单一数字评估指标

在训练机器学习模型的时候，无论是调整超参数，还是尝试更好的优化算法，为问题设置一个单一数字评估指标，可以更好更快的评估模型。

example1

下面是分别训练的两个分类器的 Precision、Recall 以及 F1 score。

由上表可以看出，以 Precision 为指标，则分类器 A 的分类效果好；以 Recall 为指标，则分类器 B 的分类效果好。所以在有两个及以上判定指标的时候，我们很难决定出 A 好还是 B 好。

这里以 Precision 和 Recall 为基础，构成一个综合指标 F1 Score，那么我们利用 F1 Score 便可以更容易的评判出分类器A的效果更好。

指标介绍：

在二分类问题中，通过预测我们得到下面的真实值 y 和预测值y^的表：

Precision（查准率）：

Precision=True positiveNumber of predicted positive×100%=True positiveTrue positive+False positive

假设在是否为猫的分类问题中，查准率代表：所有模型预测为猫的图片中，确实为猫的概率。

Recall（查全率）：

Recall=True positiveNumber of actually positive×100%=True positiveTrue positive+False negative

假设在是否为猫的分类问题中，查全率代表：真实为猫的图片中，预测正确的概率。

F1 Score：

F1−Socre=21P+1R

example2

下面是另外一个问题多种分类器在不同的国家中的分类错误率结果：

模型在各个地区有不同的表现，这里用地区的平均值来对模型效果进行评估，转换为单一数字评估指标，就可以很容易的得出表现最好的模型。

参考文献：

[1]. 大树先生.吴恩达Coursera深度学习课程 DeepLearning.ai 提炼笔记（3-1）– 机器学习策略（1）

PS: 欢迎扫码关注公众号：「SelfImprovementLab」！专注「深度学习」，「机器学习」，「人工智能」。以及「早起」，「阅读」，「运动」，「英语」「其他」不定期建群打卡互助活动。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： 单一实数评估指标

相关文章推荐

新的分享

章节导航