学习AlphaGo的理论知识 -----part one
2017-10-21 17:18
204 查看
Mastering the Game of Go without Human Knowledge
David Silver*, Julian Schrittwieser*, Karen Simonyan*, Ioannis Antonoglou, Aja Huang, Arthur
Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy
Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, Demis Hassabis.
DeepMind, 5 New Street Square, London EC4A 3TW.
*These authors contributed equally to this work.
A long-standing goal of artificial intelligence is an algorithm that learns,
tabula rasa, superhuman proficiency in challenging domains. Recently,
AlphaGo became the first program to defeat a world champion in the game of Go.
人工智能的一个长期目标是在一些具有挑战性的领域进行学习、突破、使之具有超过人类的能力。最近,alphago成为围棋电脑程序中打败世界冠军的第一个程序。
The tree search in
AlphaGo evaluated positions and selected moves using deep neural networks. These neural networks were trained by supervised learning from human expert moves, and by reinforcement
learning from selfplay. Here, we introduce an algorithm based solely on reinforcement learning, without human data, guidance, or domain knowledge beyond game rules.
AlphaGo becomes its own teacher: a neural network is trained to predict
AlphaGo’s own move selections and also the winner of
AlphaGo’s games. This neural network improves the strength of tree search, resulting in higher quality move selection and stronger self-play in the next iteration. Starting
tabula rasa, our new program
AlphaGo Zero achieved superhuman performance, winning 100-0 against the previously published, champion-defeating
AlphaGo.
alphago在评估和选择位置移动利用深度神经网络树搜索算法。这些神经网络基于自主无人指导的学习,从人类专家下象棋的方式着手,通过自己玩游戏强化学习。在这里,我们介绍一种仅仅基于强化学习算法,不需要人为的数据,指导,或领域知识以及游戏规则。alphago首先成为自己的老师:通过训练一个神经网络来预测alphago如何利用搜索树选择自身移动,同时程序本身也是alphago游戏的赢家。该神经网络提高了树搜索预测能力,从而获得更高质量的移动选择,在下一次迭代时将会有更为优异的表现。启动tabula
rasa,我们的新程序AlphaGo Zero有了了超人的表现,以100-0击败了之前的冠军AlphaGo。
Much progress towards artificial intelligence has been made using supervised learning systems that are trained to replicate the decisions of human experts
. However, expert data is often expensive, unreliable, or simply unavailable. Even when reliable data is available it may impose a ceiling on the performance of systems trained in this manner
. In contrast, reinforcement learning systems are trained from their own experience, in principle allowing them to exceed human capabilities, and to operate in domains where human expertise is lacking. Recently, there
has been rapid progress towards this goal, using deep neural networks trained by reinforcement learning.
These systems have outperformed humans in computer games such as Atari
6,7 and 3D virtual environments
8–10. However, the most challenging domains in terms of human intellect – such as the game of Go, widely viewed as a grand challenge for artificial intelligence
– require precise and sophisticated lookahead in vast search spaces. Fully general methods have not previously achieved human-level performance in these domains.
许多在人工智能方面取得进展都是通过程序算法的无人监督的自学系统模拟专家的行为取得的。然而,专家数据通常是昂贵的、不可靠的,或者根本不可用。即便有可靠的数据,以这种方式进行训练也可能会遇到天花板。相反,强化学习系统是依靠他们自己的训练模拟经验,理论上他们具有超越人的能力,甚至可以在人类缺乏专业知识的领域工作。近年来,利用强化学习训练的深层神经网络在这一目标上取得了快速的进展。这些系统已经超越人类,如Atari 6,7和3D虚拟环境电脑游戏。然而,最具挑战性的领域,在人类的智力领域–如围棋术语,被广泛视为人工智能的阻碍–需要精确和复杂庞大的搜索程序运算空间先行。以前没有完全通用的方法在这些领域中超越或者达到人类的水准。
AlphaGo was the first program to achieve superhuman performance in Go. The published version
12, which we refer to as
AlphaGo Fan, defeated the European champion Fan Hui in October 2015.
AlphaGo Fan utilised two deep neural networks: a policy network that outputs move probabilities, and a value network that outputs a position evaluation. The policy network was trained initially
by supervised learning to accurately predict human expert moves, and was subsequently refined by policy-gradient reinforcement learning. The value network was trained to predict the winner of games played by the policy network against itself. Once trained,
these networks were
combined with a Monte-Carlo Tree Search (MCTS)
13–15 to provide a lookahead search, using the policy network to narrow down the search to high-probability moves, and using the value network (in conjunction with Monte-Carlo rollouts
using a fast rollout policy) to evaluate positions in the tree. A subsequent version, which we refer to as
AlphaGo Lee, used a similar approach (see Methods), and defeated Lee Sedol, the winner of 18 international titles, in March 2016.
alphago是超越人类表现并且击败人类的第一个程序。发布版本12,我们称之为AlphaGo Fan ,于2015年10月击败欧洲冠军Fan
Hui 。AlphaGo Fan 利用两种深层神经网络:政策网络输出转移概率,和价值网络输出关于位置的评估。政策网络最初是通过有无人自主的学习来精确预测人类专家的象棋的移动,随后通过政策梯度强化学习加以改进。价值网被训练以预测利用政策网络的程序博弈时谁会成为最后的赢家。一旦经过训练,这些网络结合蒙特卡洛树搜索(MCTS)提供前向搜索,运用政策网络来指明可能的高概率的动作,运用价值网络(结合蒙特卡罗的算法与神经网络)评估计算树中的位置。随后的版本,我们称之为alphago
Lee,用类似的方法(参考方法),并于在2016年3月击败了拥有18个国际冠军头衔的lee sedol。
Our program,
AlphaGo Zero, differs from
AlphaGo Fan and
AlphaGo Lee 12
in several important aspects. First and foremost, it is trained solely by self-play reinforcement learning, starting from random play, without any supervision or use of human data. Second, it only uses the black and white stones
from the board as input features. Third, it uses a single neural network, rather than separate policy and value networks. Finally, it uses a simpler tree search that relies upon this single neural network to evaluate positions and sample moves, without performing
any MonteCarlo rollouts. To achieve these results, we introduce a new reinforcement learning algorithm that incorporates lookahead search
inside the training loop, resulting in rapid improvement and precise and stable learning. Further technical differences in the he search algorithm, training procedure
and network architecture are described in Methods.
我们的人工智能项目AlphaGo Zero,在诸多重要领域不同于AlphaGo以及AlphaGo
Lee 12 。首先,他是完全的自主学习,从一开始便是随机的下棋,没有任何监督或使用人为数据。其次,它仅使用黑白两色旗子的输入功能。第三,它使用一个单一的神经网络,而不是分离的政策和价值网络。最后,它使用了一个简单的搜索树,依靠这种单一的神经网络进行行位置还有样例的学习,不执行任何蒙特卡洛展示。为了实现这些效果,我们引入一个新的增强算法,采用前向搜索里面的训练循环学习,从而迅速提高效率、精确和稳定的学习。在下面的论述中将进一步解释两者算法的差别,训练过程和神经网络的结构的描述方法。
David Silver*, Julian Schrittwieser*, Karen Simonyan*, Ioannis Antonoglou, Aja Huang, Arthur
Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy
Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, Demis Hassabis.
DeepMind, 5 New Street Square, London EC4A 3TW.
*These authors contributed equally to this work.
A long-standing goal of artificial intelligence is an algorithm that learns,
tabula rasa, superhuman proficiency in challenging domains. Recently,
AlphaGo became the first program to defeat a world champion in the game of Go.
人工智能的一个长期目标是在一些具有挑战性的领域进行学习、突破、使之具有超过人类的能力。最近,alphago成为围棋电脑程序中打败世界冠军的第一个程序。
The tree search in
AlphaGo evaluated positions and selected moves using deep neural networks. These neural networks were trained by supervised learning from human expert moves, and by reinforcement
learning from selfplay. Here, we introduce an algorithm based solely on reinforcement learning, without human data, guidance, or domain knowledge beyond game rules.
AlphaGo becomes its own teacher: a neural network is trained to predict
AlphaGo’s own move selections and also the winner of
AlphaGo’s games. This neural network improves the strength of tree search, resulting in higher quality move selection and stronger self-play in the next iteration. Starting
tabula rasa, our new program
AlphaGo Zero achieved superhuman performance, winning 100-0 against the previously published, champion-defeating
AlphaGo.
alphago在评估和选择位置移动利用深度神经网络树搜索算法。这些神经网络基于自主无人指导的学习,从人类专家下象棋的方式着手,通过自己玩游戏强化学习。在这里,我们介绍一种仅仅基于强化学习算法,不需要人为的数据,指导,或领域知识以及游戏规则。alphago首先成为自己的老师:通过训练一个神经网络来预测alphago如何利用搜索树选择自身移动,同时程序本身也是alphago游戏的赢家。该神经网络提高了树搜索预测能力,从而获得更高质量的移动选择,在下一次迭代时将会有更为优异的表现。启动tabula
rasa,我们的新程序AlphaGo Zero有了了超人的表现,以100-0击败了之前的冠军AlphaGo。
Much progress towards artificial intelligence has been made using supervised learning systems that are trained to replicate the decisions of human experts
. However, expert data is often expensive, unreliable, or simply unavailable. Even when reliable data is available it may impose a ceiling on the performance of systems trained in this manner
. In contrast, reinforcement learning systems are trained from their own experience, in principle allowing them to exceed human capabilities, and to operate in domains where human expertise is lacking. Recently, there
has been rapid progress towards this goal, using deep neural networks trained by reinforcement learning.
These systems have outperformed humans in computer games such as Atari
6,7 and 3D virtual environments
8–10. However, the most challenging domains in terms of human intellect – such as the game of Go, widely viewed as a grand challenge for artificial intelligence
– require precise and sophisticated lookahead in vast search spaces. Fully general methods have not previously achieved human-level performance in these domains.
许多在人工智能方面取得进展都是通过程序算法的无人监督的自学系统模拟专家的行为取得的。然而,专家数据通常是昂贵的、不可靠的,或者根本不可用。即便有可靠的数据,以这种方式进行训练也可能会遇到天花板。相反,强化学习系统是依靠他们自己的训练模拟经验,理论上他们具有超越人的能力,甚至可以在人类缺乏专业知识的领域工作。近年来,利用强化学习训练的深层神经网络在这一目标上取得了快速的进展。这些系统已经超越人类,如Atari 6,7和3D虚拟环境电脑游戏。然而,最具挑战性的领域,在人类的智力领域–如围棋术语,被广泛视为人工智能的阻碍–需要精确和复杂庞大的搜索程序运算空间先行。以前没有完全通用的方法在这些领域中超越或者达到人类的水准。
AlphaGo was the first program to achieve superhuman performance in Go. The published version
12, which we refer to as
AlphaGo Fan, defeated the European champion Fan Hui in October 2015.
AlphaGo Fan utilised two deep neural networks: a policy network that outputs move probabilities, and a value network that outputs a position evaluation. The policy network was trained initially
by supervised learning to accurately predict human expert moves, and was subsequently refined by policy-gradient reinforcement learning. The value network was trained to predict the winner of games played by the policy network against itself. Once trained,
these networks were
combined with a Monte-Carlo Tree Search (MCTS)
13–15 to provide a lookahead search, using the policy network to narrow down the search to high-probability moves, and using the value network (in conjunction with Monte-Carlo rollouts
using a fast rollout policy) to evaluate positions in the tree. A subsequent version, which we refer to as
AlphaGo Lee, used a similar approach (see Methods), and defeated Lee Sedol, the winner of 18 international titles, in March 2016.
alphago是超越人类表现并且击败人类的第一个程序。发布版本12,我们称之为AlphaGo Fan ,于2015年10月击败欧洲冠军Fan
Hui 。AlphaGo Fan 利用两种深层神经网络:政策网络输出转移概率,和价值网络输出关于位置的评估。政策网络最初是通过有无人自主的学习来精确预测人类专家的象棋的移动,随后通过政策梯度强化学习加以改进。价值网被训练以预测利用政策网络的程序博弈时谁会成为最后的赢家。一旦经过训练,这些网络结合蒙特卡洛树搜索(MCTS)提供前向搜索,运用政策网络来指明可能的高概率的动作,运用价值网络(结合蒙特卡罗的算法与神经网络)评估计算树中的位置。随后的版本,我们称之为alphago
Lee,用类似的方法(参考方法),并于在2016年3月击败了拥有18个国际冠军头衔的lee sedol。
Our program,
AlphaGo Zero, differs from
AlphaGo Fan and
AlphaGo Lee 12
in several important aspects. First and foremost, it is trained solely by self-play reinforcement learning, starting from random play, without any supervision or use of human data. Second, it only uses the black and white stones
from the board as input features. Third, it uses a single neural network, rather than separate policy and value networks. Finally, it uses a simpler tree search that relies upon this single neural network to evaluate positions and sample moves, without performing
any MonteCarlo rollouts. To achieve these results, we introduce a new reinforcement learning algorithm that incorporates lookahead search
inside the training loop, resulting in rapid improvement and precise and stable learning. Further technical differences in the he search algorithm, training procedure
and network architecture are described in Methods.
我们的人工智能项目AlphaGo Zero,在诸多重要领域不同于AlphaGo以及AlphaGo
Lee 12 。首先,他是完全的自主学习,从一开始便是随机的下棋,没有任何监督或使用人为数据。其次,它仅使用黑白两色旗子的输入功能。第三,它使用一个单一的神经网络,而不是分离的政策和价值网络。最后,它使用了一个简单的搜索树,依靠这种单一的神经网络进行行位置还有样例的学习,不执行任何蒙特卡洛展示。为了实现这些效果,我们引入一个新的增强算法,采用前向搜索里面的训练循环学习,从而迅速提高效率、精确和稳定的学习。在下面的论述中将进一步解释两者算法的差别,训练过程和神经网络的结构的描述方法。
相关文章推荐
- 加强计算机理论知识的再学习
- 基于stm32f103zet6的定时器的学习1(理论知识)
- Spring学习 理论知识
- [MFC学习笔记]--网络编程理论知识
- 今天学习了分布式服务框架的基础理论知识(一)
- set 理论知识(学习)
- 在编程的世界中,如何高效地学习理论知识,应用理论知识来解决实际生产中的问题
- 一文读懂AlphaGo背后的强化学习:它的背景知识与贝尔曼方程的原理
- 理论知识需要不断学习
- Hadoop学习之ZooKeeper理论知识和集群安装配置
- 软件测试理论知识学习
- 一直有个问题说不清楚,我们学习知识的时候为什么一定要按照知识点积累和理论并行?
- Spring学习总结(一)---谈谈对Spring IOC的理解(一:理论知识理解)
- 关于node 的学习(基础入门 - 理论知识点总汇)
- 开始学习Docker啦--容器理论知识(一)
- android自定义控件(理论知识学习 +自定义属性的讲解)
- Linux学习第二篇--学习linux前的理论知识.
- 网络学习(十六)简单介绍些虚拟机相关理论知识
- 谈理论知识在计算机专业学习中的作用
- 一天一点学习Linux之用户(user)和用户组(group)理论知识