您的位置:首页 > 编程语言 > Go语言

学习AlphaGo的理论知识 -----part one

2017-10-21 17:18 204 查看
Mastering the Game of Go without Human Knowledge

David Silver*, Julian Schrittwieser*, Karen Simonyan*, Ioannis Antonoglou, Aja Huang, Arthur

Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy

Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, Demis Hassabis.

DeepMind, 5 New Street Square, London EC4A 3TW.

*These authors contributed equally to this work.

  A long-standing goal of artificial intelligence is an algorithm that learns,
tabula rasa, superhuman proficiency in challenging domains. Recently,
AlphaGo became the first program to defeat a world champion in the game of Go.

  人工智能的一个长期目标是在一些具有挑战性的领域进行学习、突破、使之具有超过人类的能力。最近,alphago成为围棋电脑程序中打败世界冠军的第一个程序。

   The tree search in
AlphaGo evaluated positions and selected moves using deep neural networks. These neural networks were trained by supervised learning from human expert moves, and by reinforcement
learning from selfplay. Here, we introduce an algorithm based solely on reinforcement learning, without human data, guidance, or domain knowledge beyond game rules.
AlphaGo becomes its own teacher: a neural network is trained to predict
AlphaGo’s own move selections and also the winner of
AlphaGo’s games. This neural network improves the strength of tree search, resulting in higher quality move selection and stronger self-play in the next iteration. Starting

tabula rasa, our new program
AlphaGo Zero achieved superhuman performance, winning 100-0 against the previously published, champion-defeating
AlphaGo.
       
       alphago在评估和选择位置移动利用深度神经网络树搜索算法。这些神经网络基于自主无人指导的学习,从人类专家下象棋的方式着手,通过自己玩游戏强化学习。在这里,我们介绍一种仅仅基于强化学习算法,不需要人为的数据,指导,或领域知识以及游戏规则。alphago首先成为自己的老师:通过训练一个神经网络来预测alphago如何利用搜索树选择自身移动,同时程序本身也是alphago游戏的赢家。该神经网络提高了树搜索预测能力,从而获得更高质量的移动选择,在下一次迭代时将会有更为优异的表现。启动tabula
rasa,我们的新程序AlphaGo Zero有了了超人的表现,以100-0击败了之前的冠军AlphaGo。

   Much progress towards artificial intelligence has been made using supervised learning systems that are trained to replicate the decisions of human experts
. However, expert data is often expensive, unreliable, or simply unavailable. Even when reliable data is available it may impose a ceiling on the performance of systems trained in this manner
. In contrast, reinforcement learning systems are trained from their own experience, in principle allowing them to exceed human capabilities, and to operate in domains where human expertise is lacking. Recently, there
has been rapid progress towards this goal, using deep neural networks trained by reinforcement learning.

These systems have outperformed humans in computer games such as Atari
6,7 and 3D virtual environments
8–10. However, the most challenging domains in terms of human intellect – such as the game of Go, widely viewed as a grand challenge for artificial intelligence
 – require precise and sophisticated lookahead in vast search spaces. Fully general methods have not previously achieved human-level performance in these domains.
  
       许多在人工智能方面取得进展都是通过程序算法的无人监督的自学系统模拟专家的行为取得的。然而,专家数据通常是昂贵的、不可靠的,或者根本不可用。即便有可靠的数据,以这种方式进行训练也可能会遇到天花板。相反,强化学习系统是依靠他们自己的训练模拟经验,理论上他们具有超越人的能力,甚至可以在人类缺乏专业知识的领域工作。近年来,利用强化学习训练的深层神经网络在这一目标上取得了快速的进展。这些系统已经超越人类,如Atari 6,7和3D虚拟环境电脑游戏。然而,最具挑战性的领域,在人类的智力领域–如围棋术语,被广泛视为人工智能的阻碍–需要精确和复杂庞大的搜索程序运算空间先行。以前没有完全通用的方法在这些领域中超越或者达到人类的水准。 

   AlphaGo was the first program to achieve superhuman performance in Go. The published version
12, which we refer to as
AlphaGo Fan, defeated the European champion Fan Hui in October 2015.
AlphaGo Fan utilised two deep neural networks: a policy network that outputs move probabilities, and a value network that outputs a position evaluation. The policy network was trained initially
by supervised learning to accurately predict human expert moves, and was subsequently refined by policy-gradient reinforcement learning. The value network was trained to predict the winner of games played by the policy network against itself. Once trained,
these networks were

combined with a Monte-Carlo Tree Search (MCTS)
13–15 to provide a lookahead search, using the policy network to narrow down the search to high-probability moves, and using the value network (in conjunction with Monte-Carlo rollouts
using a fast rollout policy) to evaluate positions in the tree. A subsequent version, which we refer to as
AlphaGo Lee, used a similar approach (see Methods), and defeated Lee Sedol, the winner of 18 international titles, in March 2016.

    alphago是超越人类表现并且击败人类的第一个程序。发布版本12,我们称之为AlphaGo Fan  ,于2015年10月击败欧洲冠军Fan
Hui 。AlphaGo Fan 利用两种深层神经网络:政策网络输出转移概率,和价值网络输出关于位置的评估。政策网络最初是通过有无人自主的学习来精确预测人类专家的象棋的移动,随后通过政策梯度强化学习加以改进。价值网被训练以预测利用政策网络的程序博弈时谁会成为最后的赢家。一旦经过训练,这些网络结合蒙特卡洛树搜索(MCTS)提供前向搜索,运用政策网络来指明可能的高概率的动作,运用价值网络(结合蒙特卡罗的算法与神经网络)评估计算树中的位置。随后的版本,我们称之为alphago
Lee,用类似的方法(参考方法),并于在2016年3月击败了拥有18个国际冠军头衔的lee sedol。

     Our program,
AlphaGo Zero, differs from
AlphaGo Fan and
AlphaGo Lee 12
in several important aspects. First and foremost, it is trained solely by self-play reinforcement learning, starting from random play, without any supervision or use of human data. Second, it only uses the black and white stones
from the board as input features. Third, it uses a single neural network, rather than separate policy and value networks. Finally, it uses a simpler tree search that relies upon this single neural network to evaluate positions and sample moves, without performing
any MonteCarlo rollouts. To achieve these results, we introduce a new reinforcement learning algorithm that incorporates lookahead search
inside the training loop, resulting in rapid improvement and precise and stable learning. Further technical differences in the he search algorithm, training procedure
and network architecture are described in Methods.

     我们的人工智能项目AlphaGo Zero,在诸多重要领域不同于AlphaGo以及AlphaGo
Lee 12  。首先,他是完全的自主学习,从一开始便是随机的下棋,没有任何监督或使用人为数据。其次,它仅使用黑白两色旗子的输入功能。第三,它使用一个单一的神经网络,而不是分离的政策和价值网络。最后,它使用了一个简单的搜索树,依靠这种单一的神经网络进行行位置还有样例的学习,不执行任何蒙特卡洛展示。为了实现这些效果,我们引入一个新的增强算法,采用前向搜索里面的训练循环学习,从而迅速提高效率、精确和稳定的学习。在下面的论述中将进一步解释两者算法的差别,训练过程和神经网络的结构的描述方法。
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: