The Engine

The “Engine” is a customized version of leela-zero.

This is a fairly faithful reimplementation of the system described in the Alpha Go Zero paper “Mastering the Game of Go without Human Knowledge“. For all intents and purposes, it is an open source AlphaGo Zero.

AlphaGo 101

This may help a bit:

In a nutshell, AlphaGo (and most recent Go / Baduk engines that involves some form of ‘deep learning’) adopts two key ideas:

  • Monte-Carlo Tree Search
  • Neural net based policy / value evaluation

A game of Baduk can be represented as a tree search problem. Each node on the tree is a possible ‘state’ of the board, while each edge represents a move that can be made on the board. The goal of any Baduk-playing engine is to figure out which selection leads to a win regardless of how the opponent plays.

However, the size of these trees are infeasible to search exhaustively, so the only way to do this efficiently is to prune unpromising paths, that is, don’t bother trying to play very bad moves. This is similar to having a human ‘reading’ the board – a skilled person reads the patterns from the board and tries only the promising-looking moves. This takes time and experience.

This is where the neural net comes into play – the neural net gets the game state as its input, and picks a list of promising moves and how likely the player might win. This is trained by feeding a lot of games into the neural net, and then ‘train’ it to learn which pattern of boards are promising and which aren’t. This greatly reduces the size of the tree that needs to be searched – So efficient that searching around 5000 nodes is enough to beat most mere mortals (which takes roughly 10 seconds on modern GPUs).

For most NN applications, getting a high-quality set of data is the hard part, but for Baduk and other board games, the engine itself can generate an infinite amount of data by playing the game itself.

So, what’s running here?

On top of leela-zero, some of my own patches are added:

  • The “distributed” patch ( which offloads the NN evaluation to a separate process
  • The “batched” patch ( which does parallel NN evaluation
  • The “tensorcore” patch ( which implements the inline assembly code for tensor cores

Note that none of the patches at this moment tweaks the algorithm itself – these are merely performance optimizations. One interesting patch is the “distributed” patch which is worth further explanation:

The “distributed” patch

The “distributed” patch splits the MCTS process and the NN process into two separate binaries. These can exist on different machines across the network. Initially the goal was to create a Baduk-playing engine that spans across a network of machines, that is, simply throw insane amount of compute power to increase the strength of the engine. Eventually, it turned into an engine that ran across 16 GPUs and 160 parallel tree search threads, across one MCTS client machine and four NN machines each with 4 Tesla V100 GPUs.

However, this can be done the other way around – have one NN machine serve multiple MCTS client machines. GPUs are inherently massively parallel machines, so doing multiple NN evaluations are much more efficient than evaluating one move at a time. In the cbaduk case, the goal is to be scalable but still be interesting enough to play against, so it makes sense to have one GPU play multiple games concurrently without a moderately shallow search tree.

So, what happens here is:

  • Each “engine” – the leela-zero process connects to a NN server instead of doing the evaluation itself. One engine plays one game at a time.
  • The NN server services multiple engines’ NN evaluation requests and returns the results back to the engine.

The good thing of this architecture is that not only it is efficient, but also this provides a good fallback mechanism. If an engine loses contact with a NN server, it can migrate to another functional server while still re-attempting to get contact with the server. If an engine crashes, the NN server still has enough job to handle because there are other engines running and the job schedulers will be re-routing them to the right machines.

(Well, this doesn’t mean that I am really running massive number of machines now – currently it’s just 1~3 NN servers GPUs with 3 engines, 1 job queue, and 1 web frontend machine. But it should be scalable at least in theory. 🙂 )

Leave a Reply

Your email address will not be published. Required fields are marked *