9×9 / 13×13, Save/restore

Two (or maybe three, depending on how you count) new features were added:

  • 9×9 boards
  • 13×13 boards
  • Save/restore

The 9×9 and 13×13 NNs were trained by myself from scratch, using a small cluster of three NVIDIA GTX 1080 GPUs. 9×9 took two weeks to stabilize, while the 13×13 was running for the last two months and still improving constantly (so there is a lot of room for improvement) but it’s at least good enough to beat me comfortably. 🙂

Save/restore is simply encoding the current board state into a single URL string so that nothing needs to be saved on the servers. A 200-move game gets encoded into roughly 400 bytes for now, but this should also need some optimization. The current implementation is simply gzip-compressing the board state (encoded in JSON), and then add a bit of encryption for additional obscurity.

Training the 9×9 / 13×13 nets

Leela-zero uses a distributed framework (http://github.com/gcp/leela-zero-server/) for training the nets across a wide geographical region, but this isn’t really needed if anybody wants to do it himself/herself in a small number of machines. For me, all it took was a shared NFS drive across a couple of machines, a small amount of C++ coding, plus a couple of shell scripts.

The first modification was a custom GTP command ‘autotrain’. Just typing the command plays 25 self-plays and records the play results. This replaces autogtp. The added benefit is that time is spent more on playing the game rather than preparing the engine for startup. This helps a lot because the games are much shorter on smaller boards. The code looks something like this:
https://github.com/ihavnoid/leelaz-ninenine

Once the C++ code changes are here, it’s merely picking up the latest net from the training pipeline, and then running again and again.

#!/bin/bash

suffix=$1
gpunum=$2

for((i=0;i<50000;i=i+1)) ; do
    resign_rate=$((i%10))
    timestamp=$(date +%y%m%d_%H%M%S)_${suffix}
    latest_weight=$(ls -1c training/tf/*.txt | head -1)
    leelaz_cmd="src/leelaz -q -m 8 -n -d -r $resign_rate -t 5 --batchsize 5 -v 800 --noponder --gtp --gpu $gpunum"
    sleep 5
    echo leelaz_cmd : $leelaz_cmd
    echo latest_weight : $latest_weight
    echo timestamp : $timestamp
    echo autotrain traindata_${timestamp} 25 \n quit | ${leelaz_cmd} -w $latest_weight
done

For the training code, it picks up the latest 1500 bundles (each with 25 games, so total 37500 games) and trains itself for 20k steps with batch size of 256, save net, and re-scan for the latest 1500 bundles, and so forth. I just left it running for the last month or two, playing 10 games every day vs. the previous day’s net. Other than that, everything else is stock leela-zero.

The 13×13 progress seems to be good:

Day  W/L
----------
30   8-2
31   5-5
32   5-5
33   6-4
34   5-5
35   4-6
36   5-5
37   7-3
38   7-3
39   8-2
40   7-3
41   7-3
42   8-2
43   7-3
44   9-1
45   5-5

I am going to leave it running for another month and see how it goes, unless something goes wrong. 🙂

The Engine

The “Engine” is a customized version of leela-zero.

This is a fairly faithful reimplementation of the system described in the Alpha Go Zero paper “Mastering the Game of Go without Human Knowledge“. For all intents and purposes, it is an open source AlphaGo Zero.

https://github.com/gcp/leela-zero

AlphaGo 101

This may help a bit:

https://medium.com/applied-data-science/alphago-zero-explained-in-one-diagram-365f5abf67e0

In a nutshell, AlphaGo (and most recent Go / Baduk engines that involves some form of ‘deep learning’) adopts two key ideas:

  • Monte-Carlo Tree Search
  • Neural net based policy / value evaluation

A game of Baduk can be represented as a tree search problem. Each node on the tree is a possible ‘state’ of the board, while each edge represents a move that can be made on the board. The goal of any Baduk-playing engine is to figure out which selection leads to a win regardless of how the opponent plays.

However, the size of these trees are infeasible to search exhaustively, so the only way to do this efficiently is to prune unpromising paths, that is, don’t bother trying to play very bad moves. This is similar to having a human ‘reading’ the board – a skilled person reads the patterns from the board and tries only the promising-looking moves. This takes time and experience.

This is where the neural net comes into play – the neural net gets the game state as its input, and picks a list of promising moves and how likely the player might win. This is trained by feeding a lot of games into the neural net, and then ‘train’ it to learn which pattern of boards are promising and which aren’t. This greatly reduces the size of the tree that needs to be searched – So efficient that searching around 5000 nodes is enough to beat most mere mortals (which takes roughly 10 seconds on modern GPUs).

For most NN applications, getting a high-quality set of data is the hard part, but for Baduk and other board games, the engine itself can generate an infinite amount of data by playing the game itself.

So, what’s running here?

On top of leela-zero, some of my own patches are added:

  • The “distributed” patch (https://github.com/ihavnoid/leela-zero/tree/distributed) which offloads the NN evaluation to a separate process
  • The “batched” patch (https://github.com/ihavnoid/leela-zero/tree/batch-full) which does parallel NN evaluation
  • The “tensorcore” patch (https://github.com/ihavnoid/leela-zero/tree/tensorcore) which implements the inline assembly code for tensor cores

Note that none of the patches at this moment tweaks the algorithm itself – these are merely performance optimizations. One interesting patch is the “distributed” patch which is worth further explanation:

The “distributed” patch

The “distributed” patch splits the MCTS process and the NN process into two separate binaries. These can exist on different machines across the network. Initially the goal was to create a Baduk-playing engine that spans across a network of machines, that is, simply throw insane amount of compute power to increase the strength of the engine. Eventually, it turned into an engine that ran across 16 GPUs and 160 parallel tree search threads, across one MCTS client machine and four NN machines each with 4 Tesla V100 GPUs.

However, this can be done the other way around – have one NN machine serve multiple MCTS client machines. GPUs are inherently massively parallel machines, so doing multiple NN evaluations are much more efficient than evaluating one move at a time. In the cbaduk case, the goal is to be scalable but still be interesting enough to play against, so it makes sense to have one GPU play multiple games concurrently without a moderately shallow search tree.

So, what happens here is:

  • Each “engine” – the leela-zero process connects to a NN server instead of doing the evaluation itself. One engine plays one game at a time.
  • The NN server services multiple engines’ NN evaluation requests and returns the results back to the engine.

The good thing of this architecture is that not only it is efficient, but also this provides a good fallback mechanism. If an engine loses contact with a NN server, it can migrate to another functional server while still re-attempting to get contact with the server. If an engine crashes, the NN server still has enough job to handle because there are other engines running and the job schedulers will be re-routing them to the right machines.

(Well, this doesn’t mean that I am really running massive number of machines now – currently it’s just 1~3 NN servers GPUs with 3 engines, 1 job queue, and 1 web frontend machine. But it should be scalable at least in theory. 🙂 )