Blog

Compressing the NN evaluation results : part 3

(Here is part 1 and part 2)

Part 2 describes a proper entropy coded algorithm. With this algorithm, now it’s time to create a proper file storage.

File format

First, it is good to start any file with some magic number. This is so that we can easily troubleshoot loading the wrong types, and for the NN evaluation, this can be especially troublesome if we use the wrong file type.

For the neural net cache file, I decided to start the first four bytes with the magic number ‘\xfe’, ‘L’, ‘N’, ‘C’. The first character is a non-printable character so that any random text file won’t match.

After the magic number, it’s any number of NNCache entries in the following order:

Field nameSize (bytes)
Hash value8
Value4
Policy (pass)4
Length of compressed policy1
Compressed policy(from above)

In between every 1000 entries, a ‘guide’ entry is inserted, which is encoded as sixteen 0xff values. This is to serve as a ‘recovery point’ in the case part of the cache file is corrupted. The sixteen 0xff byte values are chosen as they will end up being illegal in the compressed policy because it will eventually be decoded into a stream of X31 symbols. If the decompressing code finds that there is something wrong with the decoded results, then it will search for the next ‘guide’ entry and start decompressing from there.

Because of this, the following two cases cannot be saved in disk: In the unlikely situation that one of these happens, we will not store them in cache:

  • Hash value ending up being 0xffff_ffff_ffff_ffff
  • Compressed policy ending up being larger than 255 bytes

Cache Operation

The cache can be loaded in one of the two commands:

  • lz-read_cache : Open cache file read-only, and use entries when available
  • lz-dump_cache : Open cache file read-write. Use entries when available, and if it’s a cache miss, append the evaluated result.

Whenever a cache file is loaded (on either mode), the whole content is parsed to create a hash table for random access. The hash table is simply a std::unordered_map with the keys being the hash value, and the value being the offset of the entry.

Once the table is built, accessing a cache entry is simply searching the table and then reading the cache file:

    auto iter2 = m_outfile_map.find(hash);
    if(iter2 == m_outfile_map.end()) {
        return false;  // Not found.
    }

    // at this point the file should open and run.
    std::ifstream ifs;
    ifs.open(m_filename);
    ifs.seekg(iter2->second);

    try {
        // throws and exception if ended up being parse error
        CompressedEntry e;
        e.read(ifs, hash);  // create compressed entry from stream
        e.get(result);      // decompress
    } catch (...) {
        return false;
    }

Memory management

One problem is that the table can get quite big if we want to do a lot of precomputation, even if we are trying to precompute the first 20 common moves. 1GB of precomputed results ends up being around 25 million entries, and putting 25 million entries on an unordered_map is quickly going to eat up GBs of memory. This definitely is going to hurt on low-end machines.

For now, the only way that I can make things work is to limit the table size so that we don’t exceed the cache budget (which is 10% by default). If the table starts getting to large, entries will be randomly dropped and they will be no longer searchable by that session.

This definitely is going to be an ongoing concern. A couple of possible ideas:

  • Come up with a more efficient data structure, which is fine-tuned to work with already-randomized 64-bit hashes
  • Disk-backed index tables – that is, keep the table itself in disk and access as needed

None of these are simple, nor there are good solution for any of these. I think all I should do at this point is to leave the scalability problem as a future work.

Want to try it out?

Code is here : https://github.com/ihavnoid/leela-zero/pull/new/disk-nncache

Here is a compressed net file created from the latest 40b net (ffe8ba44 – can get the net itself from here):

https://drive.google.com/open?id=181DracV9FpQ9kSvXc5v_ziL-6vZ8U-ME (914MB)

To try:

  • Run leela-zero as usual:
./leelaz -w ffe8ba44*.gz
  • Once you get the GTP prompt, it’s probably a good idea to increase the amount of memory assigned to the cache before loading the cache file:
lz-setoption name Percentage of memory for cache value 40
  • Then, load the cache
lz-read_cache ffe8ba44.lzc

You can see the first couple of moves will be blazing fast. Enjoy!

Server crash (ugh)

After a couple of days of being stuck at work, family issues, and generally not feeling well, I left cbaduk.net alone for a while… and figured out the server simply crashed.

Ugh.

google cloud console says the machine is running, but it is not responding to anything. No ssh, no web, nothing.

Without much choice, I rebooted the machine, opened a terminal, and started to have a look – and the syslog showed that the machine crashed running out of memory.

Jan 19 21:49:41 instance-1 kernel: [3073981.782861] sshd invoked oom-killer: gfp_mask=0x15080c0(GFP_KERNEL_ACCOUNT|__GFP_ZERO), nodemask=(null), order=1, oom_score_adj=0
Jan 19 21:49:41 instance-1 kernel: [3073981.782863] sshd cpuset=/ mems_allowed=0
Jan 19 21:49:41 instance-1 kernel: [3073981.782869] CPU: 0 PID: 22855 Comm: sshd Not tainted 4.15.0-1025-gcp #26-Ubuntu
Jan 19 21:49:41 instance-1 kernel: [3073981.782870] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Jan 19 21:49:41 instance-1 kernel: [3073981.782870] Call Trace:
Jan 19 21:49:41 instance-1 kernel: [3073981.782879] dump_stack+0x8e/0xcb
Jan 19 21:49:41 instance-1 kernel: [3073981.782882] dump_header+0x71/0x285
Jan 19 21:49:41 instance-1 kernel: [3073981.782884] oom_kill_process+0x220/0x440
Jan 19 21:49:41 instance-1 kernel: [3073981.782886] out_of_memory+0x2d1/0x4f0
......

The whole server was running on a f1-micro instance (the smallest instance on Google Compute) which has a grand total of 0.6GB of DRAM. Running nginx+php+mysql.

*sigh*

My first web server was an Athlon 1.4GHz with 256MB of RAM which I managed to shove it in the computer room back when I was in college. Running Red Hat 9. It was running a couple of webpages, with php and mysql, while running all the code that I needed to run for my term projects… and now 600MB won’t even get me to the point of running an website with near-zero visitors.

(Yeah yeah, I know there are people much older than me that will say that their first server was something with 256MB in hard drives…)

Well, I do appreciate how easy things are now. I launched another server, installed mysql, and then migrated the database from the web server to another separate f1-micro instance. Took around 15 minutes.

If this was year 2001, it probably took me another day to find/buy parts (assuming I have the money to buy something), figure out how to get another IP, assemble all the parts, and then struggle getting the required rpm files.

Compressing the NN evaluation results : part 2

In part 1, I wrote an ad-hoc algorithm which applied two commonly-used techniques – variable length coding and run length coding. The ad-hoc mode seemed to start going nowhere, so I eventually settled with something more general.

Two steps:

  • Encode the values 0 ~ 2047 using a run length coding method into some symbols, and then:
  • Do a simple variable length coding (something like Huffman Coding) to encode those symbols

Since we know that the values are mostly zeros, it is pretty straightforward that we have to do a run length encoding of zeros. Other than that, just a smaller number of tricks will be better. I settled with the following symbols:

  • V0 ~ V63 : A single value of 0 ~ 63
  • Z0 ~ Z15 : 2 ~ 17 consecutive zeros
  • X0 ~ X31 : For any symbol Xn, f the previous value was a V-type symbol, add 64(n+1) to that value. If the previous value was a Z-type symbol, append 16(n+1) zeros.

For example,

129 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2

gets encoded into:

V1 X2 Z0 X0 V1 V2

The next step is to encode these values in individual bits. To see how many bits should be used to encode each value, I measured the frequency of each bits, and hand-assigned a variable number of bits for each symbol, ending up being

SymbolBit encoding
V00100
V1000
V2 ~ V3X1100
V4 ~ V7XX0010
V8 ~ V15XXX1010
V16 ~ V31XXXX0110
V32 ~ V63XXXXX1110
Z00001
Z11001
Z2 ~ Z3X0101
Z4 ~ Z7XX1101
Z8 ~ Z15XXX0011
X01011
X100111
X2 ~ X3X10111
X4 ~ X7XX01111
X8 ~ X15XXX011111
X16~X31XXXX111111

(The X on the bit encoding is the LSBs of the symbol number – For example, V18 is 00100110. One notable thing is that V1 is more frequent than V0 which is because we encoded the majority of zeros using the Zn type symbols.)

Since we are dealing with little endian (the least significant bits comes first), if we represent the example above as a 64-bit integer, we end up getting:

01100 000 1011 0001 010111 000 = 0x0c162b8

So, how good are we? I got 261.7 bits in average (or 32.7 bytes) which is much better than the ad-hoc method!

Next step is to apply this in leela-zero, and then verify it doesn’t lose strength.

Compressing the NN evaluation results : part 1

I opened this discussion: https://github.com/gcp/leela-zero/issues/2130

I was thinking a bit today and I think I will try implementing something that might be useful - disk-backed NNCache. Two usage cases that I can find useful would be:
 - Use the cache file as an opening book - keep the NN eval results for the first 30 or so moves
 - Pre-compute analysis : we can precompute the evaluation results for an sgf file overnight, and then use the precomputed NN eval results later for analyzing the game. The benefit is that the NN eval results will still be useful for playing variations of the sgf file.
With compression I think we can store tens of millions of eval results per GB - there are a couple of easy compression ideas since the NN eval result consists of a lot of zero or near-zero values.

On the cbaduk.net context, the main purpose is to cache the common opening sequences. Most of the compute load for Leela-zero is the neural net computation so anything that will make that part efficient will be an improvement.

So, what kind of problem are we facing?

The input of a NN evaluation is the current board state – the last eight states of the board, and whether the current player is black or white. The output of the net consists of 363 floating point numbers:

  • (Policy) 362 for the probability of playing each move (0.0~1.0) – promising moves gets higher numbers while unlikely-to-be-useful moves gets near-zero.
  • (Value) 1 for the probability of winning (-1.0~1.0)

363 floating point numbers x 4 bytes per number is roughly 1.5kB, which means caching a million moves will be 1.5GB. A single GTX 1080 can generate about 300 moves per second on a 40×256 net (the most complex nets) which means this will result in roughly 30GB every single day.

The dictionary probably has to grow in billions: Even if we only count 2 variations per move, the number of total combinations will quickly reach 2^30 (=1 billion) which will be 1.5TB – not only it is expensive to store but it also will require a lot of disk access. We probably will end up being slower than a GPU if we keep it in a single disk.

Luckily, we can cheat a bit:

  • We are dealing with probabilities, and most of the 362 moves will be silly moves which will end up something close to zero.
  • The NN results does not need to be precise. It’s not like we will pick one move over another because the probability changed to 60% from 85% – both are large enough to be most likely to be picked.

To put it simply, we can do a lossy compression.

How lossy can we go?

To see how lossy we can go, I devised an experiment – A 480-game match between four engines:

  • (base) The baseline engine
  • (quantize) An engine that quantizes the policy output into 256 discrete steps
x = static_cast<int>(x * 256);
x = x / 256;
  • (quantize-sqrt) An engine that quantizes the policy output into 256 discrete steps after square-rooting it. What happens here is that the smaller numbers retain more accuracy while the larger numbers lose accuracy.
x = sqrt(x);
x = static_cast<int>(x * 256);
x = x / 256;
x = x * x;
  • (quantize-pow2) An engine that quantize the policy output into 256 steps after doing a power-of-2. Over here the larger numbers retain accuracy while the smaller numbers still preserve them.
x = x * x;
x = static_cast<int>(x * 256);
x = x / 256;
x = sqrt(x);

…and here is the result:

                  A     B     C     D
A base 38-42 33-47 58-22
B quantize 42-38 43-37 64-16
C quantize-sqrt 47-33 37-43 65-15
D quantize-pow2 22-58 16-64 15-65

What we can see here is that simply compressing the policy values into 256 steps is still good enough. However, for some reason, we lose strength when we lose information on the smaller values (quantize-pow2), while the larger values doesn’t seem to care much.

From now on, I will quantize the numbers into 2048 steps – this is 8x more steps than what we figured out that it is safe, but this is to avoid any unintended consequences, e.g., results going haywire when applied to a different net or a different board size.

What are we compressing?

To compress things effectively, the first thing to do is to see what kind of data is common and what is uncommon. To do so, I created the NN evaluation dumps from two self-play games, which consists of 487,883 evaluations. Then I quantized them into 2048 steps to see which values are the most common:

 Value   Count     %
0 165614777 93.7724
1 3765003 2.13177
2 948082 0.536811
3 561482 0.317915
4 401759 0.227479
5 311689 0.176481
6 252915 0.143202
7 209827 0.118806
8 177416 0.100454
9 151993 0.0860596

Yes, 94 out of 100 values gets quantized to zero. This makes perfect sense since in any game most of the moves does not make sense. There usually are only a handful moves that makes any sense hence most of them will have extremely low probability. This means that the single most important scheme would be how we encode the large number of zeros.

Step 1 : Variable Length Coding

Since small values are likely to happen and large values are very unlikely, we can use different number of bytes to record them. So, here is the simple way:

  • Value of 0 ~ 247 : write the value as a single byte as-is
  • Value 248 ~ 2048 : The top 3 bits gets recorded on the LSB of the first byte, and the bottom 8 bits gets recorded on the next byte

So, basically something like this:

std::vector<std::uint8_t> ret;
for(auto x : in_array) {
    int v = static_cast<int>(x * 2048.0);
    if(v < 248) {
        ret.push_back(v);
    } else {
        ret.push_back(248 + v / 256);
        ret.push_back(v % 256);
    }
}
return ret;

This resulted in an average of 363.921 bytes per encoding. Not bad considering that we started from 362 floating-point numbers (1440 bytes). This also shows that most of the values on the nets are less than 248 – to be exact, only 1.921 out of 362 are larger than 248 (or 0.12 before quantization)

Step 2 : Run Length Coding

We know that most of the values are zeros. So, rather than writing ‘zero zero zero zero …’. we can record zeros like ‘ten zeros’. So, we modify our encoding a bit more:

  • Value of 0~239 : write the value as a single byte as-is
  • Value 240 ~ 2048 : The top 3 bits gets recorded on the LSB of the first byte, and the bottom 8 bits gets recorded on the next byte. The first byte’s five MSBs are all ones.
  • If we get N consecutive values of M (M is between 0 to 7) then the sequence gets recorded as (240+M, N). That is, seven consecutive zeros are recorded as (240, 7)

Here’s the code, a bit more complicated:

std::vector<uint8_t> ret;

int prev_value = -1;
int prev_cnt = 0;

for(auto x : xarray) {
    int v = static_cast<int>(x * 2048.0);

    if(prev_value == v && prev_cnt < 255) {
        prev_cnt++;
    } else {
        if(prev_cnt > 0) {
            if(prev_cnt == 1) {
                ret.push_back(static_cast<uint8_t>(prev_value));
            } else {
                ret.push_back(static_cast<uint8_t>(prev_value + 240));
                ret.push_back(static_cast<uint8_t>(prev_cnt));
                runcount++;
                runsum += prev_cnt;
            }
        }
        prev_value = -1;
        prev_cnt = 0;
        if(v < 8) {
            prev_value = v;
            prev_cnt = 1;
        } else if(v < 240) {
            ret.push_back(v);
        } else {
            ret.push_back(248 + v / 256);
            ret.push_back(v % 256);
        }
    }
}
if(prev_cnt > 0) {
    if(prev_cnt == 1) {
        ret.push_back(static_cast<uint8_t>(prev_value));
    } else {
        ret.push_back(static_cast<uint8_t>(prev_value + 240));
        ret.push_back(static_cast<uint8_t>(prev_cnt));
        runcount++;
        runsum += prev_cnt;
    }
}
return ret;

This time we got 52.215 bytes per encoding . Considering that we have one out of 17 values being zero, this means that we are going to have a lot of consecutive zeros which will all be reduced into 2 bytes.

Step 3 : Huffman Coding

We can see that there are some values (of the 256 possible values of a byte) going to be used frequently while others are going to be used infrequently. An ideal compression algorithm should result in every single possible being used fairly equally. So, what does the output distribution look like?

Value count   %
...
---------------------
14 323995 1.27182
13 327487 1.28553
10 334606 1.31348
9 338045 1.32698
11 341306 1.33978
15 351303 1.37902
8 370494 1.45435
18 381949 1.49932
16 389688 1.5297
249 396717 1.55729
7 416312 1.63421
241 447475 1.75654
6 470702 1.84771
17 481525 1.8902
5 569659 2.23616
4 726237 2.8508
0 785929 3.08512
3 949443 3.72699
2 1722779 6.76267
1 2550610 10.0123
240 6550212 25.7125

Not good at all. 25% of them are 240 (consecutive zeros) and 10% are single ones. There are quite a bit of 241 (consecutive ones) and 249 (values larger than 256) but other than that, it’s just small numbers (0~16).

At this point, it seems that we can add more and more tricks or just use a more general solution – which i will continue on another post.

std::vector with a not-so-movable type

Consider this:

class some_class {
    public:
        some_class(some_class && x) = delete;
    // ....
};

Let’s say that we do not want some class type to prohibit move operations, and for some reason, we want to keep a std::vector<T> type of these. Unfortunately this doesn’t work as-is.

C++ 101 : what is a ‘move’ operation?

Consider this:

std::vector<int> get_power_vect(int size) {
    std::vector<int> ret;
    ret.reserve(size);
    for(int x=0; x<size; x++) {
        ret.push_back(x);
    }
    return ret;
}

On classic C++, this kind of code resulted in a horrible *facepalm*. This is because the final return statement results in creating a sizable vector, and then returning a copy of the vector to the caller, and then immediately destroying the function’s stack, resulting in destroying the original vector.

This resulted in a lot of old C++ code doing something like this:

void get_power_vect(int size, std::vector<int> & ret) {
    ...
}

This works without performance overhead, but there are a couple of problems:

  • It isn’t clear if an argument (especially if it’s a reference type) is an input or an output. Either it has to be clearly documented (which we know that people are bad at) or somebody has to spend time reading the implementation.
  • The function has to think of what the input vector would look like – e.g., what will happen if the vector ret was non-empty?

C++11 changed this by introducing ‘rvalue’ references which ends up being ‘moved’ instead of being ‘copied’. I won’t go into too much details, but to put it simply, what happens is that a variable gets moved instead of copied if the source variable is going to be destroyed right away.

So, the statement ‘return ret’ results in the caller’s vector contents being moved to the callee’s context, so none of that silly copy operation happens. Thus, modern C++ recommends even returning huge STL data structures because this is going to result in only a couple of pointer copy operations.

A move constructor looks something like this:

some_class::some_class(some_class && x) {...}

See the ‘&&’ operation – that is the ‘rvalue reference’ type, which (to put simply) means that the variable is a rvalue type which will be destroyed right after this call, so you don’t need to do any deep copying.

So, what can possibly go wrong?

A move operation is a godsend in many cases, but is going to definitely be confusing in other cases. The biggest problem is that it can screw over pointers and references. Consider this:

std::vector<int> x;
std::vector<int> * xp = &x;
...
std::vector<int> y = std::move(x); // force a move operation.  the value of x
                                   // is now undefined
...
return xp->size() + 3; // undefined behavior

So, once we did a move, any pointers pointing to the original object gets to point to invalid locations. This may be a bit easy to avoid when the object is confined to a small piece of block, but if your class spans multiple hundreds of files… then good luck avoiding all this trouble.

Thus, if there is an object type that should never be moved around even accidently because there will be many pointers or references of that object everywhere, then it’s better to just prohibit move operations (or even copy operations) by disallowing it:

some_class(some_class && x) = delete;

But… but… but…

The main problem I had was that prohibiting move operations mean that they are no longer usable on many STL data structures (notably std::vector). In many cases I ended up having to keep a vector of std::unique_ptr simply because these things were not movable.

But why does std::vector require a move operation?

Remember that std::vector is a ‘elastic’ array. Add something – you construct that on the end of the buffer. The problem is that if you run out of buffer, std::vector has is to construct a new buffer and then move each individual object to the new buffer. Append something on the front – then std::vector shifts everything by one (involving move) and then creates the new object on the newly created spot.

But if we don’t need those operations, we should be able to use std::vector. It’s perfectly valid that we create a vector of size 100 (or some arbitrary size) and then keep it at that size until it gets destroyed. The only problem is that the compiler can’t statically know if the move will be ever required or not. So it disallows it.

Solution : make it a runtime error

I ended up with a very simple solution: redefine the move operation to be a runtime error. Something like this:

some_class(some_class && x) {
    throw std::runtime_error("Move operations not allowed");
}

This results in the code compiling but crash if a move operation is ever invoked. Except… it changes a compile-time error to a runtime error which is much harder to debug. Do this on a sparsely-used data structure might be okay, but it’s going to look bad if there is something like 1000 places using that class.

So, an alternative – simply create a child class that defines the move type as a runtime error, and use that type for std::vector.

template <typename T>
class assert_fail_move : public T {
    using T::T;
public:
    assert_fail_move(assert_fail_move && x) {
        throw std::runtime_error("Did not expect move operations to happen");
    }
};

template <typename T> using unmovable_vector =  
                            std::vector<assert_fail_move<T>>;

Example usage:

// Test example
class testclass {
private:
    int _w;
    int _x;
public:
    testclass(int w, int x) {
        _w = w; _x = x;
    }
    testclass() {
        _w = 0; _x = 0;
    }
    testclass(testclass && x) = delete;
    int sumsum() const {
        return _w + _x;
    }
    void setval(int w, int x) {
        _w = w;
        _x = x;
    }
};

void print(const testclass * x) {
    std::cout << x->sumsum() << std::endl;
}
void print(testclass & x) {
    std::cout << x.sumsum() << std::endl;
}
int main() {
    unmovable_vector<testclass> v(10);
    for(int i=0; i<10; i++) {
        v[i].setval(i, i+3);
    }

    unmovable_vector<testclass> v2;
    v2.reserve(10);
    // will result in a runtime error if we reserved too less
    for(int i=0; i<10; i++) {
        v2.emplace_back(i, i+3);
    }
    for(auto & x : v) {
        print(&x);
    }
    for(auto & x : v2) {
        print(x);
    }
    return 0;
}

9×9 / 13×13, Save/restore

Two (or maybe three, depending on how you count) new features were added:

  • 9×9 boards
  • 13×13 boards
  • Save/restore

The 9×9 and 13×13 NNs were trained by myself from scratch, using a small cluster of three NVIDIA GTX 1080 GPUs. 9×9 took two weeks to stabilize, while the 13×13 was running for the last two months and still improving constantly (so there is a lot of room for improvement) but it’s at least good enough to beat me comfortably. 🙂

Save/restore is simply encoding the current board state into a single URL string so that nothing needs to be saved on the servers. A 200-move game gets encoded into roughly 400 bytes for now, but this should also need some optimization. The current implementation is simply gzip-compressing the board state (encoded in JSON), and then add a bit of encryption for additional obscurity.

Training the 9×9 / 13×13 nets

Leela-zero uses a distributed framework (http://github.com/gcp/leela-zero-server/) for training the nets across a wide geographical region, but this isn’t really needed if anybody wants to do it himself/herself in a small number of machines. For me, all it took was a shared NFS drive across a couple of machines, a small amount of C++ coding, plus a couple of shell scripts.

The first modification was a custom GTP command ‘autotrain’. Just typing the command plays 25 self-plays and records the play results. This replaces autogtp. The added benefit is that time is spent more on playing the game rather than preparing the engine for startup. This helps a lot because the games are much shorter on smaller boards. The code looks something like this:
https://github.com/ihavnoid/leelaz-ninenine

Once the C++ code changes are here, it’s merely picking up the latest net from the training pipeline, and then running again and again.

#!/bin/bash

suffix=$1
gpunum=$2

for((i=0;i<50000;i=i+1)) ; do
    resign_rate=$((i%10))
    timestamp=$(date +%y%m%d_%H%M%S)_${suffix}
    latest_weight=$(ls -1c training/tf/*.txt | head -1)
    leelaz_cmd="src/leelaz -q -m 8 -n -d -r $resign_rate -t 5 --batchsize 5 -v 800 --noponder --gtp --gpu $gpunum"
    sleep 5
    echo leelaz_cmd : $leelaz_cmd
    echo latest_weight : $latest_weight
    echo timestamp : $timestamp
    echo autotrain traindata_${timestamp} 25 \n quit | ${leelaz_cmd} -w $latest_weight
done

For the training code, it picks up the latest 1500 bundles (each with 25 games, so total 37500 games) and trains itself for 20k steps with batch size of 256, save net, and re-scan for the latest 1500 bundles, and so forth. I just left it running for the last month or two, playing 10 games every day vs. the previous day’s net. Other than that, everything else is stock leela-zero.

The 13×13 progress seems to be good:

Day  W/L
----------
30   8-2
31   5-5
32   5-5
33   6-4
34   5-5
35   4-6
36   5-5
37   7-3
38   7-3
39   8-2
40   7-3
41   7-3
42   8-2
43   7-3
44   9-1
45   5-5

I am going to leave it running for another month and see how it goes, unless something goes wrong. 🙂

The Engine

The “Engine” is a customized version of leela-zero.

This is a fairly faithful reimplementation of the system described in the Alpha Go Zero paper “Mastering the Game of Go without Human Knowledge“. For all intents and purposes, it is an open source AlphaGo Zero.

https://github.com/gcp/leela-zero

AlphaGo 101

This may help a bit:

https://medium.com/applied-data-science/alphago-zero-explained-in-one-diagram-365f5abf67e0

In a nutshell, AlphaGo (and most recent Go / Baduk engines that involves some form of ‘deep learning’) adopts two key ideas:

  • Monte-Carlo Tree Search
  • Neural net based policy / value evaluation

A game of Baduk can be represented as a tree search problem. Each node on the tree is a possible ‘state’ of the board, while each edge represents a move that can be made on the board. The goal of any Baduk-playing engine is to figure out which selection leads to a win regardless of how the opponent plays.

However, the size of these trees are infeasible to search exhaustively, so the only way to do this efficiently is to prune unpromising paths, that is, don’t bother trying to play very bad moves. This is similar to having a human ‘reading’ the board – a skilled person reads the patterns from the board and tries only the promising-looking moves. This takes time and experience.

This is where the neural net comes into play – the neural net gets the game state as its input, and picks a list of promising moves and how likely the player might win. This is trained by feeding a lot of games into the neural net, and then ‘train’ it to learn which pattern of boards are promising and which aren’t. This greatly reduces the size of the tree that needs to be searched – So efficient that searching around 5000 nodes is enough to beat most mere mortals (which takes roughly 10 seconds on modern GPUs).

For most NN applications, getting a high-quality set of data is the hard part, but for Baduk and other board games, the engine itself can generate an infinite amount of data by playing the game itself.

So, what’s running here?

On top of leela-zero, some of my own patches are added:

  • The “distributed” patch (https://github.com/ihavnoid/leela-zero/tree/distributed) which offloads the NN evaluation to a separate process
  • The “batched” patch (https://github.com/ihavnoid/leela-zero/tree/batch-full) which does parallel NN evaluation
  • The “tensorcore” patch (https://github.com/ihavnoid/leela-zero/tree/tensorcore) which implements the inline assembly code for tensor cores

Note that none of the patches at this moment tweaks the algorithm itself – these are merely performance optimizations. One interesting patch is the “distributed” patch which is worth further explanation:

The “distributed” patch

The “distributed” patch splits the MCTS process and the NN process into two separate binaries. These can exist on different machines across the network. Initially the goal was to create a Baduk-playing engine that spans across a network of machines, that is, simply throw insane amount of compute power to increase the strength of the engine. Eventually, it turned into an engine that ran across 16 GPUs and 160 parallel tree search threads, across one MCTS client machine and four NN machines each with 4 Tesla V100 GPUs.

However, this can be done the other way around – have one NN machine serve multiple MCTS client machines. GPUs are inherently massively parallel machines, so doing multiple NN evaluations are much more efficient than evaluating one move at a time. In the cbaduk case, the goal is to be scalable but still be interesting enough to play against, so it makes sense to have one GPU play multiple games concurrently without a moderately shallow search tree.

So, what happens here is:

  • Each “engine” – the leela-zero process connects to a NN server instead of doing the evaluation itself. One engine plays one game at a time.
  • The NN server services multiple engines’ NN evaluation requests and returns the results back to the engine.

The good thing of this architecture is that not only it is efficient, but also this provides a good fallback mechanism. If an engine loses contact with a NN server, it can migrate to another functional server while still re-attempting to get contact with the server. If an engine crashes, the NN server still has enough job to handle because there are other engines running and the job schedulers will be re-routing them to the right machines.

(Well, this doesn’t mean that I am really running massive number of machines now – currently it’s just 1~3 NN servers GPUs with 3 engines, 1 job queue, and 1 web frontend machine. But it should be scalable at least in theory. 🙂 )

Web Frontend and the job queue

The web frontend handles is a essentially a single php script that simply forwards jobs to play a move, and then returns the next board state back as another chunk of html5 and javascript. The job queue is a simple C++ server that accepts connections from frontends and engines. It accepts connections from engines that waits for new jobs (port 13334), and connections from frontends that submits new jobs (port 13333). What happens in the job queue is that it matches frontend connections with engine connections.

The javascript client sends a POST request to the web frontend, which contains the full sequence of the game (so that the server can be completely stateless – yes, the client keeps the state of the game). What happens next is:

  1. The frontend picks a random job queue.
  2. The POST request is packetized into an internal protocol format.
  3. The web frontend opens a TCP connection to the job queue, and then sends a dummy packet (a single zero).
  4. The job queue accepts the connection, and then puts the connection into a queue. As engines are ready, the queue is drained and assigned to available engines.
  5. When a connection is assigned an engine, the job queue responds to the client with a dummy packet. This signals the web frontend that it is ready to handle a job.
  6. The web frontend sends the game state, which gets forwarded to the engine. The engine does the ‘thinking’ and then returns the next move – either the board coordinate, a pass, or a ‘resign’ response.
  7. The job queue forwards the response back to the web frontend, which gets forwarded back to the client. The connection is closed.

There is a 15-second timeout on each step. If anything goes wrong, the whole sequence is aborted and then the frontend responds to the client with a ‘error-retry’ response. The job queue cleans up any aborted sockets, and the workers also clean up themselves if the socket gets pulled abruptly.

Frontend connections terminate as soon as one job is handled (this is php, I don’t think we can share socket resources between different sessions), while the worker connections are persistent and the connection is kept alive as long as the worker is functional. In case a worker crashes unexpectedly, the worker will be re-launched immediately.

The protocol itself is ultra-simple, and used for both frontend-to-jobqueue communication and jobqueue-to-engine communication. A request/response consists of multiple 16-bit integers where the first integer is always the packet size.

A request packet contains the sequence (board coordinates) of the whole game. To prevent any integer overflows, the server will simply truncate any games longer than 1000 moves into 1000, and it’s the client’s responsibility to prevent any games longer than 1000 moves.

The response is either a zero-length packet (for the dummy handshake) or a one-length packet. We can add other encodings (e.g., komi or other strength settings) but that is something that can be done in the future.

Next part is the engine itself – which is a customized version of leela-zero.

Hi!

First a quick introduction… I am a regular contributor of leela-zero (http://github.com/gcp/leela-zero/) and this project is to see if it is possible to implement a scalable baduk-on-a-cloud website with a reasonable amount of cost (that is, don’t need to charge the users a fortune).

Why?

To me, the main motivation doing these kind of things is to keep learning new things.  I figured out that the only way to learn how to do things is to spend your time actually writing code.  Additionally, the only way to make sure I can write things is… to actually write things.

Naturally, writing a blog is also something that people can learn by… trying.  Hence the whole effort is to learn how to write things and maintain a website that describes how things work.

What?

The main constraints that I have is:

  • This can only be a part-time thing.  I have a full-time job and a family I need to take care of.  Probably all I can spend is a couple of hours a week.
  • Since there is no commitments, this also means I can’t spend much $$$ buying equipment – the best I can do is use whatever stuff I have, plus maybe a few servers from some public cloud provider.

So a few design decisions:

  • Try spending compute resources on the local desktop (equipment I already have).  GPUs on the cloud is expensive (A Tesla V100 runs around 80 cents per hour even on a preemptive mode) and having a single GPU running on the cloud is going to burn hundreds of dollars every month.  I can’t blame any of the cloud providers because these cost $10K – hence whatever that can be offloaded to somewhere else… will be offloaded.
  • Make it scale – yes, this is what I am trying to learn.  Making things scale includes implementing redundancy and having the appropriate failover mechanisms.  The main goal is to eliminate as many single-point-of-failures as possible.
  • Write things from scratch whenever appropriate – there are many great open-source projects that provide pretty much anything.  The whole point is to learn how to write code, not learn how to use somebody else’s code.
  • Don’t keep anything valuable on the cloud – it’s not like I can commit that I will take good care of other people’s private data.  This is going to turn into a helluva mess with all sorts of bad practices everywhere.  If things go wrong the last resort I have is to shut it down (and maybe leave a short apology).  I can’t commit to clean up whatever disaster that might happen.

So?


cbaduk is going to run on a four-tier architecture.  Each component sends the request to the next tier and gets the response back – naturally, the request is the board state of the player, and the response is the next move.

  • The logic of the game itself will be written in javascript and run on the client web browser.  The servers’ main responsibility is to ‘think’ – that is, play baduk.
  • The web frontend is responsible of communicating with the client – serve html files and provide data in some form of code the web browser can run.
  • The job queue distributes the gameplay to the relevant engines.
  • The engine plays baduk and generates the next move.
  • The NN server does the neural net evaluation for the engine.  Normally the NN server is part of the game engine, but for compute efficiency it is better to do evaluations in parallel.

There are multiple instances of each component, and each component will attempt to query one of the multiple instances.  For example, if one job queue is unresponsive, the web frontend will fall back to another alternative job queue.  When the job queue finds one of the engines are not responsive, it will reroute the job to a different engine.

(Well, eventually… for now the whole system consists of one web frontend, one job queue, and a couple of engines).

The next couple of ports will cover each piece of component.

Happy hacking 🙂