Compressing the NN evaluation results : part 1

I opened this discussion: https://github.com/gcp/leela-zero/issues/2130

I was thinking a bit today and I think I will try implementing something that might be useful - disk-backed NNCache. Two usage cases that I can find useful would be:
 - Use the cache file as an opening book - keep the NN eval results for the first 30 or so moves
 - Pre-compute analysis : we can precompute the evaluation results for an sgf file overnight, and then use the precomputed NN eval results later for analyzing the game. The benefit is that the NN eval results will still be useful for playing variations of the sgf file.
With compression I think we can store tens of millions of eval results per GB - there are a couple of easy compression ideas since the NN eval result consists of a lot of zero or near-zero values.

On the cbaduk.net context, the main purpose is to cache the common opening sequences. Most of the compute load for Leela-zero is the neural net computation so anything that will make that part efficient will be an improvement.

So, what kind of problem are we facing?

The input of a NN evaluation is the current board state – the last eight states of the board, and whether the current player is black or white. The output of the net consists of 363 floating point numbers:

  • (Policy) 362 for the probability of playing each move (0.0~1.0) – promising moves gets higher numbers while unlikely-to-be-useful moves gets near-zero.
  • (Value) 1 for the probability of winning (-1.0~1.0)

363 floating point numbers x 4 bytes per number is roughly 1.5kB, which means caching a million moves will be 1.5GB. A single GTX 1080 can generate about 300 moves per second on a 40×256 net (the most complex nets) which means this will result in roughly 30GB every single day.

The dictionary probably has to grow in billions: Even if we only count 2 variations per move, the number of total combinations will quickly reach 2^30 (=1 billion) which will be 1.5TB – not only it is expensive to store but it also will require a lot of disk access. We probably will end up being slower than a GPU if we keep it in a single disk.

Luckily, we can cheat a bit:

  • We are dealing with probabilities, and most of the 362 moves will be silly moves which will end up something close to zero.
  • The NN results does not need to be precise. It’s not like we will pick one move over another because the probability changed to 60% from 85% – both are large enough to be most likely to be picked.

To put it simply, we can do a lossy compression.

How lossy can we go?

To see how lossy we can go, I devised an experiment – A 480-game match between four engines:

  • (base) The baseline engine
  • (quantize) An engine that quantizes the policy output into 256 discrete steps
x = static_cast<int>(x * 256);
x = x / 256;
  • (quantize-sqrt) An engine that quantizes the policy output into 256 discrete steps after square-rooting it. What happens here is that the smaller numbers retain more accuracy while the larger numbers lose accuracy.
x = sqrt(x);
x = static_cast<int>(x * 256);
x = x / 256;
x = x * x;
  • (quantize-pow2) An engine that quantize the policy output into 256 steps after doing a power-of-2. Over here the larger numbers retain accuracy while the smaller numbers still preserve them.
x = x * x;
x = static_cast<int>(x * 256);
x = x / 256;
x = sqrt(x);

…and here is the result:

                  A     B     C     D
A base 38-42 33-47 58-22
B quantize 42-38 43-37 64-16
C quantize-sqrt 47-33 37-43 65-15
D quantize-pow2 22-58 16-64 15-65

What we can see here is that simply compressing the policy values into 256 steps is still good enough. However, for some reason, we lose strength when we lose information on the smaller values (quantize-pow2), while the larger values doesn’t seem to care much.

From now on, I will quantize the numbers into 2048 steps – this is 8x more steps than what we figured out that it is safe, but this is to avoid any unintended consequences, e.g., results going haywire when applied to a different net or a different board size.

What are we compressing?

To compress things effectively, the first thing to do is to see what kind of data is common and what is uncommon. To do so, I created the NN evaluation dumps from two self-play games, which consists of 487,883 evaluations. Then I quantized them into 2048 steps to see which values are the most common:

 Value   Count     %
0 165614777 93.7724
1 3765003 2.13177
2 948082 0.536811
3 561482 0.317915
4 401759 0.227479
5 311689 0.176481
6 252915 0.143202
7 209827 0.118806
8 177416 0.100454
9 151993 0.0860596

Yes, 94 out of 100 values gets quantized to zero. This makes perfect sense since in any game most of the moves does not make sense. There usually are only a handful moves that makes any sense hence most of them will have extremely low probability. This means that the single most important scheme would be how we encode the large number of zeros.

Step 1 : Variable Length Coding

Since small values are likely to happen and large values are very unlikely, we can use different number of bytes to record them. So, here is the simple way:

  • Value of 0 ~ 247 : write the value as a single byte as-is
  • Value 248 ~ 2048 : The top 3 bits gets recorded on the LSB of the first byte, and the bottom 8 bits gets recorded on the next byte

So, basically something like this:

std::vector<std::uint8_t> ret;
for(auto x : in_array) {
    int v = static_cast<int>(x * 2048.0);
    if(v < 248) {
        ret.push_back(v);
    } else {
        ret.push_back(248 + v / 256);
        ret.push_back(v % 256);
    }
}
return ret;

This resulted in an average of 363.921 bytes per encoding. Not bad considering that we started from 362 floating-point numbers (1440 bytes). This also shows that most of the values on the nets are less than 248 – to be exact, only 1.921 out of 362 are larger than 248 (or 0.12 before quantization)

Step 2 : Run Length Coding

We know that most of the values are zeros. So, rather than writing ‘zero zero zero zero …’. we can record zeros like ‘ten zeros’. So, we modify our encoding a bit more:

  • Value of 0~239 : write the value as a single byte as-is
  • Value 240 ~ 2048 : The top 3 bits gets recorded on the LSB of the first byte, and the bottom 8 bits gets recorded on the next byte. The first byte’s five MSBs are all ones.
  • If we get N consecutive values of M (M is between 0 to 7) then the sequence gets recorded as (240+M, N). That is, seven consecutive zeros are recorded as (240, 7)

Here’s the code, a bit more complicated:

std::vector<uint8_t> ret;

int prev_value = -1;
int prev_cnt = 0;

for(auto x : xarray) {
    int v = static_cast<int>(x * 2048.0);

    if(prev_value == v && prev_cnt < 255) {
        prev_cnt++;
    } else {
        if(prev_cnt > 0) {
            if(prev_cnt == 1) {
                ret.push_back(static_cast<uint8_t>(prev_value));
            } else {
                ret.push_back(static_cast<uint8_t>(prev_value + 240));
                ret.push_back(static_cast<uint8_t>(prev_cnt));
                runcount++;
                runsum += prev_cnt;
            }
        }
        prev_value = -1;
        prev_cnt = 0;
        if(v < 8) {
            prev_value = v;
            prev_cnt = 1;
        } else if(v < 240) {
            ret.push_back(v);
        } else {
            ret.push_back(248 + v / 256);
            ret.push_back(v % 256);
        }
    }
}
if(prev_cnt > 0) {
    if(prev_cnt == 1) {
        ret.push_back(static_cast<uint8_t>(prev_value));
    } else {
        ret.push_back(static_cast<uint8_t>(prev_value + 240));
        ret.push_back(static_cast<uint8_t>(prev_cnt));
        runcount++;
        runsum += prev_cnt;
    }
}
return ret;

This time we got 52.215 bytes per encoding . Considering that we have one out of 17 values being zero, this means that we are going to have a lot of consecutive zeros which will all be reduced into 2 bytes.

Step 3 : Huffman Coding

We can see that there are some values (of the 256 possible values of a byte) going to be used frequently while others are going to be used infrequently. An ideal compression algorithm should result in every single possible being used fairly equally. So, what does the output distribution look like?

Value count   %
...
---------------------
14 323995 1.27182
13 327487 1.28553
10 334606 1.31348
9 338045 1.32698
11 341306 1.33978
15 351303 1.37902
8 370494 1.45435
18 381949 1.49932
16 389688 1.5297
249 396717 1.55729
7 416312 1.63421
241 447475 1.75654
6 470702 1.84771
17 481525 1.8902
5 569659 2.23616
4 726237 2.8508
0 785929 3.08512
3 949443 3.72699
2 1722779 6.76267
1 2550610 10.0123
240 6550212 25.7125

Not good at all. 25% of them are 240 (consecutive zeros) and 10% are single ones. There are quite a bit of 241 (consecutive ones) and 249 (values larger than 256) but other than that, it’s just small numbers (0~16).

At this point, it seems that we can add more and more tricks or just use a more general solution – which i will continue on another post.

The Engine

The “Engine” is a customized version of leela-zero.

This is a fairly faithful reimplementation of the system described in the Alpha Go Zero paper “Mastering the Game of Go without Human Knowledge“. For all intents and purposes, it is an open source AlphaGo Zero.

https://github.com/gcp/leela-zero

AlphaGo 101

This may help a bit:

https://medium.com/applied-data-science/alphago-zero-explained-in-one-diagram-365f5abf67e0

In a nutshell, AlphaGo (and most recent Go / Baduk engines that involves some form of ‘deep learning’) adopts two key ideas:

  • Monte-Carlo Tree Search
  • Neural net based policy / value evaluation

A game of Baduk can be represented as a tree search problem. Each node on the tree is a possible ‘state’ of the board, while each edge represents a move that can be made on the board. The goal of any Baduk-playing engine is to figure out which selection leads to a win regardless of how the opponent plays.

However, the size of these trees are infeasible to search exhaustively, so the only way to do this efficiently is to prune unpromising paths, that is, don’t bother trying to play very bad moves. This is similar to having a human ‘reading’ the board – a skilled person reads the patterns from the board and tries only the promising-looking moves. This takes time and experience.

This is where the neural net comes into play – the neural net gets the game state as its input, and picks a list of promising moves and how likely the player might win. This is trained by feeding a lot of games into the neural net, and then ‘train’ it to learn which pattern of boards are promising and which aren’t. This greatly reduces the size of the tree that needs to be searched – So efficient that searching around 5000 nodes is enough to beat most mere mortals (which takes roughly 10 seconds on modern GPUs).

For most NN applications, getting a high-quality set of data is the hard part, but for Baduk and other board games, the engine itself can generate an infinite amount of data by playing the game itself.

So, what’s running here?

On top of leela-zero, some of my own patches are added:

  • The “distributed” patch (https://github.com/ihavnoid/leela-zero/tree/distributed) which offloads the NN evaluation to a separate process
  • The “batched” patch (https://github.com/ihavnoid/leela-zero/tree/batch-full) which does parallel NN evaluation
  • The “tensorcore” patch (https://github.com/ihavnoid/leela-zero/tree/tensorcore) which implements the inline assembly code for tensor cores

Note that none of the patches at this moment tweaks the algorithm itself – these are merely performance optimizations. One interesting patch is the “distributed” patch which is worth further explanation:

The “distributed” patch

The “distributed” patch splits the MCTS process and the NN process into two separate binaries. These can exist on different machines across the network. Initially the goal was to create a Baduk-playing engine that spans across a network of machines, that is, simply throw insane amount of compute power to increase the strength of the engine. Eventually, it turned into an engine that ran across 16 GPUs and 160 parallel tree search threads, across one MCTS client machine and four NN machines each with 4 Tesla V100 GPUs.

However, this can be done the other way around – have one NN machine serve multiple MCTS client machines. GPUs are inherently massively parallel machines, so doing multiple NN evaluations are much more efficient than evaluating one move at a time. In the cbaduk case, the goal is to be scalable but still be interesting enough to play against, so it makes sense to have one GPU play multiple games concurrently without a moderately shallow search tree.

So, what happens here is:

  • Each “engine” – the leela-zero process connects to a NN server instead of doing the evaluation itself. One engine plays one game at a time.
  • The NN server services multiple engines’ NN evaluation requests and returns the results back to the engine.

The good thing of this architecture is that not only it is efficient, but also this provides a good fallback mechanism. If an engine loses contact with a NN server, it can migrate to another functional server while still re-attempting to get contact with the server. If an engine crashes, the NN server still has enough job to handle because there are other engines running and the job schedulers will be re-routing them to the right machines.

(Well, this doesn’t mean that I am really running massive number of machines now – currently it’s just 1~3 NN servers GPUs with 3 engines, 1 job queue, and 1 web frontend machine. But it should be scalable at least in theory. 🙂 )