This was updated a while ago, but I didn’t really update the blog due to being just… lazy. 🙂 Yes, cbaduk.net plays handicapped games, and it was playing handicapped games for the last three months! There was a lot of changes internally regarding what I was doing – sometimes it got worse, sometimes it was doing blatantly stupid moves, and now it seems that I have something.
The basic idea is to have a third head, as I wrote previously on https://github.com/leela-zero/leela-zero/issues/2331
- The problem is that if a player is on a losing situation, it should add uncertainty and increase the probability of a larger territory even if there is no chance of winning.
- Thus, it seems that we need another output plane – in this case, the board occupancy seemed to be a useful feature. So, I added two outputs – each are used for predicting the end state of the board – to be specific, the third head looks like this:
# endstate head conv_st = self.conv_block(flow, filter_size=1, input_channels=self.RESIDUAL_FILTERS, output_channels=2, name="endstate_head") h_conv_st_flat = tf.reshape(conv_st, [-1, 2 * 19 * 19]) W_fc4 = weight_variable("w_fc_4", [2 * 19 * 19, (19 * 19) * 2]) b_fc4 = bias_variable("b_fc_4", [(19 * 19) * 2]) self.add_weights(W_fc4) self.add_weights(b_fc4) h_fc4 = tf.add(tf.matmul(h_conv_st_flat, W_fc4), b_fc4)
- To have this, we need to play the game until the very end – so that instead of resigning, the game enters an ‘acceleration mode’ (10 playouts) once the losing side passes the resignation threshold.
- The ‘endstate’ plane is used as an auxiliary plane for the value head – to be specific, the winrate is 80% from the value output, and 20% from the endstate net – using this formula:
endstate_winrate = tanh( avg_delta * confidence / 10.0 ) delta = sum (number_of_my_stone - number_of_opponent_stone + komi_bias) confidence = average ( (v - 0.5) * ( v - 0.5) for v in endstate_plane )
That is, winrate is calculated by the multiple of expected score and uncertainty – that is, the engine will prefer playing a chaotic game rather than giving the opponent clear territory.
The idea seemed to be quite good, except it seems to need to gobble up a terribly large amount of compute. Leela zero already spend more than millions of dollars worth of compute resources to get to the current situation, and all I have is three NVIDIA GTX1080s. Spending compute resources for doing all that was just going nowhere. Progress was slow, and even after months it was playing somewhat sensibly and then… playing blatantly stupid moves. It seemed that I can’t really afford to do all that training.
Instead, I tried a different approach… what if, I just collected Leela Zero self-play data and recreate the endstate plane? To elaborate,
- Parse the self-play data and create a list of moves and policy data for each move,
- Feed those moves and policy data to leela zero (by creating a custom command) and enter acceleration mode immediately after the end of the input
This strategy resulted in processing 100K games in roughly 40 hours. Yes it isn’t perfect since the original plays were for a komi of 7.5, but it probably is much better than any amount of compute I can afford.
So, how does it play?
Give it a try! I played a couple of six-stone handicap games (obviously I was playing black) and it seems to make aggressive attacks trying to capture black’s stones. Most of the games I ended up getting at least one dragon being slaughtered and then lose by a dozen points. 🙂