First a quick introduction… I am a regular contributor of leela-zero (http://github.com/gcp/leela-zero/) and this project is to see if it is possible to implement a scalable baduk-on-a-cloud website with a reasonable amount of cost (that is, don’t need to charge the users a fortune).
To me, the main motivation doing these kind of things is to keep learning new things. I figured out that the only way to learn how to do things is to spend your time actually writing code. Additionally, the only way to make sure I can write things is… to actually write things.
Naturally, writing a blog is also something that people can learn by… trying. Hence the whole effort is to learn how to write things and maintain a website that describes how things work.
The main constraints that I have is:
- This can only be a part-time thing. I have a full-time job and a family I need to take care of. Probably all I can spend is a couple of hours a week.
- Since there is no commitments, this also means I can’t spend much $$$ buying equipment – the best I can do is use whatever stuff I have, plus maybe a few servers from some public cloud provider.
So a few design decisions:
- Try spending compute resources on the local desktop (equipment I already have). GPUs on the cloud is expensive (A Tesla V100 runs around 80 cents per hour even on a preemptive mode) and having a single GPU running on the cloud is going to burn hundreds of dollars every month. I can’t blame any of the cloud providers because these cost $10K – hence whatever that can be offloaded to somewhere else… will be offloaded.
- Make it scale – yes, this is what I am trying to learn. Making things scale includes implementing redundancy and having the appropriate failover mechanisms. The main goal is to eliminate as many single-point-of-failures as possible.
- Write things from scratch whenever appropriate – there are many great open-source projects that provide pretty much anything. The whole point is to learn how to write code, not learn how to use somebody else’s code.
- Don’t keep anything valuable on the cloud – it’s not like I can commit that I will take good care of other people’s private data. This is going to turn into a helluva mess with all sorts of bad practices everywhere. If things go wrong the last resort I have is to shut it down (and maybe leave a short apology). I can’t commit to clean up whatever disaster that might happen.
cbaduk is going to run on a four-tier architecture. Each component sends the request to the next tier and gets the response back – naturally, the request is the board state of the player, and the response is the next move.
- The web frontend is responsible of communicating with the client – serve html files and provide data in some form of code the web browser can run.
- The job queue distributes the gameplay to the relevant engines.
- The engine plays baduk and generates the next move.
- The NN server does the neural net evaluation for the engine. Normally the NN server is part of the game engine, but for compute efficiency it is better to do evaluations in parallel.
There are multiple instances of each component, and each component will attempt to query one of the multiple instances. For example, if one job queue is unresponsive, the web frontend will fall back to another alternative job queue. When the job queue finds one of the engines are not responsive, it will reroute the job to a different engine.
(Well, eventually… for now the whole system consists of one web frontend, one job queue, and a couple of engines).
The next couple of ports will cover each piece of component.
Happy hacking 🙂