LeeLa has a way to limit the memory usage of the engine. i.e. <code class="notran

maximum memory usage about katago HOT 3 CLOSED

lightvector commented on August 15, 2024

maximum memory usage

from katago.

Comments (3)

lightvector commented on August 15, 2024 4

There's not an explicit way to do this right now in terms of memory bytes. Thanks for the suggestion about a memory limit though, I will consider adding one in a future release. But it might be tricky due to the complexities mentioned below, so I am not immediately optimistic about getting to this soon.

Right now, you can control the memory by setting the sizes of things in the config. The thing that should be consuming the bulk of the memory under any normal settings is the NN cache, which is controlled by the parameter in the .cfg file called nnCacheSizePowerOfTwo

So for example, if it was equal to 18, then at most 2 ** 18 = 262144 neural net results will be cached. Neural net results also accumulate in the MCTS tree (on the order of 1 per visit, roughly), but this is shared with the results in the cache, so this mostly only matters if you are running searches with so many visits that you start to exceed the size of the cache so that the MCTS tree starts accumulating lots of things that no longer fit in the cache.

Each neural net result consists of a policy (19 * 19 + 1 floats = 362 floats = 1448 bytes) plus maybe a dozen other values, or maybe another 50 bytes or something like that, so maybe roughly 1500 bytes in total. So with some math you can figure out a "safe" size for the NN cache given how much memory you want to use. But note also that if you are requesting ownership predictions (such as via kata-analyze's ownership true), then the memory usage will roughly double, because each neural net result will also now need to store another 361 floats for the ownership.

Lastly, mind the note in the main readme at https://github.com/lightvector/KataGo about memory fragmentation. At least when performing self-play for many hours and days, and writing out training data at the same time, I found when trying to train KataGo that even though the actual semantic memory usage of the self-play workers stayed small and bounded (as confirmed by debugging tools and things like Valgrind), the physical memory use blew up almost arbitrarily large over time because the default implementation of malloc in gcc/g++ did a poor job of avoiding memory fragmentation. Switching a better memory allocator TCMalloc fixed this, and allowed the self-play machines to run for weeks with hundreds of game threads without issues.

Outside of self-play, I have not always used TCMalloc, particularly some months ago when I was running bulk match tests on AWS where I hadn't installed TCMalloc libraries yet, and it was fine. So I think (although am not 100% sure) that memory fragmentation problems are limited to using default glibc malloc while also doing bulk self-play perhaps involving also recording and writing down training data.

Let me know if this helps!

from katago.

bvandenbon commented on August 15, 2024

nnCacheSizePowerOfTwo will do the trick !
Thank you for the clear explanation.

from katago.

bvandenbon commented on August 15, 2024

And I think it's sufficient really.
There's no need to add an additional setting. This one is fine.

from katago.

maximum memory usage about katago HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent