Git Product home page Git Product logo

Comments (3)

lightvector avatar lightvector commented on August 15, 2024 4

There's not an explicit way to do this right now in terms of memory bytes. Thanks for the suggestion about a memory limit though, I will consider adding one in a future release. But it might be tricky due to the complexities mentioned below, so I am not immediately optimistic about getting to this soon.

Right now, you can control the memory by setting the sizes of things in the config. The thing that should be consuming the bulk of the memory under any normal settings is the NN cache, which is controlled by the parameter in the .cfg file called nnCacheSizePowerOfTwo

So for example, if it was equal to 18, then at most 2 ** 18 = 262144 neural net results will be cached. Neural net results also accumulate in the MCTS tree (on the order of 1 per visit, roughly), but this is shared with the results in the cache, so this mostly only matters if you are running searches with so many visits that you start to exceed the size of the cache so that the MCTS tree starts accumulating lots of things that no longer fit in the cache.

Each neural net result consists of a policy (19 * 19 + 1 floats = 362 floats = 1448 bytes) plus maybe a dozen other values, or maybe another 50 bytes or something like that, so maybe roughly 1500 bytes in total. So with some math you can figure out a "safe" size for the NN cache given how much memory you want to use. But note also that if you are requesting ownership predictions (such as via kata-analyze's ownership true), then the memory usage will roughly double, because each neural net result will also now need to store another 361 floats for the ownership.

Lastly, mind the note in the main readme at https://github.com/lightvector/KataGo about memory fragmentation. At least when performing self-play for many hours and days, and writing out training data at the same time, I found when trying to train KataGo that even though the actual semantic memory usage of the self-play workers stayed small and bounded (as confirmed by debugging tools and things like Valgrind), the physical memory use blew up almost arbitrarily large over time because the default implementation of malloc in gcc/g++ did a poor job of avoiding memory fragmentation. Switching a better memory allocator TCMalloc fixed this, and allowed the self-play machines to run for weeks with hundreds of game threads without issues.

Outside of self-play, I have not always used TCMalloc, particularly some months ago when I was running bulk match tests on AWS where I hadn't installed TCMalloc libraries yet, and it was fine. So I think (although am not 100% sure) that memory fragmentation problems are limited to using default glibc malloc while also doing bulk self-play perhaps involving also recording and writing down training data.

Let me know if this helps!

from katago.

bvandenbon avatar bvandenbon commented on August 15, 2024

nnCacheSizePowerOfTwo will do the trick !
Thank you for the clear explanation.

from katago.

bvandenbon avatar bvandenbon commented on August 15, 2024

And I think it's sufficient really.
There's no need to add an additional setting. This one is fine.

from katago.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.