Git Product home page Git Product logo

Comments (6)

amj avatar amj commented on August 18, 2024

At a minimum, i'd like a way for workers to backoff/go to sleep if there's enough games for the given model, and a way for the trainer to wait for there to be enough new games before training. Literally anything would be an improvement on the current method.

from minigo.

brilee avatar brilee commented on August 18, 2024

two problems:

  • trainer dies; workers generate obscene number of games for just one generation
  • workers get pre-empted; trainer continues training on the same data repeatedly, potentially overfitting.

solutions

  • (for trainer dying): cap number of games per generation. (Q: you would want to know the sum of #games played and #games currently being played for a given generation; this is not easy to do. how can we prevent overshooting # of games? does this even matter? If number of workers << number of games desired, then there's only so much we can overshoot by.)
  • (for workers dying): only initiate training if there are sufficiently many new games.

from minigo.

brilee avatar brilee commented on August 18, 2024

Another thought: if we have 1000 workers, each completing a game every 5 +/- 3 minutes, then why not just output checkpoints every 5 minutes. After all, if a worker is gonna start a new game, why not just grab the latest weights? So having 10,000 games in a generation seems unnecessarily stale. In a regime where each worker only plays 1 game per generation on average, the above solutions may work less well.

from minigo.

amj avatar amj commented on August 18, 2024

Partially addressed by #75 , which prevents the workers from going crazy if the training job falls down. (still waiting to see if that actually works in the real world)

For training proportional to some targetted amount of new data, i'm ok punting on this until we have TPU training running.

from minigo.

amj avatar amj commented on August 18, 2024

I'm ok targeting 10k games/generation for this one, not worrying about getting fewer, and putting a hard cap of 15k per. @brilee, any further thoughts?

from minigo.

brilee avatar brilee commented on August 18, 2024

we have some crude backoff mechanisms now.

from minigo.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.