Git Product home page Git Product logo

Comments (15)

brilee avatar brilee commented on July 19, 2024

For go.N, this one needs to be easily changeable because unit tests run on N=9 but selfplay can run on 9 or 19.

For number of selfplay workers, this one gets set in kubernetes configuration files, so this one is doomed to need to be manually fiddled, instead of python configs which we have some hope of taming.

from minigo.

artasparks avatar artasparks commented on July 19, 2024

I was thinking about reading having a YAML configuration for the hyperparams. That would make it easy to change / grok.

from minigo.

brilee avatar brilee commented on July 19, 2024

Yaml doesn't allow us to do multiplications/division, which I think we'd need. In terms of flags, we've been using arghparse + manual threading of values. We could continue doing that or use tf.flags or equivalent - thoughts on that front?

from minigo.

amj avatar amj commented on July 19, 2024

i've been thinking about the difference between "operating" parameters, params that configure the pipeline, vs the more traditional "hyperparameters" that control e.g. the NN architecture, etc. Conceptually, the operating params are about the physical realities of memory, workers, directories, etc., while the 'hyperparams' are about the various knobs that are replaceable as constants/functions/etc.

that divides it as:

Operating:

  • go.N
  • number of workers
  • directory locations
  • type of machine/accelerator
  • root parallelism / game batch size
  • number of workers
  • number of concurrent files
  • shuffle buffer size
  • number of examples to gather per shard (this should not be a hyperparam)
  • here's a big one: what version of code is deployed ;P

"traditional" parameters

  • the neural network params as above
  • the MCTS params as above
  • the training params as above
  • the target number of games per generation

So the 'migration' of training rate vs data-creation-rate is a good example of how my worse-is-better implementation of orchestration causes problems: By not having a way for the selfplay jobs to self-limit, or for the training job to assert that it has enough new data to start, i've turned what should be a hyperparam (how many new pieces of data until we try to train again) into some operating parameters (how long to sleep in between, how many workers to run in parallel, etc.). I think ideally we'd also want to add some guardrails to make it possible to set the max/min number of games/generation and have them be "enforced".

I will say that i really like the 'makeshift hyperparams dict' because it's legible and easy to read. I also like how we have some parameters that are able to be computed at runtime based on other parameters -- e.g., the d-noise, the game depth, the move threshold. I think moving to YAML for hyperparameter conf would lose us that flexibility.

However, i think the operating parameters could very well be set as YAML conf (or some other language), without any loss in flexibility or expressiveness

from minigo.

amj avatar amj commented on July 19, 2024

re: tf.flags or arghparse. I've really liked arghparse so far, better than the g3-ish flags of tf.flags

from minigo.

amj avatar amj commented on July 19, 2024

Another way to think about the difference. "Operating" parameters change how/where/how fast the work gets done, but they don't change the actual "work" of training the net itself.

I.e., changing the root parallelism or the number of workers doesn't actually change how long training to a given strength should take. But changing the no. of layers or the no. of readouts could result in really different numbers of steps-to-a-given-strength.

from minigo.

amj avatar amj commented on July 19, 2024

While we're at it, it'd be nice to have models loaded with different hyperparameters not die unhelpfully.

E.g., our models that are for 19 shared x 128 filters; if we change k and try to load them, kaboom explosions and no helpful message.

from minigo.

artasparks avatar artasparks commented on July 19, 2024

I was thinking about this over the weekend -- if we had a separate config file we could serialize, we could store it in the same directory as the data.

In fact, we could create different versions of the hyperparam data for if we change variables mid-way through (like the learning rate).

E.g.,
000000-bootstrap-hyperparams.json
....
000140-flapper-jack-hyperparams.json

from minigo.

brilee avatar brilee commented on July 19, 2024

So here's my current thinking on hyperparams.

There are a few channels to pass in hyperparams, broadly speaking: cmdline flags, environment variables, magic files on GCS that are fetched, and a persistent server that accepts basic HTTP PUT/POST commands to update current process params.

Furthermore, we have three contexts in which we'd like to overwrite hyperparams: training/eval, selfplay, and unit / integration testing.

I think having a HTTP server is overkill, and selfplay workers aren't that persistent anyway. Magic files on GCS have some precedent: selfplay workers already scan GCS models directory to figure out what the latest checkpoint is. Env variables are a natural-ish way that Docker prefers for configuring workers (but really annoying to set otherwise), and changing dockerfiles also means committing spurious changes to the git repo every time a hyperparam is updated. Flags are pretty easy to set by human operation but really annoying to update docker containers (which are hardcoded to run a given command)

So I think we should use some combination of flags (to override hyperparams during training) and magic files on GCS (to override hyperparams for selfplay). And for unit tests / integration tests, a context manager in Python can provide scoped overrides of hyperparams.

from minigo.

amj avatar amj commented on July 19, 2024

@brilee I like flags + GCS files as a solution -- how do you divide which params go where? any general intuition?

(updating flags for docker containers is not so bad; i have my 'running head' CL, i make and push the image from that, the job picks up the new container, no problem)

If selfplay workers became persistent (e.g., cloud TPU), would we want to do an HTTP server? It seems like the evaluation cluster is going to have some overhead re: message queues, a web frontend, etc. Which means if we're going to have the infrastructure already in place for that, it could be worth it to consider what a full 'config server' solution looks like.

from minigo.

georgedahl avatar georgedahl commented on July 19, 2024

How about argparse for operating configuration and https://www.tensorflow.org/api_docs/python/tf/contrib/training/HParams for metaparameters? HParams can be stored as human readable pbtxt.

from minigo.

amj avatar amj commented on July 19, 2024

@georgedahl that sounds like a good division; perhaps the hparams are stored at just the root of the bucket?

from minigo.

brilee avatar brilee commented on July 19, 2024

@georgedahl I'm wondering what hparams would buy us over a liberal sprinkling of tf.flags?

from minigo.

georgedahl avatar georgedahl commented on July 19, 2024

Easy saving in the metagraph with its registered to_proto function, easy to use with hyperparameter tuning services that support it (cloud ML engine probably has one), similarity to other TF codebases.
An object to pass around that doesn't have all other flags with it and basically acts like a dict anyway that also supports dot notation.

Other than that, very little. I don't think these benefits are necessarily overwhelming, but they aren't nothing either.

from minigo.

brilee avatar brilee commented on July 19, 2024

being finished as part of #316

from minigo.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.