Git Product home page Git Product logo

dora's Introduction

Dora The Explorer, a friendly experiment manager

tests badge linter badge

Dora logo, picturing a schematic Dora in front of a computer.

A demo of a Dora grid search

Table of Content

Installation

# For bleeding edge
pip install -U git+https://github.com/facebookincubator/submitit@main#egg=submitit
pip install -U git+https://[email protected]/facebookresearch/dora#egg=dora-search

# For stable release
pip install -U dora-search

What's up?

See the changelog for details on releases.

  • 2022-06-09: version 0.1.10: adding HiPlot support ! Updated PL support, many small fixes.
  • 2022-02-28: version 0.1.9
  • 2021-12-10: version 0.1.8: see changelog, many of small changes.
  • 2021-11-08: version 0.1.7: support for job arrays added.
  • 2021-10-20: version 0.1.6 released, bug fixes.
  • 2021-09-29: version 0.1.5 released.
  • 2021-09-07: added support for a git_save option. This will ensure that the project git is clean and make a clone from which the experiment will run. This does not apply to dora run for easier debugging (but you can force it with --git_save).
  • 2021-06-21: added support for Hydra 1.1. Be very careful if you update to Hydra 1.1, there are some non backward compatible changes in the way group config are parsed, see the Hydra release notes for more information.

(FB Only) If you are using Dora and want to receive updates on bug fixes and new versions, ping me (@defossez) on Workchat.

Introduction

Dora is an experiment launching tool which provides the following features:

  • Grid search management: automatic scheduling and canceling of the jobs to match what is specified in the grid search files. Grid search files are pure Python, and can contain arbitrary loops, conditions etc.
  • Deduplication: experiments are assigned a signature based on their arguments. If you ask twice for the same experiment to be ran, it won't be scheduled twice, but merged to the same run. If your code handles checkpointing properly, any previous run will be automatically resumed.
  • Monitoring: Dora supports basic monitoring from inside the terminal. You can customize the metrics to display in the monitoring table, and easily track progress, and compare runs in a grid search.

Some Dora concepts:

  • A Grid is a python file with an explorer function, wrapped in a dora.Explorer. The explorer function takes a dora.Launcher as argument. Call repeatidly the dora.Launcher with a set of hyper-parameters to schedule different experiments.
  • An XP is a specific experiment. Each experiment is defined by the arguments passed to the underlying experimental code, and is assigned a signature based on those arguments, for easy deduplication.
  • A signature is the unique XP identifier, derived from its arguments. You can use the signature to uniquely identity the XP across runs, and easily access logs, checkpoints etc.
  • A Sheep is the association of a Slurm/Submitit job, and an XP. Given an XP, it is always possible to retrieve the last Slurm job that was associated with it.

Making your code compatible with Dora

In order to derive the XP signature, Dora must know about the configuration schema your project is following, as well as the parsed arguments for a run. Dora supports two backends for that : argparse, and hydra. On top of that, Dora provides a smooth integration with Pytorch Lightning for projects that uses it.

In all cases, you must have a specific python package (which we will call here myproj), with a train module in it, (i.e. myproj.train module, stored in the myproj/train.py file.)

The train.py file must contain a main function that is properly decorated, as explained hereafter.

Argparse support

Here is a template for the train.py file:

import argparse
from dora import argparse_main, get_xp

parser = argparse.ArgumentParser("mycode.train")
...


@argparse_main(
    dir="./where_to_store_logs_and_checkpoints",
    parser=parser,
    exclude=["list_of_args_to_ignore_in_signature, e.g.", "num_workers",
             "can_be_pattern_*", "log_*"],
    use_underscore=True,  # flags are --batch_size vs. --batch-size
    git_save=False,  # if True, scheduled experiments will run from a separate clone of the repo.
)
def main():
    # No need to reparse args, you can directly access them from the current XP
    # object.
    xp = get_xp()
    xp.cfg # parsed arguments
    xp.sig  # signature for the current run
    xp.folder  # folder for the current run, please put your checkpoint relative
               # to this folder, so that it is automatically resumed!
    xp.link  # link object, can send back metrics to Dora

    # If you load a previous checkpoint, you should always make sure
    # That the Dora Link is consistent with what is in the checkpoint with
    # history = checkpoint['history']
    # xp.link.update_history(history)

    for t in range(10):
        xp.link.push_metrics({"loss": 1/(t + 1)})
    ...

Hydra support

The template for train.py:

from dora import hydra_main, get_xp


@hydra_main(
    config_path="./conf",  # path where the config is stored, relative to the parent of `mycode`.
    config_name="config"  # a file `config.yaml` should exist there.
)
def main(cfg):
    xp = get_xp()
    xp.cfg # parsed configuration
    xp.sig  # signature for the current run
    # Hydra run folder will automatically be set to xp.folder!

    xp.link  # link object, can send back metrics to Dora
    # If you load a previous checkpoint, you should always make sure
    # That the Dora Link is consistent with what is in the checkpoint with
    # history = checkpoint['history']
    # xp.link.update_history(history)

    for t in range(10):
        xp.link.push_metrics({"loss": 1/(t + 1)})
    ...

You can customize dora behavior from the config.yaml file, e.g.

my_config: plop
num_workers: 40
logs:
    interval: 10
    level: info

dora:
    exclude: ["num_workers", "logs.*"]
    dir: "./outputs"
    git_save: true  # set git_save option for the project.

PyTorch Lightning support

Deprecated: Due to a lack of internal use for PL, this only works with fairly old versions of PL. We are not planning on updating the support for PL.

Dora supports PyTorch Lightning (PL) out of the box. Dora will automatically capture logged metrics (make sure to use per_epoch=True), and handles distribution (you should not pass gpus=... or num_nodes=... to PL).

import dora.lightning


@dora.argparse_main(...)
def main():
    xp = dora.get_xp()
    args = xp.cfg
    # Replace Pytorch lightning `Trainer(...)` with the following:
    trainer = dora.lightning.get_trainer(...)
    # Or when using argparse parsing:
    trainer = dora.lightning.trainer_from_argparse_args(args)

See examples/pl/train.py for a full example including automatic reloading of the last checkpoint, logging etc.

Important: Dora deactivates the default PL behavior of dumping a mid-epoch checkpoint upon preemption, as this lead to non deterministic behavior (as PL would skip this epoch upon restart). Dora assumes you save checkpoints from time to time (e.g. every epoch). To get back the old behavior, pass no_unfinished_epochs=False to get_trainer. See examples/pl/train.py for an example of how to implement checkpointing in a reliable manner.

Distributed training support (non PyTorch Lightning)

Dora supports distributed training, and makes a few assumptions for you. You should initialize distributed training through Dora, by calling in your main function:

import dora.distrib
dora.distrib.init()

Note: This is not required for Pytorch Lightning users, see the PL section hereafter, everything will be setup automatically for you :)

Git Save

You can set the git_save option on your project, see hereafter on how to do it for either argparse or Hydra based projects. When this option is set, Dora makes individual clones of your project repository for each experiment that is scheduled. The job will then run from that clean clone. This allows both to keep track of the exact code that was used for an experiment, as well as preventing code changes to impact pending, or requeued jobs. If you reschedule a failed or cancelled job, the clone will however be updated with the current code.

In order to use this option, your code should be able to run from a fresh clone of the repository. If you need to access to resources that are specified with a path relative to the original repo, use dora.to_absolute_path(). Note that this is similar to hydra.utils.to_absolute_path(). In fact, you can safely replace the Hydra version with this one, as even when git_save is not set, the Dora one automatically falls back to the Hydra one (if Hydra is used).

The repository must be completely clean before scheduling remote jobs, and all files should be either tracked or git ignored. This is very restricive, but this makes implementing this feature much simpler and safe. Also this forces good practice :) Only the dora run command can be used on a dirty repository, to allow for easy debugging. For the dora launch and dora grid command, you can also use the --no_git_save option to temporarily deactivate this feature.

The clone for each experiment is located inside the code/ subfolder inside the XP folder (which you can get with the dora info command for instance).

The dora command

Dora will install a dora command that is the main way to interact with it. The dora command defines 4 sub-commands, detailed in the following sections:

  • dora run: run training code locally (e.g. for debugging).
  • dora launch: launch remote jobs, useful for one-off experiments.
  • dora info: get information on a specific job/XP, logs etc.
  • dora grid: launch an entire grid search defined in a grid file. Only missing XP will be scheduled. Will also reports status and latest metrics.

In order for Dora to find your code, you must pass your training package (i.e. mycode) as dora -P mycode [run|launch|grid|info]. This flag can be skipped if mycode is in the current working directory and is the only folder with a train.py file in it, in which case Dora will find it automatically. You can also export DORA_PACKAGE=mycode to avoid having to give the -P flag explicitely.

You can use a different script name than train.py with -M, --main_module, or setting the DORA_MAIN_MODULE env variable.

Attention: those flags should be specified BEFORE the run | launch |info |grid part: dora -P mycode run, not dora run -P mycode.

Examples

See the examples folder for a few examples using argparse, Hydra and Pytorch Lightning, in order to test the commands described here. To play with them, first install Dora (pip install . from the top-level of the repo), then cd examples, and use dora -P example_name ... to let Dora know which example to use!

dora run: Running XP locally

You can run an XP locally with

dora run [TRAINING_ARGS ...]

Warning: for the argparse backend, you must insert -- between the dora args and your own training args, i.e.:

dora run -- [TRAINING_ARGS ...]

dora run supports two flags:

  • -d: distributed training using all available gpus. The master worker output will be to the shell, and other workers will be redirected to a log file in the XP folder.
  • -f sig: this will inject the hyper-parameters from the XP with the given sig on top of the one provided on the command line. Useful to resume locally a remote job that failed.
  • --git_save: clone the repo inside the XP folder and execute from there. This is mostly for debugging, and in general is not needed.

Multi node training without Slurm

If you do not have a Slurm cluster but still want to do multi node training, you can adapt the following example for two nodes (do not include the -d flag here, as torchrun will be responsible for launching the processes):

torchrun --master-addr NODE_1_ADDR --master-port MASTER_PORT --node_rank 0 --nnodes 2 --nproc-per-node 8 -m dora run [DORA RUN ARGS]
torchrun --master-addr NODE_1_ADDR --master-port MASTER_PORT --node_rank 1 --nnodes 2 --nproc-per-node 8 -m dora run [DORA RUN ARGS]

If you have a Slurm cluster available, you should usually prefer using the dora launch or dora grid command for managing multi node jobs.

dora launch: Launching XP remotely

Warning: This command is not recommended for serious workflows. First it doesn't allow for advanced tuning of the Slurm config, and in almost all cases, it is preferable to use the dora grid command, even for a single job, as the grid system allows for a better tracking and book keeping of the experiments you launch on the cluster.

Dora supports scheduling experiments on Slurm. If you need to schedule many of them, then a grid file is properly better.

dora launch [--dev] [-g NUMBER_OF_GPUS] [TRAINING_ARGS ...]

Dora will automatically select the appropriate number of nodes and tasks per nodes based on the number of GPUs required, as well as scale required memory. For instance, if -g 16, Dora will schedule on 2 nodes with 8 gpus each.

This command will launch the command, and immediately tail its log and monitor its progress, just like if it were running in locally. If you want to kill the command if you kill the local process, you can add the -a, --attach flag. To avoid tailing the log, just pass --no_tail.

If a job already exist for the given XP, Dora will not schedule a second one, but reuse the existing job.

If a previous run has failed or was canceled, Dora will not automatically start a new one, to give you a chance to inspect the logs. If you want to reschedule a run, use the -r, --retry flag.

Other flags:

  • -f SIG: injects the arguments from the XP matching this signature, on top of the one provided on the command line.
  • -R, --replace: replace any running job (i.e. cancels, and schedules a new one).
  • -D, --replace_done: also reschedule a job even if a previous one completed successfully.
  • -p, --partition PARTITION: partition to use.
  • -c, --comment COMMENT: comment for the job (e.g. if priority is used).
  • --clear: cancel any previous job, clear the XP folder (i.e. delete checkpoints) and reschedule.

dora info: Inspecting an XP

You can get information on an XP with the dora info command:

dora info [TRAINING_ARGS ...]
dora info -f SIGNATURE
dora info -j SLURM_ID

You can either specify the XP by listing all of its training arguments, by passing its signature, or even the latest Slurm id associated with it. The info command supports a number of flags:

  • -l: print the entire log for the main task (this only work for remote jobs, not XP ran locally with dora run)
  • -t: tail the log for the main task.

dora grid: Managing a grid search

The main benefit from Dora is the ability to handle arbitarily complex grid searches. Each grid is defined by a grid file, inside a grids package (i.e. mycode.grids.my_grid). The grid file defines an explorer function, decorated by an Explorer class. The Explorer class defines various metadata, in particular on which metrics to display when calling the grid command. The explorer function takes a dora.Launcher as an argument, and should repeatidly call it to schedule experiments.

Here is an example of grid search file, for instance mycode.grids.mygrid.

from itertools import product
from dora import Explorer, Launcher

@Explorer
def explorer(launcher: Launcher):
    launcher(batch_size=128)  # Schedule an experiments with the given batch size.
    # For an argparse based project, this will get converted to the `--batch_size=128` flag
    # You can pass `use_underscore=False` to `argparse_main` to get instead `--batch-size=128`.

    sub = launcher.bind(lr=0.01)  # bind some parameter value, in a new launcher
    sub.slurm_(gpus=8)  # all jobs scheduled with `sub` will use 8 gpus.

    sub()  # Job with lr=0.01 and 8 gpus.
    sub.bind_(epochs=40)  # in-place version of bind()
    sub.slurm(partition="dev")(batch_size=64)  # lr=0.01, 8 gpus, dev, bs=64 and epochs=40.

    launcher.slurm_(gpus=16)  # Now using 16 gpus per job, i.e. 2 full nodes.
    # Nice thing of native python, you can define arbitrary set of XP!
    for lr, bs in product([0.1, 0.01, 0.001], [16, 32, 64]):
        if bs > 32 and lr < 0.01:
            # this is just too extreme, let's skip
            continue
        launcher(lr=lr, batch_size=bs)

    # Job arrays are also supported.
    # The only limitation is that all jobs in an array must use exactly
    # the same slurm config.
    with launcher.job_array():
        for seed in range(1, 100):
            launcher(seed=seed)

You can then call

dora grid mygrid

This will do 3 thing:

  • Any XP defined in the explorer function will be scheduled, if not already running or completed.
  • Any XP that was previously defined in the grid file, but is no longer referenced will be cancelled. If you just comment one line in the grid file, the corresponding job will automatically be killed.
  • A table containing job status and metadata as well as the latest metrics will be printed every 5 minutes.

The Launcher API

Here is a more comprehensive description of what Launcher object can do.

  • launcher.bind_(...): remember the given parameters (command line option for argparse based project, or overrides for Hydra based ones) for future scheduling, i.e. all experiments later scheduled with that launcher will have those parameters set.
  • sub = launcher.bind(...): same as bind, but returns a new "sub" launcher, i.e. the object launcher is not changed, only experiments scheduled with sub will use the given params. sub also inherits from all the params already bound to its parent launcher (i.e. previous call to launcher.bind_). Creating a sub-launcher is especially recommended inside loops, to avoid leaking params to the next loop iteration.
  • launcher(...): schedules an experiment with the given params, plus all the ones that have been aggregated through the various calls to bind_ and to bind. This is equivalent to launcher.bind(...)().
  • launcher.slurm_(key=value, ...) and launcher.slurm(key=value, ...): same as bind_ and bind but for the slurm config (nb of GPUs etc). For a list of possible options, checkout SlurmConf.

Now let us describe the format for passing parameters overrides or command line flags to launcher.bind_(), launcher.bind() or launcher():

  • Simple parameters (i.e. not nested) can be passed as kwargs, for instance if you have a --batch_size flag, you can do launcher.bind(batch_size=64).
  • Command line flags can be explicitely passed as a list of strings, for instance launcher.bind(['--lr=1e-4']).
  • A dictionary of overrides can be passed, for instance launcher.bind({'batch_size': 64}). Note that this also allows for nested keys in Hydra: launcher.bind({'model.channels': 256}). With Hydra, you can also define new keys with {'+model.activation': 'relu'}. You must not remove keys though.
  • Finally you can combine all of those (for a Hydra project here):
launcher.bind(['optim.lr=1e-4'], {'model.channels': 256, 'seed': 42}, {'+model.activation': 'relu'}, batch_size=64)

Flags

The dora grid command supports the following flags:

  • --init: init the given XPs so that their signature can be referenced. launching or running an XP whose sig was not initialised will fail with FATAL: Could not find an existing run with sig yoursig.

  • -r, --retry: failed or cancelled XP within one grid file will be rescheduled.

  • -R, --replace: any running XP will be replaced by a new job.

  • -D, --replace_done: any XP in the grid that previously completed will be rescheduled.

  • -C, --cancel: cancel all XPs in a grid.

  • --clear: cancel any previous jobs, clear all XP folders (i.e. delete checkpoints) and reschedule. This will ask confirmation first, because this is quite dangerous.

  • -i, --interval INTERVAL: the table monitoring all jobs will be updated every INTERVAL minutes, until all jobs are finished or failed.

  • -T, --trim IDX: trim all the metrics to the number of epochs of the XP with the given index inside the grid, i.e. pretend that all XPs have at most as many epochs as the XP with the given index.

  • -L, --trim_last: trim all XPs to the least advanced XP i.e. if the least advanced XP has only 3 epochs, show the metrics at epoch 3 for all XPs.

  • -f, --folder IDX: only print the folder of the XP with the given idnex.

  • -l, --log IDX: print the full log of the XP with the given index.

  • -t, --tail IDX: tail the log of the XP with the given index.

  • --no_monitoring: only show the table once and return.

  • --dry_run: only simulate actions.

Patterns

You can also pass patterns to the grid command, for instance

dora grid mygrid bs=64

will only show XPs which have bs=64 in their name. You can see the name by launching the grid normally. Names are heavily shorten to avoid running out of space, in particular nested structure will have all their components but the leaf be shorten. You can negate a query with !, for instance dora grid mygrid '!bs=64' (use quotes because ! will be interpreted by the shell otherwise). Multiple patterns are interpreted as logical AND between them.

Note that with the latest version (be sure to update), the --clear, or -C, --cancel flags will only apply to the XP matching the pattern. Similarly, only XP matching those patterns will be scheduled.

Explorer class

The Explorer class allows to customize which metric to report, and with what precision. It also gives you a chance to reorganize metrics or further post process them (for instance, extracting max, min etc.). See Customize metrics displayed hereafter for more explanation.

By convention, files starting with _ inside the grids folder are ignored by Dora, and are a good place to put utility code such as your custom Explorer classes. For an example with detailed comments, go checkout the Explorer classes for BrainMagick.

HiPlot support

Dora supports HiPlot out of the box. Make sure it is installed (pip install hiplot), then you can run

python -m hiplot dora.hiplot.load --port=XXXX

In the prompt, you can type any number of grid names or XP sig, separated by spaces. You can customize the metrics displayed by inheriting dora.hiplot.HiPlotExplorer in a class inside yourproject.grids._hiplot. Then, select your explorer with the command explorer=MyExplorer inside the HiPlot prompt (along the grid names and XP sigs, in any order).

The Dora API

Dora provides some API, including the possibility to run grid searches directly from an IPython notebook. See the Dora API.

DecoratedMain

The most useful class is the DecoratedMain, which is the decorated main function in your project. You can use it to retrieve an XP object from a list of argv, or a signature:

from myproj.train import main

xp = main.get_xp_from_sig('ae43f645')
xp2 = main.get_xp_from_argv(xp.argv + ['batch_size=32'])
xp2_name = main.get_name(xp2)  # See API in dora.names.NameMixin
with xp2.enter():
    # You can pretend to be in an XP with this.
    ...

Advanced customization

If you want to do some advance customization of the behavior of DecoratedMain (e.g. custom naming for XP, or weird parsing of flags), feel free to inherit dora.main.ArgparseMain or dora.hydra.HydraMain and use your custom class instead.

Grid API

You can schedule and manage grids from the Dora API rather than the command-line. This is useful to manage XPs from a notebook for instance! See the dora.grid.run_grid. Flags are passed with as an instance of dora.grid.RunGridArgs. Submission rules (e.g. cancel, retry etc.) are passed as a dora.conf.SubmitRules.

import dora
import dora.grid

from myproj.train import main


@dora.Explorer
def explorer(launcher):
    launcher.slurm_(gpus=2, partition='learnlab')
    launcher.bind_({
        'epochs': 200,
        'model.depth': 10,
        'batch_size': 128
    })
    launcher()
    for p in [0.005, 0.01, 0.05, 0.1]:
        sub = launcher.bind({'task.penalty': p})
        sub()
        sub(lr=1e-4)


args = dora.grid.RunGridArgs(jupyter=True, monitor=False, interval=1)
rules = dora.conf.SubmitRules(retry=True)  # Should we reschedule failed jobs?
# The run_grid function returns a list of sheeps
# each sheep as 2 attributues: sheep.xp and sheep.job_id.
sheeps = dora.grid.run_grid(main, explorer, grid_name='jupy', rules=rules, args=args)
args.monitor = True
args.jupyter = True
# The jupyter flag will make the grid API use the display API to clear the cell
# output and update it regularly. This one will not return until all jobs
# are done or failed.
# In the following, `grid_name` should be unique. It will be used
# to determine which experiments were previously scheduled with that grid
# and should potentially be cancelled if no longer needed.
dora.grid.run_grid(main, explorer, grid_name='jupy', rules=rules, args=args)
# You can retrieve the short names by using `main.get_names()`
short_names, ref_name = main.get_names([sheep.xp for sheep in sheeps])

Sharing XPs

At the moment, checkpoints and metrics cannot be directly shared (you can always copy manually an XP folder in someone else XPs folder). However, there are now two ways to share easily an XP hyper-params using its signature. This is useful if you want someone else to quickly reproduce your XP!

Dora import/export command

Given a list of signatures, you can export its hyper-params to a compact textual format with dora export:

dora export SIG1 [OTHER_SIG ...]

Copy paste the given string and your teammate can import it with

dora import
# now paste to stdin

The command will show you the list of imported XPs. Once an XP is imported, you can simply run it or query hyper params with dora run -f SIG, dora info -f SIG etc. From a grid file, you can programmatically retrieve the hyper-params from that XP, e.g.

from myproject.train import main

xp = main.get_xp_from_sig(SIG)
launcher.bind_(xp.argv)

Secondary shared XPs repository

You can now configure a secondary shared XPs repository where only mappings from SIG -> hyper params are stored. With Hydra you can add

dora:
    dir: outputs
    shared: /shared_folder/shared_xps

Then other teammates can reference any SIG from an XP launched by other team members within the Dora commands.

Advanced configuration

Setting SLURM default parameters

Slurm configuration is detailed in dora/conf.py. It is a bit different from the usual Slurm config, as it tries to make it as easy as possible to change the number of GPUs without requiring to manually compute the number of nodes, tasks per nodes etc.

Slurm config flags

  • gpus (int): number of GPUs to schedule. Number of nodes and tasks per nodes will be automatically inferred.
  • mem_per_gpu (float): amount of memory in GB to schedule per gpus.
  • time (int): maximum duration for the job in minutes.
  • cpus_per_gpu (int): number of cpus per gpu, this will set the cpus_per_task automatically, based on the number of gpus and one_task_per_node, unless cpus_per_task is explicitely provided.
  • cpus_per_task (int or None): number of cpus per task.
  • partition (str): partition name.
  • comment (str): comment for the job.
  • setup (List[str]): list of shell commands to execute before the actual command. Use it for module load.
  • max_num_timeout (int): maximum number of requeue.
  • one_task_per_node (bool): if True, schedules a single task per node, otherwise, will schedule one task per gpu (default is False).

Default Slurm config

You can pass an instance of SlurmConfig to argparse_main that will be used as the default config for all dora launch commands and grid files. Grid files can override any field defined on SlurmConfig with the launcher.slurm (return new launcher) and launcher.slurm_ (in-place) methods.

For Hydra, the default slurm configuration is taken from the slurm entry in the yaml file, for instance, you can have their:

my_param: whatever
batch_size: 42

dora:
    dir: outputs

slurm:
    partition: devlab
    mem_per_gpu: 30  # this is in GB

Customize metrics displayed in Explorer

Metrics are formatted with the treetable, which is not heavily documented, but it should be easy enough to get how it works with this example:

from dora import Explorer
import treetable as tt


class MyExplorer(Explorer):
    test_metrics = ['sisnr', 'pesq']

    def get_grid_metrics(self):
        """Return the structure of the metrics that should be displayed in the tracking table.
        """
        # This will return a list of `tt.group`, each group will
        # be in separate parts of the table.
        return [
            tt.group("train", [
                tt.leaf("epoch"),
                tt.leaf("loss", ".3f"),  # The second argument of tt.leaf is a formatting string.
             ], align=">"),  # Align can be left (<) or right (>) and will apply on all
                             # leaves inside the group.
            tt.group("valid", [
                tt.leaf("best", ".3f"),
                tt.leaf("loss", ".3f"),
             ], align=">"),
            tt.group("test", [
                tt.leaf(name, ".3f")
                for name in self.test_metrics
             ], align=">")
        ]
        # In practice you can even have deeply nested groups etc. but honestly
        # you probably don't need that.

    def process_history(self, history):
        """This process the history obtained from the Dora Link
        into the right format for the `get_grid_metrics()` layout.
        This should return a dictionary, with one key per group, each
        being a sub-dict with one key per metric.

        It is fine for a key to be absent, things won't crash, and it will
        just not show anything there in the table.
        """
        train = {
            'epoch': len(history),
        }
        valid = {}
        test = {}
        best = float('inf')
        for metrics in history:
            train.update(metrics['train'])
            valid.update(metrics['valid'])
            # Let say you forgot to compute best valid loss, you can
            # fill in it here on the fly.
            best = min(valid['loss'], best)
            valid['best'] = best

            # The reason for having this for loop is also if some metrics
            # are not defined with every epoch. Let say you compute test metrics
            # only every 10 epochs, then this will automatically fill the
            # test metrics of the last epoch which evaluated them.
            if 'test' in metrics:
                test.update(metrics['test'])
        return {"train": train, "valid": valid, "test": test}

License

Dora is released under the MIT license as found in the LICENSE file.

Contributing

Before submitting any change, please run make to run unit tests and code linting.

dora's People

Contributors

adefossez avatar alexisthual avatar jadecopet avatar kingjr avatar louismartin avatar michaelramamonjisoa avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dora's Issues

Can I train on multiple machines?

❓ Questions

I am new to Dora. I see that I can run distributed training. But is it possible to deploy learning on multiple machines? I don’t see the possibility of adding master_addr, master_port, rank. Maybe you haven’t done it yet. Perhaps I did not notice it. But it would be very cool to have such a possibility! I would be very grateful for help and tips in this matter!

Slurm Configuration

❓ Questions

I'm trying to train Demucs on a 4090 from Jupyter notebook.
I'm able to initialize the model, and retrieve its parameters from checkpoint, train the solver, and save it again.
I'm having trouble running a grid xp search though. Any help would be appreciated.

Below is what I am running, with my own custom main class, and I get this error. I look into the grids directory and there 3909beea is but unable to be accessed. There might be a problem with slurmconf on the gpu but I am not sure.
`
run_grid(main = train, explorer = explorer, grid_name = 'home/robertthomas/Documents/Melody-stems/demucs/demucs/grids/mdx.py', slurm = xp.cfg.slurm)

Error:

Grid: Error when trying to load old sheep 3909beea: Could not find experiment with signature 3909beea
An error happened when trying to load from /home/robertthomas/Documents/Melody-stems/demucs/outputs/grids/home/robertthomas/Documents/Melody-stems/demucs/demucs/grids/mdx.py/3909beea/job.pkl, this file will be ignored: FileNotFoundError(2, 'No such file or directory')
`

Missing Python3.10 Build

PIP doesn't seem to be able to install dora-search on Python3.10 at the moment due to I assume no build being present for it. Would it be possible to get one of those made? Thanks.

Initializing Dora xp/using Dora

I've been trying to initialize a dora experiment on two 4090s. Specifically, I am training the HTDemucs model from FB. I've run into this issue when uploading a new dataset, but now it seems whenever I initialize dora or run any command with dora, I get the following error. @adefossez

File "/home/robertthomas/.local/bin/dora", line 5, in <module>
    from dora.__main__ import main
  File "/home/robertthomas/.local/lib/python3.10/site-packages/dora/__init__.py", line 66, in <module>
    from .explore import Explorer, Launcher
  File "/home/robertthomas/.local/lib/python3.10/site-packages/dora/explore.py", line 27, in <module>
    from .shep import Shepherd, Sheep
  File "/home/robertthomas/.local/lib/python3.10/site-packages/dora/shep.py", line 25, in <module>
    from .distrib import get_distrib_spec
  File "/home/robertthomas/.local/lib/python3.10/site-packages/dora/distrib.py", line 14, in <module>
    import torch
  File "/home/robertthomas/.local/lib/python3.10/site-packages/torch/__init__.py", line 1465, in <module>
    from . import _meta_registrations
  File "/home/robertthomas/.local/lib/python3.10/site-packages/torch/_meta_registrations.py", line 7, in <module>
    from torch._decomp import _add_op_to_registry, global_decomposition_table, meta_table
  File "/home/robertthomas/.local/lib/python3.10/site-packages/torch/_decomp/__init__.py", line 169, in <module>
    import torch._decomp.decompositions
  File "/home/robertthomas/.local/lib/python3.10/site-packages/torch/_decomp/decompositions.py", line 10, in <module>
    import torch._prims as prims
  File "/home/robertthomas/.local/lib/python3.10/site-packages/torch/_prims/__init__.py", line 33, in <module>
    from torch._subclasses.fake_tensor import FakeTensor, FakeTensorMode
  File "/home/robertthomas/.local/lib/python3.10/site-packages/torch/_subclasses/__init__.py", line 3, in <module>
    from torch._subclasses.fake_tensor import (
  File "/home/robertthomas/.local/lib/python3.10/site-packages/torch/_subclasses/fake_tensor.py", line 13, in <module>
    from torch._guards import Source
  File "/home/robertthomas/.local/lib/python3.10/site-packages/torch/_guards.py", line 14, in <module>
    import sympy  # type: ignore[import]
  File "/home/robertthomas/.local/lib/python3.10/site-packages/sympy/__init__.py", line 30, in <module>
    from sympy.core.cache import lazy_function
  File "/home/robertthomas/.local/lib/python3.10/site-packages/sympy/core/__init__.py", line 4, in <module>
    from .sympify import sympify, SympifyError
  File "/home/robertthomas/.local/lib/python3.10/site-packages/sympy/core/sympify.py", line 8, in <module>
    from sympy.core.random import choice
  File "/home/robertthomas/.local/lib/python3.10/site-packages/sympy/core/random.py", line 25, in <module>
    from sympy.utilities.iterables import is_sequence
  File "/home/robertthomas/.local/lib/python3.10/site-packages/sympy/utilities/__init__.py", line 4, in <module>
    from .iterables import (flatten, group, take, subsets,
  File "/home/robertthomas/.local/lib/python3.10/site-packages/sympy/utilities/iterables.py", line 16, in <module>
    from sympy.utilities.misc import as_int
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 879, in exec_module
  File "<frozen importlib._bootstrap_external>", line 1012, in get_code
  File "<frozen importlib._bootstrap_external>", line 672, in _compile_bytecode
ValueError: bad marshal data (invalid reference)
terminate called after throwing an instance of 'c10::Error'
  what():  Number of tensor lists has to match the depth.
Exception raised from multi_tensor_apply at ../aten/src/ATen/native/cuda/MultiTensorApply.cuh:92 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb90af9e4d7 in /home/robertthomas/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x68 (0x7fb90af68434 in /home/robertthomas/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0x17b292d (0x7fb8ac1b292d in /home/robertthomas/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10::impl::OperatorEntry::updateDispatchTableEntry_(c10::Dispatcher const&, c10::DispatchKey) + 0xe0 (0x7fb8d26cc500 in /home/robertthomas/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10::impl::OperatorEntry::updateDispatchTable_(c10::Dispatcher const&, c10::DispatchKey) + 0xb5 (0x7fb8d26cc655 in /home/robertthomas/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10::impl::OperatorEntry::deregisterKernel_(c10::Dispatcher const&, c10::optional<c10::DispatchKey>, std::_List_iterator<c10::impl::AnnotatedKernel>) + 0x3ff (0x7fb8d26cdc5f in /home/robertthomas/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10::Dispatcher::deregisterImpl_(c10::OperatorHandle const&, c10::OperatorName const&, c10::optional<c10::DispatchKey>, std::_List_iterator<c10::impl::AnnotatedKernel>) + 0x59 (0x7fb8d26bf779 in /home/robertthomas/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #7: <unknown function> + 0xe81705 (0x7fb8ab881705 in /home/robertthomas/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #8: <unknown function> + 0x45495 (0x7fb923c45495 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: on_exit + 0 (0x7fb923c45610 in /lib/x86_64-linux-gnu/libc.so.6)
frame #10: <unknown function> + 0x29d97 (0x7fb923c29d97 in /lib/x86_64-linux-gnu/libc.so.6)
frame #11: __libc_start_main + 0x80 (0x7fb923c29e40 in /lib/x86_64-linux-gnu/libc.so.6)
frame #12: _start + 0x25 (0x55c151d9dba5 in /usr/bin/python3)

Aborted (core dumped)

Any advice or clues on how to debug are welcomed. Thank you

dora install question

❓ Questions

I get the dora through such command: pip install -U dora-search. However the dora command does not installed. How can I get the dora command? Thanks a lot

Now I want to debug dora,Is dora parsing from the train.py file?

❓ Questions

I know the dora by audiocraft project,but the dora is so complex,I'm trying to figure out how the audiocraft program works.I see that there is a train.py file in the audiocraft package, may I ask if dora is accessed directly from that program

[low importance] dora grid picks up additional packages

🐛 Bug Report

Basically the grid.py file in dora uses pkgutil.walk_packages(). This outputs local packages in the grids/ dir, but also global ones installed by pip. So if I had a /grids/alphafold/ dir, and also did "pip install alphafold" on my environment, it will pick all modules inside the installed alphafold package unintentionally.

The fix is to probably replace pkgutil.walk_packages() at grid.py to something else that picks up only local modules.

Your Environment

  • Python version:
  • Operating system:

Feature request: Specify grids by absolute path

I know we discussed this already but I'm creating this issue for future reference in case we want to implement it in the future.

It would be nice to be able to specify grids with absolute paths, e.g. dora grid grids/my_grid.py, in addition to grid names.

The advantages that I see are:

  • You can place your grids anywhere you want and run grids from everywhere
  • More intuitive in my opinion
  • Benefits for free from shell path autocompletion which is super helpful once you start having many grids!

Can not work on multi machines with multi gpus

❓ Questions

I am trying to run the program on multiple machines with multiple GPUs, but the code can only find multiple machines and use only one GPU on each machine during runtime. Do I need to add additional configurations to run it properly?

/home/fay.cyf/shiyi.zxh/miniconda3/envs/musicgen/lib/python3.9/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
Dora directory: /home/fay.cyf/weixipin.wxp/audiocraft/audiocraft_root
[�[36m08-18 02:46:54�[0m][�[34maudiocraft.solvers.builders�[0m][�[32mINFO�[0m] - Loading audio data split valid: /home/fay.cyf/weixipin.wxp/audiocraft/egs/example�[0m
[�[36m08-18 02:46:54�[0m][�[34maudiocraft.solvers.builders�[0m][�[32mINFO�[0m] - Loading audio data split evaluate: /home/fay.cyf/weixipin.wxp/audiocraft/egs/example�[0m
[�[36m08-18 02:46:54�[0m][�[34maudiocraft.solvers.builders�[0m][�[32mINFO�[0m] - Loading audio data split generate: /home/fay.cyf/weixipin.wxp/audiocraft/egs/example�[0m
[�[36m08-18 02:46:54�[0m][�[34mroot�[0m][�[32mINFO�[0m] - Getting pretrained compression model from HF facebook/encodec_32khz�[0m
[�[36m08-18 02:46:53�[0m][�[34mtorch.distributed.distributed_c10d�[0m][�[32mINFO�[0m] - Added key: store_based_barrier_key:1 to store for rank: 0�[0m
[�[36m08-18 02:46:53�[0m][�[34mtorch.distributed.distributed_c10d�[0m][�[32mINFO�[0m] - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.�[0m
[�[36m08-18 02:46:53�[0m][�[34mdora.distrib�[0m][�[32mINFO�[0m] - Distributed init: 0/2 (local 0) from env�[0m
[�[36m08-18 02:46:53�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Instantiating solver MusicGenSolver for XP 9521b0af�[0m
[�[36m08-18 02:46:53�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - All XP logs are stored in /home/fay.cyf/weixipin.wxp/audiocraft/audiocraft_root/xps/9521b0af�[0m
/home/fay.cyf/shiyi.zxh/miniconda3/envs/musicgen/lib/python3.9/site-packages/flashy/loggers/tensorboard.py:47: UserWarning: tensorboard package was not found: use pip install tensorboard
  warnings.warn("tensorboard package was not found: use pip install tensorboard")
[�[36m08-18 02:46:53�[0m][�[34maudiocraft.solvers.builders�[0m][�[32mINFO�[0m] - Loading audio data split train: /home/fay.cyf/weixipin.wxp/audiocraft/egs/data�[0m
[�[36m08-18 02:47:34�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Compression model has 4 codebooks with 2048 cardinality, and a framerate of 50�[0m
[�[36m08-18 02:47:34�[0m][�[34maudiocraft.modules.conditioners�[0m][�[32mINFO�[0m] - T5 will be evaluated with autocast as float32�[0m
[�[36m08-18 02:47:51�[0m][�[34maudiocraft.optim.dadam�[0m][�[32mINFO�[0m] - Using decoupled weight decay�[0m
[�[36m08-18 02:47:53�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Model hash: e7554e7f9d6cc2dea51bd31aa3e89765bc73d1dd�[0m
[�[36m08-18 02:47:53�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Initializing EMA on the model with decay = 0.99 every 10 updates�[0m
[�[36m08-18 02:47:53�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Model size: 420.37 M params�[0m
[�[36m08-18 02:47:53�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Base memory usage, with model, grad and optim: 6.73 GB�[0m
[�[36m08-18 02:47:53�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Restoring weights and history.�[0m
[�[36m08-18 02:47:53�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Loading a pretrained model. Ignoring 'load_best' and 'ignore_state_keys' params.�[0m
[�[36m08-18 02:48:00�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Checkpoint source is not the current xp: Load state_dict from best state.�[0m
[�[36m08-18 02:48:00�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Ignoring keys when loading best []�[0m
damo-pod7-0129:1:1 [0] NCCL INFO Bootstrap : Using bond0:33.57.143.227<0>
damo-pod7-0129:1:1 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
damo-pod7-0129:1:1 [0] NCCL INFO cudaDriverVersion 11070
NCCL version 2.14.3+cuda11.7
[�[36m08-18 02:48:06�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Model hash: 776d041cbbcb8973c4968782a79f9bb63b53a727�[0m
[�[36m08-18 02:48:04�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Re-initializing EMA from best state�[0m
[�[36m08-18 02:48:04�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Initializing EMA on the model with decay = 0.99 every 10 updates�[0m
[�[36m08-18 02:48:03�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Loading state_dict from best state.�[0m

How to run with torchrun?

❓ Questions

Torchrun is the standard recommended way to run multi gpu, multi machine training. How can one launch projects that are written to use dora using torchrun?

Moving a running xp from one grid to another (e.g. when refactoring) cancels the XP

I had a grid grid_a with too many experiments running so I refactored some of its experiments in a new grid file grid_b.
While running the new grid dora grid grid_b worked as expected and found the already running experiment, when I ran dora grid grid_a again it cancelled all the experiments that were now in grid_b.

It would be nice to have a way to track this scenario and only garbage collect experiments that are not linked to a grid.
And also it would be nice to ask the user for confirmation when cancelling experiments.

[Improvement] Better grid API and experience

better grid api for retrieving xps.

faster explorer evaluation for hydra.

potentially extra config "confirm_cancel" ?

"no_auto_cancel" flag for messing around.

allow to "abandon" xps when it should be canceled for transferring ownership. but maybe better to have a mechanism to transfer ownership, e.g. last grid to run XP owns it.

Define Dora `outputs` dir relative to where decorated main is defined

dir: Path = Path("./outputs") # where everything will be stored

My decorated main is in myrepo/mypackage/train.py
I launched a grid search from myrepo/ with dora grid ... which created an outputs dir in myrepo/outputs/.

Now I want to analyze my runs programatically from a notebook stored in myrepo/notebooks/mynotebook.ipynb.
However if run this in the notebook:

from mypackage.train import main

print(main.dora.dir)

Then it prints

'mypackage/notebooks/outputs'

Hence not the same as where my experiments are stored.
Would it make sense to define the dora outputs dir relative to where the decorated main is defined? E.g. when calling the decorator we would set dora.dir as Path(__file__).parent / "outputs" or something.

[Feature request] Allow running {first-xp or entirety} of a grid, locally.

I've been using Dora recently and it's been great.
One thing that one help my usage is an easy way to run e.g. the first xp of a grid locally, for debugging purposes.
This would be helpful for large, complex sweeps, to quickly squash issues without waiting for xps to schedule.

(So far, as a workaround, I've been printing launcher._argv and then doing dora run ${launcher._argv}.)

Python Debugger and dora

❓ Questions

Currently using the dora CLI to initiate training.

dora run -d solver=<some/solver> dset=<path/to/data> ...

What is the syntax for running the same command but using the dora python package?

Support for custom resolvers with Hydra

❓ Questions

Hi,
With hydra and more generally omegaconf, it is possible to register new resolvers to apply custom functions directly within the YAML configuration. I can for example do the following:

from omegaconf import OmegaConf

def effective_lr(base_lr: float, batch_size: int) -> float:
    return base_lr * batch_size / 256

OmegaConf.register_new_resolver("effective_lr", effective_lr)

@hydra.main(...)
def main(cfg):
    ...

which enables me to put directly in my yaml file:

data:
    batch_size: 128
    ....
model:
    ...
    optimizer:
        _target_: torch.optim.Adam
        _partial_: true
        lr: ${effective_lr:${model.base_lr},${data.batch_size}}
    base_lr: 0.0003
    ...

However, when using Dora's hydra_main, this doesn't work anymore. Indeed when I use e.g. dora run, the first thing executed is main.get_xp and for some reason this function under the hood resolves the whole config, thus raising an InterpolationError since the custom resolver hasn't been registered yet.

The only workaround I found to resolve this issue consists in directly overriding hydra_main:

from dora import hydra_main


def my_hydra_main(config_name: str, config_path: str, extra_resolvers: Dict[str, Callable] = None, **kwargs):
    """Wrap your main function with this.
    You can pass extra kwargs, e.g. `version_base` introduced in 1.2.
    """
    extra_resolvers = extra_resolvers or {}
    for name, resolver in extra_resolvers.items():
        OmegaConf.register_new_resolver(name, resolver)
    return hydra_main(config_name=config_name, config_path=config_path, **kwargs)


@my_hydra_main(version_base="1.3",
            config_path="../configs",
            config_name="train.yaml",
            extra_resolvers={"effective_lr": effective_lr})
def main(cfg: DictConfig) -> Optional[float]:
    ...

This works quite well, however I'd have 3 questions:

  1. Is it necessary to parse the config directly when calling main.get_xp() in Dora, since anyway the cfg arg of the DecoratedMain is not resolved? If yes, why?
  2. Is there already a way to register custom resolvers while using Dora that I may have missed?
  3. If no, should I consider doing a PR to replace hydra_main by my_hydra_main in next version? Since it only adds an optional arg that is not used by the original hydra_main, hopefully it shouldn't break anything.

Thanks a lot!

Cannot import name 'hydra_main' from 'dora' on Colab or Kaggle environment

🐛 Bug Report

I wanted to run training inside colab notebook using code that utilizes dora. But after installation hydra_main seems to be missing.

from dora import hydra_main gives me following exception

[<ipython-input-8-0d1fd6af93ab>](https://localhost:8080/#) in <cell line: 1>()
----> 1 from dora import hydra_main

ImportError: cannot import name 'hydra_main' from 'dora' (/usr/local/lib/python3.9/dist-packages/dora/__init__.py)

I installed it with
pip install -U dora-search

and tried few different versions.

Your Environment

Both kaggle or colab notebooks.

Or maybe this is expected and one can't use dora inside colab?

Cannot install due to requirement of "sklearn"

🐛 Bug Report

When I try to pip install dora I encountered this error:

Collecting sklearn (from dora->-r requirements.txt (line 4))
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/a4/0b/d1c703256cf293be77b7db44dbef62251fe02a97d0bef981f7120b0b0c0f/sklearn-0.0.post11.tar.gz (3.6 kB)
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [18 lines of output]
      The 'sklearn' PyPI package is deprecated, use 'scikit-learn'
      rather than 'sklearn' for pip commands.
      
      Here is how to fix this error in the main use cases:
      - use 'pip install scikit-learn' rather than 'pip install sklearn'
      - replace 'sklearn' by 'scikit-learn' in your pip requirements files
        (requirements.txt, setup.py, setup.cfg, Pipfile, etc ...)
      - if the 'sklearn' package is used by one of your dependencies,
        it would be great if you take some time to track which package uses
        'sklearn' instead of 'scikit-learn' and report it to their issue tracker
      - as a last resort, set the environment variable
        SKLEARN_ALLOW_DEPRECATED_SKLEARN_PACKAGE_INSTALL=True to avoid this error
      
      More information is available at
      https://github.com/scikit-learn/sklearn-pypi-package
      
      If the previous advice does not cover your use case, feel free to report it at
      https://github.com/scikit-learn/sklearn-pypi-package/issues/new
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

Environment

  • Python version: Python 3.10.12
  • Operating system: Linux version 5.4.0-91-generic (buildd@lcy01-amd64-017) (gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)) #102-Ubuntu SMP Fri Nov 5 16:31:28 UTC 2021

May the requirement be modified? I found a temporary workout:
SKLEARN_ALLOW_DEPRECATED_SKLEARN_PACKAGE_INSTALL=True pip install dora

No training when using 2 nodes and torchrun

Thank you for adding the ability to use multi-nodes without slurm! When I run training on one machine using torchrun, everything works without problems. But when I run it on two machines, the training freezes. The machines connect to each other, the model is loaded, but the training process does not continue. Learning gets stuck somewhere. Now I'm trying to figure out how to solve this problem. My launch command:

torchrun --master-addr [ip] \
--master-port [port] \
--node_rank 0 \
--nnodes 2 \
--nproc-per-node 2 \
-m dora run [ARGS]

`dora grid ... -t 0` crashes if the job hasn't logged anything yet

🐛 Bug Report

When I run dora grid with the -t parameter to monitor the logs, the command stops if the job has not logged anything (e.g. it's still pending).
It might be worth to continue monitoring even when the log does not exists yet.

$ dora grid my_grid -t 0
WARNING: Log .../dora_outputs/xps/d3843038/latest/61700001_0_log.out does not exist

World size by dora_distrib.world_size() is equal to 1 when I have two GPU's

❓ Questions

I am training my model on two NVIDIA 4090s, whenever the following code is run:

world_size = dora_distrib.world_size()
print(world_size)

world_size is equal to 1 even though torch.cuda.device_count() returns 2.
I tried wrapping my model in DDP and DataParallel but to no avail.

Would appreciate someone shining light on why this is happening.

Thanks

How to add the --export=ALL option to srun?

❓ Questions

Hi all,

when running grid searches, I run into RuntimeError: Could not figure out which environment the job is runnning in. Known environments: slurm, local, debug.. I have managed to manually fix this by adding the --export=ALL option to the srun command in the generated job scripts, and have seen that this can be done automatically with submitit (slurm_srun_args=["--export=ALL"]). I cannot find a way to do this with dora, are there any tricks to enabling --export=ALL with dora?

Best,
Mattias

Run a grid experiment for the first time

❓ Questions

When debugging an implementation, I don't want to launch a full grid but a single experiment of this grid instead.

Therefore, I list all experiments of my grid with dora grid mygrid --dry_run, select an experiment id mysig from this grid and launch it with dora run -f mysig.
Unfortunately, this raises FATAL: Could not find an existing run with sig mysig.
I get the same error with dora launch -f mysig.

In order to circumvent this, what I do is I launch all experiments of my grid with dora grid mygrid and cancel them right after with dora grid mygrid --cancel.

Is there a more direct way to launch an experiment for the first time?

Why only the log file of rank > 0 is created?

❓ Questions

I'm trying to debug a problem with nccl, but I can only see the worker_1.log output file.

I have two gpus and I'm trying to see the logs, but I can only see the log of one, is this check really necessary?

TIA

No stop command?

❓ Questions

Sometimes there's something wrong with the experiment and need to expire it. I did it by kill all the processes created by dora, is this the correct way?

Dora outputs dir broken when git_save: true

🐛 Bug Report

Since I used git_save: true the dora output dir is nested into the saved code dir. I.e. my checkpoints are saved here:
dora_outputs/codes/73345a0f0c11882824e3c9d2d354a1c1b82098d6/dora_outputs/xps/6658f617/lightning_logs/
instead of here dora_outputs/xps/6658f617/lightning_logs/.

279571536_478615730681942_7232181962064291643_n
279500923_1080774419172504_4165265806523926916_n

Your Environment

  • Python version: 3.9
  • Operating system: Ubuntu 20+

error with pytorch_lightning

the pytorch_lightning example leads to the following error

jeanremi@devfair0166:~/opt/dora/examples$ export DORA_PACKAGE=pl
jeanremi@devfair0166:~/opt/dora/examples$ dora run
Traceback (most recent call last):
  File "/private/home/jeanremi/.conda/envs/ame/bin/dora", line 8, in <module>
    sys.exit(main())
  File "/private/home/jeanremi/.conda/envs/ame/lib/python3.8/site-packages/dora/__main__.py", line 205, in main
    args.action(args, main)
  File "/private/home/jeanremi/.conda/envs/ame/lib/python3.8/site-packages/dora/run.py", line 69, in run_action
    main()
  File "/private/home/jeanremi/.conda/envs/ame/lib/python3.8/site-packages/dora/main.py", line 62, in __call__
    return self._main()
  File "/private/home/jeanremi/.conda/envs/ame/lib/python3.8/site-packages/dora/main.py", line 68, in _main
    return self.main()
  File "/private/home/jeanremi/opt/dora/examples/pl/train.py", line 85, in main
    trainer = trainer_from_argparse_args(
  File "/private/home/jeanremi/.conda/envs/ame/lib/python3.8/site-packages/dora/lightning.py", line 238, in trainer_from_argparse_args
    return get_trainer(*intercept.args, **intercept.kwargs)
  File "/private/home/jeanremi/.conda/envs/ame/lib/python3.8/site-packages/dora/lightning.py", line 185, in get_trainer
    env = DoraEnvironment()
TypeError: Can't instantiate abstract class DoraEnvironment with abstract methods creates_processes_externally

Callbacks default is None

callbacks = kwargs.pop("callbacks", [])

When calling trainer = dora.lightning.get_trainer() I get

Traceback (most recent call last):
  File "/private/home/louismartin/dev/tap/tap/train.py", line 60, in main
    train(cfg=cfg)
  File "/private/home/louismartin/dev/tap/tap/train.py", line 26, in train
    trainer = dora.lightning.get_trainer()
  File "/private/home/louismartin/dev/dora/dora/lightning.py", line 175, in get_trainer
    callbacks.append(DoraCheckpointSync())
AttributeError: 'NoneType' object has no attribute 'append'

This is because the default value for callbacks in the Trainer.__init__ signature (retrieved here kwargs = inspect.getcallargs(init, [None] + list(args), **kwargs)) is None.
Hence callbacks = kwargs.pop("callbacks", []) returns None instead of [].

pytorch-lightning==1.5.10

[Feature request] Export grid tree table to LaTeX/csv

❓ Questions

Hi all,

I think the Tree Table feature is great for monitoring, but also presenting final results, unfortunately, there doesn't seem to be an easy way to export to LaTeX or flatten the tree and save as csv.

Is this feature already secretly implemented or should I get coding :)

Best,
Mattias

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.