facebookresearch / dora Goto Github PK

Dora is an experiment management framework. It expresses grid searches as pure python files as part of your repo. It identifies experiments with a unique hash signature. Scale up to hundreds of experiments without losing your sanity.

License: MIT License

Makefile 0.25% Python 99.75%

dora's Issues

How to run with torchrun?

❓ Questions

Torchrun is the standard recommended way to run multi gpu, multi machine training. How can one launch projects that are written to use dora using torchrun?

Now I want to debug dora,Is dora parsing from the train.py file?

❓ Questions

I know the dora by audiocraft project,but the dora is so complex,I'm trying to figure out how the audiocraft program works.I see that there is a train.py file in the audiocraft package, may I ask if dora is accessed directly from that program

Support for custom resolvers with Hydra

❓ Questions

Hi,
With hydra and more generally omegaconf, it is possible to register new resolvers to apply custom functions directly within the YAML configuration. I can for example do the following:

from omegaconf import OmegaConf

def effective_lr(base_lr: float, batch_size: int) -> float:
    return base_lr * batch_size / 256

OmegaConf.register_new_resolver("effective_lr", effective_lr)

@hydra.main(...)
def main(cfg):
    ...

which enables me to put directly in my yaml file:

data:
    batch_size: 128
    ....
model:
    ...
    optimizer:
        _target_: torch.optim.Adam
        _partial_: true
        lr: ${effective_lr:${model.base_lr},${data.batch_size}}
    base_lr: 0.0003
    ...

However, when using Dora's hydra_main, this doesn't work anymore. Indeed when I use e.g. dora run, the first thing executed is main.get_xp and for some reason this function under the hood resolves the whole config, thus raising an InterpolationError since the custom resolver hasn't been registered yet.

The only workaround I found to resolve this issue consists in directly overriding hydra_main:

from dora import hydra_main


def my_hydra_main(config_name: str, config_path: str, extra_resolvers: Dict[str, Callable] = None, **kwargs):
    """Wrap your main function with this.
    You can pass extra kwargs, e.g. `version_base` introduced in 1.2.
    """
    extra_resolvers = extra_resolvers or {}
    for name, resolver in extra_resolvers.items():
        OmegaConf.register_new_resolver(name, resolver)
    return hydra_main(config_name=config_name, config_path=config_path, **kwargs)


@my_hydra_main(version_base="1.3",
            config_path="../configs",
            config_name="train.yaml",
            extra_resolvers={"effective_lr": effective_lr})
def main(cfg: DictConfig) -> Optional[float]:
    ...

This works quite well, however I'd have 3 questions:

Is it necessary to parse the config directly when calling main.get_xp() in Dora, since anyway the cfg arg of the DecoratedMain is not resolved? If yes, why?
Is there already a way to register custom resolvers while using Dora that I may have missed?
If no, should I consider doing a PR to replace hydra_main by my_hydra_main in next version? Since it only adds an optional arg that is not used by the original hydra_main, hopefully it shouldn't break anything.

Thanks a lot!

Can we train with dora on multiple machines without SLURM?

Is it possible to use dora with horovod or pytorch DDP? If so, is there any documentation/codebase available?

[Improvement] use relative symlinks

Use relative symlinks everywhere to allow easily moving the output folder.

Run a grid experiment for the first time

❓ Questions

When debugging an implementation, I don't want to launch a full grid but a single experiment of this grid instead.

Therefore, I list all experiments of my grid with dora grid mygrid --dry_run, select an experiment id mysig from this grid and launch it with dora run -f mysig.
Unfortunately, this raises FATAL: Could not find an existing run with sig mysig.
I get the same error with dora launch -f mysig.

In order to circumvent this, what I do is I launch all experiments of my grid with dora grid mygrid and cancel them right after with dora grid mygrid --cancel.

Is there a more direct way to launch an experiment for the first time?

[Improvement] Multi method target / refactor decorated main

Also getting rid of "argv" abstraction everywhere ?

Support "light" Hydra mode that only uses the config parsing.

Listing grid names show other modules

pkgutil.walk_package doesn't do what I thought it was doing and show too many results when listing grids.

Callbacks default is None

dora/dora/lightning.py

Line 173 in a23ce11

callbacks = kwargs.pop("callbacks", [])

When calling trainer = dora.lightning.get_trainer() I get

Traceback (most recent call last):
  File "/private/home/louismartin/dev/tap/tap/train.py", line 60, in main
    train(cfg=cfg)
  File "/private/home/louismartin/dev/tap/tap/train.py", line 26, in train
    trainer = dora.lightning.get_trainer()
  File "/private/home/louismartin/dev/dora/dora/lightning.py", line 175, in get_trainer
    callbacks.append(DoraCheckpointSync())
AttributeError: 'NoneType' object has no attribute 'append'

This is because the default value for callbacks in the Trainer.__init__ signature (retrieved here kwargs = inspect.getcallargs(init, [None] + list(args), **kwargs)) is None.
Hence callbacks = kwargs.pop("callbacks", []) returns None instead of [].

pytorch-lightning==1.5.10

How to add the --export=ALL option to srun?

❓ Questions

Hi all,

when running grid searches, I run into RuntimeError: Could not figure out which environment the job is runnning in. Known environments: slurm, local, debug.. I have managed to manually fix this by adding the --export=ALL option to the srun command in the generated job scripts, and have seen that this can be done automatically with submitit (slurm_srun_args=["--export=ALL"]). I cannot find a way to do this with dora, are there any tricks to enabling --export=ALL with dora?

Best,
Mattias

Python Debugger and dora

❓ Questions

Currently using the dora CLI to initiate training.

dora run -d solver=<some/solver> dset=<path/to/data> ...

What is the syntax for running the same command but using the dora python package?

[Feature request] Allow running {first-xp or entirety} of a grid, locally.

I've been using Dora recently and it's been great.
One thing that one help my usage is an easy way to run e.g. the first xp of a grid locally, for debugging purposes.
This would be helpful for large, complex sweeps, to quickly squash issues without waiting for xps to schedule.

(So far, as a workaround, I've been printing launcher._argv and then doing dora run ${launcher._argv}.)

[low importance] dora grid picks up additional packages

🐛 Bug Report

Basically the grid.py file in dora uses pkgutil.walk_packages(). This outputs local packages in the grids/ dir, but also global ones installed by pip. So if I had a /grids/alphafold/ dir, and also did "pip install alphafold" on my environment, it will pick all modules inside the installed alphafold package unintentionally.

The fix is to probably replace pkgutil.walk_packages() at grid.py to something else that picks up only local modules.

Your Environment

Python version:
Operating system:

No training when using 2 nodes and torchrun

Thank you for adding the ability to use multi-nodes without slurm! When I run training on one machine using torchrun, everything works without problems. But when I run it on two machines, the training freezes. The machines connect to each other, the model is loaded, but the training process does not continue. Learning gets stuck somewhere. Now I'm trying to figure out how to solve this problem. My launch command:

torchrun --master-addr [ip] \
--master-port [port] \
--node_rank 0 \
--nnodes 2 \
--nproc-per-node 2 \
-m dora run [ARGS]

Initializing Dora xp/using Dora

I've been trying to initialize a dora experiment on two 4090s. Specifically, I am training the HTDemucs model from FB. I've run into this issue when uploading a new dataset, but now it seems whenever I initialize dora or run any command with dora, I get the following error. @adefossez

File "/home/robertthomas/.local/bin/dora", line 5, in <module>
    from dora.__main__ import main
  File "/home/robertthomas/.local/lib/python3.10/site-packages/dora/__init__.py", line 66, in <module>
    from .explore import Explorer, Launcher
  File "/home/robertthomas/.local/lib/python3.10/site-packages/dora/explore.py", line 27, in <module>
    from .shep import Shepherd, Sheep
  File "/home/robertthomas/.local/lib/python3.10/site-packages/dora/shep.py", line 25, in <module>
    from .distrib import get_distrib_spec
  File "/home/robertthomas/.local/lib/python3.10/site-packages/dora/distrib.py", line 14, in <module>
    import torch
  File "/home/robertthomas/.local/lib/python3.10/site-packages/torch/__init__.py", line 1465, in <module>
    from . import _meta_registrations
  File "/home/robertthomas/.local/lib/python3.10/site-packages/torch/_meta_registrations.py", line 7, in <module>
    from torch._decomp import _add_op_to_registry, global_decomposition_table, meta_table
  File "/home/robertthomas/.local/lib/python3.10/site-packages/torch/_decomp/__init__.py", line 169, in <module>
    import torch._decomp.decompositions
  File "/home/robertthomas/.local/lib/python3.10/site-packages/torch/_decomp/decompositions.py", line 10, in <module>
    import torch._prims as prims
  File "/home/robertthomas/.local/lib/python3.10/site-packages/torch/_prims/__init__.py", line 33, in <module>
    from torch._subclasses.fake_tensor import FakeTensor, FakeTensorMode
  File "/home/robertthomas/.local/lib/python3.10/site-packages/torch/_subclasses/__init__.py", line 3, in <module>
    from torch._subclasses.fake_tensor import (
  File "/home/robertthomas/.local/lib/python3.10/site-packages/torch/_subclasses/fake_tensor.py", line 13, in <module>
    from torch._guards import Source
  File "/home/robertthomas/.local/lib/python3.10/site-packages/torch/_guards.py", line 14, in <module>
    import sympy  # type: ignore[import]
  File "/home/robertthomas/.local/lib/python3.10/site-packages/sympy/__init__.py", line 30, in <module>
    from sympy.core.cache import lazy_function
  File "/home/robertthomas/.local/lib/python3.10/site-packages/sympy/core/__init__.py", line 4, in <module>
    from .sympify import sympify, SympifyError
  File "/home/robertthomas/.local/lib/python3.10/site-packages/sympy/core/sympify.py", line 8, in <module>
    from sympy.core.random import choice
  File "/home/robertthomas/.local/lib/python3.10/site-packages/sympy/core/random.py", line 25, in <module>
    from sympy.utilities.iterables import is_sequence
  File "/home/robertthomas/.local/lib/python3.10/site-packages/sympy/utilities/__init__.py", line 4, in <module>
    from .iterables import (flatten, group, take, subsets,
  File "/home/robertthomas/.local/lib/python3.10/site-packages/sympy/utilities/iterables.py", line 16, in <module>
    from sympy.utilities.misc import as_int
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 879, in exec_module
  File "<frozen importlib._bootstrap_external>", line 1012, in get_code
  File "<frozen importlib._bootstrap_external>", line 672, in _compile_bytecode
ValueError: bad marshal data (invalid reference)
terminate called after throwing an instance of 'c10::Error'
  what():  Number of tensor lists has to match the depth.
Exception raised from multi_tensor_apply at ../aten/src/ATen/native/cuda/MultiTensorApply.cuh:92 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb90af9e4d7 in /home/robertthomas/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x68 (0x7fb90af68434 in /home/robertthomas/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0x17b292d (0x7fb8ac1b292d in /home/robertthomas/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10::impl::OperatorEntry::updateDispatchTableEntry_(c10::Dispatcher const&, c10::DispatchKey) + 0xe0 (0x7fb8d26cc500 in /home/robertthomas/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10::impl::OperatorEntry::updateDispatchTable_(c10::Dispatcher const&, c10::DispatchKey) + 0xb5 (0x7fb8d26cc655 in /home/robertthomas/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10::impl::OperatorEntry::deregisterKernel_(c10::Dispatcher const&, c10::optional<c10::DispatchKey>, std::_List_iterator<c10::impl::AnnotatedKernel>) + 0x3ff (0x7fb8d26cdc5f in /home/robertthomas/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10::Dispatcher::deregisterImpl_(c10::OperatorHandle const&, c10::OperatorName const&, c10::optional<c10::DispatchKey>, std::_List_iterator<c10::impl::AnnotatedKernel>) + 0x59 (0x7fb8d26bf779 in /home/robertthomas/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #7: <unknown function> + 0xe81705 (0x7fb8ab881705 in /home/robertthomas/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #8: <unknown function> + 0x45495 (0x7fb923c45495 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: on_exit + 0 (0x7fb923c45610 in /lib/x86_64-linux-gnu/libc.so.6)
frame #10: <unknown function> + 0x29d97 (0x7fb923c29d97 in /lib/x86_64-linux-gnu/libc.so.6)
frame #11: __libc_start_main + 0x80 (0x7fb923c29e40 in /lib/x86_64-linux-gnu/libc.so.6)
frame #12: _start + 0x25 (0x55c151d9dba5 in /usr/bin/python3)

Aborted (core dumped)

Any advice or clues on how to debug are welcomed. Thank you

Is there any way to use the Debugger of VSCode while using "dora run"?

❓ Questions

Is it possible to use VSCode Debugger with dora run command for debugging python codes?

Moving a running xp from one grid to another (e.g. when refactoring) cancels the XP

I had a grid grid_a with too many experiments running so I refactored some of its experiments in a new grid file grid_b.
While running the new grid dora grid grid_b worked as expected and found the already running experiment, when I ran dora grid grid_a again it cancelled all the experiments that were now in grid_b.

It would be nice to have a way to track this scenario and only garbage collect experiments that are not linked to a grid.
And also it would be nice to ask the user for confirmation when cancelling experiments.

Can not work on multi machines with multi gpus

❓ Questions

I am trying to run the program on multiple machines with multiple GPUs, but the code can only find multiple machines and use only one GPU on each machine during runtime. Do I need to add additional configurations to run it properly?

/home/fay.cyf/shiyi.zxh/miniconda3/envs/musicgen/lib/python3.9/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
Dora directory: /home/fay.cyf/weixipin.wxp/audiocraft/audiocraft_root
[�[36m08-18 02:46:54�[0m][�[34maudiocraft.solvers.builders�[0m][�[32mINFO�[0m] - Loading audio data split valid: /home/fay.cyf/weixipin.wxp/audiocraft/egs/example�[0m
[�[36m08-18 02:46:54�[0m][�[34maudiocraft.solvers.builders�[0m][�[32mINFO�[0m] - Loading audio data split evaluate: /home/fay.cyf/weixipin.wxp/audiocraft/egs/example�[0m
[�[36m08-18 02:46:54�[0m][�[34maudiocraft.solvers.builders�[0m][�[32mINFO�[0m] - Loading audio data split generate: /home/fay.cyf/weixipin.wxp/audiocraft/egs/example�[0m
[�[36m08-18 02:46:54�[0m][�[34mroot�[0m][�[32mINFO�[0m] - Getting pretrained compression model from HF facebook/encodec_32khz�[0m
[�[36m08-18 02:46:53�[0m][�[34mtorch.distributed.distributed_c10d�[0m][�[32mINFO�[0m] - Added key: store_based_barrier_key:1 to store for rank: 0�[0m
[�[36m08-18 02:46:53�[0m][�[34mtorch.distributed.distributed_c10d�[0m][�[32mINFO�[0m] - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.�[0m
[�[36m08-18 02:46:53�[0m][�[34mdora.distrib�[0m][�[32mINFO�[0m] - Distributed init: 0/2 (local 0) from env�[0m
[�[36m08-18 02:46:53�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Instantiating solver MusicGenSolver for XP 9521b0af�[0m
[�[36m08-18 02:46:53�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - All XP logs are stored in /home/fay.cyf/weixipin.wxp/audiocraft/audiocraft_root/xps/9521b0af�[0m
/home/fay.cyf/shiyi.zxh/miniconda3/envs/musicgen/lib/python3.9/site-packages/flashy/loggers/tensorboard.py:47: UserWarning: tensorboard package was not found: use pip install tensorboard
  warnings.warn("tensorboard package was not found: use pip install tensorboard")
[�[36m08-18 02:46:53�[0m][�[34maudiocraft.solvers.builders�[0m][�[32mINFO�[0m] - Loading audio data split train: /home/fay.cyf/weixipin.wxp/audiocraft/egs/data�[0m
[�[36m08-18 02:47:34�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Compression model has 4 codebooks with 2048 cardinality, and a framerate of 50�[0m
[�[36m08-18 02:47:34�[0m][�[34maudiocraft.modules.conditioners�[0m][�[32mINFO�[0m] - T5 will be evaluated with autocast as float32�[0m
[�[36m08-18 02:47:51�[0m][�[34maudiocraft.optim.dadam�[0m][�[32mINFO�[0m] - Using decoupled weight decay�[0m
[�[36m08-18 02:47:53�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Model hash: e7554e7f9d6cc2dea51bd31aa3e89765bc73d1dd�[0m
[�[36m08-18 02:47:53�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Initializing EMA on the model with decay = 0.99 every 10 updates�[0m
[�[36m08-18 02:47:53�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Model size: 420.37 M params�[0m
[�[36m08-18 02:47:53�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Base memory usage, with model, grad and optim: 6.73 GB�[0m
[�[36m08-18 02:47:53�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Restoring weights and history.�[0m
[�[36m08-18 02:47:53�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Loading a pretrained model. Ignoring 'load_best' and 'ignore_state_keys' params.�[0m
[�[36m08-18 02:48:00�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Checkpoint source is not the current xp: Load state_dict from best state.�[0m
[�[36m08-18 02:48:00�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Ignoring keys when loading best []�[0m
damo-pod7-0129:1:1 [0] NCCL INFO Bootstrap : Using bond0:33.57.143.227<0>
damo-pod7-0129:1:1 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
damo-pod7-0129:1:1 [0] NCCL INFO cudaDriverVersion 11070
NCCL version 2.14.3+cuda11.7
[�[36m08-18 02:48:06�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Model hash: 776d041cbbcb8973c4968782a79f9bb63b53a727�[0m
[�[36m08-18 02:48:04�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Re-initializing EMA from best state�[0m
[�[36m08-18 02:48:04�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Initializing EMA on the model with decay = 0.99 every 10 updates�[0m
[�[36m08-18 02:48:03�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Loading state_dict from best state.�[0m

[Question] how can I launch dora with volta32gb GPUs? Is there a command line flag for it?

Why only the log file of rank > 0 is created?

❓ Questions

I'm trying to debug a problem with nccl, but I can only see the worker_1.log output file.

I have two gpus and I'm trying to see the logs, but I can only see the log of one, is this check really necessary?

TIA

Cannot import name 'hydra_main' from 'dora' on Colab or Kaggle environment

🐛 Bug Report

I wanted to run training inside colab notebook using code that utilizes dora. But after installation hydra_main seems to be missing.

from dora import hydra_main gives me following exception

[<ipython-input-8-0d1fd6af93ab>](https://localhost:8080/#) in <cell line: 1>()
----> 1 from dora import hydra_main

ImportError: cannot import name 'hydra_main' from 'dora' (/usr/local/lib/python3.9/dist-packages/dora/__init__.py)

I installed it with
pip install -U dora-search

and tried few different versions.

Your Environment

Both kaggle or colab notebooks.

Or maybe this is expected and one can't use dora inside colab?

[Improvement] Core launching API / mini dora with concurrent.futures like API

Allow for a mini job launching only API with deduplication, supporting easily any function with kwargs.

`dora grid ... -t 0` crashes if the job hasn't logged anything yet

🐛 Bug Report

When I run dora grid with the -t parameter to monitor the logs, the command stops if the job has not logged anything (e.g. it's still pending).
It might be worth to continue monitoring even when the log does not exists yet.

$ dora grid my_grid -t 0
WARNING: Log .../dora_outputs/xps/d3843038/latest/61700001_0_log.out does not exist

`grids/` without `init.py` crashes

dora/dora/grid.py

Line 99 in a0a6558

grid_file = Path(grids.__file__).parent / grid_filename

grids.__file__ is None if there is no __init__.py.

Deleting History of a Cancelled Experiment

❓ Questions

Hi @adefossez,

Thank you for developing such a useful package. I have encountered an issue with rerunning a cancelled experiment (XP). Specifically, I'm using dora grid baseline --clear with the expectation that the experiment would start from scratch. Initially, it appears to work as the history.json file is deleted. However, as the process continues, the previous history reappears before the current training starts, causing the training to resume from the last cancellation point.

This issue does not occur if the previous experiment completes successfully; in those cases, the --clear option works as expected. Could you advise on how to ensure that a cancelled experiment restarts completely from scratch when rerun?

Thank you for your assistance.

error with pytorch_lightning

the pytorch_lightning example leads to the following error

jeanremi@devfair0166:~/opt/dora/examples$ export DORA_PACKAGE=pl
jeanremi@devfair0166:~/opt/dora/examples$ dora run

Traceback (most recent call last):
  File "/private/home/jeanremi/.conda/envs/ame/bin/dora", line 8, in <module>
    sys.exit(main())
  File "/private/home/jeanremi/.conda/envs/ame/lib/python3.8/site-packages/dora/__main__.py", line 205, in main
    args.action(args, main)
  File "/private/home/jeanremi/.conda/envs/ame/lib/python3.8/site-packages/dora/run.py", line 69, in run_action
    main()
  File "/private/home/jeanremi/.conda/envs/ame/lib/python3.8/site-packages/dora/main.py", line 62, in __call__
    return self._main()
  File "/private/home/jeanremi/.conda/envs/ame/lib/python3.8/site-packages/dora/main.py", line 68, in _main
    return self.main()
  File "/private/home/jeanremi/opt/dora/examples/pl/train.py", line 85, in main
    trainer = trainer_from_argparse_args(
  File "/private/home/jeanremi/.conda/envs/ame/lib/python3.8/site-packages/dora/lightning.py", line 238, in trainer_from_argparse_args
    return get_trainer(*intercept.args, **intercept.kwargs)
  File "/private/home/jeanremi/.conda/envs/ame/lib/python3.8/site-packages/dora/lightning.py", line 185, in get_trainer
    env = DoraEnvironment()
TypeError: Can't instantiate abstract class DoraEnvironment with abstract methods creates_processes_externally

Define Dora `outputs` dir relative to where decorated main is defined

dora/dora/conf.py

Line 147 in 7e5d894

dir: Path = Path("./outputs") # where everything will be stored

My decorated main is in myrepo/mypackage/train.py
I launched a grid search from myrepo/ with dora grid ... which created an outputs dir in myrepo/outputs/.

Now I want to analyze my runs programatically from a notebook stored in myrepo/notebooks/mynotebook.ipynb.
However if run this in the notebook:

from mypackage.train import main

print(main.dora.dir)

Then it prints

'mypackage/notebooks/outputs'

Hence not the same as where my experiments are stored.
Would it make sense to define the dora outputs dir relative to where the decorated main is defined? E.g. when calling the decorator we would set dora.dir as Path(__file__).parent / "outputs" or something.

dora install question

❓ Questions

I get the dora through such command: pip install -U dora-search. However the dora command does not installed. How can I get the dora command? Thanks a lot

Slurm Configuration

❓ Questions

I'm trying to train Demucs on a 4090 from Jupyter notebook.
I'm able to initialize the model, and retrieve its parameters from checkpoint, train the solver, and save it again.
I'm having trouble running a grid xp search though. Any help would be appreciated.

Below is what I am running, with my own custom main class, and I get this error. I look into the grids directory and there 3909beea is but unable to be accessed. There might be a problem with slurmconf on the gpu but I am not sure.
`
run_grid(main = train, explorer = explorer, grid_name = 'home/robertthomas/Documents/Melody-stems/demucs/demucs/grids/mdx.py', slurm = xp.cfg.slurm)

Error:

Grid: Error when trying to load old sheep 3909beea: Could not find experiment with signature 3909beea
An error happened when trying to load from /home/robertthomas/Documents/Melody-stems/demucs/outputs/grids/home/robertthomas/Documents/Melody-stems/demucs/demucs/grids/mdx.py/3909beea/job.pkl, this file will be ignored: FileNotFoundError(2, 'No such file or directory')
`

Cannot install due to requirement of "sklearn"

🐛 Bug Report

When I try to pip install dora I encountered this error:

Collecting sklearn (from dora->-r requirements.txt (line 4))
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/a4/0b/d1c703256cf293be77b7db44dbef62251fe02a97d0bef981f7120b0b0c0f/sklearn-0.0.post11.tar.gz (3.6 kB)
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [18 lines of output]
      The 'sklearn' PyPI package is deprecated, use 'scikit-learn'
      rather than 'sklearn' for pip commands.
      
      Here is how to fix this error in the main use cases:
      - use 'pip install scikit-learn' rather than 'pip install sklearn'
      - replace 'sklearn' by 'scikit-learn' in your pip requirements files
        (requirements.txt, setup.py, setup.cfg, Pipfile, etc ...)
      - if the 'sklearn' package is used by one of your dependencies,
        it would be great if you take some time to track which package uses
        'sklearn' instead of 'scikit-learn' and report it to their issue tracker
      - as a last resort, set the environment variable
        SKLEARN_ALLOW_DEPRECATED_SKLEARN_PACKAGE_INSTALL=True to avoid this error
      
      More information is available at
      https://github.com/scikit-learn/sklearn-pypi-package
      
      If the previous advice does not cover your use case, feel free to report it at
      https://github.com/scikit-learn/sklearn-pypi-package/issues/new
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

Environment

Python version: Python 3.10.12
Operating system: Linux version 5.4.0-91-generic (buildd@lcy01-amd64-017) (gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)) #102-Ubuntu SMP Fri Nov 5 16:31:28 UTC 2021

May the requirement be modified? I found a temporary workout:
SKLEARN_ALLOW_DEPRECATED_SKLEARN_PACKAGE_INSTALL=True pip install dora

[Improvement] Better grid API and experience

better grid api for retrieving xps.

faster explorer evaluation for hydra.

potentially extra config "confirm_cancel" ?

"no_auto_cancel" flag for messing around.

allow to "abandon" xps when it should be canceled for transferring ownership. but maybe better to have a mechanism to transfer ownership, e.g. last grid to run XP owns it.

[Feature request] Export grid tree table to LaTeX/csv

❓ Questions

Hi all,

I think the Tree Table feature is great for monitoring, but also presenting final results, unfortunately, there doesn't seem to be an easy way to export to LaTeX or flatten the tree and save as csv.

Is this feature already secretly implemented or should I get coding :)

Best,
Mattias

No stop command?

❓ Questions

Sometimes there's something wrong with the experiment and need to expire it. I did it by kill all the processes created by dora, is this the correct way?

Can I train on multiple machines?

❓ Questions

I am new to Dora. I see that I can run distributed training. But is it possible to deploy learning on multiple machines? I don’t see the possibility of adding master_addr, master_port, rank. Maybe you haven’t done it yet. Perhaps I did not notice it. But it would be very cool to have such a possibility! I would be very grateful for help and tips in this matter!

Store dora stuff in `dora_outputs` instead of `outputs`?

dora/dora/conf.py

Line 147 in 7e5d894

dir: Path = Path("./outputs") # where everything will be stored

Store dora stuff in dora_outputs instead of outputs for more clarity?
I was not sure what package created the outputs dir in my repo.

Missing Python3.10 Build

PIP doesn't seem to be able to install dora-search on Python3.10 at the moment due to I assume no build being present for it. Would it be possible to get one of those made? Thanks.

[Feature request] copy the source code to xps dir upon launch (only .py and .yaml files is a good start)

[Feature request]. Ability to schedule with SLURM job_arrays.

Thanks again for the great tool. Recently, I'm running Dora with many (1k - 10k) small single-gpu experiments that run for a few hours each. Is there a way to launch these jobs via slurm jobarray? Otherwise scheduler load is too high.

Feature request: Specify grids by absolute path

I know we discussed this already but I'm creating this issue for future reference in case we want to implement it in the future.

It would be nice to be able to specify grids with absolute paths, e.g. dora grid grids/my_grid.py, in addition to grid names.

The advantages that I see are:

You can place your grids anywhere you want and run grids from everywhere
More intuitive in my opinion
Benefits for free from shell path autocompletion which is super helpful once you start having many grids!

World size by dora_distrib.world_size() is equal to 1 when I have two GPU's

❓ Questions

I am training my model on two NVIDIA 4090s, whenever the following code is run:

world_size = dora_distrib.world_size()
print(world_size)

world_size is equal to 1 even though torch.cuda.device_count() returns 2.
I tried wrapping my model in DDP and DataParallel but to no avail.

Would appreciate someone shining light on why this is happening.

Thanks

Dora outputs dir broken when git_save: true

🐛 Bug Report

Since I used git_save: true the dora output dir is nested into the saved code dir. I.e. my checkpoints are saved here:
dora_outputs/codes/73345a0f0c11882824e3c9d2d354a1c1b82098d6/dora_outputs/xps/6658f617/lightning_logs/
instead of here dora_outputs/xps/6658f617/lightning_logs/.

Your Environment

Python version: 3.9
Operating system: Ubuntu 20+

facebookresearch / dora Goto Github PK

dora's Issues

❓ Questions

❓ Questions

❓ Questions

❓ Questions

❓ Questions

❓ Questions

🐛 Bug Report

Your Environment

❓ Questions

❓ Questions

❓ Questions

🐛 Bug Report

Your Environment

🐛 Bug Report

❓ Questions

❓ Questions

❓ Questions

🐛 Bug Report

Environment

❓ Questions

❓ Questions

❓ Questions

❓ Questions

🐛 Bug Report

Your Environment

Recommend Projects

Recommend Topics

Recommend Org