rwth-i6 / sisyphus Goto Github PK

View Code? Open in Web Editor NEW

43.0 11.0 21.0 1.26 MB

A Workflow Manager in Python

License: Mozilla Public License 2.0

Python 98.38% Shell 0.32% HTML 1.30%

sisyphus's Introduction

Sisyphus

A workflow manager in Python.

Installation

Requirements

Sisyphus requires a Python >=3.6 installation with the following additional libraries:

pip3 install -r requirements.txt

Optional if curse based user interface should be used:

pip3 install urwid

Optional if web interface should be used:

pip3 install flask

Optional to compile documentation:

pip3 install Sphinx
pip3 install sphinxcontrib-mermaid

Optional if virtual file system should be used:

pip3 install fusepy
sudo addgroup $USER fuse  # depending on your system

Optional for a nicer traceback

pip3 install better_exchook

Setup

After the requirements are installed, install sis. This can be done with
```
pip3 install git+https://github.com/rwth-i6/sisyphus.git
```
The current directory (pwd), when you run sis, should have a file settings.py (see example dir).
Create a directory work in the current dir. All data created while running the jobs will be stored there.
Create a directory output in the current dir. All the registered output will end up here.
Create a directory alias in the current dir.
Run sis --config some_config.py m.

Documentation

Can be found here: sisyphus-workflow-manager.readthedocs.io.

Example

A short toy workflow example is given in the example directory.

To run sisyphus on the example workflow change into the /example directory and run ../sis manager

A large realistic workflow will soon be added.

Development

If you want to commit to the repository make sure to run the unittest and check for PEP 8 by running tox. All need tools can be installed by running:

pip3 install -r requirements-dev.txt

To automatically check for PEP 8 errors before committing run:

pre-commit install

The unittest and flake8 on all relevant files is done by running tox:

tox

Known Problems:

blocks may don't work correctly if used together with the async workflow

License

All Source Code in this Project is subject to the terms of the Mozilla Public License, v. 2.0. If a copy of the MPL was not distributed with this file, You can obtain one at http://mozilla.org/MPL/2.0/.

sisyphus's People

Contributors

Stargazers

Watchers

sisyphus's Issues

Path copy_append and join_right keep hash_overwrite

Example code:

path = tk.Path("foo", hash_overwrite="X")
path2 = path.join_right("bar")

path2.hash_overwrite == "X"  # ?
sis_hash(path) == sis_hash(path2)  # ?

I assume this is not intended? At least I was very confused about this.

Don't try to cleanup symlinked jobs

We started to share more common job dirs. In my case, they are symlinked.

[2023-11-19 18:36:36,651] INFO: clean up: work/i6_core/datasets/librispeech/LibriSpeechCreateBlissCorpusJob.OjEfOC2QXh8l                        
[2023-11-19 18:36:36,652] WARNING: Could not clean up work/i6_core/datasets/librispeech/LibriSpeechCreateBlissCorpusJob.wjSkfzJS1Ge2: [Errno 13] Permission denied: '/u/zeyer/setups/combined/2021-05-31/work/i6_core/datasets/librispeech/LibriSpeechCreateBlissCorpusJob.wjSkfzJS1Ge2/work'   
[2023-11-19 18:36:36,653] WARNING: Could not clean up work/i6_core/datasets/librispeech/LibriSpeechCreateBlissCorpusJob.N4devEBOAvgK: [Errno 13] Permission denied: '/u/zeyer/setups/combined/2021-05-31/work/i6_core/datasets/librispeech/LibriSpeechCreateBlissCorpusJob.N4devEBOAvgK/work'   
[2023-11-19 18:36:36,655] WARNING: Could not clean up work/i6_core/datasets/librispeech/LibriSpeechCreateBlissCorpusJob.h78m0D0Rx6uQ: [Errno 13] Permission denied: '/u/zeyer/setups/combined/2021-05-31/work/i6_core/datasets/librispeech/LibriSpeechCreateBlissCorpusJob.h78m0D0Rx6uQ/work'   
[2023-11-19 18:36:36,662] WARNING: Could not clean up work/i6_core/datasets/librispeech/LibriSpeechCreateBlissCorpusJob.OjEfOC2QXh8l: [Errno 13] Permission denied: '/u/zeyer/setups/combined/2021-05-31/work/i6_core/datasets/librispeech/LibriSpeechCreateBlissCorpusJob.OjEfOC2QXh8l/work'   
[2023-11-19 18:36:36,662] WARNING: Could not clean up work/i6_core/datasets/librispeech/LibriSpeechCreateBlissCorpusJob.2GMuOxuirZVL: [Errno 13] Permission denied: '/u/zeyer/setups/combined/2021-05-31/work/i6_core/datasets/librispeech/LibriSpeechCreateBlissCorpusJob.2GMuOxuirZVL/work'

With symlinks for example like:
work/i6_core/datasets/librispeech/LibriSpeechCreateBlissCorpusJob.OjEfOC2QXh8l -> /work/common/asr/librispeech/data/sisyphus_work_dir/i6_core/datasets/librispeech/LibriSpeechCreateBlissCorpusJob.OjEfOC2QXh8l

My suggestion is to never clean up job dirs which are symlinks. E.g. we could add such a check in Job._sis_cleanable.

Do you agree? Or other suggestions? Should this be an option or just the new behavior? If this is supposed to be an option, what should be its default?

cleanup performed on symlinks in work dir

Hey,

I get this error while running the sisyphus manager.

[2022-09-16 18:44:21,738] INFO: clean up: work/i6_core/datasets/librispeech/DownloadLibriSpeechCorpusJob.8iqSB1tz3OMD
tar (child): finished.tar.gz: Cannot open: Permission denied
tar (child): Error is not recoverable: exiting now 
[2022-09-16 18:44:21,874] WARNING: Could not clean up work/i6_core/datasets/librispeech/DownloadLibriSpeechCorpusJob.8iqSB1tz3OMD: Command '['tar', '-czf', 'finished.tar.gz', 'job.save', 'input', 'finished', 'last_user', 'log.run.1', 'engine', 'finished.run.1', 'usage.run.1', 'submit_log.run']' died with <Signals.SIGPIPE: 13>.
[2022-09-16 18:44:21,875] WARNING: Could not clean up work/i6_core/tools/download/DownloadJob.0UXAqd5DuQG7: Command '['tar', '-czf', 'finished.tar.gz', 'last_user', 'submit_log.run', 'finished.run.1', 'usage.run.1', 'log.run.1', 'input', 'job.save', 'engine', 'finished']' died with <Signals.SIGPIPE: 13>.

The sisyphus job dir is a symlink to location I have no write access.
I think there is a wrong assumption here. That you actually do not what to clean up a symlinked sisyphus job since you cannot reliably say that it belongs under your control

config file not loaded correctly

Hi,
I have a problem with the naming of config files. It used to be the way that all filenames where allowed. But now they are limited to names which are valid python modules.
This breaks my setup since I use numbers in the file name beginning and use periods and hyphens.

./sis --config 01_test.py m

[2020-03-13 17:19:58,740] ERROR: Main thread unhandled exception:

AssertionError Traceback (most recent call last)
~/dev/sisyphus_2020-03-13/sisyphus/main.py in main()
204 t.join()
205 else:
--> 206 args.func(args)
args.func = <function manager at 0x7f24078134c0>
args = Namespace(argv=[], clear_errors_once=False, clear_interrupts_once=False, config_files=['01_test.py'], filesystem=None, func=<function manager at 0x7f24078134c0>, http_port=None, ignore_once=False, interactive=False, log_level=20, run=False, ui=False)
207 except BaseException as exc:
208 if not isinstance(exc, SystemExit):

~/dev/sisyphus_2020-03-13/sisyphus/manager.py in manager(args=Namespace(argv=[], clear_errors_once=False, clea...ractive=False, log_level=20, run=False, ui=False))
70
71 start = time.time()
---> 72 load_configs(args.config_files)
global load_configs = <function load_configs at 0x7f24083e0700>
args.config_files = ['01_test.py']
73 load_time = time.time() - start
74 if load_time < 5:

~/dev/sisyphus_2020-03-13/sisyphus/loader.py in load_configs(filenames=['01_test.py'])
80
81 for filename in filenames:
---> 82 load_config_file(filename)
global load_config_file = <function load_config_file at 0x7f24083e0670>
filename = '01_test.py'
83
84

~/dev/sisyphus_2020-03-13/sisyphus/loader.py in load_config_file(config_name='01_test.py')
25
26 filename = filename.replace(os.path.sep, '.') # allows to use tab completion for file selection
---> 27 assert all(part.isidentifier() for part in filename.split('.')), "Config name is invalid: %s" % filename
global all = undefined
global part.isidentifier = undefined
global part = undefined
filename.split = <built-in method split of str object at 0x7f240d6c11f0>
filename = '01_test.py'
28 module_name, function_name = filename.rsplit('.', 1)
29 try:

AssertionError: Config name is invalid: 01_test.py

./sis --config tttt.py m

[2020-03-13 18:45:42,605] ERROR: Main thread unhandled exception:

ModuleNotFoundError Traceback (most recent call last)
~/dev/sisyphus_2020-03-13/sisyphus/main.py in main()
204 t.join()
205 else:
--> 206 args.func(args)
args.func = <function manager at 0x7f0ba4f09550>
args = Namespace(argv=[], clear_errors_once=False, clear_interrupts_once=False, config_files=['tttt.py'], filesystem=None, func=<function manager at 0x7f0ba4f09550>, http_port=None, ignore_once=False, interactive=False, log_level=20, run=False, ui=False)
207 except BaseException as exc:
208 if not isinstance(exc, SystemExit):

~/dev/sisyphus_2020-03-13/sisyphus/manager.py in manager(args=Namespace(argv=[], clear_errors_once=False, clea...ractive=False, log_level=20, run=False, ui=False))
70
71 start = time.time()
---> 72 load_configs(args.config_files)
global load_configs = <function load_configs at 0x7f0ba5223c10>
args.config_files = ['tttt.py']
73 load_time = time.time() - start
74 if load_time < 5:

~/dev/sisyphus_2020-03-13/sisyphus/loader.py in load_configs(filenames=['tttt.py'])
85
86 for filename in filenames:
---> 87 load_config_file(filename)
global load_config_file = <function load_config_file at 0x7f0ba526e550>
filename = 'tttt.py'
88
89

~/dev/sisyphus_2020-03-13/sisyphus/loader.py in load_config_file(config_name='tttt.py')
33 try:
34 print(module_name, function_name)
---> 35 config = importlib.import_module(module_name)
config = undefined
global importlib.import_module = <function import_module at 0x7f0baaf054c0>
module_name = 'tttt'
36 except SyntaxError:
37 import sys

/work/tools/asr/python/3.8.0_tf_1.15-generic+cuda10.1/lib/python3.8/importlib/init.py in import_module(name='tttt', package=None)
125 break
126 level += 1
--> 127 return _bootstrap._gcd_import(name[level:], package, level)
global _bootstrap._gcd_import = <function _gcd_import at 0x7f0bab00d430>
name = 'tttt'
level = 0
package = None
128
129

/work/tools/asr/python/3.8.0_tf_1.15-generic+cuda10.1/lib/python3.8/importlib/_bootstrap.py in _gcd_import(name='tttt', package=None, level=0)

/work/tools/asr/python/3.8.0_tf_1.15-generic+cuda10.1/lib/python3.8/importlib/_bootstrap.py in find_and_load(name='tttt', import=)

/work/tools/asr/python/3.8.0_tf_1.15-generic+cuda10.1/lib/python3.8/importlib/_bootstrap.py in find_and_load_unlocked(name='tttt', import=)

ModuleNotFoundError: No module named 'tttt'

./sis --config test.py m

[2020-03-13 17:24:40,466] INFO: Loaded config: test.py
[2020-03-13 17:24:40,526] INFO: Experiment directory: /u/luescher/setups/punctuation-prediction/2019-07-29--iwslt-paper Call: /u/luescher/dev/sisyphus_2020-03-13/sis --config test.py m
Print verbose overview (v), update aliases and outputs (u), start manager (y), or exit (n)? n
[2020-03-13 17:24:44,375] INFO: Manager loop stopped

./sis --config config.py m

[2020-03-13 17:25:55,969] INFO: Loaded config: config.py
[2020-03-13 17:25:55,969] INFO: Config loaded (time needed: 6.11)
[2020-03-13 17:25:56,435] INFO: Experiment directory: /u/luescher/setups/punctuation-prediction/2019-07-29--iwslt-paper Call: /u/luescher/dev/sisyphus_2020-03-13/sis --config config.py m
[2020-03-13 17:25:56,437] INFO: runnable: Job<work/corpus/CorpusToStm.S1yirGGByxOE> [Extract STM from Corpus]

This is all the same config. This behavior does not concern files that are located in recipes. I would like to start single config files which are located at the top-level and not only the config.py. Is it possible to get the old behavior back?

Clarification: canonical way to define custom global settings

@michelwi wrote in rwth-i6/i6_core#32:

Actually RETURNN_PYTHON_EXE is not a setting that is defined in sisyphus.global_settings so we are here in fact abusing the settings mechanics of sisyphus and should not have set this to begin with.

I want to clarify: Are we abusing this? If yes, how should we do it instead? I.e. where/how should we define some global settings (which we intentionally do not want as arguments to jobs)?

Let's not argue too much about RETURNN_PYTHON_EXE specifically here. This is just an example. This is actually an example where some people prefer it one way (having it as an explicit job arg) and others the other way (use global default, which is just a recent version), so it stays an argument where the default None would fallback to the custom global setting. Maybe we will often have it like this that both variants make sense. Maybe there are also other cases where we really want to have custom global settings which are never an argument.

My take here is that we can simply extend the scope of sisyphus.global_settings to also support such use cases. I don't think we need another separate mechanism for this. So basically no change needed then, just maybe some clarification in the doc that this is a valid use case.

Symlinks for Imported Jobs

When importing the work dir from someone else, you might end up with links to /u/<username>/.... In the case that you work with an apptainer image, this can be problematic because /u/<username>/ is not mounted. We could resolve this by only linking real paths upon the import of the work directory.

Probably it would make sense to add the actual install command to the installation section

For newbies it is sometimes hard to infer, what might be clear for people working with such packages since a long time.
Thus I would propose to add some more detail about how to install the actual package to the installation section, e.g. something like this:

git clone https://github.com/rwth-i6/sisyphus.git
pip install -e ~/code/sisyphus/

I can also prepare a pull request if you agree to this.

Inconsistencies in memory specification

In class Task memory is sometimes stored as int and sometimes as string (e.g. '2G'). This can cause problems when an old memory value is read or compared to a new one. In the case of update_rqmt:

sisyphus/sisyphus/task.py

Lines 402 to 417 in 4693518

 def update_rqmt(self, initial_rqmt, submit_history, task_id): 

 """ Update task requirements of interrupted job """ 

 initial_rqmt = initial_rqmt.copy() 

 initial_rqmt['mem'] = tools.str_to_GB(initial_rqmt['mem']) 

 initial_rqmt['time'] = tools.str_to_hours(initial_rqmt['time']) 

 usage_file = self._job._sis_path(gs.PLOGGING_FILE + '.' + self.name(), task_id, abspath=True) 

 try: 

 last_usage = literal_eval(open(usage_file).read()) 

 except (SyntaxError, IOError): 

 # we don't know anything if no usage file is writen or is invalid, just reuse last rqmts 

 return initial_rqmt 

 new_rqmt = self._update_rqmt(initial_rqmt=initial_rqmt, last_usage=last_usage) 

 new_rqmt = gs.check_engine_limits(new_rqmt, self) 

 return new_rqmt

for initial_rqmt mem is converted to an int, but the file on disk can be in string format still. tools.str_to_GB should be called either on last_usage or possibly directly on self._rqmt directly.

Sisyphus is too slow

What I measure (what is most relevant for me): startup time of the sis manager, up to the prompt.

My current benchmark: i6_experiments.users.zeyer.experiments.chris_hybrid_2021.test_run. This creates a graph of 526 jobs.

I run this benchmark on a local computer with extremely fast FS, to separate the influence of slow FS at this point.

I would expect that the startup time takes a few millisecs, not more.

It took around 21 seconds the first time I tried this.
Now, via #84, #85, #87, it takes around 14 seconds for me.
For #85, I use GRAPH_WORKER = 1.

So, this issue here is to discuss why this specific case is still so slow, and what we can potentially do to improve it.
I guess other cases will be different, so maybe we should better discuss them separately, to not mix-up things. When the FS is slow, I think we also can improve a lot. I would still expect that this runs in maybe 2-3 seconds max. But let's discuss that separately.

For profiling, I current use austin, and then I visualize the output in VSCode. Not sure what you would recommend to use instead. I run:

echo n | sudo austin ./sis m config/exp2022_06_15_hybrid.py > sample7.austin

Or just timing:

time echo n | ./sis m config/exp2022_06_15_hybrid.py

Automatic tests (e.g. via GitHub CI Actions)

See the RETURNN GitHub Actions as an example.

`worker_helper`: use `sys.executable` to run subproc

Currently we have:

        # Make sure Sisyphus is called again with the same python executable.
        # Adding the executable in front of the call could cause problems with the worker_wrapper
        if shutil.which(os.path.basename(sys.executable)) != sys.executable:
            os.environ["PATH"] = os.path.dirname(sys.executable) + ":" + os.environ["PATH"]
...
        call = sys.argv
...
        subprocess.check_call(call, stdout=logfile, stderr=logfile)

This failed for me:

    line: subprocess.check_call(call, stdout=logfile, stderr=logfile)
    locals:
      subprocess = <global> <module 'subprocess' from '/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/subprocess.py'>
      subprocess.check_call = <global> <function check_call at 0x7f67aed9f600>
      call = <local> ['sis', 'worker', '--engine', 'short', 'work/i6_core/returnn/training/ReturnnTrainingJob.Mw6ETRkehAUq', 'create_files', '1'], len = 7
      stdout = <not found>

Because I started Sisyphus this way:

/work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11 sis m recipe/i6_experiments/users/zeyer/experiments/exp2023_04_25_rf/conformer_import_moh_att_2023_06_30.py

Where sis is a file in the current directory. It is not a file anywhere in PATH. So that's why running sis m ... directly does not work.

A workaround is to start Sisyphus this way:

/work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11 ./sis m recipe/i6_experiments/users/zeyer/experiments/exp2023_04_25_rf/conformer_import_moh_att_2023_06_30.py

Then call becomes ["./sis", ...], which works. (Even though there could be cases where this would be wrong as well, as this hacking of PATH is not always correct. E.g. the Sis executable uses the shebang #!/usr/bin/env python3, and python3 might still be a different Python executable and environment.)

Anyway, I wonder, instead of having this PATH hack, why not using the sys.executable directly, as it is commonly done elsewhere? E.g. we have this:

#: Which command should be called to start sisyphus, can be used to replace the python binary
SIS_COMMAND = [sys.executable, sys.argv[0]]
# if this first argument is -m it's missing the module name
if sys.argv[0] == "-m":
    SIS_COMMAND += ["sisyphus"]

So, I don't really understand this comment "Adding the executable in front of the call could cause problems with the worker_wrapper". What problems? I think we anyway should do this to have it correct here and to avoid this hack.

More then one matching SLURM task

[2023-12-18 15:13:42,425] INFO: queue(5) runnable(1) running(10) waiting(2191)                                                                          
[2023-12-18 15:13:42,448] INFO: Submit to queue: work/i6_core/returnn/training/ReturnnTrainingJob.Mxt0Fq5EAPMJ run [1]                                  
[2023-12-18 15:13:42,469] INFO: Submitted with job_id: ['3767058'] i6_core.returnn.training.ReturnnTrainingJob.Mxt0Fq5EAPMJ.run                         
[2023-12-18 15:13:43,234] INFO: Experiment directory: /u/zeyer/setups/combined/2021-05-31      Call: ./sis m recipe/i6_experiments/users/zeyer/experimen
ts/exp2023_04_25_rf/i6.py                                                                                                                               
...
[2023-12-18 15:13:43,408] INFO: queue(6) running(10) waiting(2191)
[2023-12-18 16:07:58,088] INFO: Experiment directory: /u/zeyer/setups/combined/2021-05-31      Call: ./sis m recipe/i6_experiments/users/zeyer/experimen
ts/exp2023_04_25_rf/i6.py
[2023-12-18 16:07:58,114] INFO: running: Job<alias/exp2023_04_25_rf/chunked_aed_import/chunk-C20-R15-H2-bs22k/train work/i6_core/returnn/training/Return
nTrainingJob.yr9RPZ4KpDXG> {ep 32/2000} 
[2023-12-18 16:07:58,129] INFO: running: Job<alias/exp2023_04_25_rf/chunked_ctc/chunk-C20-R15-H2-11gb-f32-bs8k-wrongLr-accgrad1-mgpu4-p100/train work/i6
_core/returnn/training/ReturnnTrainingJob.nGxCOVlNDroS> {ep 13/500} 
[2023-12-18 16:07:58,145] INFO: running: Job<alias/exp2023_04_25_rf/conformer_import_moh_att_2023_06_30/v6-11gb-f32-bs15k-accgrad1-mgpu4-pavg100-lrlin1e
_5_295k/train work/i6_core/returnn/training/ReturnnTrainingJob.6KHl6rvV9hvx> {ep 24/500} 
[2023-12-18 16:07:58,164] INFO: running: Job<alias/exp2023_04_25_rf/conformer_import_moh_att_2023_06_30/v6-11gb-f32-bs15k-accgrad1-mgpu4-pavg100-wd1e_4-
lrlin1e_5_295k-run2/train work/i6_core/returnn/training/ReturnnTrainingJob.6PFOVJ9fX451> {ep 24/500} 
[2023-12-18 16:07:58,186] INFO: running: Job<alias/exp2023_04_25_rf/conformer_import_moh_att_2023_06_30/v6-11gb-f32-bs15k-accgrad1-mgpu4-pavg100-wd1e_4-
lrlin1e_5_295k-run3/train work/i6_core/returnn/training/ReturnnTrainingJob.gddacwOynWbN> {ep 24/500} 
[2023-12-18 16:07:58,206] INFO: running: Job<alias/exp2023_04_25_rf/conformer_import_moh_att_2023_06_30/v6-11gb-f32-bs15k-accgrad1-mgpu4-pavg100-wd1e_5-
lrlin1e_5_295k/train work/i6_core/returnn/training/ReturnnTrainingJob.IeBoLNDUwgWo> {ep 23/500} 
[2023-12-18 16:07:58,235] INFO: running: Job<alias/exp2023_04_25_rf/conformer_import_moh_att_2023_06_30/v6-11gb-f32-bs15k-accgrad1-mgpu4-pavg1000-wd1e_4
-lrlin1e_5_295k/train work/i6_core/returnn/training/ReturnnTrainingJob.RrU9BMvLf8gm> {ep 196/500} 
[2023-12-18 16:07:58,292] INFO: running: Job<alias/exp2023_04_25_rf/conformer_import_moh_att_2023_06_30/v6-11gb-f32-bs15k-accgrad1-mgpu4-pavg500-wd1e_4-
lrlin1e_5_295k/train work/i6_core/returnn/training/ReturnnTrainingJob.ltAtV1rNBhC6> {ep 202/500} 
[2023-12-18 16:07:58,383] INFO: running: Job<alias/exp2023_04_25_rf/conformer_import_moh_att_2023_06_30/v6-11gb-f32-bs15k-accgrad1-mgpu4-wd1e_4-lrlin1e_
5_295k/train work/i6_core/returnn/training/ReturnnTrainingJob.PzLC0RKyvMLA> {ep 491/500} 
[2023-12-18 16:07:58,600] INFO: running: Job<alias/exp2023_04_25_rf/conformer_import_moh_att_2023_06_30/v6-11gb-f32-bs15k-accgrad4-mgpu2/train work/i6_c
ore/returnn/training/ReturnnTrainingJob.Iy77vquwmeqX> {ep 1162/2000} 
[2023-12-18 16:07:58,600] INFO: runnable: Job<alias/exp2023_04_25_rf/chunked_ctc/chunk-C20-R15-H2-bs22k/train work/i6_core/returnn/training/ReturnnTrain
ingJob.jMkDWpu5R0VV>
[2023-12-18 16:07:58,600] INFO: runnable: Job<alias/exp2023_04_25_rf/conformer_import_moh_att_2023_06_30/base-24gb-v6-lrlin1e_5_100k/train work/i6_core/
returnn/training/ReturnnTrainingJob.l2dwBB9n7TqS>
[2023-12-18 16:07:58,600] INFO: runnable: Job<alias/exp2023_04_25_rf/conformer_import_moh_att_2023_06_30/base-24gb-v6-lrlin1e_5_20k/train work/i6_core/r
eturnn/training/ReturnnTrainingJob.1Bar8z0wWtjq>
[2023-12-18 16:07:58,600] INFO: runnable: Job<alias/exp2023_04_25_rf/conformer_import_moh_att_2023_06_30/base-24gb-v6-lrlin1e_5_50k/train work/i6_core/r
eturnn/training/ReturnnTrainingJob.por8G6TNlO9E>
[2023-12-18 16:07:58,600] INFO: runnable: Job<alias/exp2023_04_25_rf/conformer_import_moh_att_2023_06_30/base-24gb-v6-lrlin1e_5_600k/train work/i6_core/
returnn/training/ReturnnTrainingJob.VAf79gjGIxyn>
[2023-12-18 16:07:58,600] INFO: runnable: Job<alias/exp2023_04_25_rf/conformer_import_moh_att_2023_06_30/base-24gb-v6-warmup100k/train work/i6_core/retu
rnn/training/ReturnnTrainingJob.Mxt0Fq5EAPMJ>
[2023-12-18 16:07:58,600] INFO: runnable(6) running(10) waiting(2191)
[2023-12-18 16:07:58,603] INFO: Submit to queue: work/i6_core/returnn/training/ReturnnTrainingJob.Mxt0Fq5EAPMJ run [1]                                  
[2023-12-18 16:07:58,617] INFO: Submitted with job_id: ['3767268'] i6_core.returnn.training.ReturnnTrainingJob.Mxt0Fq5EAPMJ.run                         
[2023-12-18 16:07:58,893] INFO: Submit to queue: work/i6_core/returnn/training/ReturnnTrainingJob.l2dwBB9n7TqS run [1]                                  
[2023-12-18 16:07:58,896] INFO: Submit to queue: work/i6_core/returnn/training/ReturnnTrainingJob.1Bar8z0wWtjq run [1]
[2023-12-18 16:07:58,896] INFO: Submit to queue: work/i6_core/returnn/training/ReturnnTrainingJob.VAf79gjGIxyn run [1]
[2023-12-18 16:07:58,904] INFO: Submit to queue: work/i6_core/returnn/training/ReturnnTrainingJob.por8G6TNlO9E run [1]
[2023-12-18 16:07:58,907] INFO: Submit to queue: work/i6_core/returnn/training/ReturnnTrainingJob.jMkDWpu5R0VV run [1]
[2023-12-18 16:07:58,911] INFO: Submitted with job_id: ['3767269'] i6_core.returnn.training.ReturnnTrainingJob.l2dwBB9n7TqS.run
[2023-12-18 16:07:58,911] INFO: Submitted with job_id: ['3767270'] i6_core.returnn.training.ReturnnTrainingJob.VAf79gjGIxyn.run
[2023-12-18 16:07:58,912] INFO: Submitted with job_id: ['3767271'] i6_core.returnn.training.ReturnnTrainingJob.1Bar8z0wWtjq.run
[2023-12-18 16:07:58,918] INFO: Submitted with job_id: ['3767272'] i6_core.returnn.training.ReturnnTrainingJob.por8G6TNlO9E.run
[2023-12-18 16:07:58,920] INFO: Submitted with job_id: ['3767273'] i6_core.returnn.training.ReturnnTrainingJob.jMkDWpu5R0VV.run
[2023-12-18 16:07:59,951] WARNING: More then one matching SLURM task, use first match < ('i6_core.returnn.training.ReturnnTrainingJob.l2dwBB9n7TqS.run',
 1) > matches: [('3765376', 'PENDING'), ('3767269', 'PENDING')]
[2023-12-18 16:07:59,954] WARNING: More then one matching SLURM task, use first match < ('i6_core.returnn.training.ReturnnTrainingJob.VAf79gjGIxyn.run',
 1) > matches: [('3765374', 'PENDING'), ('3767270', 'PENDING')]
[2023-12-18 16:07:59,956] WARNING: More then one matching SLURM task, use first match < ('i6_core.returnn.training.ReturnnTrainingJob.jMkDWpu5R0VV.run',
 1) > matches: [('3764451', 'PENDING'), ('3767273', 'PENDING')]
[2023-12-18 16:07:59,976] WARNING: More then one matching SLURM task, use first match < ('i6_core.returnn.training.ReturnnTrainingJob.1Bar8z0wWtjq.run',
 1) > matches: [('3765375', 'PENDING'), ('3767271', 'PENDING')]
[2023-12-18 16:07:59,977] WARNING: More then one matching SLURM task, use first match < ('i6_core.returnn.training.ReturnnTrainingJob.Mxt0Fq5EAPMJ.run',
 1) > matches: [('3767058', 'PENDING'), ('3767268', 'PENDING')]
[2023-12-18 16:07:59,996] WARNING: More then one matching SLURM task, use first match < ('i6_core.returnn.training.ReturnnTrainingJob.por8G6TNlO9E.run',
 1) > matches: [('3765373', 'PENDING'), ('3767272', 'PENDING')]
[2023-12-18 16:08:01,587] INFO: Experiment directory: /u/zeyer/setups/combined/2021-05-31      Call: ./sis m recipe/i6_experiments/users/zeyer/experimen
ts/exp2023_04_25_rf/i6.py

This is the same message as in #156, but I think the bug (problem) here is different, because from the log, it seems there is no "Error to submit job", and the problem in #156 is also fixed already by #157, and I have seen multiple times that #157 is indeed working as intended.

Maybe it is relevant that when I interrupted the manager here, I got the crash from #164.

Black

Do we want to enforce black here, just like in most other i6 repos? Same settings, 120 char line limit?

More then one matching SLURM task

[2023-11-23 12:24:05,584] ERROR: Error to submit job                                                                                           
[2023-11-23 12:24:05,584] ERROR: SBATCH command: sbatch -J i6_core.returnn.forward.ReturnnForwardJobV2.9mzWqrCWqG0E.run -o work/i6_core/returnn
/forward/ReturnnForwardJobV2.9mzWqrCWqG0E/engine/%x.%A.%a --mail-type=None --mem=6G --gres=gpu:1 --cpus-per-task=2 --time=240 --export=all -p g
pu_11gb -x cn-505 -a 1-1:1 --wrap=/work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11 ./sis worker --engine long work/i6_core/return
n/forward/ReturnnForwardJobV2.9mzWqrCWqG0E run                                                                                                 
[2023-11-23 12:24:05,585] ERROR: Output: Submitted batch job 3149172                                                                           
[2023-11-23 12:24:05,585] ERROR: Error: sbatch: error: Slurm temporarily unable to accept job, sleeping and retrying                           
0 1 4 [b'Submitted', b'batch', b'job'] [b'Submitted', b'batch', b'job']                                                                        
[2023-11-23 12:24:05,586] ERROR: Error to submit job                                                                                           
[2023-11-23 12:24:05,586] ERROR: SBATCH command: sbatch -J i6_core.returnn.forward.ReturnnForwardJobV2.xsJCcIe5eUm7.run -o work/i6_core/returnn
/forward/ReturnnForwardJobV2.xsJCcIe5eUm7/engine/%x.%A.%a --mail-type=None --mem=6G --gres=gpu:1 --cpus-per-task=2 --time=240 --export=all -p g
pu_11gb -x cn-505 -a 1-1:1 --wrap=/work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11 ./sis worker --engine long work/i6_core/return
n/forward/ReturnnForwardJobV2.xsJCcIe5eUm7 run                                                                                                 
[2023-11-23 12:24:05,586] ERROR: Output: Submitted batch job 3149173                                                                           
[2023-11-23 12:24:05,586] ERROR: Error: sbatch: error: Slurm temporarily unable to accept job, sleeping and retrying                           
0 1 4 [b'Submitted', b'batch', b'job'] [b'Submitted', b'batch', b'job']                                                                        
[2023-11-23 12:24:05,587] ERROR: Error to submit job                                                                                           
[2023-11-23 12:24:05,587] ERROR: SBATCH command: sbatch -J i6_core.returnn.forward.ReturnnForwardJobV2.vinrAG5lsR8a.run -o work/i6_core/returnn
/forward/ReturnnForwardJobV2.vinrAG5lsR8a/engine/%x.%A.%a --mail-type=None --mem=6G --gres=gpu:1 --cpus-per-task=2 --time=240 --export=all -p g
pu_11gb -x cn-505 -a 1-1:1 --wrap=/work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11 ./sis worker --engine long work/i6_core/return
n/forward/ReturnnForwardJobV2.vinrAG5lsR8a run                                                                                                 
[2023-11-23 12:24:05,587] ERROR: Output: Submitted batch job 3149174                                                                           
[2023-11-23 12:24:05,587] ERROR: Error: sbatch: error: Slurm temporarily unable to accept job, sleeping and retrying                           
...
[2023-11-23 12:24:48,575] WARNING: More then one matching SLURM task, use first match < ('i6_core.returnn.forward.ReturnnForwardJobV2.q6p2w4D9C
SYW.run', 1) > matches: [('3149158', 'PENDING'), ('3149194', 'PENDING')]                                 
[2023-11-23 12:24:48,580] WARNING: More then one matching SLURM task, use first match < ('i6_core.returnn.forward.ReturnnForwardJobV2.IAkRxmOdW
HHA.run', 1) > matches: [('3149156', 'PENDING'), ('3149197', 'PENDING')]                                 
[2023-11-23 12:24:48,597] WARNING: More then one matching SLURM task, use first match < ('i6_core.returnn.forward.ReturnnForwardJobV2.lkURo50Hf
4d2.run', 1) > matches: [('3149155', 'PENDING'), ('3149198', 'PENDING')]                                 
[2023-11-23 12:24:48,636] WARNING: More then one matching SLURM task, use first match < ('i6_core.returnn.forward.ReturnnForwardJobV2.iaKfugRIW
bb7.run', 1) > matches: [('3149157', 'PENDING'), ('3149200', 'PENDING')]                                 
[2023-11-23 12:24:48,665] WARNING: More then one matching SLURM task, use first match < ('i6_core.returnn.forward.ReturnnForwardJobV2.jDBAArsaJ
aWc.run', 1) > matches: [('3149159', 'PENDING'), ('3149199', 'PENDING')]                                 
[2023-11-23 12:24:48,675] WARNING: More then one matching SLURM task, use first match < ('i6_core.returnn.forward.ReturnnForwardJobV2.QaE4jVxus
6xu.run', 1) > matches: [('3149161', 'PENDING'), ('3149196', 'PENDING')]                                 
[2023-11-23 12:24:48,708] WARNING: More then one matching SLURM task, use first match < ('i6_core.returnn.forward.ReturnnForwardJobV2.YOMu5k2SC
FZi.run', 1) > matches: [('3149160', 'PENDING'), ('3149195', 'PENDING')]

I assume that Sisyphus somehow did not get the full output status, and the incorrectly just assumed that the jobs don't exist, and resubmitted them.

The fix would be to never submit jobs when the current job status is unknown.

@Atticus1806 also reported this.

worker_helper removes worker_wrapper on localengine

I use (on bare metal) a virtualenv to run sisyphus. Then I use a worker_wrapper function to run some tasks with a singularity image. Here the virtualenv does not exist any more but the default python within the image has all required modules.

original call:

['/path/to/virtualenv/bin/python', '/home/wmichel/software/sisyphus/sis', 'worker', 'work/path/to/MyJob.a0OOf4GEhHZ8', 'my_task']

after worker_wrapper:

['singularity', 'run', '/path/to/singularity_image.sif', 'python3', '/home/wmichel/software/sisyphus/sis', 'worker', 'work/path/to/MyJob.a0OOf4GEhHZ8', 'my_task']

now the Task my_task task is a mini_task i.e. the local engine is used.
The manager finds the runnable task

DEBUG: Found new task: TaskQueueInstance(call=['singularity', 'run', '/path/to/singularity_image.sif', 'python3', '/home/wmichel/software/sisyphus/sis', 'worker', '--engine', 'short', 'work/path/to/MyJob.a0OOf4GEhHZ8', 'my_task', '1', '--redirect_output'], logpath='work/path/to/MyJob.a0OOf4GEhHZ8/engine', rqmt={'cpu': 1, 'mem': 1.0, 'time': 1.0, 'engine': 'short', 'engine_info': 'my_pc.local', 'engine_name': 'local'}, name='path/to/MyJob.a0OOf4GEhHZ8.my_task', task_name='my_task', task_id=1)

But then after the lines

sisyphus/sisyphus/worker.py

Lines 186 to 189 in 5f1373a

 argv = sys.argv[sys.argv.index(gs.CMD_WORKER):] 

 del argv[argv.index('--redirect_output')] 

 call = gs.SIS_COMMAND + argv

the wrapped singularity call is reset to the original

['/path/to/virtualenv/bin/python', '/home/wmichel/software/sisyphus/sis', 'worker', '--engine', 'short', 'work/path/to/MyJob.a0OOf4GEhHZ8/engine', 'my_task', '1']

which fails, as the task requires the singularity image.

So.. are these lines needed? Can we remove them or add another worker_wrapper call?

HTTP-Server templates are missing from pip installation

In my singularity container I installed sisyphus via pip (directly from github). But the templates seem to be excluded, i.e. I get:
ls: cannot access '/usr/local/lib/python3.8/dist-packages/sisyphus/templates': No such file or directory

This results in:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 2070, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1515, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1513, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1499, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
  File "/usr/local/lib/python3.8/dist-packages/sisyphus/http_server.py", line 78, in decorated_function
    resp = make_response(f(*args, **kwargs))
  File "/usr/local/lib/python3.8/dist-packages/sisyphus/http_server.py", line 108, in output_view
    return render_template("outputs.html", outputs=outputs)
  File "/usr/local/lib/python3.8/dist-packages/flask/templating.py", line 148, in render_template
    ctx.app.jinja_env.get_or_select_template(template_name_or_list),
  File "/usr/local/lib/python3.8/dist-packages/jinja2/environment.py", line 1081, in get_or_select_template
    return self.get_template(template_name_or_list, parent, globals)
  File "/usr/local/lib/python3.8/dist-packages/jinja2/environment.py", line 1010, in get_template
    return self._load_template(name, globals)
  File "/usr/local/lib/python3.8/dist-packages/jinja2/environment.py", line 969, in _load_template
    template = self.loader.load(self, name, self.make_globals(globals))
  File "/usr/local/lib/python3.8/dist-packages/jinja2/loaders.py", line 126, in load
    source, filename, uptodate = self.get_source(environment, name)
  File "/usr/local/lib/python3.8/dist-packages/flask/templating.py", line 59, in get_source
    return self._get_source_fast(environment, template)
  File "/usr/local/lib/python3.8/dist-packages/flask/templating.py", line 95, in _get_source_fast
    raise TemplateNotFound(template)
jinja2.exceptions.TemplateNotFound: outputs.html```

Issue with auto-completion in PyCharm

Since upgrading from PyCharm 2019.1 to 2020.1, it does no longer correctly resolve the constructor calls, but instead each Job constructor is treated as the __call__ function from the class JobSingleton, so that there is an unknown return type and unknown parameters, which is slightly annoying. Does anybody else have the same issue? And is there a way to fix this? I am not sure how the metaclasses are handled, and what alternatives there are.

Error in cleaner: sis_graph.targets set changed size during iteration

Since updating to the current master (from a version in November), I recently get the error message below, when I try to run the cleaner.
I have no sisyphus sessions running anywhere, so I think this is very odd.

In [1]: tk.cleaner(clean_job_dir=True, clean_work_dir=True, mode='remove')
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-1-7338018b81e2> in <module>()
----> 1 tk.cleaner(clean_job_dir=True, clean_work_dir=True, mode='remove')

~/src/sisyphus/sisyphus/toolkit.py in cleaner(clean_job_dir, clean_work_dir, mode, keep_value, only_remove_current_graph)
    446     # the output of unfinished jobs or belong to the output
    447     needed = set()
--> 448     for target in sis_graph.targets:
    449         for path in target.required:
    450             active_paths[os.path.abspath(os.path.join(path.get_path()))] = path

RuntimeError: Set changed size during iteration

I don't have time to investigate this right now, so I was wondering if anyone has seen this before or has any idea what could cause this.

Deadlock in for_all_nodes

zeyer@indigo ~> py-spy dump --pid 17326
Process 17326: python3 ./python-tf.bin demo-import-tensorflow.py
Python v3.8.0 (/work/tools/asr/python/3.8.0/bin/python3.8)

Thread 17326 (idle): "MainThread"
    for_all_nodes (sisyphus/graph.py:569)
    set_job_targets (sisyphus/graph.py:686)
    __init__ (sisyphus/manager.py:218)
    manager (sisyphus/manager.py:98)
    main (sisyphus/__main__.py:220)
    <module> (sis:14)
    _run_code (runpy.py:85)
    _run_module_code (runpy.py:95)
    run_path (runpy.py:262)
    child_run (preloaded/utils.py:45)
    server_handle_child (preloaded/fork_server.py:138)
    server_main (preloaded/fork_server.py:99)
    startup_via_fork_server (preloaded/startup.py:68)
    <module> (python-tf.bin:11)
Thread 17351 (idle): "Thread-1"
    run (sisyphus/localengine.py:204)
    wrapped_func (sisyphus/tools.py:308)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
Thread 17352 (idle): "Thread-2"
    run (sisyphus/localengine.py:204)
    wrapped_func (sisyphus/tools.py:308)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)

I removed threads which are just waiting for tasks (multiprocessing) but the full log is here.

The deadlock in for_all_nodes is basically in this loop:

        # Check if all jobs are finished
        while len(visited) != finished:
            time.sleep(0.1)

I have a job with update(), not sure if this is relevant here.

I still need to look more deeply into it but maybe you already have an idea?

DelayedBase.repr does not use repr?

sisyphus/sisyphus/delayed_ops.py

Line 77 in a5de7fe

return str(self.get())

class DelayedBase:
    ...

    def __repr__(self):
        return str(self.get())

    def __str__(self):
        return str(self.get())

Why is there str(...) for the __repr__? Isn't this inconsistent? Shouldn't it be repr?

Suggestion: Database for cache, `_sis_all_path_available` etc

Related to #167, I think it would be nice to have a database where we cache the information across runs such as what job has finished and what files are available. Reading that DB would be much faster than checking through the FS through lots of files.

I would probably use sqlite.

Sisyphus fails to hash numpy bools

I believe this check fails because it's not using isinstance but checking for the type itself, which leads the code to try to invoke get_object_state, which then fails:

sisyphus/sisyphus/hash.py

Lines 86 to 87 in 4c3b40f

 elif type(obj) in (int, float, bool, str, complex): 

 byte_list.append(repr(obj).encode())

The numpy bool behaves very similarly to normal python bools and therefore I would consider this behavior very surprising and I believe it should be fixed. Maybe checking for isinstance(obj, bool) as the only check using isistance is valid? I think this would be a possible fix here. Or we just add all numpy types to the list that's checked there.

Manager does not stop if there are interrupted jobs without a resume function

Currently the manager does not terminate if there are still interrupted jobs. This is OK as we might be able to resubmit those. But when when those jobs don't have the resume function set the manager still remains running. In my particular setup I don't want the jobs to be resubmitted as they are part of an automated testing-pipeline and failure to finish in the allotted time is a bug that I want to be able to catch.

@critias do you have time to look into this? Thank!

Structuring the output directory?

Dear maintainers,
I remember that when I was running Sisyphus during my stay at i6, the outputs were structured by the respective config, e.g. output/01_gmm/wer.txt. When I run vanilla Sisyphus (+customized engine) at our lab, this does not happen, i.e. I'd get output/wer.txt.

Is there some flag (which I perhaps just copied from someone's settings, oblivious to its effect) to achieve the structuring?

Wrong number of arguments in SonOfGridEngine

get_task_id is called with 2 arguments

sisyphus/sisyphus/son_of_grid_engine.py

Line 407 in d028c10

task_id = SonOfGridEngine.get_task_id(None, None)

but it takes only a single argument

sisyphus/sisyphus/son_of_grid_engine.py

Line 392 in d028c10

def get_task_id(task_id):

Am I missing something or is there a problem?

Log file for finished output

I often would miss the "Finished output: ..." log message by the Sis manager on stdout, as it usually prints quite a lot when there are a few experiments running.

A log file with those (human readable timestamp + what output was finished) would be very useful.

I assume this would need to be implemented. So this is a feature request. Maybe you could also suggest already a place where and how to implement it, so then I could also do a PR myself.

Path creator can be string or not?

sisyphus/sisyphus/job_path.py

Line 67 in ab073d3

assert not isinstance(creator, str)

sisyphus/sisyphus/job_path.py

Line 109 in ab073d3

assert not isinstance(self.creator, str)

sisyphus/sisyphus/job_path.py

Line 153 in ab073d3

 assert(not isinstance(self.creator, str)), "This should only occur during running of worker" 

But:

sisyphus/sisyphus/job_path.py

Line 168 in ab073d3

if isinstance(self.creator, str):

sisyphus/sisyphus/job_path.py

Line 88 in ab073d3

elif isinstance(self.creator, str):

sisyphus/sisyphus/job_path.py

Line 138 in ab073d3

if self.creator is None or isinstance(self.creator, str):

If there is really some case where this can happen (which avoids the __init__?) I think this should be better documented in the code with some comment.

`FileExistsError` in `Job._sis_setup_directory`

...
[2023-12-05 04:29:29,576] WARNING: interrupted_resumable: Job<alias/exp2023_04_25_rf/conformer_import_moh_att_2023_06_30/base-24gb-v6/train work/
i6_core/returnn/training/ReturnnTrainingJob.jyQaF3P8Ieol>               
[2023-12-05 04:29:29,576] INFO: interrupted_resumable(1) retry_error(4) running(8) waiting(663)                                                  
[2023-12-05 04:31:04,825] ERROR: Exception in thread <_MainThread(MainThread, started 140708433694720)>:                                         
EXCEPTION                                                                                                                                        
Traceback (most recent call last):
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/tools.py", line 311, in default_handle_exception_interrupt_main_thread.<locals>.wrapped_func
    line: return func(*args, **kwargs)                                                                                                           
    locals:                                                             
      func = <local> <function Manager.run at 0x7ff93ae65940>                                                                                    
      args = <local> (<Manager(Thread-2, initial)>,)                                                                                             
      kwargs = <local> {}                                                                                                                        
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/manager.py", line 617, in Manager.run                   
    line: self.resume_jobs()                                                                                                                     
    locals:                                                              
      self = <local> <Manager(Thread-2, initial)>                                                                                                
      self.resume_jobs = <local> <bound method Manager.resume_jobs of <Manager(Thread-2, initial)>>                                              
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/manager.py", line 405, in Manager.resume_jobs                                
    line: self.thread_pool.map(f, self.jobs.get(gs.STATE_INTERRUPTED_RESUMABLE, []))
    locals:                                                                                                                                      
      self = <local> <Manager(Thread-2, initial)>
      self.thread_pool = <local> <multiprocessing.pool.ThreadPool state=RUN pool_size=10>                                                        
      self.thread_pool.map = <local> <bound method Pool.map of <multiprocessing.pool.ThreadPool state=RUN pool_size=10>>                         
      f = <local> <function Manager.resume_jobs.<locals>.f at 0x7ff92c3c0180>                                                                    
      self.jobs = <local> defaultdict(<class 'set'>, {'waiting': {Job<work/i6_core/returnn/search/SearchWordsToCTMJob.sHh83NBWaNtR>, Job<work/i6_
core/recognition/scoring/ScliteJob.TThbUgE8qjSd>, Job<work/i6_core/returnn/forward/ReturnnForwardJobV2.uX3OywkERCr7>, Job<work/i6_core/returnn/se
arch/SearchRemoveLabelJob.cakEqUo..., len = 5, _[0]: {len = 0}           
      self.jobs.get = <local> <built-in method get of collections.defaultdict object at 0x7ff8ac589c60>                                          
      gs = <global> <module 'sisyphus.global_settings' from '/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/global_settings.py'>    
      gs.STATE_INTERRUPTED_RESUMABLE = <global> 'interrupted_resumable', len = 21                                                                
  File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/pool.py", line 367, in Pool.map                        
    line: return self._map_async(func, iterable, mapstar, chunksize).get()                                                                       
    locals:                                                                                                                                      
      self = <local> <multiprocessing.pool.ThreadPool state=RUN pool_size=10>                                                                    
      self._map_async = <local> <bound method Pool._map_async of <multiprocessing.pool.ThreadPool state=RUN pool_size=10>>                       
      func = <local> <function Manager.resume_jobs.<locals>.f at 0x7ff92c3c0180>                                                                 
      iterable = <local> {Job<alias/exp2023_04_25_rf/conformer_import_moh_att_2023_06_30/base-24gb-v6/train work/i6_core/returnn/training/Returnn
TrainingJob.jyQaF3P8Ieol>}, len = 1                                                                                                              
      mapstar = <global> <function mapstar at 0x7ff93b797d80>                                                                                    
      chunksize = <local> None                                                                                                                   
      get = <not found>                                                 
  File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/pool.py", line 774, in ApplyResult.get
    line: raise self._value
    locals:
      self = <local> <multiprocessing.pool.MapResult object at 0x7ff8ac510a90>
      self._value = <local> FileExistsError(17, 'File exists')
  File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/pool.py", line 125, in worker
    line: result = (True, func(*args, **kwds))
    locals:
      result = <local> None
      func = <local> None
      args = <local> None
      kwds = <local> None
  File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/pool.py", line 48, in mapstar
    line: return list(map(*args))
    locals:
      list = <builtin> <class 'list'>
      map = <builtin> <class 'map'>
      args = <local> (<function Manager.resume_jobs.<locals>.f at 0x7ff92c3c0180>, (Job<alias/exp2023_04_25_rf/conformer_import_moh_att_2023_06_3
0/base-24gb-v6/train work/i6_core/returnn/training/ReturnnTrainingJob.jyQaF3P8Ieol>,))
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/manager.py", line 399, in Manager.resume_jobs.<locals>.f
    line: job._sis_setup_directory(force=True)
    locals:
      job = <local> Job<alias/exp2023_04_25_rf/conformer_import_moh_att_2023_06_30/base-24gb-v6/train work/i6_core/returnn/training/ReturnnTraini
ngJob.jyQaF3P8Ieol>
      job._sis_setup_directory = <local> <bound method Job._sis_setup_directory of Job<alias/exp2023_04_25_rf/conformer_import_moh_att_2023_06_30
/base-24gb-v6/train work/i6_core/returnn/training/ReturnnTrainingJob.jyQaF3P8Ieol>>
      force = <not found>
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/job.py", line 284, in Job._sis_setup_directory
    line: os.symlink(src=os.path.abspath(str(creator._sis_path())), dst=link_name, target_is_directory=True)
    locals:
      os = <global> <module 'os' (frozen)>
      os.symlink = <global> <built-in function symlink>
      src = <not found>
      os.path = <global> <module 'posixpath' (frozen)>
      os.path.abspath = <global> <function abspath at 0x7ff93be74f40>
      str = <builtin> <class 'str'>
      creator = <local> Job<work/i6_core/text/label/subword_nmt/train/ReturnnTrainBpeJob.vTq56NZ8STWt>
      creator._sis_path = <local> <bound method Job._sis_path of Job<work/i6_core/text/label/subword_nmt/train/ReturnnTrainBpeJob.vTq56NZ8STWt>>
      dst = <not found>
      link_name = <local> 'work/i6_core/returnn/training/ReturnnTrainingJob.jyQaF3P8Ieol/input/i6_core_text_label_subword_nmt_train_ReturnnTrainBpeJob.vTq56NZ8STWt', len = 136
      target_is_directory = <not found>
FileExistsError: [Errno 17] File exists: '/u/zeyer/setups/combined/2021-05-31/work/i6_core/text/label/subword_nmt/train/ReturnnTrainBpeJob.vTq56N
Z8STWt' -> 'work/i6_core/returnn/training/ReturnnTrainingJob.jyQaF3P8Ieol/input/i6_core_text_label_subword_nmt_train_ReturnnTrainBpeJob.vTq56NZ8S
TWt'
[2023-12-05 04:31:05,077] WARNING: Main thread exit. Still running non-daemon threads: {<LocalEngine(Thread-1, started 140708182750784)>}

This is the first time I see this. Probably a very rare issue.

After a restart of the manager, I don't see the problem anymore.

Asserting manager vs worker

Is there an intended way in Sisyphus to check if code is run via the manager or via the worker? I would like to add some assert calls to specific code parts that should never be called from the manager, but only from inside jobs.

slurm RPC count

Dear sisyphus developers,

I noticed that some HPC users using the sisyphus engine are performing a large amount of Slurm RPCs (squeue) at once on the RWTH HPC Cluster.
At large scales this can cause issues for the slurmctld, where the daemon hangs and the stability of the system is endangered.
Would it be possible to have sisyphus/engine.py changed, so that there is a delay (configurable to the user if you wish) between calls of self.task_state(task), instead of the current uncapped for loop (there is currently no throttling).
This function calls squeue in the background if I understand correctly.
When Slurm commands are called in an automated fashion, the slurmctld will have issues handling uncapped RPCs (This is an issue of Slurm/SchedMD).
A delay of 3 to 5 seconds is preferred (perhaps configurable for the user).
Example:

       for task in job._sis_tasks():
            if not task.finished():
                time.sleep(5.0)
                return self.task_state(task)

Or alternatively have self.task_state(task) check all ids at once with a single squeue call?
This is a very important change that would help us maintain a good uptime on the slurmctld.
This would also avoid us having to warn every user using your engine.

Thank you for your help on this matter.

Sisyphus Autoclean with Symlinks

When importing jobs from other work directories as symlink sisyphus tries to autoclean them causing a lot of warnings, spamming the console. Is there a workarround or fix besides turning off autoclean?

Configurable hash of functions and classes

Currently, the hash of functions and classes is defined via (obj.__module__, obj.__qualname__) (see sis_hash_helper).

This has the same problems as it is being solved in rwth-i6/i6_experiments#157, namely that the module name usually is some string like i6_experiments.users.zeyer.experiments.exp2023_04_25_rf._moh_att_2023_06_30_import.map_param_func_v2, i.e. it includes the user name and maybe other things which are not really so relevant, so moving the class or function later on will change the hash. rwth-i6/i6_experiments#157 solves this by an explicit unhashed_package_root option.

I want to have similar logic for any function or class in Sisyphus. I almost never want the existing hashing logic. Maybe the exception is for functions or classes in i6_core, which would hopefully not be moved anymore.

For backwards compatibility, the existing logic probably should stay though? Even by default? Or just disallow it?

I have many ideas how to solve this. Here are some:

We can introduce some __unhashed_package_root__ (and maybe also __extra_hash__) global variables for a module. When Sisyphus sis_hash_helper stumbles upon some function or class, it would go reversely through the module hierarchy (i6_experiments.users.zeyer.experiments.exp2023_04_25_rf._moh_att_2023_06_30_import, i6_experiments.users.zeyer.experiments.exp2023_04_25_rf, i6_experiments.users.zeyer.experiments, i6_experiments.users.zeyer, ...) and checks for these global variables, and if found, it would replace the module name up to that by __extra_hash__ (or empty string).
Similarly as above, but just to disallow hashing in any case: __disallow_hash_root__: bool or so. That way, you do not accidentally end up with such function or class in the hash. The user must explicitly use some wrapper object (this also needs to be implemented yet, maybe in i6_core or i6_exp/common).
A whitelist of allowed module prefixes, and anything else would fail with an exception (if the whitelist is defined, otherwise, as currently, all would be allowed).

Use config file from recipes directly

Consider i6_experiments.users....experiments....

In many cases, I don't need to have a separate config file. Or such a config while would basically only do an import, and maybe call some run() function and nothing else. However, in principle, I would want to directly do this:

sis m recipe/i6_experiments/...

But Sisyphus complains that it must be in the config directory.

Can we add support for that?

AssertionError: Only runnable jobs can list needed tasks

I just got this:

...
[2024-01-09 22:54:58,572] ERROR: error: Job<work/i6_core/returnn/forward/ReturnnForwardJobV2.uJVC0MQcnILG>                                                      
[2024-01-09 22:54:58,572] ERROR: error: Job<work/i6_core/returnn/forward/ReturnnForwardJobV2.ugQXmf07iqFo>                                                      
[2024-01-09 22:54:58,572] ERROR: error: Job<work/i6_core/returnn/forward/ReturnnForwardJobV2.vs5I2fwLg7W0>                                                      
...
[2024-01-09 22:54:59,027] INFO: error(51) queue(5) running(7) waiting(838)                                                                                      
Clear jobs in error state? [y/N] y
[2024-01-09 22:55:22,023] WARNING: Clearing: Job<work/i6_core/returnn/forward/ReturnnForwardJobV2.uFPYnu3DCyrE>                                                 
[2024-01-09 22:55:22,159] INFO: Move: work/i6_core/returnn/forward/ReturnnForwardJobV2.uFPYnu3DCyrE to work/i6_core/returnn/forward/ReturnnForwardJobV2.uFPYnu3D
CyrE.cleared.0002                                                                                                                                               
[2024-01-09 22:55:22,244] WARNING: Clearing: Job<work/i6_core/returnn/forward/ReturnnForwardJobV2.ronBYRCWAKNU>
[2024-01-09 22:55:22,264] INFO: Move: work/i6_core/returnn/forward/ReturnnForwardJobV2.ronBYRCWAKNU to work/i6_core/returnn/forward/ReturnnForwardJobV2.ronBYRCW
AKNU.cleared.0002                                                                                                                                               
...
[2024-01-09 22:55:23,374] WARNING: Clearing: Job<work/i6_core/returnn/forward/ReturnnForwardJobV2.m77LwAXPRniE>                                                 
[2024-01-09 22:55:23,418] INFO: Move: work/i6_core/returnn/forward/ReturnnForwardJobV2.m77LwAXPRniE to work/i6_core/returnn/forward/ReturnnForwardJobV2.m77LwAXP
RniE.cleared.0002
[2024-01-09 22:55:23,437] ERROR: Exception in thread <_MainThread(MainThread, started 140556651528192)>:
EXCEPTION
Traceback (most recent call last):
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/tools.py", line 311, in default_handle_exception_interrupt_main_thread.<locals>.wrapped_func
    line: return func(*args, **kwargs)
    locals:
      func = <local> <function Manager.run at 0x7fd5e3f81800>
      args = <local> (<Manager(Thread-2, initial)>,)
      kwargs = <local> {}
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/manager.py", line 574, in Manager.run
    line: if not self.startup():
    locals:
      self = <local> <Manager(Thread-2, initial)>
      self.startup = <local> <bound method Manager.startup of <Manager(Thread-2, initial)>>
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/manager.py", line 535, in Manager.startup
    line: maybe_clear_state(gs.STATE_ERROR, self.clear_errors_once, clear_error)
    locals:
      maybe_clear_state = <local> <function Manager.startup.<locals>.maybe_clear_state at 0x7fd577fc39c0>
      gs = <global> <module 'sisyphus.global_settings' from '/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/global_settings.py'>
      gs.STATE_ERROR = <global> 'error'
      self = <local> <Manager(Thread-2, initial)>
      self.clear_errors_once = <local> False
      clear_error = <local> <function Manager.startup.<locals>.clear_error at 0x7fd5760ce2a0>
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/manager.py", line 532, in Manager.startup.<locals>.maybe_clear_state
    line: action()
    locals:
      action = <local> <function Manager.startup.<locals>.clear_error at 0x7fd5760ce2a0>
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/manager.py", line 516, in Manager.startup.<locals>.clear_error
    line: self.clear_states(state=gs.STATE_ERROR)
    locals:
      self = <local> <Manager(Thread-2, initial)>
      self.clear_states = <local> <bound method Manager.clear_states of <Manager(Thread-2, initial)>>
      state = <global> 'input_missing', len = 13
      gs = <global> <module 'sisyphus.global_settings' from '/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/global_settings.py'>
      gs.STATE_ERROR = <global> 'error'
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/manager.py", line 249, in Manager.clear_states
    line: job._sis_move()
    locals:
      job = <local> Job<work/i6_core/returnn/forward/ReturnnForwardJobV2.m77LwAXPRniE>
      job._sis_move = <local> <bound method Job._sis_move of Job<work/i6_core/returnn/forward/ReturnnForwardJobV2.m77LwAXPRniE>>
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/job.py", line 801, in Job._sis_move
    line: for t in self._sis_tasks():
    locals:
      t = <not found>
      self = <local> Job<work/i6_core/returnn/forward/ReturnnForwardJobV2.m77LwAXPRniE>
      self._sis_tasks = <local> <bound method Job._sis_tasks of Job<work/i6_core/returnn/forward/ReturnnForwardJobV2.m77LwAXPRniE>>
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/job.py", line 822, in Job._sis_tasks
    line: assert False, "Only runnable jobs can list needed tasks"
    locals:
       no locals
AssertionError: Only runnable jobs can list needed tasks
[2024-01-09 22:55:23,809] WARNING: Main thread exit. Still running non-daemon threads: {<LocalEngine(Thread-1, started 140556329707072)>}

After a restart of the Sisyphus manager, it again asked for clearing jobs, I again typed y, and then there was no error anymore.

extend file caching

I was thinking about the extension of the file caching in Sisyphus. I would like to make an interface available so that the user can read and write using local file caching.

The idea would be to extend Sisyphus to support three user-defined functions. One is already available via file_caching. The other two would support copying between remote and local paths.

The Sisyphus AbstractPath Object would need additional functions to then facilitate remote/local copy operations.

This would leave the actual usage of file caching in the hands of the user because this would need to be implemented inside the jobs. This gives the user more direct control over the file caching.

@curufinwe @critias Do you think this is a good approach? Or do you think it would be better to abstract everything away? So hide inside Sisyphus? I have extended the API to support the above idea but wanted to discuss the general approach before making a pull request.

`Task(parallel=n)` has surprising behavior

Consider the scenario yield Task("run", args=list(range(100)), parallel=5). The current behavior is that Sisyphus will submit at maximum 5 jobs/runners into the queue in parallel and these runners will each then run 20 sub jobs (in a single Q submission).

I think this is very surprising for the following reasons:

The rqmt needs to take this batching into account and accordingly needs the job time increased. This makes the limited parallelism no longer an "implementation detail".
You do not get finished-markers for the individual subjobs, but only once all sub jobs in one submitted Q runner have finished. This means failing sub jobs can not be re-run individually, but only on a per-batch basis.

I would have expected Sisyphus to individually submit runners running a single sub job each, but hold their submission to the Q back as needed (just like with jobs whose dependencies haven't finished running yet). Maybe there is a good reason why it is done this way, and I'd love to know.

Mark job as "create but do not execute"

This is more a feature request than an issue. I thought about the case that I often want to run my jobs until I reach a specific job, where I want to manually check all inputs/code etc. until actual execution. Also sometimes I want to delete broken jobs, but do not directly run them again, but also keeping them in the graph.

So what do you think about having a kind of "job.wait()" function, that will add a "wait" file in the Job folder marking, that the Job should be created but no task should run yet? Am I the only one that would be using this?

Crash after user interrupt

Sometimes, but not always (maybe 20% of the cases?), when I hit Ctrl+C, I get this crash:

^C[2023-12-18 18:53:21,090] INFO: Got user interrupt signal stop engine and exit                                                                        [2023-12-18 18:53:21,090] WARNING: Main thread exit. Still running non-daemon threads: {<LocalEngine(Thread-1, started 140176269506112)>}               
[2023-12-18 18:53:21,665] ERROR: Exception in thread <DummyProcess(Thread-12 (worker), started daemon 140175636158016)>:                                [2023-12-18 18:53:21,666] ERROR: Exception in thread <DummyProcess(Thread-18 (worker), started daemon 140175107679808)>:                                
[2023-12-18 18:53:21,734] ERROR: Exception in thread <DummyProcess(Thread-14 (worker), started daemon 140175619372608)>:                                [2023-12-18 18:53:21,734] ERROR: Exception in thread <DummyProcess(Thread-7 (worker), started daemon 140176156243520)>:                                 
[2023-12-18 18:53:21,734] ERROR: Exception in thread <DummyProcess(Thread-6 (worker), started daemon 140176164636224)>:                                 [2023-12-18 18:53:21,776] ERROR: Exception in thread <DummyProcess(Thread-15 (worker), started daemon 140175610979904)>:                                
[2023-12-18 18:53:21,817] ERROR: Exception in thread <DummyProcess(Thread-3 (worker), started daemon 140176189814336)>:                                 
[2023-12-18 18:53:21,858] ERROR: Exception in thread <DummyProcess(Thread-9 (worker), started daemon 140176139458112)>:                                 
[2023-12-18 18:53:21,858] ERROR: Exception in thread <DummyProcess(Thread-4 (worker), started daemon 140176181421632)>:                                 [2023-12-18 18:53:21,859] ERROR: Exception in thread <DummyProcess(Thread-13 (worker), started daemon 140175627765312)>:
EXCEPTION
Traceback (most recent call last):
(Exclude vars because we are exiting.) 
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/tools.py", line 311, in default_handle_exception_interrupt_main_thread.<locals>.wrap
ped_func
EXCEPTION
Traceback (most recent call last):
[2023-12-18 18:53:21,859] ERROR: Exception in thread <DummyProcess(Thread-11 (worker), started daemon 140175644550720)>:
EXCEPTION
Traceback (most recent call last):
EXCEPTION
    line: return func(*args, **kwargs)
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/graph.py", line 570, in SISGraph.for_all_nodes.<locals>.runner_helper
    line: runner(path.creator)
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/graph.py", line 547, in SISGraph.for_all_nodes.<locals>.runner
EXCEPTION
(Exclude vars because we are exiting.) 
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/tools.py", line 311, in default_handle_exception_interrupt_main_thread.<locals>.wrap
ped_func
    line: return func(*args, **kwargs)
EXCEPTION
EXCEPTION
Traceback (most recent call last):
Traceback (most recent call last):
(Exclude vars because we are exiting.) 
EXCEPTION
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/graph.py", line 570, in SISGraph.for_all_nodes.<locals>.runner_helper
    line: runner(path.creator)
EXCEPTION
Traceback (most recent call last):
Traceback (most recent call last):
(Exclude vars because we are exiting.) 
(Exclude vars because we are exiting.) 
...
    line: self._check_running()                                                                                                                         
  File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/pool.py", line 353, in Pool._check_running                     
    line: raise ValueError("Pool not running")                                                                                                          ValueError: Pool not running                                                                                                                            
    line: self._check_running()                                                                                                                           File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/pool.py", line 353, in Pool._check_running                     
Exception ignored in atexit callback: <function shutdown at 0x7f7d659ae5c0>                                                                             
  File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/pool.py", line 458, in Pool.apply_async                        
    line: self._check_running()                                                                                                                         
EXCEPTION                                                                                                                                               
Traceback (most recent call last):                                                                                                                      
EXCEPTION                                                                                                                                               
Traceback (most recent call last):                                                                                                                      
(Exclude vars because we are exiting.)                                                                                                                  
    line: raise ValueError("Pool not running")                                                                                                          
  File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/pool.py", line 353, in Pool._check_running                     
Exception ignored in sys.unraisablehook: <built-in function unraisablehook>                                                                             (Exclude vars because we are exiting.)                                                                                                                  
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/tools.py", line 311, in default_handle_exception_interrupt_main_thread.<locals>.wrapped_func                                                                                                                                                
KeyboardInterrupt                                                                                                                                       Fatal Python error: _enter_buffered_busy: could not acquire lock for <_io.BufferedWriter name='<stderr>'> at interpreter shutdown, possibly due to daemon threads                                                                                                                                               
Python runtime state: finalizing (tstate=0x00007f7d668932d8)                                                                                            
                                                                                                                                                        
Current thread 0x00007f7d66080000 (most recent call first):                                                                                             
  <no Python frame>                                                                                                                                     
                                                                                                                                                        
Extension modules: psutil._psutil_linux, psutil._psutil_posix, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, n
umpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, h5py._errors, h5py.defs, h5py._objects, h5py.h5, h5py.utils
, h5py.h5t, h5py.h5s, h5py.h5ac, h5py.h5p, h5py.h5r, h5py._proxy, h5py._conv, h5py.h5z, h5py.h5a, h5py.h5d, h5py.h5ds, h5py.h5g, h5py.h5i, h5py.h5f, h5p
y.h5fd, h5py.h5pl, h5py.h5o, h5py.h5l, h5py._selector, markupsafe._speedups, _cffi_backend (total: 41)                                                  
fish: Job 2, '/work/tools/users/zeyer/py-envs…' terminated by signal SIGABRT (Abort)

WARNING:root:Settings file 'settings.py' does not exist, ignoring it

With our new i6 setups, a RETURNN config file often directly imports model definitions or other things from the i6_experiments repository. That can look like this:

...
import os
import sys

sys.path.insert(0, "/u/zeyer/setups/combined/2021-05-31/recipe")
sys.path.insert(1, "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus")

...

from i6_experiments.users.zeyer.experiments.exp2023_04_25_rf.conformer_import_moh_att_2023_06_30 import (
    from_scratch_model_def as _model_def,
)
from i6_experiments.users.zeyer.experiments.exp2023_04_25_rf.train import (
    _returnn_v2_get_model as get_model,
)
from i6_experiments.users.zeyer.experiments.exp2023_04_25_rf.conformer_import_moh_att_2023_06_30 import (
    from_scratch_training as _train_def,
)
from i6_experiments.users.zeyer.experiments.exp2023_04_25_rf.train import (
    _returnn_v2_train_step as train_step,
)

As the i6_experiments code is mixed with Sisyphus-level logic and with model definitions which are imported here, this will thus end up with an import sisyphus inside of RETURNN. That's why you see both sys.path.insert there.

But sisyphus is just imported here but otherwise not really used. However, it still leads to this warning:

WARNING:root:Settings file 'settings.py' does not exist, ignoring it ([Errno 2] No such file or directory: 'settings.py').

I think we should not print this warning, unless the settings are actually used.
The way this is designed in Sisyphus makes this tricky, though.
However, a simple thing we can do is maybe checking sys.modules['__main__'].__file__ if that is actually some Sisyphus executable or not (like RETURNN) and in case it is not from Sisyphus, do not print the warning.
Or what do you think?

Or maybe you say this is really intended, even at just import sisyphus, wherever you do that?
One workaround on RETURNN side (or wherever else you might have an import sisyphus):
Add sth like this before the import:

os.environ["SIS_GLOBAL_SETTINGS_FILE"] = ""

So we could add this to the RETURNN config. But it's also somewhat ugly.

update_engine_rqmt in settings can not be pickled

I defined a update_engine_rqmt function in my settings.py, which is in the root of my setup.
Now I get this pickling error:

...
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/manager.py", line 406, in Manager.run_jobs.<locals>.f
    line: job._sis_setup_directory() 
    locals: 
      job = <local> Job<work/i6_core/returnn/search/ReturnnSearchJobV2.ooVjHv7YzCd9> 
      job._sis_setup_directory = <local> <bound method Job._sis_setup_directory of Job<work/i6_core/returnn/search/ReturnnSearchJobV2.ooVjHv7YzCd9>> 
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/job.py", line 290, in Job._sis_setup_directory 
    line: pickle.dump(self, f) 
    locals: 
      pickle = <global> <module 'pickle' from '/work/tools/asr/python/3.8.0/lib/python3.8/pickle.py'> 
      pickle.dump = <global> <built-in function dump> 
      self = <local> Job<work/i6_core/returnn/search/ReturnnSearchJobV2.ooVjHv7YzCd9> 
      f = <local> <gzip on 0x7fbcdc4bd190> 
PicklingError: Can't pickle <function update_engine_rqmt at 0x7fbcde8b90d0>: import of module 'settings.py' failed

This is because update_global_settings_from_file and update_global_settings_from_text do not properly import the module but instead manually do compile and exec. Esp, the __name__ for the module exec globals is set incorrectly. Otherwise I think it might have worked already (for the case of settings.py in the root dir, at least).

I wonder, is setting update_engine_rqmt in the user settings not supported? Or how would that work?

And also, should we maybe fix update_global_settings_from_text a bit, that it would at least set __name__ correctly for this case?

Better manager logging

Hi all! When I work with sisyphus, I usually work with many interdependencies between files and functions, and my manager usually interprets jobs that are on average two to three calls below the main configuration file. As a result, trying to see which job failed, as well as the code surrounding the failing job, is a frustrating task since sisyphus doesn't provide a stack trace of where the fail is, as opposed to running the bare setup.

I thought of this, and I came to the conclusion that I could probably debug faster if I got at least the path and line which made my job break. For instance, instead of what we have currently:

[2023-06-08 13:22:20,259] ERROR: error: Job<job-dir>

I think logging the information as follows would be much more efficient for the programmer:

[2023-06-08 13:22:20,259] ERROR: Job<job-dir> (<py-that-defines-job>:<specific-line>)

I don't know if this is already implemented, but I checked sisyphus's help and found nothing related, only log_level. My question would be: is this implemented? And if not, would it be much effort to implement?

Sisyphus sometimes hangs for a while (several minutes) at startup

I see this:

$ python3.11 ./sis m recipe/i6_experiments/users/zeyer/experiments/exp2023_04_25_rf/i6.py                                                                                                                                       
[2024-01-04 09:59:01,199] INFO: Loaded config: recipe/i6_experiments/users/zeyer/experiments/exp2023_04_25_rf/i6.py (loaded module: i6_experiments.users.zeyer.experiments.exp2023_04_25_rf.i6)                                                                                                                                 
[2024-01-04 09:59:01,199] INFO: Config loaded (time needed: 10.46)                                                                                              
[2024-01-04 09:59:21,744] INFO: Finished updating job states                    
[2024-01-04 09:59:21,749] INFO: Experiment directory: /u/zeyer/setups/combined/2021-05-31      Call: ./sis m recipe/i6_experiments/users/zeyer/experiments/exp2023_04_25_rf/i6.py
[2024-01-04 09:59:21,749] ERROR: error: Job<work/i6_core/returnn/forward/ReturnnForwardJobV2.9gyOjplgn71p>                                                      
[2024-01-04 09:59:21,749] ERROR: error: Job<work/i6_core/returnn/forward/ReturnnForwardJobV2.QQVFmfUHS3vO>
...
[2024-01-04 09:59:22,478] INFO: running: Job<alias/exp2023_04_25_rf/espnet/v6-11gb-f32-bs8k-accgrad1-mgpu4-pavg100-wd1e_4-lrlin1e_5_558k-EBranchformer-testDynGr
adAccumV2/train work/i6_core/returnn/training/ReturnnTrainingJob.eOr3dSwsdCDH> {ep 2/500} 
[2024-01-04 09:59:22,478] INFO: error(8) queue(4) running(11) waiting(1230)     
Clear jobs in error state? [y/N]        
Print verbose overview (v), update aliases and outputs (u), start manager (y), or exit (n)? y   

# hang here now...

[2024-01-04 10:10:50,891] INFO: Experiment directory: /u/zeyer/setups/combined/2021-05-31      Call: ./sis m recipe/i6_experiments/users/zeyer/experiments/exp2023_04_25_rf/i6.py
[2024-01-04 10:10:50,891] ERROR: error: Job<work/i6_core/returnn/forward/ReturnnForwardJobV2.9gyOjplgn71p>
...

Using py-spy, I see that the main thread hangs here:

Thread 2408628 (idle): "MainThread"
    for_all_nodes (sisyphus/graph.py:579)
    get_jobs_by_status (sisyphus/graph.py:478)
    update_jobs (sisyphus/manager.py:241)
    run (sisyphus/manager.py:593)
    wrapped_func (sisyphus/tools.py:311)
    manager (sisyphus/manager.py:138)
    main (sisyphus/__main__.py:234)
    <module> (sis:14)

Looking at the code, in the other threads, I should see sth related to the for_all_nodes with the get_unfinished_jobs, but I don't see anything like that.

In the other threads, I see lots of instances in for_all_nodes via the JobCleaner, though. That job cleanup check seems to take a while. Maybe that blocks the other for_all_nodes? But shouldn't I see at least something in the other threads about anything get_jobs_by_status related?

See the full py-spy dump log here: https://gist.github.com/albertz/b4a3fa7c5140dfecb5d67ebe30bf0eff

WAIT_PERIOD_JOB_FS_SYNC might be ignored due to Node time differences

We are currently investigating a problem, where potentially the following task was started before writing the previous tasks file was finished.
From our understanding this should be prevented by WAIT_PERIOD_JOB_FS_SYNC=30 in the global settings. But after having a look at the code we found that if one of the tasks is a minitask and the other is on a node the system time difference between both nodes might be bigger than 30 seconds, which should then "ignore" this constant.
While we do not really have a solution here, we just wanted to make aware of the problem. Do you maybe have an idea how to solve this?
cc. @christophmluscher @dthulke JanRosendahl

Sisyphus console cleanup also tries to cleanup .history files

Using e.g.

used = tk.cleaner.list_all_graph_directories()
unused = tk.cleaner.search_for_unused(used)
tk.cleaner.remove_directories(unused, "test", mode="move")

in the console also moves files like .history.hilmes

`Task.run`, wait for inputs

Some of my recog jobs fail because they don’t find the checkpoint:

[2023-10-22 14:05:41,761] INFO: Inputs:
[2023-10-22 14:05:41,761] INFO: /u/zeyer/setups/combined/2021-05-31/tools/returnn
[2023-10-22 14:05:41,761] INFO: /work/tools/users/zeyer/linuxbrew/opt/[email protected]/bin/python3.11
[2023-10-22 14:05:41,761] INFO: /u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/oggzip/BlissToOggZipJob.NSdIHfk1iw2M/output/out.ogg.zip
[2023-10-22 14:05:41,761] INFO: /u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.jYUaQ42BWrqA/output/models/epoch.240.pt
[2023-10-22 14:05:41,761] WARNING: Input path does not exist: /u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.jYUaQ42BWrqA/output/models/epoch.240.pt
...
[2023-10-22 14:05:41,762] INFO: ------------------------------------------------------------
[2023-10-22 14:05:41,762] INFO: Starting subtask for arg id: 0 args: []
[2023-10-22 14:05:41,762] INFO: ------------------------------------------------------------
[2023-10-22 14:05:41,777] INFO: Run time: 0:00:00 CPU: 215.80% RSS: 70MB VMS: 213MB 
reformatted ...
[2023-10-22 14:05:41,950] ERROR: Job failed, traceback:
...
  File "/u/zeyer/setups/combined/2021-05-31/recipe/i6_core/returnn/forward.py", line 307, in ReturnnForwardJobV2.create_files
    line: assert os.path.exists( 
              _get_model_path(self.model_checkpoint).get_path()
          ), f"Provided model checkpoint does not exists: {self.model_checkpoint}"
...
AssertionError: Provided model checkpoint does not exists: /u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.jYUaQ42BWrqA/output/models/epoch.240.pt

then it ends up in this state:

interrupted_not_resumable: Job<work/i6_core/returnn/forward/ReturnnForwardJobV2.3UMtLSktznsC>

My initial suggestion was to add some wait time in the ReturnnForwardJobV2 (see rwth-i6/i6_core#464), but @michelwi suggested to implement this more generic, so this issue here is to discuss about this.

I also found this code:

sisyphus/sisyphus/task.py

Lines 140 to 146 in cbf529c

# each input must be at least X seconds old

# if an input file is too young it's may not synced in a network filesystem yet

try:

input_age = time.time() - os.stat(i.get_path()).st_mtime

time.sleep(max(0, gs.WAIT_PERIOD_MTIME_OF_INPUTS - input_age))

except FileNotFoundError:

logging.warning("Input path does not exist: %s" % i.get_path())

Maybe we should change actually this code to not just print this warning ("Input path does not exist") but instead to wait in this case?

I really don't like the WAIT_PERIOD_JOB_FS_SYNC. I want to get rid of any arbitrary sleeps before doing anything. See also #146. Ideally, if sth is ready, it should directly execute the next task without any delay.

Then, the task itself can check if inputs are available, and if not, wait a bit. Thus, in the ideal case, it would directly run, and only if not available, it would wait a bit.

Originally posted by @albertz in rwth-i6/i6_core#464 (comment), after discussion with @michelwi

Passing function as input parameter to Job

When passing a function as input to a Job I the error AttributeError: 'ConvertPytorchToReturnnJob' object has no attribute '_sis_alias_prefixes' during runtime of the job. This behavior is independent of actually using the function in the job or excluding the function from the hash computation. Is passing a function as input to a sisyphus job for some reason not supported or is there a different way of passing a function as a job's input? The function passed to the job was a simple dummy function def test(): return 0

If required I can provide a minimal example.

typo in var name

sisyphus/sisyphus/worker.py

Line 182 in c0c1875

 log_file.write('\n' + ('#'*80) + '\nRETRY OR CONTINUE TASK\n' + ('#'*80) + '\n\n') 

log_file is the string with the log file name and logfile is the actual file object. the wrong one seems to be called.

/u/luescher/dev/sisyphus_20181119/sisyphus/worker.py in worker_helper(args=Namespace(config_files=[], engine='short', force...redirect_output=True, task_id=1, task_name='run'))
    180         with open(log_file, 'a') as logfile:
    181             if is_not_first:
--> 182                 log_file.write('\n' + ('#'*80) + '\nRETRY OR CONTINUE TASK\n' + ('#'*80) + '\n\n')
        log_file.write = undefined
    183             subprocess.check_call(call, stdout=logfile, stderr=logfile)
    184         return

AttributeError: 'str' object has no attribute 'write'

Wait time after (or between) mini jobs

My recognition + scoring + summarizing results pipeline contains a number of very simple mini jobs which run one after another. The wait time after a mini job is annoying. I don't want to have any wait time between two consecutive mini jobs.

I'm not sure how others think about this (excluding those who just don't care). What are the reasons for having this wait time? I'm specifically talking about mini jobs, specifically jobs run via LocalEngine.

I haven't looked too deep into the code yet. Some things I consider changing:

Make an exception for LocalEngine and ignore WAIT_PERIOD_JOB_FS_SYNC, WAIT_PERIOD_JOB_CLEANUP and some others for this.
Interrupt the time.sleep(gs.WAIT_PERIOD_BETWEEN_CHECKS) when some LocalEngine job finishes.

Maybe such changes are not wanted by everyone, so then this should be optional and configurable somehow?

Path lt/eq/hash not consistent to _sis_hash

sisyphus/sisyphus/job_path.py

Line 219 in ab073d3

def __eq__(self, other):

_sis_hash considers hash_overwrite, while the others do not.
I'm not sure if this is by intention.
If this is fine, maybe some small comment would be nice.

Input path checking

After line

sisyphus/sisyphus/graph.py

Line 461 in acebc54

if i.creator is None:

there is some code to check if a Path exists, shouldn't this be a call to

sisyphus/sisyphus/job_path.py

Line 128 in acebc54

def available(self, debug_info=None):

	def update_rqmt(self, initial_rqmt, submit_history, task_id):
	""" Update task requirements of interrupted job """
	initial_rqmt = initial_rqmt.copy()
	initial_rqmt['mem'] = tools.str_to_GB(initial_rqmt['mem'])
	initial_rqmt['time'] = tools.str_to_hours(initial_rqmt['time'])
	usage_file = self._job._sis_path(gs.PLOGGING_FILE + '.' + self.name(), task_id, abspath=True)

	try:
	last_usage = literal_eval(open(usage_file).read())
	except (SyntaxError, IOError):
	# we don't know anything if no usage file is writen or is invalid, just reuse last rqmts
	return initial_rqmt

	new_rqmt = self._update_rqmt(initial_rqmt=initial_rqmt, last_usage=last_usage)
	new_rqmt = gs.check_engine_limits(new_rqmt, self)
	return new_rqmt

	argv = sys.argv[sys.argv.index(gs.CMD_WORKER):]
	del argv[argv.index('--redirect_output')]

	call = gs.SIS_COMMAND + argv

	elif type(obj) in (int, float, bool, str, complex):
	byte_list.append(repr(obj).encode())

	# each input must be at least X seconds old
	# if an input file is too young it's may not synced in a network filesystem yet
	try:
	input_age = time.time() - os.stat(i.get_path()).st_mtime
	time.sleep(max(0, gs.WAIT_PERIOD_MTIME_OF_INPUTS - input_age))
	except FileNotFoundError:
	logging.warning("Input path does not exist: %s" % i.get_path())

rwth-i6 / sisyphus Goto Github PK

sisyphus's Introduction

Sisyphus

Installation

Requirements

Setup

Documentation

Example

Development

Known Problems:

License

sisyphus's People

Contributors

Stargazers

Watchers

Forkers

sisyphus's Issues

[2020-03-13 17:19:58,740] ERROR: Main thread unhandled exception:

[2020-03-13 18:45:42,605] ERROR: Main thread unhandled exception:

Recommend Projects

Recommend Topics

Recommend Org