getzlab / canine Goto Github PK

View Code? Open in Web Editor NEW

6.0 6.0 6.0 1.45 MB

A modular, high-performance computing solution to run jobs using SLURM

Home Page: https://getzlab.github.io/canine/

License: BSD 3-Clause "New" or "Revised" License

Python 99.36% Dockerfile 0.19% Shell 0.45%

bioinformatics high-performance-computing slurm

canine's People

Contributors

Stargazers

Watchers

Forkers

julianhess shankara-a njharlen marlin-na dpmerrell getzlab

canine's Issues

orchestrator should attempt to run `delocalize.py` for jobs failed from node failures

After the main batch completes, the orchestrator should make a second attempt at salvaging output files from failed jobs. Jobs which have too many node failures stop getting queued, but may have produced some output files.

Which jobs receive 2nd chance delocalization

I think we could do a couple things here, either:

We check the sacct dataframe and perform 2nd chance delocalization on all jobs with a NODE FAIL slurm state
We run localizer.build_manifest() to check which jobs have any output files before files are delocalized. Then we can run 2nd chance delocalization on any jobs which do not appear in the manifest before running localizer.delocalize()

The big picture

I think this will probably end up looking like

orchestrator.wait_for_jobs_to_finish()
for job in select_second_chance_jobs():
    backend.invoke('SLURM_ARRAY_TASK_ID={job} . setup.sh && delocalize.py')
outputs = localizer.delocalize()

scancel'ing noop'd jobs is slow

We current scancel each shard individually:

canine/canine/orchestrator.py

Lines 553 to 556 in 5a255c3

 # cancel noop'd jobs 

 for k, v in job_spec.items(): 

 if v is None: 

 self.backend.scancel(batch_id + "_" + k)

For a large job, this is quite slow. We should either use a batch syntax for scancel (if supported), or use the --array functionality of sbatch to avoid dispatching them in the first place.

Add new condition here for localization that hit quota issue; infinite retries

canine/canine/orchestrator.py

Line 103 in 01ce0a7

echo -n $LOCALIZER_JOB_RC > $CANINE_JOB_ROOT/.localizer_exit_code

Output df

Canine orchestrator should dump an output df to the outputs directory, which should assist in job avoidance and output detection

JobID	Output Name	Output Files (relpaths)
0	foo	0/foo/bar
0	baz	0/baz/bam
1	foo	1/foo/bar

etc...

Add localizer for DRS URIs, and update GDC handler to use it when available

Is your feature request related to a problem? Please describe.
A lot of dbGaP data is available on buckets via signed URLs that can be generated by having a Terra account linked to the appropriate provider, and using the DRSHub API. Currently, for data hosted by the GDC, we are using the GDC API, which is slow and prone to crashing.

Describe the solution you'd like
Use the above API to get a signed URL, and if it's available, use it rather than the GDC API.

Additional context
Manual testing to confirm it works outside of the Broad network:

% curl --request POST  --url "https://drshub.dsde-prod.broadinstitute.org/api/v4/drs/resolve"  \
--header "authorization: Bearer $(gcloud auth print-access-token)" \
--header 'content-type: application/json' \
--data '{ "url": "drs://dg.4dfc:7d1726dc-1261-4db9-adea-3adbcb2ffa28", "fields": ["size", "name", "accessUrl", "hashes"] }'
{
  "size" : 9434312,
  "name" : "7d1726dc-1261-4db9-adea-3adbcb2ffa28/7812f73a-4603-44b4-84c4-cefd5fc654a8_wgs_gdc_realn.bai",
  "accessUrl" : {
    "url" : "https://gdc-tcga-phs000178-controlled.storage.googleapis.com/7d1726dc-1261-4db9-adea-3adbcb2ffa28/7812f73a-4603-44b4-84c4-cefd5fc654a8_wgs_gdc_realn.bai?x-goog-algorithm=GOOG4-RSA-SHA256&x-goog-credential=dheiman-666%40dcf-prod.iam.gserviceaccount.com%2F20240214%2Fauto%2Fstorage%2Fgoog4_request&x-goog-date=20240214T144326Z&x-goog-expires=3600&x-goog-signedheaders=host&x-goog-signature=6e20a8b207483ba7dfdcf62bcefd15e357157fe379942054e4e6380e30537bacfbc372c4cd0142102941137df831bc27eb699c7b8dad99fbd76b9b9473af38419f69830a32cc1ccee512057c35fc9c512c82576a15c9182489d61511b46f384e1a52a895786cc0f6d8d4e71883e0b24c6d5e90c778d65d8a860b8abb511b7016b2f00cff5ee489aef44bd1c03130e57b20002ed9542c933e01e9ab8211a605a6a6a0b16ae5a8e072b1ad9477c0e7ccb6bb92464f242014d4b387e4f6e3cc5e9d28f455d757f4e43a74dbd9292ddaa1e74558462f87fff7e9e328b87adcc90fe8d007c95abf5ed770cdad1012ea2fcf8c72a22a9c57eb52e6ab16d76111986bd7",
    "headers" : null
  },
  "hashes" : {
    "md5" : "5290f15d8e95e2660fe6d15a5f4e9dd9"
  }
}

% curl -OJ -H 'Connection: keep-alive' --keepalive-time 2 "https://gdc-tcga-phs000178-controlled.storage.googleapis.com/7d1726dc-1261-4db9-adea-3adbcb2ffa28/7812f73a-4603-44b4-84c4-cefd5fc654a8_wgs_gdc_realn.bai?x-goog-algorithm=GOOG4-RSA-SHA256&x-goog-credential=dheiman-666%40dcf-prod.iam.gserviceaccount.com%2F20240214%2Fauto%2Fstorage%2Fgoog4_request&x-goog-date=20240214T144326Z&x-goog-expires=3600&x-goog-signedheaders=host&x-goog-signature=6e20a8b207483ba7dfdcf62bcefd15e357157fe379942054e4e6380e30537bacfbc372c4cd0142102941137df831bc27eb699c7b8dad99fbd76b9b9473af38419f69830a32cc1ccee512057c35fc9c512c82576a15c9182489d61511b46f384e1a52a895786cc0f6d8d4e71883e0b24c6d5e90c778d65d8a860b8abb511b7016b2f00cff5ee489aef44bd1c03130e57b20002ed9542c933e01e9ab8211a605a6a6a0b16ae5a8e072b1ad9477c0e7ccb6bb92464f242014d4b387e4f6e3cc5e9d28f455d757f4e43a74dbd9292ddaa1e74558462f87fff7e9e328b87adcc90fe8d007c95abf5ed770cdad1012ea2fcf8c72a22a9c57eb52e6ab16d76111986bd7"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 9213k  100 9213k    0     0  10.6M      0 --:--:-- --:--:-- --:--:-- 10.6M

% ls -la
total 18992
drwxr-xr-x  15 dheiman  staff      480 Feb 14 09:40 .
drwxr-xr-x@ 19 dheiman  staff      608 Jan 11 14:27 ..
-rw-r--r--@  1 dheiman  staff  9434312 Feb 14 09:40 7812f73a-4603-44b4-84c4-cefd5fc654a8_wgs_gdc_realn.bai

% md5 7812f73a-4603-44b4-84c4-cefd5fc654a8_wgs_gdc_realn.bai
MD5 (7812f73a-4603-44b4-84c4-cefd5fc654a8_wgs_gdc_realn.bai) = 5290f15d8e95e2660fe6d15a5f4e9dd9

Delete file on md5 check failure

Is your feature request related to a problem? Please describe.
When localizing data from the GDC, weird failures can corrupt the file, and retries can't fix it

Describe the solution you'd like
If checking the md5sum fails, then delete the file, as it has been corrupted, and the only way for subsequent retries to work is to start from scratch

Feature request: Check if delocalization directory is valid before dispatching job

Currently, Canine waits until it actually starts delocalization to check that the delocalization location is valid (e.g., does not already exist, is writable, etc.)

I think it would be better to have these checks happen before the job is dispatched, to avoid the possibility of having a long, expensive job run to completion only to fail at the very end because it couldn't be delocalized.

Feature Request: reset slurmctld for local/remote backends

On non-transient backends, add an option to clear the slurmctld configuration with:

kill $(pgrep slurmctld)
slurmctld -c
slurmctld -f /path/to/slurm.conf

Feature Request: External Reporting

Add a pipeline option to allow canine to report pipeline status to an external system

Compute zone default should be zone of current GCE instance

Here, the default compute zone is hardcoded as us-central1-a:

https://github.com/broadinstitute/canine/blob/8f6a51abcaeebebca9c546dcd9e897e5c89a1c59/canine/backends/imageTransient.py#L53

Instead, this should be the zone of the current GCE instance. Otherwise, severe performance issues can arise when cluster instances get created across different zones.

Dummy backend on linux permissions issues

I've noticed that on linux, there are still lingering issues with the dummy backend, particularly during cleanup. Since the container runs as root, any files written into the bind mount from the container are owned by root and usually write protected. This causes issues when the host tries to clean up the staging directory after killing the cluster.

I think the easiest workaround would be a two part approach:

Change root's umask in the docker image to be more permissive
Invoke chmod -R in the controller right before cleanup

I like that this doesn't involve the use of sudo, which may not be available to a user

Improve common file perf

Right now common files are processed in several steps, which is slow for big jobs. Would be significantly faster if the localizer pulls them out of the job spec as they are discovered.

However, right now, the current version indexes common files based on their input location, which allows for auto-detection of common files anywhere within the job spec

Don't Manually set version string

It looks like the version is meant to be the same as the latest release. If this is so, then it may be preferable to use setuptools_scm. That way version will be linked directly to the commit of the latest release tag, and there won't be a need to manually set it in the code (in-between release codebases will have the commit hash appended to the version to make it clear that your current code doesn't match the latest release).

Add job avoidance unit tests

The Job avoidance tests are currently skipped because the system was not fully implemented when unit tests were added.

local override

As mentioned in the GDAN meeting, some people are interested in adding an option to save inputs to node-local storage instead of over the NFS. This is particularly useful for large input files which are only needed once and may clog NFS bandwidth and storage.

In terms of implementation, I think this makes sense as an override which follows the behavior of Delayed, except that the file is downloaded to local storage (not over NFS). I'm leaning towards calling this override Local, but it is very similar to Localize, so it may not be the best choice.

I'm going to try to get to this today or tomorrow, and should have a PR open soon

Raise more informative error when no outputs are found

Current, if no outputs are found, the only clue is this getting printed to stdout:

Expecting value: line 1 column 1 (char 0)

presumably from delocalizer.py.

We need to make output-related error messages clearer.

Copy or symlink when delocalizing a directory?

canine/canine/localization/delocalization.py

Line 51 in d283629

shutil.copytree(target, dest)

Is there a reason that we copy the directory instead of making a symlink of the directory?

Feature Request: image backend

Allows canine to boot a slurm cluster from a pre-created image.

Image requirements

Slurm must be installed and configured (compute node configuration not necessary)
an external NFS server must be present with /etc/fstab properly configured in the image to mount the shares. The canine staging directory must be mounted through that share

Configurable options

Select node naming scheme, count, disk size and machine types to add proper entry to slurm.conf

Feature Request: Custom job names

Allow user to specify custom names for array jobs, instead of naming by index.

Firecloud adapter should default to provide entity names as custom job names, if not provided by user

localization.sh fails with transfer_bucket

Localization.sh loses it's execution bit when transferred over a google bucket. We can probably just add chmod 755 to setup.sh to make sure

mount error not caught

A canine rodisk can be attached with correct mount point without being usable (error not caught during mount?)

canine/canine/localization/base.py

Lines 1321 to 1324 in 8235701

 # because we forced zero exits for the previous commands, 

 # we need to verify that the mount actually exists 

 "mountpoint -q ${CANINE_RODISK_DIR} || { echo 'Read-only disk mount failed!' >&2; cat $DIAG_FILE >&2; exit 1; }",

This happened to me on a disk marked as attached and mounted by two instances, but one is non-readable:

Instance with successful mount:

root@jb-2-worker3105:/tmp# ls /mnt/rodisks/canine-scratch-wpa4buy-ypljbha-ewdz3z3pxih0g-0
mybam.bam  dupe_metrics.txt

Instance without successful mount for the same disk:

root@jb-2-worker3103:/tmp# ls /mnt/rodisks/canine-scratch-wpa4buy-ypljbha-ewdz3z3pxih0g-0
ls: reading directory '/mnt/rodisks/canine-scratch-wpa4buy-ypljbha-ewdz3z3pxih0g-0': Input/output error

Catching a 1 exit status from ls $CANINE_RODISK_DIR might be a good option after mountpoint -q per https://superuser.com/a/864375 (we don't know how subsequent tasks may react to missing files...)

NFS creation from custom image fails silently

If the user specifies an invalid image/project, or does not have permission to view the specified image, the backend will hang forever at "Waiting for NFS to be ready …"

Enable job shell variable expansion during delocalization

Right now, the remote half of delocalization (delocalize.py) accepts and expands shell variables in the context of the current job, however, matching files will not match in canine's localizer.delocalize() because the same variables are not set. We do have the information to do this, as the localizer is responsible for setting the environment variables in the first place, but we would have to be careful about using the right path contexts. This would still only allow for variables which were job inputs. Other shell variables would not expand properly.

Handle multiple NFS instances/disks

A single 4 TB persistent disk on a 4 core NFS has a maximum throughput of ~400 MB/s. The instance itself will have a network throughput of 8 gigabits per second (1000 MB/s). Recall that the NFS server also runs jobs, so for a large batch submission where we are sure that we will continuously max out a 16 core instance, its throughput would be 32 Gb/s (4000 MB/s).

Thus, for a large number of concurrent workflows, we will need to have multiple NFS servers/disks per NFS server. All disks would be mounted to the controller node/each worker node; each task would be randomly assigned to a disk. (Since Slurm allows us to preferentially assign tasks to specific nodes, it would make sense to put the most I/O intensive jobs on the NFS nodes.)

We will have to very carefully optimize how many jobs a single disk could accommodate. As currently implemented, our pipelines are CPU-bound, not I/O-bound. Localization will likely be the most I/O intensive step; I’ve ballparked localization maxing out at 2-4 concurrent samples per 4 TB drive, assuming sustained bucket throughput of 100-200 MB/s per file. Perhaps we could do something clever with staggering workflow starts so that there aren’t too many concurrent localization steps.

Dummy Backend hangs on linux VMs

@julianhess and I have both encountered the Dummy backend hanging forever on start after booting all containers. The containers seem to be starting, but for some reason slurmctld isn't actually running.

Remove distinction between controller and worker fs environments

Also enforce that nfs localizer staging directory is same as local directory

If default GCP project is not set, immediately raise error in the constructor

Currently, Canine can progress quite far if a default project is not set up, and crash with a cryptic stack trace:

https://files.slack.com/files-pri/T1YPV3RLL-FUTTHHZ88/image.png

We should explicitly capture self.config["project"] is None and raise an informative exception.

Use fallocate to enforce disk space

Slurm can track disk usage as a consumable resource, but it only checks for available space before launching a task. This is problematic — if there are 100 GBs remaining on the NFS disk and 20 tasks launching concurrently each require 10 GB, the NFS disk will ultimately fill, since for each task, Slurm will see 100 GB free at launch, and will not continuously monitor each task's disk consumption.

Qing had the idea of using fallocate to reserve the full amount of space each job will use ahead of time (as specified by the user). We would then iteratively shrink the file created by fallocate by the amount the job's workspace directory grows in size.

If the user underestimated the total amount of space used by a job, the job should be killed. Because the monitoring process is backgrounded, we would need to trap a signal sent from it.

Because the overhead of monitoring disk usage can be potentially high (e.g., a workspace directory with many subfolders or many little files), this should be disabled by default and only get activated if the user explicitly requests disk space as a consumable resource. This also means that we should set the default disk CRES in Slurm to 0. Finally, we should caution users that they should only reserve disk space for tasks that have nontrivial output sizes (e.g., localization, aligners, etc.)

Add debug mode to `wait_for_jobs_to_finish`

Orchestrator will tail stdout/err for a random job, and switch jobs as they finish. Thanks to Phil Montgomery for suggesting this

Cleanup docs

Right now the docs and readme are kind of overwhelming. Let's restructure to give the briefest overview then point them to other resources

Add CI

Canine is growing in complexity and it's about time we add CI. I would like it to be comprehensive, but I'm not sure about how to fudge it. Right now I'm thinking the easiest way would be to develop a dummy backend which would allow for testing of all other layers of the canine library except backends.

Paramiko re-key fails

See paramiko/paramiko#822

Exception: Key-exchange timed out waiting for key negotiation
Traceback (most recent call last):
  File "/home/sanand/.conda/envs/scrna/lib/python3.7/site-packages/paramiko/transport.py", line 2088, in run
    self._channel_handler_table[ptype](chan, m)
  File "/home/sanand/.conda/envs/scrna/lib/python3.7/site-packages/paramiko/channel.py", line 1187, in _handle_close
    self.transport._send_user_message(m)
  File "/home/sanand/.conda/envs/scrna/lib/python3.7/site-packages/paramiko/transport.py", line 1860, in _send_user_message
    "Key-exchange timed out waiting for key negotiation"
paramiko.ssh_exception.SSHException: Key-exchange timed out waiting for key negotiation

Must either decrease the re-key frequency to avoid ever re-keying or simply catch the error and refresh the client. Probably both

Feature Request: Better output format

Orchestrator should return output in a better format (big dataframe) rather than a bunch of separate outputs

Reuse canine backend (docker transient) across different proceesses

When starting workflow runs through prefect dashboard, the workflow will run in a new python process. Currently it works fine if we just have one workflow to run each time. However, if we want to run them concurrently, I think we need a way to reuse the backend across different processes.

I think it would be similar to what canine-transient does, we can manage the lifetime of a canine backend in a separate process. Then reuse that backend in a different process by giving an IP, nfs name, configuration or other kind. Do you see difficulties implementing this, or this can already be achieved by using RemoteBackend somehow? Or any other suggestion?

Feature request: override input stringification

Often times, we want inputs to correspond to arrays of command line arguments (not just single strings). For example, I might want to run

foo --arg1 bar baz --arg2 doo

with the following configuration dict:

inputs:
  arg1: bar baz
  arg2: doo
script:
  - foo --arg1 $arg1 --arg2 $arg2

However, this is incompatible with Canine, since any inputs that contain spaces will be wrapped in single quotes by the stringification here:

https://github.com/broadinstitute/canine/blob/bdd42083cef0def94a5c53fd3be9b859a1200a07/canine/orchestrator.py#L118

and thus Canine will dispatch my job like so

foo --arg1 'bar baz' --arg2 doo

which will break many tools out there that expect --arg1 to be a space-delimited list of multiple arguments, e.g. anything using Python's argparse module:

argparse.ArgumentPareser().add_argument("arg1", nargs = "+")

This will also break things if inputs are being used to put together arbitrary chunks of command line. Qing ran into the following:

I have gatk: gatk --java-options -Xmx10g, which Canine will translate into export gatk="'gatk --java-options ....'". and it cannot find 'gatk

I think this would be best controlled in the override section — if the localization type is None, then prepare_job_inputs should strip any stringification-added wrapping single quotes here:

https://github.com/broadinstitute/canine/blob/bdd42083cef0def94a5c53fd3be9b859a1200a07/canine/localization/base.py#L477-L480

This is consistent with how the "null" type localization is already described in the Canine docs, i.e.

null: Forces the input to be treated as a plain string. No handling whatsoever will be applied to the input.

What do you think?

delocalization.py should create relative symlinks, if possible

Currently, delocalization.py will create an absolute symlink from the target to the destination:

https://github.com/broadinstitute/canine/blob/8f6a51abcaeebebca9c546dcd9e897e5c89a1c59/canine/localization/delocalization.py#L29

However, this means that a Canine directory cannot easily be moved, since files in outputs are absolute symlinks to files in jobs.

If target and dest are on the same filesystem, I would replace os.path.abspath(target) with os.path.relpath(target, os.path.basename(dest))

As an alternative, this could be done after all the absolute symlinks are created with something like

find outputs/ ! -type d -exec symlinks -c {} \;

The advantage of this approach is that symlinks -c will check if the files are on the same filesystem. The disadvantage is that symlinks is not a standard utility, and would have to be installed as part of Canine.

I’m not opposed to having the localizer set a $CANINE_DOCKER_ARGS variable that we fill with stuff relevant to dockers, like cgroup stuff or bind-mounting the staging dir. I think we can use that to solve this issue (by putting the fifo somewhere on the host disk and adding a bind-mount to $CANINE_DOCKER_ARGS) and solve the issue with local overrides, by also adding a bind mount for the mount point to $CANINE_DOCKER_ARGS

	# cancel noop'd jobs
	for k, v in job_spec.items():
	if v is None:
	self.backend.scancel(batch_id + "_" + k)

	# because we forced zero exits for the previous commands,
	# we need to verify that the mount actually exists
	"mountpoint -q ${CANINE_RODISK_DIR} \|\| { echo 'Read-only disk mount failed!' >&2; cat $DIAG_FILE >&2; exit 1; }",