getzlab / canine Goto Github PK
View Code? Open in Web Editor NEWA modular, high-performance computing solution to run jobs using SLURM
Home Page: https://getzlab.github.io/canine/
License: BSD 3-Clause "New" or "Revised" License
A modular, high-performance computing solution to run jobs using SLURM
Home Page: https://getzlab.github.io/canine/
License: BSD 3-Clause "New" or "Revised" License
After the main batch completes, the orchestrator should make a second attempt at salvaging output files from failed jobs. Jobs which have too many node failures stop getting queued, but may have produced some output files.
I think we could do a couple things here, either:
sacct
dataframe and perform 2nd chance delocalization on all jobs with a NODE FAIL
slurm statelocalizer.build_manifest()
to check which jobs have any output files before files are delocalized. Then we can run 2nd chance delocalization on any jobs which do not appear in the manifest before running localizer.delocalize()
I think this will probably end up looking like
orchestrator.wait_for_jobs_to_finish()
for job in select_second_chance_jobs():
backend.invoke('SLURM_ARRAY_TASK_ID={job} . setup.sh && delocalize.py')
outputs = localizer.delocalize()
We current scancel
each shard individually:
Lines 553 to 556 in 5a255c3
For a large job, this is quite slow. We should either use a batch syntax for scancel
(if supported), or use the --array
functionality of sbatch
to avoid dispatching them in the first place.
Line 103 in 01ce0a7
Canine orchestrator should dump an output df to the outputs directory, which should assist in job avoidance and output detection
JobID | Output Name | Output Files (relpaths) |
---|---|---|
0 | foo | 0/foo/bar |
0 | baz | 0/baz/bam |
1 | foo | 1/foo/bar |
etc...
Is your feature request related to a problem? Please describe.
A lot of dbGaP data is available on buckets via signed URLs that can be generated by having a Terra account linked to the appropriate provider, and using the DRSHub API. Currently, for data hosted by the GDC, we are using the GDC API, which is slow and prone to crashing.
Describe the solution you'd like
Use the above API to get a signed URL, and if it's available, use it rather than the GDC API.
Additional context
Manual testing to confirm it works outside of the Broad network:
% curl --request POST --url "https://drshub.dsde-prod.broadinstitute.org/api/v4/drs/resolve" \
--header "authorization: Bearer $(gcloud auth print-access-token)" \
--header 'content-type: application/json' \
--data '{ "url": "drs://dg.4dfc:7d1726dc-1261-4db9-adea-3adbcb2ffa28", "fields": ["size", "name", "accessUrl", "hashes"] }'
{
"size" : 9434312,
"name" : "7d1726dc-1261-4db9-adea-3adbcb2ffa28/7812f73a-4603-44b4-84c4-cefd5fc654a8_wgs_gdc_realn.bai",
"accessUrl" : {
"url" : "https://gdc-tcga-phs000178-controlled.storage.googleapis.com/7d1726dc-1261-4db9-adea-3adbcb2ffa28/7812f73a-4603-44b4-84c4-cefd5fc654a8_wgs_gdc_realn.bai?x-goog-algorithm=GOOG4-RSA-SHA256&x-goog-credential=dheiman-666%40dcf-prod.iam.gserviceaccount.com%2F20240214%2Fauto%2Fstorage%2Fgoog4_request&x-goog-date=20240214T144326Z&x-goog-expires=3600&x-goog-signedheaders=host&x-goog-signature=6e20a8b207483ba7dfdcf62bcefd15e357157fe379942054e4e6380e30537bacfbc372c4cd0142102941137df831bc27eb699c7b8dad99fbd76b9b9473af38419f69830a32cc1ccee512057c35fc9c512c82576a15c9182489d61511b46f384e1a52a895786cc0f6d8d4e71883e0b24c6d5e90c778d65d8a860b8abb511b7016b2f00cff5ee489aef44bd1c03130e57b20002ed9542c933e01e9ab8211a605a6a6a0b16ae5a8e072b1ad9477c0e7ccb6bb92464f242014d4b387e4f6e3cc5e9d28f455d757f4e43a74dbd9292ddaa1e74558462f87fff7e9e328b87adcc90fe8d007c95abf5ed770cdad1012ea2fcf8c72a22a9c57eb52e6ab16d76111986bd7",
"headers" : null
},
"hashes" : {
"md5" : "5290f15d8e95e2660fe6d15a5f4e9dd9"
}
}
% curl -OJ -H 'Connection: keep-alive' --keepalive-time 2 "https://gdc-tcga-phs000178-controlled.storage.googleapis.com/7d1726dc-1261-4db9-adea-3adbcb2ffa28/7812f73a-4603-44b4-84c4-cefd5fc654a8_wgs_gdc_realn.bai?x-goog-algorithm=GOOG4-RSA-SHA256&x-goog-credential=dheiman-666%40dcf-prod.iam.gserviceaccount.com%2F20240214%2Fauto%2Fstorage%2Fgoog4_request&x-goog-date=20240214T144326Z&x-goog-expires=3600&x-goog-signedheaders=host&x-goog-signature=6e20a8b207483ba7dfdcf62bcefd15e357157fe379942054e4e6380e30537bacfbc372c4cd0142102941137df831bc27eb699c7b8dad99fbd76b9b9473af38419f69830a32cc1ccee512057c35fc9c512c82576a15c9182489d61511b46f384e1a52a895786cc0f6d8d4e71883e0b24c6d5e90c778d65d8a860b8abb511b7016b2f00cff5ee489aef44bd1c03130e57b20002ed9542c933e01e9ab8211a605a6a6a0b16ae5a8e072b1ad9477c0e7ccb6bb92464f242014d4b387e4f6e3cc5e9d28f455d757f4e43a74dbd9292ddaa1e74558462f87fff7e9e328b87adcc90fe8d007c95abf5ed770cdad1012ea2fcf8c72a22a9c57eb52e6ab16d76111986bd7"
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 9213k 100 9213k 0 0 10.6M 0 --:--:-- --:--:-- --:--:-- 10.6M
% ls -la
total 18992
drwxr-xr-x 15 dheiman staff 480 Feb 14 09:40 .
drwxr-xr-x@ 19 dheiman staff 608 Jan 11 14:27 ..
-rw-r--r--@ 1 dheiman staff 9434312 Feb 14 09:40 7812f73a-4603-44b4-84c4-cefd5fc654a8_wgs_gdc_realn.bai
% md5 7812f73a-4603-44b4-84c4-cefd5fc654a8_wgs_gdc_realn.bai
MD5 (7812f73a-4603-44b4-84c4-cefd5fc654a8_wgs_gdc_realn.bai) = 5290f15d8e95e2660fe6d15a5f4e9dd9
Is your feature request related to a problem? Please describe.
When localizing data from the GDC, weird failures can corrupt the file, and retries can't fix it
Describe the solution you'd like
If checking the md5sum fails, then delete the file, as it has been corrupted, and the only way for subsequent retries to work is to start from scratch
Currently, Canine waits until it actually starts delocalization to check that the delocalization location is valid (e.g., does not already exist, is writable, etc.)
I think it would be better to have these checks happen before the job is dispatched, to avoid the possibility of having a long, expensive job run to completion only to fail at the very end because it couldn't be delocalized.
On non-transient backends, add an option to clear the slurmctld configuration with:
kill $(pgrep slurmctld)
slurmctld -c
slurmctld -f /path/to/slurm.conf
Add a pipeline option to allow canine to report pipeline status to an external system
Here, the default compute zone is hardcoded as us-central1-a
:
Instead, this should be the zone of the current GCE instance. Otherwise, severe performance issues can arise when cluster instances get created across different zones.
I've noticed that on linux, there are still lingering issues with the dummy backend, particularly during cleanup. Since the container runs as root, any files written into the bind mount from the container are owned by root and usually write protected. This causes issues when the host tries to clean up the staging directory after killing the cluster.
I think the easiest workaround would be a two part approach:
chmod -R
in the controller right before cleanupI like that this doesn't involve the use of sudo, which may not be available to a user
Right now common files are processed in several steps, which is slow for big jobs. Would be significantly faster if the localizer pulls them out of the job spec as they are discovered.
However, right now, the current version indexes common files based on their input location, which allows for auto-detection of common files anywhere within the job spec
It looks like the version is meant to be the same as the latest release. If this is so, then it may be preferable to use setuptools_scm
. That way version will be linked directly to the commit of the latest release tag, and there won't be a need to manually set it in the code (in-between release codebases will have the commit hash appended to the version to make it clear that your current code doesn't match the latest release).
The Job avoidance tests are currently skipped because the system was not fully implemented when unit tests were added.
As mentioned in the GDAN meeting, some people are interested in adding an option to save inputs to node-local storage instead of over the NFS. This is particularly useful for large input files which are only needed once and may clog NFS bandwidth and storage.
In terms of implementation, I think this makes sense as an override which follows the behavior of Delayed
, except that the file is downloaded to local storage (not over NFS). I'm leaning towards calling this override Local
, but it is very similar to Localize
, so it may not be the best choice.
I'm going to try to get to this today or tomorrow, and should have a PR open soon
Current, if no outputs are found, the only clue is this getting printed to stdout
:
Expecting value: line 1 column 1 (char 0)
presumably from delocalizer.py
.
We need to make output-related error messages clearer.
Is there a reason that we copy the directory instead of making a symlink of the directory?
Allows canine to boot a slurm cluster from a pre-created image.
/etc/fstab
properly configured in the image to mount the shares. The canine staging directory must be mounted through that shareslurm.conf
Allow user to specify custom names for array jobs, instead of naming by index.
Firecloud adapter should default to provide entity names as custom job names, if not provided by user
Localization.sh loses it's execution bit when transferred over a google bucket. We can probably just add chmod 755
to setup.sh to make sure
A canine rodisk can be attached with correct mount point without being usable (error not caught during mount?)
canine/canine/localization/base.py
Lines 1321 to 1324 in 8235701
This happened to me on a disk marked as attached and mounted by two instances, but one is non-readable:
Instance with successful mount:
root@jb-2-worker3105:/tmp# ls /mnt/rodisks/canine-scratch-wpa4buy-ypljbha-ewdz3z3pxih0g-0
mybam.bam dupe_metrics.txt
Instance without successful mount for the same disk:
root@jb-2-worker3103:/tmp# ls /mnt/rodisks/canine-scratch-wpa4buy-ypljbha-ewdz3z3pxih0g-0
ls: reading directory '/mnt/rodisks/canine-scratch-wpa4buy-ypljbha-ewdz3z3pxih0g-0': Input/output error
Catching a 1 exit status from ls $CANINE_RODISK_DIR
might be a good option after mountpoint -q
per https://superuser.com/a/864375 (we don't know how subsequent tasks may react to missing files...)
If the user specifies an invalid image/project, or does not have permission to view the specified image, the backend will hang forever at "Waiting for NFS to be ready …"
Right now, the remote half of delocalization (delocalize.py
) accepts and expands shell variables in the context of the current job, however, matching files will not match in canine's localizer.delocalize()
because the same variables are not set. We do have the information to do this, as the localizer is responsible for setting the environment variables in the first place, but we would have to be careful about using the right path contexts. This would still only allow for variables which were job inputs. Other shell variables would not expand properly.
A single 4 TB persistent disk on a 4 core NFS has a maximum throughput of ~400 MB/s. The instance itself will have a network throughput of 8 gigabits per second (1000 MB/s). Recall that the NFS server also runs jobs, so for a large batch submission where we are sure that we will continuously max out a 16 core instance, its throughput would be 32 Gb/s (4000 MB/s).
Thus, for a large number of concurrent workflows, we will need to have multiple NFS servers/disks per NFS server. All disks would be mounted to the controller node/each worker node; each task would be randomly assigned to a disk. (Since Slurm allows us to preferentially assign tasks to specific nodes, it would make sense to put the most I/O intensive jobs on the NFS nodes.)
We will have to very carefully optimize how many jobs a single disk could accommodate. As currently implemented, our pipelines are CPU-bound, not I/O-bound. Localization will likely be the most I/O intensive step; I’ve ballparked localization maxing out at 2-4 concurrent samples per 4 TB drive, assuming sustained bucket throughput of 100-200 MB/s per file. Perhaps we could do something clever with staggering workflow starts so that there aren’t too many concurrent localization steps.
@julianhess and I have both encountered the Dummy backend hanging forever on start after booting all containers. The containers seem to be starting, but for some reason slurmctld isn't actually running.
Also enforce that nfs localizer staging directory is same as local directory
Currently, Canine can progress quite far if a default project is not set up, and crash with a cryptic stack trace:
https://files.slack.com/files-pri/T1YPV3RLL-FUTTHHZ88/image.png
We should explicitly capture self.config["project"] is None
and raise an informative exception.
Slurm can track disk usage as a consumable resource, but it only checks for available space before launching a task. This is problematic — if there are 100 GBs remaining on the NFS disk and 20 tasks launching concurrently each require 10 GB, the NFS disk will ultimately fill, since for each task, Slurm will see 100 GB free at launch, and will not continuously monitor each task's disk consumption.
Qing had the idea of using fallocate
to reserve the full amount of space each job will use ahead of time (as specified by the user). We would then iteratively shrink the file created by fallocate
by the amount the job's workspace
directory grows in size.
If the user underestimated the total amount of space used by a job, the job should be killed. Because the monitoring process is backgrounded, we would need to trap a signal sent from it.
Because the overhead of monitoring disk usage can be potentially high (e.g., a workspace directory with many subfolders or many little files), this should be disabled by default and only get activated if the user explicitly requests disk space as a consumable resource. This also means that we should set the default disk CRES in Slurm to 0. Finally, we should caution users that they should only reserve disk space for tasks that have nontrivial output sizes (e.g., localization, aligners, etc.)
Orchestrator will tail stdout/err for a random job, and switch jobs as they finish. Thanks to Phil Montgomery for suggesting this
Right now the docs and readme are kind of overwhelming. Let's restructure to give the briefest overview then point them to other resources
Canine is growing in complexity and it's about time we add CI. I would like it to be comprehensive, but I'm not sure about how to fudge it. Right now I'm thinking the easiest way would be to develop a dummy backend which would allow for testing of all other layers of the canine library except backends.
Exception: Key-exchange timed out waiting for key negotiation
Traceback (most recent call last):
File "/home/sanand/.conda/envs/scrna/lib/python3.7/site-packages/paramiko/transport.py", line 2088, in run
self._channel_handler_table[ptype](chan, m)
File "/home/sanand/.conda/envs/scrna/lib/python3.7/site-packages/paramiko/channel.py", line 1187, in _handle_close
self.transport._send_user_message(m)
File "/home/sanand/.conda/envs/scrna/lib/python3.7/site-packages/paramiko/transport.py", line 1860, in _send_user_message
"Key-exchange timed out waiting for key negotiation"
paramiko.ssh_exception.SSHException: Key-exchange timed out waiting for key negotiation
Must either decrease the re-key frequency to avoid ever re-keying or simply catch the error and refresh the client. Probably both
Orchestrator should return output in a better format (big dataframe) rather than a bunch of separate outputs
When starting workflow runs through prefect dashboard, the workflow will run in a new python process. Currently it works fine if we just have one workflow to run each time. However, if we want to run them concurrently, I think we need a way to reuse the backend across different processes.
I think it would be similar to what canine-transient
does, we can manage the lifetime of a canine backend in a separate process. Then reuse that backend in a different process by giving an IP, nfs name, configuration or other kind. Do you see difficulties implementing this, or this can already be achieved by using RemoteBackend somehow? Or any other suggestion?
Often times, we want inputs to correspond to arrays of command line arguments (not just single strings). For example, I might want to run
foo --arg1 bar baz --arg2 doo
with the following configuration dict:
inputs:
arg1: bar baz
arg2: doo
script:
- foo --arg1 $arg1 --arg2 $arg2
However, this is incompatible with Canine, since any inputs that contain spaces will be wrapped in single quotes by the stringification here:
and thus Canine will dispatch my job like so
foo --arg1 'bar baz' --arg2 doo
which will break many tools out there that expect --arg1
to be a space-delimited list of multiple arguments, e.g. anything using Python's argparse
module:
argparse.ArgumentPareser().add_argument("arg1", nargs = "+")
This will also break things if inputs are being used to put together arbitrary chunks of command line. Qing ran into the following:
I have
gatk: gatk --java-options -Xmx10g
, which Canine will translate intoexport gatk="'gatk --java-options ....'"
. and it cannot find'gatk
I think this would be best controlled in the override section — if the localization type is None
, then prepare_job_inputs
should strip any stringification-added wrapping single quotes here:
This is consistent with how the "null" type localization is already described in the Canine docs, i.e.
null
: Forces the input to be treated as a plain string. No handling whatsoever will be applied to the input.What do you think?
Currently, delocalization.py
will create an absolute symlink from the target to the destination:
However, this means that a Canine directory cannot easily be moved, since files in outputs
are absolute symlinks to files in jobs
.
If target
and dest
are on the same filesystem, I would replace os.path.abspath(target)
with os.path.relpath(target, os.path.basename(dest))
As an alternative, this could be done after all the absolute symlinks are created with something like
find outputs/ ! -type d -exec symlinks -c {} \;
The advantage of this approach is that symlinks -c
will check if the files are on the same filesystem. The disadvantage is that symlinks
is not a standard utility, and would have to be installed as part of Canine.
Right now, the adapter is a tiny part of the code, that almost feels like an appendix to the localizer. Maybe look into merging the two, which may also help with #24
gsutil cat $file > fifo
, where fifo
lives on an NFS filesystem will not gracefully resume if the NFS server gets preempted.
Aaron says:
I’m not opposed to having the localizer set a
$CANINE_DOCKER_ARGS
variable that we fill with stuff relevant to dockers, like cgroup stuff or bind-mounting the staging dir. I think we can use that to solve this issue (by putting the fifo somewhere on the host disk and adding a bind-mount to$CANINE_DOCKER_ARGS
) and solve the issue withlocal
overrides, by also adding a bind mount for the mount point to$CANINE_DOCKER_ARGS
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.