shawhahnlab / umbra Goto Github PK
View Code? Open in Web Editor NEWPython package and executable for Linux for managing Illumina sequencing runs
License: GNU Affero General Public License v3.0
Python package and executable for Linux for managing Illumina sequencing runs
License: GNU Affero General Public License v3.0
If a network blip causes an upload to fail (e.g. OSError 32, EPIPE) the task fails. Instead BoxUploader should default to a few retries internally for known problems before letting the exception propagate.
If load_csv
is given a text file with unicode Byte Order Mark prefix, the magic bytes are left in place in the returned data, messing up parsing later. Using an explicit encoding like open(..., encoding="utf-8-sig")
seems to work to strip out the BOM if present but ignore if not.
If a run fails it never shows up as completed in IlluminaProcessor
or the report or anywhere. This information should be available through a simple property of the Run class, and reflected in a clear way in the reporting.
Illumina doesn't summarize this status clearly anywhere in the run directory as far as I can tell, though. The best we may be able to do is parse the error status out of the files in the Logs directory.
There currently isn't any easy way to see how the metadata in /seq/status corresponds to the processing directories. There should be a column in the report CSV for this.
Currently the processing status YAML is saved in the zip file, but not the sample sheet or metadata spreadsheet. These should all be stored together in a metdata subdirectory.
If too large a file is given to BoxUploader.upload
it will still try to upload it but will eventually fail when the file size limit is reached. The file size should be checked beforehand so we don't bother trying for a file that would fail and can give a more informative error.
The maximum should be available via the API via the max_upload_size
attribute of the current user information.
The zip archives created by ProjectData.zip
call zipfile.ZipFile
without specifying a compression method, and the default is ZIP_STORED
, giving no compression. This should be changed to ZIP_DEFLATED for the zip default.
ProjectData.send_email
incorrectly puts the angle brackets on the name reference rather than the mailbox part of the email addresses. (This passes testing because TestProjectDataEmail
contains the same mistake.) This causes smtplib to misinterpret the name as an address at localhost, or reject the message entirely if no cc_addrs is supplied.
illumina.util.load_csv
assumes UTF-8, but in case there happens to be, say, an ISO/IEC 8859-1 0xCA (Ê) inserted into the file for some reason it'll crash with:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xca in position 5534: invalid continuation byte
What's the "right" behavior here? Intentionally throw an exception for this? Allow these to be automatically stripped out with a warning?
The destination paths in the processing directory for task logs and metadata (sample sheet / metadata.csv / YAML) should be configurable.
The min_age
setting (that ignores run directories until their ctime timestamp becomes old enough) should be applied to the alignment sub-directories inside run directories as well. Otherwise we can still get spurious warnings about missing data on the first pass over incompletely-written directories.
Currently blank rows in the metadata spreadsheet are read in as usual, with everything set to an empty string, which generates spurious errors. These should be filtered when it's read in.
Exceptions raised during ProjectData.process
are caught and logged, but should also be emailed to a configurable list of addresses with plenty of output.
If run data is written to the monitored path directly or copied on a rolling basis and a IlluminaProcessor.refresh()
occurs while a write is incomplete, a log message of "skipped unrecognized run" or "Alignment not recognized" is triggered. This is harmless since the same location is re-checked on the next cycle, but should still be avoided.
Allowing for a requirement of a minimum file change timestamp age when trying to load run data might be a reasonable way. These timestamps are filesystem-level so they are unaffected by how the files were created or copied (e.g., rsync -a
, cp -p
, etc.).
If an experiment metadata CSV file is malformed it will crash IlluminaProcessor. Instead this should be caught and logged.
Since there isn't a straightforward mapping of sample names to file names, different Illumina programs do that name conversion differently, and duplicate sample names are allowed, we need to switch to using sample numbers and existing files on disk rather than trying to figure out what the files should be named. This can be done internally in the illumina
package to start with, with minimal visible changes, before switching the project metadata to be sample-number-based. This change is a first step toward addressing #50.
Currently it's assumed that all samples defined in metadata.csv for a given project will be present for the matched Run/Alignment, but that might not be the case, either accidentally (typo or incorrect experiment name) or intentionally (shared experiment metadata between multiple runs). This should be checked and logged appropriately: some missing names should trigger a warning, and a completely mismatched set should set the project status to failed.
Currently it assumes a name associated with each address like Name Lastname <[email protected]>
. Just [email protected]
should work too.
GenerateFASTQ's output on disk does not include '
(and probably "
) when those appear in the sample sheet. These should be excluded in illumina.alignment.sample_files_for_num
.
For user-facing simplicity it should be an option to store implicit task dependency outputs within a subdirectory of the processing directory.
In TaskAssemble.prep_contigs_for_geneious
the contig numbering is mangled by an incorrect regular expression. re.match("^NODE_([0-9])+_.*", rec.id)
should be re.match("^NODE_([0-9]+)_.*", rec.id)
with the capture group including all of the digits. As it currently stands the same one-digit labels are recycled over and over.
I just saw some uploads to box fail due to a missing method and realized the installed boxsdk was too old to have it. Adding boxsdk>=2.7.1
(the version I'm using in testing) in setup.py
would have prevented it. The same should be done for the other dependencies.
The CSV report file uses Windows-style \r\n
line endings instead of Unix-style \n
but that trips up things like grep. This can be changed by passing a lineterminator
argument when setting up the writer object in processor.IlluminaProcessor.report.
Replies to automated messages should be directed to a configurable address.
The processing emails may long outlive the zipfiles they link to. The working directory names are given, but it'd be best to include the original Run ID and sample sheet experiment entry as well.
In processor.IlluminaProcessor._run_setup
ValueError is caught and logged for invalid runs, but the invalid run object is still returned. None
should be returned instead so that the processor won't try to refresh or query anything from the run directory.
It should be safe to formally define the task names as all-lowercase, and if so, we can automatically convert entries int the Task column in each experiment metadata.csv to lowercase and avoid one potential point of confusion.
We have some old runs that have sample sheets with no Experiment defined, so the experiment property returns None. This is fine for the Illumina Run/Alignment objects but crashes IlluminaProcessor.
We should be able to run umbra --version
to get the current version via umbra.__version__
.
After old runs are skipped once and logged they can only get older, so we get endless log messages about them. These should get tracked so they're only logged once. If we only track and inhibit the message itself the processor will still be able to process a given run later on if conditions somehow change.
Signal handling should be reorganized to allow USR1 and USR2 to handle verbosity.
Travis CI is no longer a viable option for automated testing under their new pricing model.
We will be offering an allotment of OSS minutes that will be reviewed and allocated on a case by case basis. Should you want to apply for these credits please open a request with Travis CI support stating that you’d like to be considered for the OSS allotment.
The framework for install-specific custom processing tasks is already in place in the introspection in the task package. To finish this off an entry in the configuration file will need to specify an additional filesystem path containing custom tasks.
The sequencers do allow a sample sheet to specify the same name for multiple samples, and currently this results in all but the first sample of a given name being "lost" during processing. How should this be handled? Based on Illumina's approach it should probably be supported, weird as it is. Maybe we should use sample number as the basis for tracking samples instead.
Currently if an exception is raised during task processing, IlluminaProcessor._worker
will catch it and send an email to the address defined in to_addrs_on_error
in the config. The same should happen if an exception reaches _proc_new_alignment
, possibly also including what's currently caught in ProjectData.from_alignment
for any problems when loading metadata. (Possibly just remove the except clause there and handle within IlluminaProcessor.)
If a project with manual processing specified is never actually finished manually, it'll take up one worker indefinitely. This potentially leads to all possible workers clogged with orphaned tasks in the long run. One simple safeguard would be a timeout (potentially quite long, like a week) before giving up, failing processing, and moving on.
Currently the ProjectData
class can accept and use a threads
attribute (for assembly) but this is not supported by IlluminaProcessor. nthreads_per_project
should be a main configuration option.
If scrape_log_for_code
encounters a problem while streaming in the server log for authentication details it's not currently shown and it just hangs waiting on the expected text. One way to handle this is to use subprocess.Popen(..., stderr=subprocess.PIPE)
and watch for errors in stderr. Possibly there's a better way to catch the error from the process directly.
The short name for a project instance associated with a single run is built from the project name, run completion date, and user contact names. I figured that would be enough because we won't have multiple runs finishing on the same day for the same project and the same people, right? Well, we do. In this case one of the instances never processes because the other is already present.
Maybe instead add the flow cell of the run into the project work_dir attribute (see _init_work_dir_name
).
Running python -m unittest
currently takes several minutes to complete locally and upwards of 30 minutes on CircleCI which is kind of absurd. The tests should be disentangled from one another, superfluous subclasses cleaned up, and mock objects used wherever appropriate.
Our MiniSeq keeps touching its last run output directory after the run completes, which updates the timestamp and prevents the run from processing because of how the min_age option is handled. Possibly we should instead use the CompletionTime entry in CompletedJobInfo.xml for the run.
There's already support for reloading input data from scratch via a signal but this should be supported with systemctl reload umbra
, too.
Related to #91, illumina.util.load_checkpoint
only works as expected with the most common case, for a completed FASTQGeneration. Looking at Checkpoint.txt for interrupted/failed cases I see that in general there's both an integer and a keyword (which just happens to be an empty string in the common case). load_checkpoint
should be updated to parse out both an integer and a keyword in all cases.
Currently an alignment directory with incomplete output from a processing failure on the sequencer triggers a repeating "Alignment not recognized" error for the run, but that's only because the wrong ValueError is inadvertently caught. The Checkpoint.txt file for these cases is both an integer and a keyword (and the keyword gets included when trying to cast as int in umbra.illumna.util.load_checkpoint
).
In these cases Checkpoint.txt looks like:
0
Demultiplexing
instead of:
3
(So there's an empty string in the usual case, and presumably other integers and keywords for the intermediates but I haven't seen them.)
Instead, load_checkpoint
should be updated to get both the integer and the keyword, and any error entries in CompletedJobInfo.xml should be noted.
The Box SDK files log messages at level INFO via boxsdk.network.default_network
that contain the binary contents of uploaded files in their entirety. Since we pass along everything logged at level INFO and higher this dumps huge amounts of info into syslog with each uploaded file.
A simple workaround would be to set the level of that specific logger to something higher, like WARNING.
Box's upload
method is intended for small (< ~ 50 MB) files and that's probably why we get sporadic timeouts with larger ones. This should be switched to use the chunked upload features as per:
https://developer.box.com/en/guides/uploads/chunked/with-sdks/
https://medium.com/box-developer-blog/introducing-the-chunked-upload-api-f82c820ccfcb
https://github.com/box/box-python-sdk/blob/master/boxsdk/object/folder.py
A few configuration-related changes that should happen:
There's a copy task but that's for the entire run directory with no distinction between projects. More useful would be a task to copy the raw fastq.gz files for a specific project.
In ProjectData.from_alignment, sample_paths may be left as None if a FileNotFoundError is raised in alignment.sample_paths(), but when that None object is assigned into the ProjectData instance's sample_paths, the setter tries to access attributes and crashes. This should be considered a special case of missing samples versus what the experiment metadata predicted, raising ProjectError (instead of the AttributeError that crashes everything).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.