pavlidislab / rnaseq-pipeline Goto Github PK
View Code? Open in Web Editor NEWRNA-seq pipeline for raw sequence alignment and transcript/gene quantification.
License: The Unlicense
RNA-seq pipeline for raw sequence alignment and transcript/gene quantification.
License: The Unlicense
It would be convenient to automate generation of ERCC spike-in count and storing the result for QC purposes.
In the current setup, it's possible that the AlignSample task get scheduled on a different node than the matching DownloadSample task which disrupt the NFS caching.
We could first start by pinning align jobs on whichever node was used to download the data. If that works well, we would then switch to use the local scratch filesystem to completely avoid the NFS.
Currently, we use the SRA format for FASTQ headers which prefix the SRR run accession to the original string from the sequencer. This format is not compatible with ArrayExpress and local sources and will pose a problem if we try to generalize batch information extraction for arbitrary FASTQs and not just GEO series.
The solution is to add the --origfmt
flag to fastq-dump
so that the original header will be used instead.
This might require some adjustment in how Gemma parses the batch information.
Relates to: #16
The information that the taxon provide is redundant because we already have the genome and the annotation reference identifiers.
For that to work, we have to update Gemma platform identifiers to use explicit Ensembl versions.
Addressing request from PavlidisLab/Gemma#20
For GEO datasets/expression experiments, we would like to obtain:
To start, let's produce a text file in a "METADATA" directory similar to the Data directory.
If we happen to eventually use a job scheduler to distribute the computation, it would be simple to extend ExternalProgramTask
to dispatch the work on a job scheduler.
This is the case for SRR3279144 where the metadata claims it is paired but the output of fastq-dump in single-ended.
Luigi (and also the Python core developers...) is about to drop support for Python 2.7 see spotify/luigi#2876.
This is a rather simple move as we only have to update the Conda environment to use Python 3 and possibly adjust some code in the pipeline.
We need a mechanism for limiting the number of jobs of this kind since it does not scale with the number running tasks due to bandwith bottleneck.
This would basically resolve #13 since we would pull the information from SRA/GEO/Gemma instead of requesting it.
For the local source, we would still need a flag to override a default taxon.
The real issue is that the tool does not exit with a non-zero code when it fails due to a download size limit.
This is basically due to memory leakage when a process does not decrement the shared memory usage counter. It would be better to establish the count on the number of attached process on the segment instead.
Experiments from GEO can sometimes have an accession but not actually have SRA data. An example is GSE64018
.
Currently the pipeline will not see this as a failure on DownloadGSE
, but the Qc step will fail (assuming the --nsamples
argument from QcGSE
is not 0). It would be nicer if DownloadGSE would either return a failure when it can't find a matching SRA accession from the MINIML file, or at least raise a loud warning so the problem is obvious to the pipeline user.
This will have the advantage of grouping the alignments statistics with the FASTQ quality controls in the final report.
This would remove the bottleneck between data download and downstream processing.
I think it might relate to NFS.
Some experiments may need to be re-downloaded for the only purpose of getting the FASTQ metadata. This could be made as a luigi task.
Additional considerations:
There's been discussion about using a subset of CWL to produce a file with structured pipeline metadata.
This process can be entirely automated by introspecting the Luigi task graph and extending ExternalProgramTask
to provide additional metadata such as a version number. Ideally, we would move that logic into bioluigi and make all our task CWL-friendly.
I've encountered a couple of archives in SRA that replaces :
separator by _
in their FASTQ headers.
We would have to handle this specific case in ExtractGeoSeriesBatchInfo
.
We need the FASTQ files for two things: alignment tasks and batch information extraction.
At this time, submitting a batch information and quantifying gene expression are two separate and independent tasks that share some common dependencies. This makes it a bit difficult to determine the right moment to clear the FASTQ files.
We could add a wrapper task that depends on any task that needs the FASTQ files to exist and that would be considered completed if all its dependencies are met AND the FASTQ files do not exist.
This logic cannot apply for the local
source.
Add DownloadLocalSample
and DownloadLocalExperiment
tasks to cover case of local samples.
Most of the logic is now in scheduler/tasks.py, so it would be great to clean up the legacy code and get this repository tidy.
While it is not generally needed, some datasets might require adapter trimming and there's pretty popular solution out there that automate some decisions:
What's nice about Trim Galore! is that we already use FastQC for reporting read quality and it might be possible to reuse the generated report.
In addition to the GEO source which invokes the SRA source, it would be nice to complement this with a SRA source that works with SRX accession (see #23).
EXITING because of fatal error: buffer size for SJ output is too small
Solution: increase input parameter --limitOutSJcollapsed
This is necessary for #14 because the shortName entry might conflict with GEO accessions. Using the experiment_id
parameter cannot always disambiguate the data source.
It causes the following error in Luigi:
Runtime error:
Traceback (most recent call last):
File "/space/grp/Pipelines/rnaseq-pipeline/venv/lib/python2.7/site-packages/luigi/worker.py", line 184, in run
raise RuntimeError('Unfulfilled %s at run time: %s' % (deps, ', '.join(missing)))
RuntimeError: Unfulfilled dependency at run time: QualityControlSample_GSE117223_GSM3288306_d6aef47967
RSeQC provide a wide range of tools for that purpose.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.