Comments (6)
From the line tagged on https://github.com/JoseEspinosa/viralrecon/blob/fba94cfd88625c7052afe5e275970ad5d33ca2be/main.nf#L328 there is a possible implementation of how to download files fastq files, it needs to be cleaned and merged with the current version of the pipeline. Also, some code could be nicely put in a function. In any case, the current approach logic is:
-
Check the files that are given just by ID in the samplesheet (i.e
fastq
files that should be downloaded). -
Download these files using nextflow
fromSRA
when possible. -
Validate files coming from
fromSRA
and check which files are missing since it seems that the EBI server it is not completely updated with covid-19 data. -
Retry the download of missing or not valid files using
parallel-fastq-dump
-
Validate downloaded files by
parallel-fastq-dump
-
If some file is not valid or not reachable the pipeline stops and throws an error.
-
If all files are downloaded the channels ("*fromSRA*" and "*fastqdump*") are mixed with the channels generated with the files provided directly by the user.
*Note: The current version of parallel-fastq-dump
command subsamples only some reads from the fastq files for testing purposes (-X
option)
from viralrecon.
I have put a restriction in the samplesheet validation script that these ids have to begin with SRR
:
viralrecon/bin/check_samplesheet.py
Line 70 in c38ba7b
Depending on how we decide to implement this feature this will have to be extended to download using SRP
(SRA study accession) and SRX
(SRA experiment accession) to get multiple fastq files too?
Is there a way we can check that the corresponding samples were actually sequenced on an Illumina machine because the pipeline isn't currently setup to deal with other platforms.
Given the various ways in which data can be uploaded and extracted from the SRA, I have also been thinking whether we need to have a separate process to validate the fastq files after they have been downloaded. Better that the pipeline fails sooner rather than later if there are any issues. This tool looks promising and is available on Bioconda:
https://github.com/nunofonseca/fastq_utils#fastq_info---validates-and-collects-information-from-single-or-paired-fastq-files
from viralrecon.
By the moment we can just go for the SRR
, in the case that we want to add the SRA
and the SRX
in the future we can always read the SRA
IDs from the study or the experiment and return the SRR
IDs in a way that the functionality we are implementing can deal with them.
As we discussed to create a separate process
to download these files, I think it will make sense to validate the files within the same process. I'll give a look on this tool, seems promising as you say.
from viralrecon.
Code updated to separate prefetch
and parallel-fastq-dump
, this way if the latter fails the downloaded data is not lost.
Current version can be found here:
https://github.com/JoseEspinosa/viralrecon/blob/62e438282b919f1e26854766d42e4bfa274f9785/main.nf#L426-L620
from viralrecon.
I discovered that fromSRA
is not faulty as I reported, actually I open this issue on the nextflow repo. The problem was that if the pipeline was manually interrupted while a file it is being staged by nextflow, as the file, although truncated, exists on $NXF_WORK/stage
, nextflow does not try to obtain the file again. I found it thanks to the kraken2
processes since I interrupted the pipeline manually and then nxf was complaining about the DB tar.gz
file being truncated.
In summary, fromSRA
will not work only if the file is not yet in the EBI server, otherwise it works fine. Anyhow, I think is nice to validate staged files at the beginning of the pipeline.
from viralrecon.
xref: #50
from viralrecon.
Related Issues (20)
- [bug] Fails to find FASTQ files when stored on S3
- epi2me-labs/wf-cnv failed during analysis HOT 7
- Allow viralrecon to take gtf as input annotation file
- primer_set is not taking param
- Artic v5 mismatched primer names in artic-ncov2019 repo cause certain amplicons to be erroneously filtered/removed
- Add QIAseq DIRECT SARS-CoV-2 Kit amplicons
- Argument input-fasta is missing in NFCORE_VIRALRECON:ILLUMINA:CONSENSUS_BCFTOOLS:CONSENSUS_QC:NEXTCLADE_RUN" HOT 4
- Problem installing viralrecon
- Adding "aggregate" and "plot" methods in the freyja subworkflow HOT 4
- Error running version 2.6.0 with Nanopore data in process NFCORE_VIRALRECON:NANOPORE:ARTIC_MINION
- Unable to download the python script HOT 3
- Split authors in generated RO crate
- Adding "--grouplineages" parameter in the nf-core/viralrecon HOT 3
- Make cutadapt primer's position ext.arg
- Properly deal with multiqc in the config files before the next release HOT 3
- Temp file problem in VARIANTS_IVAR:BCFTOOLS_SORT HOT 1
- Allow skipping `freyja boot` HOT 1
- `MOSDEPTH_AMPLICON` doesn't run in `dev` branch HOT 1
- Non-SCV2 amplicon run returns consensus genomes with no low-coverage masking HOT 1
- nf-core/viralrecon run halted due to R version clash HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from viralrecon.