Given that most of the raw data for this pipeline will be available in publicly availa

From the line tagged on <a href="https://github.com/JoseEspinosa/viralrecon/blob/fba94

I have put a restriction in the samplesheet validation that these ids have to b

Code updated to separate prefetch and <code class="n

xref: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-

Pull data directly fromSRA about viralrecon HOT 6 CLOSED

nf-core commented on June 3, 2024

Pull data directly fromSRA

from viralrecon.

Comments (6)

JoseEspinosa commented on June 3, 2024 1

From the line tagged on https://github.com/JoseEspinosa/viralrecon/blob/fba94cfd88625c7052afe5e275970ad5d33ca2be/main.nf#L328 there is a possible implementation of how to download files fastq files, it needs to be cleaned and merged with the current version of the pipeline. Also, some code could be nicely put in a function. In any case, the current approach logic is:

Check the files that are given just by ID in the samplesheet (i.e fastq files that should be downloaded).
Download these files using nextflow fromSRA when possible.
Validate files coming from fromSRA and check which files are missing since it seems that the EBI server it is not completely updated with covid-19 data.
Retry the download of missing or not valid files using parallel-fastq-dump
Validate downloaded files by parallel-fastq-dump
If some file is not valid or not reachable the pipeline stops and throws an error.
If all files are downloaded the channels ("*fromSRA*" and "*fastqdump*") are mixed with the channels generated with the files provided directly by the user.

*Note: The current version of parallel-fastq-dump command subsamples only some reads from the fastq files for testing purposes (-X option)

from viralrecon.

drpatelh commented on June 3, 2024

I have put a restriction in the samplesheet validation script that these ids have to begin with SRR:

viralrecon/bin/check_samplesheet.py

Line 70 in c38ba7b

if sample[:3] == 'SRR':

Depending on how we decide to implement this feature this will have to be extended to download using SRP (SRA study accession) and SRX (SRA experiment accession) to get multiple fastq files too?

Is there a way we can check that the corresponding samples were actually sequenced on an Illumina machine because the pipeline isn't currently setup to deal with other platforms.

Given the various ways in which data can be uploaded and extracted from the SRA, I have also been thinking whether we need to have a separate process to validate the fastq files after they have been downloaded. Better that the pipeline fails sooner rather than later if there are any issues. This tool looks promising and is available on Bioconda:
https://github.com/nunofonseca/fastq_utils#fastq_info---validates-and-collects-information-from-single-or-paired-fastq-files

from viralrecon.

JoseEspinosa commented on June 3, 2024

By the moment we can just go for the SRR, in the case that we want to add the SRA and the SRX in the future we can always read the SRA IDs from the study or the experiment and return the SRR IDs in a way that the functionality we are implementing can deal with them.
As we discussed to create a separate process to download these files, I think it will make sense to validate the files within the same process. I'll give a look on this tool, seems promising as you say.

from viralrecon.

JoseEspinosa commented on June 3, 2024

Code updated to separate prefetch and parallel-fastq-dump, this way if the latter fails the downloaded data is not lost.

Current version can be found here:
https://github.com/JoseEspinosa/viralrecon/blob/62e438282b919f1e26854766d42e4bfa274f9785/main.nf#L426-L620

from viralrecon.

JoseEspinosa commented on June 3, 2024

I discovered that fromSRA is not faulty as I reported, actually I open this issue on the nextflow repo. The problem was that if the pipeline was manually interrupted while a file it is being staged by nextflow, as the file, although truncated, exists on $NXF_WORK/stage, nextflow does not try to obtain the file again. I found it thanks to the kraken2 processes since I interrupted the pipeline manually and then nxf was complaining about the DB tar.gz file being truncated.
In summary, fromSRA will not work only if the file is not yet in the EBI server, otherwise it works fine. Anyhow, I think is nice to validate staged files at the beginning of the pipeline.

from viralrecon.

drpatelh commented on June 3, 2024

xref: #50

from viralrecon.

Pull data directly fromSRA about viralrecon HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent