Git Product home page Git Product logo

Comments (6)

JoseEspinosa avatar JoseEspinosa commented on June 3, 2024 1

From the line tagged on https://github.com/JoseEspinosa/viralrecon/blob/fba94cfd88625c7052afe5e275970ad5d33ca2be/main.nf#L328 there is a possible implementation of how to download files fastq files, it needs to be cleaned and merged with the current version of the pipeline. Also, some code could be nicely put in a function. In any case, the current approach logic is:

  1. Check the files that are given just by ID in the samplesheet (i.e fastq files that should be downloaded).

  2. Download these files using nextflow fromSRA when possible.

  3. Validate files coming from fromSRA and check which files are missing since it seems that the EBI server it is not completely updated with covid-19 data.

  4. Retry the download of missing or not valid files using parallel-fastq-dump

  5. Validate downloaded files by parallel-fastq-dump

  6. If some file is not valid or not reachable the pipeline stops and throws an error.

  7. If all files are downloaded the channels ("*fromSRA*" and "*fastqdump*") are mixed with the channels generated with the files provided directly by the user.

*Note: The current version of parallel-fastq-dump command subsamples only some reads from the fastq files for testing purposes (-X option)

from viralrecon.

drpatelh avatar drpatelh commented on June 3, 2024

I have put a restriction in the samplesheet validation script that these ids have to begin with SRR:

if sample[:3] == 'SRR':

Depending on how we decide to implement this feature this will have to be extended to download using SRP (SRA study accession) and SRX (SRA experiment accession) to get multiple fastq files too?

Is there a way we can check that the corresponding samples were actually sequenced on an Illumina machine because the pipeline isn't currently setup to deal with other platforms.

Given the various ways in which data can be uploaded and extracted from the SRA, I have also been thinking whether we need to have a separate process to validate the fastq files after they have been downloaded. Better that the pipeline fails sooner rather than later if there are any issues. This tool looks promising and is available on Bioconda:
https://github.com/nunofonseca/fastq_utils#fastq_info---validates-and-collects-information-from-single-or-paired-fastq-files

from viralrecon.

JoseEspinosa avatar JoseEspinosa commented on June 3, 2024

By the moment we can just go for the SRR, in the case that we want to add the SRA and the SRX in the future we can always read the SRA IDs from the study or the experiment and return the SRR IDs in a way that the functionality we are implementing can deal with them.
As we discussed to create a separate process to download these files, I think it will make sense to validate the files within the same process. I'll give a look on this tool, seems promising as you say.

from viralrecon.

JoseEspinosa avatar JoseEspinosa commented on June 3, 2024

Code updated to separate prefetch and parallel-fastq-dump, this way if the latter fails the downloaded data is not lost.

Current version can be found here:
https://github.com/JoseEspinosa/viralrecon/blob/62e438282b919f1e26854766d42e4bfa274f9785/main.nf#L426-L620

from viralrecon.

JoseEspinosa avatar JoseEspinosa commented on June 3, 2024

I discovered that fromSRA is not faulty as I reported, actually I open this issue on the nextflow repo. The problem was that if the pipeline was manually interrupted while a file it is being staged by nextflow, as the file, although truncated, exists on $NXF_WORK/stage, nextflow does not try to obtain the file again. I found it thanks to the kraken2 processes since I interrupted the pipeline manually and then nxf was complaining about the DB tar.gz file being truncated.
In summary, fromSRA will not work only if the file is not yet in the EBI server, otherwise it works fine. Anyhow, I think is nice to validate staged files at the beginning of the pipeline.

from viralrecon.

drpatelh avatar drpatelh commented on June 3, 2024

xref: #50

from viralrecon.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.