Git Product home page Git Product logo

grabseqs's Introduction

grabseqs

Utility for simplifying bulk downloading data from next-generation sequencing repositories, like NCBI SRA, MG-RAST, and iMicrobe.

CircleCI Conda version Conda downloads Paper link

Note: read downloads for some samples from MG-RAST are not working through their web interface or API currently (3/20/2020)

Install

Install grabseqs and all dependencies via conda:

conda install grabseqs -c louiejtaylor -c bioconda -c conda-forge

Or with pip (and install the non-Python dependencies yourself):

pip install grabseqs

Note: If you're using SRA data, after you've installed sra-tools, run vdb-config -i and turn off local file caching unless you want extra copies of the downloaded sequences taking up space (read more here).

Quick start

Download all samples from a single SRA Project:

grabseqs sra SRP#######

Or any combination of projects (S/ERP), runs (S/ERR), BioProjects (PRJNA):

grabseqs sra SRR######## ERP####### PRJNA######## ERR########

If you'd like to do a dry run and just get a list of samples that will be downloaded, pass -l:

grabseqs sra -l SRP########

Similar syntax works for MG-RAST:

grabseqs mgrast mgp##### mgm#######

And iMicrobe (prefixing the sample numbers with "s" and project numbers with "p"):

grabseqs imicrobe p4 s3

Detailed usage

See the grabseqs FAQ for detailed troubleshooting tips!

Fun options:

grabseqs sra -t 10 -m metadata.csv -o proj/ -r 3 SRP#######

(translation: use 10 threads, save metadata to proj/metadata.csv, download to the dir proj/, retry failed downloads 3x, get all samples from SRP#######)

If you'd like to do a dry run and only get a list of samples that will be downloaded, pass -l:

grabseqs sra -l SRP########

If you'd like to pass your own arguments to fasterq-dump to get data in a slightly different format, you can do so like this:

grabseqs sra SRP####### -r 0 --custom_fqdump_args="--split-spot --progress"

Full usage:

grabseqs sra [-h] [-m METADATA] [-o OUTDIR] [-r RETRIES] [-t THREADS]
             [-f] [-l] [--no_parsing] [--parse_run_ids]
             [--use_fastq_dump]
             id [id ...]

positional arguments:
  id                One or more BioProject, ERR/SRR or ERP/SRP number(s)

optional arguments:
  -h, --help        show this help message and exit
  -m METADATA       filename in which to save SRA metadata (.csv format,
                    relative to OUTDIR)
  -o OUTDIR         directory in which to save output. created if it doesn't
                    exist
  -r RETRIES        number of times to retry download
  -t THREADS        threads to use (for fasterq-dump/pigz)
  -f                force re-download of files
  -l                list (but do not download) samples to be grabbed
  --parse_run_ids   parse SRR/ERR identifers (do not pass straight to fasterq-
                    dump)
  --custom_fqdump_args CUSTOM_FQD_ARGS
                    "string" containing args to pass to fastq-dump
  --use_fastq_dump  use legacy fastq-dump instead of fasterq-dump (no
                    multithreaded downloading)

Downloads .fastq.gz files to OUTDIR (or the working directory if not specified). If the -m flag is passed, saves metadata to OUTDIR with filename METADATA in csv format.

Similar options are available for downloading from MG-RAST:

grabseqs mgrast [-h] [-m METADATA] [-o OUTDIR] [-r RETRIES]
                [-t THREADS] [-f] [-l]
                rastid [rastid ...]

And iMicrobe:

grabseqs imicrobe [-h] [-m METADATA] [-o OUTDIR] [-r RETRIES]
                  [-t THREADS] [-f] [-l]
                  imicrobeid [imicrobeid ...]

Troubleshooting

See the grabseqs FAQ for detailed troubleshooting tips. If the FAQs don't fix your problem, feel free to open an issue!

Dependencies

  • Python 3 (external packages req'd: requests, requests-html, pandas, fake-useragent)
  • sra-tools>2.9
  • pigz
  • wget

If you use conda (on Linux), these will be installed for you!

Grabseqs runs on Mac or Linux. We've tested on these specific OSes:

Linux (conda or pip):

  • CentOS 6, 7, and 8
  • Debian 9 and 10
  • Ubuntu 16.04, 18.04, and 19.10
  • Red Hat Enterprise 6, 7, and 8
  • SUSE Enterprise 12 and 15

Mac (pip):

  • MacOS 10.14

Grabseqs has been tested and works with the following version of the Python dependencies (though these are neither minimal nor pinned version numbers):

  • requests 2.22.0
  • requests-html 0.10.0
  • pandas 0.25.3
  • fake-useragent 0.1.11

Citation

If you use grabseqs in your work, please cite:

Louis J Taylor, Arwa Abbas, Frederic D Bushman. "grabseqs: Simple downloading of reads and metadata from multiple next-generation sequencing data repositories." Bioinformatics, (2020), btaa167, https://doi.org/10.1093/bioinformatics/btaa167

Please also cite the researchers who generated the data (and the repository, if appropriate)!


Changelog

Dev version (not yet released)

  • Added a walk-through for adding a new repo using template.py
  • Better handling for invalid SRA accession numbers

0.7.0 (2020-01-29)

  • Allow users to pass custom args to fast(er)q-dump
  • Minor re-writes of download handling code for easier readability

0.6.1 (2019-12-20)

  • Validate compressed files (fix #8 and #34)

0.6.0 (2019-12-12)

  • Gracefully handle incomplete or missing dependencies
  • Major rewrite of test suite

0.5.2 (2019-12-05)

  • Improvements to work with multiple versions of Python 3

0.5.1 (2019-11-23)

  • Hotfix handling outdated versions of sra-tools

0.5.0 (2019-04-11)

  • Metadata available for all sources in .csv format

History

This project spawned out of/incorporates code from hisss; many thanks to ArwaAbbas for helping make this work!

grabseqs's People

Contributors

arwaabbas avatar louiejtaylor avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

grabseqs's Issues

Refactor code

  • Simplify/break up funcs to facilitate testing
  • Abstract most stuff into repo-specific functions from __init__.py
  • Write warnings to stderr, not stdout
  • Re-write tests to be more modular (and make it easier to run specific ones)

fasterq-dump error:

Thanks for making this tool, it's a real time saver!

I'm attempting to download a list of SRS accessions, which, at the start was working fine, but after a few hours has been consistently erroring:

downloading SRR2192724 using fasterq-dump 2019-04-09T00:56:16 fasterq-dump.2.9.1 err: storage exhausted while creating directory within file system module - error with https open 'https://sra-download.ncbi.nlm.nih.gov/traces/sra33/SRR/002141/SRR2192724' 2019-04-09T00:56:16 fasterq-dump.2.9.1 err: **invalid accession 'SRR2192724'** pigz: skipping: SRS475922/SRR2192724*fastq does not exist SRA download for acc SRR2192724 failed, retrying 0 more times. Traceback (most recent call last): File "/localscratch/EisenRa/miniconda2/bin/grabseqs", line 11, in <module> sys.exit(main()) File "/localscratch/EisenRa/miniconda2/lib/python3.6/site-packages/grabseqslib/__init__.py", line 59, in main run_fasterq_dump(acc, args.retries, args.threads, args.outdir, args.force, args.fastqdump) File "/localscratch/EisenRa/miniconda2/lib/python3.6/site-packages/grabseqslib/sra.py", line 114, in run_fasterq_dump raise Exception("download for "+acc+" failed. fast(er)q-dump returned "+str(retcode)+", pigz returned "+str(rgzip)+".") Exception: download for SRR2192724 failed. fast(er)q-dump returned 0, pigz returned 0.

It claims invalid accession, but the SRA file link is downloadable with wget. Is this some kind of cache error? I've got enough space on the disk.

Commands ran:
while read SRS; do grabseqs sra -t 50 -m -o $SRS -r 3 $SRS; done < SRS.txt
Where SRS.txt = a list of SRS accessions, one per line.

Best wishes,
Raphael

grabseqs and geofetch

Hey, I just came across grabseqs and at first glance, it looks really similar to a tool I've been developing called geofetch -- just wondered if you had any interest in exploring the possibility of working together on this. or, perhaps you'd be interested in the idea of a PEP, which geofetch produces, which is a standardized way to represent the sample metadata that is downloaded from geo. I haven't delved too deep into grabseqs yet as I just found it, but I thought I'd reach out to see if we could make a connection and alert you to some related projects.

SRA metadata not including BioSample attributes

I thought I was missing some metadata in one of our own previously-submitted SRA datasets because it wasn't showing up in the CSV file, but then the SRA admins pointed out that it does show up on the web interface and the TSV file generated there, just not the version downloaded by grabseqs via the SRA CGI URL.

This is for BioProject PRJNA506241, where you can see the full metadata (columns like dsODN) when viewing it here:

https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=PRJNA506241

But downloading via this URL gives only the core SRA columns and not the BioSample ones:

http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?save=efetch&db=sra&rettype=runinfo&term=PRJNA506241

Looking closer they're almost two completely different sets of columns except for BioSample (and Consent, for some reason). Did something change server-side with this behavior, maybe? I also asked the SRA admins so I'll post an update if I learn anything.

Grabseqs error

Hi,
I met this problem when I used grabseqs.
image
Traceback (most recent call last):
File "/media/home/user05/anaconda3/envs/python36/bin/grabseqs", line 11, in
sys.exit(main())
File "/media/home/user05/anaconda3/envs/python36/lib/python3.6/site-packages/grabseqslib/init.py", line 58, in main
acclist, metadata_agg = get_sra_acc_metadata(sra_identifier, args.outdir, args.list, not args.SRR_parsing, metadata_agg)
File "/media/home/user05/anaconda3/envs/python36/lib/python3.6/site-packages/grabseqslib/sra.py", line 52, in get_sra_acc_metadata
run_col = lines[0].index("Run")
ValueError: 'Run' is not in list
Could you please tell me how to solve this problem? Thanks.

All grabseqs SRA downloads failing

Looks like some changes on the NCBI side lead to failures in SRA downloads:

grabseqs sra SRR11733975
Traceback (most recent call last):
  File "/users/cdiener/miniconda3/envs/sra/bin/grabseqs", line 11, in <module>
    sys.exit(main())
  File "/users/cdiener/miniconda3/envs/sra/lib/python3.7/site-packages/grabseqslib/__init__.py", line 58, in main
    metadata_agg = process_sra(args, zip_func)
  File "/users/cdiener/miniconda3/envs/sra/lib/python3.7/site-packages/grabseqslib/sra.py", line 31, in process_sra
    metadata_agg)
  File "/users/cdiener/miniconda3/envs/sra/lib/python3.7/site-packages/grabseqslib/sra.py", line 97, in get_sra_acc_metadata
    run_col = lines[0].index("Run")
ValueError: 'Run' is not in list

This seems to be caused by a hardcoded address to download the SRA manifest that is not reachable anymore.

Add tests for iMicrobe

  • Downloading in .fasta format
  • Downloading in .fastq format
  • -l (listing)
  • Not clobbering an already-downloaded file
  • Forcing download of an already existing file

Version 0.5.0 breaks numpy

Multiple errors incl

module 'numpy' has no attribute '__version__'

and

ImportError: Something is wrong with the numpy installation. While importing we detected an older version of numpy in ['/.../miniconda2/envs/sunbeam/lib/python3.6/site-packages/numpy']. One method of fixing this is to repeatedly uninstall numpy until none is found, then reinstall this version.

I suspect this is due to duplicated requirements/dependencies between setup.py (pip) and other packages grabbing numpy in their environment.yml (conda) although I'm not sure what the correct workaround is for this...

pigz attempting to zip files that don't exist

Saw this when a test was running:

downloading SRR1913936 using fasterq-dump
spots read      : 11
reads read      : 22
reads written   : 22
pigz: skipping: /home/circleci/grabseqs_unittest/test_tiny_sra_paired/SRR1913936.fastq does not exist
โœ” SRA paired sample download test passed

pigz should not attempt to zip a paired sample..?

Or, do I explicitly zip all possible files that come down to hedge against unpaired sequences? Either way, should fix this and #8 at the same time.

Make --no_SRR_parsing default

Makes more sense this way--problems have come from people needing to use it
(Keep the option around until v1.0 at least)

grabseqs sra -l PRJDB5400 pigz not found, using gzip Traceback (most recent call last): File "/home/tools/anaconda3/bin/grabseqs", line 8, in <module> sys.exit(main()) File "/home/tools/anaconda3/lib/python3.8/site-packages/grabseqslib/__init__.py", line 58, in main metadata_agg = process_sra(args, zip_func) File "/home/tools/anaconda3/lib/python3.8/site-packages/grabseqslib/sra.py", line 27, in process_sra acclist, metadata_agg = get_sra_acc_metadata(sra_identifier, File "/home/tools/anaconda3/lib/python3.8/site-packages/grabseqslib/sra.py", line 97, in get_sra_acc_metadata run_col = lines[0].index("Run") ValueError: 'Run' is not in list

when I use the command "grabseqs sra -l PRJDB5400", I have some errors.
pigz not found, using gzip
Traceback (most recent call last):
File "/home/tools/anaconda3/bin/grabseqs", line 8, in
sys.exit(main())
File "/home/tools/anaconda3/lib/python3.8/site-packages/grabseqslib/init.py", line 58, in main
metadata_agg = process_sra(args, zip_func)
File "/home/tools/anaconda3/lib/python3.8/site-packages/grabseqslib/sra.py", line 27, in process_sra
acclist, metadata_agg = get_sra_acc_metadata(sra_identifier,
File "/home/tools/anaconda3/lib/python3.8/site-packages/grabseqslib/sra.py", line 97, in get_sra_acc_metadata
run_col = lines[0].index("Run")
ValueError: 'Run' is not in list

For SRA downloading, make sure pigz zips only the correct sequence files

This is extremely unlikely to be an issue in practice, but if for some reason an individual were to be downloading two accession numbers such that one accession number was a substring of another accession number, pigz might clobber the shorter accession number because of the way you compress files.

download a list of SRR

Sometimes, people may not need the whole dataset of the SRA project. If there is a flag can help to download the custom SRR list, that will be great!

error handling when sra accession doesn't exist / doesn't return runs

Hi -
when I run grabseqs with a project identifier that has no links to any runs
(example:
grabseqs sra -l PRJNAXXXXX)

, grabseqs dies with

ValueError: 'Run' is not in list

Rightly so, because list.index() raises a ValueError when there is no matching item (see e.g. https://docs.python.org/3/tutorial/datastructures.html)

solution:
in line 98 of sra.py the error should be caught with
except ValueError: raise ValueError("Could not find samples for accession: "+pacc+". If this accession number is valid, try re-running.")

Best wishes -
Anna

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.