grabseqs

Utility for simplifying bulk downloading data from next-generation sequencing repositories, like NCBI SRA, MG-RAST, and iMicrobe.

Note: read downloads for some samples from MG-RAST are not working through their web interface or API currently (3/20/2020)

Install

Install grabseqs and all dependencies via conda:

conda install grabseqs -c louiejtaylor -c bioconda -c conda-forge

Or with pip (and install the non-Python dependencies yourself):

pip install grabseqs

Note: If you're using SRA data, after you've installed sra-tools, run vdb-config -i and turn off local file caching unless you want extra copies of the downloaded sequences taking up space (read more here).

Quick start

Download all samples from a single SRA Project:

grabseqs sra SRP#######

Or any combination of projects (S/ERP), runs (S/ERR), BioProjects (PRJNA):

grabseqs sra SRR######## ERP####### PRJNA######## ERR########

If you'd like to do a dry run and just get a list of samples that will be downloaded, pass -l:

grabseqs sra -l SRP########

Similar syntax works for MG-RAST:

grabseqs mgrast mgp##### mgm#######

And iMicrobe (prefixing the sample numbers with "s" and project numbers with "p"):

grabseqs imicrobe p4 s3

Detailed usage

See the grabseqs FAQ for detailed troubleshooting tips!

Fun options:

grabseqs sra -t 10 -m metadata.csv -o proj/ -r 3 SRP#######

(translation: use 10 threads, save metadata to proj/metadata.csv, download to the dir proj/, retry failed downloads 3x, get all samples from SRP#######)

If you'd like to do a dry run and only get a list of samples that will be downloaded, pass -l:

grabseqs sra -l SRP########

If you'd like to pass your own arguments to fasterq-dump to get data in a slightly different format, you can do so like this:

grabseqs sra SRP####### -r 0 --custom_fqdump_args="--split-spot --progress"

Full usage:

grabseqs sra [-h] [-m METADATA] [-o OUTDIR] [-r RETRIES] [-t THREADS]
             [-f] [-l] [--no_parsing] [--parse_run_ids]
             [--use_fastq_dump]
             id [id ...]

positional arguments:
  id                One or more BioProject, ERR/SRR or ERP/SRP number(s)

optional arguments:
  -h, --help        show this help message and exit
  -m METADATA       filename in which to save SRA metadata (.csv format,
                    relative to OUTDIR)
  -o OUTDIR         directory in which to save output. created if it doesn't
                    exist
  -r RETRIES        number of times to retry download
  -t THREADS        threads to use (for fasterq-dump/pigz)
  -f                force re-download of files
  -l                list (but do not download) samples to be grabbed
  --parse_run_ids   parse SRR/ERR identifers (do not pass straight to fasterq-
                    dump)
  --custom_fqdump_args CUSTOM_FQD_ARGS
                    "string" containing args to pass to fastq-dump
  --use_fastq_dump  use legacy fastq-dump instead of fasterq-dump (no
                    multithreaded downloading)

Downloads .fastq.gz files to OUTDIR (or the working directory if not specified). If the -m flag is passed, saves metadata to OUTDIR with filename METADATA in csv format.

Similar options are available for downloading from MG-RAST:

grabseqs mgrast [-h] [-m METADATA] [-o OUTDIR] [-r RETRIES]
                [-t THREADS] [-f] [-l]
                rastid [rastid ...]

And iMicrobe:

grabseqs imicrobe [-h] [-m METADATA] [-o OUTDIR] [-r RETRIES]
                  [-t THREADS] [-f] [-l]
                  imicrobeid [imicrobeid ...]

Troubleshooting

See the grabseqs FAQ for detailed troubleshooting tips. If the FAQs don't fix your problem, feel free to open an issue!

Dependencies

Python 3 (external packages req'd: requests, requests-html, pandas, fake-useragent)
sra-tools>2.9
pigz
wget

If you use conda (on Linux), these will be installed for you!

Grabseqs runs on Mac or Linux. We've tested on these specific OSes:

Linux (conda or pip):

CentOS 6, 7, and 8
Debian 9 and 10
Ubuntu 16.04, 18.04, and 19.10
Red Hat Enterprise 6, 7, and 8
SUSE Enterprise 12 and 15

Mac (pip):

MacOS 10.14

Grabseqs has been tested and works with the following version of the Python dependencies (though these are neither minimal nor pinned version numbers):

requests 2.22.0
requests-html 0.10.0
pandas 0.25.3
fake-useragent 0.1.11

Citation

If you use grabseqs in your work, please cite:

Louis J Taylor, Arwa Abbas, Frederic D Bushman. "grabseqs: Simple downloading of reads and metadata from multiple next-generation sequencing data repositories." Bioinformatics, (2020), btaa167, https://doi.org/10.1093/bioinformatics/btaa167

Please also cite the researchers who generated the data (and the repository, if appropriate)!

Changelog

Dev version (not yet released)

Added a walk-through for adding a new repo using template.py
Better handling for invalid SRA accession numbers

0.7.0 (2020-01-29)

Allow users to pass custom args to fast(er)q-dump
Minor re-writes of download handling code for easier readability

0.6.1 (2019-12-20)

Validate compressed files (fix #8 and #34)

0.6.0 (2019-12-12)

Gracefully handle incomplete or missing dependencies
Major rewrite of test suite

0.5.2 (2019-12-05)

Improvements to work with multiple versions of Python 3

0.5.1 (2019-11-23)

Hotfix handling outdated versions of sra-tools

0.5.0 (2019-04-11)

Metadata available for all sources in .csv format

History

This project spawned out of/incorporates code from hisss; many thanks to ArwaAbbas for helping make this work!

fasterq-dump error:

Thanks for making this tool, it's a real time saver!

I'm attempting to download a list of SRS accessions, which, at the start was working fine, but after a few hours has been consistently erroring:

downloading SRR2192724 using fasterq-dump 2019-04-09T00:56:16 fasterq-dump.2.9.1 err: storage exhausted while creating directory within file system module - error with https open 'https://sra-download.ncbi.nlm.nih.gov/traces/sra33/SRR/002141/SRR2192724' 2019-04-09T00:56:16 fasterq-dump.2.9.1 err: **invalid accession 'SRR2192724'** pigz: skipping: SRS475922/SRR2192724*fastq does not exist SRA download for acc SRR2192724 failed, retrying 0 more times. Traceback (most recent call last): File "/localscratch/EisenRa/miniconda2/bin/grabseqs", line 11, in <module> sys.exit(main()) File "/localscratch/EisenRa/miniconda2/lib/python3.6/site-packages/grabseqslib/__init__.py", line 59, in main run_fasterq_dump(acc, args.retries, args.threads, args.outdir, args.force, args.fastqdump) File "/localscratch/EisenRa/miniconda2/lib/python3.6/site-packages/grabseqslib/sra.py", line 114, in run_fasterq_dump raise Exception("download for "+acc+" failed. fast(er)q-dump returned "+str(retcode)+", pigz returned "+str(rgzip)+".") Exception: download for SRR2192724 failed. fast(er)q-dump returned 0, pigz returned 0.

It claims invalid accession, but the SRA file link is downloadable with wget. Is this some kind of cache error? I've got enough space on the disk.

Commands ran:
while read SRS; do grabseqs sra -t 50 -m -o $SRS -r 3 $SRS; done < SRS.txt
Where SRS.txt = a list of SRS accessions, one per line.

Best wishes,
Raphael

louiejtaylor / grabseqs Goto Github PK

grabseqs's Introduction

grabseqs

Install

Quick start

Detailed usage

Troubleshooting

Dependencies

Citation

Changelog

History

grabseqs's People

Contributors

Stargazers

Watchers

Forkers

grabseqs's Issues

Recommend Projects

Recommend Topics

Recommend Org