arq5x / ggd Goto Github PK

View Code? Open in Web Editor NEW

43.0 11.0 2.0 987 KB

Python 90.93% Shell 9.07%

ggd's Introduction

GGD: Get Genomics Data

THIS PROJECT HAS MOVED

to: https://github.com/gogetdata/ggd-recipes

ggd's People

Contributors

Stargazers

Watchers

Forkers

marcelcaraciolo brentp

ggd's Issues

set exit codes on errors.

there are a lot of messages to stderr, but we should also return non-zero exit codes on failures.

How about GENCODE retreival

Data and links to FTP sites for human and mouse.

Use checksum to ensure recipe was executed successfully.

Update config file with installed datasets to detect updates, etc.

make and get sections

make: should have the entire provenance of the recipe

an optional get: section will have just a url that is the result of running the make command. So data hosters can choose to host a made dataset but still share the provenance.

The file from the get: should have the same sha1 and the result of the make.

The client will default to get if it's available and otherwise to make unless a flag is specified.

dependencies on other recipes and on softare

we will have e.g. a recipe for reference sequence.
then, in dependent recipes we will use that reference in vt normalize -r $REFERENCE to normalize the variants. We need to specify the data and software dependencies.

Comments on example command

python ggd.py

Install a ggd shell script and use instead ggd install …?

ucsc.human.b38.cpg

The underlying URL of the formula uses slashes to separate components. Why use dots here? Consider instead ucsc/human/b38/cpg.

human

Common names are ambiguous. Use binomial names instead?

b38

Use the full name GRCh38 instead?

source.species.genomebuild.name

Perhaps this format could optionally include the GitHub username and repo to support installing formula from non-arq5x/ggd-recipes locations.

user/repo/source/species/genomebuild/name

allow simpler running of installing single local yaml

currently, have to run as:

ggd install --cookbook file://$(pwd)/ seq

should be:

ggd install --yaml seq.yaml

or something similar.

SRA downloader?

I have a SRA downloader script that downloads all the datasets associated with a GEO project. It supports partial/resumable downloads (like wget -c) and is a pure python implementation.

What is your view about including such recipes? I should have probably opened this issue in ggd-recipes but I think they are related.

https://github.com/saketkc/methylomer/blob/master/geo_downloader.py

versioning

see @chapmanb 's use of distutils here: https://github.com/chapmanb/cloudbiolinux/blob/master/cloudbio/biodata/ggd.py

Make species the top level in hierarchy

Hey this is a cool project. In the figure you show species as the top level directory but the recipes aren't following that scheme currently? I Can see advantages for both approaches, although I think organizing by species first is the worth considering because:

there are several species specific repositories (e.g. Wormbase, flybase, etc.). This would keep those folders from cluttering up the top level directory.
it also helps to make folks aware of what resources are available for their species of interest.

Sorry if this was already in the works!

Also what about adding a description line in the recipes which could be queried and displayed from the CLI?

Use wildcards to install collections of recipes

E.g. all chipseq datasets for a given cell line.

Config file to specify where to install datasets.

template variables

all recipe commands will actually be templates. We need this to handle data dependencies. Among other things.
The template variables that will be filled by GGD are:

${DATE}
${version} # pulled from the attributes section in the yaml
${name} # pulled from the attributes section in the yaml
${GGD_DATA} path to the data directory (usually ~/ggd_data/) this will allow recipes to specify paths to existing ggd resources that have already been installed.
${sha1} -- the sha for the current entry under commands.

recipe responsible for writing files

so a recipe should look like:

attributes:
    name: hg38_reference
    version: 1
    sha1:
        - efaaea68910ee444b2756062b2ae2b990d5cdb71
        - 8c6e9635f50256e4ecd84bee2bfb1cb27cc8bbd1

recipe:
    full:
        recipe_type: bash
        recipe_cmds:
            - wget -O hg38.decoy.hla.fa ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa
            - samtools faidx hg38.decoy.hla.fa
        recipe_outfiles:
            - hg38.decoy.hla.fa
            - hg38.decoy.hla.fa.fai

and ggd knows that the output files are hg38.decoy.hla.fa and .fai by the recipe_outfiles section.
This is instead of the current behavior where ggd captures the output of each command and assumes that it is the file.

this makes it more obvious since the user can implement a full recipe in bash and then translate into the yaml section. I believe this is also what @chapmanb has implemented.