gogetdata / ggd-cli Goto Github PK

View Code? Open in Web Editor NEW

41.0 41.0 3.0 585 KB

The command-line interface to GGD

License: MIT License

Python 99.80% Shell 0.20%

ggd-cli's People

Contributors

Stargazers

Watchers

Forkers

mikecormier koteko trellixvulnteam

ggd-cli's Issues

Update README to include new commands

Need to explain:
search (all args and why there is a git repo, including keyword search)
list-files (all args and how envi vars relate)
show-env

CL interface

The executable will be ggd, written in python. The available sub-commands are to be:

ggd from_bash make a new ggd recipe from a bash script

--species | species recipe is for, must be present in ggd-recipes/genomes
--genome_build | version of genome to use. must be present in ggd-recipes/genome
--dependency | any data or bioconda dependencies needed to build the recipe
--extra-file | any files that the recipe creates that are not a *.gz and *.gz.tbi pair. May be used more than once
--summary | a comment describing the recipe
--keyword | a keyword to associate with the recipe. may be specified more that once.
--author | a recipe author
script | bash script that contains the commands to build

ggd build build and test a ggd recipe locally
ggd install wrapper for conda install
ggd search search by name and/or keyword

env-var should be replaced if new version of recipe is being installed.

The env-var for a recipe should be replaced if a new version/build is being installed. ggd install should check if an env-var exists and if so replace it with the new one.

storing bgzip compressed genomes?

hi!
this project looks super interesting!
One issue that would personally concern me if I were to use ggd is that the existing recipes store genomes in an uncompressed form. My concern is that, with potentially many genomes that my lab would have to deal with, the library will take a lot of space; moreover, given that we use network storage, storing data uncompressed will reduce the I/O performance.

Have you considered allowing optional compression of genomes with bgzip? bgzip plays well with faidx/pyfaidx and does not have any downsides, at least as much as we're concerned.

Thank you!
Anton.

show-env testing

Just realized that show-env is missing pattern recognition support. I'm throwing a quicck fix together, but it needs to be pushed and tested.

finish search

Like list-files, search uses glob.glob to list files locally available. This introduces a slight internal inconsistency as results are verified by calling conda search, which treats patterns differently. Specifically, glob.glob supports shell-style wildcards (most usefully the * character to match anything), while conda search requires valid regex patterns (matching anywhere within the recipe name). Currently the script converts empty strings to * characters for glob and removes those * characters for conda search, as empty strings fail in glob. This is a fairly hack-feeling solution, however, that should be improved on.

Furthermore, if the user inputs a real, valid regex pattern, it will likely fail, because glob will treat the special characters it doesn't know as literals. This might be okay because they can use shell wildcards, but I think at the very least it needs to be well-documented and bulletproofed some more.

Can "ggd install --file" be integrated into "conda env create -f environment.yml"?

I want to use ggd in a nextflow pipeline. At present nextflow has no support for ggd, but does have support for conda. If "ggd install --file" is integrated into "conda env create -f environment.yml", it will be very convenient to create a reproducible pipeline with data packages managed just as software packages.

list-files should be case-insensitive

list-files depends on python's glob.glob to list the files present. This simplifies the code considerably, but makes a case-insensitive search more difficult.

Need to support changed install structure (see Issue #3) within list-files

Version coordination

Versions are supported now in recipe installation. The version number (which allows a string but not whitespace) is parsed out of the meta.yaml file by check-recipe and included in the installation directory structure. This needs to be thought out more fully, however. The environment variables created by prelink.sh don't all include the version number currently and there may be additional issues caused by inclusion of the version number that I haven't thought of yet.

check-sort-order unable to parse bcf files

go language does not have a bcf parser. Therefore, check-sort-order cannot check the sort order of .bcf files.

Brent started a bcf parser in go, but it is not finished: https://github.com/brentp/bcf

For now:

skip check of sort order on bcf files

Future:

Add bcf parsing to check-sort-order

Should installed recipe versions be removed when a different version is installed?

Adding the option of maintaining different versions of recipes means there could be multiple versions of a dataset installed at once. We could use a post-link script (http://conda.pydata.org/docs/building/build-scripts.html) to handle this by removing old versions. We could also maintain those other versions (updating the environment variable to the most recently installed version) and allow them to be access via the ggd list-files command.

env_vars.sh cleanup

When installing a recipe a new environment variable is appended to conda_root/etc/conda/activate.d/env_vars.sh and conda_root/etc/conda/deactivate.d/env_vars.sh. Since this is only appended, old copies of vars will be maintained in the file. In the interest of neatness, we should remove old var declarations from these files on adding a new one.

Add data version to install dir structure

Need to include the version of the data in the directory structure.

i.e. /scratch/ucgd/lustre/u1072557/a2/share/ggd/Homo_sapiens/hg19/hg19-cpg-islands/1/{files}
instead of
/scratch/ucgd/lustre/u1072557/a2/share/ggd/Homo_sapiens/hg19/hg19-cpg-islands/{files}

install does not include required packages

I see:

ggd --version
Traceback (most recent call last):
  File "/home/brentp/miniconda3/bin/ggd", line 7, in <module>
    from ggd.__main__ import main
  File "/home/brentp/miniconda3/lib/python3.7/site-packages/ggd/__main__.py", line 6, in <module>
    from . list_files import add_list_files
  File "/home/brentp/miniconda3/lib/python3.7/site-packages/ggd/list_files.py", line 14, in <module>
    from .search import load_json_from_url, search_packages
  File "/home/brentp/miniconda3/lib/python3.7/site-packages/ggd/search.py", line 11, in <module>
    from fuzzywuzzy import fuzz
ModuleNotFoundError: No module named 'fuzzywuzzy'

after doing the install as suggested in the readme.

doc issues

a detailed figure should not be in a "quick-start" section. In quickstart, show a few command-line examples. Save the image for a more detailed section
the section indicating how to install should be titled "Installation" and made a super-heading closer to the top of the document.
readme for ggd-cli and ggd-recipes should be updated to point to or match whats in gogetdata.github.io
the contents appears under "Contributing" it should be its own header and should be only 1 level deep (so we see each command, not example and "Using ..." for each command. Or, the contents section may be removed.
not clear how the icons adn the columns interact on this page: https://gogetdata.github.io/recipes.html (e.g. why do we need a linux column, and a linux icon). I recipe is either OSX, Linux or noarch, so why not have a single "arch" column?