gogetdata / ggd-cli Goto Github PK
View Code? Open in Web Editor NEWThe command-line interface to GGD
License: MIT License
The command-line interface to GGD
License: MIT License
Need to explain:
search (all args and why there is a git repo, including keyword search)
list-files (all args and how envi vars relate)
show-env
The executable will be ggd, written in python. The available sub-commands are to be:
--species | species recipe is for, must be present in ggd-recipes/genomes
--genome_build | version of genome to use. must be present in ggd-recipes/genome
--dependency | any data or bioconda dependencies needed to build the recipe
--extra-file | any files that the recipe creates that are not a *.gz and *.gz.tbi pair. May be used more than once
--summary | a comment describing the recipe
--keyword | a keyword to associate with the recipe. may be specified more that once.
--author | a recipe author
script | bash script that contains the commands to build
The env-var for a recipe should be replaced if a new version/build is being installed. ggd install
should check if an env-var exists and if so replace it with the new one.
hi!
this project looks super interesting!
One issue that would personally concern me if I were to use ggd is that the existing recipes store genomes in an uncompressed form. My concern is that, with potentially many genomes that my lab would have to deal with, the library will take a lot of space; moreover, given that we use network storage, storing data uncompressed will reduce the I/O performance.
Have you considered allowing optional compression of genomes with bgzip? bgzip plays well with faidx/pyfaidx and does not have any downsides, at least as much as we're concerned.
Thank you!
Anton.
Just realized that show-env is missing pattern recognition support. I'm throwing a quicck fix together, but it needs to be pushed and tested.
Like list-files, search uses glob.glob to list files locally available. This introduces a slight internal inconsistency as results are verified by calling conda search, which treats patterns differently. Specifically, glob.glob supports shell-style wildcards (most usefully the * character to match anything), while conda search requires valid regex patterns (matching anywhere within the recipe name). Currently the script converts empty strings to * characters for glob and removes those * characters for conda search, as empty strings fail in glob. This is a fairly hack-feeling solution, however, that should be improved on.
Furthermore, if the user inputs a real, valid regex pattern, it will likely fail, because glob will treat the special characters it doesn't know as literals. This might be okay because they can use shell wildcards, but I think at the very least it needs to be well-documented and bulletproofed some more.
I want to use ggd in a nextflow pipeline. At present nextflow has no support for ggd, but does have support for conda. If "ggd install --file" is integrated into "conda env create -f environment.yml", it will be very convenient to create a reproducible pipeline with data packages managed just as software packages.
list-files depends on python's glob.glob to list the files present. This simplifies the code considerably, but makes a case-insensitive search more difficult.
Versions are supported now in recipe installation. The version number (which allows a string but not whitespace) is parsed out of the meta.yaml file by check-recipe and included in the installation directory structure. This needs to be thought out more fully, however. The environment variables created by prelink.sh don't all include the version number currently and there may be additional issues caused by inclusion of the version number that I haven't thought of yet.
go language does not have a bcf parser. Therefore, check-sort-order cannot check the sort order of .bcf
files.
Brent started a bcf parser in go, but it is not finished: https://github.com/brentp/bcf
Adding the option of maintaining different versions of recipes means there could be multiple versions of a dataset installed at once. We could use a post-link script (http://conda.pydata.org/docs/building/build-scripts.html) to handle this by removing old versions. We could also maintain those other versions (updating the environment variable to the most recently installed version) and allow them to be access via the ggd list-files command.
When installing a recipe a new environment variable is appended to conda_root/etc/conda/activate.d/env_vars.sh and conda_root/etc/conda/deactivate.d/env_vars.sh. Since this is only appended, old copies of vars will be maintained in the file. In the interest of neatness, we should remove old var declarations from these files on adding a new one.
Need to include the version of the data in the directory structure.
i.e. /scratch/ucgd/lustre/u1072557/a2/share/ggd/Homo_sapiens/hg19/hg19-cpg-islands/1/{files}
instead of
/scratch/ucgd/lustre/u1072557/a2/share/ggd/Homo_sapiens/hg19/hg19-cpg-islands/{files}
I see:
ggd --version
Traceback (most recent call last):
File "/home/brentp/miniconda3/bin/ggd", line 7, in <module>
from ggd.__main__ import main
File "/home/brentp/miniconda3/lib/python3.7/site-packages/ggd/__main__.py", line 6, in <module>
from . list_files import add_list_files
File "/home/brentp/miniconda3/lib/python3.7/site-packages/ggd/list_files.py", line 14, in <module>
from .search import load_json_from_url, search_packages
File "/home/brentp/miniconda3/lib/python3.7/site-packages/ggd/search.py", line 11, in <module>
from fuzzywuzzy import fuzz
ModuleNotFoundError: No module named 'fuzzywuzzy'
after doing the install as suggested in the readme.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.