vinisalazar / bioprov Goto Github PK

View Code? Open in Web Editor NEW

13.0 2.0 1.0 12.74 MB

A provenance library for bioinformatics workflows 🧬 🔀 📝

Home Page: https://bioprov.readthedocs.io/

License: MIT License

Python 100.00%

w3c-prov biopython biological-data provenance python prov

bioprov's Introduction

Hi, I'm Vini (he/him)

Microbial oceanography and research software engineering. Bridging the gap between metagenomics and oceanographic data.

I use Vinícius W. Salazar in formal documents and texts, and Vini Salazar for everything else.

🦘 PhD candidate @ Melbourne Integrative Genomics-UniMelb
💾 Msc. Systems & Computer Engineering @ Coppe-UFRJ
🌱 BSc. Biological Sciences @ UFSC
🔨 Maintainer and Instructor @ The Carpentries
🐧 Open Source enthusiast
🎓 My publications are available on my Google Scholar profile

If you'd like to contact me, please write to vinicius.salazar [at] unimelb.edu.au.

bioprov's People

Stargazers

Watchers

Forkers

mym88mym

bioprov's Issues

Package requirements

Hello!

While reviewing your package, I have noticed that the requirements are not clearly stated anywhere (just several packages mentioned here and there). Please add such a section to your documentation and README file. You can just copy-paste the requirements from your setup.py file.

openjournals/joss-reviews#3622

genome_annotation workflow gives unexpected error

Usually when I run a shell command without arguments I expect it to show me all possible flags. Running bioprov kaiju does just that:

usage: kaiju [-h] -i INPUT [-o OUTPUT_DIRECTORY] -db KAIJU_DB -no NODES -na
             NAMES [--kaiju_params KAIJU_PARAMS]
             [--kaiju2table_params KAIJU2TABLE_PARAMS] [-t TAG] [-v]
             [-p THREADS]
kaiju: error: the following arguments are required: -i/--input, -db/--kaiju_db, -no/--nodes, -na/--names

However the same is not true when running bioprov genome_annotation:

Traceback (most recent call last):
  File "/home/jvfe/miniconda3/envs/bioprov/bin/bioprov", line 7, in <module>
    exec(compile(f.read(), __file__, 'exec'))
  File "bioprov/bioprov", line 14, in <module>
    main()
  File "bioprov/bioprov.py", line 54, in main
    parser.parse_options(args)
  File "bioprov/workflows/wf_parser.py", line 66, in parse_options
    subparsers[options.subparser_name](options)
  File "bioprov/workflows/wf_parser.py", line 61, in <lambda>
    "genome_annotation": lambda _options: self._genome_annotation(_options),
  File "bioprov/workflows/wf_parser.py", line 33, in _genome_annotation
    main.run_steps(steps)
  File "bioprov/src/workflow.py", line 181, in run_steps
    self.generate_sampleset()
  File "bioprov/src/workflow.py", line 117, in generate_sampleset
    self.sampleset = _generate_sampleset[self.input_type]()
  File "bioprov/src/workflow.py", line 255, in _load_dataframe_input
    assert path.isfile(input_), Warnings()["not_exist"]
  File "/home/jvfe/miniconda3/envs/bioprov/lib/python3.7/genericpath.py", line 30, in isfile
    st = os.stat(path)
TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType

Though bioprov genome_annotation --help does the job:

usage: genome_annotation [-h] [-i INPUT] [-c CPUS] [--verbose] [-t TAG]
                         [--steps STEPS]

Genome annotation with Prodigal, Prokka and the COG database.

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        Input file, may be a tab delimited file or a
                        directory. If a file, must contain column 'sample-id'
                        for sample ID and 'assembly' for files. See program
                        help for information. (default: None)
  -c CPUS, --cpus CPUS  Default is set in BioProv config (half of the CPUs).
                        (default: 2)
  --verbose             More verbose output (default: False)
  -t TAG, --tag TAG     A tag for the dataset (default: None)
  --steps STEPS         A comma-delimited string of which steps will be run in
                        the workflow. Possible steps: ['prodigal'] (default:
                        ['prodigal'])

I don't know enough about argparse to think of a fix for this, though checking for len(sys.argv) might be a possible workaround.

Any reason to use .format() instead of fstrings?

Hi,

I've noticed most (if not all) string interpolation in your package uses the .format() style, instead of fstrings. I don't think it's for backwards compatibility, since your package already requires Python>=3.6 and fstrings were implemented in 3.6, so is it a question of personal preference using .format()? I ask because, even though there isn't a clear advantage of one over the other, I find fstrings more concise and easier to read.

If you plan on changing them, the output of grep -R "[\"|'].format(" . on the root directory would show all occurrences.

Environment.yml fails to build development environment

Bug:

When I try conda env create --file environment.yml I get several installations failing:

Solving environment: failed

ResolvePackageNotFound: 
  - openssl==1.1.1h=haf1e3a3_0
  - certifi==2020.6.20=py37h2987424_2
  - libcxx==11.0.0=h439d374_0
  - setuptools==49.6.0=py37h2987424_2
  - xz==5.2.5=haf1e3a3_1
  - tk==8.6.10=hb0a8c7a_1
  - prodigal==2.6.3=h01d97ff_2
  - python==3.7.8=hc9dea61_1_cpython
  - zlib==1.2.11=h7795811_1010
  - sqlite==3.33.0=h960bd1c_1
  - libffi==3.2.1=hb1e8313_1007
  - readline==8.0=h0678c8f_2
  - ncurses==6.2=hb1e8313_2

Specs:
OS: Linux Debian 10 (Buster)
conda version: 4.8.5

Apparently this is a common issue with conda (Unfortunately), here's an example.

I've found that removing the version hashes in the environment file (i.e. prodigal=2.6.3=h516909a_2 -> prodigal=2.6.3) fixes it and gets me an environment with the same software and versions. Though this is a very weird and not very practical solution, I can submit a pull request for it.

Update environment.yml with new dependencies.

v0.1.13 will include a new dependency, TinyDB, which will be the database system for BioProv.

Must update environment.yml with the following packages:

TinyDB
nb_conda_kernels

Add support for ProvStore API

ProvStore is an open provenance repository for W3C-PROV documents. BioProv should add support for CRUD operations using the ProvStore API.

Submit a PR that will:

Manage configuration of the ProvStore credentials [ ]
Allow CRUD operations with the ProvStore API [ ]

Non-code improvements that would be nice to have

Hi, great work on this package!

I think a few simple things that don't mess with the source code would greatly improve it:

Adding a CONTRIBUTING.md to the root

This would make this repo much more inviting for beginners that want to contribute (Such as myself!). I really like biopython's contribution guide, for example.

Adding a link to the documentation in the README.

I find the notebooks you made really nice but I've also noticed you already have some docs on readthedocs, maybe add a badge on the README to refer to them? Some people like this more "traditional" type of documentation hahah. And you could even turn the current notebooks into .rst to have a static version of them over there.

Improve the project's page on PyPI

Currently, your project page only has a very short description, even though your README is much more informative! You could try using it as your long description, see one of my packages for example. It would only require a simple change on:

BioProv/setup.py

Lines 16 to 21 in 178bef7

 long_description=( 

 "BioProv is a toolkit for capturing and extracting provenance data from" 

 " bioinformatics workflows." 

 "\n" 

 "To know more about BioProv, please visit the Homepage." 

 ),

Add an environment file for development

This one is a bit tougher, since the tests require some external software, but I believe having a conda "dev_environment.yml" file would make it a lot easier for collaborators to work on this, run tests, add features and what not. Edit: Thinking back on it, this isn't really necessary as long as it's well described in the contribution guidelines what packages/software is needed for testing, code styling, etc.

Sorry for this huge wall of text, they are mostly just suggestions, feel free to disregard any of them!

That being said, I'd be happy to draft a PR to tackle the first three suggestions, if you don't mind.

Add GitHub action "create-release"

To make this repository citable through Zenodo, it must have GitHub release. However, creating GH releases manually is a bit time consuming. Ideally, whenever a tag gets pushed, GH should create a new release.

Create a PR that will:

Add a GitHub action to create a release when new tags are created.

vinisalazar / bioprov Goto Github PK

bioprov's Introduction

Hi, I'm Vini (he/him)

bioprov's People

Stargazers

Watchers

Forkers

bioprov's Issues

Package requirements

genome_annotation workflow gives unexpected error

Any reason to use .format() instead of fstrings?

Environment.yml fails to build development environment

Update environment.yml with new dependencies.

Add support for ProvStore API

Non-code improvements that would be nice to have

Add GitHub action "create-release"

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	long_description=(
	"BioProv is a toolkit for capturing and extracting provenance data from"
	" bioinformatics workflows."
	"\n"
	"To know more about BioProv, please visit the Homepage."
	),