tibhannover / bacdiver Goto Github PK

Inofficial R client for the DSMZ's Bacterial Diversity Metadatabase (former contact: @katrinleinweber). https://api.bacdive.dsmz.de/client_examples seems to be the official alternatives.

Home Page: https://TIBHannover.GitHub.io/BacDiveR/

License: MIT License

R 98.05% Makefile 1.95%

r microorganism bacterial-database bacteriology webservice-client microbiology biobank r-package rstats bacterial-samples

bacdiver's Introduction

BacDiveR

This R package provided a programmatic interface to the Bacterial Diversity Metadatabase of the DSMZ (German Collection of Microorganisms and Cell Cultures).

As of June 2021, BacDive's "redesign" has rendered this R package inoperable. Apparently, they want you to use either of these clients.

Old README below

BacDiveR helps you improve your research on bacteria and archaea by providing access to "structured information on [...] their taxonomy, morphology, physiology, cultivation, geographic origin, application, interaction" and more (Söhngen et al. 2016). Specifically, you can:

download the BacDive data you need for offline investigation, and
document your searches and downloads in .R scripts, .Rmd files, etc.

Thus, BacDiveR can be the basis for a reproducible data analysis pipeline. See TIBHannover.GitHub.io/BacDiveR for more details, /news there for the changelog, and GitHub.com/TIBHannover/BacDiveR for the latest source code.

It was also built to serve as a demonstration object during TIB's "FAIR Data & Software" workshop.

Installation

Because the BacDive Web Service requires registration please do that first and wait for DSMZ staff to grant you access.
Once you have your login credentials, install the latest BacDiveR release from GitHub with: if(!require('devtools')) install.packages('devtools'); devtools::install_github('TIBHannover/BacDiveR').
After installing, follow the instructions on the console to save your login credentials locally and restart R(Studio) or run usethis::edit_r_environ() and ensure it contains the following:

[email protected]
BacDive_password=YOUR_20_char_password

In the examples and vignettes, the data retrieval will only work if your login credentials are correct in themselves (no typos) and were correctly saved. Console output like "{\"detail\": \"Invalid username/password\"}", or Error: $ operator is invalid for atomic vectors indicates that either the login credentials are incorrect, or the .Renviron file.

How to use

There are two main functions: retrieve_data() and retrieve_search_results(). Please click on their names to read their docu, and find real-life examples in the vignettes "BacDive-ing in" and "Pre-Configuring Advanced Searches".

How to cite: See `Cite as` & `Export` on Zenodo

You can also run citation('BacDiveR') in the R console and use its output because that ensures you are citing exactly the installed version.

If you want to import this repo's metadata into a reference manager directly, I recommend Zotero and its GitHub translator. Please double-check, that the citation refers to the same version number that you ran your analysis with.

When using BibTeX, you may want to try changing the item type from to @Software ;-) Support for that is being worked on.

Don't forget to also cite BacDive itself whenever you used their data, regardless of access method.

How to contribute: See `CONTRIBUTING.md` file.

Known issues: See bugs and ADRs.

Similar tools

These seem to scrape all data, instead of retrieving specific datasets.

@cjm007's BacDive_ & BacDivePy (Python)
@zorino's microbe-dbs (Python & Shell)
@EngqvistLab's Python script
or generally: GitHub.com/topics/bacdive

References

Söhngen, Bunk, Podstawka, Gleim, Overmann. 2014. “BacDive — the Bacterial Diversity Metadatabase.” Nucleic Acids Research 42 (D1): D592–D599. doi:10.1093/nar/gkt1058.
Söhngen, Podstawka, Bunk, Gleim, Vetcininova, Reimer, Ebeling, Pendarovski, Overmann. 2016. “BacDive – the Bacterial Diversity Metadatabase in 2016.” Nucleic Acids Research 44 (D1): D581–D585. doi:10.1093/nar/gkv983.
Reimer, Vetcininova, Carbasse, Söhngen, Gleim, Ebeling, Overmann. 2018. “BacDive in 2019: Bacterial Phenotypic Data for High-Throughput Biodiversity Analysis” Nucleic Acids Research doi:10.1093/nar/gky879.

bacdiver's People

Contributors

Stargazers

Watchers

Forkers

katrinleinweber sebkopf shihuang047 gregpoore ahmadalzdwr scottdaniel

bacdiver's Issues

clarify development model: master for integration of feature branches

Currently, I push to master whatever appears to be working. This has to be cleaned up before #9 & #10. Related to #41 & #55.

Thus, ac94212 was a step into the wrong direction.

I should:

write a Release issue template ~~based on usethise::use_release_issue()~~ Too detailled for now.
start development branch
remove rev from ReadMe.md
update contributing docu & templates
activate branch protection to force PR

Switch retrieve_data() from default searchType = "bacdive_id" to "taxon"?

Because apparently 90% of website users search for a taxonomic group, supporting that use-case in the API client might convince users to query BacDive programmatically.

How many data fields are accessible only through the website?
- doi:10.1093/nar/gkv983: 99 data fields" vs. 233 active data fields.

Read-only SQL access possible?

Would it be possible then to formulate queries in BacDiveR and download only the queried data fields?

get DOI from Zenodo

https://guides.github.com/activities/citable-code/

Because this project was moved to TIBHannover, I had to sign into Zenodo again via GitHub, and grant Zenodo access to the org, then sign-ou and -in again.

convert ReadMe to .Rmd

usethis::use_readme_rmd() to better support examples?

randomise test searches

https://bacdive.dsmz.de/api/bacdive/example uses some specific search terms. If automatic tests use these as well, their internal statistics about popular datasets might be skewed. Maybe they are already, and this fact is accounted for by the DSMZ.

ask whether any such statistics are collected

The test search terms could be randomised to avoid this problem: sample(seq(100000, 999999), size = 1), acc <- paste(sample(LETTERS, size = 2), collapse = ""), int), paste("DSM", round(int / 1000)) or similar.

ask for max ranges

This would spread out the "popularity" inflation, but might require fine-tuning the seq ranges. Plus, it would assume continuous numbering on their end.

ask whether this is the case

Automate build with Makefile

roxygen2::roxygenise()
pkgdown::build_site()
Rscript -e "library('knitr'); knit('README.Rmd')"

implement client-side API call caching

intro: https://www.r-bloggers.com/caching-api-calls-offline/
caching in httr: https://cran.r-project.org/web/packages/httr/httr.pdf & https://rud.is/b/2017/08/22/caching-httr-requests-this-means-warc/
packages: https://github.com/ropensci/mocker & https://github.com/databio/simpleCache

Unless the BacDive server takes over the caching. It could additionally communicate the time-of-last or time-of-next update to support caching.

Keep a CHANGELOG

https://keepachangelog.com/en/1.0.0/

I'm not sure about this, because it duplicates info from the commit messages. Maybe it's more useful to focus any explanatory efforts there & auto-generate the CHANGELOG?

move to other GitHub org

update links in

ReadMe
issues
vignette

create CITATION file

After #14 as an alternative or addition to #29?

test citation("BacDiveR")
avoid warning: no date field in DESCRIPTION file of package ‘BacDiveR’
use Zenodo-BibTeX, but with @software, or @misc with type = {software}
test whether a CITATION file supersedes R's citation()-parsing of the DESCRIPTION

relationship to ROpenSci/taxize

They don't have BacDive as a source currently. Is BacDiveR 100% redundant, 100% separate, or should some of our functionality maybe be PR'ed to them, while we focus on unique functions here?

ask them about BacDive & PNU

cc @ceb15

Remove invalid \n in JSON

While implementing #31 and switching from rjson to jsonlite I noticed that some fields contain a insufficiently escaped \ns. This results in lexical error: invalid character inside string..

@ceb15: Please consider ensuring that those are escaped as \\n already BacDive or (I presume) during JSON serialisation.

I'll parse them away for now.

prepare demo page

after #23, prepare GitHub Page, maybe based on #13.

Compare temp data to https://zenodo.org/record/1175609

check whether that dataset has different source
- partially from BacDive => reproduce results
write vignette about extracting growth temp from that dataset & through BacDiveR & mention @mengqvist then
- parse his dataset, try to retrieve same species from BacDive

Extract retrieve_IDs() and make retrieve_data(..., force_taxon_download = TRUE)

It seems weird to name the function according to an optional argument, which is itself weirdly named. I want to clarify the API here:

separate retrieve_data() & retrieve_IDs().

prepare .Renviron for manual config

alternative to #2: don't prompt users within R for email & password, but prepare the target file so they can more easily add both themselves.

split retrieve_IDs off from retrieve_data()

Extract to different function? If yes, by scraping IDs from paged URL returns (official examples), or by storing the URLs as intermediate result, plus providing helper functions to narrow that result down to the IDs?

Or, implement as an internal loop-back in retrieve_data(…, searchType = "taxon") based on new parameter taxon_data = TRUE?

ask, whether only taxon search can return multiple IDs

retrieve_data(searchTerm = "DSM 319"…) yields error

x[[1]][1]$url : $ operator is invalid for atomic vectors probably from interaction of

https://github.com/tibhannover/BacDiveR/blob/development/R/retrieve_data.R#L75

with

https://github.com/tibhannover/BacDiveR/blob/development/R/retrieve_data.R#L97-L103

~~try removing Rcurl and hand URL over to fromJSON directly~~
~~pass loop flag into download to switch off gsub for single downloads~~

submit to ROpenSci

https://onboarding.ropensci.org/

update README sections
Package categories: data retrieval and/or reproducibility
apply styler, see #73
consider new function names, see #96

Refactor retrieve_data()

https://github.com/tibhannover/BacDiveR/blob/master/R/retrieve_data.R#L53-L84 should be more readable & self-explanatory.

~~add comments~~ descriptive names
functionalise

create codemeta.json

after #14 & using https://ropensci.github.io/codemetar/#using-the-description-file

same as repo topics
include in makefile

Test login credentials not for format, but simply against live API?

https://github.com/tibhannover/BacDiveR/blob/ecd1bb4/R/retrieve_data.R#L45 could be used to live-test an email-password combo right after providing it. This would mean less validation code in the plugin, but would put the users at risk of false-negative validation results in case of connection problems, or server-side problems not related to the login credentials.

rename get_Renviron_path to construct…

Because it is similar in nature to construct_url() and doesn't GET anything really from anywhere else.

in case of illegal characters, tell users which

extract the reg-ex to grep(value = TRUE)
adapt to extract only non-[:alnum] & \s
paste() error message together
probably fix test https://github.com/tibhannover/BacDiveR/blob/master/tests/testthat/test-construct_url.R#L4

Add BacDive-IDs to large, returned list

Instead of leaving the index numbers [[1]], [[2]], etc. the datasets should be named according to their bacdive_id. Which is not in the dataset itself, so it might need to be spliced out of the URL again, or retrieved from a previous, ID-containing vector.

Probably needs to be assigned in https://github.com/tibhannover/BacDiveR/blob/master/R/retrieve_data.R#L66-L70

optimise performance of list aggregation

https://github.com/tibhannover/BacDiveR/blob/f67dafd7940cde249b4144ce92b6417a29d17064/R/retrieve_data.R#L125-L127

Don't use c() in for loop, but calculate list size from count, then insert elements.

use more accessible datatype than list to aggregate data

Maybe a list, named with the IDs? Or a character vector, with each element being one list-of-lists / dataset?

submit to CRAN

https://cran.r-project.org/web/packages/policies.html#Submission

offer CI to contributors

tutorials: https://blog.rstudio.com/2016/03/09/r-on-travis-ci/ & https://juliasilge.com/blog/beginners-guide-to-travis/
possibly useful tools: https://github.com/craigcitro/r-travis & https://github.com/r-lib/covr
official docu: https://docs.travis-ci.com/user/languages/r/

remove any kind of guessing or correcting searchType

https://github.com/tibhannover/BacDiveR/blob/master/R/guess_searchType.R is purely for user-convenience, but maybe forcing them to be specific is generally better.

Write management plans

https://figshare.com/articles/Managing_Research_Software_Development_better_software_better_research/5930662 p24f & http://www.software.ac.uk/software-management-plans

What software will you write?
What will your software do? 
Will your software have a name?
Who are the intended users of your software?
Is for one type of user or for many?
What expertise is required?
How will you make your software available?
How will your software contribute to research and how will you measure its contribution?

rename "aggregate…()" to retrieve… for coherent naming

incorporate test-relevant data into package

https://github.com/tibhannover/BacDiveR/blob/master/tests/testthat/test-retrieve_data.R#L40-L50 is only testing internal functions, so it doesn't need to be downloaded as well. Best package it up as a dataset within the package

Warn very specifically about slow downloads

Instead of the general warning message in case of

https://github.com/tibhannover/BacDiveR/blob/a2d461dc35fc57c5cdf3b87fb6b2931869a6d507/R/retrieve_data.R#L56

always display a progress bar or "x % finished" message.

Sys.chmod(r_env_file, mode = "0600")

So that only the owner/user can read & write it (suggested via email).

check whether …_Renviron_path code can be consolidated (currently in 4 places; #8)

inform about contribution workflow

Only very rudimentary, and expand after a few contributions have actually occured.

ISSUE_TEMPLATE
PULL_REQUEST_TEMPLATE
CONTRIBUTING

see also https://github.com/kmindi/special-files-in-repository-root for other potentially useful files

write vignette

usethis::use_vignette("Bac-Dive-ing-in") 🤓

Too many example results are printed

https://github.com/tibhannover/BacDiveR/blob/master/docs/reference/retrieve_data.html#L142

Find an option to not show them.

rename is_paged() to is_URL_list()

https://github.com/tibhannover/BacDiveR/blob/a2d461dc35fc57c5cdf3b87fb6b2931869a6d507/R/retrieve_data.R#L153-L155

currently checks whether the payload is a list of URLs, not whether that list is actually split into several pages.

Error: pandoc document conversion failed with error 83

When running pkgdown::build_site() both with encoding = "latin1" & UTF-8. Same when knit()ing the README.Rmd. This should be fixed in order to enable smooth updating of the GitHub page.

aggregate datasets into useful structure before returning

noticed while working on #16

retrieve_data() currently appends multiple downloads into a continuous list in which the datasets can't be addressed anymore. We need a data structure, that lets the user $-address the datasets, and their fields. Ideally, each dataset is referred to by index = bacdive_id. Something like a sparse list-of-lists?!?

ideas:

~~aggregate JSON strings in character vector, then rjson::fromJSON() them "in-place" or somehow that creates the nested lists "below / as lower hierarchies" of that vector~~
~~write-out each dataset to a file (kind of a local cache), then maybe concatenate files & re-import as a useful data structure~~
~~use jsonlite to create 1 dataframe per bacdive_ID, then add those to a list~~
~~keep on c()ombining downloads, but~~ aggregate into a higher-level list and use an apply variant to extract a field/element from the resulting "megastructure"

Try new folder in https://github.com/tibhannover/BacDiveR/tree/master/docs with .md

Make taxon search more prominent?

Assuming the vast majority ("90%") of BacDive users looks up data about a strain, backdive_ID as the default search may not be as useful.

Maybe rather a retrieve_taxon_data("…", filter_by = c("property_A", "prop_B", "C")) function?

Acknowledgements section in README

Dear @sckott & @jotech, shall I mention you in some way? For example as in fe22d8d?