neotomadb / neotoma2 Goto Github PK

View Code? Open in Web Editor NEW

8.0 7.0 9.0 18.78 MB

Repository for the updated neotoma R package.

Home Page: http://open.neotomadb.org/neotoma2

License: MIT License

R 98.12% TeX 1.88%

neotoma paleoecology rstats r earthcube nsf

neotoma2's Introduction

`neotoma2` R Package

The neotoma2 R package represents a set of breaking changes with the original neotoma R package. The neotoma package was deprecated following end-of-life for the Neotoma Windows Server in 2020 and the migration of the Neotoma backend infrastructure to a PostgreSQL database and JavaScript API.

The neotoma2 package is built on the new Neotoma API and is intended as a starting point for a fully interactive experience with the Neotoma Paleoecology Database, to support both data access and data input through R.

Contributors

This project is an open project, and contributions are welcome from any individual. All contributors to this project are bound by a code of conduct. Please review and follow this code of conduct as part of your contribution.

Tips for Contributing

Issues and bug reports are always welcome. Code clean-up, and feature additions can be done either through pull requests to project forks or project branches.

Please direct development questions to Simon Goring by email: [email protected].

All products of the Neotoma Paleoecology Database are licensed under an MIT License unless otherwise noted.

How to use this repository

All R functions for the package should be written in the R folder. Any documentation should be added to .R files using roxygen2 notation. Because we are using roxygen2 for documentation in this package, all edits to documentation should take place in the associated functions .R file. The files in the man folder should not be manually changed.

Class definitions and new methods should be added to the files 01_classDefinitions.R and 02_genericDefinitions.R respectively, to ensure they are properly loaded during the package build process.

Development Workflow Overview

The neotoma2 package is built for R. Build tools include elements from the usethis, devtools and testthat R packages, and build and compilation occurs within (and outside) the RStudio IDE environment.

Package use requires the use of the devtools::install_github() function, to pull this working repository into a user's environment:

devtools::install_github('NeotomaDB/neotoma2', build_vignettes = TRUE)

To see the rendered vignette, you can also visit the following site: https://open.neotomadb.org/neotoma2/inst/doc/neotoma2-package.html

The expectation for this repository is that all commits to the prod branch will support a clean package build. This is supported through GitHub Actions in the .github folder of the repository.

Analysis Workflow Overview

There is considerable information in the vignettes for the package, which can be accessed directly.

Report of Sites Statistics

In order to see the total available sites that can be queried by this package, use the function:

neotoma2::get_stats(start=0, end=1)

System Requirements

This project is built with R > v4.0. The packages needed for proper package use are detailed in the DESCRIPTION file for this repository.

Data Requirements

The neotoma2 R package pulls data from the Neotoma Paleoecology Database. Neotoma maintains a permissive data use policy. Within the data use policy there is a statement on co-authorship which reads:

Normal ethics apply to co-authorship of scientific publications. Paleoecological datasets are labor-intensive and complex: they take years to generate and always have additional attributes and metadata not captured in Neotoma. Neotoma data quality also relies on expert curation by data stewards, each associated with one or more Constituent Databases. Users of data stored in Neotoma’s Constituent Databases should consider inviting the original data contributor, or Constituent Database steward(s), to be a co-author(s) of any resultant publications if that contributor’s data are a major portion of the dataset analyzed, or if a data contributor or steward makes a significant contribution to the analysis of the data or to the interpretation of results. For large-scale studies using many Neotoma records, contacting all contributors or stewards or making them co-authors will not be practical, possible, or reasonable. Under no circumstance should authorship be attributed to data contributors or stewards, individually or collectively, without their explicit consent.

Metrics

This project is to be evaluated using the following metrics:

Updated production Branch with CRAN changes, May 24, 2024; only code affected.
Maintenance Github 1.0.3 release, February 28, 2024
Published JOSS paper, DONE 28 November, 2023
Submitted paper to JOSS DONE 03 May, 2023
Completion CRAN 1.0.0 release DONE April 23, 2023
Completion of core functionality for data access DONE Feb 10, 2022
Completion of core functionality for data presentation DONE
Completion of clear vignettes for major data types or Constituent Databases represented within the Neotoma Database.

neotoma2's People

Contributors

Stargazers

Watchers

Forkers

wnanavati ggilromera jmrhobbs jerrinjacob21 cabah villegar quinnasena richardjtelford foresthayes

neotoma2's Issues

JOSS Review (optional): Make functions internal

I think there are a few functions that could be made internal testNull() (besides others). In my personal experience, limiting the exported functions to the core functionality is often beneficial to no confuse less experienced users.

openjournals/joss-reviews#5561

JOSS review: (OPTIONAL DEVELOPMENT!) Add parallel processing to speed up get_sites()?

The get_sites() function is used to derive site specific info. It seems to call an API function. It can take very long to get multiple sites at once especially if the user provides an age filter.

Authors can consider just adding a parallelization here since this is just a fetch function. This could be as simple as below,

` library(parallel)
library(data.table)
#create cores and save to a cluster object

list_of_sites <- c(A,B,C)

parLapply(cluster,
list_of_sites)->processed_sites

processed_sites <- rbindlist(processed sites)

Above is a crude example, but can work well. Again just an optional suggestion

openjournals/joss-reviews#5561

Spelling error in returned object from summary() on sites

When you get an object using get_sites(), then use summary() on it, one of the components of the returned summary is $chronologies, except it's spelled "chronolgies".

Calls to get_*() functions that use identifiers should auto-clean NA values.

Currently a call to (e.g.) get_datasets() that contains NA values results in a failing call due to the passing of a NULL value. Note, this may overlap with an issue in the Neotoma API.

reprex:

library(neotoma2)
datasetids <- c(1,2,3,4,NA,6)
output <- get_datasets(datasetids)

Returns:

Error in neotoma2::parseURL(base_url, ...) : 
  Internal Server Error (HTTP 500). Failed to Could not connect to the Neotoma API.
                    Check that the path is valid, and check the current
                     status of the Neotoma API services at
                      http://data.neotomadb.org.

Should return a valid sites object.

JOSS review: Manuscript

Lines 13-17: As far as I understand the manuscript and the vignettes, the set_sites() function can be used to create sites locally based on data not available in the database yet. However, currently in the manuscript it is unclear if the function also allows to upload these sites to the database ("create new records"). Please rephrase and specify.

Lines 30-33: Please specify a bit further how the package " [...] conforms to a tidyverse approach [...]" and how this relates to the Neotoma data model.

openjournals/joss-reviews#5561

Issue with `get_downloads()` for `sites` objects.

I added some code in the mammal vignette:

mamData <- get_datasets(datasettype="vertebrate fauna", ageold=12000, limit = 99999)
plotLeaflet(mamData)
mamDl <- get_downloads(mamData)

Executing the code as is results in the following:

x is not an allowed argument.
      Choose from the allowed arguments: sitename, altmax, altmin, loc
Your search returned 965 objects.
Printing only 25 objects.
       Use all_data = TRUE for storing the complete set.
 Error: $ operator is invalid for atomic vectors

I'll take a look and see if I can figure out what's going on.

Request-URI Too Long

I'm trying to download all the pollen data, but run into the following error

pollen_datasets <-  neotoma2::get_datasets(datasettype = "pollen",   all_data = TRUE)

neotoma2::get_downloads(pollen_datasets)
Error in parseURL(base_url, ...) : 
  Request-URI Too Long (HTTP 414). Failed to Could not connect to the Neotoma API.
                    Check that the path is valid, and check the current
                     status of the Neotoma API services at
                      http://data.neotomadb.org.

Fairly sure this is because get.downloads.numeric pastes all the datasetid into a very long string that is too long for a GET request

JOSS review: Add a summary of current available sites that can be queried

Given this is a live package where users can add sites etc, can the authors report the total available sites that can be queried by this package? Ideally there should be a function that just reports this. Authors can also add this to a table in the README.

openjournals/joss-reviews#5561

Improve the `doi()` help, to include a working example.

In response to an article review, it's clear that we need to improve the documentation for the doi() function. I have written a GitHub Gist that effectively does what we're looking to do.

We should clean up the code so it comes directly from a set of queries (rather than a raw file):

library(neotoma2)
library(dplyr)
library(readr)

poll <- readr::read_tsv('pollen_counts_europe.csv') %>%
    dplyr::filter(Data_Source == 'Neotoma') %>%
    dplyr::select(Dataset_ID) %>%
    dplyr::distinct() %>%
    unlist()

datasets <- neotoma2::get_datasets(poll, all_data = TRUE)
neotoma2::doi(datasets)

Improve Vignette to add data citation best practices.

Data Citation is increasingly a critical element of FAIR practice, and part of things like DataCite DOI minting and maintenance.

Building a section of the Vignette that focuses on this will be really helpful.

Pollen standardizations

The prior iteration of neotoma had an option for standardizing pollen taxa to a number of different lists. It would be great to have this same function in neotoma2. Are there plans to add this? Thanks!

Issue loading neotoma2 through devtools

I am getting the following error when trying to load neotoma2:

Error in makePrototypeFromClassDef(properties, ClassDef, immediate, where) :
in making the prototype for class “publication” elements of the prototype failed to match the corresponding slot class: author (class "contacts" )
Error: unable to load R code in package ‘neotoma2’
Execution halted

sites no longer repeated with get_downloads

Fixed issue for repeated sites regarding a download <- this was hard and probably needs refactoring for efficiency

Can do downloads for Brazil_datasets now

Redo the all_data = TRUE in vignette and code

not sure why this keeps being deleted and I keep pushing it. maybe we are merging on the vignette in a wrong way?
Removed pager function
not sure why this keeps being re-added and I keep deleting it. maybe we are not pulling the delete?

Did [[ <- methods for collunits, datasets

Fixed c() method for site objects - it was returning primitive[1] instead of the right sites object

JOSS review: Contribution instructions (add email)

The authors should add an email which new users can reach out to in case of any questions/confusion

openjournals/joss-reviews#5561

get_downloads() returns data from the same site n times

When I select data thanks to get_sites() and then get_datasets(), I effectively get what I want: data for the n sites which respect my criteria. But then, when I run get_downloads() based on the same set of sites, I get n identical lists containing samples for the same site.

For instance if I run the following chunk, directly took from the page neotoma2_R_pack_doc, I get samples for the right number of sites (7) but I get 7 times the data for the same site (Lake Valencia):

brazil <- '{"type": "Polygon", 
            "coordinates": [[
                [-73.125, -9.102],
                [-56.953, -33.138],
                [-36.563, -7.711],
                [-68.203, 13.923],
                [-73.125, -9.102]
              ]]}'

# We can make the geojson a spatial object if we want to use the
# functionality of the `sf` package.
brazil_sf <- geojsonsf::geojson_sf(brazil)

brazil_records <- get_datasets(loc = brazil_sf) %>%
  neotoma2::filter(datasettype == "pollen" & age_range_young <= 1000 & age_range_old >= 10000) %>%
  get_downloads(verbose = FALSE)

With get_sites() or get_datasets(), I obtain in my environment a list of lists structured this way:
-sites

-[[1]]
-[[2]]
-[[3]]
etc.

While with get_downloads(), I get:
-sites

-site
-site
-site
etc.

JOSS Review: Code of conduct

Refer to code of conduct guidelines in documentation (Suggestions: Somewhere in the Tips for Contributing section)

openjournals/joss-reviews#5561

get_datasets from get_sites results in duplication of sites

With the following script:

aa <- get_sites(sitename='A%', limit = 100)
bb <- get_datasets(aa, limit = 200)

We see that `aa@sites[[99]]` is not the same as `bb@sites[[99]]`, this is because the API currently returns a site for each dataset, so the sites get repeated in the resulting object.  This is not the correct API behaviour, and *is* addressed in a change in the API repository.

JOSS review: Could authors add state of the field to the paper?

Can the authors add some text regarding the current available options (other than your package) which can be used to explore the neotoma database? And why would this be much more accessible.

I'm guessing this is because using R is very user friendly and users can conviniently add their own sites, methods? Either way, this should be fairly simple to add to the paper itself.

openjournals/joss-reviews#5561

get_datasets should respect the limit

Currently get_datasets does a thing where it first pulls all datasets, then just pulls with the limit. This makes the call seem really slow. It should only download what is asked for, respecting the limit.

Preferred citation for neotoma2

Hello neotoma2 developers 👋
I'd like to reference neotoma2 in an academic paper.
What's the preferred citation in this case?
I've used the default citation("neotoma2") citation generated by R:

@Manual{,
    title = {neotoma2: Working with the Neotoma Paleoecology Database},
    author = {Simon Goring},
    year = {2021},
    note = {R package version 0.0.0.9000},
  }

Is this ok?
You could override this behavior by having an explicit CITATION file in the inst/ folder.

Many thanks!

Installation

I'm having trouble installing via devtools. I'm using devtools::install_github("NeotomaDB/neotoma2", force = TRUE) however getting the below error:

* installing *source* package ‘neotoma2’ ...
** using staged installation
** R
** byte-compile and prepare package for lazy loading
Error in charToDate(x) : 
  character string is not in a standard unambiguous format
Error: unable to load R code in package ‘neotoma2’
Execution halted
ERROR: lazy loading failed for package ‘neotoma2’

Config info:

$platform
[1] "x86_64-apple-darwin15.6.0"

$os
[1] "darwin15.6.0"

$system
[1] "x86_64, darwin15.6.0"

$version.string
[1] "R version 3.6.1 (2019-07-05)"

JOSS review (optional): Code coverage

The automated tests are very well written, however, the overall coverage could be higher (currently about 60%) and many functions are currently not tested.

openjournals/joss-reviews#5561

neotoma R package filters out sites with no apparent reason

I am currently trying to download bulks of data with the neotoma R package. For example all pollen data available. The explorer indicates 4000+ available datasets, also: get_sites(datasettype= "pollen", all_data=TRUE) results in the same number.

Only as soon as i use get_datasets or get_downloads the number of datasets is reduced drastically, leaving me with as little as 900 remaining.

Any Ideas what the problem could be and how it can be fixed?

Thanks

Calls to `as.data.frame()` seem to be returning confusing column classes.

The as.data.frame() call is casting certain columns to non-intuitive column types. This causes problems with using functions like dplyr::inner_join() on tables.

reprex

library(neotoma2)
> ds <- get_datasets(1:10)
> df <- as.data.frame(ds)
> class(df$siteid)
[1] "character"

Expected

The ID columns should be integers. Other columns should be properly typed/classed.

Modify the length function to provide lengths of sites, datasets, &cetera.

Currently, because of the structure of a neotoma2 object using the length() function only returns the number of sites.

There are several options here. One is to have something like length(x, 'datasets'), or we could do ndatasets().

I think that the base length() method does not allow other parameters, so that may not be an option.

JOSS review: documentation does not include vignettes. These should be produced using pkgdown

The current docs- http://open.neotomadb.org/neotoma2/ dont include the vignettes. I think the authors would just have to do

pkgdown::build_articles() to generate these and include them on the GitHub docs.

openjournals/joss-reviews#5561

JOSS review (optional): Add link to rendered vignettes to docs

It would be nice if the link in the README provides a rendered HTML vignette directly. Maybe using Articles (more information here) would be nice. However, not necessarily required, just a suggestion.

openjournals/joss-reviews#5561

JOSS review: Build vignettes when installing package

When installing the package, the vignettes dont build automatically. This should be a relatively simple fix. All that is required is changing the README from-

devtools::install_github('NeotomaDB/neotoma2')

devtools::install_github('NeotomaDB/neotoma2', build_vignettes=TRUE)

UPDATE: I confirmed that the build_vignettes passes when indtalling. So, this can easily be updated in the README.

openjournals/joss-reviews#5561

Issue with 'gpid' param in get_datasets()

When I try to pull data with a specific gpid using the gpid param in get_datasets(), I get the error "gpid is not an allowed argument".

neotomadb / neotoma2 Goto Github PK

neotoma2's Introduction

neotoma2 R Package

Contributors

Tips for Contributing

How to use this repository

Development Workflow Overview

Analysis Workflow Overview

Report of Sites Statistics

System Requirements

Data Requirements

Metrics

neotoma2's People

Contributors

Stargazers

Watchers

Forkers

neotoma2's Issues

reprex

Expected

Recommend Projects

Recommend Topics

Recommend Org

`neotoma2` R Package