Git Product home page Git Product logo

neotoma2's Introduction

lifecycle NSF-1948926 Codecov test coverage DOI status

neotoma2 R Package

The neotoma2 R package represents a set of breaking changes with the original neotoma R package. The neotoma package was deprecated following end-of-life for the Neotoma Windows Server in 2020 and the migration of the Neotoma backend infrastructure to a PostgreSQL database and JavaScript API.

The neotoma2 package is built on the new Neotoma API and is intended as a starting point for a fully interactive experience with the Neotoma Paleoecology Database, to support both data access and data input through R.

Contributors

This project is an open project, and contributions are welcome from any individual. All contributors to this project are bound by a code of conduct. Please review and follow this code of conduct as part of your contribution.

Tips for Contributing

Issues and bug reports are always welcome. Code clean-up, and feature additions can be done either through pull requests to project forks or project branches.

Please direct development questions to Simon Goring by email: [email protected].

All products of the Neotoma Paleoecology Database are licensed under an MIT License unless otherwise noted.

How to use this repository

All R functions for the package should be written in the R folder. Any documentation should be added to .R files using roxygen2 notation. Because we are using roxygen2 for documentation in this package, all edits to documentation should take place in the associated functions .R file. The files in the man folder should not be manually changed.

Class definitions and new methods should be added to the files 01_classDefinitions.R and 02_genericDefinitions.R respectively, to ensure they are properly loaded during the package build process.

Development Workflow Overview

The neotoma2 package is built for R. Build tools include elements from the usethis, devtools and testthat R packages, and build and compilation occurs within (and outside) the RStudio IDE environment.

Package use requires the use of the devtools::install_github() function, to pull this working repository into a user's environment:

devtools::install_github('NeotomaDB/neotoma2', build_vignettes = TRUE)

To see the rendered vignette, you can also visit the following site: https://open.neotomadb.org/neotoma2/inst/doc/neotoma2-package.html

The expectation for this repository is that all commits to the prod branch will support a clean package build. This is supported through GitHub Actions in the .github folder of the repository.

Analysis Workflow Overview

There is considerable information in the vignettes for the package, which can be accessed directly.

Report of Sites Statistics

In order to see the total available sites that can be queried by this package, use the function:

neotoma2::get_stats(start=0, end=1)

System Requirements

This project is built with R > v4.0. The packages needed for proper package use are detailed in the DESCRIPTION file for this repository.

Data Requirements

The neotoma2 R package pulls data from the Neotoma Paleoecology Database. Neotoma maintains a permissive data use policy. Within the data use policy there is a statement on co-authorship which reads:

Normal ethics apply to co-authorship of scientific publications. Paleoecological datasets are labor-intensive and complex: they take years to generate and always have additional attributes and metadata not captured in Neotoma. Neotoma data quality also relies on expert curation by data stewards, each associated with one or more Constituent Databases. Users of data stored in Neotoma’s Constituent Databases should consider inviting the original data contributor, or Constituent Database steward(s), to be a co-author(s) of any resultant publications if that contributor’s data are a major portion of the dataset analyzed, or if a data contributor or steward makes a significant contribution to the analysis of the data or to the interpretation of results. For large-scale studies using many Neotoma records, contacting all contributors or stewards or making them co-authors will not be practical, possible, or reasonable. Under no circumstance should authorship be attributed to data contributors or stewards, individually or collectively, without their explicit consent.

Metrics

This project is to be evaluated using the following metrics:

  • Updated production Branch with CRAN changes, May 24, 2024; only code affected.
  • Maintenance Github 1.0.3 release, February 28, 2024
  • Published JOSS paper, DONE 28 November, 2023
  • Submitted paper to JOSS DONE 03 May, 2023
  • Completion CRAN 1.0.0 release DONE April 23, 2023
  • Completion of core functionality for data access DONE Feb 10, 2022
  • Completion of core functionality for data presentation DONE
  • Completion of clear vignettes for major data types or Constituent Databases represented within the Neotoma Database.

neotoma2's People

Contributors

dfcharles avatar jerrinjacob21 avatar jmrhobbs avatar sedv8808 avatar simongoring avatar xuanxu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

neotoma2's Issues

JOSS review: (OPTIONAL DEVELOPMENT!) Add parallel processing to speed up get_sites()?

The get_sites() function is used to derive site specific info. It seems to call an API function. It can take very long to get multiple sites at once especially if the user provides an age filter.

Authors can consider just adding a parallelization here since this is just a fetch function. This could be as simple as below,

` library(parallel)
library(data.table)
#create cores and save to a cluster object

list_of_sites <- c(A,B,C)

parLapply(cluster,
list_of_sites)->processed_sites

processed_sites <- rbindlist(processed sites)

`

Above is a crude example, but can work well. Again just an optional suggestion

openjournals/joss-reviews#5561

Calls to get_*() functions that use identifiers should auto-clean NA values.

Currently a call to (e.g.) get_datasets() that contains NA values results in a failing call due to the passing of a NULL value. Note, this may overlap with an issue in the Neotoma API.

reprex:

library(neotoma2)
datasetids <- c(1,2,3,4,NA,6)
output <- get_datasets(datasetids)

Returns:

Error in neotoma2::parseURL(base_url, ...) : 
  Internal Server Error (HTTP 500). Failed to Could not connect to the Neotoma API.
                    Check that the path is valid, and check the current
                     status of the Neotoma API services at
                      http://data.neotomadb.org.

Should return a valid sites object.

JOSS review: Manuscript

Lines 13-17: As far as I understand the manuscript and the vignettes, the set_sites() function can be used to create sites locally based on data not available in the database yet. However, currently in the manuscript it is unclear if the function also allows to upload these sites to the database ("create new records"). Please rephrase and specify.

Lines 30-33: Please specify a bit further how the package " [...] conforms to a tidyverse approach [...]" and how this relates to the Neotoma data model.

openjournals/joss-reviews#5561

Issue with `get_downloads()` for `sites` objects.

I added some code in the mammal vignette:

mamData <- get_datasets(datasettype="vertebrate fauna", ageold=12000, limit = 99999)
plotLeaflet(mamData)
mamDl <- get_downloads(mamData)

Executing the code as is results in the following:

x is not an allowed argument.
      Choose from the allowed arguments: sitename, altmax, altmin, loc
Your search returned 965 objects.
Printing only 25 objects.
       Use all_data = TRUE for storing the complete set.
 Error: $ operator is invalid for atomic vectors 

I'll take a look and see if I can figure out what's going on.

Request-URI Too Long

I'm trying to download all the pollen data, but run into the following error

pollen_datasets <-  neotoma2::get_datasets(datasettype = "pollen",   all_data = TRUE)

neotoma2::get_downloads(pollen_datasets)
Error in parseURL(base_url, ...) : 
  Request-URI Too Long (HTTP 414). Failed to Could not connect to the Neotoma API.
                    Check that the path is valid, and check the current
                     status of the Neotoma API services at
                      http://data.neotomadb.org.

Fairly sure this is because get.downloads.numeric pastes all the datasetid into a very long string that is too long for a GET request

Improve the `doi()` help, to include a working example.

In response to an article review, it's clear that we need to improve the documentation for the doi() function. I have written a GitHub Gist that effectively does what we're looking to do.

We should clean up the code so it comes directly from a set of queries (rather than a raw file):

library(neotoma2)
library(dplyr)
library(readr)

poll <- readr::read_tsv('pollen_counts_europe.csv') %>%
    dplyr::filter(Data_Source == 'Neotoma') %>%
    dplyr::select(Dataset_ID) %>%
    dplyr::distinct() %>%
    unlist()

datasets <- neotoma2::get_datasets(poll, all_data = TRUE)
neotoma2::doi(datasets)

Pollen standardizations

The prior iteration of neotoma had an option for standardizing pollen taxa to a number of different lists. It would be great to have this same function in neotoma2. Are there plans to add this? Thanks!

Issue loading neotoma2 through devtools

I am getting the following error when trying to load neotoma2:

Error in makePrototypeFromClassDef(properties, ClassDef, immediate, where) :
in making the prototype for class “publication” elements of the prototype failed to match the corresponding slot class: author (class "contacts" )
Error: unable to load R code in package ‘neotoma2’
Execution halted

sites no longer repeated with get_downloads

Fixed issue for repeated sites regarding a download <- this was hard and probably needs refactoring for efficiency

  • Can do downloads for Brazil_datasets now

Redo the all_data = TRUE in vignette and code

  • not sure why this keeps being deleted and I keep pushing it. maybe we are merging on the vignette in a wrong way?
    Removed pager function
  • not sure why this keeps being re-added and I keep deleting it. maybe we are not pulling the delete?

Did [[ <- methods for collunits, datasets

Fixed c() method for site objects - it was returning primitive[1] instead of the right sites object

get_downloads() returns data from the same site n times

When I select data thanks to get_sites() and then get_datasets(), I effectively get what I want: data for the n sites which respect my criteria. But then, when I run get_downloads() based on the same set of sites, I get n identical lists containing samples for the same site.

For instance if I run the following chunk, directly took from the page neotoma2_R_pack_doc, I get samples for the right number of sites (7) but I get 7 times the data for the same site (Lake Valencia):

brazil <- '{"type": "Polygon", 
            "coordinates": [[
                [-73.125, -9.102],
                [-56.953, -33.138],
                [-36.563, -7.711],
                [-68.203, 13.923],
                [-73.125, -9.102]
              ]]}'

# We can make the geojson a spatial object if we want to use the
# functionality of the `sf` package.
brazil_sf <- geojsonsf::geojson_sf(brazil)

brazil_records <- get_datasets(loc = brazil_sf) %>%
  neotoma2::filter(datasettype == "pollen" & age_range_young <= 1000 & age_range_old >= 10000) %>%
  get_downloads(verbose = FALSE)

With get_sites() or get_datasets(), I obtain in my environment a list of lists structured this way:
-sites

  • -[[1]]
  • -[[2]]
  • -[[3]]
    etc.

While with get_downloads(), I get:
-sites

  • -site
  • -site
  • -site
    etc.

get_datasets from get_sites results in duplication of sites

With the following script:

aa <- get_sites(sitename='A%', limit = 100)
bb <- get_datasets(aa, limit = 200)

We see that `aa@sites[[99]]` is not the same as `bb@sites[[99]]`, this is because the API currently returns a site for each dataset, so the sites get repeated in the resulting object.  This is not the correct API behaviour, and *is* addressed in a change in the API repository.

JOSS review: Could authors add state of the field to the paper?

Can the authors add some text regarding the current available options (other than your package) which can be used to explore the neotoma database? And why would this be much more accessible.

I'm guessing this is because using R is very user friendly and users can conviniently add their own sites, methods? Either way, this should be fairly simple to add to the paper itself.

openjournals/joss-reviews#5561

get_datasets should respect the limit

Currently get_datasets does a thing where it first pulls all datasets, then just pulls with the limit. This makes the call seem really slow. It should only download what is asked for, respecting the limit.

Preferred citation for neotoma2

Hello neotoma2 developers 👋
I'd like to reference neotoma2 in an academic paper.
What's the preferred citation in this case?
I've used the default citation("neotoma2") citation generated by R:

@Manual{,
    title = {neotoma2: Working with the Neotoma Paleoecology Database},
    author = {Simon Goring},
    year = {2021},
    note = {R package version 0.0.0.9000},
  }

Is this ok?
You could override this behavior by having an explicit CITATION file in the inst/ folder.

Many thanks!

Installation

I'm having trouble installing via devtools. I'm using devtools::install_github("NeotomaDB/neotoma2", force = TRUE) however getting the below error:

* installing *source* package ‘neotoma2’ ...
** using staged installation
** R
** byte-compile and prepare package for lazy loading
Error in charToDate(x) : 
  character string is not in a standard unambiguous format
Error: unable to load R code in package ‘neotoma2’
Execution halted
ERROR: lazy loading failed for package ‘neotoma2’

Config info:

$platform
[1] "x86_64-apple-darwin15.6.0"

$os
[1] "darwin15.6.0"

$system
[1] "x86_64, darwin15.6.0"

$version.string
[1] "R version 3.6.1 (2019-07-05)"

neotoma R package filters out sites with no apparent reason

I am currently trying to download bulks of data with the neotoma R package. For example all pollen data available. The explorer indicates 4000+ available datasets, also: get_sites(datasettype= "pollen", all_data=TRUE) results in the same number.

Only as soon as i use get_datasets or get_downloads the number of datasets is reduced drastically, leaving me with as little as 900 remaining.

Any Ideas what the problem could be and how it can be fixed?

Thanks

Calls to `as.data.frame()` seem to be returning confusing column classes.

The as.data.frame() call is casting certain columns to non-intuitive column types. This causes problems with using functions like dplyr::inner_join() on tables.

reprex

library(neotoma2)
> ds <- get_datasets(1:10)
> df <- as.data.frame(ds)
> class(df$siteid)
[1] "character"

Expected

The ID columns should be integers. Other columns should be properly typed/classed.

Modify the length function to provide lengths of sites, datasets, &cetera.

Currently, because of the structure of a neotoma2 object using the length() function only returns the number of sites.

There are several options here. One is to have something like length(x, 'datasets'), or we could do ndatasets().

I think that the base length() method does not allow other parameters, so that may not be an option.

JOSS review: Build vignettes when installing package

When installing the package, the vignettes dont build automatically. This should be a relatively simple fix. All that is required is changing the README from-

devtools::install_github('NeotomaDB/neotoma2')

to

devtools::install_github('NeotomaDB/neotoma2', build_vignettes=TRUE)

UPDATE: I confirmed that the build_vignettes passes when indtalling. So, this can easily be updated in the README.

openjournals/joss-reviews#5561

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.