Git Product home page Git Product logo

datapack's Introduction

datapack: A Flexible Container to Transport and Manipulate Data and Associated Resources

CRAN_Status_Badge R-CMD-check

The datapack R package provides an abstraction for collating heterogeneous collections of data objects and metadata into a bundle that can be transported and loaded into a single composite file. The methods in this package provide a convenient way to load data from common repositories such as DataONE into the R environment, and to document, serialize, and save data from R to data repositories worldwide.

Note that this package ('datapack') is not related to the similarly named rOpenSci package 'DataPackageR'. Documentation from the DataPackageR github repository states that "DataPackageR is used to reproducibly process raw data into packaged, analysis-ready data sets."

Installation Notes

The datapack R package requires the R package redland. If you are installing on Ubuntu then the Redland C libraries must be installed before the redland and datapack package can be installed. If you are installing on Mac OS X or Windows then installing these libraries is not required.

The following instructions illustrate how to install datapack and its requirements.

Installing on Mac OS X

On Mac OS X datapack can be installed with the following commands:

install.packages("datapack")
library(datapack)

The datapack R package should be available for use at this point.

Note: if you wish to build the required redland package from source before installing datapack, please see the redland installation instructions.

Installing on Ubuntu

For Ubuntu, install the required Redland C libraries by entering the following commands in a terminal window:

sudo apt-get update
sudo apt-get install librdf0 librdf0-dev

Then install the R packages from the R console:

install.packages("datapack")
library(datapack)

The datapack R package should be available for use at this point

Installing on Windows

For windows, the required redland R package is distributed as a binary release, so it is not necessary to install any additional system libraries.

To install the R packages from the R console:

install.packages("datapack")
library(datapack)

Quick Start

See the full manual for documentation, but once installed, the package can be run in R using:

library(datapack)
help("datapack")

Create a DataPackage and add metadata and data DataObjects to it:

library(datapack)
library(uuid)
dp <- new("DataPackage")
mdFile <- system.file("extdata/sample-eml.xml", package="datapack")
mdId <- paste("urn:uuid:", UUIDgenerate(), sep="")
md <- new("DataObject", id=mdId, format="eml://ecoinformatics.org/eml-2.1.0", file=mdFile)
addData(dp, md)

csvfile <- system.file("extdata/sample-data.csv", package="datapack")
sciId <- paste("urn:uuid:", UUIDgenerate(), sep="")
sciObj <- new("DataObject", id=sciId, format="text/csv", filename=csvfile)
dp <- addData(dp, sciObj)
ids <- getIdentifiers(dp)

Add a relationship to the DataPackage that shows that the metadata describes, or "documents", the science data:

dp <- insertRelationship(dp, subjectID=mdId, objectIDs=sciId)
relations <- getRelationships(dp)

Create an Resource Description Framework representation of the relationships in the package:

serializationId <- paste("resourceMap", UUIDgenerate(), sep="")
filePath <- file.path(sprintf("%s/%s.rdf", tempdir(), serializationId))
status <- serializePackage(dp, filePath, id=serializationId, resolveURI="")

Save the DataPackage to a file, using the BagIt packaging format:

bagitFile <- serializeToBagIt(dp) 

Note that the dataone R package can be used to upload a DataPackage to a DataONE Member Node using the uploadDataPackage method. Please see the documentation for the dataone R package, for example:

vignette("upload-data", package="dataone")

Acknowledgements

Work on this package was supported by:

  • NSF-ABI grant #1262458 to C. Gries, M. B. Jones, and S. Collins.
  • NSF-DATANET grants #0830944 and #1430508 to W. Michener, M. B. Jones, D. Vieglais, S. Allard and P. Cruse
  • NSF DIBBS grant #1443062 to T. Habermann and M. B. Jones
  • NSF-PLR grant #1546024 to M. B. Jones, S. Baker-Yeboah, J. Dozier, M. Schildhauer, and A. Budden
  • NSF-PLR grant #2042102 to M. B. Jones, A. Budden, J. Dozier, and M. Schildhauer

Additional support was provided for working group collaboration by the National Center for Ecological Analysis and Synthesis, a Center funded by the University of California, Santa Barbara, and the State of California.

nceas_footer

dataone_footer

ropensci_footer

datapack's People

Contributors

amoeba avatar bfgray3 avatar cboettig avatar gothub avatar jeanetteclark avatar jeroen avatar mbjones avatar thomasthelen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

datapack's Issues

Error with setClassUnion when loading datapackage via devtools

When trying to load datapackage using devtools::load_all(), I get an error now that seems to have an issue with creating a union class with setClassUnion. Related issues were reported, including r-lib/devtools#657 and r-lib/devtools#168. Code below illustrates the issue, and provides session info.

> library(devtools)
> load_all('.')
Loading datapackage
Error in .walkClassGraph(ClassDef, "contains", where, attr(ext, "conflicts")) : 
  the 'superClass' list for classcharacter”, includes an undefined classEnumerationValueIn addition: Warning message:
character(0) 
Error in setClassUnion("charOrNULL", c("character", "NULL")) (from DataPackage.R#176) : 
  unable to create union class:  could not set members "character"
> getwd()
[1] "/Users/jones/development/datapackage"
> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.3 (Yosemite)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] devtools_1.8.0

loaded via a namespace (and not attached):
 [1] Rcpp_0.11.6        roxygen2_4.1.1     XML_3.98-1.1       digest_0.6.8      
 [5] bitops_1.0-6       git2r_0.10.1       magrittr_1.5       stringi_0.4-1     
 [9] uuid_0.1-1         hash_2.2.6         tools_3.2.0        stringr_1.0.0     
[13] RCurl_1.95-4.6     rversions_1.0.0    memoise_0.2.1      redland_0.2.0.9000

BagIT example online anywhere?

Hey, I've been looking for an existing dataset available online in the bag format. Does anyone know of any examples anywhere? I've struggled to find them.

-Luke

add methods for adding access policies

Currently SystemMetadata does not contain methods for adding access control roles. Add methods to:

  • Initialize accessPolicy with a data frame during construction
  • add a single access control rule with a subject and permission
  • add a set of access rules as a data frame of (subject, permission) triples

SystemMetadata initialize() sets @dateUploaded to current time

The dateUploaded slot value should be set to the current time when the SystemMetadata object
is uploaded to DataOne, not the object creation time, and should be set by the method or function
that is performing the upload. If the date is set in this way, then we can tell if the SystemMetadata
has been successfully uploaded.

Group sysmeta access rules by subject

datapack:serializeSystemMetadata() should group access rules by subject, so that a more condensed,
and easier to read (by a human) accessPolicy is created. For example, the following access policy has
repeated blocks for the same subjects, from
https://arcticdata.io/metacat/d1/mn/v2/meta/arctic-data.3958.1:

<accessPolicy>
<allow>
<subject>CN=arctic-data-admins,DC=dataone,DC=org</subject>
<permission>read</permission>
</allow>
<allow>
<subject>CN=arctic-data-admins,DC=dataone,DC=org</subject>
<permission>write</permission>
</allow>
<allow>
<subject>CN=arctic-data-admins,DC=dataone,DC=org</subject>
<permission>changePermission</permission>
</allow>
<allow>
<subject>http://orcid.org/0000-0002-2625-6747</subject>
<permission>read</permission>
</allow>
<allow>
<subject>http://orcid.org/0000-0002-2625-6747</subject>
<permission>write</permission>
</allow>
<allow>
<subject>http://orcid.org/0000-0002-2625-6747</subject>
<permission>changePermission</permission>
</allow>
<allow>
<subject>public</subject>
<permission>read</permission>
</allow>
</accessPolicy>

Release prep

datapack is ready for any additional evaluation and then release to CRAN.

The following tasks have been completed:

  • build / check() has been rerun on all platforms in cran-comments.md
  • all 1.0.0 github issues closed
  • datapack build tested with current dataone repo

create vignette showing use of datapackage

We need a vignette illustrating typical usage. Also show relationship to the dataone package. Consider whether this vignette shoudl be generated from a .Rmd file, and whether such a file should also generate the readme from issue #24.

DataPackage constructor reuses internal structures

To illustrate this

  • create a DataPackage and add a DataObject
    • list objects in the DataPackage
    • delete the DataPackage
    • create a new DataPackage
    • list objects in the new, empty DataPackage
    • the new DataPackage contains objects from the original, deleted package

To recreate

library(datapackage)
testdf <- data.frame(x=1:10,y=11:20)
csvfile <- tempfile(pattern = "file", tmpdir = tempdir(), fileext = ".csv")
write.csv(testdf, csvfile, row.names=FALSE)
cn <- CNode("SANDBOX")
mnId <- "urn:node:mnSandboxUCSB2"
dp <- new("DataPackage")
sciObj <- new("DataObject", format="text/csv", user="uid=slaughter,ou=Account,dc=ecoinformatics,dc=org", mnNodeId=mnId, filename=csvfile)
addData(dp, sciObj)
getIdentifiers(dp)
rm(dp)
newDP <- new("DataPackage")
getIdentifiers(newDP)

Link

Hello,

I'm looking for a R library to read DataPackages http://data.okfn.org/doc/data-package
I'm not sure if this package deals with them.
If that's the case, maybe you should put a link to specification.

Kind regards

Rename and refactor D1Object class

The DataONEorg/D1Object class was moved into datapackage, but needs to be renamed to not use the "D1" name and to remove the java dependencies.

Allow non-memory storage for DataObject data

DataObject instances store their data in memory in the data slot, which works fine for small objects, but not well for larger objects. The size threshold shifts over time, but currently its reasonable to store objects of 0.5 to 1 GB in memory, but larger is unwieldy. To fix this, add the ability to store data on the local filesystem in a temporary cache, or as a reference to a file that the caller provides. This will allow data objects to be created that are much larger than can fit in memory. We'll need a new slot to store the file reference, and modified methods to detect when data are in a file or in memory.

Prepare datapackage for rOpenSci drat builds

The datapackage package depends on the redland package, so is not included in the automatic drat
builds, with the related issue in the drat.builder repo that mentions a problem with building packages
'deep' directory structures (i.e. redland): richfitz/drat.builder#9. Ensure that datapackage builds properly in preparation for the resolution to the drat.builder issue.

standardize on one class initialization mechanism

The class "DataPackage" requires this form for object creation:

pkg <- DataPackage()

Should we standarize on the form used by "DataObject" and only allow object
create through new?

do <- new("DataObject", ...)

Update S4 generics so minimal signature is used

Use as few arguments in the setGeneric function signature as necessary, provide flexibility
for users that wish to write their own methods that will be dispatched from our generic.

For our own uses, we need to have adequate parameters to dispatch on, so if we have multiple
methods implemented, we may need at least the first param, and sometimes the second, i.e.
addAccessRule(x, y) requires "x" to be able to dispatch to DataObject or SystemMetadata
implementations, "y" is required to dispatch based on a character string containing one access rule
or a data.frame containing possibly multiple access rules.

Relation to other packaging efforts?

We are currently exploring the use of the DataONE DataPackage format using using OAI-ORE for the manifest and BagIt for serialization and transport.
Related packaging efforts need to be considered. Maybe we can/should support multiple serializations. None of the following really deal with all of the facets of packaging (metadata, container structure, identifier mapping, serialization formats), but each deals with some aspects. Related packaging specifications to consider:

Question for discussion: which serialization(s) shall we support, in addition to ORE/BagIt?

SystemMetadata() can't initialize if no <replicationPolicy>

When a SystemMetadata object is initialized from a parsed sysmeta XML document, if that
document doesn't contain a element, then the following error is produced
as shown in the following example:

> library(datapack)
> library(httr)
> library(XML)
> response <- GET("https://gmn.lternet.edu/mn/v1/meta/doi:10.6073%2FAA%2Fknb-lter-hfr.145.3")
> xml <- xmlParse(content(response, as="text"))
No encoding supplied: defaulting to UTF-8.
> sysmeta <- SystemMetadata(xmlRoot(xml))

 Error in UseMethod("xmlAttrs", node) : 
  no applicable method for 'xmlAttrs' applied to an object of class "NULL" 
10 xmlAttrs(xml[["replicationPolicy"]]) at SystemMetadata.R#260
9 .local(x, ...) 
8 parseSystemMetadata(sm_obj, x) at SystemMetadata.R#202
7 parseSystemMetadata(sm_obj, x) at SystemMetadata.R#183
6 .local(...) 
5 .Method(...) 
4 eval(expr, envir, enclos) 
3 eval(.dotsCall, env) 
2 standardGeneric("SystemMetadata") at SystemMetadata.R#169
1 SystemMetadata(xmlRoot(xml)) 
> 

implement shortcut for associating metadata object with science objects

Currently the insertRelationship method is called to associate a science object with a metadata object (which adds the isdocumented / documents relationships).

An simpler alternative to having the user call insertRelationships is to add a third parameter
to the addData method to specify a metadata object to associate with a science object:

dp <- new("DataPackage")
metaObj <- new("DataObject", id=mdId, data, format="eml://ecoinformatics.org/eml-2.1.1", user, node)
addData(dp, metaObj)
sciObj <- new("DataObject", format="text/csv", user, node, filename=csvfile)
addData(dp, sciObj, metaObj)

Error when calling SystemMetadata() for sysmeta without access policy

When the sysmeta xml is parsed, the following error occurs when the access policy is
checked:

  no applicable method for 'xmlChildren' applied to an object of class "NULL"

BTW this problem can be see in the dataone package when a sysmeta w/o access policy is
uploaded, then getSystemMetadata() is called.

Use generic param names for S4 dispatch params

Especially for the first param in a S4 generic definition, use a name that is
sufficiently generic enough to handle all potential implementations, so for the
1st parameter, use 'x' as this could be potentially any data type.

insertRelationships() should return a DataPackage object

insertRelationships updated a DataObject by updating a slot that contains a hash. This allows
the method to work without returning a DataPackage object, so for example, the following works:

dp <- new("DataPackage")
# Create a relationship
insertRelationship(dp, "/Users/smith/scripts/genFields.R",
   "http://www.w3.org/ns/prov#used",
   "https://knb.ecoinformatics.org/knb/d1/mn/v1/object/doi:1234/_030MXTI009R00_20030812.40.1")

Return the updated DataPackage, to prevent confusion and allow for the standard R way of
doing things:

dp <- new("DataPackage")
# Create a relationship
dp <- insertRelationship(dp, "/Users/smith/scripts/genFields.R",
   "http://www.w3.org/ns/prov#used",
   "https://knb.ecoinformatics.org/knb/d1/mn/v1/object/doi:1234/_030MXTI009R00_20030812.40.1")

insertRelationships method doesn't provide way to specify RDF resource type

Given that the underlying data model for datapackage resourceMaps is RDF, the insertRelationships
methods don't provide a way to specify the RDF resource type (URIs, blank nodes or litteral) for the subject, and object arguments (The predicate must always be a URI).

Possibly two approaches could be supported to resolve this:
- RDF types could be specified explicitly for subject, object
- specify to the underlying RDF processing to determines the appropriate
RDF resource type based on the string passed in.

This issue is related to ropensci/redland-bindings#14

Comments?

datapackage requires dataone

When invoking the command

R CMD INSTALL --no-multiarch --with-keep.source datapackage

the following error is printed

Loading required package: dataone
Error in .requirePackage(package) : 
  unable to find required package ‘dataone’
In addition: Warning message:
In library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE,  :
  there is no package called ‘dataone’

This occurs if the dataone package is not available for loading, i.e. remove.packages("dataone")

Extract DataONEorg/DataPackage class into its own package,

DataPackage, SystemMetadata, and probably D1Object would be main classes

  • Decide whether to move SystemMetadata to datapackage
  • Create R-based slots for all DataPackage state variables, including systemmetadata, and ORE relationships
  • Replace all java-based implementations in DataPackage with native R
  • Rename D1Object to something less DataONE centric
  • Replace all java-based implementation in 'D1Object with native R

add method DataObject:addAccessRules(dataObject, user, access)

A SystemMetadata object is created for a DataObject when it is initialized, but there is no method or
initialization parameter to add access rules to this sysmeta object. The proposed method would
take arguments such as:

dataObj <- addAccessRules(dataObj, user="uid=slaughter,ou=Account,dc=ecoinformatics,dc=org", access=c("read", "write"))

Installation errors: the `redland` package is not available

I tried to install the package with devtools::install_github("ropensci/datapackage"), but got the following errors:

Downloading github repo ropensci/datapackage@master
Installing datapackage
Skipping 1 packages not available: redland
'/Library/Frameworks/R.framework/Resources/bin/R' --no-site-file --no-environ  \
  --no-save --no-restore CMD INSTALL  \
  '/private/var/folders/m8/68z5d14d7mv9qyf1bd69ych40000gn/T/RtmpQWZ1bF/devtools32b45c3ccb9/ropensci-datapackage-ff07ac0'  \
  --library='/Library/Frameworks/R.framework/Versions/3.2/Resources/library'  \
  --install-tests 

ERROR: dependencyredlandis not available for packagedatapackage* removing/Library/Frameworks/R.framework/Versions/3.2/Resources/library/datapackageError: Command failed (1)

Update package name per CRAN request

Per CRAN request, the package name will be updated to 'dpack' or 'datapack'.

@mbjones, which one do you prefer?

Downstream packages will have to be updated as well, and tickets will be entered for

  • dataone
  • recordr

edit the README.md for datapackage

We need a complete README describing the R package. Write it. Consider whether it could or should be generated by a Vignette or an .Rmd file that generates both it and a vignette.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.