Git Product home page Git Product logo

unf's Introduction

Universal Numeric Fingerprint

UNF is a cryptographic hash or signature that can be used to uniquely identify (a version of) a rectangular dataset, or a subset thereof. UNF can be used, in tandem with a DOI or Handle, to form a persistent citation to a versioned dataset. A UNF signature is printed in the following form:

UNF:[UNF version][:UNF header options]:[UNF hash]

This allows a data consumer to quickly, easily, and definitively verify an in-hand data file against a data citation or to test for the equality of two datasets, regardless of their variable order or file format. UNF is used by The Dataverse Network archiving software for data citation (making the UNF package a logical companion to the dvn package). This package implements UNF versions 3 and up (current version is 6). Some details on the UNF algorithm and the R implementation thereof are included in a package vignette ("The UNF Algorithm") and details on use of UNF in data citation is available in another vignette ("Data Citation with UNF").

Please report any mismatches between this implementation and any other implementation (including Dataverse's) on the issues page!

Why UNFs?

While file checksums are a common strategy for verifying a file (e.g., md5 sums are available for validating R packages), they are not well-suited to being used as global signatures for a dataset. A UNF differs from an ordinary file checksum in several important ways:

  1. UNFs are format independent. The UNF for a dataset will be the same regardless of whether the data is saved as a R binary format, SAS formatted file, Stata formatted file, etc., but file checksums will differ. The UNF is also independent of variable arrangement and naming, which can be unintentionally changed during file reading.

    library("digest")
    library("UNF")
    write.csv(iris, file = "iris.csv", row.names = FALSE)
    iris2 <- read.csv("iris.csv")
    identical(iris, iris2)
    ## [1] FALSE
    
    identical(digest(iris, "md5"), digest(iris2, "md5"))
    ## [1] FALSE
    
    identical(unf(iris), unf(iris2))
    ## [1] TRUE
    
  2. UNFs are robust to insignificant rounding error. This important when dealing with floating-point numeric values. A UNF will also be the same if the data differs in non-significant digits, a file checksum not.

    x1 <- 1:20
    x2 <- x1 + 1e-7
    identical(digest(x1), digest(x2))
    ## [1] FALSE
    
    identical(unf(x1), unf(x2))
    ## [1] TRUE
    
  3. UNFs detect misinterpretation of the data by statistical software. If the statistical software misreads the file, the resulting UNF will not match the original, but the file checksums may match. For example, numeric values read as character will produce a different UNF than those values read in as numerics.

    x1 <- 1:20
    x2 <- as.character(x1)
    identical(unf(x1), unf(x2))
    ## [1] FALSE
    
  4. UNFs are strongly tamper resistant. Any accidental or intentional changes to data values will change the resulting UNF. Most file checksums and descriptive statistics detect only certain types of changes.

Package Functionality

  • unf(): The core unf() function calculates the UNF signature for almost any R object for UNF algorithm versions 3, 4, 4.1, 5, or 6, with options to control the rounding of numeric values, truncation of character strings, and some idiosyncratic details of the UNFv5 algorithm as implemented by Dataverse. unf() is a wrapper for functions unf6(), unf5(), unf4(), and unf3(), which calculate vector-level UNF signatures.

    unf(iris)
    ## UNF6:6oVTvlCR+F1W1HTJ/QUmkA==
    
    str(unf(iris))
    ## List of 5
    ##  $ unf      : chr "6oVTvlCR+F1W1HTJ/QUmkA=="
    ##  $ hash     : raw [1:32] ea 85 53 be ...
    ##  $ unflong  : chr "6oVTvlCR+F1W1HTJ/QUmkHEAyPC4LZiHnI1s2rURxbs="
    ##  $ formatted: chr "UNF6:6oVTvlCR+F1W1HTJ/QUmkA=="
    ##  $ variables: Named chr [1:5] "FnQvOCZE9tcn64bP78wLag==" "epaV+rjvURem8qIo0r9LBQ==" "KP6tL8gFSqnG3FLJ887o/g==" "TN39UY6H/vRGv4ARWQTXrw==" ...
    ##   ..- attr(*, "names")= chr [1:5] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" ...
    ##  - attr(*, "class")= chr "UNF"
    ##  - attr(*, "version")= num 6
    ##  - attr(*, "digits")= int 7
    ##  - attr(*, "characters")= int 128
    ##  - attr(*, "truncation")= int 128
    
  • %unf%: %unf% is a binary operator that can compare two R objects, or an R object against a "UNF" class summary (e.g., as stored in a study metadata record, or returned by unf()). The function tests whether the objects are identical and, if they are not, provides object- and variable-level UNF comparisons between the two objects, checks for difference in the sorting of the two objects, and (for dataframes) reports indices for rows seemingly present in one object but missing from the other based on row-level hashes of variables common to both dataframes. This can be used both to compare two objects in general (e.g., to see whether two dataframes differ) as well as to debug incongruent UNFs. Two UNFs can differ dramatically due to minor changes like rounding, the deletion of an observation, addition of a variable, etc., so %unf% provides a useful tool for looking under the hood at the differences between data objects that might produce different UNF signatures.

    u <- unf(iris)
    unf(iris) %unf% u
    ## Objects are identical
    ## 
    ## UNF6:6oVTvlCR+F1W1HTJ/QUmkA== 
    ## 
    ## UNF6:6oVTvlCR+F1W1HTJ/QUmkA==
    
    unf(iris) %unf% unf(iris[,1:3])
    ## Objects are not identical
    ## 
    ## UNF6:6oVTvlCR+F1W1HTJ/QUmkA== 
    ## Mismatched variables:
    ## Petal.Width: TN39UY6H/vRGv4ARWQTXrw==
    ## Species: Xqh76nYY3z8eTfmL1KfxaQ==
    ## 
    ## UNF6:lEajCAiTPXcxJuP+hr8Kew==
    
    unf(iris) %unf% head(iris[,1:3])
    ## Objects are not identical
    ## 
    ## UNF6:6oVTvlCR+F1W1HTJ/QUmkA== 
    ## Mismatched variables:
    ## Sepal.Length: FnQvOCZE9tcn64bP78wLag==
    ## Sepal.Width: epaV+rjvURem8qIo0r9LBQ==
    ## Petal.Length: KP6tL8gFSqnG3FLJ887o/g==
    ## Petal.Width: TN39UY6H/vRGv4ARWQTXrw==
    ## Species: Xqh76nYY3z8eTfmL1KfxaQ==
    ## 
    ## UNF6:0Ppu3rquJJrYvjkDePjGbA== 
    ## Mismatched variables:
    ## Sepal.Length: yMtrQJDMuxcSay0afKLz5A==
    ## Sepal.Width: e6etgUxSU/7XccLSwNzHVQ==
    ## Petal.Length: oSk42LS4+joAOdTAr9OChQ==
    
  • as.unfvector() is an S3 generic method that standardizes any R vector into the standardized character representation described by the UNF specification. While this functionality is primarily for internal use, it can be helpful for clarifying the difference (or lack thereof) between floating point numbers or between objects with identical meaning but different class representations that perhaps resulted for flawed data importing:

    # floating point ambiguity
    .14*10 == 1.4
    ## [1] FALSE
    
    as.unfvector(.14*10) == as.unfvector(1.4)
    ## [1] TRUE
    
    # substantively irrelevant class differences
    c(0L, 1L) == c(FALSE, TRUE)
    ## [1] TRUE TRUE
    
    as.unfvector(c(0L, 1L))
    ## [1] "+0.e+" "+1.e+"
    
    as.unfvector(c(FALSE, TRUE))
    ## [1] "+0.e+" "+1.e+"
    

Installation

CRAN Build Status Build status codecov.io Downloads

UNF is on CRAN. To install the latest version, simply use:

install.packages("UNF")

To install the latest development version of UNF from GitHub:

# latest (potentially unstable) version from GitHub
if (!require("remotes")) {
    install.packages("remotes")
}
remotes::install_github("leeper/UNF")

unf's People

Contributors

leeper avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

unf's Issues

More tests!

  • `%unf% equivalence operator

  • print method

  • dispatch to old versions (3, 4, 4.1, 5, error otherwise)

  • raw vectors:

    Normalize bit fields by converting to big-endian form, truncating all leading empty bits, aligning to a byte boundary by padding with leading zero bits, and base64 encoding to form a character string representation.

  • dates

  • datetimes

  • difftimes

  • character encoding

Note: some of these are carried forward from #2.

Fix tests for R3.0.3

http://cran.r-project.org/web/checks/check_results_UNF.html

Check Details

Version: 2.0 
Check: tests 
Result: ERROR 
    Running the tests in 'tests/test-all.R' failed.
    Last 13 lines of output:

     3. Failure(@test-unf6-datetimes.R#7): Examples from v6 specification -----------
     unf6("2014-08-22T16:51:05Z") not equal to unf6(strptime("2014-08-22T16:51:05Z", "%FT%H:%M:%OSZ", tz = "UTC"), timezone = "UTC")
     Component 1: 1 string mismatch
     Component 2: 32 element mismatches
     Component 3: 1 string mismatch
     Component 4: 1 string mismatch

     4. Failure(@test-unf6-datetimes.R#16): UNFs differ by timezone -----------------
     identical(unf6(strptime("2014-08-22T16:51:05Z", "%FT%H:%M:%OSZ", tz = "UTC"), timezone = "UTC"), unf6(strptime("2014-08-22T16:51:05Z", "%FT%H:%M:%OSZ", tz = "US/Eastern"), timezone = "UTC")) isn't false

     Error: Test failures
     Execution halted 
Flavor: r-oldrel-windows-ix86+x86_64

Add datetime rounding parameter

Dataverse ingests .Rdata files with POSIXt datetime classes to three fractional second decimal places. There needs to be a rounding parameter to control this in order to match a Dataverse UNF to an R UNF.

Inconsistent UNF values

This morning I'm working with some data that hasn't been touched since November (over 7 months ago). I'm the maintainer for this data, it lives on my personal machine, and I use UNF to validate which version of the dataset I'm working with. Today I'm getting UNF values that are inconsistent with values calculated last November. I'm getting similar inconsistencies for some of the examples in ?unf (shown below). In particular I'm getting inconsistencies for unf(longley, ver=4, digits=3) and unf(cbind.data.frame(x1,x2),ver=3) and its equivalents. The UNFs for my data were calculated using version 6.

Both calculations were done using UNF version 2.0.6 on the same machine. One potential difference is last November I was using R 3.5.1 and today I'm using R 4.0.0.

Please specify whether your issue is about:

  • a possible bug
  • a question about package functionality
  • a suggested code or documentation change, improvement to the code, or feature request

Put your code here:

library(UNF)

# Version 6 #

### FORTHCOMING ###

# Version 5 #
## vectors

### just numerics
unf5(1:20) # UNF:5:/FIOZM/29oC3TK/IE52m2A==
#> UNF5:/FIOZM/29oC3TK/IE52m2A==
unf5(-3:3, dvn_zero = TRUE) # UNF:5:pwzm1tdPaqypPWRWDeW6Jw==
#> UNF5:pwzm1tdPaqypPWRWDeW6Jw==

### characters and factors
unf5(c('test','1','2','3')) # UNF:5:fH4NJMYkaAJ16OWMEE+zpQ==
#> UNF5:fH4NJMYkaAJ16OWMEE+zpQ==
unf5(as.factor(c('test','1','2','3'))) # UNF:5:fH4NJMYkaAJ16OWMEE+zpQ==
#> UNF5:fH4NJMYkaAJ16OWMEE+zpQ==

### logicals
unf5(c(TRUE,TRUE,FALSE), dvn_zero=TRUE)# UNF:5:DedhGlU7W6o2CBelrIZ3iw==
#> UNF5:DedhGlU7W6o2CBelrIZ3iw==

### missing values
unf5(c(1:5,NA)) # UNF:5:Msnz4m7QVvqBUWxxrE7kNQ==
#> UNF5:Msnz4m7QVvqBUWxxrE7kNQ==

## variable order and object structure is irrelevant
unf(data.frame(1:3,4:6,7:9)) # UNF:5:ukDZSJXck7fn4SlPJMPFTQ==
#> UNF6:ukDZSJXck7fn4SlPJMPFTQ==
unf(data.frame(7:9,1:3,4:6))
#> UNF6:ukDZSJXck7fn4SlPJMPFTQ==
unf(list(1:3,4:6,7:9))
#> UNF6:ukDZSJXck7fn4SlPJMPFTQ==

# Version 4 #
# version 4
data(longley)
unf(longley, ver=4, digits=3) # PjAV6/R6Kdg0urKrDVDzfMPWJrsBn5FfOdZVr9W8Ybg=
#> UNF4:3,128:KjRoxvNqv+Gkbso2DZ5N3lztfFYA02PPy8KlAByze9s=

# version 4.1
unf(longley, ver=4.1, digits=3) # 8nzEDWbNacXlv5Zypp+3YCQgMao/eNusOv/u5GmBj9I=
#> UNF4.1:3,128:8nzEDWbNacXlv5Zypp+3YCQgMao/eNusOv/u5GmBj9I=

# Version 3 #
x1 <- 1:20
x2 <- x1 + .00001

unf3(x1) # HRSmPi9QZzlIA+KwmDNP8w==
#> UNF3:M+FD+2bN2GJGqHJmhZeWig==
unf3(x2) # OhFpUw1lrpTE+csF30Ut4Q==
#> UNF3:cN+0PxPJHvbQQd5I+pLKpg==

# UNFs are identical at specified level of rounding
identical(unf3(x1), unf3(x2))
#> [1] FALSE
identical(unf3(x1, digits=5),unf3(x2, digits=5))
#> [1] TRUE

# dataframes, matrices, and lists are all treated identically:
unf(cbind.data.frame(x1,x2),ver=3) # E8+DS5SG4CSoM7j8KAkC9A==
#> UNF3:eIjrbuHf+6rWU/XD+4F7+g==
unf(list(x1,x2), ver=3)
#> UNF3:eIjrbuHf+6rWU/XD+4F7+g==
unf(cbind(x1,x2), ver=3)
#> UNF3:eIjrbuHf+6rWU/XD+4F7+g==

sessionInfo()
#> R version 4.0.0 (2020-04-24)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Catalina 10.15.5
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] UNF_2.0.6
#> 
#> loaded via a namespace (and not attached):
#>  [1] compiler_4.0.0  magrittr_1.5    tools_4.0.0     htmltools_0.4.0
#>  [5] base64enc_0.1-3 yaml_2.2.1      Rcpp_1.0.4.6    stringi_1.4.6  
#>  [9] rmarkdown_2.1   highr_0.8       knitr_1.28      stringr_1.4.0  
#> [13] xfun_0.13       digest_0.6.25   rlang_0.4.6     evaluate_0.14

Created on 2020-06-27 by the reprex package (v0.3.0)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.