richfitz / storr Goto Github PK

:package: Object cacher for R

Home Page: http://richfitz.github.io/storr

License: Other

R 98.88% Makefile 0.49% Shell 0.19% C 0.44%

storr's Introduction

👋 Hi, I'm Rich

I work at the MRC Centre for Global Infectious Disease Analysis, Imperial College London, as head of the RESIDE research software engineering research.

Most projects in my personal namespace are either historical and/or personal projects that I wrote for fun (e.g. rainbowrite, rfiglet or stegasaur. Actively maintained projects here include redux and thor.

Most of my current software can be found in the mrc-ide, vimc and reside-ic organisations (among others)

(Profile photo from a particularly wet ascent of Cresent Climb, Pavey Ark in December 2019)

storr's People

Contributors

Stargazers

Watchers

Forkers

mcdelaney ateucher wlandau ben-gready mpadge ashesitr kendonb crerecombinase artemklevtsov rhjp

storr's Issues

storr$type

Implemented as function() self$driver$type()

Cascading caching backends?

I'm looking to contribute to a project that might grow towards cascading caching layers, e.g. ask local disk, if local disk doesn't have it, look to a Redis cache, if the Redis cache doesn't have it look to AWS S3, etc. I also like the memoization features of https://github.com/HenrikBengtsson/R.cache and would seek to have them be (eventually) present. I see that you are working on https://ropensci.org/; is this package going to be a part of that and acquire an open source license?

environment storr could take a given environment

Support pigz for compressing .rds files

if installed, for parallel compression.

if (Sys.which("pigz") != "") {
  con <- pipe("nice pigz > outfile.rds", mode = "wb")
}

One could argue this should be built into R, but for now...

Fix issues with new RSQLite

Looks like the new parametrised query interface is there; hopefully this will allow general blob storage. See also r-dbi/RSQLite#100 for previous discussion here.

This is an automated email to let you know about the upcoming release
of RSQLite, which will be submitted to CRAN on 2016-08-05. To check
for potential problems, I ran R CMD check on your package storr
(1.0.1).

I found: 0 errors | 1 warning | 0 notes.
checking re-building of vignette outputs ... WARNING
Error in re-building vignettes:
  ...
Quitting from lines 101-109 (drivers.Rmd)
Error: processing vignette 'drivers.Rmd' failed with diagnostics:
Please use dbGetQuery instead
Execution halted
If I got an ERROR because I couldn't install your package (or one of
it's dependencies), my apologies. You'll have to run the checks
yourself (unfortunately I don't have the time to diagnose installation
failures).

Regressions may arise due to changes in the public API. In particular,
the dbGetQuery() and summary() methods are not exported anymore.
Instead, use DBI::dbGetQuery() and show(), respectively.
Furthermore, dbSendPreparedQuery(), dbGetPreparedQuery() and
dbListResults() have been deprecated, these functions now raise an
error. Use dbBind() for parametrized queries, listing the results of
a connection is not supported anymore.

Otherwise, please carefully look at the results, and let me know if
I've introduced a bug in RSQLite.

To get the development version of RSQLite so you can run the checks
yourself, you can run:
# install.packages("devtools")
devtools::install_github("rstats-db/RSQLite")
To see what's changed visit
https://github.com/rstats-db/RSQLite/blob/master/NEWS.md.

If you have any questions about this email, please feel free to
respond directly.

"github_release_storr_get" gone?

Error in plant_lookup_get(version) : 
  could not find function "github_release_storr_get"

did something change in storr?

best practice for multi-argument hook

Is there a way to do something like?

st <- storr_external(driver_rds("data/foo"), ff)

ff <- function(key, namespace, some_other_data) {
   # some calculation that depends on some_other_data
}

In normal code, I could put some_other_data inside the ff function, but in this particular case some_other_data is a target from a remake project that takes a long time to compute.

mget/mset/mdel support

Can be supported by Redis and by DBI. Simulate support in the other drivers so that the interface stays the same.

This will reduce the number of individual calls to remote resources

Option in redis driver store in hashes

Rather than have the key/namespace division be a prefix/key, use redis hashes instead. This should be a fairly straightforward change, and could make things a little faster?

storr does not install (R6 update clash?)

With devtools::install_github(c("richfitz/storr")), I get

* installing *source* package 'storr' ...
** R
** inst
** tests
** preparing package for lazy loading
R6Class driver_external: 'lock' argument has been renamed to 'lock_objects' as of version 2.1.This code will continue to work, but the 'lock' option will be removed in a later version of R6
Error in R6::R6Class("storr_mangled", public = storr_mangled_methods()) : 
  Cannot add a member with reserved name 'clone'.
Error : unable to load R code in package 'storr'
ERROR: lazy loading failed for package 'storr'

Avoid calling exists on key lookup for drivers that can throw reliably

Drivers should declare if they will promise to throw an error on nonexistant key retrieval.

    get=function(key, namespace=self$default_namespace, use_cache=TRUE) {
      self$get_value(self$get_hash(key, namespace), use_cache)
    },

    get_hash=function(key, namespace=self$default_namespace) {
      if (self$exists(key, namespace)) {
        self$driver$get_hash(key, namespace)
      } else {
        stop(KeyError(key, namespace))
      }
    },

    get_value=function(hash, use_cache=TRUE) {
      envir <- self$envir
      if (use_cache && exists0(hash, envir)) {
        value <- envir[[hash]]
      } else {
        if (!self$driver$exists_hash(hash)) {
          stop(HashError(hash))
        }
        value <- self$driver$get_object(hash)
        if (use_cache) {
          envir[[hash]] <- value
        }
      }
      value
    },

If driver$get_hash(key, namespace) throws an error if the key is not there (or alternatively simply return NULL would be OK) then we can avoid the self$exists(key, namespace) call. That can be tested for in the inst/spec tests too.

Then the second exists_hash could be removed if the driver promises to throw an error (through some sort of capability reporting thing).

if (self$driver$capabilities$get_object_throws) {
  value <- tryCatch(self$driver$get_object(hash), error=function(e) stop(HashError(hash)))
} else if (self$driver$exists_hash(hash)) {
  value <- self$driver$get_object(hash)
} else {
 stop(HashError(hash)
}

documentation for mset & friends

in storr, at least

mset
mset_by_value
mset_value
mget

Plus documentation checks for

del
exists

Support git2r

auto-test failures not caught

...which triggers the partial match warnings on my computers.

fix the unqualified testthat functions within inst/spec
check that the reporter had no failures and throw appropriately
in the makefile, turn of local configuration.

OS X CRAN check errors

See the CRAN checks page:

Running ‘testthat.R’ [4s/4s]
Running the tests in ‘tests/testthat.R’ failed.
Complete output:
> library(testthat)
> library(storr)
>
> test_check("storr")
Error in redis_connect_tcp(config$host, config$port) :
Failed to create context: Connection refused
Calls: test_check ... redis_connection -> redis_connect -> redis_connect_tcp -> .Call
In addition: Warning message:
call dbDisconnect() when finished working with a connection
testthat results ================================================================
OK: 634 SKIPPED: 0 FAILED: 0
Execution halted

Support expiring keys

Redis does lazy key expires (i.e., check for expiry time on read). That could be done as part of the key -> hash lookup, returning NULL if the key has expired. It'll add a little extra work, but be nice for doing the "local cache" pattern, especially when using the external driver.

Might be best to make this optional at the storr level (like key mangling and default namespace) but not sure.

Update to work with development testthat

Also needs to work with old testthat as this is a breaking change in testthat without compatibility over CRAN releases.

To pad or not to pad in encode64()

The base64url package forgoes padding in base64_urlencode() because "=" is sometimes unsafe in urls/files. This seems like an extremely small concern, but I am wondering if there is a weird edge case where we will need safer storr key filenames.

For parallelRemake and drake, I cannot pad because "=" messes with Makefile rules (drake issue 19).

setx

I can see uses for this, but perhaps it's too nasty to get right

Change in RSQLite has broken vignette

crate support

Probably in another package

A redis "plaintext" driver for storing plain text

Saves the overhead (space and time) of serialising when all that is being stored is plain text.

This would be useful for the file cache in rrqueue, though that may move into here at some point.

unqlite

https://unqlite.org/intro.html

Possible CRAN violation - writing to disk

See here r-lib/rappdirs#2 (comment)

The rappdir's using components might need to get permission to write do disk. Thoughts @wcornwell, @dfalster - this will be an issue for baad.data and taxonlookup.

Relevant section of CRAN policy is:

Packages should not write in the users’ home filespace, nor anywhere else on the file system apart from the R session’s temporary directory (or during installation in the location pointed to by TMPDIR: and such usage should be cleaned up). Installing into the system’s R installation (e.g., scripts to its bin directory) is not allowed.
Limited exceptions may be allowed in interactive sessions if the package obtains confirmation from the user.

Which is frankly insane (especially given that we have function in R that write to the disk, like pdf).

REST interface

e.g., https://gist.github.com/rstacruz/0eccf7a619c14e815a8b

Consider using base64url rather than the homebrew encode64

See #43, and the timings in this comment

Seems similar, i think

https://cran.r-project.org/web/packages/repo/

boltdb

The library is here: https://github.com/boltdb/bolt
Could interface with this approach: https://github.com/glycerine/rmq

file caching

A bit different to file caching. I had aspects of this in remake (caching fingerprints only) and in rrqueue (caching file contents) so there's a few places this can go.

Support per-namespace mangling in the rds storr

Would cost another lookup every time, which is not ideal. But it would be reasonable to memoise the call per namespace per driver instance which would be fairly mild.

Support selection of hash algorithm?

arangodb

change DESCRIPTION to require earlier version of R

monetdb?

https://github.com/hannesmuehleisen/MonetDBLite/

try switching to rbenchmark to get elapsed time benchmarks

Functions to support a memoise pattern given a storr and some callback

See example use in https://github.com/wcornwell/TaxonLookup

Port from testthat::with_mock to mockr::with_mock

Thankfully this is not used in the CRAN version, as apparently testthat::with_mock will break in 3.4.0

storr-holder

Gabor uses this in httrmock and I use this in at least one other project; where a storr is used for side effects in a package it might be useful to allow creation/retrieval of the storage itself.

Depends on development RSQLite

This blocks a CRAN release. Depending on timing, either

add a compatibility layer for driver_dbi$set_object. This requires using a function that will possibly be hard-deprecated in the future so this needs some care
pull out the compatibility layer from the vignette, and simplify the binary support detection

storr@refactor

devtools::install_github("richfitz/storr@refactor")

Isn't working anymore for some reason, leads to this error:

Downloading GitHub repo richfitz/storr@refactor
from URL https://api.github.com/repos/richfitz/storr/zipball/refactor
No encoding supplied: defaulting to UTF-8.
Error: lexical error: invalid char in json text.
                                       Not Found
                     (right here) ------^

This means that taxonlookup and baad instillation instructions don't work...

Proper DBI support

Given I need this for other things, might be worth getting this done.

Need to detect if BLOB support is going to work; this is going to require recent DBI, works for upcoming RSQLite but not RMySQL yet (see r-dbi/RMySQL#49, r-dbi/RMySQL#123). Postgres does not have BLOB from the look of it r-dbi/RPostgres#66

External url function should be allowed to have only one argument

In that case, take the argument as key and discard namespace.

Full paths instead of relative paths

slugs 🐌

Related to a conversation in httrmock, which uses storrr to store recorded HTTP requests (r-lib/httrmock#3).

Have you ever contemplated a naming scheme that would allow the incorporation of an optional slug? Perhaps as a prefix.

cc @gaborcsardi

Possibly support xxhash64

Related to the discussion in ropensci/drake#4.

Tests work with no postgres running?

Option to set ascii to TRUE or NA in internal calls to saveRDS

Would this help caches play nicer with version control?

Running out of memory writing to cache with storr_rds

Hi @richfitz, thanks so much for this package, it's coming in really handy. I have an issue though with a large file (3GB csv 55 columns, 10 million rows), where it's running out of memory writing it to the cache. I'm running Windows 7 64 bit, with 24 GB of memory.

download.file("https://pub.data.gov.bc.ca/datasets/949f2233-9612-4b06-92a9-903e817da659/ems_sample_results_historic_expanded.zip", destfile = "test.zip")
csv_file <- unzip("test.zip")
file.rename(csv_file, "test_storr.csv")

sessionInfo()

## R version 3.3.1 (2016-06-21)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## locale:
## [1] LC_COLLATE=English_Canada.1252  LC_CTYPE=English_Canada.1252   
## [3] LC_MONETARY=English_Canada.1252 LC_NUMERIC=C                   
## [5] LC_TIME=English_Canada.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] magrittr_1.5       formatR_1.4        tools_3.3.1       
##  [4] htmltools_0.3.5    yaml_2.1.13        Rcpp_0.12.6       
##  [7] stringi_1.1.1      rmarkdown_1.0.9001 knitr_1.13        
## [10] stringr_1.0.0      digest_0.6.9       evaluate_0.9

library(storr)
packageVersion("storr")

## [1] '1.0.1'

library(readr)

dat <- read_csv("test_storr.csv")
dim(dat)

## [1] 10518573       55

object.size(dat)

## 4644497248 bytes

cache <- storr::storr_rds(tempfile())
cache$set("my_data", dat)

## Warning in writeBin(value[seq(i, min(j, len))], con): Reached total
## allocation of 24371Mb: see help(memory.size)

## Warning in writeBin(value[seq(i, min(j, len))], con): Reached total
## allocation of 24371Mb: see help(memory.size)

## Warning in writeBin(value[seq(i, min(j, len))], con): Reached total
## allocation of 24371Mb: see help(memory.size)

## Warning in writeBin(value[seq(i, min(j, len))], con): Reached total
## allocation of 24371Mb: see help(memory.size)

## Error: cannot allocate vector of size 8.0 Gb

cache$list()

## character(0)

cache$destroy()

I tested just writing the file with saveRDS and it worked just fine, so I dug through the source code of storr and saw that you are using writeBin internally instead of saveRDS. I also noticed that storr used to use saveRDS until this commit. So I installed from github to the commit before that, and it worked:

devtools::install_github("richfitz/storr", 
                         ref = "73505843261c23d092cff82613a0e8c5bd6f9d1f", quiet = TRUE)
library(storr)
packageVersion("storr")

## [1] '0.6.0'

library(readr)

dat <- read_csv("test_storr.csv")
dim(dat)

## [1] 10518573       55

object.size(dat)

## 4644497248 bytes

cache <- storr::storr_rds(tempfile())
cache$set("my_data", dat)
cache$list()

## [1] "my_data"

Don't use in-place file writes

Use case: I thought I could back up a ~30 GB storr from a remake project, to be able to restore it later (richfitz/remake#136). Used cp -lr .remake .remake.bak which creates hardlinks to files recursively. After trying out something and then restoring with rm -r .remake; mv .remake .remake.bak I'm now stuck with an inconsistent storr...