Git Product home page Git Product logo

storr's Introduction

👋 Hi, I'm Rich

I work at the MRC Centre for Global Infectious Disease Analysis, Imperial College London, as head of the RESIDE research software engineering research.

Most projects in my personal namespace are either historical and/or personal projects that I wrote for fun (e.g. rainbowrite, rfiglet or stegasaur. Actively maintained projects here include redux and thor.

Most of my current software can be found in the mrc-ide, vimc and reside-ic organisations (among others)

(Profile photo from a particularly wet ascent of Cresent Climb, Pavey Ark in December 2019)

storr's People

Contributors

hadley avatar kendonb avatar krlmlr avatar mpadge avatar richfitz avatar wlandau-lilly avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

storr's Issues

storr$type

Implemented as function() self$driver$type()

Cascading caching backends?

I'm looking to contribute to a project that might grow towards cascading caching layers, e.g. ask local disk, if local disk doesn't have it, look to a Redis cache, if the Redis cache doesn't have it look to AWS S3, etc. I also like the memoization features of https://github.com/HenrikBengtsson/R.cache and would seek to have them be (eventually) present. I see that you are working on https://ropensci.org/; is this package going to be a part of that and acquire an open source license?

Support pigz for compressing .rds files

if installed, for parallel compression.

if (Sys.which("pigz") != "") {
  con <- pipe("nice pigz > outfile.rds", mode = "wb")
}

One could argue this should be built into R, but for now...

Fix issues with new RSQLite

Looks like the new parametrised query interface is there; hopefully this will allow general blob storage. See also r-dbi/RSQLite#100 for previous discussion here.

This is an automated email to let you know about the upcoming release
of RSQLite, which will be submitted to CRAN on 2016-08-05. To check
for potential problems, I ran R CMD check on your package storr
(1.0.1).

I found: 0 errors | 1 warning | 0 notes.

checking re-building of vignette outputs ... WARNING
Error in re-building vignettes:
  ...
Quitting from lines 101-109 (drivers.Rmd)
Error: processing vignette 'drivers.Rmd' failed with diagnostics:
Please use dbGetQuery instead
Execution halted

If I got an ERROR because I couldn't install your package (or one of
it's dependencies), my apologies. You'll have to run the checks
yourself (unfortunately I don't have the time to diagnose installation
failures).

Regressions may arise due to changes in the public API. In particular,
the dbGetQuery() and summary() methods are not exported anymore.
Instead, use DBI::dbGetQuery() and show(), respectively.
Furthermore, dbSendPreparedQuery(), dbGetPreparedQuery() and
dbListResults() have been deprecated, these functions now raise an
error. Use dbBind() for parametrized queries, listing the results of
a connection is not supported anymore.

Otherwise, please carefully look at the results, and let me know if
I've introduced a bug in RSQLite.

To get the development version of RSQLite so you can run the checks
yourself, you can run:

# install.packages("devtools")
devtools::install_github("rstats-db/RSQLite")

To see what's changed visit
https://github.com/rstats-db/RSQLite/blob/master/NEWS.md.

If you have any questions about this email, please feel free to
respond directly.

best practice for multi-argument hook

Is there a way to do something like?

st <- storr_external(driver_rds("data/foo"), ff)

ff <- function(key, namespace, some_other_data) {
   # some calculation that depends on some_other_data
}

In normal code, I could put some_other_data inside the ff function, but in this particular case some_other_data is a target from a remake project that takes a long time to compute.

mget/mset/mdel support

Can be supported by Redis and by DBI. Simulate support in the other drivers so that the interface stays the same.

This will reduce the number of individual calls to remote resources

Option in redis driver store in hashes

Rather than have the key/namespace division be a prefix/key, use redis hashes instead. This should be a fairly straightforward change, and could make things a little faster?

storr does not install (R6 update clash?)

With devtools::install_github(c("richfitz/storr")), I get

* installing *source* package 'storr' ...
** R
** inst
** tests
** preparing package for lazy loading
R6Class driver_external: 'lock' argument has been renamed to 'lock_objects' as of version 2.1.This code will continue to work, but the 'lock' option will be removed in a later version of R6
Error in R6::R6Class("storr_mangled", public = storr_mangled_methods()) : 
  Cannot add a member with reserved name 'clone'.
Error : unable to load R code in package 'storr'
ERROR: lazy loading failed for package 'storr'

Avoid calling exists on key lookup for drivers that can throw reliably

Drivers should declare if they will promise to throw an error on nonexistant key retrieval.

    get=function(key, namespace=self$default_namespace, use_cache=TRUE) {
      self$get_value(self$get_hash(key, namespace), use_cache)
    },

    get_hash=function(key, namespace=self$default_namespace) {
      if (self$exists(key, namespace)) {
        self$driver$get_hash(key, namespace)
      } else {
        stop(KeyError(key, namespace))
      }
    },

    get_value=function(hash, use_cache=TRUE) {
      envir <- self$envir
      if (use_cache && exists0(hash, envir)) {
        value <- envir[[hash]]
      } else {
        if (!self$driver$exists_hash(hash)) {
          stop(HashError(hash))
        }
        value <- self$driver$get_object(hash)
        if (use_cache) {
          envir[[hash]] <- value
        }
      }
      value
    },

If driver$get_hash(key, namespace) throws an error if the key is not there (or alternatively simply return NULL would be OK) then we can avoid the self$exists(key, namespace) call. That can be tested for in the inst/spec tests too.

Then the second exists_hash could be removed if the driver promises to throw an error (through some sort of capability reporting thing).

if (self$driver$capabilities$get_object_throws) {
  value <- tryCatch(self$driver$get_object(hash), error=function(e) stop(HashError(hash)))
} else if (self$driver$exists_hash(hash)) {
  value <- self$driver$get_object(hash)
} else {
 stop(HashError(hash)
}

auto-test failures not caught

...which triggers the partial match warnings on my computers.

  • fix the unqualified testthat functions within inst/spec
  • check that the reporter had no failures and throw appropriately
  • in the makefile, turn of local configuration.

OS X CRAN check errors

See the CRAN checks page:

Runningtestthat.R’ [4s/4s]
Running the tests intests/testthat.Rfailed.
Complete output:
> library(testthat)
> library(storr)
>
> test_check("storr")
Error in redis_connect_tcp(config$host, config$port) :
Failed to create context: Connection refused
Calls: test_check ... redis_connection -> redis_connect -> redis_connect_tcp -> .Call
In addition: Warning message:
call dbDisconnect() when finished working with a connection
testthat results ================================================================
OK: 634 SKIPPED: 0 FAILED: 0
Execution halted 

Support expiring keys

Redis does lazy key expires (i.e., check for expiry time on read). That could be done as part of the key -> hash lookup, returning NULL if the key has expired. It'll add a little extra work, but be nice for doing the "local cache" pattern, especially when using the external driver.

Might be best to make this optional at the storr level (like key mangling and default namespace) but not sure.

To pad or not to pad in encode64()

The base64url package forgoes padding in base64_urlencode() because "=" is sometimes unsafe in urls/files. This seems like an extremely small concern, but I am wondering if there is a weird edge case where we will need safer storr key filenames.

For parallelRemake and drake, I cannot pad because "=" messes with Makefile rules (drake issue 19).

setx

I can see uses for this, but perhaps it's too nasty to get right

A redis "plaintext" driver for storing plain text

Saves the overhead (space and time) of serialising when all that is being stored is plain text.

This would be useful for the file cache in rrqueue, though that may move into here at some point.

Possible CRAN violation - writing to disk

See here r-lib/rappdirs#2 (comment)

The rappdir's using components might need to get permission to write do disk. Thoughts @wcornwell, @dfalster - this will be an issue for baad.data and taxonlookup.

Relevant section of CRAN policy is:

  • Packages should not write in the users’ home filespace, nor anywhere else on the file system apart from the R session’s temporary directory (or during installation in the location pointed to by TMPDIR: and such usage should be cleaned up). Installing into the system’s R installation (e.g., scripts to its bin directory) is not allowed.
    Limited exceptions may be allowed in interactive sessions if the package obtains confirmation from the user.

Which is frankly insane (especially given that we have function in R that write to the disk, like pdf).

file caching

A bit different to file caching. I had aspects of this in remake (caching fingerprints only) and in rrqueue (caching file contents) so there's a few places this can go.

storr-holder

Gabor uses this in httrmock and I use this in at least one other project; where a storr is used for side effects in a package it might be useful to allow creation/retrieval of the storage itself.

Depends on development RSQLite

This blocks a CRAN release. Depending on timing, either

  • add a compatibility layer for driver_dbi$set_object. This requires using a function that will possibly be hard-deprecated in the future so this needs some care
  • pull out the compatibility layer from the vignette, and simplify the binary support detection

storr@refactor

devtools::install_github("richfitz/storr@refactor")

Isn't working anymore for some reason, leads to this error:

Downloading GitHub repo richfitz/storr@refactor
from URL https://api.github.com/repos/richfitz/storr/zipball/refactor
No encoding supplied: defaulting to UTF-8.
Error: lexical error: invalid char in json text.
                                       Not Found
                     (right here) ------^

This means that taxonlookup and baad instillation instructions don't work...

slugs 🐌

Related to a conversation in httrmock, which uses storrr to store recorded HTTP requests (r-lib/httrmock#3).

Have you ever contemplated a naming scheme that would allow the incorporation of an optional slug? Perhaps as a prefix.

cc @gaborcsardi

Running out of memory writing to cache with storr_rds

Hi @richfitz, thanks so much for this package, it's coming in really handy. I have an issue though with a large file (3GB csv 55 columns, 10 million rows), where it's running out of memory writing it to the cache. I'm running Windows 7 64 bit, with 24 GB of memory.

download.file("https://pub.data.gov.bc.ca/datasets/949f2233-9612-4b06-92a9-903e817da659/ems_sample_results_historic_expanded.zip", destfile = "test.zip")
csv_file <- unzip("test.zip")
file.rename(csv_file, "test_storr.csv")
sessionInfo()
## R version 3.3.1 (2016-06-21)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## locale:
## [1] LC_COLLATE=English_Canada.1252  LC_CTYPE=English_Canada.1252   
## [3] LC_MONETARY=English_Canada.1252 LC_NUMERIC=C                   
## [5] LC_TIME=English_Canada.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] magrittr_1.5       formatR_1.4        tools_3.3.1       
##  [4] htmltools_0.3.5    yaml_2.1.13        Rcpp_0.12.6       
##  [7] stringi_1.1.1      rmarkdown_1.0.9001 knitr_1.13        
## [10] stringr_1.0.0      digest_0.6.9       evaluate_0.9
library(storr)
packageVersion("storr")
## [1] '1.0.1'
library(readr)

dat <- read_csv("test_storr.csv")
dim(dat)
## [1] 10518573       55
object.size(dat)
## 4644497248 bytes
cache <- storr::storr_rds(tempfile())
cache$set("my_data", dat)
## Warning in writeBin(value[seq(i, min(j, len))], con): Reached total
## allocation of 24371Mb: see help(memory.size)

## Warning in writeBin(value[seq(i, min(j, len))], con): Reached total
## allocation of 24371Mb: see help(memory.size)

## Warning in writeBin(value[seq(i, min(j, len))], con): Reached total
## allocation of 24371Mb: see help(memory.size)

## Warning in writeBin(value[seq(i, min(j, len))], con): Reached total
## allocation of 24371Mb: see help(memory.size)

## Error: cannot allocate vector of size 8.0 Gb
cache$list()
## character(0)
cache$destroy()

I tested just writing the file with saveRDS and it worked just fine, so I dug through the source code of storr and saw that you are using writeBin internally instead of saveRDS. I also noticed that storr used to use saveRDS until this commit. So I installed from github to the commit before that, and it worked:

devtools::install_github("richfitz/storr", 
                         ref = "73505843261c23d092cff82613a0e8c5bd6f9d1f", quiet = TRUE)
library(storr)
packageVersion("storr")
## [1] '0.6.0'
library(readr)

dat <- read_csv("test_storr.csv")
dim(dat)
## [1] 10518573       55
object.size(dat)
## 4644497248 bytes
cache <- storr::storr_rds(tempfile())
cache$set("my_data", dat)
cache$list()
## [1] "my_data"

Don't use in-place file writes

Use case: I thought I could back up a ~30 GB storr from a remake project, to be able to restore it later (richfitz/remake#136). Used cp -lr .remake .remake.bak which creates hardlinks to files recursively. After trying out something and then restoring with rm -r .remake; mv .remake .remake.bak I'm now stuck with an inconsistent storr...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.