Comments (14)
This is done through the "external" driver at present. No real documentation to speak of, but there's an example in the tests and we use this here to use github releases to store things (caching to disk).
Memoisation is pretty easy to build on top of this; see for example here. Once things settle down (see below) I'll probably add a proper interface. I have held off though because it seems like fairly well trodden ground.
I've not tested cascading this past a lookup/storage driver as doing a 3 step cascade would require write access to the intermediate storage.
The package has an open source licence already; it is BSD 2 clause licenced. This package will probably end up on ropensci eventually yes, as I use it in a number of other packages that are likely to end up there.
Be warned; I'm in the process of refactoring this package to get things off to CRAN and to make it easier to add things to into and separating off some of the github stuff. Timeline is the next ~2 weeks I hope.
from storr.
OK, I've done a first pass at cleaning the package up. Documentation for building drivers is not really there because I think that the internals might get overhauled more completely (I've started on that offline to see where it goes).
What would be useful is to know:
- what additional backends you have a need for (and/or could provide)
- aside from
get
support, whether the full cascade is necessary
The current implementation is complicated somewhat by my desire to have indexable serialisation (i.e., serialise a list and access it in pieces rather than just all at once). See here for details. But that could be made opt-in and then the interface simplifies.
The list of things that backends need to provide would still be quite large:
- get/set/del/exists support for both binary and text objects (can be the same and are for all but Redis at present)
- key and hash mangling for turning keys into something that the underlying driver can support.
from storr.
Short Version:
The only back ends that I'm not noticing in your current docs is a) an AWS S3 cache and b) a remote machine's storage. I think for both of these get/set/del/exists are a providable feature set in principle. I don't see any problem using a digest derived hash for creating keys/files in either use case.
I think a cache on a remote machine's storage is probably trivial with what you already have. In addition, for reasons I describe ad nauseum below, I have a short to medium-term unavailability to provide an S3 caching engine that can be supported on CRAN.
Provided a layer has the basics you describe (get/set/del/exists), then a basic cascading cache should be possible using storr caches without needing to change anything integral to storr. Therefore, I don't think there is any compelling reason to embed the cascading cache inside of an existing caching package. Instead a cascading cache could simply suggest and wrap external projects. Most of the features I imagine can be handled via metadata at the point of insertion and retrieval from the cache and therefore don't need additional engine features per se - but if they are handled by the engine, then it makes sense to expose them.
- List of all currently valid keys
- Storage time of key
- Last access time of key
- Key expiration time
- (only the engine can provide) Information regarding the storage size used by each key
I probably shouldn't try to shoehorn this ill defined thing into your well oiled caching engine, I should just wrap what you already have and submit pulls here if I find an engine lacks a feature I need/want it to have. Given all of that, it is probably entirely reasonable to close this issue.
Long Rambling Version:
My desire to have an AWS S3 backend is complicated by my unawareness of any CRAN package supporting AWS. Looking on github I can see that there has been some activity in that direction since last time I looked. Unfortunately, I'm already pretty deeply buried in use-case-specific adhoc stuff I've already written leveraging rPython. However, I doubt it will be ready for CRAN ever. I stomps all over the python namespace as if it is the only thing living there. I could fix that... or ignore it if I switch over to rJython. However, there are very few reverse dependencies on those two packages leaving me to think getting them approved on CRAN would be an uphill battle. The most obvious barrier I see to a clean user experience there is getting the user to install boto. In short, if I were to write a backend to S3 in the next year it almost certainly would be ugly and never pass muster with CRAN. That being said, I might be able to share it on github.
There are ton of different ways to handle a remote machine's storage. The first solution that jumps out at me as being able to support indexable serialization would be connecting to a running a single slave node SNOW cluster and having the result passed back from the SNOW slave node. There might be some advantage in that the SNOW slave node is then responsible for uncompressing the cache file and returning the result so that it is already serialized in RAM for passing back to the requesting master. The second solution, probably more practical one too, is just mapping the remote machine's cache directory to the localhost and using it as just another disk cache.
I'd already started writing a cascading cache in spaghetti functions before I stumbled across storr or really gave a hard look at R.cache. The features I have working right now in spaghetti functions are:
- On set, creating a limit on time to live for the cache (not allowing the result to be fetched if past a certain staleness threshold
- On put, setting the staleness threshold on cache read
- Being choosy about which layers to check on get
- Being choosy about which layers to push to on set
- Backpropagating to faster caches
What I still see as missing features are the ability to:
- expire stale keys from the cache where staleness involves both the above thresholds and an LRU consideration
- handle backpropagation intelligently such that it doesn't turn out to be more hassle than it is worth
- be smart about space usage so that I don't overflow caches that have bounded storage space
The above feature set goes beyond your get/set/del/exists support and would require metadata regarding the items time of storage, time until stale (specified on set), time of most recent access, and data size (as stored). However, it seems like a lot of extra functionality to ask from a cache most of it in the realm of metadata.
I started poking at the metadata problem here and there. Some of those items can simply be put as attributes on the stored item, but that requires fetching the full item before you discover that it is stale. Thus some sort of legitimate meta-data layer makes some sense. For obvious reasons, e.g. key expiration is already handled, Redis looks great as a meta-data layer. On the other hand, going round trip to a Redis server to get metadata (or requiring that one be installed) is a bit on the heavy/slow side. rrlite
seems like a potential answer, but it reads like the Windows support issue is an open question (seppo0010/rlite#11). That leaves SQLite, but I'm a bit less excited about that for no reason in particular.
All of that leads back to the short version. I probably shouldn't try to shoehorn this ill defined thing into your well oiled caching engine, I should just wrap what you already have and submit pulls here if I find an engine lacks a feature I need/want it to have.
from storr.
TTL is definitely a nice to have (see #5; open to ideas about implementing that). Storage space is much harder because there's a whole set of logic around that (better might be to implement at the driver level for cases where that's important).
For Amazon S3, there are few options but none seem very good:
from storr.
I agree, I don't see a way around doing storage space at a driver level.
Re S3: AWS.tools and https://github.com/armstrtw/Rawscli (the follow on) both proceed via system calls to amazon's command line tool(s). Except Amazon keeps changing the interface on those pretty much on whim. So, a long lived R solution probably should (IMO) depend on a supported Python SDK. In particular boto3/botocore (relative to the hand coded boto) looks ripe for the type of procedural generation stuff you've done with the Redis API. In addition there is a forthcoming, but not yet stable, C++ SDK which seems a preferable build target relative to the community provided C++ link being leveraged by RS3.
from storr.
If you're still interested in writing an AWS backend, please see the new vignette: http://richfitz.github.io/storr/vignettes/drivers.html
This walks through creating a new driver, which is much easier than it was before. There's also an automatic test suite to reduce guesswork.
The new version is currently on the "refactor" branch as I check to see if I broke anything across various projects I have that depend on it. Aside from additional documentation I'm thinking this is pretty much good to go to CRAN now.
from storr.
What about AWS S3 support in the memoise package?
from storr.
An AWS backend could be done, for sure - would make most sense to use the package aws.s3
directly (storr drivers need way more than what memoise needs). That would would probably be best in another package (storr.aws perhaps).
Actually cascading things is another matter entirely; there's quite a bit of work needed to work out how things should cascade down. I can imagine a "caching" mode where you put one backend in front of another, but how do you keep changes propagating forward through the backends...
from storr.
With no offense intended to the author of aws.s3
it is still very nascent compared to basic offerings Python or Java. In particular, the last time I looked very large file support was poor as was use of IAM creds. Personally I use Python's boto3
via package:reticulate
. package:awsjavasdk [I authored] or package:AWR would also be workable.
from storr.
Also, it has been a while since this thread revisited a possible rOpenSci submission. Any new thoughts?
from storr.
I've made some proof of concept stuff with reticulate and mostly addressed the issues caused for it by forking. So, S3 access outside of aws.s3 is within reach conceptually. I'd need to clean room rewrite into a public repo if I were to do it. As for storr integration I haven't been keeping track, but if I were to do it, then it would be after all of the above. For what it is worth, https://github.com/HenrikBengtsson/R.cache has the advantage of already being on CRAN.
from storr.
I'm interested to hear where @richfitz is on this. I know I was very glad to see redux hit CRAN, I really enjoy his packages.
from storr.
@russellpierce - storr is on CRAN, though I think that really cascading through will require some thinking
from storr.
We can probably close this issue. If we were to do cascading I'd probably see it as a package sitting on top of this one rather than increasing the complexity here.
from storr.
Related Issues (20)
- Failing test on R-3.2.0 (RHEL 7) HOT 2
- Multiformat driver HOT 5
- [Request] Add qs backend HOT 1
- Interrupted promise evaluation warning in get(use_cache = FALSE) HOT 1
- Use writeBin() for large objects? HOT 1
- writeBin() in chunks HOT 10
- fst driver HOT 3
- Use compress_fst() in the RDS driver? HOT 2
- Unit test warnings about custom error conditions
- Scratch HOT 2
- Add Support for Schemas in PostgreSQL HOT 1
- Option to skip scratch HOT 1
- Cache the existence of data files
- illegal file names, silent failure HOT 9
- Error: NOT NULL constraint failed: datatable.value - Storing large (~1.1GB) dataframe in SQLite cache HOT 5
- In clear(), remove keys in bulk instead of one by one. HOT 2
- Design question about storrs and drivers
- CRAN issue with sprintf
- st$set() leaking space when setting same key again? HOT 2
- Some tests are failing on R-3.1.0 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from storr.