Comments (17)
I wonder if using feather for data frames might be faster (though I'd not want to add that dependency here). I saw a benchmark recently that found them way faster to store very large data.frames in. It'd be a bit of an issue though because they'd not roundtrip correctly always (not everything can be losslessly converted to feather).
from storr.
And actually, if I uncomment this test and run the tests, it fails:
==> devtools::test()
Loading storr
Loading required package: testthat
Testing storr
drivers [environment]: ..................................
export [environment]: ..........................................
external [environment]: .................
storr [environment]: ...................................................
drivers [rds]: ..................................
export [rds]: ..........................................
external [rds]: .................
storr [rds]: ...................................................
base64: ..
driver rds details: ......................WW1....
environment driver: .....
utils: .
Warnings -----------------------------------------------------------------------
1. large vector support (@test-driver-rds.R#99) - Reached total allocation of 24371Mb: see help(memory.size)
2. large vector support (@test-driver-rds.R#99) - Reached total allocation of 24371Mb: see help(memory.size)
Failed -------------------------------------------------------------------------
1. Error: large vector support (@test-driver-rds.R#99) -------------------------
'Calloc' could not allocate memory (2147483646 of 1 bytes)
1: dr$set_object(hash, x) at C:\_dev\storr/tests/testthat/test-driver-rds.R:99
2: write_bin(value, con) at C:\_dev\storr/R/driver_rds.R:147
3: writeBin(value[seq(i, min(j, len))], con) at C:\_dev\storr/R/utils.R:127
from storr.
Hi Andy - thanks for this! It came through while I was on holiday. I'll try and replicate soon and see what the story is.
from storr.
Thanks @richfitz! Let me know what I can do.
from storr.
I have a fix for this! It's not pretty but it should work. I'll get it pushed to github soon (it's mixed in with a big overhaul of the package), but can you let me know if you'd be able to test it once it's up?
from storr.
from storr.
What on earth are y'all storing? :)
Storing really big objects is always going to be really nasty; I'll try and post my experimentation with timings, but a to-disk roundtrip is going to cost ~25s for a file right on the 2GB limit.
from storr.
BIG DATA!
It'll be faster than having to recompute them, so I can deal with it 😎
from storr.
I'm happy to test as well. Thanks @richfitz
from storr.
I've been using feather a bit and it's really fast, and did definitely wonder about a storr backend using it (though obviously only for data frames). The only thing I wish it supported but doesn't (yet?) is list columns.
from storr.
Please have a look at the develop
branch. You should be able to install that with remotes::install_github("richfitz/storr@develop")
from storr.
Worked for me! 4 minutes on the same machine and file as above (but I wouldn't use my computer as a good benchmark 😉 )
from storr.
Great, thanks! I'll look into what, if anything, can be done to speed that up with your particular file
from storr.
OK, I see times (user, system, elapsed, in seconds)
- Read with
readr::read_csv()
: 55 / 15 / 71s - serialisation: 24 / 20 / 45
- hash: 17.6 / 0 / 17.6
- set object: 75 / 12 / 89
- get object: 59 / 8 / 69
So that's all pretty terrible :) Total time for a storr roundtrip here is looking to be about 3.7 minutes and my computer is not a slouch (but still just a desktop PC).
R is really struggling to do anything with that object, though; just computing object.size
takes 15 seconds.
For comparison, I see 18.5s to write a feather file (elapsed) and 19.6s to read it in. That's clearly much better than the 89 / 69s with rds, but we'd still have to pay the cost of serialisation and hash just to determine that we have a "big" object. And the objects as returned aren't identical, or all.equal (attributes are re-ordered). However, they do appear to be equivalent.
I'm not really sure what can be sensibly done to make this less unpleasant but I think we're fundamentally hitting the limits of what is sensible to do in an in-memory data.frame. Let me know if you have any ideas, otherwise I'll just leave this as-is, I think.
from storr.
@richfitz I think you're absolutely right. For this particular data I did switch to using a sqlite backend (not with storr, but it looks like you might be moving to that direction?). But your work fixing this issue is still really helpful, so thanks!
from storr.
The SQLite support in the development version of storr is not meant to replace database-based table access; for that you want to use RSQLite and friends directly. The new support allows storing individual keys in a table. They're not really the same sort of thing as if you used SQL based storage with storr to keep your 3GB table you'd still be serialising it into a big opaque blob of data and pushing it into a system that wants to deal with tables! Different tools for different problems, really.
I'm glad you got everything working and sorry this took so long to fix.
from storr.
Yep, that makes sense and I kind of thought that was what the SQLite support was for. Thanks, and no worries!
from storr.
Related Issues (20)
- Failing test on R-3.2.0 (RHEL 7) HOT 2
- Multiformat driver HOT 5
- [Request] Add qs backend HOT 1
- Interrupted promise evaluation warning in get(use_cache = FALSE) HOT 1
- Use writeBin() for large objects? HOT 1
- writeBin() in chunks HOT 10
- fst driver HOT 3
- Use compress_fst() in the RDS driver? HOT 2
- Unit test warnings about custom error conditions
- Scratch HOT 2
- Add Support for Schemas in PostgreSQL HOT 1
- Option to skip scratch HOT 1
- Cache the existence of data files
- illegal file names, silent failure HOT 9
- Error: NOT NULL constraint failed: datatable.value - Storing large (~1.1GB) dataframe in SQLite cache HOT 5
- In clear(), remove keys in bulk instead of one by one. HOT 2
- Design question about storrs and drivers
- CRAN issue with sprintf
- st$set() leaking space when setting same key again? HOT 2
- Some tests are failing on R-3.1.0 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from storr.