Git Product home page Git Product logo

Comments (17)

richfitz avatar richfitz commented on August 16, 2024 1

I wonder if using feather for data frames might be faster (though I'd not want to add that dependency here). I saw a benchmark recently that found them way faster to store very large data.frames in. It'd be a bit of an issue though because they'd not roundtrip correctly always (not everything can be losslessly converted to feather).

from storr.

ateucher avatar ateucher commented on August 16, 2024

And actually, if I uncomment this test and run the tests, it fails:

==> devtools::test()

Loading storr
Loading required package: testthat
Testing storr
drivers [environment]: ..................................
export [environment]: ..........................................
external [environment]: .................
storr [environment]: ...................................................
drivers [rds]: ..................................
export [rds]: ..........................................
external [rds]: .................
storr [rds]: ...................................................
base64: ..
driver rds details: ......................WW1....
environment driver: .....
utils: .

Warnings -----------------------------------------------------------------------
1. large vector support (@test-driver-rds.R#99) - Reached total allocation of 24371Mb: see help(memory.size)

2. large vector support (@test-driver-rds.R#99) - Reached total allocation of 24371Mb: see help(memory.size)

Failed -------------------------------------------------------------------------
1. Error: large vector support (@test-driver-rds.R#99) -------------------------
'Calloc' could not allocate memory (2147483646 of 1 bytes)
1: dr$set_object(hash, x) at C:\_dev\storr/tests/testthat/test-driver-rds.R:99
2: write_bin(value, con) at C:\_dev\storr/R/driver_rds.R:147
3: writeBin(value[seq(i, min(j, len))], con) at C:\_dev\storr/R/utils.R:127

from storr.

richfitz avatar richfitz commented on August 16, 2024

Hi Andy - thanks for this! It came through while I was on holiday. I'll try and replicate soon and see what the story is.

from storr.

ateucher avatar ateucher commented on August 16, 2024

Thanks @richfitz! Let me know what I can do.

from storr.

richfitz avatar richfitz commented on August 16, 2024

I have a fix for this! It's not pretty but it should work. I'll get it pushed to github soon (it's mixed in with a big overhaul of the package), but can you let me know if you'd be able to test it once it's up?

from storr.

fmichonneau avatar fmichonneau commented on August 16, 2024

from storr.

richfitz avatar richfitz commented on August 16, 2024

What on earth are y'all storing? :)

Storing really big objects is always going to be really nasty; I'll try and post my experimentation with timings, but a to-disk roundtrip is going to cost ~25s for a file right on the 2GB limit.

from storr.

fmichonneau avatar fmichonneau commented on August 16, 2024

BIG DATA!

It'll be faster than having to recompute them, so I can deal with it 😎

from storr.

ateucher avatar ateucher commented on August 16, 2024

I'm happy to test as well. Thanks @richfitz

from storr.

ateucher avatar ateucher commented on August 16, 2024

I've been using feather a bit and it's really fast, and did definitely wonder about a storr backend using it (though obviously only for data frames). The only thing I wish it supported but doesn't (yet?) is list columns.

from storr.

richfitz avatar richfitz commented on August 16, 2024

Please have a look at the develop branch. You should be able to install that with remotes::install_github("richfitz/storr@develop")

from storr.

ateucher avatar ateucher commented on August 16, 2024

Worked for me! 4 minutes on the same machine and file as above (but I wouldn't use my computer as a good benchmark 😉 )

from storr.

richfitz avatar richfitz commented on August 16, 2024

Great, thanks! I'll look into what, if anything, can be done to speed that up with your particular file

from storr.

richfitz avatar richfitz commented on August 16, 2024

OK, I see times (user, system, elapsed, in seconds)

  • Read with readr::read_csv(): 55 / 15 / 71s
  • serialisation: 24 / 20 / 45
  • hash: 17.6 / 0 / 17.6
  • set object: 75 / 12 / 89
  • get object: 59 / 8 / 69

So that's all pretty terrible :) Total time for a storr roundtrip here is looking to be about 3.7 minutes and my computer is not a slouch (but still just a desktop PC).

R is really struggling to do anything with that object, though; just computing object.size takes 15 seconds.

For comparison, I see 18.5s to write a feather file (elapsed) and 19.6s to read it in. That's clearly much better than the 89 / 69s with rds, but we'd still have to pay the cost of serialisation and hash just to determine that we have a "big" object. And the objects as returned aren't identical, or all.equal (attributes are re-ordered). However, they do appear to be equivalent.

I'm not really sure what can be sensibly done to make this less unpleasant but I think we're fundamentally hitting the limits of what is sensible to do in an in-memory data.frame. Let me know if you have any ideas, otherwise I'll just leave this as-is, I think.

from storr.

ateucher avatar ateucher commented on August 16, 2024

@richfitz I think you're absolutely right. For this particular data I did switch to using a sqlite backend (not with storr, but it looks like you might be moving to that direction?). But your work fixing this issue is still really helpful, so thanks!

from storr.

richfitz avatar richfitz commented on August 16, 2024

The SQLite support in the development version of storr is not meant to replace database-based table access; for that you want to use RSQLite and friends directly. The new support allows storing individual keys in a table. They're not really the same sort of thing as if you used SQL based storage with storr to keep your 3GB table you'd still be serialising it into a big opaque blob of data and pushing it into a system that wants to deal with tables! Different tools for different problems, really.

I'm glad you got everything working and sorry this took so long to fix.

from storr.

ateucher avatar ateucher commented on August 16, 2024

Yep, that makes sense and I kind of thought that was what the SQLite support was for. Thanks, and no worries!

from storr.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.