Git Product home page Git Product logo

Comments (3)

MarcusKlik avatar MarcusKlik commented on June 19, 2024

Hi @statquant , thanks for the feature request! Your request is related to issues #16 and #30. As you say, for sorted table's, we can implement a binary search to retrieve a range of rows depending of some specified key range. A binary search is very fast, for example with only 30 seek operations on the fst file, you can scan a billion records. For selections which are not related to a stored key, we could use the selection mechanism from data.table, but on chunks of data instead of the whole table. The problem however is that you can't use aggregate statements for selection in that case, for example:

dt <- data.table(X = 1:10, Y = 10:1)
dt[X < mean(Y)]

   X  Y
1: 1 10
2: 2  9
3: 3  8
4: 4  7
5: 5  6

This works for a complete table, but it won't work when the data is chunked into multiple subsets (in that case the mean is not calculated correctly). So that is a problem. Possible solutions might be:

  • A selection requires the specification of a grouping variable. So the selection is done per group. If the groups are small enough, there will be no problems for large data sets.
  • No aggregate selections are allowed, only simple operators. The advantage of this solution is that we can program these simple operators in C++, increasing performance.
  • A more elaborate framework where we allow custom methods as operators on the data. These methods should have a map reduce-like character, for example for the above example:
# Two chunks
dt1 <- data.table(X = sample(1:20, 10), Y = sample(1:20, 10))
dt2 <- data.table(X = sample(1:20, 10), Y = sample(1:20, 10))

# Calculate sums and counts
r1 <- dt1[, .(Sum = sum(Y), Count = .N)]
r2 <- dt2[, .(Sum = sum(Y), Count = .N)]

# Combine results and calculate mean
rTot <- rbindlist(list(r1, r2))
rTot[, sum(Sum) / sum(Count)]

[1] 10.35

So we calculated a mean by using sum and counting per chunk. The fst package could provide methods like fst.sum, fst.mean etc. to perform these operations.

For your use-case I think that option 2 is probably enough?

from fst.

statquant avatar statquant commented on June 19, 2024

@MarcusKlik thanks for the prompt reply, indeed 2) is enough for me.

Honestly I think it would be for most people, as when you want to aggregate in some sense I'd guess you would still want the whole data to check what you've done, to change what you've done etc...

from fst.

MarcusKlik avatar MarcusKlik commented on June 19, 2024

Nice, I will make sure that your feature is on the list for one of the next versions of fst.

from fst.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.