Git Product home page Git Product logo

Comments (2)

kevinburkesegment avatar kevinburkesegment commented on August 10, 2024

Is it worth using https://github.com/orijtech/structslop to optimize the order that we put the fields in each struct?

the parquet spec does not mention it unfortunately

Can we open a ticket? It seems like they should at least say one way or another what the treatment is.

the null count should fit in a 32 bits integer since the number of values and the number of is nulls are stored as a 32 bits integer in the data page header

I'm confused about this. Wouldn't we still need to store which pages specifically are null?

in order to reconstruct the first row index of a page, we need to sum the values of all deltas up to that page.

This plus the added complexity make me think maybe we should wait until we deploy this and have a better sense of the size savings + access patterns to decide if it's worth it to do this.

from parquet-go.

achille-roussel avatar achille-roussel commented on August 10, 2024

Is it worth using https://github.com/orijtech/structslop to optimize the order that we put the fields in each struct?

I don't think we have a lot of room for optimization here, most of the memory is going to be held in the backing array of slices, and the fields are made of types that already align well (no small types interleaving larger ones, etc...).

Can we open a ticket? It seems like they should at least say one way or another what the treatment is.

That seems like the right thing to do, I'll have to familiarize myself with the issue submission process for parquet-format (it doesn't seem to use Github issues).

I'm confused about this. Wouldn't we still need to store which pages specifically are null?

Yes, I think I wanted to relate to the fact that they are arrays of 64 bits integers in the thrift definition, but that seems larger than necessary. I don't know why it was defined as int64 rather than int32 since the same values are represented with 32 bits integers in other places. Maybe they anticipated to use the column index structures for aggregates as well (where it may need more than 32 bits to represent the null counts then).

This plus the added complexity make me think maybe we should wait until we deploy this and have a better sense of the size savings + access patterns to decide if it's worth it to do this.

I don't know if the access pattern is going to play a big part here, the indexes have to be loaded in memory in order to be effective (if they are kept on disk then we need to issue O(log(N)) I/O operations to binary search through the index. That's why I was trying to estimate the amount of memory needed to represent the indexes in memory. The more compact they are the larger the dataset can be, and the more memory there is left available for other parts of the system.

from parquet-go.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.