Page indexes are split in two data structures: Column Indexes and Offset Indexes. The

minimize in-memory overhead of page indexes about parquet-go HOT 2 OPEN

segmentio commented on August 10, 2024

minimize in-memory overhead of page indexes

from parquet-go.

Comments (2)

kevinburkesegment commented on August 10, 2024

Is it worth using https://github.com/orijtech/structslop to optimize the order that we put the fields in each struct?

the parquet spec does not mention it unfortunately

Can we open a ticket? It seems like they should at least say one way or another what the treatment is.

the null count should fit in a 32 bits integer since the number of values and the number of is nulls are stored as a 32 bits integer in the data page header

I'm confused about this. Wouldn't we still need to store which pages specifically are null?

in order to reconstruct the first row index of a page, we need to sum the values of all deltas up to that page.

This plus the added complexity make me think maybe we should wait until we deploy this and have a better sense of the size savings + access patterns to decide if it's worth it to do this.

from parquet-go.

achille-roussel commented on August 10, 2024

Is it worth using https://github.com/orijtech/structslop to optimize the order that we put the fields in each struct?

I don't think we have a lot of room for optimization here, most of the memory is going to be held in the backing array of slices, and the fields are made of types that already align well (no small types interleaving larger ones, etc...).

Can we open a ticket? It seems like they should at least say one way or another what the treatment is.

That seems like the right thing to do, I'll have to familiarize myself with the issue submission process for parquet-format (it doesn't seem to use Github issues).

I'm confused about this. Wouldn't we still need to store which pages specifically are null?

Yes, I think I wanted to relate to the fact that they are arrays of 64 bits integers in the thrift definition, but that seems larger than necessary. I don't know why it was defined as int64 rather than int32 since the same values are represented with 32 bits integers in other places. Maybe they anticipated to use the column index structures for aggregates as well (where it may need more than 32 bits to represent the null counts then).

This plus the added complexity make me think maybe we should wait until we deploy this and have a better sense of the size savings + access patterns to decide if it's worth it to do this.

I don't know if the access pattern is going to play a big part here, the indexes have to be loaded in memory in order to be effective (if they are kept on disk then we need to issue O(log(N)) I/O operations to binary search through the index. That's why I was trying to estimate the amount of memory needed to represent the indexes in memory. The more compact they are the larger the dataset can be, and the more memory there is left available for other parts of the system.

from parquet-go.

minimize in-memory overhead of page indexes about parquet-go HOT 2 OPEN

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent