Comments (2)
Is it worth using https://github.com/orijtech/structslop to optimize the order that we put the fields in each struct?
the parquet spec does not mention it unfortunately
Can we open a ticket? It seems like they should at least say one way or another what the treatment is.
the null count should fit in a 32 bits integer since the number of values and the number of is nulls are stored as a 32 bits integer in the data page header
I'm confused about this. Wouldn't we still need to store which pages specifically are null?
in order to reconstruct the first row index of a page, we need to sum the values of all deltas up to that page.
This plus the added complexity make me think maybe we should wait until we deploy this and have a better sense of the size savings + access patterns to decide if it's worth it to do this.
from parquet-go.
Is it worth using https://github.com/orijtech/structslop to optimize the order that we put the fields in each struct?
I don't think we have a lot of room for optimization here, most of the memory is going to be held in the backing array of slices, and the fields are made of types that already align well (no small types interleaving larger ones, etc...).
Can we open a ticket? It seems like they should at least say one way or another what the treatment is.
That seems like the right thing to do, I'll have to familiarize myself with the issue submission process for parquet-format (it doesn't seem to use Github issues).
I'm confused about this. Wouldn't we still need to store which pages specifically are null?
Yes, I think I wanted to relate to the fact that they are arrays of 64 bits integers in the thrift definition, but that seems larger than necessary. I don't know why it was defined as int64 rather than int32 since the same values are represented with 32 bits integers in other places. Maybe they anticipated to use the column index structures for aggregates as well (where it may need more than 32 bits to represent the null counts then).
This plus the added complexity make me think maybe we should wait until we deploy this and have a better sense of the size savings + access patterns to decide if it's worth it to do this.
I don't know if the access pattern is going to play a big part here, the indexes have to be loaded in memory in order to be effective (if they are kept on disk then we need to issue O(log(N)) I/O operations to binary search through the index. That's why I was trying to estimate the amount of memory needed to represent the indexes in memory. The more compact they are the larger the dataset can be, and the more memory there is left available for other parts of the system.
from parquet-go.
Related Issues (20)
- corruption of already read byte values on read of subsequent pages HOT 2
- bigdata is slow
- Specifying row group size as in bytes
- io.ReaderAt interface acceptance is dishonest
- parquet-go can read nested objects, but not parquet cli (Parquet/Avro schema mismatch) HOT 1
- A struct where two tags specify the same column name causes records to silently be not written
- Add more examples for reading parquet files HOT 1
- WriteBooleans appears to be broken (panic) HOT 2
- How to pivot columns
- Broken reader - index out of range HOT 4
- Truncated column index for binary columns are incorrect
- Optional *time.Time panics HOT 2
- AWS Athena's `=` not working for string HOT 3
- Library sometimes flips boolean values HOT 6
- Flakey test: "TestOpenFile/testdata/rle_boolean_encoding.parquet" HOT 2
- Trouble creating a modified schema HOT 1
- GenericWriter should write map keys to matching columns HOT 2
- List-type columns should able to write null in parquet file HOT 2
- panic: reflect: call of reflect.Value.Field on zero Value HOT 2
- Panic in page.go on IBM Z HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from parquet-go.