Comments (7)
Thanks for sending the code snippet to reproduce the issue.
I took a look and I think parquet-go and your program behave as expected here, there is not memory leak.
If I'm understanding correctly, the high memory usage is combination of two factors:
- your program dynamically create writers, one for each dimension that you are generating a parquet file for
- the default configuration for a
parquet.Writer
is to allocate 1 MiB buffers for each column
The schema you are using has 5 columns, so each writer uses 5 MiB of memory. I ran the program on a subset of your example input (10% of the rows), which results in generating 183 files, 5 x 183 = ~900 MiB, which is also the heap footprint that I was able to measure by modifying your program to generate a memory profile.
Here is a capture of the memory profile, showing that memory is being allocated in 1 MiB blocks in the column buffers used by the writer:
It appeared the parquet files you generated end up being pretty small, so these 1 MiB column buffers go mostly unused. I set the PageBufferSize configuration option to 64 KiB to reduce the memory footprint of each writer, which appeared to greatly reduce the memory footprint used by the column buffers. Here is a highlight of comparing the memory profiles before an after the configuration change:
from parquet-go.
Hello @vbmithr, thanks for reaching out!
I've added documentation in #130 which describes how to use on-disk page buffers to avoid consuming large amount of memory when writing large row groups, this might be useful to your use case.
Calling Flush
on a writer should materialize a row group in the file as well, if memory isn't released or reused in this case it might be a bug. Could you share a code snippet reproducing the issue you reported?
Let me know if that helps.
from parquet-go.
I guess it is because it tries to put all data in one row group. I have looked at the WriteRowgroup function but it seems overkill to have to reimplement a lot just to avoid using all RAM.
Is there ways to setup the maximum number of values per row group so that row groups are created on demand (and memory freed in between row groups?).
I also looked at column buffers to use swap, but really, what I’d like is to write the file so that not much memory is consumed while doing it, not having to swap a lot of data.
Probably I don’t understand something but I haven’t yet figured out.
Best
from parquet-go.
Thanks Achille, I have just sent you a mail with details on how to reproduce.
from parquet-go.
I have reproduced the memory trace. Indeed, these report very little memory usage, but in practice my program still allocates (and needs!) 5GiB of RAM to run correctly. I can understand why it consumes at least 2GiB for loading the JSON file into memory but I don’t know where the last 3GiB comes from. Thanks for your help, anyway!
from parquet-go.
Unless I'm missing something, parquet-go is behaving as intended here and offers the ability to tune memory utilization to better suit your use case.
I'm closing the issue, feel free to reopen if you think further changes or discussions are needed.
from parquet-go.
GOGC
did wonders, eventually. Thanks for the help, @achille-roussel!
from parquet-go.
Related Issues (20)
- corruption of already read byte values on read of subsequent pages HOT 2
- bigdata is slow
- Specifying row group size as in bytes
- io.ReaderAt interface acceptance is dishonest
- parquet-go can read nested objects, but not parquet cli (Parquet/Avro schema mismatch) HOT 1
- A struct where two tags specify the same column name causes records to silently be not written
- Add more examples for reading parquet files HOT 1
- WriteBooleans appears to be broken (panic) HOT 2
- How to pivot columns
- Broken reader - index out of range HOT 4
- Truncated column index for binary columns are incorrect
- Optional *time.Time panics HOT 2
- AWS Athena's `=` not working for string HOT 3
- Library sometimes flips boolean values HOT 6
- Flakey test: "TestOpenFile/testdata/rle_boolean_encoding.parquet" HOT 2
- Trouble creating a modified schema HOT 1
- GenericWriter should write map keys to matching columns HOT 2
- List-type columns should able to write null in parquet file HOT 2
- panic: reflect: call of reflect.Value.Field on zero Value HOT 2
- Panic in page.go on IBM Z HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from parquet-go.