I’m trying to use flush on the writer to limit RAM us

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

Limit RAM usage about parquet-go HOT 7 CLOSED

segmentio commented on August 10, 2024

Limit RAM usage

from parquet-go.

Comments (7)

achille-roussel commented on August 10, 2024 3

Thanks for sending the code snippet to reproduce the issue.

I took a look and I think parquet-go and your program behave as expected here, there is not memory leak.

If I'm understanding correctly, the high memory usage is combination of two factors:

your program dynamically create writers, one for each dimension that you are generating a parquet file for
the default configuration for a parquet.Writer is to allocate 1 MiB buffers for each column

The schema you are using has 5 columns, so each writer uses 5 MiB of memory. I ran the program on a subset of your example input (10% of the rows), which results in generating 183 files, 5 x 183 = ~900 MiB, which is also the heap footprint that I was able to measure by modifying your program to generate a memory profile.

Here is a capture of the memory profile, showing that memory is being allocated in 1 MiB blocks in the column buffers used by the writer:

It appeared the parquet files you generated end up being pretty small, so these 1 MiB column buffers go mostly unused. I set the PageBufferSize configuration option to 64 KiB to reduce the memory footprint of each writer, which appeared to greatly reduce the memory footprint used by the column buffers. Here is a highlight of comparing the memory profiles before an after the configuration change:

from parquet-go.

achille-roussel commented on August 10, 2024 1

Hello @vbmithr, thanks for reaching out!

I've added documentation in #130 which describes how to use on-disk page buffers to avoid consuming large amount of memory when writing large row groups, this might be useful to your use case.

Calling Flush on a writer should materialize a row group in the file as well, if memory isn't released or reused in this case it might be a bug. Could you share a code snippet reproducing the issue you reported?

Let me know if that helps.

from parquet-go.

vbmithr commented on August 10, 2024

I guess it is because it tries to put all data in one row group. I have looked at the WriteRowgroup function but it seems overkill to have to reimplement a lot just to avoid using all RAM.
Is there ways to setup the maximum number of values per row group so that row groups are created on demand (and memory freed in between row groups?).
I also looked at column buffers to use swap, but really, what I’d like is to write the file so that not much memory is consumed while doing it, not having to swap a lot of data.
Probably I don’t understand something but I haven’t yet figured out.

Best

from parquet-go.

vbmithr commented on August 10, 2024

Thanks Achille, I have just sent you a mail with details on how to reproduce.

from parquet-go.

vbmithr commented on August 10, 2024

I have reproduced the memory trace. Indeed, these report very little memory usage, but in practice my program still allocates (and needs!) 5GiB of RAM to run correctly. I can understand why it consumes at least 2GiB for loading the JSON file into memory but I don’t know where the last 3GiB comes from. Thanks for your help, anyway!

from parquet-go.

achille-roussel commented on August 10, 2024

Unless I'm missing something, parquet-go is behaving as intended here and offers the ability to tune memory utilization to better suit your use case.

I'm closing the issue, feel free to reopen if you think further changes or discussions are needed.

from parquet-go.

vbmithr commented on August 10, 2024

GOGC did wonders, eventually. Thanks for the help, @achille-roussel!

from parquet-go.

Limit RAM usage about parquet-go HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent