Git Product home page Git Product logo

Comments (7)

achille-roussel avatar achille-roussel commented on August 10, 2024 3

Thanks for sending the code snippet to reproduce the issue.

I took a look and I think parquet-go and your program behave as expected here, there is not memory leak.

If I'm understanding correctly, the high memory usage is combination of two factors:

  • your program dynamically create writers, one for each dimension that you are generating a parquet file for
  • the default configuration for a parquet.Writer is to allocate 1 MiB buffers for each column

The schema you are using has 5 columns, so each writer uses 5 MiB of memory. I ran the program on a subset of your example input (10% of the rows), which results in generating 183 files, 5 x 183 = ~900 MiB, which is also the heap footprint that I was able to measure by modifying your program to generate a memory profile.

Here is a capture of the memory profile, showing that memory is being allocated in 1 MiB blocks in the column buffers used by the writer:

image

It appeared the parquet files you generated end up being pretty small, so these 1 MiB column buffers go mostly unused. I set the PageBufferSize configuration option to 64 KiB to reduce the memory footprint of each writer, which appeared to greatly reduce the memory footprint used by the column buffers. Here is a highlight of comparing the memory profiles before an after the configuration change:

image

from parquet-go.

achille-roussel avatar achille-roussel commented on August 10, 2024 1

Hello @vbmithr, thanks for reaching out!

I've added documentation in #130 which describes how to use on-disk page buffers to avoid consuming large amount of memory when writing large row groups, this might be useful to your use case.

Calling Flush on a writer should materialize a row group in the file as well, if memory isn't released or reused in this case it might be a bug. Could you share a code snippet reproducing the issue you reported?

Let me know if that helps.

from parquet-go.

vbmithr avatar vbmithr commented on August 10, 2024

I guess it is because it tries to put all data in one row group. I have looked at the WriteRowgroup function but it seems overkill to have to reimplement a lot just to avoid using all RAM.
Is there ways to setup the maximum number of values per row group so that row groups are created on demand (and memory freed in between row groups?).
I also looked at column buffers to use swap, but really, what I’d like is to write the file so that not much memory is consumed while doing it, not having to swap a lot of data.
Probably I don’t understand something but I haven’t yet figured out.

Best

from parquet-go.

vbmithr avatar vbmithr commented on August 10, 2024

Thanks Achille, I have just sent you a mail with details on how to reproduce.

from parquet-go.

vbmithr avatar vbmithr commented on August 10, 2024

I have reproduced the memory trace. Indeed, these report very little memory usage, but in practice my program still allocates (and needs!) 5GiB of RAM to run correctly. I can understand why it consumes at least 2GiB for loading the JSON file into memory but I don’t know where the last 3GiB comes from. Thanks for your help, anyway!

from parquet-go.

achille-roussel avatar achille-roussel commented on August 10, 2024

Unless I'm missing something, parquet-go is behaving as intended here and offers the ability to tune memory utilization to better suit your use case.

I'm closing the issue, feel free to reopen if you think further changes or discussions are needed.

from parquet-go.

vbmithr avatar vbmithr commented on August 10, 2024

GOGC did wonders, eventually. Thanks for the help, @achille-roussel!

from parquet-go.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.