Hello, Using arrow adapter, I became aware that the memory (RAM) footprint of the

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

cc <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I have created a JIRA to track the progress: <a href="https://issues.apache.org/jira/b

Thank you, <a class="user-mention notranslate" data-hovercard-type="user" data-hoverca

Huge memory taken for each field when exporting about orc HOT 14 OPEN

apache commented on July 25, 2024

Huge memory taken for each field when exporting

from orc.

Comments (14)

coderex2522 commented on July 25, 2024 5

@dongjoon-hyun @wgtmac @LouisClt I will follow up on this issues(ORC-1280) and implement a much smarter memory management.

from orc.

wgtmac commented on July 25, 2024 1

Hello, it seems there were commits referencing this issue. Is this issue now fixed ?

@LouisClt Thanks for your follow-up.

We have implemented a block-based buffer called BlockBuffer (by @coderex2522) and used it to replace the output buffer in the CompressionStream. It can decrease the memory footprint to some extent.

IMO, the next step is to use it to replace the input buffer of the CompressionStream which has the size of compressionBlockSize per stream.

from orc.

luffy-zh commented on July 25, 2024 1

I will work on it.

from orc.

dongjoon-hyun commented on July 25, 2024

cc @wgtmac , @stiga-huang , @coderex2522

from orc.

coderex2522 commented on July 25, 2024

@LouisClt To support the zero-copy mechanism, class BufferedOutputStream will have an internal data buffer. And the default capacity of the internal data buffer is 1MB. This default capacity size should be able to be modified, but here's a hint that if the buffer capacity is set too small, it may cause the buffer to expand and trigger memcpy function frequently.

from orc.

wgtmac commented on July 25, 2024

We may replace the DataBuffer by a new Buffer implementation with a much smarter memory management to automatically grow and shrink its size according to actual usage. This management can happen on the column basis.

from orc.

LouisClt commented on July 25, 2024

Thanks everyone for your answers. I understand the possible performances issues linked with lowering too much the size of the buffer (on my testing it was OK in my case though).
I think the solution given by @wgtmac would be fine for me, and better than passing by global variables, if it is feasible.

from orc.

wgtmac commented on July 25, 2024

I have created a JIRA to track the progress: https://issues.apache.org/jira/browse/ORC-1280

from orc.

dongjoon-hyun commented on July 25, 2024

Thank you, @coderex2522 .

from orc.

LouisClt commented on July 25, 2024

Hello, it seems there were commits referencing this issue. Is this issue now fixed ?

from orc.

wgtmac commented on July 25, 2024

Hello, it seems there were commits referencing this issue. Is this issue now fixed ?

@LouisClt Thanks for your follow-up.

We have implemented a block-based buffer called BlockBuffer (by @coderex2522) and used it to replace the output buffer in the CompressionStream. It can decrease the memory footprint to some extent.

IMO, the next step is to use it to replace the input buffer of the CompressionStream which has the size of compressionBlockSize per stream.

To be precise, the rawInputBuffer of every CompressionStream is fixed to the compression block size which is 1M by default. Writer with many columns will suffer from large memory footprint and nothing can be done to alleviate it.

I have created a JIRA to track it: https://issues.apache.org/jira/browse/ORC-1365

cc @coderex2522

from orc.

LouisClt commented on July 25, 2024

Thanks for your reply @wgtmac and the implementation of the BlockBuffer.
I'll wait for the replacement of the rawInputBuffer by the BlockBuffer in every compression stream then. Do you think it will take long ?

from orc.

dongjoon-hyun commented on July 25, 2024

Hi, @LouisClt . FYI, according to the Apache ORC release cycle, newly developed features will be delivered via v1.9.0 on September 2023 (if they are merged to Apache ORC before.)

https://github.com/apache/orc/milestones

from orc.

LouisClt commented on July 25, 2024

Understood, and thanks for your answer !

from orc.

Huge memory taken for each field when exporting about orc HOT 14 OPEN

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent