I have a long-running application based on parquet4s that receives a continuous stream

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

How to write in streaming fashion? about parquet4s HOT 6 CLOSED

mjakubowski84 commented on May 30, 2024

How to write in streaming fashion?

from parquet4s.

Comments (6)

mjakubowski84 commented on May 30, 2024

Hi @mac01021!
If Scala's Stream doesn't help then you may try splitting your large batch into smaller batches. Check recently introduced https://github.com/mjakubowski84/parquet4s/blob/master/core/src/main/scala/com/github/mjakubowski84/parquet4s/IncrementalParquetWriter.scala
You should be able to write multiple batches to single file before closing.

from parquet4s.

mac01021 commented on May 30, 2024

Thanks @mjakubowski84!

This looks like just what I need!

Can you provide any guidance related to the cost of a batch? Will the file come out the same whether I write it in a few large batches vs many tiny batches?

from parquet4s.

mjakubowski84 commented on May 30, 2024

Hard to tell! It is strongly correlated to the size of your entity. Performance relates also to the speed and latency of your IO. I recommend you to do a series of experiments with different setup.

from parquet4s.

mac01021 commented on May 30, 2024

But does it do anything specific I should know about, like creating a new row group for every batch?

from parquet4s.

mjakubowski84 commented on May 30, 2024

No, nothing differs comparing to normal writer except fact that you have to call close on your own. Row groups, pages, etc, are still created according the options provided to the writer.

from parquet4s.

mac01021 commented on May 30, 2024

Awesome! Thanks!

from parquet4s.

How to write in streaming fashion? about parquet4s HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent