Apache Iceberg version 1.4.3 (latest release) Q

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Thanks for reporting <a class="user-mention notranslate" data-hovercard-type="user" da

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Parquet bloom filter doesn't work with nested fields about iceberg HOT 8 OPEN

hussein-awala commented on September 28, 2024 1

Parquet bloom filter doesn't work with nested fields

from iceberg.

Comments (8)

hussein-awala commented on September 28, 2024 1

@amogh-jahagirdar I created #9902 to test if the bloom filters are added to the files, and they seem to be added for the nested field.

I will try to create my table with different catalogs to check if it's related to the catalog. Also, it could be related to how the data are written, where these tests use the FileAppender directly, I will try to use Spark API in these tests to write the data.

from iceberg.

amogh-jahagirdar commented on September 28, 2024

Thanks for reporting @hussein-awala I'm taking a look, I forgot if there's some limitation preventing us from supporting bloom filters on nested fields. At least we can see that there is not a limitation on the parquet side by the test you did.

from iceberg.

amogh-jahagirdar commented on September 28, 2024

Sounds great! I'll take a look at the PR.

I spent some time debugging this today, and added a test in the same class to try and repro and I saw the same thing. In ColumnWriteStoreBase we are actually initializing the parquet BloomFilterWriter with a valid bloomfilter for the nested type. So the table property for nested types is being passed through correctly.

As you said maybe the Spark APIs goes through a different path which ends up somehow losing the configuration. I doubt the catalog changes anything since this is more about the write path but feel free to go ahead and try it out.

But this is promising in the sense that we have already support this (our Parquet dependency supports it etc), we just need to identify why in the Spark API (or whatever mechanism used in the issue description) does not write the bloom filter for nested types.

from iceberg.

amogh-jahagirdar commented on September 28, 2024

Also curious which Spark version are you using? I just tested via Spark 3.4 and Spark 3.5, and bloom filters for nested type appears to be written out based on the parquet-cli output (just a struct with a single integer field).

from iceberg.

hussein-awala commented on September 28, 2024

I use Spark 3.5, Iceberg 1.4.3, and Glue Catalog.

and bloom filters for nested type appear to be written out

Interesting, I will retest it on Monday morning.

from iceberg.

hussein-awala commented on September 28, 2024

@amogh-jahagirdar
Today I found out that I had this issue on a single table, I tried with nested and root fields, with single and multiple bloom filters, and none of them worked. This table contains a large number of columns (over 100 columns), I don't know yet if this is related to the issue. I will continue my investigation and update the issue once I find its source.

I think #9902 is ready to merge.

from iceberg.

cccs-jory commented on September 28, 2024

Hello @hussein-awala , if you're testing with a relatively small table with a small number of distinct values, Spark may be using dictionary encoding for the values. We have discovered in our testing that if Spark is able to dictionary encode the values in the parquet file, it will not write the bloom filter (which is by design).

from iceberg.

arthurpassos commented on September 28, 2024

@hussein-awala which tool are you using to inspect the parquet files? You mentioned parquet-cli, but a google search leads to https://github.com/chhantyal/parquet-cli and/or https://pypi.org/project/parquet-cli/, which does not seem to offer the same API.

Plus, are you guys aware of any document that describes bloom filter support for Map/Struct type?

It is just a question, I am not in the issue loop.

from iceberg.

Parquet bloom filter doesn't work with nested fields about iceberg HOT 8 OPEN

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent