Git Product home page Git Product logo

Comments (8)

hussein-awala avatar hussein-awala commented on September 28, 2024 1

@amogh-jahagirdar I created #9902 to test if the bloom filters are added to the files, and they seem to be added for the nested field.

I will try to create my table with different catalogs to check if it's related to the catalog. Also, it could be related to how the data are written, where these tests use the FileAppender directly, I will try to use Spark API in these tests to write the data.

from iceberg.

amogh-jahagirdar avatar amogh-jahagirdar commented on September 28, 2024

Thanks for reporting @hussein-awala I'm taking a look, I forgot if there's some limitation preventing us from supporting bloom filters on nested fields. At least we can see that there is not a limitation on the parquet side by the test you did.

from iceberg.

amogh-jahagirdar avatar amogh-jahagirdar commented on September 28, 2024

Sounds great! I'll take a look at the PR.

I spent some time debugging this today, and added a test in the same class to try and repro and I saw the same thing. In ColumnWriteStoreBase we are actually initializing the parquet BloomFilterWriter with a valid bloomfilter for the nested type. So the table property for nested types is being passed through correctly.

As you said maybe the Spark APIs goes through a different path which ends up somehow losing the configuration. I doubt the catalog changes anything since this is more about the write path but feel free to go ahead and try it out.

But this is promising in the sense that we have already support this (our Parquet dependency supports it etc), we just need to identify why in the Spark API (or whatever mechanism used in the issue description) does not write the bloom filter for nested types.

from iceberg.

amogh-jahagirdar avatar amogh-jahagirdar commented on September 28, 2024

Also curious which Spark version are you using? I just tested via Spark 3.4 and Spark 3.5, and bloom filters for nested type appears to be written out based on the parquet-cli output (just a struct with a single integer field).

from iceberg.

hussein-awala avatar hussein-awala commented on September 28, 2024

I use Spark 3.5, Iceberg 1.4.3, and Glue Catalog.

and bloom filters for nested type appear to be written out

Interesting, I will retest it on Monday morning.

from iceberg.

hussein-awala avatar hussein-awala commented on September 28, 2024

@amogh-jahagirdar
Today I found out that I had this issue on a single table, I tried with nested and root fields, with single and multiple bloom filters, and none of them worked. This table contains a large number of columns (over 100 columns), I don't know yet if this is related to the issue. I will continue my investigation and update the issue once I find its source.

I think #9902 is ready to merge.

from iceberg.

cccs-jory avatar cccs-jory commented on September 28, 2024

Hello @hussein-awala , if you're testing with a relatively small table with a small number of distinct values, Spark may be using dictionary encoding for the values. We have discovered in our testing that if Spark is able to dictionary encode the values in the parquet file, it will not write the bloom filter (which is by design).

from iceberg.

arthurpassos avatar arthurpassos commented on September 28, 2024

@hussein-awala which tool are you using to inspect the parquet files? You mentioned parquet-cli, but a google search leads to https://github.com/chhantyal/parquet-cli and/or https://pypi.org/project/parquet-cli/, which does not seem to offer the same API.

Plus, are you guys aware of any document that describes bloom filter support for Map/Struct type?

It is just a question, I am not in the issue loop.

from iceberg.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.