Comments (8)
@amogh-jahagirdar I created #9902 to test if the bloom filters are added to the files, and they seem to be added for the nested field.
I will try to create my table with different catalogs to check if it's related to the catalog. Also, it could be related to how the data are written, where these tests use the FileAppender
directly, I will try to use Spark API in these tests to write the data.
from iceberg.
Thanks for reporting @hussein-awala I'm taking a look, I forgot if there's some limitation preventing us from supporting bloom filters on nested fields. At least we can see that there is not a limitation on the parquet side by the test you did.
from iceberg.
Sounds great! I'll take a look at the PR.
I spent some time debugging this today, and added a test in the same class to try and repro and I saw the same thing. In ColumnWriteStoreBase
we are actually initializing the parquet BloomFilterWriter
with a valid bloomfilter for the nested type. So the table property for nested types is being passed through correctly.
As you said maybe the Spark APIs goes through a different path which ends up somehow losing the configuration. I doubt the catalog changes anything since this is more about the write path but feel free to go ahead and try it out.
But this is promising in the sense that we have already support this (our Parquet dependency supports it etc), we just need to identify why in the Spark API (or whatever mechanism used in the issue description) does not write the bloom filter for nested types.
from iceberg.
Also curious which Spark version are you using? I just tested via Spark 3.4 and Spark 3.5, and bloom filters for nested type appears to be written out based on the parquet-cli output (just a struct with a single integer field).
from iceberg.
I use Spark 3.5, Iceberg 1.4.3, and Glue Catalog.
and bloom filters for nested type appear to be written out
Interesting, I will retest it on Monday morning.
from iceberg.
@amogh-jahagirdar
Today I found out that I had this issue on a single table, I tried with nested and root fields, with single and multiple bloom filters, and none of them worked. This table contains a large number of columns (over 100 columns), I don't know yet if this is related to the issue. I will continue my investigation and update the issue once I find its source.
I think #9902 is ready to merge.
from iceberg.
Hello @hussein-awala , if you're testing with a relatively small table with a small number of distinct values, Spark may be using dictionary encoding for the values. We have discovered in our testing that if Spark is able to dictionary encode the values in the parquet file, it will not write the bloom filter (which is by design).
from iceberg.
@hussein-awala which tool are you using to inspect the parquet files? You mentioned parquet-cli
, but a google search leads to https://github.com/chhantyal/parquet-cli and/or https://pypi.org/project/parquet-cli/, which does not seem to offer the same API.
Plus, are you guys aware of any document that describes bloom filter support for Map/Struct type?
It is just a question, I am not in the issue loop.
from iceberg.
Related Issues (20)
- How to get the specific catalog config from Iceberg REST get config interface?
- what does value of partition mean in table dbxxx.tbxxx.partitions? HOT 2
- AWS: Glue ETL Job fails to create a table using lakeformation
- Inconsistent id definition on Flink resolvedSchema conversion to iceberg schema HOT 3
- Row Lineage for V3
- Name Mapping Serialisation Spec lists field `field_id` but examples use `field-id`
- REST Catalog pagination can throw IndexOutOfBoundsException HOT 1
- Iceberg to configure AWS S3 configuration with the Hadoop and Hive4 setup is hanging without giving ant error HOT 13
- procedure add_files parallelism > 1 -> NotSerializableException HOT 2
- rewrite_data_files procedure is not compatible with ranger auth check
- s3:DeleteObject giving because no session policy allows the s3:DeleteObject action HOT 1
- REST Catalog does not validate "to" identifier on rename table HOT 1
- Table rename in Glue Catalog throws Incorrect `AlreadyExistsException` HOT 1
- Incorrect schema used when using time-travel HOT 2
- Kafka Connect: route to table using topic name
- Iceberg Read is not working on Iceberg Hive table HOT 3
- Deleting metadata(expire_snapshots doesn't help...)
- Retry logic in JDBC catalog fails with class cast exception if driver exception class does not extend SQLTransientException HOT 1
- Proposal: add Variant type to iceberg HOT 1
- Spark SQL UI can't show scan metrics. HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from iceberg.