Comments (8)
@rdblue I see a comment in ParquetMetrics$fromMetadata
, which says TODO: allow struct nesting, but not maps or arrays
.
Could you describe why we need to exclude maps and arrays? We have proper statistics for keys/values in maps as well as for elements in lists, don't we? For example, we might support predicates on elements of lists at some point.
I am asking because the simplest option to enable statistics for nested structs is by replacing fileSchema.asStruct().field(fieldId)
with fileSchema.findField(fieldId)
in ParquetMetrics$fromMetadata
. If we want to collect statistics only for nested structs, the logic will be more complicated.
from iceberg.
The reason for now is that this is how the stats and dictionary filters work. If a column has stats, then it is assumed to be at the top level. Structs are logical groupings and don't affect the meaning of a filter, so they are okay.
If we want to collect stats from maps and lists, then we need to have a plan for how to use them. That means implementing more filters like contains
, containsKey
, or containsValue
. Since we don't have those yet, we don't need to store the data for them.
from iceberg.
Based on the discussion above we could add support for struct metrics and struct-based filtering. Filtering on structs would be a good incremental feature to add for complex schemas where often logical grouping can really help. Especially since there is a PR available in Spark now that one can cherry pick to push down struct-based filters. If we agree then I can work on this (unless @aokolnychyi has one on the way already).
from iceberg.
@prodeezy, sounds good to me! I'll review the PRs.
from iceberg.
Also, I'm updating the summary of this issue to lower/upper bounds instead of min/max. The spec makes no guarantee that these are min/max values so that variable-length types can truncate and save space.
from iceberg.
@rdblue am I right that valueCounts
and nullValueCounts
, which are persisted for keys/values in maps as well as for elements in lists, are not used later on? Shall we keep storing them?
from iceberg.
You're right that they aren't used right now. But let's keep them because we will probably want to add more predicates.
from iceberg.
@prodeezy @rdblue I'll submit a PR for this issue after PR#131, which introduces a new test suite that will be extended to cover nested struct fields once this issue is solved.
from iceberg.
Related Issues (20)
- Iceberg Hidden Partitioning and Spark SQL Wide Transformation Optimization
- does iceberg can run on k8s? HOT 6
- REST Catalog to support custom-catalog name like HMS/Glue HOT 9
- Copy-on-Write behaviour with Flink Data Stream API HOT 2
- Iceberg rest catalog with postgres - List namespaces with parent returns wrong children namespaces
- Is the "Emitting watermarks" new feature can't be used in flink sql? HOT 9
- AWS: Updating Glue catalog table removes column descriptions HOT 1
- MinIO + Spark + hive metadata + iceberg format
- byte and short types in spark no longer auto coerce to int32 HOT 7
- AWS: Creating a Glue table with Lake Formation enabled fails HOT 1
- How to reinitialize/refresh iceberg catalog object in spark catalog on an ongoing spark session HOT 1
- Spark: Dropping partition column from old partition table corrupts entire table HOT 7
- Flink/Azure job graph serialization fails when used with storage account shared key authentication HOT 1
- Spark: CDC does not respect when the table is rolled back. HOT 2
- Documentation page returning 404 HOT 1
- Getting storage partitioned join to work HOT 1
- `iceberg-spark-runtime-3.3_2.12-1.5.1` seems to be compiled with a mismatched scala version HOT 16
- Renaming of ConfVars Enums in Apache Hive breaks compatibility of HiveCatalog dependency in Apache Iceberg HOT 8
- org.apache.iceberg.expressions.(Max/Min)Aggregate Don't handle null DataFile.(upper/lower)Bounds() HOT 2
- Geospatial Support HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from iceberg.