Git Product home page Git Product logo

Comments (8)

aokolnychyi avatar aokolnychyi commented on May 18, 2024

@rdblue I see a comment in ParquetMetrics$fromMetadata, which says TODO: allow struct nesting, but not maps or arrays.

Could you describe why we need to exclude maps and arrays? We have proper statistics for keys/values in maps as well as for elements in lists, don't we? For example, we might support predicates on elements of lists at some point.

I am asking because the simplest option to enable statistics for nested structs is by replacing fileSchema.asStruct().field(fieldId) with fileSchema.findField(fieldId) in ParquetMetrics$fromMetadata. If we want to collect statistics only for nested structs, the logic will be more complicated.

from iceberg.

rdblue avatar rdblue commented on May 18, 2024

The reason for now is that this is how the stats and dictionary filters work. If a column has stats, then it is assumed to be at the top level. Structs are logical groupings and don't affect the meaning of a filter, so they are okay.

If we want to collect stats from maps and lists, then we need to have a plan for how to use them. That means implementing more filters like contains, containsKey, or containsValue. Since we don't have those yet, we don't need to store the data for them.

from iceberg.

prodeezy avatar prodeezy commented on May 18, 2024

Based on the discussion above we could add support for struct metrics and struct-based filtering. Filtering on structs would be a good incremental feature to add for complex schemas where often logical grouping can really help. Especially since there is a PR available in Spark now that one can cherry pick to push down struct-based filters. If we agree then I can work on this (unless @aokolnychyi has one on the way already).

from iceberg.

rdblue avatar rdblue commented on May 18, 2024

@prodeezy, sounds good to me! I'll review the PRs.

from iceberg.

rdblue avatar rdblue commented on May 18, 2024

Also, I'm updating the summary of this issue to lower/upper bounds instead of min/max. The spec makes no guarantee that these are min/max values so that variable-length types can truncate and save space.

from iceberg.

aokolnychyi avatar aokolnychyi commented on May 18, 2024

@rdblue am I right that valueCounts and nullValueCounts, which are persisted for keys/values in maps as well as for elements in lists, are not used later on? Shall we keep storing them?

from iceberg.

rdblue avatar rdblue commented on May 18, 2024

You're right that they aren't used right now. But let's keep them because we will probably want to add more predicates.

from iceberg.

aokolnychyi avatar aokolnychyi commented on May 18, 2024

@prodeezy @rdblue I'll submit a PR for this issue after PR#131, which introduces a new test suite that will be extended to cover nested struct fields once this issue is solved.

from iceberg.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.