Comments (1)
Have some new thoughts on this -
Reading the Parquet spec more closely, a non-grouped repeated field defaults to a required list. We've been relying on this default behavior in Magnolify ParquetType, but we probably should have been always adding a wrapper group to be explicit.
I filed and merged a fix for the underlying incompatibility in parquet-mr's AvroSchemaConverter, PARQUET-2425, since technically non-grouped repeated field should be supported π
Once the fix is released, IMO we should start moving away from AvroCompat in magnolify-parquet. Ideally we could just modify ParquetType
to always produce a grouped repeated field schema, but that would potentially cause incompatibility issues downstream in the reader (for example, if you're reading a month's worth of partitions, and half of them are produced with a grouped field schema and half aren't, Scio would complain). Therefore, I think we can do this in two parts:
(1) Update all the reader code to treat non-grouped repeated field schemas as equivalent to required repeated field schemas during read time. This would mean updating Schema.checkCompatibility as well as wrapping repeated schemas into required groups if an Avro list is detected, here (We could just use AvroSchemaConverter#convert(pt.avroSchema)
to get back a wrapped MessageType
once PARQUET-2425 is released). Deprecate AvroCompat for reads.
(2) Eventually update the writer code to produce wrapped repeated schemas, maybe with a fallback option via Configuration
. AvroCompat has a second functionality aside from wrapping lists: adding a metadata key parquet.avro.schema
to the file footer. We should encapsulate that logic via a Configuration
flag, or simply always write it π€·ββοΈ . Deprecate AvroCompat for writes.
Thoughts?
from magnolify.
Related Issues (20)
- Set cats and scalacheck dependencies as provided
- Neo4j Record support
- Support schema annotations for magnolify-parquet HOT 1
- Build with java 17
- magnolify.bigtable.ByteStringComparator is built with Java 11 since 0.4.7 release
- Publish docs as a GH site
- Enum implicit taken instead of typeclass derivation HOT 2
- Set avro.java.String property in magnolify-avro schema derivation? HOT 1
- Add support of Joda-time types to all magnolify modules
- Support ByteBuffer derivations? HOT 1
- Neo4J documentation
- Replace deprecated usage of `FileDescriptor.Syntax` in ProtobufType
- Remove package object inheritance
- ParquetType throws error writing Optional empty list HOT 1
- Enable dictionary encoding by default for enum ParquetFields
- Support Map schema types in Parquet HOT 1
- Handling non-Enums in EnumType[] type class in scala3
- Regression in bytes read safety in avro HOT 1
- How to get a `Map` instance from Guava Funnel? HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from magnolify.