spotify / magnolify Goto Github PK

View Code? Open in Web Editor NEW

158.0 21.0 27.0 4.31 MB

A collection of Magnolia add-on modules

Home Page: https://spotify.github.io/magnolify

License: Apache License 2.0

Scala 98.98% Java 0.62% Shell 0.34% StringTemplate 0.05%

scala scalacheck cats magnolia bigquery avro tensorflow datastore bigtable guava

magnolify's Introduction

magnolify

A collection of Magnolia add-ons for common type class derivation, data type conversion, etc.; a simpler and faster successor to shapeless-datatype.

Modules

This library includes the following modules.

magnolify-avro - conversion between Scala types and Apache Avro GenericRecord
magnolify-bigquery - conversion between Scala types and Google Cloud BigQuery TableRow
magnolify-bigtable - conversion between Scala types and Google Cloud Bigtable to Mutation, from Row
magnolify-cats - type class derivation for Cats, specifically
magnolify-datastore - conversion between Scala types and Google Cloud Datastore Entity
magnolify-guava - type class derivation for Guava
- Funnel[T]
magnolify-neo4j - conversion between Scala types and Value
magnolify-parquet - support for Parquet columnar storage format.
magnolify-protobuf - conversion between Scala types and Google Protocol Buffer Message
magnolify-refined - support for simple refinement types from Refined.
magnolify-scalacheck - type class derivation for ScalaCheck
- Arbitrary[T]
- Cogen[T]
magnolify-tensorflow - conversion between Scala types and TensorFlow Example

Usage

See micro-site for documentation.

How to Release

Magnolify automates releases using sbt-ci-release with Github Actions. Simply push a new tag:

git tag -a v0.1.0 -m "v0.1.0"
git push origin v0.1.0

Note that the tag version MUST start with v to be picked up as the release version.

License

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0

magnolify's People

Contributors

Stargazers

Watchers

magnolify's Issues

Override params for scala check Arbitrary

I have case class like

case class Something(id: Int, price:BigDecimal, tax: BigDecimal)

along with a lot more params

now I want to derive Arbitrary[Something] and I want to use different Gen[BigDecimal] instances for the price and tax field.

Right now I am using this but it wouldn't it be nice to support it without the copy?

implicit val arbSomething: Arbitrary[Something] = Arbitrary {
    for {
      something   <- genArbitrary[Something].arbitrary // Magnolify
      price <- Gen.oneOf(Seq[BigDecimal](1, 3, 10))
      tax <- Gen.oneOf(Seq[BigDecimal](0, 1))
    } yield something.copy(
      price = price,
      tax = tax
    )
  }

Add `Diffy` type class derivation

Similar to the shapeless based implementation in ratatool, maybe with the goal to unify the 2.
https://github.com/spotify/ratatool/blob/master/ratatool-shapeless/src/main/scala/com/spotify/ratatool/shapeless/CaseClassDiffy.scala

Add BigtableType for Mutation/Row

Where Mutation is the writer & Row reader type. Should be similar to ExampleType since neither support nested structure.

Magnolify/tensorflow doesn't support optionals and nested repeated fields

Currently the docs here state that "Optional and repeated types are not supported in a nested field", is there a plan to add a fix/workaround that would make using those possible?

Currently optionals and nested repeated fields are the main thing standing between us and an upgrade to Scio 10.

Here's a rough example of what our case classes look like:

  case class CaseClass1(
    id: String,
    number: Option[Long],
    someFeatures: Option[SomeFeaturesCaseClass]
  )

  case class CaseClass2(
    baseFeatures: CaseClass1,
    otherFeatures: Map[String, OtherFeaturesCaseClass],
    yetOtherFeatures: Map[String, YetOtherFeaturesCaseClass]
  )

  val CASE_CLASS_CONVERTER: ExampleType[CaseClass2] =
    ExampleType[CaseClass2]

Add `ProtobufType`

For arbitrary protobuf Message types. Should be doable by manipulating the field descriptor, etc.

Use ReadOnlyCaseClass for typeclasses that does not need to construct instances

See https://twitter.com/propensive/status/1247578833680773122

Will help reduce compile time as well as allowing those derivation to work with classes where the constructor is inaccessible :)

Migration involves changing CaseClass to ReadOnlyCaseClass

Add method to `ExampleType` to return the tf.schema

When you need to work with tf.example, especially when you are writing them, you often also want to store the tf.schema which describes that particular tf.example dataset. It would be useful to add a methof to ExampleType which returns such tf.schema. Something like:

sealed trait ExampleType[T] extends Converter[T, Example, Example.Builder] {
   ...
  def schema: Schema
}

It is important to note that the returned schema object will only contain information about the names of the fields, and their types, it cannot contain any statistics information.

AvroType to produce a SpecificRecordBase instead of GenericRecord

Trying to use magnolify-avro with Lightbend Cloudflow and their Avro based logic depends on SpecificRecordBase implementations auto-generated from avsc.

Would it be possible to generate such a type dynamically ?

cc @nevillelyh

Add examples

Probably to README.md.

Magnolify decodes Optional toString value of optional byte array field instead of field itself

Given the following type definition:

@BigQueryType.toTable
case class TestType(
   id: String,
   bytes: Optional[Array[Byte]]

private val bqType = TableRowType[TestType]

Running the following pipeline test:

"test" should "work" in {
    val row = new TableRow().set("id", "test-id").set("bytes", Some(Array(1.toByte, 9.toByte)))
    JobTest[TestJob.type]
        .args("--test-table=test:table.def")
        .input(BigQueryIO(Table.Spec("test:table.def")), Seq(row))
        .run
}

Returns the following stack-trace (trimmed for relevance):

Caused by: java.lang.IllegalArgumentException: com.google.common.io.BaseEncoding$DecodingException: Unrecognized character: {
	at com.google.common.io.BaseEncoding.decode(BaseEncoding.java:219)
	at magnolify.bigquery.TableRowField$.$anonfun$trfByteArray$1(TableRowType.scala:172)
	at magnolify.bigquery.TableRowField$$anon$4.from(TableRowType.scala:160)
	at magnolify.bigquery.TableRowField$$anon$5.from(TableRowType.scala:187)
	at magnolify.bigquery.TableRowField$$anon$5.from(TableRowType.scala:183)
	at magnolify.bigquery.TableRowField.fromAny(TableRowType.scala:71)
	at magnolify.bigquery.TableRowField.fromAny$(TableRowType.scala:71)
	at magnolify.bigquery.TableRowField$$anon$5.fromAny(TableRowType.scala:183)
	at magnolify.bigquery.TableRowField$$anon$2.$anonfun$from$1(TableRowType.scala:110)

The full String that trfByteArray is attempting to decode is {empty=false, defined=true}, rather than the byte array itself. This also happens if None is passed instead of an array, but does not happen if the field is not populated at all.

Cats instances defeating algebird ones.

In Algebird we have: case class Min[T](get: T) and implicit algebird.Semigroup[T] for it.

However when deriving implicit sg: cats.Semigroup[Record] with magnolify.cats.auto._, where Record is a case class with a Min[Int] field, the cats auto derivation wins and derives that combine(Min(1), Min(2)) == Min(3). This is because algebird.Semigroup extends cats.Semigroup, and our derivation is based on cats.Semigroup, which wins in the resolution.

camelCase/PascalCase/snake_case/kebab-case mapper

Avro, BigQuery, etc. uses snake_case internally while camelCase is more idiomatic Scala.
Some users requested case conversion to be supported.

Two possible options:

Add an extra parameter to all *Type converters, i.e. a bijection between internal (Avro, BQ, etc.) and Scala cases. Something like val at = AvroType[MyRecord](Case.Snake, Case.Camel).

However we should make sure the mappings are 1-to-1 and there's no conflicts.

Add independent mappers of GenericRecord => GenericRecord, TableRow => TableRow, essentially making a copy of the record by renaming all the fields, before feeding it into the converters.

This is more decoupled but adds more overhead and possibly more complexity and is less reusable across converters. In the case of Avro & BQ, we need to re-map schema as well. In the case of Avro, GenericRecord is an interface so we can possibly return a lazy view that wraps another GenericRecord (in neville/mapper2 branch), but I ended up doing more hacky stuff to get it to behave like a real GenericRecord.

2 WIP branches for GenericRecord mapping, one making copies and one lazy view.
https://github.com/spotify/magnolify/tree/neville/mapper
https://github.com/spotify/magnolify/tree/neville/mapper2

Add methods to validate schema

For Avro, BigQuery, Parquet, etc. to validate that the type T is compatible with data source i.e. Avro Schema, BQ TableSchema and Parquet MessageType.

Get `ExampleType` to feature parity with tfexample-derive

https://github.com/spotify/tfexample-derive/
Especially w.r.t. nested type support.

Support FilterPredicate generation in magnolify-parquet

Similar to @nevillelyh's work in parquet-extra library: https://github.com/nevillelyh/parquet-extra/blob/main/parquet-avro/src/main/scala/me/lyh/parquet/avro/Predicate.scala - but for T : ParquetType types

Rewrite non-property-based tests with a proper framework

The current bunch of require() assertions are getting unmanageable. Maybe something lightweight like munit.
https://scalameta.org/munit/

Avro 1.8 support?

https://avro.apache.org/docs/current/api/java/org/apache/avro/Schema.Field.html#NULL_DEFAULT_VALUE is used in AvroType conversions, and is new in Avro 1.9.x. (Note that there's no mention of it in https://avro.apache.org/docs/1.8.2/api/java/org/apache/avro/Schema.Field.html)

I want to use this conversion in a project which is already using Avro 1.8. Is it possible to add support for this?

Test coverage

Particularly Show[T] has almost no coverage.
https://codecov.io/gh/spotify/magnolify/src/master/cats/src/main/scala/magnolify/cats/semiauto/ShowDerivation.scala

Parquet TODO

Avro array support in AvroWriteSupport - old TwoLevelListWriter vs new ThreeLevelListWriter
Avro nullabe arrays and arrays of nullables
Fix parquet.avro.data.supplier with generic records in test #278
Schema compatibility check in ReadSupport 2aea4e8
Schema evolution for enums #290
Schema evolution for arrays 6c00ecb

Add opt-in packages of "unsafe" converter implicits

For field types that may not round trip, e.g. Scala Double <=> tf.Example Float.

Optimize `Semigroup#combineAllOption`?

Could be a performance improvement if T is expensive to create/copy, but probably only worth it if Typeclass[T] for all parameters have custom combineAllOption, which we can detect with p.typeclass.getClass.getMethod("combineAllOption") == classOf[Semigroup].getClass.getMethod(...) but that feels a bit hacky.

PROTO3 `Option[T]` support

After some internal discussion this is what we'll do:

remove check that fails PROTO3 with Option[T] fields
add warning to potential asymmetric conversion
for scala -> proto, Some(0/""/false) and None are equivalent
for proto -> scala, we always get Some(0/""/false) but never None

Automate release process

It would be great to have the release process automated similar to the scio release workflow.

Release for magnolia 0.14.x

Hi! Could you make a release of the library (especially the scalacheck module) compiled with a Magnolia release in the 0.14 line? Thanks for the cool library set!

Fix and test shims for various collection types

Cache `get*` methods for annotations

So that we don't do it per record during conversion.

`LocalDateTime` maps to `local-timestamp-micros` instead of `local-timestamp-millis` in `magnolify.avro.logical.millis`

The magnolify.avro.logical.millis package seems to be incorrectly assigning a logical type of local-timestamp-micros when converting a LocalDateTime. The to/from conversions are using millis logic, but the logical type being used is for micros. I would expect the logicalType to be local-timestamp-millis:

https://github.com/spotify/magnolify/blob/main/avro/src/main/scala/magnolify/avro/logical/package.scala#L55-%10L58

Refined type support.

Bring refined type support for Typeclasses. Especially for Avro and BigQuery

i.e avro example.

import eu.timepit.refined.api._
import eu.timepit.refined.numeric._


type PosInt = Int Refined Positive
case class Person(name: String, age: PosInt)

val avroPerson = AvroType[Person]

val schema = avroPerson.schema) // schema will have valid avro types for based refine types

val bPersonGen = new GenericRecordBuilder(avroPerson.schema)
      .set("name", "Martin")
      .set("age", -10) // wrong value, should be positive int
      .build()

 avroPerson.from(bPersonGen) // this will throw an error

Add support for JDBC reader/writer?

I am wondering if adding a type class for reading/writing from/to a SQL database via JDBC is being considered. It should be possible to implement something like this:

trait JdbcType[T] extends Converter[T, ResultSet, PreparedStatement]

and derive this converters for any case class. It could be seamlessly integrated with scio-jdbc then.

Compatibility warning when reading Parquet files produced from Avro

parquet-avro writes repeated fields differently than other Parquet modules. A repeated field is normally written as:

repeated T field_name;

But parquet-avro writes it as:

(required|optional) group field_name {
  repeated T array;
}

This maps to field_name: List[T] (required) or field_name: Option[List[T]] (optional), the latter of which is not supported in Magnolify.
To read repeated fields like such (required group field_name { repeated T array }), one needs to import magnolify.ParquetArray.AvroCompat._, which changes repeated field derivation.

We should warn if a user tries to read files written from Avro but forgot the import. This would involve 2 things:

Detect that a file is written from Avro - This could be done by checking context.getKeyValueMetadata in ReadSupport has the Avro metadata key
Detect whether the ParquetType was derived with AvroCompat or not - This would require propagating isAvro, probably by making it a field in ParquetField/ParquetType.

Add benchmarks

With sbt-jmh.

Propagate Avro doc & BigQuery description via annotation

Should be doable since magnolia exposes those through CaseClass.

Support Avro logical types

Specifically the ones used in BigQuery Avro support:
https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro

We might need more than just the field type T to be able to do this though, like need to check record schema to verify logicalType field.

Optimize Group#remove, Monoid#combineAll, etc.

Enum support

Used mainly in Avro but could be useful for other types as well. Need to support Java Enum, Scala enumeration and maybe a sealed trait with case object only sub-types?

BigTable: Behaviour of BTType.mutationsToRow is different compared to actual BT.

The mutationsToRow defined in BigTableType, assumes col qualifiers to be utf8 strings (for sorting), and sorts the set cell requests before applying.

The above code creates as many columns as there are set cells. So when mutations contain the same col but different cell timestamps, this breaks. The expected behaviour would be for one col to have all the cells that belong to that column with different cell timestamps.

The row has the column qualifiers sorted, but the mutations aren't applied after sorting the col qualifiers. Mutations to the same row are applied serially (batch write is forced to serialize mutations to the same row).

Flaky tests

https://travis-ci.org/github/spotify/magnolify/jobs/708925430

==> X magnolify.scalacheck.test.ArbitraryDerivationSuite.Prop: Color.uniqueness  0.074s munit.FailException: /home/travis/build/spotify/magnolify/scalacheck/src/test/scala/magnolify/scalacheck/test/ArbitraryDerivationSuite.scala:37
36:    // `forAll(Gen.listOfN(10, g))` fails for `Repeated` & `Collections` when size parameter <= 1
37:    property(s"$name.uniqueness") {
38:      Prop.forAll { l: Long =>
Failing seed: z9omj4mAsT7BcJQoh95CcofqurdkO_jIAGEWFPx7xZA=
You can reproduce this failure by adding the following override to your suite:
  override val scalaCheckInitialSeed = "5E72c0hWZ5ioOvDIj2udNT9KuVs1mGz8EBcV-3zyKgH="
Falsified after 42 passed tests.
> ARG_0: -8560219907156556681

Research Mutatus

https://github.com/propensive/mutatus

Seems to provide a superset of features over magnolify-datastore. Might be worth studying & see if we can converge.

Support `isIdempotent`, `isCommutative`, etc. for Semigroup etc.

Create CI builds for Avro 1.8, etc.

Due to breaking changes in some old dependency versions that we still use, e.g. Avro 1.8, Protobuf 3.x. We want to make sure the code is backwards compatible.

RFC

Motivation

Right now we use a couple of shapeless based libraries in data pipelines.

https://github.com/nevillelyh/shapeless-datatype for:
- conversion between case classes <-> Avro GenericRecord, BigQuery TableRow, Datastore Entity, TensorFlow Example
- RecordMapper, RecordMatcher, LenseMatcher for some generic data type mapping/matching ¹
https://github.com/alexarchambault/scalacheck-shapeless for deriving ScalaCheck Gen[T] random generators for tuples and case classes
CaseClassDiffy in ratatool to derive Diffy[T] for comparing "diff" between 2 primitive types, case class, or Avro/Protobuf/BigQuery records
Algebird macros, not shapeless based but also for deriving Semigroup[T], Monoid[T] for tuples & case classes ²

The biggest problem of shapeless is slow compilation. Most of the type class derivation cases can be replaced with magnolia. We already have tfexample-derive for TF Example derivation. For 2, magnolia should be more flexible and maintainable than handcrafted macros.

Not sure if 1 can be covered with magnolia but they're also not widely used.

Scopes

Minimal dependencies, Magnolia and a few necessary ones to reduce migration impact
Support Scala 2.11-2.13, with our forked Magnolia for 2.11 support
Everything serializable, so they can be used in pipelines
Shared interface for similar components, e.g. Converter[T] for type conversion ³

3 may require a core sub-project. Another benefit of converters sharing a root trait is reusable tests, etc. but that's debatable.

Tasks

A breakdown of tasks ranked by priority and size.

scalacheck Gen[T] derivation
avro/bigquery/datastore/tensorflow/protobuf Converter[T] derivation, among them tensorflow already exists, protobuf might be doable but tricky
cats Semigroup[T] etc. derivation, since Algebird is not available for Scala 2.13 yet
diffy Diffy[T] derivation, with the goal of replacing that in ratatool

Optimize `Semigroup#combineN`

Should be similar to combineAllOption

Support Avro fixed type

Could be useful for fixed size binary encoding like UUID, checksums, etc. Could be expressed in Scala as a refined type.

Ideas

Just a brain dump of nice to haves. Nothing about feasibility though.

Avro union <=> Either/sealed trait
Avro fixed type, could be useful for fixed size binary encoding like UUID, checksums, etc. Maybe as a Refined type? bec818d
Avro duration logical type https://avro.apache.org/docs/current/spec.html#Duration
BigQuery BIGNUMERIC https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types
Bigtable REPEATED type, this can be either length prefixed for variable length types like String or fixed for types like Int#284
Protobuf oneof <=> Either/sealed trait
Code-gen of case classes from Avro/BigQuery/Protobuf/Parquet schemas, optionally with field projection
Proper fix for cats built-in implicit conflict e.g. Option[T], List[T] that are also ADTs. e3e7194

Fix flaky tests

Cogen tests still fail on List and Option sometimes due to Nil/None generated values.

Support Datastore entity keys

Possibly through a field annotation.
https://cloud.google.com/datastore/docs/concepts/entities
https://github.com/googleapis/googleapis/blob/master/google/datastore/v1beta3/entity.proto

For something like Person(@key email: String, name: String, age: Int)

partition_id
- project_id: "" or from class level @projectId("my-project") annotation
- namespace_id "", case class namespace, or from class level @namespace("com.myorg") annotation
path (repeated but we can allow only 1 level for simplicity?)
- kind: case class name or from class level @kind annotation
- id/name: Int/String field value

Produce `DescriptorProto` for Protobuf

https://github.com/protocolbuffers/protobuf/blob/master/src/google/protobuf/descriptor.proto#L94

This seems to be their equivalent of a schema, and can be handy when manipulating protos without needing the compiled Message class.
protoc compiler can generate a FileDescriptorSet binary .pb file with -oFILE flag.

Support default values in converters

scala> import magnolify.bigquery
final package bigquery
scala> import magnolify.bigquery._
import magnolify.bigquery._

scala> case class A(i: Long, j: Long = 10)
class A

scala> val t = TableRowType[A]
val t: magnolify.bigquery.TableRowType[A] = magnolify.bigquery.TableRowType$$anon$1@4b88be62

scala> import com.google.api.services.bigquery.model.TableRow
import com.google.api.services.bigquery.model.TableRow

scala> val tr = new TableRow().set("i", 123)
val tr: com.google.api.services.bigquery.model.TableRow = GenericData{classInfo=[f], {i=123}}

scala> t(tr)
java.lang.NullPointerException
  at magnolify.bigquery.TableRowField$.$anonfun$trfLong$1(TableRowType.scala:132)
  at magnolify.bigquery.TableRowField$.$anonfun$trfLong$1$adapted(TableRowType.scala:132)
  at magnolify.bigquery.TableRowField$$anon$4.from(TableRowType.scala:127)
  at magnolify.bigquery.TableRowField.fromAny(TableRowType.scala:61)
  at magnolify.bigquery.TableRowField.fromAny$(TableRowType.scala:61)
  at magnolify.bigquery.TableRowField$$anon$4.fromAny(TableRowType.scala:124)
  at magnolify.bigquery.TableRowField$$anon$2.$anonfun$from$1(TableRowType.scala:87)
  at $anon$1.construct(<console>:1)
  at $anon$1.construct(<console>:1)
  at magnolify.bigquery.TableRowField$$anon$2.from(TableRowType.scala:87)
  at magnolify.bigquery.TableRowField$$anon$2.from(TableRowType.scala:77)
  at magnolify.bigquery.TableRowType$$anon$1.from(TableRowType.scala:47)
  at magnolify.bigquery.TableRowType$$anon$1.from(TableRowType.scala:43)
  at magnolify.bigquery.TableRowType.apply(TableRowType.scala:37)
  at magnolify.bigquery.TableRowType.apply$(TableRowType.scala:37)
  at magnolify.bigquery.TableRowType$$anon$1.apply(TableRowType.scala:43)
  ... 40 elided

Support AvroType

Including schema support, should be similar to TableRowType.

BigtableType field names shortening

Bigtable column qualifiers are stored as data for each cell, so the best option is to keep column names as short as possible. At the same time, in a user case class field names are supposed to be long enough to convey field designation. It would be great to annotate case class fields with an attribute having a shortened field name e.g.:

case class MyClass(@btfieldname("f") superDescriptiveField: String)

I was thinking about adding implicit ClassTag to the BigtableType apply function and use reflection to traverse the type and build fields map once and cache it. Wondering if there are any other thoughts how to do this.