Git Product home page Git Product logo

magnolify's Introduction

magnolify

Build Status codecov.io GitHub license Maven Central Scala Steward badge

A collection of Magnolia add-ons for common type class derivation, data type conversion, etc.; a simpler and faster successor to shapeless-datatype.

Modules

This library includes the following modules.

Usage

See micro-site for documentation.

How to Release

Magnolify automates releases using sbt-ci-release with Github Actions. Simply push a new tag:

git tag -a v0.1.0 -m "v0.1.0"
git push origin v0.1.0

Note that the tag version MUST start with v to be picked up as the release version.

License

Copyright 2019-2021 Spotify AB.

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0

magnolify's People

Contributors

anish749 avatar anne-decusatis avatar brodin avatar clairemcginty avatar dependabot[bot] avatar freyrsae avatar gokyo avatar hansencc avatar jatcwang avatar kellen avatar leifw avatar martinbomio avatar nabbisen avatar nevillelyh avatar pismute avatar regadas avatar rustedbones avatar scala-steward avatar sckelemen avatar shnapz avatar spotify-steward[bot] avatar stormy-ua avatar syodage avatar turb avatar virtualirfan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

magnolify's Issues

Override params for scala check Arbitrary

I have case class like

case class Something(id: Int, price:BigDecimal, tax: BigDecimal)  

along with a lot more params

now I want to derive Arbitrary[Something] and I want to use different Gen[BigDecimal] instances for the price and tax field.

Right now I am using this but it wouldn't it be nice to support it without the copy?

implicit val arbSomething: Arbitrary[Something] = Arbitrary {
    for {
      something   <- genArbitrary[Something].arbitrary // Magnolify
      price <- Gen.oneOf(Seq[BigDecimal](1, 3, 10))
      tax <- Gen.oneOf(Seq[BigDecimal](0, 1))
    } yield something.copy(
      price = price,
      tax = tax
    )
  }

Magnolify/tensorflow doesn't support optionals and nested repeated fields

Currently the docs here state that "Optional and repeated types are not supported in a nested field", is there a plan to add a fix/workaround that would make using those possible?

Currently optionals and nested repeated fields are the main thing standing between us and an upgrade to Scio 10.

Here's a rough example of what our case classes look like:

  case class CaseClass1(
    id: String,
    number: Option[Long],
    someFeatures: Option[SomeFeaturesCaseClass]
  )

  case class CaseClass2(
    baseFeatures: CaseClass1,
    otherFeatures: Map[String, OtherFeaturesCaseClass],
    yetOtherFeatures: Map[String, YetOtherFeaturesCaseClass]
  )

  val CASE_CLASS_CONVERTER: ExampleType[CaseClass2] =
    ExampleType[CaseClass2]

Add `ProtobufType`

For arbitrary protobuf Message types. Should be doable by manipulating the field descriptor, etc.

Add method to `ExampleType` to return the tf.schema

When you need to work with tf.example, especially when you are writing them, you often also want to store the tf.schema which describes that particular tf.example dataset. It would be useful to add a methof to ExampleType which returns such tf.schema. Something like:

sealed trait ExampleType[T] extends Converter[T, Example, Example.Builder] {
   ...
  def schema: Schema
}

It is important to note that the returned schema object will only contain information about the names of the fields, and their types, it cannot contain any statistics information.

Magnolify decodes Optional toString value of optional byte array field instead of field itself

Given the following type definition:

@BigQueryType.toTable
case class TestType(
   id: String,
   bytes: Optional[Array[Byte]]

private val bqType = TableRowType[TestType]

Running the following pipeline test:

"test" should "work" in {
    val row = new TableRow().set("id", "test-id").set("bytes", Some(Array(1.toByte, 9.toByte)))
    JobTest[TestJob.type]
        .args("--test-table=test:table.def")
        .input(BigQueryIO(Table.Spec("test:table.def")), Seq(row))
        .run
}

Returns the following stack-trace (trimmed for relevance):

Caused by: java.lang.IllegalArgumentException: com.google.common.io.BaseEncoding$DecodingException: Unrecognized character: {
	at com.google.common.io.BaseEncoding.decode(BaseEncoding.java:219)
	at magnolify.bigquery.TableRowField$.$anonfun$trfByteArray$1(TableRowType.scala:172)
	at magnolify.bigquery.TableRowField$$anon$4.from(TableRowType.scala:160)
	at magnolify.bigquery.TableRowField$$anon$5.from(TableRowType.scala:187)
	at magnolify.bigquery.TableRowField$$anon$5.from(TableRowType.scala:183)
	at magnolify.bigquery.TableRowField.fromAny(TableRowType.scala:71)
	at magnolify.bigquery.TableRowField.fromAny$(TableRowType.scala:71)
	at magnolify.bigquery.TableRowField$$anon$5.fromAny(TableRowType.scala:183)
	at magnolify.bigquery.TableRowField$$anon$2.$anonfun$from$1(TableRowType.scala:110)

The full String that trfByteArray is attempting to decode is {empty=false, defined=true}, rather than the byte array itself. This also happens if None is passed instead of an array, but does not happen if the field is not populated at all.

Cats instances defeating algebird ones.

In Algebird we have: case class Min[T](get: T) and implicit algebird.Semigroup[T] for it.

However when deriving implicit sg: cats.Semigroup[Record] with magnolify.cats.auto._, where Record is a case class with a Min[Int] field, the cats auto derivation wins and derives that combine(Min(1), Min(2)) == Min(3). This is because algebird.Semigroup extends cats.Semigroup, and our derivation is based on cats.Semigroup, which wins in the resolution.

camelCase/PascalCase/snake_case/kebab-case mapper

Avro, BigQuery, etc. uses snake_case internally while camelCase is more idiomatic Scala.
Some users requested case conversion to be supported.

Two possible options:

  1. Add an extra parameter to all *Type converters, i.e. a bijection between internal (Avro, BQ, etc.) and Scala cases. Something like val at = AvroType[MyRecord](Case.Snake, Case.Camel).

However we should make sure the mappings are 1-to-1 and there's no conflicts.

  1. Add independent mappers of GenericRecord => GenericRecord, TableRow => TableRow, essentially making a copy of the record by renaming all the fields, before feeding it into the converters.

This is more decoupled but adds more overhead and possibly more complexity and is less reusable across converters. In the case of Avro & BQ, we need to re-map schema as well. In the case of Avro, GenericRecord is an interface so we can possibly return a lazy view that wraps another GenericRecord (in neville/mapper2 branch), but I ended up doing more hacky stuff to get it to behave like a real GenericRecord.

2 WIP branches for GenericRecord mapping, one making copies and one lazy view.
https://github.com/spotify/magnolify/tree/neville/mapper
https://github.com/spotify/magnolify/tree/neville/mapper2

Add methods to validate schema

For Avro, BigQuery, Parquet, etc. to validate that the type T is compatible with data source i.e. Avro Schema, BQ TableSchema and Parquet MessageType.

Parquet TODO

  • Avro array support in AvroWriteSupport - old TwoLevelListWriter vs new ThreeLevelListWriter
  • Avro nullabe arrays and arrays of nullables
  • Fix parquet.avro.data.supplier with generic records in test #278
  • Schema compatibility check in ReadSupport 2aea4e8
  • Schema evolution for enums #290
  • Schema evolution for arrays 6c00ecb

Optimize `Semigroup#combineAllOption`?

Could be a performance improvement if T is expensive to create/copy, but probably only worth it if Typeclass[T] for all parameters have custom combineAllOption, which we can detect with p.typeclass.getClass.getMethod("combineAllOption") == classOf[Semigroup].getClass.getMethod(...) but that feels a bit hacky.

PROTO3 `Option[T]` support

After some internal discussion this is what we'll do:

  • remove check that fails PROTO3 with Option[T] fields
  • add warning to potential asymmetric conversion
  • for scala -> proto, Some(0/""/false) and None are equivalent
  • for proto -> scala, we always get Some(0/""/false) but never None

Release for magnolia 0.14.x

Hi! Could you make a release of the library (especially the scalacheck module) compiled with a Magnolia release in the 0.14 line? Thanks for the cool library set!

`LocalDateTime` maps to `local-timestamp-micros` instead of `local-timestamp-millis` in `magnolify.avro.logical.millis`

The magnolify.avro.logical.millis package seems to be incorrectly assigning a logical type of local-timestamp-micros when converting a LocalDateTime. The to/from conversions are using millis logic, but the logical type being used is for micros. I would expect the logicalType to be local-timestamp-millis:

https://github.com/spotify/magnolify/blob/main/avro/src/main/scala/magnolify/avro/logical/package.scala#L55-%10L58

Refined type support.

Bring refined type support for Typeclasses. Especially for Avro and BigQuery

i.e avro example.

import eu.timepit.refined.api._
import eu.timepit.refined.numeric._


type PosInt = Int Refined Positive
case class Person(name: String, age: PosInt)

val avroPerson = AvroType[Person]

val schema = avroPerson.schema) // schema will have valid avro types for based refine types

val bPersonGen = new GenericRecordBuilder(avroPerson.schema)
      .set("name", "Martin")
      .set("age", -10) // wrong value, should be positive int
      .build()

 avroPerson.from(bPersonGen) // this will throw an error

Add support for JDBC reader/writer?

I am wondering if adding a type class for reading/writing from/to a SQL database via JDBC is being considered. It should be possible to implement something like this:

trait JdbcType[T] extends Converter[T, ResultSet, PreparedStatement]

and derive this converters for any case class. It could be seamlessly integrated with scio-jdbc then.

Compatibility warning when reading Parquet files produced from Avro

parquet-avro writes repeated fields differently than other Parquet modules. A repeated field is normally written as:

repeated T field_name;

But parquet-avro writes it as:

(required|optional) group field_name {
  repeated T array;
}

This maps to field_name: List[T] (required) or field_name: Option[List[T]] (optional), the latter of which is not supported in Magnolify.
To read repeated fields like such (required group field_name { repeated T array }), one needs to import magnolify.ParquetArray.AvroCompat._, which changes repeated field derivation.

We should warn if a user tries to read files written from Avro but forgot the import. This would involve 2 things:

  • Detect that a file is written from Avro - This could be done by checking context.getKeyValueMetadata in ReadSupport has the Avro metadata key
  • Detect whether the ParquetType was derived with AvroCompat or not - This would require propagating isAvro, probably by making it a field in ParquetField/ParquetType.

Enum support

Used mainly in Avro but could be useful for other types as well. Need to support Java Enum, Scala enumeration and maybe a sealed trait with case object only sub-types?

BigTable: Behaviour of BTType.mutationsToRow is different compared to actual BT.

The mutationsToRow defined in BigTableType, assumes col qualifiers to be utf8 strings (for sorting), and sorts the set cell requests before applying.

The above code creates as many columns as there are set cells. So when mutations contain the same col but different cell timestamps, this breaks. The expected behaviour would be for one col to have all the cells that belong to that column with different cell timestamps.

The row has the column qualifiers sorted, but the mutations aren't applied after sorting the col qualifiers. Mutations to the same row are applied serially (batch write is forced to serialize mutations to the same row).

Flaky tests

https://travis-ci.org/github/spotify/magnolify/jobs/708925430

==> X magnolify.scalacheck.test.ArbitraryDerivationSuite.Prop: Color.uniqueness  0.074s munit.FailException: /home/travis/build/spotify/magnolify/scalacheck/src/test/scala/magnolify/scalacheck/test/ArbitraryDerivationSuite.scala:37
36:    // `forAll(Gen.listOfN(10, g))` fails for `Repeated` & `Collections` when size parameter <= 1
37:    property(s"$name.uniqueness") {
38:      Prop.forAll { l: Long =>
Failing seed: z9omj4mAsT7BcJQoh95CcofqurdkO_jIAGEWFPx7xZA=
You can reproduce this failure by adding the following override to your suite:
  override val scalaCheckInitialSeed = "5E72c0hWZ5ioOvDIj2udNT9KuVs1mGz8EBcV-3zyKgH="
Falsified after 42 passed tests.
> ARG_0: -8560219907156556681

Create CI builds for Avro 1.8, etc.

Due to breaking changes in some old dependency versions that we still use, e.g. Avro 1.8, Protobuf 3.x. We want to make sure the code is backwards compatible.

RFC

Motivation

Right now we use a couple of shapeless based libraries in data pipelines.

  • https://github.com/nevillelyh/shapeless-datatype for:
    • conversion between case classes <-> Avro GenericRecord, BigQuery TableRow, Datastore Entity, TensorFlow Example
    • RecordMapper, RecordMatcher, LenseMatcher for some generic data type mapping/matching 1
  • https://github.com/alexarchambault/scalacheck-shapeless for deriving ScalaCheck Gen[T] random generators for tuples and case classes
  • CaseClassDiffy in ratatool to derive Diffy[T] for comparing "diff" between 2 primitive types, case class, or Avro/Protobuf/BigQuery records
  • Algebird macros, not shapeless based but also for deriving Semigroup[T], Monoid[T] for tuples & case classes 2

The biggest problem of shapeless is slow compilation. Most of the type class derivation cases can be replaced with magnolia. We already have tfexample-derive for TF Example derivation. For 2, magnolia should be more flexible and maintainable than handcrafted macros.

Not sure if 1 can be covered with magnolia but they're also not widely used.

Scopes

  • Minimal dependencies, Magnolia and a few necessary ones to reduce migration impact
  • Support Scala 2.11-2.13, with our forked Magnolia for 2.11 support
  • Everything serializable, so they can be used in pipelines
  • Shared interface for similar components, e.g. Converter[T] for type conversion 3

3 may require a core sub-project. Another benefit of converters sharing a root trait is reusable tests, etc. but that's debatable.

Tasks

A breakdown of tasks ranked by priority and size.

  • scalacheck Gen[T] derivation
  • avro/bigquery/datastore/tensorflow/protobuf Converter[T] derivation, among them tensorflow already exists, protobuf might be doable but tricky
  • cats Semigroup[T] etc. derivation, since Algebird is not available for Scala 2.13 yet
  • diffy Diffy[T] derivation, with the goal of replacing that in ratatool

Support Avro fixed type

Could be useful for fixed size binary encoding like UUID, checksums, etc. Could be expressed in Scala as a refined type.

Ideas

Just a brain dump of nice to haves. Nothing about feasibility though.

Fix flaky tests

Cogen tests still fail on List and Option sometimes due to Nil/None generated values.

Support Datastore entity keys

Possibly through a field annotation.
https://cloud.google.com/datastore/docs/concepts/entities
https://github.com/googleapis/googleapis/blob/master/google/datastore/v1beta3/entity.proto

For something like Person(@key email: String, name: String, age: Int)

  • partition_id
    • project_id: "" or from class level @projectId("my-project") annotation
    • namespace_id "", case class namespace, or from class level @namespace("com.myorg") annotation
  • path (repeated but we can allow only 1 level for simplicity?)
    • kind: case class name or from class level @kind annotation
    • id/name: Int/String field value

Support default values in converters

  • Avro
  • BigQuery
  • Bigtable
  • Datastore
  • Protobuf
  • TensorFlow
scala> import magnolify.bigquery
final package bigquery
scala> import magnolify.bigquery._
import magnolify.bigquery._

scala> case class A(i: Long, j: Long = 10)
class A

scala> val t = TableRowType[A]
val t: magnolify.bigquery.TableRowType[A] = magnolify.bigquery.TableRowType$$anon$1@4b88be62

scala> import com.google.api.services.bigquery.model.TableRow
import com.google.api.services.bigquery.model.TableRow

scala> val tr = new TableRow().set("i", 123)
val tr: com.google.api.services.bigquery.model.TableRow = GenericData{classInfo=[f], {i=123}}

scala> t(tr)
java.lang.NullPointerException
  at magnolify.bigquery.TableRowField$.$anonfun$trfLong$1(TableRowType.scala:132)
  at magnolify.bigquery.TableRowField$.$anonfun$trfLong$1$adapted(TableRowType.scala:132)
  at magnolify.bigquery.TableRowField$$anon$4.from(TableRowType.scala:127)
  at magnolify.bigquery.TableRowField.fromAny(TableRowType.scala:61)
  at magnolify.bigquery.TableRowField.fromAny$(TableRowType.scala:61)
  at magnolify.bigquery.TableRowField$$anon$4.fromAny(TableRowType.scala:124)
  at magnolify.bigquery.TableRowField$$anon$2.$anonfun$from$1(TableRowType.scala:87)
  at $anon$1.construct(<console>:1)
  at $anon$1.construct(<console>:1)
  at magnolify.bigquery.TableRowField$$anon$2.from(TableRowType.scala:87)
  at magnolify.bigquery.TableRowField$$anon$2.from(TableRowType.scala:77)
  at magnolify.bigquery.TableRowType$$anon$1.from(TableRowType.scala:47)
  at magnolify.bigquery.TableRowType$$anon$1.from(TableRowType.scala:43)
  at magnolify.bigquery.TableRowType.apply(TableRowType.scala:37)
  at magnolify.bigquery.TableRowType.apply$(TableRowType.scala:37)
  at magnolify.bigquery.TableRowType$$anon$1.apply(TableRowType.scala:43)
  ... 40 elided

BigtableType field names shortening

Bigtable column qualifiers are stored as data for each cell, so the best option is to keep column names as short as possible. At the same time, in a user case class field names are supposed to be long enough to convey field designation. It would be great to annotate case class fields with an attribute having a shortened field name e.g.:

case class MyClass(@btfieldname("f") superDescriptiveField: String)

I was thinking about adding implicit ClassTag to the BigtableType apply function and use reflection to traverse the type and build fields map once and cache it. Wondering if there are any other thoughts how to do this.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.