The doric from hablapps

Partition transform functions

~~def bucket(numBuckets: Int, e: Column): Column~~ ❌ won't do, same function with DoricColumn
def bucket(numBuckets: Column, e: Column): Column
def days(e: Column): Column
def hours(e: Column): Column
def months(e: Column): Column
def years(e: Column): Column

Sorting functions

DataFrame functions:

~~def sortWithinPartitions(sortCol: String, sortCols: String*): Dataset[T]~~ ❌ won't do, same function with DoricColumn
def sortWithinPartitions(sortExprs: Column*): Dataset[T] #233
~~def sort(sortCol: String, sortCols: String*): Dataset[T]~~ ❌ won't do, same function with DoricColumn
def sort(sortExprs: Column*): Dataset[T] #233
~~def orderBy(sortCol: String, sortCols: String*): Dataset[T]~~ ❌ won't do, same function with DoricColumn
def orderBy(sortExprs: Column*): Dataset[T] #233

Column functions:

def asc(columnName: String): Column WIP: #233
def asc_nulls_first(columnName: String): Column #233
def asc_nulls_last(columnName: String): Column #233
def desc(columnName: String): Column #233
def desc_nulls_first(columnName: String): Column #233
def desc_nulls_last(columnName: String): Column #233

Rename methods to be like spark

String functions

License

Configure publish to maven central

Build doric for scala 2.11

As asked in #184 this thread is a list of all neaded for milestone Doric for scala 2.11

Pending tasks to migrate doric to scala 2.11 (spark 2.4 at least)

Delete newtype #204
Have spark-fast-tests build for scala 2.11 MrPowers/spark-fast-tests#106
~~Modularize Doric column to have a common interface that abstracts its use fro the cats version~~(no needed, look next)
Cats imports must be ok #206
Review methods that are not in spark 2.4

Add documentation of similarities and differences of Doric custom types with User Defined Types

I will add a new section explaining why Doric customs and not Spark User Defined Types (or the differences). This will be probably a FAQ

Originally posted by @eruizalo in #95 (comment)

Remove column types java.sql.{Date, Timestamp}

Doric column types java.sql.{Date, Timestamp} will be removed leaving only java.time.{Instant, LocalDate} column types

Auto merge new dependencies scala-steward

Binder

Make doric testeable in my binder https://mybinder.org/

Improve error message to show all accumulated errors

v1 milestones & release

I'm really excited about this project!

Think about the features that'll be included in the "initial public release". Once all the initial features are built, ping me, and I'll make a commit to make a compelling sell in the project README.

Once the README is updated, I'll start marketing the project to try to get users and feedback on the code.

Sounds like a good plan? I'm definitely interested in seeing this project grow & get a lot of users!

Enable dynamic invocations for `Row` columns

Given a Row column, we'd like to access its fields as regular Scala object fields.

Currently:

case class User(name: String, age: Int)
val df = List((User("John", "doe", 34), 1)).toDF("user", "col2")

val dn: DoricColumn[Row] = col[Row]("user").getChild[String]("name")
val da: DoricColumn[Row] = col[Row]("user").getChild[Int]("age")

Desired:

...
val dn: DoricColumn[Row] = col[Row]("user").name[String]
val da: DoricColumn[Row] = col[Row]("user").age[Int]

And, even better:

...
val dn: DoricColumn[Row] = row.user[Row].name[String]
val da: DoricColumn[Row] = row.user[Row].age[Int]
val db: DoricColumn[Int] = row.col2[Int]

The last example is equivalent to col[Int]("col2"). The idea is to make row similar to this.

If the parent column is not a Row, the invocation should not compile. E.g., the following expression

...
col[Int]("n").name[String]

will make the compiler complain that Int is not equal to Row.

Simplify the literal and allow implicit lit conversion

Right now we have some column methods that require another signature to pass literals. We must simplify this to a single method, and allow the user to import implicit conversions if wanted.

Window functions

Use new type to avoid case class objects

Use new types for scala 2 and 3 compatibility to avoid instantiating doric columns as case classes https://newtypes.monix.io/docs/core.html

Map column

Add a Map column that validates the map according to their key and value types

[Feature request]: Time comparators

Doric doesn't have time comparators to work with
Please, check there is no other feature request like this one to avoid duplicates: see doric issues

It seems this feature is not requested yet:
- YES / NO

Expected Behavior

col[LocalDate]("aaa") > col("bbb")

Current Behavior (if so)

None

Complete `testColumnsN` at TypedColumnTest.scala

Not sure if this issue worths it, remove the TODO comment if it doesn't

ArrayType
ListType
StructType
MapType

Review all castings

Add missing castings

Window functions

Implement window functions compatible with doric columns

Implement an ".as[T]" method for columns to improve compatibility of spark api and doric api

allow the user for unimplemented doric methods to use spark, and then transform them into doric columns.

(col("a") + col("b")).as[Int]

[Feature request]: Retrieve last doric version from maven for documentation

Feature request duplicity

It seems this feature is not requested yet: YES

Feature suggestion

It would be nice to retrieve the latest version of doric from maven central and use it to publish documentation

⚠️⚠️ CAREFUL! 👀

Documentation from a release may not take the correct version

Current behaviour

Currently documentation has to be manually updated

Refactoring

Apis, in terms of implicit classes. In package syntax.
In package types, type classes SparkType, NumericType, Cast, etc. Instances of type classes, in companion objects.
Remove XType.apply/unapply
Package sem, with Errors, DataFrameOps, ...
Companion object DoricColumn: getters, sparkfunction, ...
String concat, Array concat, ... to syntax; package function with mixins from syntax
Control syntax: when builder, ...
Refactor package habla.doric to simply doric

Collection functions

Array transformations

Implemented here

Map transformations

Implemented here

Structure

Type if the column is an aggregation, transformation or generator

Compile errors mixing uncompatible types of columns

[Testing] Generate random DataFrames by types

Use scalacheck to randomly generate DataFrames using scala types

Reuse github workflows

https://docs.github.com/es/actions/using-workflows/reusing-workflows

Review and structure all the implicit conversions of Doric to make it easy to use

Will test in a review of the implicit conversions.

Originally posted by @alfonsorr in #95 (comment)

Custom matchers for testing

Testing might be improved with custom matchers.

For instance, instead of:

  val df = List((User("John", "doe", 34), 1))
      .toDF("col", "delete")
      .select("col")
  ...
   it("throws an error if the sub column is not of the provided type") {
      colStruct("col")
        .getChild[String]("age")
        .elem
        .run(df)
        .toEither
        .left
        .value
        .head shouldBe ColumnTypeError("col.age", StringType, IntegerType)

it'd be nice to write something like this:

   it("..."){
      df.select(colStruct("col").getChild[String]("age")) should 
          failWith(ColumnTypeError("col.age", StringType, IntegerType))
   }

In general, testing should not expose implementation details, and stick itself to the programmer API whenever possible.

Static & dynamic field accessors for product types

We can't create column expressions that access fields of product type columns. For instance, given the following product type:

case class User(name: String, age: Int)

we would like to access fields of users as follows:

col[User]("user").getChildSafe("name"): DoricColumn[String]
col[User]("user").getChildSafe("age"): DoricColumn[Int]

Moreover, we would like to get a compilation error if we try to access a non-existing field:

col[User]("user").getChildSafe("surname") // should not compile

Dynamic invocations should be allowed too:

col[User]("user").child.name[String]: DoricColumn[String]
col[User]("user").child.age[Int]: DoricColumn[Int]
col[User]("user").child.surname[String] // should raise a non-existent column error at runtime

Could be used to doric validations
Include location #283

Fix complex scaladoc links

Adding scaladoc links to high order spark functions breaks the doc. Here they are some examples:

Related with #109:
- org.apache.spark.sql.functions.array
- org.apache.spark.sql.functions.transform
- org.apache.spark.sql.functions.aggregate
- org.apache.spark.sql.functions.filter
- org.apache.spark.sql.functions.array_join
Related with #138:
- org.apache.spark.sql.functions.locate
- org.apache.spark.sql.functions.split
- org.apache.spark.sql.functions.to_utc_timestamp
- org.apache.spark.sql.functions.from_utc_timestamp

Tag added for future fixing @todo scaladoc link

Issue duplicity

I have checked the current created issues and it seems there is no other issue like this one.

Doric version

0.0.2

What happened?

As concat spark function, using ds"Column value: $myColumn" will output null if myColumn has a null

What should have happened?

It should have output:

"Column value: null
"Column value:
something like that

Relevant log output

No response

Joins

Create a DoricJoinColum:

Kleisly[DoricValidated, (DataFrame, Dataframe), Column]

???

Consider uploading to https://spark-packages.org/

https://spark-packages.org/

Misc functions

def assert_true(c: Column, e: Column): Column
def assert_true(c: Column): Column
def crc32(e: Column): Column
def hash(cols: Column*): Column
def md5(e: Column): Column
def raise_error(c: Column): Column
def sha1(e: Column): Column
def sha2(e: Column, numBits: Int): Column
def xxhash64(cols: Column*): Column

Sbt change to allow multiple versions (#184)
Github actions matrix (#187)
~~Dependabot for multiple versions?~~ (should not affect)

hablapps / doric Goto Github PK

doric's People

Contributors

Stargazers

Watchers

Forkers

doric's Issues