hablapps / doric Goto Github PK
View Code? Open in Web Editor NEWType safety for spark columns
Home Page: https://www.hablapps.com/doric/
License: Apache License 2.0
Type safety for spark columns
Home Page: https://www.hablapps.com/doric/
License: Apache License 2.0
def bucket(numBuckets: Int, e: Column): Column
def bucket(numBuckets: Column, e: Column): Column
def days(e: Column): Column
def hours(e: Column): Column
def months(e: Column): Column
def years(e: Column): Column
DataFrame functions:
def sortWithinPartitions(sortCol: String, sortCols: String*): Dataset[T]
def sortWithinPartitions(sortExprs: Column*): Dataset[T]
#233def sort(sortCol: String, sortCols: String*): Dataset[T]
def sort(sortExprs: Column*): Dataset[T]
#233def orderBy(sortCol: String, sortCols: String*): Dataset[T]
def orderBy(sortExprs: Column*): Dataset[T]
#233Column functions:
def ascii(e: Column): Column
def base64(e: Column): Column
def concat_ws(sep: String, exprs: Column*): Column
def decode(value: Column, charset: String): Column
def encode(value: Column, charset: String): Column
def format_number(x: Column, d: Int): Column
def format_string(format: String, arguments: Column*): Column
def initcap(e: Column): Column
def instr(str: Column, substring: String): Column
def length(e: Column): Column
def levenshtein(l: Column, r: Column): Column
def locate(substr: String, str: Column, pos: Int): Column
def locate(substr: String, str: Column): Column
def lower(e: Column): Column
def lpad(str: Column, len: Int, pad: String): Column
def ltrim(e: Column, trimString: String): Column
def ltrim(e: Column): Column
def overlay(src: Column, replace: Column, pos: Column): Column
def overlay(src: Column, replace: Column, pos: Column, len: Column): Column
def regexp_extract(e: Column, exp: String, groupIdx: Int): Column
def regexp_replace(e: Column, pattern: Column, replacement: Column): Column
def regexp_replace(e: Column, pattern: String, replacement: String): Column
def repeat(str: Column, n: Int): Column
def rpad(str: Column, len: Int, pad: String): Column
def rtrim(e: Column, trimString: String): Column
def rtrim(e: Column): Column
def soundex(e: Column): Column
def split(str: Column, pattern: String, limit: Int): Column
def split(str: Column, pattern: String): Column
def substring(str: Column, pos: Int, len: Int): Column
def substring_index(str: Column, delim: String, count: Int): Column
def translate(src: Column, matchingString: String, replaceString: String): Column
def trim(e: Column, trimString: String): Column
def trim(e: Column): Column
def unbase64(e: Column): Column
def upper(e: Column): Column
As asked in #184 this thread is a list of all neaded for milestone Doric for scala 2.11
Pending tasks to migrate doric to scala 2.11 (spark 2.4 at least)
spark-fast-tests
build for scala 2.11 MrPowers/spark-fast-tests#106I will add a new section explaining why Doric customs and not Spark User Defined Types (or the differences). This will be probably a FAQ
Originally posted by @eruizalo in #95 (comment)
Doric column types java.sql.{Date, Timestamp}
will be removed leaving only java.time.{Instant, LocalDate}
column types
Make doric testeable in my binder https://mybinder.org/
I'm really excited about this project!
Think about the features that'll be included in the "initial public release". Once all the initial features are built, ping me, and I'll make a commit to make a compelling sell in the project README.
Once the README is updated, I'll start marketing the project to try to get users and feedback on the code.
Sounds like a good plan? I'm definitely interested in seeing this project grow & get a lot of users!
Given a Row
column, we'd like to access its fields as regular Scala object fields.
Currently:
case class User(name: String, age: Int)
val df = List((User("John", "doe", 34), 1)).toDF("user", "col2")
val dn: DoricColumn[Row] = col[Row]("user").getChild[String]("name")
val da: DoricColumn[Row] = col[Row]("user").getChild[Int]("age")
Desired:
...
val dn: DoricColumn[Row] = col[Row]("user").name[String]
val da: DoricColumn[Row] = col[Row]("user").age[Int]
And, even better:
...
val dn: DoricColumn[Row] = row.user[Row].name[String]
val da: DoricColumn[Row] = row.user[Row].age[Int]
val db: DoricColumn[Int] = row.col2[Int]
The last example is equivalent to col[Int]("col2")
. The idea is to make row
similar to this
.
If the parent column is not a Row
, the invocation should not compile. E.g., the following expression
...
col[Int]("n").name[String]
will make the compiler complain that Int
is not equal to Row
.
Right now we have some column methods that require another signature to pass literals. We must simplify this to a single method, and allow the user to import implicit conversions if wanted.
def cume_dist(): Column
def dense_rank(): Column
def lag(e: Column, offset: Int, defaultValue: Any): Column
def lag(columnName: String, offset: Int, defaultValue: Any): Column
❌ won't do, same function with DoricColumndef lag(columnName: String, offset: Int): Column
❌ won't do, same function with DoricColumndef lag(e: Column, offset: Int): Column
def lead(e: Column, offset: Int, defaultValue: Any): Column
def lead(columnName: String, offset: Int, defaultValue: Any): Column
❌ won't do, same function with DoricColumndef lead(e: Column, offset: Int): Column
def lead(columnName: String, offset: Int): Column
❌ won't do, same function with DoricColumndef nth_value(e: Column, offset: Int): Column
#236def nth_value(e: Column, offset: Int, ignoreNulls: Boolean): Column
#236def ntile(n: Int): Column
def percent_rank(): Column
def rank(): Column
def row_number(): Column
Use new types for scala 2 and 3 compatibility to avoid instantiating doric columns as case classes https://newtypes.monix.io/docs/core.html
Add a Map column that validates the map according to their key and value types
Doric doesn't have time comparators to work with
Please, check there is no other feature request like this one to avoid duplicates: see doric issues
col[LocalDate]("aaa") > col("bbb")
None
Not sure if this issue worths it, remove the TODO comment if it doesn't
Add missing castings
Implement window functions compatible with doric columns
allow the user for unimplemented doric methods to use spark, and then transform them into doric columns.
(col("a") + col("b")).as[Int]
It would be nice to retrieve the latest version of doric from maven central and use it to publish documentation
Documentation from a release may not take the correct version
Currently documentation has to be manually updated
syntax
.types
, type classes SparkType
, NumericType
, Cast
, etc. Instances of type classes, in companion objects.sem
, with Errors, DataFrameOps, ...DoricColumn
: getters, sparkfunction, ...concat
, Array concat, ... to syntax; package function
with mixins from syntaxhabla.doric
to simply doric
Implemented here
def aggregate(expr: Column, initialValue: Column, merge: (Column, Column) ⇒ Column): Column
Infix method #58def aggregate(expr: Column, initialValue: Column, merge: (Column, Column) ⇒ Column, finish: (Column) ⇒ Column): Column
Infix method aggregateWT
#58def array_contains(column: Column, value: Any): Column
#109def array_distinct(e: Column): Column
#109def array_except(col1: Column, col2: Column): Column
#109def array_intersect(col1: Column, col2: Column): Column
#109def array_join(column: Column, delimiter: String): Column
#109def array_join(column: Column, delimiter: String, nullReplacement: String): Column
#109def array_max(e: Column): Column
#109def array_min(e: Column): Column
#109def array_position(column: Column, value: Any): Column
#109def array_remove(column: Column, element: Any): Column
#109def array_repeat(e: Column, count: Int): Column
def array_repeat(left: Column, right: Column): Column
#109def array_sort(e: Column): Column
#109def array_union(col1: Column, col2: Column): Column
#109def arrays_overlap(a1: Column, a2: Column): Column
#109def arrays_zip(e: Column*): Column
#287def concat(exprs: Column*): Column
Prefix method concat
for strings, concatArrays
for arrays #58def element_at(column: Column, value: Any): Column
#109def exists(column: Column, f: (Column) ⇒ Column): Column
#109def explode(e: Column): Column
#109def explode_outer(e: Column): Column
#109def filter(column: Column, f: (Column, Column) ⇒ Column): Column
#109def filter(column: Column, f: (Column) ⇒ Column): Column
Infix method filter
#109def flatten(e: Column): Column
#227def forall(column: Column, f: (Column) ⇒ Column): Column
#109def map_concat(cols: Column*): Column
#109def posexplode(e: Column): Column
#287def posexplode_outer(e: Column): Column
#287def reverse(e: Column): Column
#109def schema_of_csv(csv: Column, options: Map[String, String]): Column
#287def schema_of_csv(csv: Column): Column
def schema_of_csv(csv: String): Column
def schema_of_json(json: Column, options: Map[String, String]): Column
#287def schema_of_json(json: Column): Column
def schema_of_json(json: String): Column
def sequence(start: Column, stop: Column): Column
#109def sequence(start: Column, stop: Column, step: Column): Column
#109def shuffle(e: Column): Column
#109def size(e: Column): Column
#109def slice(x: Column, start: Column, length: Column): Column
#109def slice(x: Column, start: Int, length: Int): Column
def sort_array(e: Column, asc: Boolean): Column
#109def sort_array(e: Column): Column
#109def transform(column: Column, f: (Column, Column) ⇒ Column): Column
Infix method #58def transform(column: Column, f: (Column) ⇒ Column): Column
infix method transformWithIndex
#58def zip_with(left: Column, right: Column, f: (Column, Column) ⇒ Column): Column
#109Implemented here
def map_entries(e: Column): Column
#287def map_filter(expr: Column, f: (Column, Column) ⇒ Column): Column
#109def map_from_entries(e: Column): Column
#287def map_from_arrays(keys: Column, values: Column): Column
#287def map_keys(e: Column): Column
Postfix method keys
#58def map_values(e: Column): Column
Postfix method values
#58def map_zip_with(left: Column, right: Column, f: (Column, Column, Column) ⇒ Column): Column
#109def transform_keys(expr: Column, f: (Column, Column) ⇒ Column): Column
#109def transform_values(expr: Column, f: (Column, Column) ⇒ Column): Column
#109def explode(e: Column): Column
#287def explode_outer(e: Column): Column
#287def from_csv(e: Column, schema: Column, options: Map[String, String]): Column
#287def from_csv(e: Column, schema: StructType, options: Map[String, String]): Column
#287def from_json(e: Column, schema: Column, options: Map[String, String]): Column
#287def from_json(e: Column, schema: Column): Column
def from_json(e: Column, schema: String, options: Map[String, String]): Column
def from_json(e: Column, schema: DataType): Column
#287def from_json(e: Column, schema: StructType): Column
#287def from_json(e: Column, schema: DataType, options: Map[String, String]): Column
#287def from_json(e: Column, schema: StructType, options: Map[String, String]): Column
#287def get_json_object(e: Column, path: String): Column
#287def to_csv(e: Column): Column
def to_csv(e: Column, options: Map[String, String]): Column
#287def to_json(e: Column): Column
def to_json(e: Column, options: Map[String, String]): Column
#287def json_tuple(json: Column, fields: String*): Column
#287Compile errors mixing uncompatible types of columns
Use scalacheck
to randomly generate DataFrames using scala types
Will test in a review of the implicit conversions.
Originally posted by @alfonsorr in #95 (comment)
Testing might be improved with custom matchers.
For instance, instead of:
val df = List((User("John", "doe", 34), 1))
.toDF("col", "delete")
.select("col")
...
it("throws an error if the sub column is not of the provided type") {
colStruct("col")
.getChild[String]("age")
.elem
.run(df)
.toEither
.left
.value
.head shouldBe ColumnTypeError("col.age", StringType, IntegerType)
it'd be nice to write something like this:
it("..."){
df.select(colStruct("col").getChild[String]("age")) should
failWith(ColumnTypeError("col.age", StringType, IntegerType))
}
In general, testing should not expose implementation details, and stick itself to the programmer API whenever possible.
We can't create column expressions that access fields of product type columns. For instance, given the following product type:
case class User(name: String, age: Int)
we would like to access fields of users as follows:
col[User]("user").getChildSafe("name"): DoricColumn[String]
col[User]("user").getChildSafe("age"): DoricColumn[Int]
Moreover, we would like to get a compilation error if we try to access a non-existing field:
col[User]("user").getChildSafe("surname") // should not compile
Dynamic invocations should be allowed too:
col[User]("user").child.name[String]: DoricColumn[String]
col[User]("user").child.age[Int]: DoricColumn[Int]
col[User]("user").child.surname[String] // should raise a non-existent column error at runtime
List of all aggregate functions in spark marked all implemented in doric
def add_months(startDate: Column, numMonths: Column): Column
def add_months(startDate: Column, numMonths: Int): Column
def current_date(): Column
def current_timestamp(): Column
def date_add(start: Column, days: Column): Column
def date_add(start: Column, days: Int): Column
def date_format(dateExpr: Column, format: String): Column
def date_sub(start: Column, days: Column): Column
def date_sub(start: Column, days: Int): Column
def date_trunc(format: String, timestamp: Column): Column
def datediff(end: Column, start: Column): Column
def dayofmonth(e: Column): Column
def dayofweek(e: Column): Column
def dayofyear(e: Column): Column
def from_unixtime(ut: Column, f: String): Column
def from_unixtime(ut: Column): Column
def from_utc_timestamp(ts: Column, tz: Column): Column
def from_utc_timestamp(ts: Column, tz: String): Column
def hour(e: Column): Column
def last_day(e: Column): Column
def minute(e: Column): Column
def month(e: Column): Column
def months_between(end: Column, start: Column, roundOff: Boolean): Column
def months_between(end: Column, start: Column): Column
def next_day(date: Column, dayOfWeek: String): Column
def quarter(e: Column): Column
def second(e: Column): Column
def timestamp_seconds(e: Column): Column
def to_date(e: Column, fmt: String): Column
def to_date(e: Column): Column
def to_timestamp(s: Column, fmt: String): Column
def to_timestamp(s: Column): Column
def to_utc_timestamp(ts: Column, tz: Column): Column
def to_utc_timestamp(ts: Column, tz: String): Column
def trunc(date: Column, format: String): Column
def unix_timestamp(s: Column, p: String): Column
def unix_timestamp(s: Column): Column
def unix_timestamp(): Column
def weekofyear(e: Column): Column
def window(timeColumn: Column, windowDuration: String): Column
def window(timeColumn: Column, windowDuration: String, slideDuration: String): Column
def window(timeColumn: Column, windowDuration: String, slideDuration: String, startTime: String): Column
def year(e: Column): Column
Adding scaladoc links to high order spark functions breaks the doc. Here they are some examples:
Related with #109:
org.apache.spark.sql.functions.array
org.apache.spark.sql.functions.transform
org.apache.spark.sql.functions.aggregate
org.apache.spark.sql.functions.filter
org.apache.spark.sql.functions.array_join
Related with #138:
org.apache.spark.sql.functions.locate
org.apache.spark.sql.functions.split
org.apache.spark.sql.functions.to_utc_timestamp
org.apache.spark.sql.functions.from_utc_timestamp
Tag added for future fixing @todo scaladoc link
def abs(e: Column): Column
#223def acos(columnName: String): Column
def acos(e: Column): Column
#223def acosh(columnName: String): Column
def acosh(e: Column): Column
#223def asin(columnName: String): Column
def asin(e: Column): Column
#223def asinh(columnName: String): Column
def asinh(e: Column): Column
#223def atan(columnName: String): Column
def atan(e: Column): Column
#223def atan2(yValue: Double, xName: String): Column
def atan2(yValue: Double, x: Column): Column
def atan2(yName: String, xValue: Double): Column
def atan2(y: Column, xValue: Double): Column
def atan2(yName: String, xName: String): Column
def atan2(yName: String, x: Column): Column
def atan2(y: Column, xName: String): Column
def atan2(y: Column, x: Column): Column
#223def atanh(columnName: String): Column
def atanh(e: Column): Column
#223def bin(columnName: String): Column
def bin(e: Column): Column
#223def bround(e: Column, scale: Int): Column
#223def bround(e: Column): Column
#223def cbrt(columnName: String): Column
def cbrt(e: Column): Column
#223def ceil(columnName: String): Column
def ceil(e: Column): Column
#223def conv(num: Column, fromBase: Int, toBase: Int): Column
#223def cos(columnName: String): Column
def cos(e: Column): Column
#223def cosh(columnName: String): Column
def cosh(e: Column): Column
#223def degrees(columnName: String): Column
def degrees(e: Column): Column
#223def exp(columnName: String): Column
def exp(e: Column): Column
#223def expm1(columnName: String): Column
def expm1(e: Column): Column
#223def factorial(e: Column): Column
#223def floor(columnName: String): Column
def floor(e: Column): Column
#223def hex(column: Column): Column
#223def hypot(l: Double, rightName: String): Column
def hypot(l: Double, r: Column): Column
def hypot(leftName: String, r: Double): Column
def hypot(l: Column, r: Double): Column
def hypot(leftName: String, rightName: String): Column
def hypot(leftName: String, r: Column): Column
def hypot(l: Column, rightName: String): Column
def hypot(l: Column, r: Column): Column
#223def log(base: Double, columnName: String): Column
def log(base: Double, a: Column): Column
#223def log(columnName: String): Column
def log(e: Column): Column
#223def log10(columnName: String): Column
def log10(e: Column): Column
#223def log1p(columnName: String): Column
def log1p(e: Column): Column
#223def log2(columnName: String): Column
def log2(expr: Column): Column
#223def pmod(dividend: Column, divisor: Column): Column
#223def pow(l: Double, rightName: String): Column
def pow(l: Double, r: Column): Column
def pow(leftName: String, r: Double): Column
def pow(l: Column, r: Double): Column
def pow(leftName: String, rightName: String): Column
def pow(leftName: String, r: Column): Column
def pow(l: Column, rightName: String): Column
def pow(l: Column, r: Column): Column
#223def radians(columnName: String): Column
def radians(e: Column): Column
#223def rint(columnName: String): Column
def rint(e: Column): Column
#223def round(e: Column, scale: Int): Column
def round(e: Column): Column
#223def shiftLeft(e: Column, numBits: Int): Column
#223def shiftRight(e: Column, numBits: Int): Column
#223def shiftRightUnsigned(e: Column, numBits: Int): Column
#223def signum(columnName: String): Column
def signum(e: Column): Column
#223def sin(columnName: String): Column
def sin(e: Column): Column
#223def sinh(columnName: String): Column
def sinh(e: Column): Column
#223def sqrt(colName: String): Column
def sqrt(e: Column): Column
#223def tan(columnName: String): Column
def tan(e: Column): Column
#223def tanh(columnName: String): Column
def tanh(e: Column): Column
#223def unhex(column: Column): Column
#223def toDegrees(columnName: String): Column
def toDegrees(e: Column): Column
def toRadians(columnName: String): Column
def toRadians(e: Column): Column
def array(colName: String, colNames: String*): Column
function array
def array(cols: Column*): Column
function array
def bitwiseNOT(e: Column): Column
#262def broadcast[T](df: Dataset[T]): Dataset[T]
no needed because its not Column
depententdef coalesce(e: Column*): Column
def col(colName: String): Column
use col[T](colName: String)
or specializations colString
colInt
etcdef column(colName: String): Column
use the same as col
def expr(expr: String): Column
def greatest(columnName: String, columnNames: String*): Column
def greatest(exprs: Column*): Column
#113def input_file_name(): Column
#113def isnan(e: Column): Column
def isnull(e: Column): Column
postfix isnull
def least(columnName: String, columnNames: String*): Column
def least(exprs: Column*): Column
#113def lit(literal: Any): Column
use lit[T](literal: T)
def map(cols: Column*): Column
def map_from_arrays(keys: Column, values: Column): Column
def monotonically_increasing_id(): Column
#113def nanvl(col1: Column, col2: Column): Column
#262def negate(e: Column): Column
#262def not(e: Column): Column
#113def rand(): Column
#113def rand(seed: Long): Column
#113def randn(): Column
#113def randn(seed: Long): Column
#113def spark_partition_id(): Column
#113def struct(colName: String, colNames: String*): Column
def struct(cols: Column*): Column
#95def when(condition: Column, value: Any): Column
method builder when
def monotonicallyIncreasingId(): Column
Add a column that validates that the data type is a struct.
0.0.2
As concat
spark function, using ds"Column value: $myColumn"
will output null if myColumn
has a null
It should have output:
"Column value: null
"Column value:
No response
Create a DoricJoinColum:
Kleisly[DoricValidated, (DataFrame, Dataframe), Column]
???
def assert_true(c: Column, e: Column): Column
def assert_true(c: Column): Column
def crc32(e: Column): Column
def hash(cols: Column*): Column
def md5(e: Column): Column
def raise_error(c: Column): Column
def sha1(e: Column): Column
def sha2(e: Column, numBits: Int): Column
def xxhash64(cols: Column*): Column
Create syntax to operate with arrays
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.