itkpi / trembita Goto Github PK

View Code? Open in Web Editor NEW

46.0 10.0 3.0 2.67 MB

Model complex data transformation pipelines easily

License: Apache License 2.0

Scala 99.76% Kotlin 0.24%

lazy collections cassandra parallel functional dsl typelevel-programming spark typesafe akka-streams

trembita's People

Contributors

Stargazers

Watchers

Forkers

phanirajl octaplexsys eyalatox

trembita's Issues

Benchmarks

Need to add JMH benchmarks for:

kernel (Basic, FSM, QL)
akka integrations (trembita vs plain akka)
spark integrations (trenbita vs plain spark)
caching

Research ability to integrate Kafka Streams

Trembita currently is able to read data from kafka using Akka Stearms or Spark Streaming.
Need to research the ability to integrate Kafka streams directly so that users can use trembita without akka or spark for such cases

Monitoring capabilities

Allow trembita integration with grafana and prometheus for monitoring pipeline performance and visualisation of pipeline itself

trembitaz

Integrate scalaz-zio as a separete module.
Research about pefrormance pros/cons when using scala-zio bifunctor IO

Integration with java.util.stream

java.util.stream are well known and frequently used in Java projects.
They have good performance so I think we should add the ability to use them as transport layer along with Vector and ParVector for sequential and parallel pipelines

Implement Slick integrations

It's almost implemented in https://github.com/vitaliihonta/trembita/tree/features/slick.
Implement after #18 is done

Cons Environment

Allow programmer to write same code for different possible environments.
For instance:

val pipeline: DataPipelineT[F, A, Sequential] = ???
val pipeline2: DataPipelineT[F, A, Akka[NotUsed] Or Spark]
  .to[Akka[NotUsed]]
  .orTo[Spark](condition = ???)
  . // some possible heavy operations

This should allow more flexible applicaitons which can be run on different environments depending on some condition (for isntance amount of your data).
Additionaly such implicit derivation should work:

def func[E <: Environment](implicit ev: E Supports CanGroupBy) = ???
func[Akka[NotUsed] Or Spark]

And this shouldn't

def func2[E <: Environment](implicit ev: E Supports CanSort) = ???
func[Akka[NotUsed] Or Spark] // doesn't compile

Compile time pipeline optimizer

Need to implement pipeline transformations optimizer that should work at compile time.
For instance, we can start from trembita.ql. Spark's query analyzer is a good option to research

Add 2.13 and drop 2.11

Enrich trembitaql

Add the following methods (like in sparksql):

withColumn:

case class Foo(i: Int, x: Long, s: String)
case class Bar(ij: Long, ss: String)
val pipeline: DataPipelineT[F, Bar, E] = ???
pipeline.withColumn[Foo](_ / 2)

select:

case class Foo(i: Int, x: Long, s: String)
case class Bar(ij: Long, ss: String)
val pipeline: DataPipelineT[F, Foo, E] = ???
pipeline.select[Bar](a => a.i + a.j, _.s * 2)

Integrate akka http

Think about how to integrate Akka Http routes into trembita.
Implement after [#18] is done

Increase test coverage

Trembita QL improvements

currently querying pipeline looks like:

val pipeline: DataPipelineT[F, A, E] = ???
pipeline.query(_
  .groupBy(...)
  .aggregate(...)
)

What about to make it less verbose?

val pipeline: DataPipelineT[F, A, E] = ???
val result: DataPipelineT[F, Foo, E] = pipeline.
  .groupBy(...) // something like DataPipelineTGroupByClause
  .aggregate(...)  // DataPipelineTAggregateClause
  .having(...)  // DataPipelineTHavingClause

Where:

DataPipelineTGroupByClause - special class providing aggregate operation
DataPipelineTAggregateClause - special class providing having & order operation
DataPipelineTHavingClause - special class providing more having & order operation

trembita.ql package should contain implicit conversions from ...Clause stuff into DataPipelineT.

input/output dsl

Deploy Pipelines

Sometimes in distributed systems, there is a need to deploy components as a result of some user interactions.
For instance, deploy spark cluster for distributive computations when a user wants so.
Need to research about something I'll call "Deploy pipelines".
The idea is simple: write an OutputT which will (based on input pipeline elements) deploy something
(for instance docker containers, k8s pods, spark clusters, etc.)

Implement seamless Infinispan-to-Spark integration

The best idea currently is to store RDD partitions directly in distributed Infinispan.
Other ideas are welcome

Replace Infinispan module with Redis

Machile learning capabilities

Need to research about how to integrate machine learning capabilities into trembita.
The easiest way is to research Spark-ML and try to integrate it.
Then we need to research tensorflow integration.
Having these 2 models - create separate module trembita-ml as a kernel with basic abstractions.
Then integrate trembita-ml with Spark ML and tensorflow scala

Direct translator from trembita.ql into spark.sql

Currently trembita.ql for spark is implemented upon RDD.
Need to rewrite it so that query will be translated directly into spark.sql

Supports DSL

I've recently added the following type alias into kernel:

type Supports[E <: Environment, Op[_[_]]] = Op[E#Repr]

Currently it allows to abstract over environment you are using:

def foo[F[_], A, E <: Environment](foo: Foo)(
  implicit canCombineByKey: E Supports CanCombineByKey,
  canReduce: E Supports CanReduce,
  hasSize: E Supports HasSize
  ...
) = ???

Need to provide an easier way to do so. For instance something like:

type RequiredAPI[E <: Environment] = 
  Supports[E, 
    CanCombineByKey & 
    CanReduce & 
    HasSize & ...
]
def foo[F[_], A, E <: Environment: RequiredAPI](foo: Foo) = ???