itkpi / trembita Goto Github PK
View Code? Open in Web Editor NEWModel complex data transformation pipelines easily
License: Apache License 2.0
Model complex data transformation pipelines easily
License: Apache License 2.0
Need to add JMH benchmarks for:
Trembita currently is able to read data from kafka using Akka Stearms or Spark Streaming.
Need to research the ability to integrate Kafka streams directly so that users can use trembita without akka or spark for such cases
Allow trembita integration with grafana and prometheus for monitoring pipeline performance and visualisation of pipeline itself
Integrate scalaz-zio as a separete module.
Research about pefrormance pros/cons when using scala-zio bifunctor IO
java.util.stream
are well known and frequently used in Java projects.
They have good performance so I think we should add the ability to use them as transport layer along with Vector
and ParVector
for sequential and parallel pipelines
It's almost implemented in https://github.com/vitaliihonta/trembita/tree/features/slick.
Implement after #18 is done
Allow programmer to write same code for different possible environments.
For instance:
val pipeline: DataPipelineT[F, A, Sequential] = ???
val pipeline2: DataPipelineT[F, A, Akka[NotUsed] Or Spark]
.to[Akka[NotUsed]]
.orTo[Spark](condition = ???)
. // some possible heavy operations
This should allow more flexible applicaitons which can be run on different environments depending on some condition (for isntance amount of your data).
Additionaly such implicit derivation should work:
def func[E <: Environment](implicit ev: E Supports CanGroupBy) = ???
func[Akka[NotUsed] Or Spark]
And this shouldn't
def func2[E <: Environment](implicit ev: E Supports CanSort) = ???
func[Akka[NotUsed] Or Spark] // doesn't compile
Need to implement pipeline transformations optimizer that should work at compile time.
For instance, we can start from trembita.ql
. Spark's query analyzer is a good option to research
Add the following methods (like in sparksql):
case class Foo(i: Int, x: Long, s: String)
case class Bar(ij: Long, ss: String)
val pipeline: DataPipelineT[F, Bar, E] = ???
pipeline.withColumn[Foo](_ / 2)
case class Foo(i: Int, x: Long, s: String)
case class Bar(ij: Long, ss: String)
val pipeline: DataPipelineT[F, Foo, E] = ???
pipeline.select[Bar](a => a.i + a.j, _.s * 2)
Think about how to integrate Akka Http routes into trembita.
Implement after [#18] is done
currently querying pipeline looks like:
val pipeline: DataPipelineT[F, A, E] = ???
pipeline.query(_
.groupBy(...)
.aggregate(...)
)
What about to make it less verbose?
val pipeline: DataPipelineT[F, A, E] = ???
val result: DataPipelineT[F, Foo, E] = pipeline.
.groupBy(...) // something like DataPipelineTGroupByClause
.aggregate(...) // DataPipelineTAggregateClause
.having(...) // DataPipelineTHavingClause
Where:
DataPipelineTGroupByClause
- special class providing aggregate
operationDataPipelineTAggregateClause
- special class providing having
& order
operationDataPipelineTHavingClause
- special class providing more having
& order
operationtrembita.ql
package should contain implicit conversions from ...Clause
stuff into DataPipelineT
.
Sometimes in distributed systems, there is a need to deploy components as a result of some user interactions.
For instance, deploy spark cluster for distributive computations when a user wants so.
Need to research about something I'll call "Deploy pipelines".
The idea is simple: write an OutputT
which will (based on input pipeline elements) deploy something
(for instance docker containers, k8s pods, spark clusters, etc.)
The best idea currently is to store RDD partitions directly in distributed Infinispan.
Other ideas are welcome
Need to research about how to integrate machine learning capabilities into trembita.
The easiest way is to research Spark-ML and try to integrate it.
Then we need to research tensorflow integration.
Having these 2 models - create separate module trembita-ml
as a kernel with basic abstractions.
Then integrate trembita-ml
with Spark ML
and tensorflow scala
Currently trembita.ql
for spark is implemented upon RDD
.
Need to rewrite it so that query will be translated directly into spark.sql
I've recently added the following type alias into kernel
:
type Supports[E <: Environment, Op[_[_]]] = Op[E#Repr]
Currently it allows to abstract over environment you are using:
def foo[F[_], A, E <: Environment](foo: Foo)(
implicit canCombineByKey: E Supports CanCombineByKey,
canReduce: E Supports CanReduce,
hasSize: E Supports HasSize
...
) = ???
Need to provide an easier way to do so. For instance something like:
type RequiredAPI[E <: Environment] =
Supports[E,
CanCombineByKey &
CanReduce &
HasSize & ...
]
def foo[F[_], A, E <: Environment: RequiredAPI](foo: Foo) = ???
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.