radrares1 / research-project Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 110 KB

Scala 100.00%

research-project's People

Contributors

Watchers

research-project's Issues

Use column pruning

As a customer, I want a reader which reads only the specified parameter from the file, so reading performance will improve.

When we call:

val ds = spark.read.format("customBinaryFile").select("param_1", "param_2")

spark should optimize the query and should read only the data corresponding to the two parameters, avoiding reading all the data and filtering afterward.

Acceptance criteria:

files are read and processed into a spark Dataset[Record]
only parameters specified in the select statement should be read
attach the optimized query plan for the aforementioned call

This is a test issue

As a GitHub noob, I want to create my very first issue to test my superpower.

Acceptance criteria:

Close the issue

Reader with sc.binaryFiles

As a customer, I want to have a reader which uses sc.binaryFiles for reading multiple binary files.

Acceptance criteria:

files are read and processed into a spark Dataset[Record]

Reader with DataSourceV1 api

As a customer, I want to have a reader which uses DataSourceV1 for reading multiple binary files.

Acceptance criteria:

files are read and processed into a spark Dataset[Record]

Property based testing for processing

As a customer, I want to be sure that the previously defined encoders and decoders work as expected, so I can sleep better.

Acceptance criteria:

test using scala check and scala test: https://www.scalatest.org/, https://www.scalacheck.org/

Use codec for processing large files

As a customer, I want to have a solution that is able to encode/decode large binary files.
A good place to start the research is: https://github.com/scodec/scodec-stream

Acceptance criteria:

can encode arbitrary large files
can decode arbitrary large files
can generate a large file, at least 1GB

CI/CD for the project

As a developer, I want to have a CI/CD pipeline (workflow) set up for the project, to make my life easier.

Acceptance criteria:

Automatic tests
Static code analyzer: ScalaStyle and Scapegoat: http://www.scalastyle.org/sbt.html, https://github.com/sksamuel/sbt-scapegoat
Publish (Optional): https://stackoverflow.com/questions/33091153/can-sbt-publish-to-jfrog-artifactory
Release (Optional): https://github.com/sbt/sbt-release

Define encoder and decoder for the binary file

As a customer, I want to have an encoder and decoder for my binary file, so as to have a PoC with Scodec.

Acceptance criteria:

Encoder defined
Decoder defined

Note: For now let's suppose one has to deal with a good binary file without quality issues.

Reader with DataSourceV2 api

As a customer, I want to have a reader which uses DataSourceV2 for reading multiple binary files.

Acceptance criteria:

files are read and processed into a spark Dataset[Record]

Setup project with SBT

As a developer, I want to set up my Scala project with SBT, so I can work further on it.

Acceptance criteria:

Use scala 2.12.x
Add required dependencies: Scalatest, Typesafe Config, etc
Define packages
Define multibuild project with modules:
- common: contains plain scala binary file processor
- spark2: contains spark 2.4.x binary file processor
- spark3: contains spark 3.0.x binary file processor

Scala writer for binary files

As a developer, I want to have binary files generated, so one can use them for reading.

Acceptance criteria:

implement writer
generate 4-5 binary files

Documentation about the binary file

As a customer, I want to see the description of the binary file and the approach used for processing.

Acceptance criteria:

document the structure of the binary file
document the approach used for processing it

Make binary file splittable

As a customer, I want to be sure that the spark readers work with large files, so I can have a reliable reader.

When a file read with sc.binaryFiles, the whole file is read into a single partition. With DataSourceV1 and V2 we can avoid this limitation. Therefore we can avoid OoM issues for huge files.

Acceptance criteria:

files are read and processed into a spark Dataset[Record]

TBD: How to split the file?

Pushdown filtering

As a customer, I want to be able to push down time-based filtering to the file, so reading performance will improve.

When one specifies a time range in the filter statement, the filtering should be pushed down to file level. Ex:

val ds = spark.read.format("customBinaryFile").selectExpr("filter(timeVector, time -> time > 100L ) AS timeVector")

should read only data that has time greater than 100L.

Acceptance criteria:

files are read and processed into a spark Dataset[Record]
only those records are returned which time is in the specified time range
attach the optimized query plan for the aforementioned call

Note:

https://docs.databricks.com/delta/data-transformation/higher-order-lambda-functions.html
https://www.waitingforcode.com/apache-spark-sql/apache-spark-2.4.0-features-array-higher-order-functions/read
in spark 3.0.x the syntax for higher-order functions will simplify
I have concerns if the push down will work, but let's see

Implement processing logic for binary files

As a customer, I want to have my files processed into the expected format.

Acceptance criteria:

after processing the output is stored in Record:

case class Parameter(name: String, unit: String)
case class Record(filename: String, parameter: Parameter, timeVector: Array[Long], valueVector: Array[Float])

Note:

time has to be stored up to millisecond precision, even though they come with microseconds precision
it's a good idea to have signals in memory, while measurements could be stored in a non-strict collection
values are calculated by the formula: value = factor * rawValue + offset

radrares1 / research-project Goto Github PK

research-project's People

Contributors

Watchers

research-project's Issues

Recommend Projects

Recommend Topics

Recommend Org