research-project's People
research-project's Issues
Use column pruning
As a customer, I want a reader which reads only the specified parameter from the file, so reading performance will improve.
When we call:
val ds = spark.read.format("customBinaryFile").select("param_1", "param_2")
spark should optimize the query and should read only the data corresponding to the two parameters, avoiding reading all the data and filtering afterward.
Acceptance criteria:
- files are read and processed into a spark Dataset[Record]
- only parameters specified in the select statement should be read
- attach the optimized query plan for the aforementioned call
This is a test issue
As a GitHub noob, I want to create my very first issue to test my superpower.
Acceptance criteria:
- Close the issue
Reader with sc.binaryFiles
As a customer, I want to have a reader which uses sc.binaryFiles
for reading multiple binary files.
Acceptance criteria:
- files are read and processed into a spark
Dataset[Record]
Reader with DataSourceV1 api
As a customer, I want to have a reader which uses DataSourceV1 for reading multiple binary files.
Acceptance criteria:
- files are read and processed into a spark
Dataset[Record]
Property based testing for processing
As a customer, I want to be sure that the previously defined encoders and decoders work as expected, so I can sleep better.
Acceptance criteria:
- test using scala check and scala test: https://www.scalatest.org/, https://www.scalacheck.org/
Use codec for processing large files
As a customer, I want to have a solution that is able to encode/decode large binary files.
A good place to start the research is: https://github.com/scodec/scodec-stream
Acceptance criteria:
- can encode arbitrary large files
- can decode arbitrary large files
- can generate a large file, at least 1GB
CI/CD for the project
As a developer, I want to have a CI/CD pipeline (workflow) set up for the project, to make my life easier.
Acceptance criteria:
- Automatic tests
- Static code analyzer: ScalaStyle and Scapegoat: http://www.scalastyle.org/sbt.html, https://github.com/sksamuel/sbt-scapegoat
- Publish (Optional): https://stackoverflow.com/questions/33091153/can-sbt-publish-to-jfrog-artifactory
- Release (Optional): https://github.com/sbt/sbt-release
Define encoder and decoder for the binary file
As a customer, I want to have an encoder and decoder for my binary file, so as to have a PoC with Scodec.
Acceptance criteria:
- Encoder defined
- Decoder defined
Note: For now let's suppose one has to deal with a good binary file without quality issues.
Reader with DataSourceV2 api
As a customer, I want to have a reader which uses DataSourceV2 for reading multiple binary files.
Acceptance criteria:
- files are read and processed into a spark
Dataset[Record]
Setup project with SBT
As a developer, I want to set up my Scala project with SBT, so I can work further on it.
Acceptance criteria:
- Use scala 2.12.x
- Add required dependencies: Scalatest, Typesafe Config, etc
- Define packages
- Define multibuild project with modules:
- common: contains plain scala binary file processor
- spark2: contains spark 2.4.x binary file processor
- spark3: contains spark 3.0.x binary file processor
Scala writer for binary files
As a developer, I want to have binary files generated, so one can use them for reading.
Acceptance criteria:
- implement writer
- generate 4-5 binary files
Documentation about the binary file
As a customer, I want to see the description of the binary file and the approach used for processing.
Acceptance criteria:
- document the structure of the binary file
- document the approach used for processing it
Make binary file splittable
As a customer, I want to be sure that the spark readers work with large files, so I can have a reliable reader.
When a file read with sc.binaryFiles
, the whole file is read into a single partition. With DataSourceV1 and V2 we can avoid this limitation. Therefore we can avoid OoM issues for huge files.
Acceptance criteria:
- files are read and processed into a spark Dataset[Record]
TBD: How to split the file?
Pushdown filtering
As a customer, I want to be able to push down time-based filtering to the file, so reading performance will improve.
When one specifies a time range in the filter
statement, the filtering should be pushed down to file level. Ex:
val ds = spark.read.format("customBinaryFile").selectExpr("filter(timeVector, time -> time > 100L ) AS timeVector")
should read only data that has time greater than 100L.
Acceptance criteria:
- files are read and processed into a spark Dataset[Record]
- only those records are returned which time is in the specified time range
- attach the optimized query plan for the aforementioned call
Note:
- https://docs.databricks.com/delta/data-transformation/higher-order-lambda-functions.html
- https://www.waitingforcode.com/apache-spark-sql/apache-spark-2.4.0-features-array-higher-order-functions/read
- in spark 3.0.x the syntax for higher-order functions will simplify
- I have concerns if the push down will work, but let's see
Implement processing logic for binary files
As a customer, I want to have my files processed into the expected format.
Acceptance criteria:
- after processing the output is stored in
Record
:
case class Parameter(name: String, unit: String)
case class Record(filename: String, parameter: Parameter, timeVector: Array[Long], valueVector: Array[Float])
Note:
- time has to be stored up to millisecond precision, even though they come with microseconds precision
- it's a good idea to have signals in memory, while measurements could be stored in a non-strict collection
- values are calculated by the formula:
value = factor * rawValue + offset
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.