Git Product home page Git Product logo

spark-etl's People

Contributors

konrads avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

spark-etl's Issues

ExtractReader and LoadWriter to handle prefix based plugins?

Ie. so that uri prefix is "parquet:" then we look for Parquet LoadWriter plugin, instantiate only that. Will cater for a combination of different strategies of reading/writing.

Might have a lifecycle:

  • for local validation - just validate
  • for running on the server (validate-remote, transform-load, etc)
    • instantiate context (eg. Oracle connection)
    • run the action
    • close the context
    • post run step, eg. to record passes/failures?

add sinker/outputter

which will be responsible for savings all transform results. It will need:

  • config section, eg:
    sinker:

    • impl: spark_etl.parquet.Sink
    • args: # optional
      • param1: strVal1
      • param2: strVal2
  • an abstract class:

    • constructor(options: Map[String, String])
    • def validate(Seq[Transform]) # could fail if unsupported Transform output?
    • sink(transformsAndDfs: Seq[(Transform, DataFrame)])
  • spark_etl.parquet.Sink, extending the above

  • changes to MainUtil:

    • add Sink.validate() to extractCheck
    • add Sink.sink() to transform
    • consider renaming extractCheck, transformCheck, transform...?

Move output to separate "load" section in config

app.yaml should contain:

  • extract
  • transform
  • load

The pipelines are derived top-down, ie. see which "load" dependencies are defined in "transform", then which "transform" dependencies are defined in "extract"

Implement DSL transforms

Which will work inside Dataset.transform()
As demo implementation in "plugins" configuration section, which will eg. perform typing of Strings to Int/Double/Date/Timestamp/etc. Typing should be done Row[Strings] -> Row[Int/Double/Date/Timestamp], ie. not Row[String] -> case classes/classes.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.