Git Product home page Git Product logo

spark-input-sources's Introduction

Scala CI codecov maven central License

Input Sources

Input Sources is an abstraction for loading Spark data via configuration files. Currently, it can handle

  • file path sources
  • table sources
  • SQL sources
  • BigQuery sources

This library aims to be easily extended to other sources by using sealed trait with case classes for each new sources.

// https://central.sonatype.com/artifact/com.growingintech/spark-input-sources_2.12/1.0.1
libraryDependencies += "com.growingintech" %% "spark-input-sources" % "1.0.1"

New Sources

Feel free to submit a PR for any new sources you would like to add. I don't plan on creating cloud accounts for all clouds, so it will be helpful if others can work on Amazon and Azure.

Basic Usage

In this simple example, we have a HOCON pipeline configuration string which can have as many parameters as needed for the user's use case. For our data definition, I am using a TableSource example.

/*
 * Copyright 2023 GrowingInTech.com. All Rights Reserved.
 *
 * Licensed under the Apache License, Version 2.0 (the "License"). You may not
 * use this file except in compliance with the License. A copy of the License
 * is located at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * or in the "license" file accompanying this file. This file is distributed on
 * an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
 * express or implied. See the License for the specific language governing
 * permissions and limitations under the License.
 *
 */
import com.growingintech.datasources.InputSources
import com.typesafe.config.ConfigFactory
import pureconfig._
import pureconfig.generic.auto._

import org.apache.spark.sql.DataFrame

val strConfig: String =
  """
    |{
    | pipeline-name: Data Runner
    | date: 20230216
    | data: {
    |   type: table-source
    |   table-name: default.test_data
    |   filter: "date = 20230101 AND x > 2"
    | }
    |}
    |""".stripMargin

case class Params(
                   pipelineName: String,
                   date: Int,
                   data: InputSources
                 )

val config: Params = ConfigSource.fromConfig(ConfigFactory.parseString(strConfig)).loadOrThrow[Params]
val df: Dataframe = Params.data.loadData

spark-input-sources's People

Contributors

dwsmith1983 avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

spark-input-sources's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.