Git Product home page Git Product logo

spark-utils's Introduction

Spark Utils

Maven Central   GitHub   Travis (.org)   Codecov   Javadocs   Gitter   Twitter  

Motivation

One of the biggest challenges after taking the first steps into the world of writing Apache Spark applications in Scala is taking them to production.

An application of any kind needs to be easy to run and easy to configure.

This project is trying to help developers write Spark applications focusing mainly on the application logic rather than the details of configuring the application and setting up the Spark context.

This project is also trying to create and encourage a friendly yet professional environment for developers to help each other, so please do not be shy and join through gitter, twitter, issue reports or pull requests.

ATTENTION!

At the moment there are a lot of changes happening to the spark-utils project, hopefully for the better.

The latest stable versions, available through Maven Central are

  • Spark 2.4: 0.4.2 to 0.6.2
  • Spark 3.0: 0.6.2 to 1.0.0-RC6
  • Spark >= 3.3.0: 1.0.0-RC7 +

The development version is 1.0.0-R6 which is bringing a clean separation between configuration implementation and the core, and additionally the PureConfig based configuration module that brings the power and features of PureConfig to increase productivity even further and allowing for a more mature configuration framework.

The new modules are:

  • spark-utils-io-pureconfig for the new PureConfig implementation

We completely removed the legacy scalaz based configuration framework.

We suggest to start considering the new for the future spark-utils-io-pureconfig.

Migrating to the new 1.0.0-RC6 is quite easy, as the configuration structure was mainly preserved. More details are available in the RELEASE-NOTES.

For now, some of the documentation related or referenced from this project might be obsolete or outdated, but as the project will get closer to the final release, there will be more improvements.

Test Results Matrix

Spark Scala 2.12 Scala 2.13 Report 1.0.0-RC6 Report 1.0.0-RC7
3.0.3 YES N/A 3.0.3 N/A
3.1.3 YES N/A 3.1.3 N/A
3.2.4 YES YES 3.2.4 N/A
3.3.4 YES YES 3.3.4 3.3.4
3.4.2 YES YES 3.4.2 3.4.2
3.5.1 YES YES 3.5.1 3.5.1

Description

This project contains some basic utilities that can help setting up an Apache Spark application project.

The main point is the simplicity of writing Apache Spark applications just focusing on the logic, while providing for easy configuration and arguments passing.

The code sample bellow shows how easy can be to write a file format converter from any acceptable type, with any acceptable parsing configuration options to any acceptable format.

Batch Application

import org.tupol.spark._

object FormatConverterExample extends SparkApp[FormatConverterContext, DataFrame] {
  override def createContext(config: Config) = FormatConverterContext.extract(config)
  override def run(implicit spark: SparkSession, context: FormatConverterContext): Try[DataFrame] = {
    val inputData = spark.source(context.input).read
    inputData.sink(context.output).write
  }
}

Optionally, the SparkFun can be used instead of SparkApp to make the code even more concise.

import org.tupol.spark._

object FormatConverterExample extends 
          SparkFun[FormatConverterContext, DataFrame](FormatConverterContext.extract) {
  override def run(implicit spark: SparkSession, context: FormatConverterContext): Try[DataFrame] = 
    spark.source(context.input).read.sink(context.output).write
}

Configuration

Creating the configuration can be as simple as defining a case class to hold the configuration and a factory, that helps extract simple and complex data types like input sources and output sinks.

import org.tupol.spark.io._

case class FormatConverterContext(input: FormatAwareDataSourceConfiguration,
                                  output: FormatAwareDataSinkConfiguration)

There are multiple ways that the context can be easily created from configuration files. This project proposes two ways:

  • the new PureConfig based framework
  • the legacy ScalaZ based framework

Configuration creation based on PureConfig

import com.typesafe.config.Config

object FormatConverterContext {
  import pureconfig.generic.auto._
  import org.tupol.spark.io.pureconf._
  import org.tupol.spark.io.pureconf.readers._
  def extract(config: Config): Try[FormatConverterContext] = config.extract[FormatConverterContext]
}

Streaming Application

For structured streaming applications the format converter might look like this:

object StreamingFormatConverterExample extends SparkApp[StreamingFormatConverterContext, DataFrame] {
  override def createContext(config: Config) = StreamingFormatConverterContext.extract(config)
  override def run(implicit spark: SparkSession, context: StreamingFormatConverterContext): Try[DataFrame] = {
    val inputData = spark.source(context.input).read
    inputData.streamingSink(context.output).write.awaitTermination()
  }
}

Configuration

The streaming configuration the configuration can be as simple as following:

import org.tupol.spark.io.streaming.structured._

case class StreamingFormatConverterContext(input: FormatAwareStreamingSourceConfiguration, 
                                           output: FormatAwareStreamingSinkConfiguration)

Configuration creation based on PureConfig

object StreamingFormatConverterContext {
  import com.typesafe.config.Config
  import pureconfig.generic.auto._
  import org.tupol.spark.io.pureconf._
  import org.tupol.spark.io.pureconf.streaming.structured._
  def extract(config: Config): Try[StreamingFormatConverterContext] = config.extract[StreamingFormatConverterContext]
}

The SparkRunnable and SparkApp or SparkFun together with the configuration framework provide for easy Spark application creation with configuration that can be managed through configuration files or application parameters.

The IO frameworks for reading and writing data frames add extra convenience for setting up batch and structured streaming jobs that transform various types of files and streams.

Last but not least, there are many utility functions that provide convenience for loading resources, dealing with schemas and so on.

Most of the common features are also implemented as decorators to main Spark classes, like SparkContext, DataFrame and StructType and they are conveniently available by importing the org.tupol.spark.implicits._ package.

Documentation

The documentation for the main utilities and frameworks available:

Latest stable API documentation is available here.

An extensive tutorial and walk-through can be found here. Extensive samples and demos can be found here.

A nice example on how this library can be used can be found in the spark-tools project, through the implementation of a generic format converter and a SQL processor for both batch and structured streams.

Prerequisites

  • Java 8 or higher
  • Scala 2.12
  • Apache Spark 3.0.X

Getting Spark Utils

Spark Utils is published to Maven Central and Spark Packages:

  • Group id / organization: org.tupol
  • Artifact id / name: spark-utils
  • Latest stable versions:
    • Spark 2.4: 0.4.2 to 0.6.2
    • Spark 3.0: 0.6.2 to 1.0.0-RC7
    • Spark 3.3: 1.0.0-RC7 to

Usage with SBT, adding a dependency to the latest version of tools to your sbt build definition file:

libraryDependencies += "org.tupol" %% "spark-utils-io-pureconfig" % "1.0.0-RC6"

Include this package in your Spark Applications using spark-shell or spark-submit

$SPARK_HOME/bin/spark-shell --packages org.tupol:spark-utils_2.12:1.0.0-RC6

Starting a New spark-utils Project

Note spark-utils-g8 was not yet updated for the 1.x version.

The simplest way to start a new spark-utils is to make use of the spark-apps.seed.g8 template project.

To fill in manually the project options run

g8 tupol/spark-apps.seed.g8

The default options look like the following:

name [My Project]:
appname [My First App]:
organization [my.org]:
version [0.0.1-SNAPSHOT]:
package [my.org.my_project]:
classname [MyFirstApp]:
scriptname [my-first-app]:
scalaVersion [2.12.12]:
sparkVersion [3.2.1]:
sparkUtilsVersion [0.4.0]:

To fill in the options in advance

g8 tupol/spark-apps.seed.g8 --name="My Project" --appname="My App" --organization="my.org" --force

What's new?

1.0.0-RC7

  • Adapt towards the latest Apache Spark versions from 3.3.x
  • Added StreamingTrigger.AvailableNow
  • Build with Spark 3.3.x and tested against Spark 3.3.0 to 3.5.1

1.0.0-RC1 to 1.0.0-RC6

Major library redesign

  • Cross compile Scala 2.12 and 2.13
  • Building with JDK 17 targeting Java 8
  • Added test java options to handle the JDK 17
  • Cross compile Scala 2.12 and 2.13
  • Build with Spark 3.2.x and tested against Spark 3.x
  • Removed the spark-utils-io-pureconfig module
  • Added configuration module based on PureConfig
  • DataSource exposes reader in addition to read
  • DataSink and DataAwareSink expose writer in addition to write
  • Added SparkSessionOps.streamingSource
  • Refactored TypesafeConfigBuilder, which has two implementations now: SimpleTypesafeConfigBuilder and FuzzyTypesafeConfigBuilder
  • Small improvements to SharedSparkSession
  • Documentation improvements

0.6.2

  • Fixed core dependency to scala-utils; now using scala-utils-core
  • Refactored the core/implicits package to make the implicits a little more explicit

For previous versions please consult the release notes.

License

This code is open source software licensed under the MIT License.

spark-utils's People

Contributors

tupol avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

spark-utils's Issues

Potential security vulnerability in the zstd C library.

Hi, @tupol , I'd like to report a vulnerable dependency in org.tupol:spark-utils-io_2.12:0.6.2.

Issue Description

I noticed that org.tupol:spark-utils-io_2.12:0.6.2 directly depends on org.apache.spark:spark-core_2.12:3.0.1 in the pom. However, as shown in the following dependency graph, org.apache.spark:spark-core_2.12:3.0.1 sufferes from the vulnerability which the C library zstd(version:1.4.3) exposed: CVE-2021-24032.

Dependency Graph between Java and Shared Libraries

image (12)

Suggested Vulnerability Patch Versions

org.apache.spark:spark-core_2.12:3.2.0 (>=3.2.0) has upgraded this vulnerable C library zstd to the patch version 1.5.0.

Java build tools cannot report vulnerable C libraries, which may induce potential security issues to many downstream Java projects. Could you please upgrade this vulnerable dependency?

Thanks for your help~
Best regards,
Helen Parr

SparkApp fails if no configuration is expected

When implementing a SparkApp that has no configuration expectations, the initialization of the app fails with the following exception:

An unexpected com.typesafe.config.ConfigException$Missing was thrown.
ScalaTestFailureLocation: org.tupol.spark.SparkAppSpec$$anonfun$5 at (SparkAppSpec.scala:64)
org.scalatest.exceptions.TestFailedException: An unexpected com.typesafe.config.ConfigException$Missing was thrown.
. . .

Sample test for SparkApp:

  object MockAppNoConfig extends SparkApp[String, Unit] {
    def createContext(config: Config): String = "Hello"
    override def run(implicit spark: SparkSession, config: String): Unit = Unit
  }

  test("SparkApp.main successfully completes") {
    noException shouldBe thrownBy(MockAppNoConfig.main(Array()))
  }

Exceptions vs `Try[T]`

In the early versions the scala-utils and spark-utils were Try[T] centric, meaning that everything that could fail returned a Try[T]. At some point, observing how some developers were using it, I decided that it might be easier to throw exceptions, even though my functional blood was boiling a little.

Question Should we go for a more functional, no side effects approach or should we keep throwing exceptions?
In a sense we are also logging, so we have few pure functions, so this question is actually not as easy as it seems.

Cross compile on Scala 2.11 and 2.12

Spark-Utils needs to catch up with the world and as a first step, it needs to cross-compile on both Scala 2.11 and 2.12.
While at it, Spark 2.4.x should be brought up to the latest version as well.

Avro support is native in Spark 2.4.x

Avro support is native in Spark 2.4.x, so we no longer can pass it as "com.databricks.spark.avro", but it is now passed as "avro".

One solution is to add a generic source / sink configuration that accept anything, with minimal validation. This might work, as it should not break the compatibility between versions and allow for more custom sources and sinks.

Another one is just to rename "com.databricks.spark.avro" to "avro", but this will break the compatibility between 2.3.x and 2.4.x Spark versions.

Application Configuration File Name

Currently the application configuration file name is limited to application.conf.

We should be able to pass in an alternative application configuration file name as an application parameter, e.g. -conf-file-name='my_app.conf' and the SparkApp should pick the configuration file from my_app.conf.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.