Git Product home page Git Product logo

drizzle-spark's Introduction

Drizzle: Low Latency Execution for Apache Spark

Drizzle is a low latency execution engine for Apache Spark that is targeted at stream processing and iterative workloads. Currently, Spark uses a BSP computation model, and notifies the scheduler at the end of each task. Invoking the scheduler at the end of each task adds overheads and results in decreased throughput and increased latency.

In Drizzle, we introduce group scheduling, where multiple batches (or a group) of computation are scheduled at once. This helps decouple the granularity of task execution from scheduling and amortize the costs of task serialization and launch.

Drizzle Example

The current Drizzle prototype exposes a low level API using the runJobs method in SparkContext. This method takes in a Seq of RDDs and corresponding functions to execute on these RDDs. Examples of using this API can be seen in DrizzleSingleStageExample and DrizzleRunningSum.

To try out Drizzle locally, we first build Spark based on existing instructions. For example, using SBT we can run

  ./build/sbt package

We can run then run the DrizzleRunningSum example with 4 cores for 10 iterations with group size 10. Note that this example requires at least 4GB of memory on your machine.

  ./bin/run-example --master "local-cluster[4,1,1024]" org.apache.spark.examples.DrizzleRunningSum 10 10

To compare this with existing Spark, we can run the same 10 iterations but now with a group size of 1

  ./bin/run-example --master "local-cluster[4,1,1024]" org.apache.spark.examples.DrizzleRunningSum 10 1

The benefit from using Drizzle is more apparent on large clusters. Results from running the single stage benchmark for 100 iterations on a Amazon EC2 cluster of 128 machines is shown below.

Status

The source code in this repository is a research prototype and only implements the scheduling techniques described in our paper. The existing Spark unit tests pass with our changes and we are actively working on adding more tests for Drizzle. We are also working towards a Spark JIRA to discuss integrating Drizzle with the Apache Spark project.

Finally we would like to note that extensions to integrate Structured Streaming and Spark ML will be implemented separately.

For more details

For more details about the architecture of Drizzle please see our Spark Summit 2015 Talk and our Technical Report

Acknowledgements

This is joint work with Aurojit Panda, Kay Ousterhout, Mike Franklin, Ali Ghodsi, Ben Recht and Ion Stoica from the AMPLab at UC Berkeley.

drizzle-spark's People

Contributors

aarondav avatar adrian-wang avatar andrewor14 avatar ankurdave avatar chenghao-intel avatar cloud-fan avatar dongjoon-hyun avatar gatorsmile avatar holdenk avatar hyukjinkwon avatar jegonzal avatar jerryshao avatar jkbradley avatar joshrosen avatar kayousterhout avatar liancheng avatar marmbrus avatar mateiz avatar mengxr avatar pwendell avatar rxin avatar sarutak avatar scrapcodes avatar shivaram avatar srowen avatar tdas avatar viirya avatar yanboliang avatar yhuai avatar zsxwing avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

drizzle-spark's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.