Git Product home page Git Product logo

spark-sql-perf's Introduction

Spark SQL Performance Tests

This is a performance testing framework for Spark SQL in Apache Spark 1.3+.

Note: This README is still under development. Please also check our source code for more information.

How to use it

The rest of document will use TPC-DS benchmark as an example. We will add contents to explain how to use other benchmarks add the support of a new benchmark dataset in future.

Setup a dataset

Before running any query, a dataset needs to be setup by creating a Dataset object. Every benchmark support in Spark SQL Perf needs to implement its own Dataset class. A Dataset object takes a few parameters that will be used to setup the needed tables and its setup function is used to setup needed tables. For TPC-DS benchmark, the class is TPCDS in the package of com.databricks.spark.sql.perf.tpcds. For example, to setup a TPC-DS dataset, you can

import org.apache.spark.sql.parquet.Tables
// Tables in TPC-DS benchmark used by experiments.
val tables = Tables(sqlContext)
// Setup TPC-DS experiment
val tpcds =
  new TPCDS (
    sqlContext = sqlContext,
    sparkVersion = "1.3.1",
    dataLocation = <the location of data>,
    dsdgenDir = <the location of dsdgen in every worker>,
    tables = tables.tables,
    scaleFactor = <scale factor>)

After a TPCDS object is created, tables of it can be setup by calling

tpcds.setup()

The setup function will first check if needed tables are stored at the location specified by dataLocation. If not, it will creates tables at there by using the data generator tool dsdgen provided by TPC-DS benchmark (This tool needs to be pre-installed at the location specified by dsdgenDir in every worker).

Run benchmarking queries

After setup, users can use runExperiment function to run benchmarking queries and record query execution time. Taking TPC-DS as an example, you can start an experiment by using

tpcds.runExperiment(
  queries = <a Seq of Queries>,
  resultsLocation = <the root location of performance results>,
  includeBreakdown = <if measure the performance of every physical operators>,
  iterations = <the number of iterations>,
  variations = <variations used in the experiment>,
  tags = <tags of this experiment>)

For every experiment run (i.e.\ every call of runExperiment), Spark SQL Perf will use the timestamp of the start time to identify this experiment. Performance results will be stored in the sub-dir named by the timestamp in the given resultsLocation (for example results/1429213883272). The performance results are stored in the JSON format.

Retrieve results

The follow code can be used to retrieve results ...

// Get experiments results.
import com.databricks.spark.sql.perf.Results
val results = Results(resultsLocation = <the root location of performance results>, sqlContext = sqlContext)
// Get the DataFrame representing all results stored in the dir specified by resultsLocation.
val allResults = results.allResults
// Use DataFrame API to get results of a single run.
allResults.filter("timestamp = 1429132621024")

spark-sql-perf's People

Contributors

yhuai avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.