Git Product home page Git Product logo

db-benchmark's Introduction

Repository for reproducible benchmarking of database-like operations in single-node environment.
Benchmark report is available at h2oai.github.io/db-benchmark.
We focused mainly on portability and reproducibility. Benchmark is routinely re-run to present up-to-date timings. Most of solutions used are automatically upgraded to their stable or development versions.
This benchmark is meant to compare scalability both in data volume and data complexity.
Contribution and feedback are very welcome!

Tasks

  • groupby
  • join

Solutions

More solutions has been proposed. Some of them are not yet mature enough to address benchmark questions well enough (e.g. modin). Others haven't been yet evaluated or implemented. Status of all can be tracked in dedicated issues labelled as new solution in project repository.

Reproduce

Batch benchmark run

  • edit path.env and set julia and java paths
  • if solution uses python create new virtualenv as $solution/py-$solution, example for pandas use virtualenv pandas/py-pandas --python=/usr/bin/python3.6
  • install every solution, follow $solution/setup-$solution.sh scripts
  • edit run.conf to define solutions and tasks to benchmark
  • generate data, for groupby use Rscript _data/groupby-datagen.R 1e7 1e2 0 0 to create G1_1e7_1e2_0_0.csv, re-save to binary format where needed (see below), create data directory and keep all data files there
  • edit _control/data.csv to define data sizes to benchmark using active flag
  • ensure SWAP is disabled and ClickHouse server is not yet running
  • start benchmark with ./run.sh

Single solution benchmark

  • install solution software
    • for python we recommend to use virtualenv for better isolation
    • for R ensure that library is installed in a solution subdirectory, so that library("dplyr", lib.loc="./dplyr/r-dplyr") or library("data.table", lib.loc="./datatable/r-datatable") works
    • note that some solutions may require another to be installed to speed-up csv data load, for example, dplyr requires data.table and similarly pandas requires (py)datatable
  • generate data using _data/*-datagen.R scripts, for example, Rscript _data/groupby-datagen.R 1e7 1e2 0 0 creates G1_1e7_1e2_0_0.csv, put data files in data directory
  • run benchmark for a single solution using ./_launcher/solution.R --solution=data.table --task=groupby --nrow=1e7
  • run other data cases by passing extra parameters --k=1e2 --na=0 --sort=0
  • use --quiet=true to suppress script's output and print timings only, using --print=question,run,time_sec specify columns to be printed to console, to print all use --print=*
  • use --out=time.csv to write timings to a file rather than console

Extra care needed

  • cudf
    • use conda instead of virtualenv
  • clickhouse
    • generate data having extra primary key column according to clickhouse/setup-clickhouse.sh
    • follow "reproduce interactive environment" section from clickhouse/setup-clickhouse.sh
  • pydatatable
  • dask
    • re-save csv groupby-1e9 and join-1e9 data into parquet format

Example environment

Acknowledgment

Timings for some solutions might be missing for particular data sizes or questions. Some functions are not yet implemented in all solutions so we were unable to answer all questions in all solutions. Some solutions might also run out of memory when running benchmark script which results the process to be killed by OS. Lastly we also added timeout for single benchmark script to run, once timeout value is reached script is terminated. Please check issues labelled as exceptions in our repository for a list of issues/defects in solutions, that makes us unable to provide all timings.

db-benchmark's People

Contributors

bkamins avatar jangorecki avatar mattdowle avatar michaelchirico avatar nalimilan avatar trivialfis avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.