Git Product home page Git Product logo

neurocard's Introduction

NeuroCard

arXiv LICENSE

NeuroCard is a neural cardinality estimator for multi-table join queries.


NeuroCard's philosophy is to learn as much correlation as possible across tables, thereby achieving high accuracy.

Technical details can be found in the VLDB 2021 paper, NeuroCard: One Cardinality Estimator for All Tables [bibtex].

Quick start | Main modules | Running experiments | Contributors | Citation

Quick start

Set up a conda environment with depedencies installed:

# On Ubuntu/Debian
sudo apt install build-essential
# Install Python environment
conda env create -f environment.yml
conda activate neurocard
# Run commands below inside this directory.
cd neurocard

Download the IMDB dataset as CSV files and place under datasets/job:

# Download size 1.2GB.
bash scripts/download_imdb.sh

# If you already have the CSVs or can export from a
# database, simply link to an existing directory.
# ln -s <existing_dir_with_csvs> datasets/job
# Run the following if the existing CSVs are without headers.
# python scripts/prepend_imdb_headers.py

Launch a short test run:

python run.py --run test-job-light

Main modules

Module Description
run Main script to train and evaluate
experiments Registry of experiment configurations
common Abstractions for columns, tables, joined relations; column factorization
factorized_sampler Unbiased join sampler
estimators Cardinality estimators: probabilistic inference for density models; inference for column factorization
datasets Registry of datasets and schemas
Models: made, transformer Deep autoregressive models: ResMADE & Transformer

Running experiments

Launch training and evaluation using a single script:

# 'name' is a config registered in experiments.py.
python run.py --run <name>

Registered configs. Hyperparameters are statically declared in experiments.py. New experiments (e.g., changing query files; running hparam tuning) can be specified there.

Configs for evaluation on pretrained checkpoints and full training runs:

Benchmark Config (reload pretrained ckpt) Config (re-train) Model Num Params
JOB-light job-light-reload job-light ResMADE 1.0M
JOB-light-ranges job-light-ranges-reload job-light-ranges ResMADE 1.1M
job-light-ranges-large-reload job-light-ranges-large Transformer 5.4M
JOB-M job-m-reload job-m ResMADE 7.2M
- job-m-large (launch with --gpus=4 or lower the batch size) Transformer 107M

The reload configs load pretrained checkpoints and run evaluation only. Normal configs start training afresh and also run evaluation.

Metrics & Monitoring. The key metrics to track are

  • Cardinality estimation accuracy (Q-errors): fact_psample_<num_psamples>_<quantile>
  • Quality of the density model: train_bits (negative log-likelihood in bits-per-tuple; lower is better).

The standard output prints these metrics and can be piped into a log file. If TensorBoard is installed, use the following to visualize:

python -m tensorboard.main --logdir ~/ray_results/

Contributors

This repo was written by

Citation

@inproceedings{neurocard,
  title={{NeuroCard}: One Cardinality Estimator for All Tables},
  author={Yang, Zongheng and Kamsetty, Amog and Luan, Sifei and Liang, Eric and Duan, Yan and Chen, Xi and Stoica, Ion},
  journal={Proceedings of the VLDB Endowment},
  volume={14},
  number={1},
  pages={61--73},
  year={2021},
  publisher={VLDB Endowment}
}

Related projects. NeuroCard builds on top of Naru and Variable Skipping.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.