Git Product home page Git Product logo

learnedsort's Introduction

Learned Sort

License: GPL v3 Build Status LGTM Grade LGTM Alerts

The official repository for Learned Sort, a model-enhanced sorting algorithm that was published in The Case for a Learned Sorting Algorithm.

Example:

#include "learned_sort.h"

int main(int argc, char *argv[]) {
  
    // Generate some data
    vector<double> arr = {...}

    // Sort in ascending order
    learned_sort::sort(arr.begin(), arr.end());
}

To get started, see Build instructions and Testing instructions.


Table of Contents

  1. Directory structure
  2. Instructions
    1. Building this project
    2. Running the benchmarks
    3. Running the unit tests
  3. Usage
    1. Basic usage
    2. More features
  4. Benchmark Results
    1. Benchmark setup
    2. Performance charts
    3. Special input types
  5. Limitations
  6. License & credits

Directory structure

This repository is organized as follows:

  • The Learned Sort algorithm implementation in include/learned_sort.h
  • Benchmarking code in benchmarks_driver.cc
  • Unit testing code under unit_tests/
  • Dependencies under third_party/

Instructions

Learned Sort is distributed as part of a header-only library. Therefore, in order the use it, it is enough to include the header file in your code.

#include "learned_sort.h"

However, besides the Learned Sort implementation, this repository contains benchmarking and unit testing code. In order to execute those, follow the instructions below.

Building this project

In order to build this project, you will need to have the following software installed in your system:

  • Git
  • CMake
  • Compiler with support for C++20

NOTE This repository has only been tested on GCC 9.3 and Clang 7.0.0 on Linux Ubuntu Focal (20.04 LTS).

# Clone this repository
git clone --recursive https://github.com/learnedsystems/learned-sort.git

# Change into the code directory
cd learned-sort

# Run the compilation script
./compile.sh

The build directory should now look like this:

/learned-sort
    /build
        /bin
            /LearnedSort_benchmarks
            /LearnedSort_tests
            /...
    /...

Running the benchmarks

We use Google Benchmark to measure the running times for Learned Sort and other sorting algorithm baselines for comparison. The benchmarks will be run for various input size, and with enough iterations to provide a stable statistic. The output will show the total running time (column "Time") and the time spent in the CPU (column "CPU") in milliseconds. Each row displays the name of the algorithm that is being benchmarked, followed by the input size and statistic type. The benchmark will repeat a few times (see run.sh) and it will report the mean, median, and standard deviation of the measurements.

In order to execute the benchmarks:

./run.sh

Running the unit tests

We use Google Test and GTest-Parallel to perform unit testing on Learned Sort and other baseline sorting algorithms on various data distributions and data types. After downloading this repository, run the tests to make sure everything is working fine for your system setup.

In order to execute the unit tests:

./test.sh

Should any of the tests fail, then GTest will display a summary of the errors that occurred, otherwise no output is displayed.

Usage

Basic Usage

Generate or read some input data in a vector container and use learned_sort::sort as a drop-in replacement for std::sort.

#include <iostream>
#include <random>
#include <vector>

#include "learned_sort.h"

int main(int argc, char *argv[])
{
    // Define some constants
    static constexpr size_t INPUT_SIZE = 1e6;
    static constexpr int MIN_KEY = 0;
    static constexpr int MAX_KEY = 1;

    // Generate some random uniformly distributed data
    std::random_device rd;
    std::mt19937 g(rd());
    std::uniform_real_distribution<> distribution(MIN_KEY, MAX_KEY);

    // Populate the input
    std::vector<double> arr;
    for (int i = 0; i < INPUT_SIZE; i++)
    {
        arr.push_back(distribution(g));
    }

    // Sort in ascending order
    learned_sort::sort(arr.begin(), arr.end());

    // Verify that the input is now sorted
    if(std::is_sorted(arr.begin(), arr.end()))
        std::cout << "Sorted!" << std::endl;
}

Function signatures

The Learned Sort library provides the following functions:

// Sort in ascending order
template <class RandomIt>
void learned_sort::sort(RandomIt begin, RandomIt end);

// Sort in ascending order using the specified RMI parameters
template <class RandomIt>
void learned_sort::sort(RandomIt begin, RandomIt end, learned_sort::RMI::Params &params);

// Train an RMI using the given RMI parameters
template <class RandomIt>
learned_sort::RMI learned_sort::train(RandomIt begin, RandomIt end, learned_sort::RMI::Params &p);

Training and sorting parameters

In order to sort using a custom set of parameters, you can create an RMI::Params object as follows, and pass it to the sorting function:

RMI::Params p; // Default parameters

or

RMI::Params p(
    sampling_rate,      // float, [0-1],    default .01
    overallocation,     // float, >=1,      default 1.1
    fanout,             // unsigned int,    default 1000
    batch_size,         // unsigned int,    default 10
    threshold,          // unsigned int,    default 100
    model_architecture  // vector<unsigned int>, default {1, 1000} (Each number represents the number of linear models in the layer of the RMI)
    );

Benchmarking on different data distributions

The default distribution of the input data for the benchmarks is set to a Normal distribution, however, we provide synthetic data generators for:

  • Exponential distribution
  • Lognormal distribution
  • Normal distribution
  • Uniform distribution
  • Mixture of Gaussians distribution
  • Chi Squared distribution
  • Zipf distribution
  • RootDups distribution (see: BlockQuicksort paper p.15)

Benchmark results

In the following sections we give concrete performance numbers for a particular server-grade computer.

Benchmark setup

  • CPU: Intel® Xeon® Gold 6150 CPU @ 2.70GHz
  • RAM: 376GB
  • OS: Linux Ubuntu 20.04, kernel 5.4.0-26-generic
  • CXX: GCC 9.3.0-10ubuntu2

Performance charts

The following chart displays the performance of LearnedSort and other sorting algorithms on a set of randomly generated input distributions containing 10M keys. The histograms in the vertical axis correspond to the shape of the distributions used for the benchmark.

Special input types

In the cases when the input distribution contain a large number of duplicates (e.g. Zipf distribution), or when the input is in some special order, the performance of LearnedSort changes as follows. (10M keys)

Limitations

  • This implementation of Learned Sort is only compatible with 2-layer RMI models as they have resulted in the most optimal performance.
  • Currently it is not possible to sort records that contain payloads, i.e., the algorithm can only shuffle the keys in a sorted order. We are in the works of adding support for payloads.
  • This implementation only supports numerical keys, while strings and other complex types are currently unsupported.
  • This is the out-of-place variant of Learned Sort.

Known bugs

Refer to the Issues page for known bugs.

License & Credits

This work is licensed under the GNU General Public License v3.0 and any use in academic settings must cite the corresponding paper:

@inproceedings{10.1145/3318464.3389752,
author = {Kristo, Ani and Vaidya, Kapil and \c{C}etintemel, Ugur and Misra, Sanchit and Kraska, Tim},
title = {The Case for a Learned Sorting Algorithm},
year = {2020},
isbn = {9781450367356},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3318464.3389752},
doi = {10.1145/3318464.3389752},
booktitle = {Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data},
pages = {1001–1016},
numpages = {16},
keywords = {linear models, linear interpolation, learned algorithm, RMI, sorting algorithm, ML for systems, CDF, sorting},
location = {Portland, OR, USA},
series = {SIGMOD '20}
}
  

learnedsort's People

Contributors

anikristo avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.