Learned Sort

The official repository for Learned Sort, a model-enhanced sorting algorithm that was published in The Case for a Learned Sorting Algorithm.

Example:

#include "learned_sort.h"

int main(int argc, char *argv[]) {
  
    // Generate some data
    vector<double> arr = {...}

    // Sort in ascending order
    learned_sort::sort(arr.begin(), arr.end());
}

To get started, see Build instructions and Testing instructions.

Directory structure
Instructions
Usage
1. Basic usage
2. More features
Benchmark Results
Limitations
License & credits

Directory structure

This repository is organized as follows:

The Learned Sort algorithm implementation in include/learned_sort.h
Benchmarking code in benchmarks_driver.cc
Unit testing code under unit_tests/
Dependencies under third_party/

Instructions

Learned Sort is distributed as part of a header-only library. Therefore, in order the use it, it is enough to include the header file in your code.

#include "learned_sort.h"

However, besides the Learned Sort implementation, this repository contains benchmarking and unit testing code. In order to execute those, follow the instructions below.

Building this project

In order to build this project, you will need to have the following software installed in your system:

Git
CMake
Compiler with support for C++20

NOTE This repository has only been tested on GCC 9.3 and Clang 7.0.0 on Linux Ubuntu Focal (20.04 LTS).

# Clone this repository
git clone --recursive https://github.com/learnedsystems/learned-sort.git

# Change into the code directory
cd learned-sort

# Run the compilation script
./compile.sh

The build directory should now look like this:

/learned-sort
    /build
        /bin
            /LearnedSort_benchmarks
            /LearnedSort_tests
            /...
    /...

Running the benchmarks

We use Google Benchmark to measure the running times for Learned Sort and other sorting algorithm baselines for comparison. The benchmarks will be run for various input size, and with enough iterations to provide a stable statistic. The output will show the total running time (column "Time") and the time spent in the CPU (column "CPU") in milliseconds. Each row displays the name of the algorithm that is being benchmarked, followed by the input size and statistic type. The benchmark will repeat a few times (see run.sh) and it will report the mean, median, and standard deviation of the measurements.

In order to execute the benchmarks:

./run.sh

Running the unit tests

We use Google Test and GTest-Parallel to perform unit testing on Learned Sort and other baseline sorting algorithms on various data distributions and data types. After downloading this repository, run the tests to make sure everything is working fine for your system setup.

In order to execute the unit tests:

./test.sh

Should any of the tests fail, then GTest will display a summary of the errors that occurred, otherwise no output is displayed.

Usage

Basic Usage

Generate or read some input data in a vector container and use learned_sort::sort as a drop-in replacement for std::sort.

#include <iostream>
#include <random>
#include <vector>

#include "learned_sort.h"

int main(int argc, char *argv[])
{
    // Define some constants
    static constexpr size_t INPUT_SIZE = 1e6;
    static constexpr int MIN_KEY = 0;
    static constexpr int MAX_KEY = 1;

    // Generate some random uniformly distributed data
    std::random_device rd;
    std::mt19937 g(rd());
    std::uniform_real_distribution<> distribution(MIN_KEY, MAX_KEY);

    // Populate the input
    std::vector<double> arr;
    for (int i = 0; i < INPUT_SIZE; i++)
    {
        arr.push_back(distribution(g));
    }

    // Sort in ascending order
    learned_sort::sort(arr.begin(), arr.end());

    // Verify that the input is now sorted
    if(std::is_sorted(arr.begin(), arr.end()))
        std::cout << "Sorted!" << std::endl;
}

Function signatures

The Learned Sort library provides the following functions:

// Sort in ascending order
template <class RandomIt>
void learned_sort::sort(RandomIt begin, RandomIt end);

// Sort in ascending order using the specified RMI parameters
template <class RandomIt>
void learned_sort::sort(RandomIt begin, RandomIt end, learned_sort::RMI::Params &params);

// Train an RMI using the given RMI parameters
template <class RandomIt>
learned_sort::RMI learned_sort::train(RandomIt begin, RandomIt end, learned_sort::RMI::Params &p);

Training and sorting parameters

In order to sort using a custom set of parameters, you can create an RMI::Params object as follows, and pass it to the sorting function:

RMI::Params p; // Default parameters

RMI::Params p(
    sampling_rate,      // float, [0-1],    default .01
    overallocation,     // float, >=1,      default 1.1
    fanout,             // unsigned int,    default 1000
    batch_size,         // unsigned int,    default 10
    threshold,          // unsigned int,    default 100
    model_architecture  // vector<unsigned int>, default {1, 1000} (Each number represents the number of linear models in the layer of the RMI)
    );

Benchmarking on different data distributions

The default distribution of the input data for the benchmarks is set to a Normal distribution, however, we provide synthetic data generators for:

Exponential distribution
Lognormal distribution
Normal distribution
Uniform distribution
Mixture of Gaussians distribution
Chi Squared distribution
Zipf distribution
RootDups distribution (see: BlockQuicksort paper p.15)

Benchmark results

In the following sections we give concrete performance numbers for a particular server-grade computer.

Benchmark setup

CPU: Intel® Xeon® Gold 6150 CPU @ 2.70GHz
RAM: 376GB
OS: Linux Ubuntu 20.04, kernel 5.4.0-26-generic
CXX: GCC 9.3.0-10ubuntu2

Performance charts

The following chart displays the performance of LearnedSort and other sorting algorithms on a set of randomly generated input distributions containing 10M keys. The histograms in the vertical axis correspond to the shape of the distributions used for the benchmark.

Special input types

In the cases when the input distribution contain a large number of duplicates (e.g. Zipf distribution), or when the input is in some special order, the performance of LearnedSort changes as follows. (10M keys)

Limitations

This implementation of Learned Sort is only compatible with 2-layer RMI models as they have resulted in the most optimal performance.
Currently it is not possible to sort records that contain payloads, i.e., the algorithm can only shuffle the keys in a sorted order. We are in the works of adding support for payloads.
This implementation only supports numerical keys, while strings and other complex types are currently unsupported.
This is the out-of-place variant of Learned Sort.

Known bugs

Refer to the Issues page for known bugs.

License & Credits

This work is licensed under the GNU General Public License v3.0 and any use in academic settings must cite the corresponding paper:

@inproceedings{10.1145/3318464.3389752,
author = {Kristo, Ani and Vaidya, Kapil and \c{C}etintemel, Ugur and Misra, Sanchit and Kraska, Tim},
title = {The Case for a Learned Sorting Algorithm},
year = {2020},
isbn = {9781450367356},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3318464.3389752},
doi = {10.1145/3318464.3389752},
booktitle = {Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data},
pages = {1001–1016},
numpages = {16},
keywords = {linear models, linear interpolation, learned algorithm, RMI, sorting algorithm, ML for systems, CDF, sorting},
location = {Portland, OR, USA},
series = {SIGMOD '20}
}

zzpw / learnedsort Goto Github PK

learnedsort's Introduction