Git Product home page Git Product logo

sliding-window-aggregators's Introduction

Sliding Window Aggregators

This repo contains reference implementations of sliding window aggregation algorithms.

All of these algorithms require operators that are associative. We classify the algorithms in two groups: those that require data to arrive in-order, and those that allow data to arrive out-of-order. We refer to the algorithms that require data to arrive in-order as FIFO algorithms, as they assume first-in, first-out semantics. We refer to the algorithms that tolerate disordered data as general algorithms.

The algorithmic complexity of the algorithms is with respect to the size of the window, n.

A tutorial and encyclopedia article provide more background on sliding window aggregation algorithms.

DABA

DABA Lite

FiBA

FlatFIT

IOA

Two-Stacks

  • full name: Two-Stacks
  • ordering: in-order required
  • operator requirements: associativity
  • time complexity: worst-case O(n), amortized O(1)
  • space requirements: 2n
  • first appeared: adamax on Stack Overflow
  • implementions: C++, Rust

Two-Stacks Lite

Reactive

Recalc

  • full name: Re-Calculate From Scratch
  • ordering: out-of-order allowed
  • operator requirements: none
  • time complexity: O(n)
  • space requirements: n
  • first appeared: no known source
  • implementations: C++, Rust

SOE

  • full name: Subtract on Evict
  • ordering: out-of-order allowed
  • operator requirements: associativity, invertability
  • time complexity: worst-case O(1)
  • space requirements: n
  • first appeared: no known source
  • implementations: C++ (strictly in-order), Rust (strictly in-order)

Amortized MTA (AMTA)

sliding-window-aggregators's People

Contributors

hirzel avatar ktangwon avatar scotts avatar segeljakt avatar stevemar avatar takoyaki65 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sliding-window-aggregators's Issues

Question about FiBA implementation: updating hitLeft information in rebalanceAfterEvict function

Hello,

I was reading the paper "Optimal and General Out-of-Order Sliding-Window Aggregation" and took a look at the FiBA implementation code available on GitHub. While inspecting the code, I came across the function Node* rebalanceAfterEvict(Node* node, bool* hitLeft, bool* hitRight, Node* toRepair=NULL) on line 832 of FiBA.hpp.

*hitRight |= sibling->rightSpine();

I noticed that on that line, only the hitRight information is being updated, while the hitLeft information is not being updated. This seems to be a problem, because when rebalancing a node that contains a tuple like (st, 20, t) (which is stored in the right child of the root node), the sibling node would become the left spine, and therefore its aggregated value would also need to be updated.

image

Could you please explain why only hitRight is being updated, and not hitLeft? Am I missing something here?

Thank you in advance for your help!

Rust version of FlatFIT sometimes fails test

Sometimes when running the tests, FlatFIT fails with:

thread 'test2::flatfit' panicked at 'attempt to add with overflow', /Users/scottschneider/Documents/gh/sliding-window-aggregators/rust/src/flatfit/mod.rs:134:27
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

I suspect this only happens sometimes because I think the tests are generating random numbers. We should make that deterministic and diagnose what's happening.

Improve sliding window background in README

Ideally, the background in the README should teach some of the basic concepts of sliding window aggregation. This material is covered in the linked tutorial and encyclopedia article, but that should be adapted to this repo.

What should `evict` in `TimeWindow` return?

Quoth @ktangwon,

In the case of a time window, the data structure should indicate whether or not the given timestamp leads to a successful eviction. But I'm not 100% sure what we should make it return, e.g., a flag or some real data.

This was forked from the discussion in PR #48, which changed pop in FifoWindow to not have a return value.

Design Rust experiments

Currently, the Rust implementations are using the Criterion crate to experimentally compare the implementations. I've spent some time with them, and done a deep-dive into Criterion itself. At the moment, my conclusion is that it doesn't quite meet our needs. My main concerns:

  1. Criterion is exclusively a time-based benchmarking framework. What I mean by that is that users tell Criterion how long to run for. Since how long the experiments are run for is static, the workload itself is necessarily variable. The experiments we have used to assess the current SWAGs, and published, are vice-versa: we keep the workload static and allow the time they run to be variable. (In fact, how long they take is one of the main things we report.) The Criterion author makes a case that for microbenchmarks this is fine, and I agree that for most use cases it is. But for us, since we're simulating a streaming system, it's imperative that (say) we actually process 5 million elements, even if it takes 5 minutes. We get less information if we only process 83,000 elements in 5 seconds. We need to make sure we process enough elements so that we can correctly scale up our window sizes, and to make sure we fully exercise all parts of a SWAG. I find our requirements don't mesh with Criterion being a time-based benchmarking framework.
  2. I have had difficulty generating the kinds of comparisons across the dimensions that I want (SWAG, window size).
  3. Collecting and storing raw data does seem to be a priority of the framework. That's critical for eventually publishing the eperiments.
  4. This is a minor point compared to the others, as I would negotiate with it if all of the above worked for our needs easily, but: I'm not comfortable that parts of the experiment are run in code outside of our control. When investigating performance over the years, have completely control over the full lifecycle of the program has benefited us in fully understanding various behavior that we see. When we control everything, we can be confident that differences in performance are from differences in the SWAGs and not experimental artifact. (As an example of the sort of thing we have done while investigating performance anomalies, see https://github.com/scotts/sliding-window-aggregators/blob/3b756f77baa340fa9f4b04a1ce486c4498f88292/cpp/src/data_benchmark.cc#L50-L54.)

The good news is that Rust makes writing our on benchmarks much simpler than C++! For example, my mess of template overloads in C++ can become a proper Rust macro, resulting in something much more understandable and extendable:

    query_run! {
        opts =>
            [[ReCalc, Sum, Max],
             [TwoStacks, Sum, Max],
             [Reactive, Sum, Max],
             [SoE, Sum]]
    }

See https://github.com/scotts/sliding-window-aggregators/blob/rust-exp/rust/benches/bench_static.rs for the Rust equivalent to https://github.com/scotts/sliding-window-aggregators/blob/rust-exp/cpp/src/benchmark_driver.cc. Note that the C++ benchmark driver includes a header file which has now also gotten out of hand.

I had hoped we could keep both the Criterion and by-hand benchmarks around at the same time, but I ran into difficulties with that approach. I ran into conflicts with options when using cargo bench.

@segeljakt, I'm interested to hear your thoughts.

Extend Rust operators and SWAG traits to account for lift and lower

The current Rust traits do not allow for the full type flexibility in the published algorithms and what are currently in our C++ implementations. Specifically, the Rust traits assume that the type that is inserted is necessarily the same as the return type of a query. The implementations also use the same type for storing partial results.

In our papers, we introduced the lift operation which converts an input type In into the stored type Partial. The Partial type is stored in the window, and the aggregation operator is a binary function on the Partial type. We also introduced the lower operation which converts the Partial type into the Out type. In the SWAG, we insert In types, and internally we use lift to convert those into Partials. Queries return values of type Out, and use lower to turn a Partial into Out.

I was forced to contend with this when I started implementing a mean operator in Rust, as the input type is a number (say, i32), but the partial type must have a sum and count (say, {sum: i32, count: usize}). What I have currently is:

#[derive(Copy, Clone)]
pub struct Mean;

#[derive(Copy, Clone, Eq, PartialEq)]
struct MeanPartial<T> {
    sum: T,
    n: usize,
}

Even if we assume the above trait issue regarding lift and lower is solved, extending this approach starts to get awkward. The binary function must be defined on MeanPartial:

impl AbstractMagma<Mean> for MeanPartial<i32> {
    fn operate(&self, other: &Self) -> Self {
        MeanPartial::<i32>{sum: self.sum + other.sum,
                           n: self.n + other.n}
    }
}

By itself, that's not a problem. But having to define things in terms of MeanPartial will extend everywhere. The main problem I see is that the partial type is exposed to the user who creates a window. For example, even if we assume that the current traits allow for all three types, users would have to do something like TwoStacks::<i32, MeanPartial<i32>, i32, Mean> when specifying a window's type.

In C++, we "hid" the partial definition inside of the operator. Window implementations knew they could ask the operator what its partial type is. I am going to explore a similar approach in Rust, so that users will be able to do something closer to TwoStacks::<Mean<i32>>.

@segeljakt, I'll let you know when I have something concrete so I can get your feedback.

In Rust trait, should `pop` return a value?

@hirzel brought this up in a PR:

I think pop should not return anything. I can see from the implementation of some of the aggregation algorithms that it lowers the oldest item and returns that. There are two problems with that. First, it is wasteful to do the lowering if that return value gets discarded by the caller. Second, and more importantly, not all aggregation algorithms hold on to that oldest item, so it is unavailable to be lowered and returned. Specifically, Two-Stacks Lite and DABA Lite don't keep around singleton partials for the front portion of the queue, since that portion is optimized for eviction, where it makes more sense to keep around partials for larger sublists.

I had a similar reaction when @segeljakt first designed FifoWindow. The main reason for doing it is that is lines up with Rust conventions. Regarding point 1, I don't think it's a major concern: if the return value of pop is unused, and the call to pop itself is inlined, dead-code elimination should avoid any overhead (even the call to lower should go away).

Point 2 is more of a concern, but I admit I don't fully follow it. Is the problem that in some cases, the partial result from the thing being evicted is nonsensical to return?

Rust implementation of the chunked-array queue

Implementing a double-linked structure is not straight-forward in Rust for several reasons. @segeljakt has a WIP (#24) with an implementation. His thoughts on design follows:

I think there should be one or more chunked array queue implementations floating around on https://crates.io/. The solution I had in mind before was to use:

  • a std::collections::LinkedList for connecting the chunks
  • a std::vec::Vec (or array [T;N]) for storing each chunk.
  • a std::collections::linked_list::CursorMut for traversing between chunks.
  • a higher-level Cursor, building on top of CursorMut for traversing between elements of chunks.

A cursor is like an iterator but does not consume the thing it is iterating over. With DoubleEndedIterator, it might be a problem because you cannot iterate over the same element twice:

let numbers = vec![1, 2, 3, 4, 5, 6];

let mut iter = numbers.iter();

assert_eq!(Some(&1), iter.next());
assert_eq!(Some(&6), iter.next_back());
assert_eq!(Some(&5), iter.next_back());
assert_eq!(Some(&2), iter.next());
assert_eq!(Some(&3), iter.next());
assert_eq!(Some(&4), iter.next());
assert_eq!(None, iter.next());
assert_eq!(None, iter.next_back());

If you need multiple cursors, you might have to create them using an unsafe pointer. I think it is not possible to use a std::cell::RefCell + std::rc::Rc because the cursors must mutably borrow the linked list which they are traversing for their whole lifetime. With RefCell, Rust would panic if two cursors exist simultaneously. There should hopefully be no problem as long as you ensure that the cursors do not point to something which is deallocated.

fn main() {
    use std::collections::LinkedList;
    let mut l: LinkedList<[i32; 5]> = LinkedList::new();

    let (mut c0, mut c1, mut c2) = unsafe {
        let l = &mut l;
        let l = l as *mut LinkedList<[i32; 5]>;
        let c0 = l.as_mut().unwrap().cursor_back_mut();
        let c1 = l.as_mut().unwrap().cursor_back_mut();
        let c2 = l.as_mut().unwrap().cursor_back_mut();
        (c0, c1, c2)
    };

    l.push_back([0,1,2,3,4]);
    l.push_back([5,6,7,8,9]);

    c0.move_next();
    c1.move_next();
    c2.move_next();
    c2.move_next();

    l.push_front([4,3,2,1,0]);

    c0.move_prev();

    println!("{:?}", l);

    assert_eq!(c0.current(), Some(&mut [4,3,2,1,0]));
    assert_eq!(c1.current(), Some(&mut [0,1,2,3,4]));
    assert_eq!(c2.current(), Some(&mut [5,6,7,8,9]));
}

And,

I asked in the Rust community and was recommended this crate: https://crates.io/crates/bucket_queue

Generalize algorithm links per language in README

In the current README, when we mention a particular algorithm, we link to its C++ implementation. This does not scale as we add implementations in new languages. We should come up with a format that makes it easier to easily point readers to implementations in a language they are interested in.

How to structure and design experimental scripts

The experimental scripts were written with the C++ implementations in mind. Comparing across language implementations would also be interesting, but this means that:

  1. The scripts need to have a concept of language implementation.
  2. Each language will need a benchmark driver that performs equivalent experiments, ideally exposing the same command-line interface.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.