Git Product home page Git Product logo

aronnax's Introduction

Aronnax

Aronnax is a document-based database that permits queries over the history of a document (longitudinal) as well as normal relational queries. I would also like to support continuous views, in which query result sets are updated in real-time as the database applies changes to documents.

The goal here is to investigate the construction of such a logging database, and how the desire to query over history influences the design and implementation of indexes, views, on-disk storage, query language/API.

Overview

Documents and Streams

A document is identified by a UUID and consists of a bag of key/value pairs. Keys are variable-length strings, and values are most likely ints, flots or strings (maybe lists?).

Aronnax stores streams. A stream is the history of the key/value pairs in the bag for a particular UUID. As such, a document at time t is the keys and values in the bag at time t. Over time, keys may have their values changed, or may be added (with a value) or removed from a document.

Each edit of a document occurs at some time t, so it becomes possible to query a document as it existed before or after that time. This implements a timeline of events for a given document. We want to be able to query across all documents at time t, and also query a document over a range of times t0 < t1

A naive implementation may implement a document stream by replicating the full value of a document at every edit timestamp, but it might be better to implement a stream as a collection of "diffs" created by each document update. The value of a document at time t is then the "sum" or "rollup" of all diffs up to (and including?) time t.

My suspicion is that a log-based approach will reduce conflicts between concurrent writers by offering a natural serialization (probably order of delivery) that works with concurrent readers, much in the spirit of data structures such as the time-ordered multiversion concurrency control mechanisms described here. However, despite the larger size required for the full-duplication approach, the reads might be much faster.

Queries

"Vertical" queries behave like one would expect in a normal document database, but are indexed by when keys/values are added/removed/changed. These queries are defined by special syntax in the where clause of a query, and perform relational-type queries over all documents as they exist at a time t, which default to "now". These queries return documents, keys and values.

  • This query returns the unique set of UUIDs that matched the given predicate in the last day since the query was written.

    -- find all stream UUIDs that were in 410 Soda in the past day
    select unique uuid where (Room = 410 and Building = Soda) in (now, now -1d);
  • On a modular sensor platform, the type of temperature sensor was changed from an SHT11 to an SHT13. We want to discover which motes had a sensor changed (assuming we know the time of the change)

    select * where Sensor/Model = 'SHT11' before 1447366661s and Sensor/Model = 'SHT13' after 1447366661s;

    Note that the s in 1447366661s indicates that the number is to be interpreted as a unix timestamp in seconds.

Obviously, there are some new semantics that we would like to cover. where clauses need an additional qualification that identifies how the predicates are to be applied over time. These query semantics can be applied to a single clause, or a compound clause (using and or or):

  • where key=val in <time range>, e.g. where Room = 410 in (now, now -5min). True if the predicate was true at any point within that time range.
  • where key=val for <time range>, e.g. where Room = 410 for (now, now -5min). True if the predicate is true for the entire time range.
  • where key=val before <time>, e.g. where Room = 410 before 1447366661s. True if predicate true at any time before the given time.
  • where key=value ibefore <time>, e.g. ... ibefore 1447366661s. True if the predicate is true in the most immediate edit before the given time.
  • where key=val after <time>, e.g. where Room = 410 after 1447366661s. True if predicate true at any time after the given time.
  • where key=value iafater <time>, e.g. ... iafter 1447366661s. True if the predicate is true in the most immediate edit after the given time.

Alternatively:

  • in <time range>
  • for/during <time range>
  • before <time>
  • at <time> (ibefore)
  • since (after)
  • after (iafter)

Consider being able to select a special @time field on these queries, which will augment the results with the time they were fetched from.


"Horizontal" queries operate across the historical values for a key in a document, and are defined by special syntax in the select clause of a query. These queries return times or ranges of times, augmented with keys or values, and are useful for determining "when" something happened, and maybe some additional details about values at that time. The "vertical" queries assume there is knowledge about time, but the two types of queries can be mixed.

Data Structures

Data structure choice is going to be important here. Here are the influencing decisions, as determined by the queries we want to enable:

  • partial matches on strings in both keys and values
    • substring, at the very least (.*abc, abc.*, .*abc.*)
    • maybe case insensitive?
    • maybe full regex? Probably very slow, but let it through anyway. Optimize for the above cases
  • grouping keys by the document/stream identifier, sliced by time
    • all keys for document ABC that existed at time t

aronnax's People

Contributors

gtfierro avatar

Watchers

 avatar  avatar  avatar

aronnax's Issues

Evaluation of `now` should be consistent

If there are multiple instances of the now keyword in a query, they are evaluated to the current time as they are parsed, so the notion of now can actually be inconsistent w/n a query. At the scales we're working with at this point, this is not a huge deal, but it is definitely something we need to address in the future. They all need to be replaced at once

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.