GutterTree

A fast external memory buffering system for graph streaming.

The following describes some of the basics of how this works. There are three key objects GutterTree, BufferControlBlock, and WorkQueue.

GutterTree

This class specifies most of the meta-data and functions for handling a gutter tree. Specifically, the gutter tree maintains a list of nodes that compose the tree in addition to some other buffers used to enable efficient IO when either flushing or reading data from a node.

Tree Structure

Tree includes parameters M the size of a buffer, B the branching factor, and finally N the number of graph node ids we expect to ingest.

           ---root---
         /      |     \
     node       node    node
   /  |  \       |  \     |  \
node node node node node node node

Each internal node contains a buffer of size M + page_size and has at most B children. We construct the tree so that there is a unique node mapping to each of the N graph nodes. The leaf nodes are of size proportional to a graph streaming sketch (polylog(N)). In the above example B is 3 and N is 7. The bottom level would be full if N equal to 3^2=9.

Flushing

When a either a root node or an internal node of the tree stores data of size ≥ M then it is ready to be flushed. This flush may happen asynchronously if desired so long as the data stored in the buffer does not exceed M.

When flushing we utilize flush_buffers to achieve efficient file writing. The data in the node is scanned and, based upon the source node of each update, is placed into the appropriate flush_buffer. When these buffers become full their contents are written to the corresponding child. The flush buffers are of size roughly equal to a page and therefore these IOs should be efficient.

A flush of a leaf node is simply accomplished by moving the data in question into the WorkQueue. If the WorkQueue is full then this insertion is blocked until the queue is no longer full.

BufferControlBlock

Encodes the meta-data associated with a block including its file_offset and storage_ptr. These two attributes represent the location of a node within the large and physically contiguous backing_store file. The nodes of the tree are stored in the file following a breadth first search. Specifically, the data stored within the buffer at each node is what is held within the file. Therefore the data in each level is contiguous on disk.

Example:

------------------------------------------------
|Node 1| Node 2| Node 3| Node 4| Node 5| Node 6|
------------------------------------------------

This buffer encodes the following tree with B=2 and N=4

      ---root---
       /      \      
   node1      node2    
    /  \       /  \  
node3 node4 node5 node6

Note that the root does not appear in the backing store. This is because it is stored entirely in RAM. There is also no BufferControlBlock for the root node for the same reason.

WorkQueue

When a node leaf node is ready to be processed by the user its data is placed into the WorkQueue. The WorkQueue is an entirely in RAM structure designed to eliminate IO contention between adding data to and getting data out of the gutter tree. With the WorkQueue, requests to the GutterTree for data take place entirely in RAM.

A leaf is ''ready'' to be processed when it is full and, if it was any other node, would normally be flushed. Instead we place the data stored at this leaf into the queue and then delete the data from the leaf node. An analogy for this process is that the leaves are growing fruits (the update buffers) and when these fruits are big and ripe they will naturally fall from the tree.

The WorkQueue is designed with a limited size. This is because RAM usage needs to be minimal and has the added benefit of naturally rate limiting the gutter tree to match the speed at which data can be taken out of it. Operations that need to take data from an empty queue or that need to insert to a full queue are blocked until their preconditions are met.

The structure of the WorkQueue is as follows

-----------------------------------------------------------------------------------
| node 4 updates | clean | clean | node 1 updates | node 2 updates| node 3 updates|
-----------------------------------------------------------------------------------
                   ^ Head                             ^ Tail

Head marks the where the next insertion to the CircularQueue should take place and Tail marks where the next read should happen from. Slots are marked clean if their data has been taken out of the queue and processed. The queue is full if the head ever points to a dirty slot and empty if Head and Tail point to the same slot (which is also clean).

Note in this example that Tail does not point to the updates for node 1. This is because the data for node 1 is currently being processed by a query. Once this query finishes, the thread performing the query will mark the slot with node 1 updates as clean.

victor-c-zhang / hyperguttertree Goto Github PK

hyperguttertree's Introduction

GutterTree

GutterTree

Tree Structure

Flushing

BufferControlBlock

WorkQueue

hyperguttertree's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent