Git Product home page Git Product logo

hyperguttertree's Introduction

GutterTree

A fast external memory buffering system for graph streaming.

The following describes some of the basics of how this works. There are three key objects GutterTree, BufferControlBlock, and WorkQueue.

GutterTree

This class specifies most of the meta-data and functions for handling a gutter tree. Specifically, the gutter tree maintains a list of nodes that compose the tree in addition to some other buffers used to enable efficient IO when either flushing or reading data from a node.

Tree Structure

Tree includes parameters M the size of a buffer, B the branching factor, and finally N the number of graph node ids we expect to ingest.

           ---root---
         /      |     \
     node       node    node
   /  |  \       |  \     |  \
node node node node node node node

Each internal node contains a buffer of size M + page_size and has at most B children. We construct the tree so that there is a unique node mapping to each of the N graph nodes. The leaf nodes are of size proportional to a graph streaming sketch (polylog(N)). In the above example B is 3 and N is 7. The bottom level would be full if N equal to 3^2=9.

Flushing

When a either a root node or an internal node of the tree stores data of size โ‰ฅ M then it is ready to be flushed. This flush may happen asynchronously if desired so long as the data stored in the buffer does not exceed M.

When flushing we utilize flush_buffers to achieve efficient file writing. The data in the node is scanned and, based upon the source node of each update, is placed into the appropriate flush_buffer. When these buffers become full their contents are written to the corresponding child. The flush buffers are of size roughly equal to a page and therefore these IOs should be efficient.

A flush of a leaf node is simply accomplished by moving the data in question into the WorkQueue. If the WorkQueue is full then this insertion is blocked until the queue is no longer full.

BufferControlBlock

Encodes the meta-data associated with a block including its file_offset and storage_ptr. These two attributes represent the location of a node within the large and physically contiguous backing_store file. The nodes of the tree are stored in the file following a breadth first search. Specifically, the data stored within the buffer at each node is what is held within the file. Therefore the data in each level is contiguous on disk.

Example:

------------------------------------------------
|Node 1| Node 2| Node 3| Node 4| Node 5| Node 6|
------------------------------------------------

This buffer encodes the following tree with B=2 and N=4

      ---root---
       /      \      
   node1      node2    
    /  \       /  \  
node3 node4 node5 node6

Note that the root does not appear in the backing store. This is because it is stored entirely in RAM. There is also no BufferControlBlock for the root node for the same reason.

WorkQueue

When a node leaf node is ready to be processed by the user its data is placed into the WorkQueue. The WorkQueue is an entirely in RAM structure designed to eliminate IO contention between adding data to and getting data out of the gutter tree. With the WorkQueue, requests to the GutterTree for data take place entirely in RAM.

A leaf is ''ready'' to be processed when it is full and, if it was any other node, would normally be flushed. Instead we place the data stored at this leaf into the queue and then delete the data from the leaf node. An analogy for this process is that the leaves are growing fruits (the update buffers) and when these fruits are big and ripe they will naturally fall from the tree.

The WorkQueue is designed with a limited size. This is because RAM usage needs to be minimal and has the added benefit of naturally rate limiting the gutter tree to match the speed at which data can be taken out of it. Operations that need to take data from an empty queue or that need to insert to a full queue are blocked until their preconditions are met.

The structure of the WorkQueue is as follows

-----------------------------------------------------------------------------------
| node 4 updates | clean | clean | node 1 updates | node 2 updates| node 3 updates|
-----------------------------------------------------------------------------------
                   ^ Head                             ^ Tail

Head marks the where the next insertion to the CircularQueue should take place and Tail marks where the next read should happen from. Slots are marked clean if their data has been taken out of the queue and processed. The queue is full if the head ever points to a dirty slot and empty if Head and Tail point to the same slot (which is also clean).

Note in this example that Tail does not point to the updates for node 1. This is because the data for node 1 is currently being processed by a query. Once this query finishes, the thread performing the query will mark the slot with node 1 updates as clean.

hyperguttertree's People

Contributors

etwest avatar victor-c-zhang avatar danieldelayo avatar abi1024 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.