Git Product home page Git Product logo

cost's Introduction

COST

COST is an acronym for the "Configuration that Outperforms a Single Thread", indicating the hardware resources required by a distributed system before it begins to outperform a single-threaded implementation. This repository contains the single-threaded implementations providing the baseline performance.

Specifically, this repository contains single-threaded implementations of three graph algorithms, PageRank, label propagation, and union-find, supporting performance measurements taken on two graphs, twitter_rv and uk_2007_05. The code is intended to be instructive, rather than a meaningful replacement for a graph-processing system.

Instructions

Once cloned, the project can be built and run by typing

cargo run --release

which should result in usage information indicating appropriate arguments:

Running `target/COST`
Invalid arguments.

Usage: COST pagerank  (vertex | hilbert | compressed) <prefix>
       COST label_prop (vertex | hilbert | compressed) <prefix>
       COST union_find (vertex | hilbert | compressed) <prefix>
       COST stats  (vertex | hilbert | compressed) <prefix>
       COST print  (vertex | hilbert | compressed) <prefix>
       COST to_hilbert [--dense] <prefix>
       COST parse_to_hilbert
       COST merge <source>...
       COST twitter <from> <prefix>

The first three modes correspond to the three graph algorithms. The second parameter indicates the binary graph layout. The final parameter must be a path prefix for which certain extensions exist as files, discussed in a moment. The to_hilbert mode performs a graph layout according to a Hilbert space-filling curve, optionally densifying the identifiers.

Graph input

The computation will not do anything productive without graph data, and the graph data use for experiments (processed versions of the graphs linked above) are too large to host here. I'm also not wild about distributing programs that write data back to someone else's computer, without more serious review. That being said, the file src/twitter_parser.rs contains the code that I used to parse twitter_rv.net, the file you get from the link above.

You can find the twitter files by searching for "twitter_rv.tar.gz" (6GB), "twitter_rv_19000000.tar.gz" (764MB) or "twitter_rv_small.tar.gz" (188MB). The sizes listed refer to the compressed size.

To process the small twitter dataset into a set of small.nodes and small.edges files, you can run:

./target/release/COST twitter twitter_rv_15066953.net small

To convert the small.nodes and small.edges file to hilbert, you can run:

./target/release/COST to_hilbert small

Which will create two files, small.upper and small.lower.

To compute pagerank, you can run:

./target/release/COST pagerank hilbert small

The required graph layout is quite simple (as is the code to parse it), and you should be able to write out your own graph data if you would like to try out the code.

The vertex option requires a <prefix> for which files <prefix>.nodes and <prefix>.edges exist. The two files should be binary, where

  • <prefix>.nodes contains a sequence of (u32, u32) pairs representing (node_id, degree).
  • <prefix>.edges contains a sequence of u32 values indicating the concatenation of edge destinations for all nodes indicated above.

The program is just going to map these two files into memory and read them, so you want to make sure that data have the appropriate endian-ness for your system.

The hilbert option requires a <prefix> for which <prefix>.upper and <prefix>.lower exist. The two files should be binary, where

  • <prefix>.upper contains a sequence of ((u16, u16), u32) values, indicating a pair of upper 16 bits of node identifiers, and a count of how many edges have this pair of upper 16 bits.
  • <prefix>.lower contains a concatenated sequence of (u16, u16) values for the lower 16 bits for each edge in each group above.

The easiest way to get a feel for what these should look like is to invoke the to_hilbert option with <prefix> valid for data in the vertex layout, and it will print to the screen what the data look like laid out in Hilbert format. If you change the code to write the data to disk rather than to the terminal, you should be good to go (remember, .upper and .lower, not .nodes and .edges).

Notes

There is a companion COST repository managed by Microsoft Research, including the state of the project several months ago. This may be helpful if you are interested in the corresponding C# implementations. The repository also contains Naiad implementations that were done more recently. I am no longer affiliated with Microsoft and cannot commit to the repository (nor, historically, do they accept pull requests), and must apologize for the sorry state I left the code in. It may be cleaned up in the future (either by me, or other more industrious souls), given the right incentives.

cost's People

Contributors

frankmcsherry avatar maxdemarzi avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.