raphtory / docbrown Goto Github PK

View Code? Open in Web Editor NEW

14.0 5.0 1.0 9.97 MB

Temporal graph storage in rust

License: GNU General Public License v3.0

Rust 58.50% Makefile 0.04% Python 2.98% Jupyter Notebook 38.48%

docbrown's Introduction

This project has been migrated to Raphtory/Raphtory. Please go there for future developments

🌍 Website 📖 Docs Pometry 🐛 Report a Bug Join Slack

What is Doc Brown? 🥼

Doc Brown is the Rust prototype for the next version of Raphtory, rethinking several aspects of the underlying graph model and algorithm API.

Please checkout the issues for the core features to be included in this version, along with their proposed semantics.

Below is a diagram of how Doc Brown works:

GraphDB is the overarching manager for the graph. A GraphDB instance can have N number of shards. These shards (also called TemporalGraphParts) store fragments of a graph. If you have used the Scala/Python version of Raphtory before, the notion of shards is similar to partitions in Raphtory, where parts of a graph are stored into partitions. When an edge or node is added to the graph in Doc Brown, GraphDB will search for an appropriate place inside a shard to place these. In this diagram, there are 4 shards (or partitions) labelled as S1, S2, S3, S4. Altogether they make up the entire temporal graph, hence the name "TemporalGraphPart" for each shard.

Shards are used for performance and distribution reasons. Having multiple shards running in parallel speeds up things a lot in Raphtory. In a matter of seconds, you are able to see your results from your temporal graph analysis. Furthermore, you can run your analysis across multiple machines (e.g. one shard per machine).

Running Doc Brown 👨🏼‍🔬

The API's are currently in...Flux

However, here is a quick start guide if you would like to test out the Raphtory Rust prototype.

Running the LOTR example 🧙🏻‍♂️

Prerequisites

Make sure you have Rust installed on your OS. Here is a guide to install Rust.

1. Set up Doc Brown

Clone the Doc Brown Repository and find the examples directory where you can find the Lord of the Rings Example. Create a folder named "Data" under examples/lotr and download the LOTR CSV data into this folder. You can download the raw csv here.

2. Run Doc Brown

Build Doc Brown by running this command to make sure it compiles and builds:

cargo build

If you have any linker errors please build with the following command:

cargo build --no-default-features

Next run the main function in main.rs which creates a graph from the LOTR csv file, showing the different character interactions throughout the book. To do this, you will need to be in the lotr folder, the file path to this from root is ./examples/src/bin/lotr. Once you are here run this command to run the LOTR example:

cargo run --bin lotr

You should see output that looks something like this with information about the edges and vertices below:

Loaded graph from encoded data files ./examples/src/bin/lotr/data/graphdb.bincode with 139 vertices, 701 edges which took 0 seconds

Gandalf exists = true

Congratulations, you have run your first Raphtory graph in Rust!

3. Running Tests

To run tests in Doc Brown, go back into your root folder and run this command:

cargo test

Code Example

// Create your GraphDB object and state the number of shards you would like, here we have 2
let graph = GraphDB::new(2);

// Add vertex and edges to your graph with the respective properties
graph.add_vertex(
  src_id,
  time,
  &vec![("name".to_string(), Prop::Str("Character".to_string()))],
);

graph.add_vertex(
  dst_id,
  time,
  &vec![("name".to_string(), Prop::Str("Character".to_string()))],
);

graph.add_edge(
  src_id,
  dst_id,
  time,
  &vec![(
      "name".to_string(),
      Prop::Str("Character Co-occurrence".to_string()),
  )],
);

// We calculate a hash for the string "Gandalf" to be used as a unique identifier for Gandalf
let gandalf = utils::calculate_hash(&"Gandalf");

// Get the in-degree, out-degree and degree of Gandalf
let in_degree = graph.degree_window(gandalf, 0, i64::MAX, Direction::IN);
let out_degree = graph.degree_window(gandalf, 0, i64::MAX, Direction::OUT);
let degree = graph.degree_window(gandalf, 0, i64::MAX, Direction::BOTH);

Documentation

DocBrown has Documentation with tutorials, explanations and rust docs. It can be found here on ReadTheDocs

Contributing

Install Rust from install guide
Install Python 3.10 (virtual/conda environment is recommended).
Install pip packages needed to build/test

pip install maturin pytest

Community

Join the growing community of open-source enthusiasts using Raphtory to power their graph analysis projects!

Follow for the latest Raphtory news and development
Join our to chat with us and get answers to your questions!

Articles and Talks about Raphtory

Contributors

Since Doc Brown is still a prototype, we are open to any contributions. If you find any issues or would like to work on some issues yourself, visit the issues page. Join our Slack if you're having any issues or would like to find out more about how you can get stuck in with Raphtory.

License

Raphtory is licensed, check out our LICENSE file.

docbrown's People

Contributors

Stargazers

Watchers

Forkers

miratepuffin

docbrown's Issues

New Semantics for Perspectives

there are no props on edges

Using Strings as vertex IDs

In many cases, the data we are ingesting into Raphtory does not have integer-based IDs for the nodes, but instead a name or hash etc.

In these instances, we are currently mapping the 'names' to an internal Long ID via our own hashing function and then storing the name as a property which can be looked up.

This is fine for the most part but does mean we are required to extract this name for every output as the results otherwise make no sense to the end user.

It would be fantastic if this could be managed better, but at the absolute minimum not having negative IDs (as Java generates) would be a huge plus.

graph.add_edge(time=1,src="a",dst="b")
graph.add_edge(time=2,src="c",dst="a")
assert graph.at(3).is_edge("a","b")
assert graph.at(3).vertex("a").degree() == 2

Cannot call submodules in python import statement

e.g. from pyraphtory.algorithms import triangle_count is not working because of a pyo3 issue. Need a hack around this so that we can call submodules from pyraphtory.

Otherwise, we have to import the module and then call the specific algorithm from the algorithm module:
from pyraphtory import algorithms
algorithms.triangle_count()

Changes to the core graph model

Overview

As we are reworking many of the components of Raphtory, this gives us a good opportunity to rethink aspects of the underlying model which would be hard to change in the current code base. An overview of this can be seen above.

Note: This does not show how things are actually implemented in Doc Brown, but what can be imagined to be stored within Vertices and Edges.

Notable changes and why

Graph State

There are two key differences planned for the state of the graph.

Unifying properties and algorithmic state

This first change comes from the fact that it has been a bit of a rub that in the algorithmic state the user is free to store anything, but at ingestion time the properties are limited to primitive types. This means if you want to create a new graph from the output of a perspective it is often just not possible. As such we have decided to open the properties to be able to store arbitrary data, unifying the underlying storage.

As a component of this we are contemplating allowing the user to write algorithmic state from a perspective back to the underlying graph as a permanent feature which can then be used in any analysis in the future.

edge_data={"weight"=2.3,"balances"=[3,5,1,2,4],"internal_dict"={...}}
graph.add_edge(time=1,src=1,dst=2,data=edge_data)
assert graph.at(1).edge(1,2).get_state("balances")[1]==5

Meta Data

The Second is breaking out 'Immutable Properties' into their own category of 'Meta Data' which does not have an associated time. The reason this was chosen is there were several instances where a user would first add structural information (nodes and edges) and then some additional information about some of these entities.

This could be things like their name, node type or some other property which doesn't really have an exact point at which it happened. In these cases, the option currently is to either look up the earliest time the entity was present (adding quite a lot of overhead) or add at time 0, which means all entities in the graph exist at the beginning of time which is often untrue.

As such these now only have a key and a value and will be imagined to exist for the whole history of the network.

graph.add_edge(time=1,src=1,dst=2)
graph.add_edge(time=2,src=2,dst=3)
graph.add_vertex_meta(id=2,data={"name"="Gandalf","Profession"="Wizard"})
graph.add_vertex_meta(id=3,data={"name"="Sam","Profession"="Gardener"})
assert graph.vertex(2).get_state("name")=="Gandalf"

Edge Layers

The other key change here is the replacement of edge types with 'layers'.

In the current code base, an edge between two nodes may have a type, but this is simply a string and does not allow users to properly define multiple different relationships between the same nodes.

In the above example, we have two nodes that have two relationships ("co-worker" and "partner").

If these were to have different weights etc. these currently have to be managed by properties in quite an ugly way i.e. "co-worker_weight" and "partner_weight". This obviously gets unwieldy very quickly if you have many relationship types and several properties on each.

In addition, for deletions, there is currently no way to model the events in the timeline where the "co-worker" relationship is removed, but the "partner" relationship is kept.

In conclusion, it makes sense to clean this up by having separate edges for each layer. This also allows for projections where we can run an algorithm on each layer and compare results etc.

graph.add_edge(time=1,src=1,dst=2,layer="friend")
graph.add_edge(time=1,src=1,dst=2,layer="co-worker")
graph.delete_edge(time=5,src=1,dst=2layer="friend")
assert graph.at(6).is_edge(1,2,"co-workers")
assert not graph.at(6).is_edge(1,2,"friend")

DefaultVec on Adjacency lists causes duplicate edges

Neighbours_ids() duplicates neighbours with two edges between them

let vs = vec![(1, 1, 2), (2, 1, 3), (3, 2, 1), (4, 3, 2)];

for (t, src, dst) in &vs {
    g.add_edge(*t, *src, *dst, &vec![]);
}

let v = 1;
for i in &WindowedGraph.neighbours_ids(v, Direction::BOTH) {
print(i)
}

i = 2
i = 2
i = 3

Counts vertex 2 twice as a neighbour, should be unique neighbours so only return vertex 2 once

u64 is the wrong type for time it should be i64

Load data in parallel

Load data in parallel
allow TemporalGraph to function as a partition of a larger graph by adding remote/local edges

Projections merging with Perspectives

This feels very "old man shouts at cloud", so any suggestions are very welcomed!

Perspectives currently aren't 'real'

NOTE: The following is all sudo code - we are still discussing how best to represent these concepts within the new API's. If you have any suggestions for how these should look, leave a comment below.

One thing which people have found quite confusing/annoying with the current Raphtory API's is that perspectives are an almost ethereal entity which you can't really touch or interact with. This problem manifests itself in several ways:

When we do something like the below query many people believe they are actually executing on the underlying graph (mutating it) and not 100 different perspectives of the graph over time.

           graph.range(start=1,end=100,increment=1)
           .transform(ConnectedComponents()
           .select()
           .to_df())

When working with perspectives it is incredibly annoying that you cannot check the current state of your analysis and then continue - i.e. if you were to call to_df/write_to at any point, adding further algorithmic steps means rerunning everything again.
It is not possible to check something in one perspective and utilise this in the next.
If we mutate the perspective in some way (such as a filter) we have no way to revert this, massively complicating some algorithms which could make use of such a mechanic.

New Perspective Flow

As a solution to these grievances, the new Perspectives will be tangible objects which are returned from a call to range, at, etc. and can be interacted with individually or as a collection.

This drastically expands the types of analysis which can be done directly within Raphtory, especially when questions involve multiple perspectives. For example we may check the existence of an edge in two perspectives and based upon this access some vertex state in a third:

    perspectives = graph.range(start=1,end=100,increment=1)
    if(perspectives[5].is_edge(12,14) & perspectives[50].is_edge(12,14)):
        return perspectives[25].vertices(12).out_neighbours()
    else 
        return perspectives[25].vertices(14).in_neighbours()

Expanding past access, we have also rethought the flow of a Perspective, an example of which can be seen below. The main points to note here are:

When filtering/projecting the graph in some way (henceforth referred to as a mask) you will be able to remove this mask but keep the state aquired by the graph while it was active. For example, the KCore filter at the top of the flow works by iteratively filtering out nodes until no nodes remain, writing each nodes Highest K as a vertex State. The mask is then removed (returning all vertices) and we perform several algorithms on Nodes with a coreness higher than some threshold.
At multiple points (such as after the Removal of the KCore mask) the actions are branched. The implication here is that the actions following this would be able to access the state of the prior steps, but these steps would not have to be rerun. I.e. Even though to_df is called 4 times, the KCore Filter would only run once.

Perspectives and Projections -> GraphView

One area which has been toyed with loads of the past year is projections - i.e. taking our standard graph and presenting in a different format (possibly with different semantics/API's available) to encompass the many different definitions of a graph. There has been two issues we have run into here - firstly Scala really really hated it and it was always kinda secondary to the GraphLens/Perspectives API.

With this version of Raphtory this will all change as both are being given equal footing within the GraphView. The premise here is that given a Graph we can set our usual time filters/points/windows, but may also set the semantics of the graph/perspectives via calls to different views. This allows us to say support concepts like delay and duration on an edge, without having to try to juggle them all as core concepts in the underlying storage.

support edges with duration (i.e, edge remains active for a specified duration after it is created)

graph.add_edge(time=0, src="a", dst="b", properties={"duration"=10})
graph.as_delay_graph("duration")
assert graph.window(2..5).is_edge("a", "b")
assert not graph.window(11..12).is_edge("a", "b")

support edges with delay (i.e., information needs time to arrive at the destination)

graph.add_edge(time=0, src="a", dst="b" properties={"delay"=10})
graph.as_delay_graph("delay")
assert not graph.window(2..5).is_edge("a", "b")
assert graph.window(0..12).is_edge("a", "b")
assert not graph.window(2..12).is_edge("a", "b")

New functions becoming available for certain projections i.e. positive and negative edges in a signed graph

graph.add_edge(time=0, src="a", dst="b" properties={"sign"=-1})
graph.as_signed_graph("sign")
assert not graph.is_positive_edge("a", "b")

There are several other out of the box projections we have initially thought to support including: (these all need to be drawn :D)

Reversed edges
Directed to undirected edges - Needs some thought on how to combine properties if the edge exists in both directions
Null Models - Shuffling the edges/events in various ways to simplify checking if some result is significant
Temporal Multi-layer - exploding vertices and edges out -- also supporting interlayer edges and several other fun things that need lots of discussion
Bipartite Network Projections - (https://en.wikipedia.org/wiki/Bipartite_network_projection)

raphtory / docbrown Goto Github PK

docbrown's Introduction

This project has been migrated to Raphtory/Raphtory. Please go there for future developments

What is Doc Brown? 🥼

Running Doc Brown 👨🏼‍🔬

Running the LOTR example 🧙🏻‍♂️

Prerequisites

1. Set up Doc Brown

2. Run Doc Brown

3. Running Tests

Code Example

Documentation

Contributing

Community

Articles and Talks about Raphtory

Contributors

License

docbrown's People

Contributors

Stargazers

Watchers

Forkers

docbrown's Issues

Overview

Notable changes and why

Graph State

Unifying properties and algorithmic state

Meta Data

Edge Layers

Perspectives currently aren't 'real'

New Perspective Flow

Perspectives and Projections -> GraphView

Recommend Projects

Recommend Topics

Recommend Org