raphtory / docbrown Goto Github PK

Temporal graph storage in rust

License: GNU General Public License v3.0

Rust 58.50% Makefile 0.04% Python 2.98% Jupyter Notebook 38.48%

docbrown's Issues

Cannot call submodules in python import statement

e.g. from pyraphtory.algorithms import triangle_count is not working because of a pyo3 issue. Need a hack around this so that we can call submodules from pyraphtory.

Otherwise, we have to import the module and then call the specific algorithm from the algorithm module:
from pyraphtory import algorithms
algorithms.triangle_count()

Using Strings as vertex IDs

In many cases, the data we are ingesting into Raphtory does not have integer-based IDs for the nodes, but instead a name or hash etc.

In these instances, we are currently mapping the 'names' to an internal Long ID via our own hashing function and then storing the name as a property which can be looked up.

This is fine for the most part but does mean we are required to extract this name for every output as the results otherwise make no sense to the end user.

It would be fantastic if this could be managed better, but at the absolute minimum not having negative IDs (as Java generates) would be a huge plus.

graph.add_edge(time=1,src="a",dst="b")
graph.add_edge(time=2,src="c",dst="a")
assert graph.at(3).is_edge("a","b")
assert graph.at(3).vertex("a").degree() == 2

Neighbours_ids() duplicates neighbours with two edges between them

let vs = vec![(1, 1, 2), (2, 1, 3), (3, 2, 1), (4, 3, 2)];

for (t, src, dst) in &vs {
    g.add_edge(*t, *src, *dst, &vec![]);
}

let v = 1;
for i in &WindowedGraph.neighbours_ids(v, Direction::BOTH) {
print(i)
}

i = 2
i = 2
i = 3

Counts vertex 2 twice as a neighbour, should be unique neighbours so only return vertex 2 once

DefaultVec on Adjacency lists causes duplicate edges

Projections merging with Perspectives

This feels very "old man shouts at cloud", so any suggestions are very welcomed!

Perspectives currently aren't 'real'

NOTE: The following is all sudo code - we are still discussing how best to represent these concepts within the new API's. If you have any suggestions for how these should look, leave a comment below.

One thing which people have found quite confusing/annoying with the current Raphtory API's is that perspectives are an almost ethereal entity which you can't really touch or interact with. This problem manifests itself in several ways:

When we do something like the below query many people believe they are actually executing on the underlying graph (mutating it) and not 100 different perspectives of the graph over time.

           graph.range(start=1,end=100,increment=1)
           .transform(ConnectedComponents()
           .select()
           .to_df())

When working with perspectives it is incredibly annoying that you cannot check the current state of your analysis and then continue - i.e. if you were to call to_df/write_to at any point, adding further algorithmic steps means rerunning everything again.
It is not possible to check something in one perspective and utilise this in the next.
If we mutate the perspective in some way (such as a filter) we have no way to revert this, massively complicating some algorithms which could make use of such a mechanic.

New Perspective Flow

As a solution to these grievances, the new Perspectives will be tangible objects which are returned from a call to range, at, etc. and can be interacted with individually or as a collection.

This drastically expands the types of analysis which can be done directly within Raphtory, especially when questions involve multiple perspectives. For example we may check the existence of an edge in two perspectives and based upon this access some vertex state in a third:

    perspectives = graph.range(start=1,end=100,increment=1)
    if(perspectives[5].is_edge(12,14) & perspectives[50].is_edge(12,14)):
        return perspectives[25].vertices(12).out_neighbours()
    else 
        return perspectives[25].vertices(14).in_neighbours()

Expanding past access, we have also rethought the flow of a Perspective, an example of which can be seen below. The main points to note here are:

When filtering/projecting the graph in some way (henceforth referred to as a mask) you will be able to remove this mask but keep the state aquired by the graph while it was active. For example, the KCore filter at the top of the flow works by iteratively filtering out nodes until no nodes remain, writing each nodes Highest K as a vertex State. The mask is then removed (returning all vertices) and we perform several algorithms on Nodes with a coreness higher than some threshold.
At multiple points (such as after the Removal of the KCore mask) the actions are branched. The implication here is that the actions following this would be able to access the state of the prior steps, but these steps would not have to be rerun. I.e. Even though to_df is called 4 times, the KCore Filter would only run once.

Perspectives and Projections -> GraphView

One area which has been toyed with loads of the past year is projections - i.e. taking our standard graph and presenting in a different format (possibly with different semantics/API's available) to encompass the many different definitions of a graph. There has been two issues we have run into here - firstly Scala really really hated it and it was always kinda secondary to the GraphLens/Perspectives API.

With this version of Raphtory this will all change as both are being given equal footing within the GraphView. The premise here is that given a Graph we can set our usual time filters/points/windows, but may also set the semantics of the graph/perspectives via calls to different views. This allows us to say support concepts like delay and duration on an edge, without having to try to juggle them all as core concepts in the underlying storage.

support edges with duration (i.e, edge remains active for a specified duration after it is created)

graph.add_edge(time=0, src="a", dst="b", properties={"duration"=10})
graph.as_delay_graph("duration")
assert graph.window(2..5).is_edge("a", "b")
assert not graph.window(11..12).is_edge("a", "b")

support edges with delay (i.e., information needs time to arrive at the destination)

graph.add_edge(time=0, src="a", dst="b" properties={"delay"=10})
graph.as_delay_graph("delay")
assert not graph.window(2..5).is_edge("a", "b")
assert graph.window(0..12).is_edge("a", "b")
assert not graph.window(2..12).is_edge("a", "b")

New functions becoming available for certain projections i.e. positive and negative edges in a signed graph

graph.add_edge(time=0, src="a", dst="b" properties={"sign"=-1})
graph.as_signed_graph("sign")
assert not graph.is_positive_edge("a", "b")

There are several other out of the box projections we have initially thought to support including: (these all need to be drawn :D)

Reversed edges
Directed to undirected edges - Needs some thought on how to combine properties if the edge exists in both directions
Null Models - Shuffling the edges/events in various ways to simplify checking if some result is significant
Temporal Multi-layer - exploding vertices and edges out -- also supporting interlayer edges and several other fun things that need lots of discussion
Bipartite Network Projections - (https://en.wikipedia.org/wiki/Bipartite_network_projection)

New Semantics for Perspectives

there are no props on edges

Load data in parallel

Load data in parallel
allow TemporalGraph to function as a partition of a larger graph by adding remote/local edges

Changes to the core graph model

Overview

As we are reworking many of the components of Raphtory, this gives us a good opportunity to rethink aspects of the underlying model which would be hard to change in the current code base. An overview of this can be seen above.

Note: This does not show how things are actually implemented in Doc Brown, but what can be imagined to be stored within Vertices and Edges.

Notable changes and why

Graph State

There are two key differences planned for the state of the graph.

Unifying properties and algorithmic state

This first change comes from the fact that it has been a bit of a rub that in the algorithmic state the user is free to store anything, but at ingestion time the properties are limited to primitive types. This means if you want to create a new graph from the output of a perspective it is often just not possible. As such we have decided to open the properties to be able to store arbitrary data, unifying the underlying storage.

As a component of this we are contemplating allowing the user to write algorithmic state from a perspective back to the underlying graph as a permanent feature which can then be used in any analysis in the future.

edge_data={"weight"=2.3,"balances"=[3,5,1,2,4],"internal_dict"={...}}
graph.add_edge(time=1,src=1,dst=2,data=edge_data)
assert graph.at(1).edge(1,2).get_state("balances")[1]==5

Meta Data

The Second is breaking out 'Immutable Properties' into their own category of 'Meta Data' which does not have an associated time. The reason this was chosen is there were several instances where a user would first add structural information (nodes and edges) and then some additional information about some of these entities.

This could be things like their name, node type or some other property which doesn't really have an exact point at which it happened. In these cases, the option currently is to either look up the earliest time the entity was present (adding quite a lot of overhead) or add at time 0, which means all entities in the graph exist at the beginning of time which is often untrue.

As such these now only have a key and a value and will be imagined to exist for the whole history of the network.

graph.add_edge(time=1,src=1,dst=2)
graph.add_edge(time=2,src=2,dst=3)
graph.add_vertex_meta(id=2,data={"name"="Gandalf","Profession"="Wizard"})
graph.add_vertex_meta(id=3,data={"name"="Sam","Profession"="Gardener"})
assert graph.vertex(2).get_state("name")=="Gandalf"

Edge Layers

The other key change here is the replacement of edge types with 'layers'.

In the current code base, an edge between two nodes may have a type, but this is simply a string and does not allow users to properly define multiple different relationships between the same nodes.

In the above example, we have two nodes that have two relationships ("co-worker" and "partner").

If these were to have different weights etc. these currently have to be managed by properties in quite an ugly way i.e. "co-worker_weight" and "partner_weight". This obviously gets unwieldy very quickly if you have many relationship types and several properties on each.

In addition, for deletions, there is currently no way to model the events in the timeline where the "co-worker" relationship is removed, but the "partner" relationship is kept.

In conclusion, it makes sense to clean this up by having separate edges for each layer. This also allows for projections where we can run an algorithm on each layer and compare results etc.

graph.add_edge(time=1,src=1,dst=2,layer="friend")
graph.add_edge(time=1,src=1,dst=2,layer="co-worker")
graph.delete_edge(time=5,src=1,dst=2layer="friend")
assert graph.at(6).is_edge(1,2,"co-workers")
assert not graph.at(6).is_edge(1,2,"friend")