Git Product home page Git Product logo

circulo's Introduction

#Circulo: A Community Detection Evaluation Framework

###Contribute Contributors welcome! If you want to contribute, please issue a pull request.

##About ####The Framework: Circulo is a "Community Detection" Evaluation Framework written primarily in Python. The Framework performs statistical analysis against partitions of a Graph resulting from the execution of a given community detection algorithm. The resulting quantitative measures can be used to drive experiments such as measuring algorithm efficacy against specific dataset types or comparing different algorithm execution results against the same dataset. The framework includes the following components:

  • Data ETL (Extract Transform Load) Engine: Circulo includes functionality to incorporate existing datasets into the evaluation framework. By default, the framework includes several dataset "downloaders" in the directory circulo/data. To learn how to add a dataset, please see the data README. We encourage users to issue pull requests for new datasets.
  • Algorithm Execution Engine: Circulo includes several algorithms by default which are located in the algorithms directory. Using the framework, these algorithms can run in parallel against the included datasets by running the run_algos.py tool. Because some algorithms include parameters that can better cater execution to the type of input Graph (i.e directed and/or weighted), algorithm execution is wrapped in the file community.py. This enables the same algorithm to automatically operate differently depending on the dataset--enabling the algorithm to adapt to the dataset if allowed. To add an algorithm to the framework, add the files to algorithms and update the wrapper community.py appropriately.
  • Metrics Engine: The metrics engine provides quantitative analysis of a given partitionin g of a graph. The metrics include internal statistical measures of a community (i.e. density), external measurements (i.e. expansion), and network wide metrics (ground truth comparisons).
  • Experiments Engine: Different types of experiments have been designed to find patterns in the metric results. For example, how do algorithms compare when considering both time and accuracy. This component is meant to be a "playground" for experimentation on metric results. Experiments may vary significantly. Each file in the experiments directory is meant to be an independent experiment. See the README for more information.

####The Research Prior to building the Circulo framework, Lab41 conducted a market survey into Community Detection algorithms and metrics. The survey was used to guide the development of Circulo. The survey includes, but is not limited to, summaries of algorithms, references to academic literature, and general information about the field. The survey can be found here: http://lab41.github.io/survey-community-detection/.

####The Underlying Graph Framework Since we did not want to reimplement the notion of a graph, we decided to pick an existing Graph Framework as a backdrop for our work. Though any of the popular graph frameworks could have been used, iGraph was chosen as our primary graph framework. iGraph offers a number of features:

  • First and foremost, iGraph implements a number of community detection algorithms out of the box. It also provides two data structures for community detection: VertexClustering (non-overlapping communities) and VertexCover (overlapping communities)
  • iGraph is written in C at its core making it fast
  • iGraph has wrappers for Python and R
  • iGraph is a mature framework

Other frameworks which could be used include GraphX, GraphLab, SNAP, NetworkX.

##Installation and Setup ####Package Requirements

  • git
  • python3
  • igraph (Refer to Appendix A for further instructions)
  • matplotlib
  • cairo (if you want to plot directly from igraph)
  • scipy
  • scikit.learn

####Installation Below are instructions for using Circulo

#clone Circulo repository (note: this also clone SNAP)
git clone --recursive https://github.com/Lab41/circulo.git
#set PYTHONPATH env variable
export PYTHONPATH=/path/to/Circulo
#make the snap code base
pushd lib/snap; make; popd

Running the Evaluation Framework

At the core, the evaluation framework run various community detection algorithms against various datasets.

#To run your algorithms against the data
python circulo/setup/run_algos [parameters ...]	
#To run metrics against the results of run_algos
python circulo/setup/run_metrics [parameters ...]

##Appendix ####Appendix A: iGraph Installation #####Ubuntu

sudo apt-get install igraph
sudo apt-get install libxml2-dev libz-dev python-dev

#####OS X

#using brew install igraph dylibs
brew install homebrew/science/igraph

#install Cairo
#installs the core libraries for cairo
brew install cairo 

#installs the python site-packages. NOTE: pip does not work for pycairo. 
#If you want to use pip, create sym links to the site packages in brew
brew install py3cairo

#install python igraph
pip3 install python-igraph

circulo's People

Contributors

aganeshlab41 avatar cglewis avatar lab41paulm avatar ostrowr avatar ymt123 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

circulo's Issues

Density metric producing values greater than 1

when applied to the flights data

"Density": {"results": [0.08791494062413698, 0.06871003888998747, 0.2504284898114645, 0.16116941529235382, 0.08605533912696034, 0.14722303849544519, 0.09575923392612859, 0.1865671641791045, 0.1731638418079096, 0.2631578947368421, 0.17857142857142858, 0.20745098039215687, 0.20195667365478687, 0.16611295681063123, 0.11739130434782609, 0.136892177589852, 0.12756410256410255, 0.29, 0.18269230769230768, 0.19318181818181818, 0.24603174603174602, 0.12535612535612534, 0.18888888888888888, 0.145, 0.22043010752688172, 0.3611111111111111, 0.17692307692307693, 0.21166666666666667, 0.14673913043478262, 0.1645021645021645, 0.4912280701754386, 0.22826086956521738, 0.235, 0.15579710144927536, 0.35, 0.17647058823529413, 0.19117647058823528, 0.18333333333333332, 0.17647058823529413, 0.37916666666666665, 0.32026143790849676, 0.5641025641025641, 0.2692307692307692, 0.11904761904761905, 0.3375, 0.3653846153846154, 0.17083333333333334, 0.2087912087912088, 0.19230769230769232, 0.30128205128205127, 0.4935897435897436, 0.4340659340659341, 0.2424242424242424, 0.3516483516483516, 0.2761904761904762, 0.29523809523809524, 0.4444444444444444, 0.13636363636363635, 0.5363636363636364, 0.3181818181818182, 0.21428571428571427, 0.24725274725274726, 0.3269230769230769, 0.4038461538461538, 0.2888888888888889, 0.3653846153846154, 0.15555555555555556, 1.0, 0.5181818181818182, 0.21666666666666667, 0.4444444444444444, 0.8444444444444444, 0.356060606060606, 0.4222222222222222, 0.38095238095238093, 0.6222222222222222, 0.3333333333333333, 0.2857142857142857, 0.39285714285714285, 0.38095238095238093, 0.39090909090909093, 0.4444444444444444, 0.4333333333333333, 0.25, 0.7857142857142857, 0.2857142857142857, 0.2916666666666667, 0.38095238095238093, 0.4761904761904762, 0.5333333333333333, 0.5952380952380952, 0.4, 0.33333333333333337, 0.4333333333333333, 0.35714285714285715, 0.38095238095238093, 0.4, 0.35, 0.26666666666666666, 0.6, 1.0, 0.33333333333333337, 0.5, 0.4166666666666667, 0.6666666666666666, 0.25, 0.5, 0.5, 0.4166666666666667, 1.3333333333333333, 0.3333333333333333, 0.6666666666666666, 0.6666666666666666, 0.8333333333333334, 0.5, 0.6666666666666666, 1.6666666666666667, 1.0, 0.6666666666666666, 2.0, 0.6666666666666666, 1.3333333333333333, 1.0, 0.5, 1.0, 1.0, 3.0, 1.5, 1.0, 1.0, 1.0, 2.0, 2.0, 1.5, 1.0, 1.0, 1.0, 1.0, 3.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.5], "aggregations": {"Size": 156, "Biased Skewness": 2.2642355484198293, "Min": 0.06871003888998747, "Mean": 0.5625547939299005, "Unbiased Variance": 0.23960389561171933, "Max": 3.0, "Median": 0.3859307359307359, "Biased Kurtosis": 7.116082054416815}}

request for explanation and clarification for some evaluation metric

Hi,
I was wondering if you could help me understanding some concept :
Conductance, Internal density,expansion, Cutratio,NormalizedCut are metrics which dosen't need ground truth so after using for example edgebetweenness and walktrap algorithm to find the communities we could use these metrics to compare the goodness of communities find with these algorithm but I am a little confused : if I take this function and use them as evaluation metric for the whole network (calculate the mean of the F(s) [6] function of all the communities using a community detection algorithm ) and compare the mean output of these two algorithm for each metric could I reach this conclusion that if this metric is lower so this algorithm capture better communities ? ( "We consider the following metrics f(S) that capture the notion of a quality of the cluster. Lower value of score f(S) (when |S| is kept constant)" [6] ) but in here the S is not constant and each community has it's own size and you intend to have the measure for the whole network so is my understanding right or wrong ?

[6] : refrence6 of the survey : Empirical comparison of algorithms for network community detection

Thank you so much

RolX sensemaking

K = complete_factor(H, M, h_on_left=True)
print(K)
return K

Hello, i have a question regarding the rolX implementation:
In function make_sense() we calc matrix M, compute K and just returning it.
As far as i can see the original paper on chapter 5 (here) is doing another step before returning K: there will be calculated another K' with only one role and then returning (K / K') to get role-contribution compared to the default-contribution (K').

This is the original paragraph:

NodeSense then computes a nonnegative
matrix E such that G * E = M. The matrix E represents
the role contribution to node measurements. A default
matrix E' is also computed by using G' = ones(n, 1), where
the n nodes belong to one role. Then, for each role r and
for each measurement s, NodeSense computes E(r,s) / E'(r,s).
Note: Matrix K in the code coressponds to matrix E in the paper

Is this step missing in the code or should this step be added to be accurate to the original paper?
Thanks in advance

Unsupervised Learning of communities

Cluster all the communities for a given dataset and be sure to include the ground truth communities. See if the ground truth communities cluster together, implying some relationship

Allow datasets to specify optimal contexts

It is often that you know a little about the expected community structure. For example, the number of expected communities. For some algorithms, knowing this information can improve the results considerably. In some cases, this information is required. We need to allow the datasets, to specify optimizations so that algorithms that support them, can you use them.

Multigraph Affecting Metrics

For those muligraphs where a transformation does not occur, some of the metrics may become undefined. For example, density scores can exceed 1 when a multigraph occurs. We have solved this under circumstances where a transformation occurred in the run_algos stage. However, when a transformation does not occur (i.e. with ground truth, or with an algorithm that accepts multigraphs) it is very likely that you will receive odd metric scores in cases where a complete graph is considered such as in density. Most likely will have to conduct a transformation in the metrics to resolve this.

Algorithms give strange results

Hello,
I ran Circulo algorithms on my own data which is a graphml file of all the followers connection from a twitter account.

Sadly I get this result:
screenshot from 2016-06-08 15 07 27

It is a multicolor graph where you can't see any colored communities (green red blue are all over the graph). And this is happening for every algorithm. This is the final plotted graph file that you can open with Igraph to see the same:
followers.txt

I tried the algorithms on the "karate" dataset and they work fine. So the problem comes from my dataset.
I launch algorithms on this graphml file (converted to a .txt to be able to link in on this page) with 1354 nodes and 2756 edges:
followers.txt

This is what my terminal says when running "run_algos.py":
run_algosTerminal.txt

Is my file to big?
Do I have to many node attributes?
Does Circulo not support MultiDirectedGraphs?

BigClam Parameter Clarification: iGraph or NetworkX?

Hello,

First, thank you for this wonderful implementation of so many community detection algorithms!

In trying to figure out how to use the BigClam algorithm, I noticed that the description of the parameters called for a NetworkX graph, but the import statement shows iGraph. If you have a moment and can clarify that would be great.

Thanks!

Maven gephi_plot issue "org.gephi:gephi-toolkit:jar:0.8.2" missing

Hello,

I want to plot my graph with the json files using the gephi_plot file.
As explained I use "mvn compile assembly:single" command.

I get his error:


[INFO]
[ERROR] BUILD ERROR
[INFO]
[INFO] Failed to resolve artifact.

Missing

  1. org.gephi:gephi-toolkit:jar:0.8.2

Try downloading the file manually from the project website.

Then, install it using the command:
mvn install:install-file -DgroupId=org.gephi -DartifactId=gephi-toolkit -Dversion=0.8.2 -Dpackaging=jar -Dfile=/path/to/file

Alternatively, if you host your own repository you can deploy the file there:
mvn deploy:deploy-file -DgroupId=org.gephi -DartifactId=gephi-toolkit -Dversion=0.8.2 -Dpackaging=jar -Dfile=/path/to/file -Durl=[url] -DrepositoryId=[id]

Path to dependency:
1) com.lab41.circulo:gephi_plot:jar:0.0.1-SNAPSHOT
2) org.gephi:gephi-toolkit:jar:0.8.2


1 required artifact is missing.

for artifact:
com.lab41.circulo:gephi_plot:jar:0.0.1-SNAPSHOT

from the specified remote repositories:
gephi-releases (http://nexus.gephi.org/nexus/content/repositories/releases/),
central (http://repo1.maven.org/maven2),
gephi-snapshots (http://nexus.gephi.org/nexus/content/repositories/snapshots/)

[INFO] ------------------------------------------------------------------------
[INFO] For more information, run Maven with the -e switch
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 5 seconds
[INFO] Finished at: Fri Apr 29 12:09:59 CEST 2016
[INFO] Final Memory: 8M/118M
[INFO] ------------------------------------------------------------------------


I can't find the org.gephi:gephi-toolkit:jar:0.8.2 document online


So I changed in the gephi pom.xml just to test:

        <artifactId>gephi-toolkit</artifactId>
        <version>0.8.2</version>

to

                    <artifactId>gephi-toolkit</artifactId>
                    <version>0.9.1</version>

but it lead to a lot of warnings and:

[ERROR] COMPILATION ERROR :
[INFO] -------------------------------------------------------------
[ERROR] No compiler is provided in this environment. Perhaps you are running on a JRE rather than a JDK?
[INFO] 1 error
[INFO] -------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[ERROR] BUILD FAILURE

I don't know what to do. Any clue would be a great help.

Thank you for reading

Have trouble in Metrics execution

Am a newbie here. I'm currently exploring various community discovery algorithms using Circulo. I'm getting an error message while running the metrics script. Please help me resolve this issue.

Dataset : karate

Error message:

Running metrics against karate--conga--0

max() takes 1 positional argument but 2 were given
Traceback (most recent call last):
File "run_metrics.py", line 131, in analyze_json
results_cover.compute_metrics(weights=weights, ground_truth_cover=ground_truth_cover )
File "/home/user/circulo/circulo/metrics/cover.py", line 276, in compute_metrics
max_out_results = maximum_out_degree_fraction(cover, odf=odf, weights=weights)
File "/home/user/circulo/circulo/metrics/cover.py", line 160, in maximum_out_degree_fraction
max_seen = odf.max(0).tocsc()
TypeError: max() takes 1 positional argument but 2 were given

Thanks!
Aman.

clean up experiments

some experiments should belong in utils so that other experiments can leverage the generic functionality

Bug in ETL of flight data

File "run.py", line 91, in prepare
G = FlightData.initialize_vertices(os.path.join(self.raw_data_path, FLIGHTS_DATA))
File "run.py", line 127, in initialize_vertices
attr = line[AIRPORTS_SCHEMA[a]]

a wrapper for R (R interface)

Hi Circulo developers,
Thanks for the great work.
I was wondering if you are intended to make Circulo also for R , in coming month?
Thank you in advance

Error when testing algorithms on my data

Hello,
I am trying to test the algorithms on a .graphml file I created. I am doing my Master Thesis on community finding, using graph algorithms. At the end I could add my dataset to circulo.
But I get this error:

File "/usr/lib/python3.4/multiprocessing/pool.py", line 599, in get
raise self._value
SystemError: error return without exception set

It works for all the other datasets.

What I did:
I put my file in the raw folder from "flights" naming it "flights.graphml".
In my run.py file I have:


import os
import shutil
from circulo.data.databot import *

FILE = "flights.graphml"

class FollowersData(CirculoData):

def __prepare__(self):
    shutil.copyfile(os.path.join(self.raw_data_path, FILE), self.graph_path)

def main():
FollowersData("flights").get_graph()

if name == "main":
main()


I took this code from the "southernwoman" data that is also already in a graphml format.

When i execute "python3 run_algos.py flights ALL --output algorithm_results"
I get this script in the terminal:


dalys@dalys ~/Documents/circulo/circulo/setup $ python3 run_algos.py followers ALL --output algorithm_results
[Graph Generation ETL for followers ]
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib/python3.4/multiprocessing/pool.py", line 119, in worker
result = (True, func(_args, *_kwds))
File "/usr/lib/python3.4/multiprocessing/pool.py", line 44, in mapstar
return list(map(_args))
File "run_algos.py", line 168, in data_fetcher
databot.get_graph()
File "/home/dalys/Documents/circulo/circulo/data/databot.py", line 91, in get_graph
return igraph.load(self.graph_path)
File "/usr/local/lib/python3.4/dist-packages/igraph/init.py", line 4063, in read
return Graph.Read(filename, *args, *_kwds)
File "/usr/local/lib/python3.4/dist-packages/igraph/init.py", line 2223, in Read
return reader(f, _args, *_kwds)
SystemError: error return without exception set
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "run_algos.py", line 304, in
main()
File "run_algos.py", line 299, in main
run(algos, datasets, args.output[0], args.samples, args.workers, args.timeout)
File "run_algos.py", line 210, in run
r.get()
File "/usr/lib/python3.4/multiprocessing/pool.py", line 599, in get
raise self._value
SystemError: error return without exception set
dalys@dalys ~/Documents/circulo/circulo/setup $


Thank you for reading, i hope you can help

where is VertexCoverMetric

in the metrics 'from circulo.metrics import VertexCoverMetric'
but where the 'VertexCoverMetric'?

thanks

Pruning collapsed graphs

In some cases, the graphs must be collapsed in order to accommodate certain algorithms. When this happens, typically these algorithms cannot handle the collapsed graph, so we prune them based on the edge weights which represents the sum of the collapsed edges. In certain cases, the general pruning algorithm is not sufficient, so in this enhancement, allow the user to override the prune method in the databot for the specific dataset.

obtaining membership is computationally expensive

In iGraph, anytime you have a cover and obtain a membership, that operation is computationally expensive. In the calculation of some of our metrics (i.e. Expansion) we are repeatedly getting the membership within loops. The solution is to take the membership calculation and store it in a variable outside of the loop. The speedup should be considerable.

how to run

Could you please provide a way to run this module,i.e what arguments from command line to be passed?

Thanks

Attempt to Label Communities

Attempt to label communities based on attributes. Use attributes that are common in the community (relative to the overall graph) to label communities.

WalkTrap Complexity

On the research page the complexity of the WalkTrap algorithm is listed as O(n^2). The abstract for the referenced paper lists the worst case complexity as O(mn^2) and the expected complexity as O(n^2*log(n))

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.