Git Product home page Git Product logo

hy-mmsbm's Introduction

Hy-MMSBM
Hypergraph Mixed-Membership Stochastic Block Model

Inference and benchmarking of higher-order community structure

License: MIT Made with Python Code style: black ARXIV: 2301.11226 ARXIV: 2212.08593 Treedom

This repository contains the implementation of the Hy-MMSBM model presented in

   [1] Community Detection in Large Hypergraphs
        Nicolò Ruggeri, Martina Contisciani, Federico Battiston and Caterina De Bacco
        [ Science Advances - ArXiv ]

   [2] A framework to generate hypergraphs with community structure
        Nicolò Ruggeri, Federico Battiston and Caterina De Bacco
        [ArXiv]

Respectively, these works present efficient inference and sampling methods based on Hy-MMSBM, a stochastic block model for higher-order interactions.
This code is made available for the public, if you make use of it please cite our work in the form of the references above.

Code installation

The code was developed utilizing Python 3.9, and can be downloaded and used locally as-is.
To install the necessary packages, run the following command

pip install -r requirements.txt

Inference of community structure

The inference of the affinity matrix w and community assignments u is performed by running the code in main_inference.py.

The most basic run only needs a dataset, the number of communities K, and a path to store the results.
For example, to perform inference on the Senate Bills dataset with K=2 communities, one can run the following command:

python main_inference.py --K 2 --out_dir ./out_inference/senate_bills --real_dataset senate-bills

Input dataset format

It is possible to provide the input dataset in three formats.

1. Real preprocessed dataset
We make all the real datasets we utilize in the experiments in [1] available, see the data release section.
After the download, any of these datasets can be utilized for inference by running the command above and specifying the dataset name using the --real_dataset argument.
The complete list of such datasets is available at src.data.data_io.PREPROCESSED_DATASETS.

2. Text format
Alternatively, a hypergraph can be provided as input via two .txt files, containing the list of hyperedges, and the relative weights. This allows the user to provide arbitrary datasets as inputs. We provide examples for the format expected in ./data/examples/justice_dataset/hyperedges.txt and ./data/examples/justice_dataset/weights.txt. To perform inference on a dataset specified in text format, provide the path to the two files as

python main_inference.py 
--K 2 
--out_dir ./out_inference/justice 
--hyperedge_file ./data/examples/justice_dataset/hyperedges.txt 
--weight_file ./data/examples/justice_dataset/weights.txt

In this case, the command above is equivalent to running with --real_dataset justice.

3. Pickle format
Finally, one can provide a Hypergraph instance, which is the main representation utilized internally in the code (see src.data.representation), serialized via the pickle Python library.
An example equivalent to the above is

python main_inference.py 
--K 2 
--out_dir ./out_inference/justice 
--pickle_file ./data/examples/justice_dataset/justice.pkl

Similarly to the text format, this allows to provide arbitrary hypergraphs as input.

Additional options

Additional options can be specified, the full documentation is shown by running

python main_inference.py --help

Among the important ones we list:

  • --assortative whether to run inference with a diagonal affinity matrix w.
  • --max_hye_size to keep only hyperedges up to a given size for inference. If None, all hyperedges are utilized.
  • --w_prior and --u_prior the rates for the exponential priors on the parameters. A value of zero is equivalent to no prior, any positive value is utilized for MAP inference.
    For non-uniform priors, the path to a file containing a NumPy array can be specified, which will be loaded via numpy.load.
  • --em_rounds number of EM steps during optimization. It is sometimes useful when the model doesn't converge rapidly.
  • --training_rounds the number of models to train with different random initializations. The one with the highest log-likelihood is returned and saved.
  • --seed integer random seed.

Sampling of synthetic data

Sampling of synthetic data from the generative model can be performed by running the code in main_sampling.py. As explained in the reference paper [2], various inputs can be provided to condition the sampling procedure.
We also refer to the notebook for additional examples.

Vanilla sampling

The simplest form of sampling, simply requires the affinity matrix w and community assignments u. These need to be provided in text files to be opened via numpy.loadtxt, for example

python main_sampling.py 
--w ./data/examples/fixed_dimension_sequence_hard/K=2/w.txt 
--u ./data/examples/fixed_dimension_sequence_hard/K=2/u.txt 
--out_dir ./out_sampling

Degree and size sequences

Samples can also be conditioned on the degree sequence (i.e. the array specifying each node's degree) , size sequence (i.e. the count of hyperedges for any given size), or both.
These need to be stored in text files, the degree sequence via numpy.savetxt, and the size sequence in a text file specifying, for every line, the size of the hyperedges and the number of the hyperedges of such size. Examples for these formats are given in ./data/examples at fixed_dimension_sequence_hard and justice_dataset.
The command line arguments are specified as follows:

python main_sampling.py 
--w ./data/examples/fixed_dimension_sequence_hard/K=2/w.txt 
--u ./data/examples/fixed_dimension_sequence_hard/K=2/u.txt 
--out_dir ./out_sampling
--deg_seq ./data/examples/fixed_dimension_sequence_hard/K=2/deg_seq.txt
--dim_seq ./data/examples/fixed_dimension_sequence_hard/K=2/dim_seq.txt

Notice that, if conditioning on both the sequences, and the sequences are extracted from an already available dataset, it is more convenient and computationally faster to directly provide the dataset as explained in the following section.

Conditioning on a dataset

Conditioning the sampling on an existing dataset means having the MCMC procedure start from the dataset itself, and is statistically equivalent to conditioning on its degree and size sequences. However, it is computationally faster since the sequences don't need to be arranged into a first hypergraph proposal (see reference paper [2]).
The format to provide a dataset as input is the same as the one specified in the relative section of the inference procedure.

Additional arguments

Additional arguments that can be provided are:

  • --max_hye_size the maximum hyperedges size allowed.
  • --burn_in_steps and --intermediate_steps the number of MCMC steps to be performed before returning the first sample, and in between returned samples.
  • --exact_dyadic_sampling whether to perform the sampling of dyadic interaction (i.e. standard edges between two nodes) exactly from their Poisson distribution, as opposed to approximate Central Limit Theorem-based sampling.
  • --allow_rescaling in case the degree or dimension sequences are provided, whether to rescale the model parameters to match the expected sequences with the input ones.
  • --n_samples number of generated samples.
  • --seed integer random seed.

Data release

All the preprocessed real datasets utilized in [1, 2], as well as some of the synthetic data generated for various experiments, are publicly available.
The data is stored and distributed via Edmond, the Open Research Data Repository of the Max Planck Society, and is available at the following link.

Alternatively, it can be directly downloaded (approximately 1.7GB compressed, then uncompressed to 12.5GB), by running the relative script as
python download_data.py

hy-mmsbm's People

Contributors

nickruggeri avatar mcontisc avatar diegoabt avatar

Stargazers

AkiiXx avatar Daniel Kaiser avatar Niranjan Anandkumar avatar Qin Lin avatar Kazuki Nakajima avatar Francesco Lotito avatar Davide Orsenigo avatar Pablo Sánchez-Martín avatar Daniela Leite avatar  avatar Flipped avatar  avatar Carlo Bottai avatar Alessandro Lonardi avatar  avatar

Watchers

 avatar  avatar

hy-mmsbm's Issues

Questions about the results

Hi~

Thanks for your great work. I am confused about the results according to the code. What are the relationships among u, w and the detected community structures, I mean, the community index for each node.

I am looking forward to your response.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.