CGMM is a generative approach to learning contexts in graphs. It combines information diffusion and local computation through the use of a deep architecture and stationarity assumptions. The model does NOT preprocess the graph into a fixed structure before learning. Instead, it works with graphs of any size and shape while retaining scalability. Experiments show that this model works well compared to expensive kernel methods that extensively analyse the entire input structure in order to extract relevant features. In contrast, CGMM extract more abstract features as the architecture is built (incrementally).
We hope that the exploitation of the proposed framework, which can be extended in many directions, can contribute to the extensive use of both generative and discriminative approaches to the adaptive processing of structured data.
The library includes data and scripts to reproduce the tree/graph classification experiments reported in the paper describing the method.
This research software is provided as is. If you happen to use or modify this code, please remember to cite the foundation papers:
Please see the reference above.
Thanks to the amazing work of Daniele Atzeni we have dramatically increased the performance of bigram computation. With C=4
, continuous posteriors and matrix operations in place of nested for loops, we have been able to get a speedup of 900x (yes.. 900x) on NCI1 with a single core. Bravo Daniele!
We refactored the whole repository to allow for easy experimentation with incremental architectures. New efficiency improvements are coming soon. Stay tuned!
We provide an extended and refactored version of CGMM, implemented in Pytorch. There are additional experimental routines to try some common graph classification tasks. Please refer to the "Paper Version" Release tag for the original code of the paper.
We first need to create a data set. Let's try to parse NCI1
python PrepareDatasets.py DATA --dataset-name NCI1
In the config file, specify node_type "discrete", as features are represented as atom types
For social datasets such as IMDB-MULTI:
python PrepareDatasets.py DATA --dataset-name IMDB-BINARY --use-degree
In the config file, specify node_type "continuous", as the degree should be treated as a continuous value
To replicate our experiments on graph classification, first modify the config_CGMM.yml file accordingly (use CGMM as model), then execute:
python Launch_Experiments.py --config-file config_CGMM.yml --inner-folds None --outer-folds 10 --inner-processes [processes to use for internal cross validation] --outer-processes [processes to use for external cross validation] --dataset [DATASET STRING]
By default, datasets are created to implement external 10-fold CV for model assessment, i.e. random splits between train and TEST, and an internal hold-out split of the training set (10% as VALIDATION set for model selection). If you change the number of data splits, you have to modify the --inner-folds and --outer-folds arguments accordingly. NOTE: a hold-out technique is associate to --inner(outer)-folds = None. Reproducibility is not hampered by different random splits in our case.
For node classification on PPI, use CGMMPPI in the config file instead of CGMM (to be refactored. In this case, you have to preprocess PPI before running on multiprocessing. You can do this by appending the --debug
argument the very first time you try to train on PPI with CGMM).