pygod-team / pygod Goto Github PK
View Code? Open in Web Editor NEWA Python Library for Graph Outlier Detection (Anomaly Detection)
Home Page: https://pygod.org
License: BSD 2-Clause "Simplified" License
A Python Library for Graph Outlier Detection (Anomaly Detection)
Home Page: https://pygod.org
License: BSD 2-Clause "Simplified" License
The benchmark paper compared the LOF method. Do you support this method and is it compatible with the pygod framework?
could you specify specific dependency that your implemented models use in this thread.
For instance,
dominant:
When I use 1080ti GPU, there is an out of memory problem in all datasets except Cora, which is inconsistent with the description in the benchmark paper.
Describe the bug
Very excellent work! But what I am confused about is why using the same parameters for training, the result auc can still vary so much?
To Reproduce
For example, my auc in one training with DOMINANT is 0.83, but the next time it becomes 0.90, why is the gap so big?
Hope to get your guidance
Is your feature request related to a problem? Please describe.
I tried to run benchmark scripts on my local after installing the repo but failed.
Here's how I setup the environment.
First, I ran
pip install -r requirements.txt
python setup.py install
to install the repo and the dependencies. However, the following errors were complained when I tried to run python main.py
under pygod/benchmark/
ModuleNotFoundError: No module named 'tqdm'
ModuleNotFoundError: No module named 'torch_geometric’
ModuleNotFoundError: No module named 'pyod’
ImportError: 'NeighborSampler' requires either 'pyg-lib' or 'torch-sparse’
AttributeError: 'DOMINANT' object has no attribute 'decision_scores_'. Did you mean: 'decision_score_'?
where the last one can be fixed by #80, but I still have to install the missing modules with commands
pip install tqdm
pip install torch
pip install torch-geometric
pip install pyod
pip install torch-sparse
pip install torch-scatter -f https://data.pyg.org/whl/torch-2.0.0+cpu.html
then I can run the benchmark script.
Describe the solution you'd like
It would be better to have a dedicated requirements.txt
file inside folder pygod/benchmark/
containing all the required dependencies.
Describe alternatives you've considered
N/A
Additional context
N/A
Can you share examples how do we use package to detect edge level and sub graph level anomalies?
In paper,it only has attr decoder,but it has struct decoder in your code.why? thank you
Describe the bug
When calling pygod.utils.load_data()
, sometimes it returns the following error message:
ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
To Reproduce
Steps to reproduce the behavior:
Expected behavior
A clear and concise description of what you expected to happen.
Screenshots
If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
Smartphone (please complete the following information):
Additional context
Please refer to 1 and 2 for potential fixing approaches.
Describe the bug
The weibo dataset was retrieved as provided by the load_data()
method in PyGOD. ANEMONE and CoLA are in beta and are called from pygod.models
. When running the ANEMONE and CoLA methods on the weibo dataset, the average AUCROC score is less than 0.15 (ANEMONE: 0.0764±0.0273 (0.1391); CoLA: 0.0750±0.0192 (0.1442)).
To Reproduce
from pygod.models import ANEMONE, CoLA
from pygod.metrics import eval_roc_auc
model = ANEMONE()
data = load_data("weibo")
data.y = data.y.bool()
model.fit(data)
outlier_scores = model.decision_function(data)
auc_score = eval_roc_auc(data.y.numpy(), outlier_scores)
Default parameters are used for the two models. The benchmark code from benchmark/main.py
was also used with few modifications; the hyperparameters that were changed are learning rate, and hidden dimensions.
Expected behavior
I would expect AUCROC scores for ANEMONE and CoLA to be above 0.5, similar to other datasets I ran the benchmark on (books, reddit, enron). It is performing significantly worse on the weibo dataset.
Additional context
Applying the fix mentioned in #43 did not seem to change performance much.
Describe the bug
A clear and concise description of what the bug is.
To Reproduce
Steps to reproduce the behavior:
cd benchmark\
python main.py
in the benchmark.Expected behavior
Within the main
function, it should load the data with args.dataset
. However, the path incorrect (default data path).
The cached data are not available for a fresh run.
Desktop (please complete the following information):
Additional context
Quick fix:
Replace
Line 14 in b960645
import pygod.utils.utility import load_dta
...
data = load_data(args.dataset)
Describe the bug
In the BOND paper, it is said that all the datasets are undirected, except Weibo.
Note that Weibo is a directed graph; the remaining datasets used in our benchmark are undirected graphs.
However, load_data
function returns directed PyG graphs (only "reddit" is undirected for some reason). Here is the output of is_undirected
method
inj_cora False
inj_amazon False
inj_flickr False
weibo False
reddit True
disney False
books False
enron False```
To Reproduce
Here is a colab notebook to reproduce the output above
https://colab.research.google.com/drive/1mNXh66Ac2hUduHvzKtGifC7_huBgCf-5?usp=sharing
Expected behavior
I expected the data to be consistent with what is stated in the paper. Please let me know if I misunderstood something or it's indeed a mistake. Thanks!
We would like to start improving the model scalability via Pytorch lightning (https://www.pytorchlightning.ai/)
File "/hdisk2/pygod_benchmark/pygod/models/guide.py", line 158, in fit
x_, s_ = self.model(x, s, edge_index)
File "/hdisk2/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/hdisk2/pygod_benchmark/pygod/models/guide.py", line 369, in forward
s_ = self.struct_ae(s, edge_index)
File "/hdisk2/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/hdisk2/pygod_benchmark/pygod/models/guide.py", line 394, in forward
s = layer(s, edge_index)
File "/hdisk2/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/hdisk2/pygod_benchmark/pygod/models/guide.py", line 411, in forward
out = self.propagate(edge_index, s=self.w2(s))
File "/hdisk2/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/hdisk2/anaconda3/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 103, in forward
return F.linear(input, self.weight, self.bias)
File "/hdisk2/anaconda3/lib/python3.7/site-packages/torch/nn/functional.py", line 1848, in linear
return torch._C._nn.linear(input, weight, bias)
RuntimeError: expected scalar type Float but found Long
Describe the bug
The code does not suggest how to generate data like inj_cora.pt.
Hi, Thank you for this awesome package. I am working with heterogeneous and knowledge graphs. For example, if I use the famous MovieLens dataset and construct a heterogeneous graph, Can I feed it to model.fit(data)
?
Describe the bug
Hi, except that with the GCNAE model, I keep running into out of memory issues with the other models, even when setting the batch size to a very low value. It's always around 600GBs for a batch with around 400k nodes.
RuntimeError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_17284\2902826942.py in <module>
5
6 model = AnomalyDAE(gpu=0, batch_size=8, verbose=True, contamination=0.05)
----> 7 model.fit(batch)
~\anaconda3\lib\site-packages\pygod\models\anomalydae.py in fit(self, G, y_true)
143 """
144 G.node_idx = torch.arange(G.x.shape[0])
--> 145 G.s = to_dense_adj(G.edge_index)[0]
146
147 # automated balancing by std
~\anaconda3\lib\site-packages\torch_geometric\utils\to_dense_adj.py in to_dense_adj(edge_index, batch, edge_attr, max_num_nodes)
46 size = [batch_size, max_num_nodes, max_num_nodes]
47 size += list(edge_attr.size())[1:]
---> 48 adj = torch.zeros(size, dtype=edge_attr.dtype, device=edge_index.device)
49
50 flattened_size = batch_size * max_num_nodes * max_num_nodes
RuntimeError: CUDA out of memory. Tried to allocate 597.53 GiB (GPU 0; 16.00 GiB total capacity; 1.32 GiB already allocated; 12.78 GiB free; 1.35 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
save and load the pretrained models for future usage.
cora is still large. Once that down, remove rmtree in teardown of unit test
I run main.py multiple times with DOMINANT (from https://github.com/pygod-team/pygod/tree/main/benchmark).
I find out that although the hyperparameter setting is consistent with the BOND paper (https://arxiv.org/pdf/2206.10071.pdf), the results on inj_cora (AUC: 0.7566±0.0332 (0.7751)) and inj_amazon (AUC: 0.7147±0.0006 (0.7152)) are significantly different from what you show in table 3 from the BOND paper (https://arxiv.org/pdf/2206.10071.pdf), which are 82.7±5.6 (84.3) on inj_cora and 81.3±1.0 (82.2) for inj_amazon.
Is there any advice that you can provide about how to reproduce the results of the BOND paper?
Is it possible to build logger information tracker inside logger function ?
I am looking for some visualization tools to track the loss or scores during each epoch. And I find the logger function embedded which can print those information. Do you think it is possible to add the metrics tracking function inside fit()/logger() ?
BR
Hi,
Given a graph object in the prediction API, What does the outlier labels mentioned here as outlier_labels (numpy array of shape (n_samples,)) indicate from a graph perspective?
Does the contents in the numpy array as 1 or 0 indicate the Nodes in the graph that are normal or anomalous? for example Labels:
[0 0 0 ... 0 0 0] . Does each 0 value pertain to a node in graph?
So, How should this prediction output be interpreted from a graph perspective? Thanks in advance.
File "/hdisk2/pygod_benchmark/pygod/models/mlpae.py", line 137, in fit
self._process_decision_scores()
File "/hdisk2/pygod_benchmark/pygod/models/base.py", line 278, in _process_decision_scores
100 * (1 - self.contamination))
File "<__array_function__ internals>", line 6, in percentile
File "/hdisk2/anaconda3/lib/python3.7/site-packages/numpy/lib/function_base.py", line 3733, in percentile
a, q, axis, out, overwrite_input, interpolation, keepdims)
File "/hdisk2/anaconda3/lib/python3.7/site-packages/numpy/lib/function_base.py", line 3853, in _quantile_unchecked
interpolation=interpolation)
File "/hdisk2/anaconda3/lib/python3.7/site-packages/numpy/lib/function_base.py", line 3404, in _ureduce
a = np.asanyarray(a)
File "/hdisk2/anaconda3/lib/python3.7/site-packages/numpy/core/_asarray.py", line 136, in asanyarray
return array(a, dtype, copy=False, order=order, subok=True)
File "/hdisk2/anaconda3/lib/python3.7/site-packages/torch/_tensor.py", line 678, in __array__
return self.numpy()
RuntimeError: Can't call numpy() on Tensor that requires grad. Use tensor.detach().numpy() instead.
The Flickr dataset used in some god papers is the Flickr in pyg. And inj_ Flickr implemented in your library has a very different dataset. I hope you can use a correct data set.
Hi, wide collection of unsupervised algorithms is amazing. But if there aren't sufficient examples on tuning them, other developers may never use it.
I am planning to use these algorithms on publicly available graphs and write tutorials on the same.
I have major experience in deep learning but not in graph neural networks. I can pull this off with sufficient amount of help on underlying algorithms
Some model like GUIDE takes a long time to process the graph, caching processed result can largely decrease the running time.
When running model.fit (on GPU), I received the following error:
ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 64])
Any suggestion to get rid of this problem when using on GPU is appreciated.
The codes for masking the target nodes is wrong. The target node is the first node in subgraph after the RandomWalk sample, while you mask the last node. The performance of CoLa and ANEMONE will improve 2% by fixing the bug.
Wrong codes in CoLA(line 361~364)
batch_feature = torch.cat(
(batch_feature[:, :-1, :],
added_feat_zero_row,
batch_feature[:, -1:, :]), dim=1)
Correct codes:
batch_feature = torch.cat(
(added_feat_zero_row,
batch_feature[:, 1:, :],
batch_feature[:, 0:1, :]), dim=1)
Wrong codes in ANEMONE(line 288289 and 429430)
bf = torch.cat(
(bf[:, :-1, :], added_feat_zero_row, bf[:, -1:, :]), dim=1)
Correct codes:
bf = torch.cat(
(added_feat_zero_row, bf[:, 1:, :], bf[:, 0 : 1, :]), dim=1)
It is unclear from the Readme or from the documentation whether one can perform outlier detection without having any labels. The Blitz Intro in the docs makes it clear that it works for supervised learning but how about out-of-the-box unsupervised outlier/anomaly detection?
@kayzliu
When I write the Dominant example, I find the following issues. Please fix/answer them accordingly.
process_graph
function is dedicated to the BlogCatalog dataset, we need to write a general dataloader that could handle any PyG data object. The preprocessing code for BlogCatalog can be put into the dominant.py
under /example
.model.fit()
, train_loss became NaN
after 5-6 epochs.Describe the bug
Line 38 in 987776c
Traceback (most recent call last):
File "main.py", line 78, in
main(args)
File "main.py", line 49, in main
"Recall: {:.4f}±{:.4f} ({:.4f})".format(torch.mean(auc),
TypeError: mean(): argument 'input' (position 1) must be Tensor, not list
To Reproduce
Steps to reproduce the behavior:
just run python main.py --model dominant --dataset inj_cora
from benchmark
Expected behavior
After running python main.py --model dominant --dataset inj_cora
, It should show the following result:
100%|█████████████████████████████████████████████| 20/20 [05:44<00:00, 17.22s/it]
inj_cora DOMINANT AUC: 0.7666±0.0013 (0.7676) AP: 0.1830±0.0015 (0.1842) Recall: 0.2819±0.0032 (0.2899)
Desktop (please complete the following information):
replicate by running examples/anomalydae.py
Please make sure the example could run :)
predicting for probability
Traceback (most recent call last):
File "C:/Users/yuezh/PycharmProjects/pygod/examples/anomalydae.py", line 39, in
prob = model.predict_proba(data)
File "C:\Users\yuezh\PycharmProjects\pygod\pygod\models\base.py", line 176, in predict_proba
test_scores = self.decision_function(G)
File "C:\Users\yuezh\PycharmProjects\pygod\pygod\models\anomalydae.py", line 360, in decision_function
A_hat, X_hat = self.model(attrs, adj)
File "C:\Users\yuezh\Anaconda3\envs\torch19\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\yuezh\PycharmProjects\pygod\pygod\models\anomalydae.py", line 169, in forward
A_hat, embed_x = self.structure_AE(x, edge_index)
File "C:\Users\yuezh\Anaconda3\envs\torch19\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\yuezh\PycharmProjects\pygod\pygod\models\anomalydae.py", line 70, in forward
embed_x = self.attention_layer(x, edge_index)
File "C:\Users\yuezh\Anaconda3\envs\torch19\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\yuezh\Anaconda3\envs\torch19\lib\site-packages\torch_geometric\nn\conv\gat_conv.py", line 230, in forward
num_nodes=num_nodes)
File "C:\Users\yuezh\Anaconda3\envs\torch19\lib\site-packages\torch_geometric\utils\loop.py", line 144, in add_self_loops
edge_index = torch.cat([edge_index, loop_index], dim=1)
RuntimeError: Sizes of tensors must match except in dimension 0. Got 2 and 2708 (The offending index is 0)
Is your feature request related to a problem? Please describe.
For now, some detectors (e.g., GUIDE) has two separate autoencoders for attribute and structure, but two autoencoders share the same hidden layer dimension. In many cases, there are a significant difference between the dimension of the node attributes and the dimension of structure information (e.g., adjacency matrix). Using the same hidden dimension may hampers the performance of the detectors.
Describe the solution you'd like
Enabling different hidden dimension for attribute autoencoder and structure autoencoder
When I am reading paper and PyGOD code, I find a problem when some algorithms aim to reconstruct structural infomation:
where z is the graph embedding we have learnt,
where
So I think we should add a self-loop on
In PyGOD code, I haven't found this consideration. I modified the code of DOMINANT in this way, and found performance improvement in some dataset.
any thoughts for speeding this up?
Hi, could you please provide the function that returns the trained node embeddings so that I can input the embeddings to machine learning classifier such as SVM.
Best wish!
Modify the data processing in model GUIDE, to reduce the dependency of the package.
We can add a tutorial in the document about loading data from numpy, scipy, matlab, networkx and other common data formats. Some of the data loaders can be found in https://pytorch-geometric.readthedocs.io/en/latest/modules/utils.html
something like this https://github.com/yzhao062/pyod/blob/master/pyod/test/test_data.py will be helpful. We could check at least the shape of the generated data.
The test coverage report can be found here: https://coveralls.io/github/pygod-team/pygod?branch=main
I think for now everything is handled as a full graph. Do we need to add funcs for batch operations or samplers?
Running examples\adone.py for replication
C:\Users\yuezh\Anaconda3\envs\torch19\python.exe C:/Users/yuezh/PycharmProjects/pygod/examples/adone.py
training...
Traceback (most recent call last):
File "C:/Users/yuezh/PycharmProjects/pygod/examples/adone.py", line 35, in
model.fit(data)
File "C:\Users\yuezh\PycharmProjects\pygod\pygod\models\adone.py", line 158, in fit
act=self.act).to(self.device)
File "C:\Users\yuezh\PycharmProjects\pygod\pygod\models\adone.py", line 331, in init
act=act)
TypeError: init() got an unexpected keyword argument 'in_channels'
How do I divide the training set and the testing set if I want to get the results in the paper?
Describe the bug
Hi, I am trying to run example of PyGOD in a subprocess and it does not work for me
To Reproduce
from torch.multiprocessing import Process
import torch_geometric.transforms as T
from torch_geometric.datasets import Planetoid
import torch
from pygod.generator import gen_contextual_outliers, gen_structural_outliers
from pygod.utils import load_data
from pygod.models import AnomalyDAE
def f(data):
model = AnomalyDAE()
print('started model fitting')
model.fit(data)
print('model fit succesful')
if __name__ == '__main__':
data = Planetoid('./data/Cora', 'Cora', transform=T.NormalizeFeatures())[0]
data, ya = gen_contextual_outliers(data, n=100, k=50)
data, ys = gen_structural_outliers(data, m=10, n=10)
data.y = torch.logical_or(ys, ya).int()
data = load_data('inj_cora')
data.y = data.y.bool()
p = Process(target=f, args=(data,))
p.start()
p.join()
Expected behavior
The model does not fit for me
Desktop (please complete the following information):
Describe the bug
The current dependencies include installing an external library: argparse
.
It must be noted that argparse
is a part of the core python libraries. There is no need for installing it alike other external libraries.
argparse
: https://docs.python.org/3/library/argparse.html🔥 EDIT: The library (argparse
) that you are installing from PyPI is no longer maintained as it is now a part of standard python3. See my comment here.
See further details:
Lines 1 to 5 in d037b67
I would like to know whether the current algorithms work on graph-level tasks or node-level tasks
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.