Git Product home page Git Product logo

pygod's People

Contributors

aha12345678 avatar ahmed3amerai avatar canyuchen avatar cshjin avatar kaize0409 avatar kayzliu avatar oldpanda avatar parthapratimbanik avatar xiyanghu avatar xyvivian avatar yingtongdou avatar yzhao062 avatar zhiming-xu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pygod's Issues

About node embedding function

Hi, could you please provide the function that returns the trained node embeddings so that I can input the embeddings to machine learning classifier such as SVM.

Best wish!

Inconsistency with BOND paper

I run main.py multiple times with DOMINANT (from https://github.com/pygod-team/pygod/tree/main/benchmark).
I find out that although the hyperparameter setting is consistent with the BOND paper (https://arxiv.org/pdf/2206.10071.pdf), the results on inj_cora (AUC: 0.7566±0.0332 (0.7751)) and inj_amazon (AUC: 0.7147±0.0006 (0.7152)) are significantly different from what you show in table 3 from the BOND paper (https://arxiv.org/pdf/2206.10071.pdf), which are 82.7±5.6 (84.3) on inj_cora and 81.3±1.0 (82.2) for inj_amazon.
Is there any advice that you can provide about how to reproduce the results of the BOND paper?

Query on Anomaly Prediction and Outlier labels

Hi,

Given a graph object in the prediction API, What does the outlier labels mentioned here as outlier_labels (numpy array of shape (n_samples,)) indicate from a graph perspective?

Does the contents in the numpy array as 1 or 0 indicate the Nodes in the graph that are normal or anomalous? for example Labels:
[0 0 0 ... 0 0 0] . Does each 0 value pertain to a node in graph?

So, How should this prediction output be interpreted from a graph perspective? Thanks in advance.

Enabling different hidden dimension for attribute autoencoder and structure autoencoder

Is your feature request related to a problem? Please describe.
For now, some detectors (e.g., GUIDE) has two separate autoencoders for attribute and structure, but two autoencoders share the same hidden layer dimension. In many cases, there are a significant difference between the dimension of the node attributes and the dimension of structure information (e.g., adjacency matrix). Using the same hidden dimension may hampers the performance of the detectors.

Describe the solution you'd like
Enabling different hidden dimension for attribute autoencoder and structure autoencoder

Dominant model data loading and training problems

@kayzliu
When I write the Dominant example, I find the following issues. Please fix/answer them accordingly.

  1. The current process_graph function is dedicated to the BlogCatalog dataset, we need to write a general dataloader that could handle any PyG data object. The preprocessing code for BlogCatalog can be put into the dominant.py under /example.
  2. When I run model.fit(), train_loss became NaN after 5-6 epochs.
  3. How is the outlier label of BlogCatalog generated?
  4. Should we train the model on clean data and evaluate it on data with outliers?

BOND data possible inconsistency

Describe the bug
In the BOND paper, it is said that all the datasets are undirected, except Weibo.

Note that Weibo is a directed graph; the remaining datasets used in our benchmark are undirected graphs.

However, load_data function returns directed PyG graphs (only "reddit" is undirected for some reason). Here is the output of is_undirected method

inj_cora False
inj_amazon False
inj_flickr False
weibo False
reddit True
disney False
books False
enron False```

To Reproduce
Here is a colab notebook to reproduce the output above
https://colab.research.google.com/drive/1mNXh66Ac2hUduHvzKtGifC7_huBgCf-5?usp=sharing

Expected behavior
I expected the data to be consistent with what is stated in the paper. Please let me know if I misunderstood something or it's indeed a mistake. Thanks!

adone get unexpected keyword argument

Running examples\adone.py for replication

C:\Users\yuezh\Anaconda3\envs\torch19\python.exe C:/Users/yuezh/PycharmProjects/pygod/examples/adone.py
training...
Traceback (most recent call last):
File "C:/Users/yuezh/PycharmProjects/pygod/examples/adone.py", line 35, in
model.fit(data)
File "C:\Users\yuezh\PycharmProjects\pygod\pygod\models\adone.py", line 158, in fit
act=self.act).to(self.device)
File "C:\Users\yuezh\PycharmProjects\pygod\pygod\models\adone.py", line 331, in init
act=act)
TypeError: init() got an unexpected keyword argument 'in_channels'

benchmark/main.py torch.mean(auc) throws error

Describe the bug

"Recall: {:.4f}±{:.4f} ({:.4f})".format(torch.mean(auc),

the above code throws the following error:

Traceback (most recent call last):
File "main.py", line 78, in
main(args)
File "main.py", line 49, in main
"Recall: {:.4f}±{:.4f} ({:.4f})".format(torch.mean(auc),
TypeError: mean(): argument 'input' (position 1) must be Tensor, not list

To Reproduce
Steps to reproduce the behavior:
just run python main.py --model dominant --dataset inj_cora from benchmark

Expected behavior
After running python main.py --model dominant --dataset inj_cora, It should show the following result:

100%|█████████████████████████████████████████████| 20/20 [05:44<00:00, 17.22s/it]
inj_cora DOMINANT AUC: 0.7666±0.0013 (0.7676) AP: 0.1830±0.0015 (0.1842) Recall: 0.2819±0.0032 (0.2899)

Desktop (please complete the following information):

  • OS: Windows 10
  • PyGOD Version 1.0.0
  • GPU: NVIDIA GeForce GTX 1050

batch operation?

I think for now everything is handled as a full graph. Do we need to add funcs for batch operations or samplers?

data shape issue in anaomalydae

replicate by running examples/anomalydae.py

Please make sure the example could run :)

predicting for probability
Traceback (most recent call last):
File "C:/Users/yuezh/PycharmProjects/pygod/examples/anomalydae.py", line 39, in
prob = model.predict_proba(data)
File "C:\Users\yuezh\PycharmProjects\pygod\pygod\models\base.py", line 176, in predict_proba
test_scores = self.decision_function(G)
File "C:\Users\yuezh\PycharmProjects\pygod\pygod\models\anomalydae.py", line 360, in decision_function
A_hat, X_hat = self.model(attrs, adj)
File "C:\Users\yuezh\Anaconda3\envs\torch19\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\yuezh\PycharmProjects\pygod\pygod\models\anomalydae.py", line 169, in forward
A_hat, embed_x = self.structure_AE(x, edge_index)
File "C:\Users\yuezh\Anaconda3\envs\torch19\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\yuezh\PycharmProjects\pygod\pygod\models\anomalydae.py", line 70, in forward
embed_x = self.attention_layer(x, edge_index)
File "C:\Users\yuezh\Anaconda3\envs\torch19\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\yuezh\Anaconda3\envs\torch19\lib\site-packages\torch_geometric\nn\conv\gat_conv.py", line 230, in forward
num_nodes=num_nodes)
File "C:\Users\yuezh\Anaconda3\envs\torch19\lib\site-packages\torch_geometric\utils\loop.py", line 144, in add_self_loops
edge_index = torch.cat([edge_index, loop_index], dim=1)
RuntimeError: Sizes of tensors must match except in dimension 0. Got 2 and 2708 (The offending index is 0)

`load_data` error in benchmark

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:

  1. Running the benchmark , cd benchmark\
  2. Run python main.py in the benchmark.

Expected behavior
Within the main function, it should load the data with args.dataset. However, the path incorrect (default data path).
The cached data are not available for a fresh run.

Screenshots
image

Desktop (please complete the following information):

  • Linux

Additional context

Quick fix:
Replace

data = torch.load('data/' + args.dataset + '.pt')

with

import pygod.utils.utility import load_dta
...
data = load_data(args.dataset)

Degraded performance of ANEMONE and CoLA on weibo dataset

Describe the bug
The weibo dataset was retrieved as provided by the load_data() method in PyGOD. ANEMONE and CoLA are in beta and are called from pygod.models. When running the ANEMONE and CoLA methods on the weibo dataset, the average AUCROC score is less than 0.15 (ANEMONE: 0.0764±0.0273 (0.1391); CoLA: 0.0750±0.0192 (0.1442)).

To Reproduce

from pygod.models import ANEMONE, CoLA
from pygod.metrics import eval_roc_auc

model = ANEMONE()
data = load_data("weibo")
data.y = data.y.bool()
model.fit(data)
outlier_scores = model.decision_function(data)
auc_score = eval_roc_auc(data.y.numpy(), outlier_scores)

Default parameters are used for the two models. The benchmark code from benchmark/main.py was also used with few modifications; the hyperparameters that were changed are learning rate, and hidden dimensions.

Expected behavior
I would expect AUCROC scores for ANEMONE and CoLA to be above 0.5, similar to other datasets I ran the benchmark on (books, reddit, enron). It is performing significantly worse on the weibo dataset.

Additional context
Applying the fix mentioned in #43 did not seem to change performance much.

Connection Error when calling pygod.utils.load_data()

Describe the bug
When calling pygod.utils.load_data(), sometimes it returns the following error message:
ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

To Reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: [e.g. iOS]
  • Browser [e.g. chrome, safari]
  • Version [e.g. 22]

Smartphone (please complete the following information):

  • Device: [e.g. iPhone6]
  • OS: [e.g. iOS8.1]
  • Browser [e.g. stock browser, safari]
  • Version [e.g. 22]

Additional context
Please refer to 1 and 2 for potential fixing approaches.

dependency of models

could you specify specific dependency that your implemented models use in this thread.

For instance,

dominant:

  • XXX>=0.3.2

same parameters but result varys

Describe the bug
Very excellent work! But what I am confused about is why using the same parameters for training, the result auc can still vary so much?

To Reproduce
For example, my auc in one training with DOMINANT is 0.83, but the next time it becomes 0.90, why is the gap so big?
Hope to get your guidance

Unclear if pygod is for supervised outlier detection only

It is unclear from the Readme or from the documentation whether one can perform outlier detection without having any labels. The Blitz Intro in the docs makes it clear that it works for supervised learning but how about out-of-the-box unsupervised outlier/anomaly detection?

Hard to run benchmark scripts directly

Is your feature request related to a problem? Please describe.
I tried to run benchmark scripts on my local after installing the repo but failed.

Here's how I setup the environment.

First, I ran

  1. pip install -r requirements.txt
  2. python setup.py install

to install the repo and the dependencies. However, the following errors were complained when I tried to run python main.py under pygod/benchmark/

  • ModuleNotFoundError: No module named 'tqdm'
  • ModuleNotFoundError: No module named 'torch_geometric’
  • ModuleNotFoundError: No module named 'pyod’
  • ImportError: 'NeighborSampler' requires either 'pyg-lib' or 'torch-sparse’
  • AttributeError: 'DOMINANT' object has no attribute 'decision_scores_'. Did you mean: 'decision_score_'?

where the last one can be fixed by #80, but I still have to install the missing modules with commands

pip install tqdm
pip install torch
pip install torch-geometric
pip install pyod
pip install torch-sparse
pip install torch-scatter -f https://data.pyg.org/whl/torch-2.0.0+cpu.html

then I can run the benchmark script.

Describe the solution you'd like
It would be better to have a dedicated requirements.txt file inside folder pygod/benchmark/ containing all the required dependencies.

Describe alternatives you've considered
N/A

Additional context
N/A

LOF model

The benchmark paper compared the LOF method. Do you support this method and is it compatible with the pygod framework?

OCGNN code issue

  • The model will run 3* default_epochs during training.
  • In fit() function, the epoch and loss value print should move into the verbose condition.
  • The current performance looks weird on wiki and Cora datasets (see the shared document).
  • Unused code at lines 236, 260-263.
  • Correct the torch.Tensor type hint in docstrings.

Is any method to track metrics like loss?

Is it possible to build logger information tracker inside logger function ?
I am looking for some visualization tools to track the loss or scores during each epoch. And I find the logger function embedded which can print those information. Do you think it is possible to add the metrics tracking function inside fit()/logger() ?

BR

MLPAE bug when set contamination=0.03 during model initialization

File "/hdisk2/pygod_benchmark/pygod/models/mlpae.py", line 137, in fit
    self._process_decision_scores()
  File "/hdisk2/pygod_benchmark/pygod/models/base.py", line 278, in _process_decision_scores
    100 * (1 - self.contamination))
  File "<__array_function__ internals>", line 6, in percentile
  File "/hdisk2/anaconda3/lib/python3.7/site-packages/numpy/lib/function_base.py", line 3733, in percentile
    a, q, axis, out, overwrite_input, interpolation, keepdims)
  File "/hdisk2/anaconda3/lib/python3.7/site-packages/numpy/lib/function_base.py", line 3853, in _quantile_unchecked
    interpolation=interpolation)
  File "/hdisk2/anaconda3/lib/python3.7/site-packages/numpy/lib/function_base.py", line 3404, in _ureduce
    a = np.asanyarray(a)
  File "/hdisk2/anaconda3/lib/python3.7/site-packages/numpy/core/_asarray.py", line 136, in asanyarray
    return array(a, dtype, copy=False, order=order, subok=True)
  File "/hdisk2/anaconda3/lib/python3.7/site-packages/torch/_tensor.py", line 678, in __array__
    return self.numpy()
RuntimeError: Can't call numpy() on Tensor that requires grad. Use tensor.detach().numpy() instead.

ONE does not accept negative value

It appears that it would throw an error if the input x contains negative values. If this is expected, we should probably mention it somewhere.
image

check pygod/test/test_one.py

A problem about structural reconstruction

When I am reading paper and PyGOD code, I find a problem when some algorithms aim to reconstruct structural infomation:

$$ \hat{A}=\sigma(\pmb z\pmb z^T) $$

where z is the graph embedding we have learnt, $\sigma$ is sigmoid function, and $\hat{A}$ is reconstructed adjacency matrix. One term of objective function is

$$ \Vert A-\hat{A}\Vert_F^2 $$

where $A$ is the adjacency matrix. But we should find that the diagonal elements of $\hat{A}$ is closed to 1 because

$$ \hat{A}_{ii}=\sigma(z_iz_i^T) $$

So I think we should add a self-loop on $A$ when reconstruction:

$$ \Vert(A+I)-\hat{A}\Vert_F^2 $$

In PyGOD code, I haven't found this consideration. I modified the code of DOMINANT in this way, and found performance improvement in some dataset.

remove external (non-core-python) library `argparse` as a dependency

Describe the bug

The current dependencies include installing an external library: argparse.

It must be noted that argparse is a part of the core python libraries. There is no need for installing it alike other external libraries.

🔥 EDIT: The library (argparse) that you are installing from PyPI is no longer maintained as it is now a part of standard python3. See my comment here.

See further details:

argparse>=1.4.0
numpy>=1.19.4
scikit-learn>=0.22.1
scipy>=1.5.2
setuptools>=50.3.1.post20201107

Problem for CoLA and ANEMONE models.

The codes for masking the target nodes is wrong. The target node is the first node in subgraph after the RandomWalk sample, while you mask the last node. The performance of CoLa and ANEMONE will improve 2% by fixing the bug.

Wrong codes in CoLA(line 361~364)
batch_feature = torch.cat(
(batch_feature[:, :-1, :],
added_feat_zero_row,
batch_feature[:, -1:, :]), dim=1)

Correct codes:
batch_feature = torch.cat(
(added_feat_zero_row,
batch_feature[:, 1:, :],
batch_feature[:, 0:1, :]), dim=1)

Wrong codes in ANEMONE(line 288289 and 429430)
bf = torch.cat(
(bf[:, :-1, :], added_feat_zero_row, bf[:, -1:, :]), dim=1)

Correct codes:
bf = torch.cat(
(added_feat_zero_row, bf[:, 1:, :], bf[:, 0 : 1, :]), dim=1)

GUIDE Bug on Cora dataset

File "/hdisk2/pygod_benchmark/pygod/models/guide.py", line 158, in fit
    x_, s_ = self.model(x, s, edge_index)
  File "/hdisk2/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/hdisk2/pygod_benchmark/pygod/models/guide.py", line 369, in forward
    s_ = self.struct_ae(s, edge_index)
  File "/hdisk2/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/hdisk2/pygod_benchmark/pygod/models/guide.py", line 394, in forward
    s = layer(s, edge_index)
  File "/hdisk2/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/hdisk2/pygod_benchmark/pygod/models/guide.py", line 411, in forward
    out = self.propagate(edge_index, s=self.w2(s))
  File "/hdisk2/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/hdisk2/anaconda3/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 103, in forward
    return F.linear(input, self.weight, self.bias)
  File "/hdisk2/anaconda3/lib/python3.7/site-packages/torch/nn/functional.py", line 1848, in linear
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: expected scalar type Float but found Long

Out of memory

Describe the bug
Hi, except that with the GCNAE model, I keep running into out of memory issues with the other models, even when setting the batch size to a very low value. It's always around 600GBs for a batch with around 400k nodes.

RuntimeError                              Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_17284\2902826942.py in <module>
      5 
      6 model = AnomalyDAE(gpu=0, batch_size=8, verbose=True, contamination=0.05)
----> 7 model.fit(batch)

~\anaconda3\lib\site-packages\pygod\models\anomalydae.py in fit(self, G, y_true)
    143         """
    144         G.node_idx = torch.arange(G.x.shape[0])
--> 145         G.s = to_dense_adj(G.edge_index)[0]
    146 
    147         # automated balancing by std

~\anaconda3\lib\site-packages\torch_geometric\utils\to_dense_adj.py in to_dense_adj(edge_index, batch, edge_attr, max_num_nodes)
     46     size = [batch_size, max_num_nodes, max_num_nodes]
     47     size += list(edge_attr.size())[1:]
---> 48     adj = torch.zeros(size, dtype=edge_attr.dtype, device=edge_index.device)
     49 
     50     flattened_size = batch_size * max_num_nodes * max_num_nodes

RuntimeError: CUDA out of memory. Tried to allocate 597.53 GiB (GPU 0; 16.00 GiB total capacity; 1.32 GiB already allocated; 12.78 GiB free; 1.35 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Add tutorials for hyperparamer tuning

Hi, wide collection of unsupervised algorithms is amazing. But if there aren't sufficient examples on tuning them, other developers may never use it.

I am planning to use these algorithms on publicly available graphs and write tutorials on the same.

I have major experience in deep learning but not in graph neural networks. I can pull this off with sufficient amount of help on underlying algorithms

Pygod does not work in a subprocess

Describe the bug
Hi, I am trying to run example of PyGOD in a subprocess and it does not work for me

To Reproduce

from torch.multiprocessing import Process

import torch_geometric.transforms as T
from torch_geometric.datasets import Planetoid

import torch
from pygod.generator import gen_contextual_outliers, gen_structural_outliers
from pygod.utils import load_data
from pygod.models import AnomalyDAE



def f(data):
    model = AnomalyDAE()
    print('started model fitting')
    model.fit(data)
    print('model fit succesful')

if __name__ == '__main__':
    data = Planetoid('./data/Cora', 'Cora', transform=T.NormalizeFeatures())[0]
    data, ya = gen_contextual_outliers(data, n=100, k=50)
    data, ys = gen_structural_outliers(data, m=10, n=10)
    data.y = torch.logical_or(ys, ya).int()

    data = load_data('inj_cora')
    data.y = data.y.bool()
    p = Process(target=f, args=(data,))
    p.start()
    p.join()

Expected behavior

The model does not fit for me

Desktop (please complete the following information):

  • OS: all os and systems
  • python: 3.8

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.