Git Product home page Git Product logo

seal_ogb's Introduction

SEAL_OGB -- An Implementation of SEAL for OGB Link Prediction Tasks

About

This repository supports the following paper:

M. Zhang, P. Li, Y. Xia, K. Wang, and L. Jin, Labeling Trick: A Theory of Using Graph Neural Networks for Multi-Node Representation Learning. [PDF]

SEAL is a GNN-based link prediction method. It first extracts a k-hop enclosing subgraph for each target link, then applies a labeling trick named Double Radius Node Labeling (DRNL) to give each node an integer label as its additional feature. Finally, these labeled enclosing subgraphs are fed to a graph neural network to predict link existences.

This repository reimplements SEAL with the PyTorch-Geometric library, and tests SEAL on the Open Graph Benchmark (OGB) datasets. SEAL ranked 1st place on 3 out of 4 link prediction datasets in the OGB Leaderboard at the time of submission. It additionally supports Planetoid like datasets, such as Cora, CiteSeer and PubMed, where random 0.85/0.05/0.1 split and AUC metric are used. Using custom datasets is also easy by replacing the Planetoid dataset with your own.

ogbl-ppa ogbl-collab ogbl-ddi ogbl-vessel ogbl-citation2
Val results 51.25%±2.52%* 64.95%±0.43%* 28.49%±2.69% 80.53%±0.22%* 87.57%±0.31%*
Test results 48.80%±3.16%* 64.74%±0.43%* 30.56%±3.86% 80.50%±0.21%* 87.67%±0.32%*

* State-of-the-art results; evaluation metrics are Hits@100, Hits@50, Hits@20 and MRR, respectively. For ogbl-collab, we have switched to the new rule, where after all hyperparameters are determined on the validation set, we include validation edges in the training graph and retrain to report the test performance. For ogbl-citation2, it is an updated version of the deprecated ogbl-citation.

The original implementation of SEAL is here.

The original paper of SEAL is:

M. Zhang and Y. Chen, Link Prediction Based on Graph Neural Networks, Advances in Neural Information Processing Systems (NIPS-18). [PDF]

This repository also implements some other labeling tricks, such as Distance Encoding (DE) and Zero-One (ZO), and supports combining labeling tricks with different GNNs, including GCN, GraphSAGE and GIN.

Requirements

Latest tested combination: Python 3.8.5 + PyTorch 1.6.0 + PyTorch_Geometric 1.6.1 + OGB 1.2.4. For ogbl-vessel, we tested SEAL in the following: Python 3.8.10 + PyTorch 1.12.1 + PyTorch_Geometric 2.0.4 + OGB 1.3.4.

Install PyTorch

Install PyTorch_Geometric

Install OGB

Other required python libraries include: numpy, scipy, tqdm etc.

Usages

ogbl-ppa

python seal_link_pred.py --dataset ogbl-ppa --num_hops 1 --use_feature --use_edge_weight --eval_steps 5 --epochs 20 --dynamic_train --dynamic_val --dynamic_test --train_percent 5 

ogbl-collab

python seal_link_pred.py --dataset ogbl-collab --num_hops 1 --train_percent 15 --hidden_channels 256 --use_valedges_as_input

According to OGB, this dataset allows including validation links in training when all the hyperparameters are finalized using the validation set. Thus, you should first tune your hyperparameters without "--use_valedges_as_input", and then append "--use_valedges_as_input" to your final command when all the hyperparameters are determined. See issue.

ogbl-ddi

python seal_link_pred.py --dataset ogbl-ddi --num_hops 1 --ratio_per_hop 0.2 --use_edge_weight --eval_steps 1 --epochs 10 --dynamic_val --dynamic_test --train_percent 1 

For the above three datasets, append "--runs 10" to do experiments for 10 times and get the average results.

ogbl-vessel

python seal_link_pred.py --dataset ogbl-vessel --use_feature --num_hops 1 --num_layers 2 --lr 0.001 

ogbl-citation2

python seal_link_pred.py --dataset ogbl-citation2 --num_hops 1 --use_feature --use_edge_weight --eval_steps 1 --epochs 10 --dynamic_train --dynamic_val --dynamic_test --train_percent 2 --val_percent 1 --test_percent 1

Because this dataset uses mean reciprocal rank (MRR) as the evaluation metric where each positive testing link is ranked against 1000 random negative ones, it requires extracting 1001 enclosing subgraphs for every testing link. This is very time consuming. Thus, the above command uses "--val_percent 1" and "--test_percent 1" to only evaluate on 1% of validation and test data to get a fast unbiased estimate of the true MRR. To get the true MRR, please change them to "--val_percent 100" and "test_percent 100". Also, because this dataset is expensive to evaluate, we first train 10 models with 1% validation data in parallel, record the best epoch's model from each run, and then evaluate all 10 best models together using the "--test_multiple_models --val_percent 100 --test_percent 100" option. This option enables evaluating multiple pretrained models together with a single subgraph extraction process for each link, thus avoiding extracting subgraphs for testing links repeatedly for 10 times. You need to specify your pretrained model paths in "seal_link_pred.py".

Cora

python seal_link_pred.py --dataset Cora --num_hops 3 --use_feature --hidden_channels 256 --runs 10

CiteSeer

python seal_link_pred.py --dataset CiteSeer --num_hops 3 --hidden_channels 256 --runs 10

PubMed

python seal_link_pred.py --dataset PubMed --num_hops 3 --use_feature --dynamic_train --runs 10

For all datasets, if you specify "--dynamic_train", the enclosing subgraphs of the training links will be extracted on the fly instead of preprocessing and saving to disk. Similarly for "--dynamic_val" and "--dynamic_test". You can increase "--num_workers" to accelerate the dynamic subgraph extraction process.

If your dataset is large, using the default train/val/test split function might result in OOM. You can add "--fast_split" in this case to do a fast split, which cannot guarantee edges (i, j) and (j, i) won't both appear in the negative links but has a better scalability.

Other labeling tricks

By default SEAL uses the DRNL labeling trick. You can alternatively use other labeling tricks such as DE (distance encoding), DE+, ZO (zero-one labeling), etc., by appending "--node_label de", "--node_label de+", and "--node_label zo".

Heuristic methods

This repository also implements two link prediction heuristics: Common Neighbor (CN) and Adamic Adar (AA), which turn out to have surprisingly better performance than many GNN methods on ogbl-ppa and ogbl-collab. An example usage of Common Neighbor is:

python seal_link_pred.py --use_heuristic CN --dataset ogbl-ppa

License

SEAL_OGB is released under an MIT license. Find out more about it here.

Reference

If you find the code useful, please cite our papers.

@article{zhang2021labeling,
  title={Labeling Trick: A Theory of Using Graph Neural Networks for Multi-Node Representation Learning},
  author={Zhang, Muhan and Li, Pan and Xia, Yinglong and Wang, Kai and Jin, Long},
  journal={Advances in Neural Information Processing Systems},
  volume={34},
  year={2021}
}

@inproceedings{zhang2018link,
  title={Link prediction based on graph neural networks},
  author={Zhang, Muhan and Chen, Yixin},
  booktitle={Advances in Neural Information Processing Systems},
  pages={5165--5175},
  year={2018}
}

Muhan Zhang, Facebook AI

10/13/2020

seal_ogb's People

Contributors

fxmeng avatar jqmcginnis avatar muhanzhang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

seal_ogb's Issues

evaluation on citation2

Hi, thanks for prividing the code of seal! I have some concerns for the evaluation of ogbl-citation2, which has the specific negative samples for each positive sample in the validation and test. According to the code, we use the dataloader withnum_workers>1to get positive/negative samples, but can we guarantee that the sampled data are in the original order? I noticed that the shuffle=False in the dataloader, but it's not quarantee to have the original oder when we combine multiple samples using multiples processes.

If the order is somehow random, how can we evaluate the performance of citation2?

Query regarding `--use_valedges_as_input`

Hi Dr. Zhang,

Hope you are doing good.

I had a query regarding the use of validation edges as input, which is useful in the ogbl-collab dataset as per the OGB rules. I noticed that when the flag --use_valedges_as_input is True, we have the following code snippet:

if args.use_valedges_as_input:
    val_edge_index = split_edge['valid']['edge'].t()
    if not directed:
        val_edge_index = to_undirected(val_edge_index)
    data.edge_index = torch.cat([data.edge_index, val_edge_index], dim=-1)
    val_edge_weight = torch.ones([val_edge_index.size(1), 1], dtype=int)
    data.edge_weight = torch.cat([data.edge_weight, val_edge_weight], 0)

But, since after hypertuning, we are allowed to use val_edges in training, can't we also update the split_edge as:

split_edge['train']['edge'] = torch.cat([data.edge_index, val_edge_index], dim=-1)

I was wondering if SEAL would benefit from this change/if this change makes sense. Any input is greatly appreciated!

Warm Regards,
Paul

can you teach me how to predict link in graphs?

Hi,professor zhang,I can't find the Seal code to do the task link prediction.If i use a model to get the graph embedding by a unsupervised learning algorithm,what is the proper way to do the downstream task link prediction,much apperciate!

Running Issues of citation dataset

Hi,

I'm using the default settings to run the model on the citation dataset. I tried twice but each time the code got stuck after the first run. The log is attached below

Results will be saved in results/ogbl-citation_20201224232705 Command line input: python seal_link_pred.py --dataset ogbl-citation --num_hops 1 --use_feature --use_edge_weight --eval_steps 1 --epochs 10 --dynamic_train --dynamic_val --dynamic_test --train_percent 2 --val_percent 1 --test_percent 1 is saved. Total number of parameters is 260802 SortPooling k is set to 115 100%|███████████████████████████| 37985/37985 [34:46<00:00, 18.20it/s] 100%|███████████████████████████| 27059/27059 [23:07<00:00, 19.50it/s] 100%|███████████████████████████| 27059/27059 [23:08<00:00, 19.49it/s] /home/***/anaconda3/envs/dlg_env/lib/python3.8/site-packages/ogb/linkproppred/evaluate.py:235: UserWarning: This overload of nonzero is deprecated: nonzero() Consider using one of the following signatures instead: nonzero(*, bool as_tuple) (Triggered internally at /opt/conda/conda-bld/pytorch_1603729009598/work/torch/csrc/utils/python_arg_parser.cpp:882.) ranking_list = (argsort == 0).nonzero() MRR Run: 01, Epoch: 01, Loss: 0.0910, Valid: 87.25%, Test: 84.76% 0%| | 1/37985 [00:01<19:04:56, 1.81s/it]

It simply stops over there, no crash or exit.

SEAL - Utilizing multiple edge features

Hello SEAL Team,

thank you very much for the great implementation!

I noticed that you can incorporate the node features and edge weight into the SEAL learning process (--use_feature and --use_edge_weight). Dealing with edge weight seems pretty straightforward for me, but have you thought about a possibility of combining further edge features (not only one weight) in SEAL? Is there a possibility of doing this as well?

Can I learn the embedding of individual nodes?

Hi @muhanzhang,

I am a master student at NCCU. I saw your SEAL method on OGB link property prediction task, and I am interested in this method. But I have a trivial question, hope you can solve my doubt.

I have read your code and I know that your SEAL method learn the embedding for subgraphs constructed from each pair of nodes. However, I wanna know that, can this method learn the embedding of individual nodes rather than the embedding of edges? Or it can only be used for learning a pair of nodes (i.e. edges)?

Ability to use your own dataset

Hi @muhanzhang , thanks for sharing this great utility. I managed to run script on the default dataset 'ogbl-collab', but the point is to use own datasets. Seems that the object SEALDataset requires the dataset object to contain a dictionary with split edges, as here is the split_edge variable.

dataset = PygLinkPropPredDataset(name=args.dataset)
data = dataset[0]
split_edge = dataset.get_edge_split()

train_dataset = eval('SEALDataset')(
'/.',
data,
split_edge,
num_hops=args.num_hops,
percent=args.train_percent,
split='train',
use_coalesce=use_coalesce,
node_label=args.node_label,
ratio_per_hop=args.ratio_per_hop,
max_nodes_per_hop=args.max_nodes_per_hop, )

I looked into the code of PygLinkPropPredDataset from ogb, however the get_edge_split() method just loads the already 'preprocessed' train, test and val splits. Could you please modify the script in order for us users to process own datasets from networkx graph object, or give a hint which splitter utility to use?

Question on output

Hi M. Zhang

I'm a Master student and currently I'm studying in this aspect. Can I ask you some couple question ?

I'm getting your repo and after adding the sigmoid to the last layer. I run a little sample sample on it and this is my command line:

python seal_link_pred.py --dataset PubMed --batch_size 12 --train_percent 0.1 --val_percent 0.1 --test_percent 0.1 --num_hops 3 --use_feature --epochs 20 --dynamic_train --dynamic_val --dynamic_test --only_test

And I got an output like this

tensor([[0.4867],
[0.4875],
[0.4858],
[0.4848],
[0.4854],
[0.4865],
[0.4855],
[0.4850]])

Can you tell me more out what's does the output mean, and how i map the result to the predict link ?

the reason of using 'mod'

In the function 'drnl_node_labeling', why did you use 'dist_over_2, dist_mod_2 = dist // 2, dist % 2'? What role does it play in the label 'z'?
Thanks!

command with RuntimeError

When running the following command, an error occurs:

python seal_link_pred.py --dataset ogbl-citation2 --num_hops 1 --use_feature --use_edge_weight --eval_steps 1 --epochs  3 --dynamic_train --dynamic_val --dynamic_test --train_percent 2 --val_percent 1 --test_percent 1

error :

 python seal_link_pred.py --dataset ogbl-citation2 --num_hops 1 --use_feature --use_edge_weight --eval_steps 1 --epochs  3 --dynamic_train --dynamic_val --dynamic_test --train_percent 2 --val_percent 1 --test_percent 1
Results will be saved in results/ogbl-citation2_20230821205719
Command line input: python seal_link_pred.py --dataset ogbl-citation2 --num_hops 1 --use_feature --use_edge_weight --eval_steps 1 --epochs 3 --dynamic_train --dynamic_val --dynamic_test --train_percent 2 --val_percent 1 --test_percent 1
 is saved.
Total number of parameters is 260802
SortPooling k is set to 115
  0%|                                       | 0/37985 [00:00<?, ?it/s]
Results will be saved in results/ogbl-citation2_20230821205742
Command line input: python E:\Notes\PyNote\GNN\seal\SEAL_OGB-main\seal_link_pred.py --dataset ogbl-citation2 --num_hops 1 --use_feature --use_edge_weight --eval_steps 1 --epochs 3 --dynamic_train --dynamic_val --dynamic_test --train_percent 2 --val_percent 1 --test_percent 1
 is saved.
Total number of parameters is 260802
SortPooling k is set to 115
  0%|                                       | 0/37985 [00:00<?, ?it/s]

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Users\admin\anaconda3\Lib\multiprocessing\spawn.py", line 120, in spawn_main
    exitcode = _main(fd, parent_sentinel)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\admin\anaconda3\Lib\multiprocessing\spawn.py", line 129, in _main
    prepare(preparation_data)
  File "C:\Users\admin\anaconda3\Lib\multiprocessing\spawn.py", line 240, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "C:\Users\admin\anaconda3\Lib\multiprocessing\spawn.py", line 291, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen runpy>", line 291, in run_path
  File "<frozen runpy>", line 98, in _run_module_code
  File "<frozen runpy>", line 88, in _run_code
  File "E:\Notes\PyNote\GNN\seal\SEAL_OGB-main\seal_link_pred.py", line 700, in <module>
    loss = train()
           ^^^^^^^
  File "E:\Notes\PyNote\GNN\seal\SEAL_OGB-main\seal_link_pred.py", line 164, in train
    for data in pbar:
  File "C:\Users\admin\anaconda3\Lib\site-packages\tqdm\std.py", line 1178, in __iter__
    for obj in iterable:
  File "C:\Users\admin\anaconda3\Lib\site-packages\torch\utils\data\dataloader.py", line 441, in __iter__
    return self._get_iterator()
           ^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\admin\anaconda3\Lib\site-packages\torch\utils\data\dataloader.py", line 388, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\admin\anaconda3\Lib\site-packages\torch\utils\data\dataloader.py", line 1042, in __init__
    w.start()
  File "C:\Users\admin\anaconda3\Lib\multiprocessing\process.py", line 121, in start
    self._popen = self._Popen(self)
                  ^^^^^^^^^^^^^^^^^
  File "C:\Users\admin\anaconda3\Lib\multiprocessing\context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\admin\anaconda3\Lib\multiprocessing\context.py", line 336, in _Popen
    return Popen(process_obj)
           ^^^^^^^^^^^^^^^^^^
  File "C:\Users\admin\anaconda3\Lib\multiprocessing\popen_spawn_win32.py", line 45, in __init__
    prep_data = spawn.get_preparation_data(process_obj._name)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\admin\anaconda3\Lib\multiprocessing\spawn.py", line 158, in get_preparation_data
    _check_not_importing_main()
  File "C:\Users\admin\anaconda3\Lib\multiprocessing\spawn.py", line 138, in _check_not_importing_main
    raise RuntimeError('''
RuntimeError:
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

Where par refers pbar = tqdm(train_loader, ncols=70).
The command corresponding to the ogbl-ppa dataset also encounters the same error. Only ogbl-collab is working fine.

-- updaye --
It seems the parameters --dynamic_train --dynamic_val --dynamic_test cause this problem.

Problems using custom dataset

Hi!

Thank you very much for providing the code for your impressive paper and the possibility of testing SEAL for custom datasets.
At the moment, I would like to use the dynamic SEAL variant for testing my large custom dataset.

Running:
seal_link_pred.py --dataset MyGraph --num_hops 1 --dynamic_train --dynamic_val --dynamic_test

fails with the out of memory error:

../anaconda3/envs/pytorch_geometric/lib/python3.8/site-packages/torch_geometric/utils/train_test_split_edges.py", line 50, in train_test_split_edges neg_adj_mask = torch.ones(num_nodes, num_nodes, dtype=torch.uint8) RuntimeError: [enforce fail at CPUAllocator.cpp:65] . DefaultCPUAllocator: can't allocate memory

When SEAL automatically tries to split the edges, my script crashes:
path = osp.join('dataset', args.dataset)
dataset = Planetoid(path, args.dataset) # replaced by my dataset
split_edge = do_edge_split(dataset)
data = dataset[0]
data.edge_index = split_edge['train']['edge'].t()

What would be the best way to mitigate this? As we need split_edge later, I wanted to ask for the most convinient way of handling this?

Thank you!

Test with custom dataset

Hi @muhanzhang,

I'm working on custom dataset and I trained this model and test is successfully. But I want to test with custom edge list. In this work calculating enclosing subgraphs for each links. How i test this custom test edge list?

Regarding the question of changing the output link score to link probability.

Hi, I read your paper Link Prediction Based on Graph Neural Networks.This is a great job.
Currently I'm doing my own dataset experiments on your code.The output I want is the probability that the current link exists.I pass the last dimensional output of the DGCNN model through the sigmoid function as shown below.
#MLP.
#x = F.relu(self.lin1(x))
#x = F.dropout(x, p=0.5, training=self.training)
#x = self.lin2(x)
#x = torch.sigmoid(x)
I got this result.
image
In additional experiments in your paper, you also performed mean precision comparisons.
I'm not sure if you set the classification threshold to 0.5 for this part of the experiment? That is, the value after passing the sigmoid function, the value greater than 0.5 is a true edge, and the value less than 0.5 is a false edge. Or do you have other tricks?
I saw a similar answer in a previous issue, but I'm more interested to know how the precision experiments in your paper are done.
I would appreciate it if you could answer my question!

Tensor size mismatch error when dynamic options are used

Hi @muhanzhang,

I've used this model for link prediction on a few datasets. Recently, when I run it on two of my new datasets with dynamic run type options, i.e. --dynamic_train --dynamic_val --dynamic_test, I get the following error at run-time. I should mention that if I don't use the above options, namely if I use the in-memory run type, it runs successfully. Do you know what causes the error and how can I fix it?

File "/nfstmp/.../seal_link_pred.py", line 708, in
model.load_state_dict(torch.load(model_name))
File "/nfstmp/.../python3.7/site-packages/torch/nn/modules/module.py", line 1052, in load_state_dict
self.class.name, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for DGCNN:
size mismatch for lin1.weight: copying a param with shape torch.Size([128, 800]) from checkpoint, the shape in current model is torch.Size([128, 1344]).

IndexError: too many indices for tensor of dimension 1

Thanks for sharing your great code!
I'm trying to run your sample code:

python seal_link_pred.py --dataset ogbl-ppa --num_hops 1 --use_feature --use_edge_weight --eval_steps 5 --epochs 20 --dynamic_train --dynamic_val --dynamic_test --train_percent 5

there is an 'IndexError':

Traceback (most recent call last): File "<string>", line 1, in <module> File "/opt/anaconda3/envs/seal2/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main exitcode = _main(fd, parent_sentinel) File "/opt/anaconda3/envs/seal2/lib/python3.8/multiprocessing/spawn.py", line 125, in _main prepare(preparation_data) File "/opt/anaconda3/envs/seal2/lib/python3.8/multiprocessing/spawn.py", line 236, in prepare _fixup_main_from_path(data['init_main_from_path']) File "/opt/anaconda3/envs/seal2/lib/python3.8/multiprocessing/spawn.py", line 287, in _fixup_main_from_path main_content = runpy.run_path(main_path, File "/opt/anaconda3/envs/seal2/lib/python3.8/runpy.py", line 265, in run_path return _run_module_code(code, init_globals, run_name, File "/opt/anaconda3/envs/seal2/lib/python3.8/runpy.py", line 97, in _run_module_code _run_code(code, mod_globals, init_globals, File "/opt/anaconda3/envs/seal2/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/Users/shanghanglin/SEAL_OGB/seal_link_pred.py", line 702, in <module> loggers[key].print_statistics() File "/Users/shanghanglin/SEAL_OGB/utils.py", line 364, in print_statistics valid = r[:, 0].max().item() IndexError: too many indices for tensor of dimension 1

Could you please help me with that? Thank you!

Batch in the graph mean ?

I see you set batch_size=32 but input is only 1 subgraph, and label output is 32. I don't understand that.
And When I using drnl to label why in the graph have node label 0.

Updating results for ogbl-citation2

We have updated the ogbl-citation dataset to ogbl-citation2 (available in version ogb>=1.2.4). We appreciate it if you could re-evaluate your model on the new dataset and make the leaderboard submission. Sorry for the extra work.

Thanks,
Weihua --- OGB Team

[BUG] get_pos_neg_edges neglects negative edges in ogbl datasets

Bug Description

Thank you for reviewing this issue.

As titled, even when negative edges are given, e.g. ogbl-ddi or ogbl-collab, SEALDataset will resample negative edges regardlessly. See

if 'edge_neg' in split_edge['train']:

Since ogbl-ddi and ogbl-collab do not have edge_neg in split_edge['train'], we will fail the if condition and resample negative edges in following lines. This may not be a proper use for ogb dataset.

Error Reproduction

To justify this bug, we manually remove the negative edges in dataset and find that we can still run the code. It means that we are not using negative edges anyway.

After

data.edge_index = split_edge['train']['edge'].t()
add these lines:

print(f"{split_edge = }")

split_edge['valid']['edge_neg'] = None
split_edge['test']['edge_neg'] = None

print(f"{split_edge = }")

Then run

python seal_link_pred.py --dataset ogbl-ddi --num_hops 1 --ratio_per_hop 0.2 --use_edge_weight --eval_steps 1 --epochs 2 --runs 1  --dynamic_val --dynamic_test --train_percent 0.01 --val_percent 1 --test_percent 1

Outputs:

Results will be saved in results/ogbl-ddi_20230619211543
Command line input: python seal_link_pred.py --dataset ogbl-ddi --num_hops 1 --ratio_per_hop 0.2 --use_edge_weight --eval_steps 1 --epochs 2 --runs 1 --dynamic_val --dynamic_test --train_percent 0.01 --val_percent 1 --test_percent 1
 is saved.
split_edge = {'train': {'edge': tensor([[4039, 2424],
        [4039,  225],
        [4039, 3901],
        ...,
        [ 647,  708],
        [ 708,  338],
        [ 835, 3554]])}, 'valid': {'edge': tensor([[ 722,  548],
        [ 874, 3436],
        [ 838, 1587],
        ...,
        [3661, 3125],
        [3272, 3330],
        [1330,  776]]), 'edge_neg': tensor([[   0,   58],
        [   0,   84],
        [   0,   90],
        ...,
        [4162, 4180],
        [4168, 4260],
        [4180, 4221]])}, 'test': {'edge': tensor([[2198, 1172],
        [1205,  719],
        [1818, 2866],
        ...,
        [ 326, 1109],
        [ 911, 1250],
        [4127, 2480]]), 'edge_neg': tensor([[   0,    2],
        [   0,   16],
        [   0,   42],
        ...,
        [4168, 4259],
        [4208, 4245],
        [4245, 4259]])}}
split_edge = {'train': {'edge': tensor([[4039, 2424],
        [4039,  225],
        [4039, 3901],
        ...,
        [ 647,  708],
        [ 708,  338],
        [ 835, 3554]])}, 'valid': {'edge': tensor([[ 722,  548],
        [ 874, 3436],
        [ 838, 1587],
        ...,
        [3661, 3125],
        [3272, 3330],
        [1330,  776]]), 'edge_neg': None}, 'test': {'edge': tensor([[2198, 1172],
        [1205,  719],
        [1818, 2866],
        ...,
        [ 326, 1109],
        [ 911, 1250],
        [4127, 2480]]), 'edge_neg': None}}
Total number of parameters is 522946
SortPooling k is set to 244
100%|███████████████████████████████████| 7/7 [00:04<00:00,  1.51it/s]
100%|█████████████████████████████████| 84/84 [00:15<00:00,  5.46it/s]
100%|█████████████████████████████████| 84/84 [00:14<00:00,  5.73it/s]
Hits@20
Run: 01, Epoch: 01, Loss: 0.6864, Valid: 12.29%, Test: 7.35%
Hits@50
Run: 01, Epoch: 01, Loss: 0.6864, Valid: 20.69%, Test: 14.17%
Hits@100
Run: 01, Epoch: 01, Loss: 0.6864, Valid: 30.51%, Test: 22.49%
100%|███████████████████████████████████| 7/7 [00:02<00:00,  2.79it/s]
100%|█████████████████████████████████| 84/84 [00:14<00:00,  5.64it/s]
100%|█████████████████████████████████| 84/84 [00:14<00:00,  5.77it/s]
Hits@20
Run: 01, Epoch: 02, Loss: 0.6602, Valid: 32.83%, Test: 22.34%
Hits@50
Run: 01, Epoch: 02, Loss: 0.6602, Valid: 44.83%, Test: 36.21%
Hits@100
Run: 01, Epoch: 02, Loss: 0.6602, Valid: 53.67%, Test: 49.93%
Hits@20
Run 01:
Highest Valid: 32.83
Highest Eval Point: 2
   Final Test: 22.34
Hits@50
Run 01:
Highest Valid: 44.83
Highest Eval Point: 2
   Final Test: 36.21
Hits@100
Run 01:
Highest Valid: 53.67
Highest Eval Point: 2
   Final Test: 49.93
Hits@20
All runs:
Highest Valid: 32.83 ± nan
   Final Test: 22.34 ± nan
Hits@50
All runs:
Highest Valid: 44.83 ± nan
   Final Test: 36.21 ± nan
Hits@100
All runs:
Highest Valid: 53.67 ± nan
   Final Test: 49.93 ± nan
Total number of parameters is 522946

Possible solution

Maybe we can change

if 'edge_neg' in split_edge['train']:
to

if 'edge_neg' in split_edge[split]:

to fix this bug.

Question regarding `use_edge_weight` parameter for training `ogbl-ddi` and `ogbl-ppa`

Hi Dr. Zhang,

I was trying to recreate experimental results for ogbl-ddi and ogbl-ppa. However, I saw the argument use_edge_weight being used. I might be wrong here, but, aren't ogbl-ddi and ogbl-ppa unweighted datasets?

Is this done to use a vector of ones as edge weight during training? If so, I am curious to know why it's needed.

Thanks,
Paul

Question regarding output

Hi,

I had another question about the output of SEAL_OGB. How do you restrict the output probability to be within 0 to 1? I see a tanh nonlinearity but I also see other layers afterwards without range restrictions, which seems to me that the predicted output could be negative or greater than 1.

Issues about heuristic methods,CN and AA,in PPA and DDI

Thanks for sharing your great code!
I was unable to reproduce the results of the CN and AA methods on the PPA and DDI datasets in the OGB leaderboard.

I've read the issue ogb-84, SEAL_OGB-25,and I use your code on collab dataset to get the same results as OGB leaderboard with "--use_valedges_as_input". However, on the PPA and DDI datasets, I get errors when using "--use_valedges_as_input". So I can only use the code in README to get the following results, which are far lower than the OGB leaderboard.

I'm trying to run your sample code:
python seal_link_pred.py --use_heuristic CN --dataset ogbl-ppa

And the result I get are:

Hits@20
All runs:
Highest Valid: 5.38 ± nan
   Final Test: 6.73 ± nan
Hits@50
All runs:
Highest Valid: 8.03 ± nan
   Final Test: 12.47 ± nan
Hits@100
All runs:
Highest Valid: 12.73 ± nan
   Final Test: 21.09 ± nan

I use package version:
ogb==1.3.5
torch==1.13.0
torch-geometric== 2.2.0

Can you have a look? Thanks.

How to reduce training time cost?

The training costs an extremely long time and has a very low GPU usage rate. This may be caused by the enclosed subgraph sampler. Do you have any idea to accelerate the training process and keep the MRR simultaneously? Thank you.

Problem with z_score?

Your graph is trained with batch_size = 32. It means you compressed 32 subgraphs into one large graph, but you only mark z_score as value 1 in 2 positions z[0] = 1, z[1] = 1. However, follow the DRNL, target nodes are labeled as 1, it means with batch_size=32, your z_score is supposed to have 32 values that are equal to 1.

Error when running with private dataset

Hi - thank you for this great project! I am trying to run the code with my own dataset and have followed your example in issue 12. However, I am getting this error. Can you please help?

`usage: ipykernel_launcher.py [-h] [--dataset DATASET] [--model MODEL]
[--sortpool_k SORTPOOL_K]
[--num_layers NUM_LAYERS]
[--hidden_channels HIDDEN_CHANNELS]
[--batch_size BATCH_SIZE] [--num_hops NUM_HOPS]
[--ratio_per_hop RATIO_PER_HOP]
[--max_nodes_per_hop MAX_NODES_PER_HOP]
[--node_label NODE_LABEL] [--use_feature]
[--use_edge_weight] [--lr LR] [--epochs EPOCHS]
[--runs RUNS] [--train_percent TRAIN_PERCENT]
[--val_percent VAL_PERCENT]
[--test_percent TEST_PERCENT] [--dynamic_train]
[--dynamic_val] [--dynamic_test]
[--num_workers NUM_WORKERS]
[--train_node_embedding]
[--pretrained_node_embedding PRETRAINED_NODE_EMBEDDING]
[--use_valedges_as_input]
[--eval_steps EVAL_STEPS] [--log_steps LOG_STEPS]
[--data_appendix DATA_APPENDIX]
[--save_appendix SAVE_APPENDIX] [--keep_old]
[--continue_from CONTINUE_FROM] [--only_test]
[--test_multiple_models]
[--use_heuristic USE_HEURISTIC]
ipykernel_launcher.py: error: unrecognized arguments: -f /root/.local/share/jupyter/runtime/kernel-616a648a-8de4-4c78-ab07-2b9702625a65.json
An exception has occurred, use %tb to see the full traceback.

SystemExit: 2
/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py:2890: UserWarning: To exit: use 'exit', 'quit', or Ctrl-D.
warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)`

Question about input parameter

Hi,

I was wondering what exactly the --train_node_embedding parameter is for? What else besides the DRNL label and input node feature is inputted into the DCGNN?

SEAL inferencing on a Raspberry Pi

@muhanzhang Hi there,

  • Can SEAL inference on an arm based device such as a Raspberry Pi with 1GB of ram?
  • Does inferencing a SEAL model for link prediction requires alot of ram, time and cpu?
  • Does training a SEAL model for link prediction require alot of ram, when my dataset consists of 1,000,000 graphs, inwhich each graph have less than 150 nodes? for such dataset, how much ram do you estimate is needed?

Regarding SEAL's performance on Cora, Citeseer and Pubmed

Hi Prof.Zhang,

I am trying to reproduce the performance of SEAL on Cora Citeseer and Pubmed. However, it seems that SEAL does not really perform well on these datasets, and it is even worse than a GNN encoder with Hadamard Product as a decoder. Given that SEAL is theoretically more powerful than a pure GNN and shows competitive results on other graph datasets like OGB, it does not make sense to me that it degenerates on these 3 common benchmarks.

I wonder if you had a chance to look over the SEAL's results on these 3 datasets and if you have some insights about its relative failure on them. I am assuming a more thorough hyperparameter tuning might be helpful but unfortunately, I didn't find the correct recipe to recover its full potential.

Cora Hits@20 Hits@50 AUC
GCN 52.80 ± 1.50 65.34 ± 2.08 91.74 ± 0.65
SEAL 44.45 ± 3.58 58.15 ± 2.66 86.13 ± 1.39
Citeseer Hits@20 Hits@50 AUC
GCN 55.55 ± 2.38 64.09 ± 3.34 90.06 ± 1.06
SEAL 45.46 ± 1.90 54.53 ± 0.91 82.72 ± 0.81
Pubmed Hits@20 Hits@50 AUC
GCN 43.97 ± 2.00 52.76 ± 1.00 91.20 ± 0.50
SEAL 39.89 ± 4.80 48.65 ± 3.65 96.08 ± 0.28

The results above are 10%/20% split for val and test and average for 10 runs.

Thanks,

dataset = PygLinkPropPredDataset(name=args.dataset)

你好,张教授,我想问一下我的代码一开始运行出现这个报错是什么原因呀
Processing...
Traceback (most recent call last):
File "D:\Pycharm_poject\GitHub_SEAL_OGB-main\seal_link_pred.py", line 420, in
dataset = PygLinkPropPredDataset(name=args.dataset)
File "D:\Program_Data\Anaconda3\envs\pytorch13\lib\site-packages\ogb\linkproppred\dataset_pyg.py", line 64, in init
super(PygLinkPropPredDataset, self).init(self.root, transform, pre_transform)
File "D:\Program_Data\Anaconda3\envs\pytorch13\lib\site-packages\torch_geometric\data\in_memory_dataset.py", line 56, in init
super().init(root, transform, pre_transform, pre_filter)
File "D:\Program_Data\Anaconda3\envs\pytorch13\lib\site-packages\torch_geometric\data\dataset.py", line 87, in init
self._process()
File "D:\Program_Data\Anaconda3\envs\pytorch13\lib\site-packages\torch_geometric\data\dataset.py", line 170, in _process
self.process()
File "D:\Program_Data\Anaconda3\envs\pytorch13\lib\site-packages\ogb\linkproppred\dataset_pyg.py", line 124, in process
additional_node_files = self.meta_info['additional node files'].split(',')
AttributeError: 'float' object has no attribute 'split'

A clarification on SEAL’s underlying GNN engine

Hi Dr. Zhang, 



I had the pleasure of reading the labeling trick paper. I was quite curious upon reading it as to what can be regarded as the SEAL model. Can we call GCN + DRNL labelled subgraphs + SortPooling aggregation as SEAL?


Or can we only call the model as SEAL if it uses DGCNN as the underlying backbone, thus making the model DGCNN + DRNL labelled subgraphs + SortPooling aggregation as last step?


I guess I am just confused if GCN + DRNL becomes SEAL only if you use SortPooling at the end. Sorry if the question is a bit difficult to understand. I can clarify further as needed.



- Paul


Feature request for only_predict

Please could you add only_predict feature (similar to what you have in SEAL repo). My goal is to generate prediction scores for a set of links (provided in a separate file as input) in the graph. Thanks.

Regarding custom datasets

Please could you elaborate more on what you mean by "Using custom datasets is also easy by replacing the Planetoid dataset with your own". Do we need to convert our data (e.g. USAir) to torch_geometric.datasets? Thanks.

about SEALDynamicDataset

Hello M. Zhang

I'm a Master student and currently I'm studying in this aspect. I want ask you a question .

if i add --dynamic_train parameter i meet the problem. i can't get the train_data, this is the shot-cut. thank you.
i will meet the problem in model.py. i think the dynamic order result in this, please help me, thanks.
截屏2021-07-08 上午1 09 08

exceeded memory when run ppa dataset (128G)

Thanks for sharing your great code!
I'm trying to run your sample code:

python seal_link_pred.py --dataset ogbl-ppa --num_hops 1 --use_feature --use_edge_weight --eval_steps 5 --epochs 20 --dynamic_train --dynamic_val --dynamic_test --train_percent 5

however, when it comes into the step train_dataset = eval(dataset_class)(...) where the dataset_class is SEALDataset func, it print [ 12% ||||||||| 128990/10619596[36:23<5:14:41, 49.39it/s]
and then killed by the server.

The reason is that this process runs out of all the memory (128G)
I wonder if is there something wrong with my try? and how many are memory needed when I run this ppa sample?
thanks !

Multi-relational link prediction in a heterogeneous graph

Hi there!

Amazing work with the SEAL approach! I was wondering how could could apply the scoring to multi-relational link prediction task. Since you mentioned in the paper that SEAL is not restricted to homogeneous graphs, I was wondering how the implementation has to be adapted to calculate the score for different edge types. Could you possibly show how to do that?

Best,

Sophia

Issues with heuristic methods, i.e., CN and AA

Hi, I found the testing performance of the heuristic method I got is significantly different from the one reported on the open graph benchmark. For example:

Running the following command gives me Hits@50 Valid 63.49, test 53.00.

python seal_link_pred.py --use_heuristic AA --dataset ogbl-collab

Running the following command gives me Hits@50 Valid 60.36, test 50.06.

python seal_link_pred.py --use_heuristic CN --dataset ogbl-collab

The result is very different from the open graph benchmark's reported one.

I went through the code on GitHub, but couldn't figure out why.

Question regarding the training time.

Hi, thanks for sharing the code.

I was wondering how long it generally will take to train a model from scratch? I understand it depends on computational power of servers. In this case, what kind of hardware configuration works the best for training SEAL on OGB tasks?

I'm using the default parameters to test the model on 'ogbl-citation'. Each training epoch takes around 4.5 hours on a server with E5-2660 (10 physical cores), 264GB memory and Tesla P100. Does the code perform normal under this setting? ?

About Planetoid edge_index utilization

Hi!

Thank you again for the flexible and interesting implementation of SEAL and the possibility of using custom datasets! I really enjoy it!

In the seal implementation, you are using the Planetoid dataset. May I ask why you are replacing the edge_index by split_edge['train']['edge'].t(), cf. edge index. Is this line specific to the Planetoid dataset?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.