jhy1993 / han Goto Github PK

Heterogeneous Graph Neural Network

Python 100.00%

graph-neural-network network-embedding graph-attention-network heterogeneous-network heterogeneous-graph heterogeneous-graph-neural-network

han's Introduction

Paper

https://github.com/Jhy1993/Representation-Learning-on-Heterogeneous-Graph

HAN

The source code of Heterogeneous Graph Attention Network (WWW-2019).

The source code is based on GAT

Reference

If you make advantage of the HAN model or use the datasets released in our paper, please cite the following in your manuscript:

@article{han2019,
title={Heterogeneous Graph Attention Network},
author={Xiao, Wang and Houye, Ji and Chuan, Shi and  Bai, Wang and Peng, Cui and P. , Yu and Yanfang, Ye},
journal={WWW},
year={2019}
}

How to preprocess DBLP?

Demo: preprocess_dblp.py

Slides

https://github.com/Jhy1993/HAN/blob/master/0516纪厚业%20www%20ppt%20copy.pdf

Q&A

ACM_3025 in our experiments is based on the preprocessed version ACM in other paper (\data\ACM\ACM.mat). Subject is just like Neural Network, Multi-Object Optimization and Face Recognition. In ACM3025, PLP is actually PSP. You can find it in our code.
In ACM, train+val+test < node_num. That is because our model is a semi-supervised model which only need a few labels to optimize our model. The num of node can be found in meta-path based adj mat.
"the model can generate node embeddings for previous unseen nodes or even unseen graph" means the propose HAN can do inductive experiments. However, we cannot find such heterogeneous graph dataset. See experiments setting in Graphsage and GAT for details, especially on PPI dataset.
meta-path can be symmetric or asymmetric. HAN can deal with different types of nodes via project them into the same space.
Can we change the split of dataset and re-conduct some experiments? of course, you can split the dataset by yourself, as long as you use the same split for all models.
How to run baseline (e.g., GCN) and report the best performance of baselines? Taking ACM as an example, we translate heterogenesous graph into two homogeneous graphs via meta-path PAP&PSP. For PAP based homogeneous graph, it only has one type of node paper and two paper connected via PAP. Then, we run GCN on two graphs and report the best performance. Ref https://arxiv.org/pdf/1902.01475v1.pdf and http://web.cs.wpi.edu/~xkong/publications/papers/www18.pdf
Several principles for preprocess data. 1）Extract nodes which have all meta-path based neighbors. 2）Extract features which may meaningful in identifying the characteristics of nodes. For example, if all nodes have one feature, this feature is not meaningful. If only several nodes have one feature, this feature is not meaningful. 3) Extract balanced node label which means different classes should have almost the same number of node. For k classes, each class should select 500 nodes and label them, so we get 500*k labeled nodes.

Datasets

Preprocessed ACM can be found in: https://pan.baidu.com/s/1V2iOikRqHPtVvaANdkzROw 提取码：50k2

https://bupteducn-my.sharepoint.com/:u:/g/personal/jhy1993_bupt_edu_cn/EfLZcHE2e4xBplCVnzcJbQYBurNVOCk7ZIne2YsO3jKbSw?e=vMQ18v

Preprocessed DBLP can be found in: https://pan.baidu.com/s/1Qr2e97MofXsBhUvQqgJqDg 提取码：6b3h

https://bupteducn-my.sharepoint.com/:u:/g/personal/jhy1993_bupt_edu_cn/Ef6A6m2njZ5CqkTN8QcwU8QBuENpB7eDVJRnsV9cWXWmsA?e=wlErKk

Preprocessed IMDB can be found in: 链接:https://pan.baidu.com/s/199LoAr5WmL3wgx66j-qwaw 密码:qkec

Run

Download preprocessed data and modify data path in def load_data_dblp(path='/home/jhy/allGAT/acm_hetesim/ACM3025.mat'):

python ex_acm3025.py

HAN in DGL

https://github.com/dmlc/dgl/tree/master/examples/pytorch/han

han's People

Contributors

Stargazers

Watchers

Forkers

liangxun songfgh chqlee uctoronto jjwangnlp jingmouren yuanyuansiyuan mengcao327 tccoder ianliyi1996 wurentidai mcmaxmm zxycynthia group2two zjl130345 zhengliu212 qwzhong1988 skx300 xiaolinhan szwangsummer wangchong111 jinlianglu96 simonmandlik zhangys2019 jiwenfei wonderfultina jackwangsysu qss2012 gaoli1537 dingdanhao110 mufeili houchaoxu anonymous5522 zhhhzhang jlqzzz patrickgsheng yunlei0518 cslele sdjoko dingyuedydydy pzhaozh czjczj xuyou314 maclarin polarbearorz damioncheng shaneson0 yea02 smileelop youngbigbird1985 hxl523 wangxuekui aabbccgithub davidhnu ventric jzx95 patriciaxiao tony109060581 alfredgin xuchensjtu xuchanguniversity chenyi729 liyuwe ammieqi jaeyun95 peacegui zwytop bishnukuet chaoyue729 liu-guo-jing chestnut1999 zeyunh amorsun wwwwwelkin codeinging sheldonresearch pwforks yangylin caoyang17621 xc15071347094 freekang yhjflower yueyub milkigit baobunuo ariana8456 mzzzzzzzzz febi123 yandexuanxuan ukilin yoghurt-lee liangzai951 xrosliang dayuml louise-lulin alexzhuqch001 daigenan sakuracy tmacmilan bofeixiao

han's Issues

About the code in DGL

The code in https://github.com/dmlc/dgl/blob/master/examples/pytorch/han/model.py Line 19,Line21.
beta = torch.softmax(w, dim=1)
return (beta * z).sum(1)

The dim should be 0 ?

关于代码复现

同学，你好！按照readme所示，相应代码运行不同，找不到对应的数据集？能不能再梳理一下，谢谢！

您好，非常感谢您的工作，以及将代码开源。
我在运行您的程序的时候得到的结果和您论文中不是很一致，所以想请问您几个问题，谢谢！
1）您论文中得到的结果就是在您现在开源这份的代码的参数下运行的结果吗？按照您论文中，在ACM数据集Training=20%的时候，Micro-F1=89.22， Macro-F1=89.40，而我这边运行的结果是： micro_f1: 0.8626 ; macro_f1: 0.8637（KNN中k=5）；
2）还想请问您，您在训练的时候有没有出现过拟合现象呢，按照现在这份代码中的情况，在epoch=100的时候early stop ，但这时的training accuracy已经为100%，但是测试的acc却只有86.12%。不知道您是否也遇到这种情况。
3）请问您base_gattn.py 中的micro_f1函数就是您计算最终结果时候使用的函数吗？我不是很理解里面的下面这行代码： predicted = tf.round(tf.nn.sigmoid(logits))
按照这种计算方式，在predicted中就会出现对某些节点类别的预测是[1,1,1]情况，那么这个时候不管真实节点类别是什么，这个预测结果都会被加到true positives中？并且打印出来的tp: 664 tn: 2788 fp: 1462 fn: 1461，总和是6375？？而那个数据集里的node number是3025？？

期待您的回复，非常感谢！

About the semantic-specific embedding

After formula five, you said "we can obtain P groups of semantic-specific node embeddings" . I looked through the full paper and found no expression or formula for this semantic-specific node embeddings. Is it possible to concatenate the learned embedding of node i for the meta-path Φ in formula four?

如何预处理ACM数据

非常感谢分享的代码，但是对于异构的网络（不同的节点、不同的边），如何预处理数据集？我的理解是需要把不同类型的节点映射到同一个特征空间，在本文中是如何处理的？请问方便公开一下清洗ACM 或者 DBLP4area数据集的清洗脚本，或者是能更详细说明一下论文里所用数据集是如何基于原始ACM 或者 dblp4area数据集清洗得来的吗？谢谢。

Questions about model

Hi,
here I have one question about HAN:

(I have found your previous answer to this quesion) In your code, the dataset is named as "ACM3020.mat'' while it is "ACM.mat" in the file. I change the code's name to load the data, but it seems that they are different data. So which one is right?
Could you please tell me whether HAN can be applied to weighted and directed graph with attribute?

Look forward to your reply!

Parent directory of /pre_trained/acm/acm_allMP_multi_fea_.ckpt doesn't exist, can't save.

请问这个ckpt文件是怎么得到的呢

GCN和GAN上的实验如何做的？

论文提出的方法通过meta-path找到邻居后，直接进行接下来的两层attention操作，而不是像GCN中节点embedding的扩散式表示？请问是否有试验过，在本文模型的基础上，对meta-path进行GCN操作，即对于中心节点，他的表示为自身 + meta-path上从邻居开始一层层传递过来的表示？
GCN和GAN的实验论文中描述为在GCN上测试了所有的meta-path，然后取了最好的结果，请问具体是怎么做的？
是GCN传递的时候，对每个节点，不是所有邻居，只是多条meta-path传递过来的表示？从中取最好的结果是什么意思呢？
本文没有用GCN中的表示传递，但能超过GCN, 说明靠meta-path找到的邻居已经包含了重要的信息？另外能超过GAN, 说明层级注意力机制比较有效？

Initial author features of DBLP dataset

Hi there, i am wondering if it is possible to obtain a running script for preprocessing the DBLP dataset. If I am not mistaken the original DBLP dataset does not have features and you did some preprocessing in order to get the initial features for author nodes. I m interested to know how the initial features of authors are obtained.

Errors report

Line 287, ex_acm3025.py should be
from jhyexp import my_KNN, my_Kmeans#, my_TSNE, my_Linear
since the python file is named jhyexp.py rather than jhyexps.py, and there are no my_TSNE() and my_Linear() fuctions in that .py

ACM and DBLP dataset

Can you share raw files and preprocessing scripts for these datasets ?

Also, can make the preprocessed files downloadable for people outside china? The link you have given is not working.

请问清洗DBLP4area数据集的更多细节

请问方便公开一下清洗DBLP4area数据集的清洗脚本，或者是能更详细说明一下论文里所用数据集是如何基于原始dblp4area数据集清洗得来的吗？谢谢。

我根据论文描述以如下方法在原始DBLP4area上清洗出来的数据集无法复现论文里报告的精度。

选出带标签的A类型节点共4057个；
选出和A类型节点有关联的P类型节点共14328个；
选出和P类型有关联的C类型节点共20个，选出和P类型有关联的T类型节点共8898个（这里和论文里面的8789个略有差别）；
4.然后基于以上构造的邻接矩阵，对于某条metapath，用该路径下对应的邻接矩阵的序列的点积来求该路径的commuting matrix；

Mac run error

Epoch: 0, att_val: [0.5074427 0.49255723]
Training: loss = 1.14501, acc = 0.30667 | Val: loss = 1.12552, acc = 0.64000
2020-08-27 18:43:10.930408: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at save_restore_v2_ops.cc:109 : Not found: pre_trained/acm; No such file or directory

Could you please share the raw data again?

Hi, I am looking for the processed IMDB dataset provided in your paper. Both links of the IMDB in README are not valid anymore (at least to me). Could you please reshare it? Thanks.

关于TF版本的问题

学长你好，方便说一下你这模型所需要的package版本么
或者就告知一下tf版本就行

Valid preprocessing pipeline request

Hi bro, could you provide preprocessing code that can run directly from raw data to preprocessed data? It's really annoying that preprocessing scripts cannot run directly.

关于DBLP数据集的问题

在论文中有提到author的feature是bag-of-words represented keywords
请问author的keywords是从哪里提取的呢，这个repo里好像并未提供

使GNN从Transductive到Inductive的最大改动是？或者说GNN里Inductive和Transductive的最大区别是？

感觉都是输入全图然后mask测试集的label，
就是sample上的区别？
@Jhy1993 谢谢！

the source of the dataset

Hi, thanks for your works.
I wonder the source of the .mat form of ACM dataset because I'm a little confused about some parts of its meaning, i.e. what does the 'n' in ACM['nT'] and ACM['nnPvsT'] mean?
And I also want to know do you preprocess the DBLP and IMDB from scratch (i.e. from the official release), or borrow that from other papers? If the latter, I want to know that source.

Regards.

IMDB数据提取问题

你好，我在提取IMDB数据过程中遇到一些问题，希望能得到你的帮助。

我提取genres中有action,comedy,drama类别的movie，但是我只得到了4380个movie，与论文中不一致，请问你是怎么提取的呢？
movie feature，我提取plot_keywords中单词（去掉短句/词组），得到了3185个单词，与论文中也不一致，请问你这边单词是怎么处理的呢，后续人为挑选单词吗？
train，val，test的数量加和为什么与movie总数量不一致呢？

您好~batch_size只能取1吗？

我想将batch_size 取其他值，但是发现做不到，在layer.py的sp_attn_head函数中有这样的注释：

As tf.sparse_tensor_dense_matmul expects its arguments to have rank-2,

here we make an assumption that our input is of batch size 1, and reshape appropriately.

The method will fail in all other cases!

想问问您应该怎么处理，才能使得batch_size可以取其他值呢？
十分感谢！

How to get the meta-path-based transition matrix M in Equation 1 of the paper ?

Thank you so much for bringing such a wonderful paper. Unfortunately, I did not find the corresponding formula 1 in the paper in the code, especially how the transition matrix M was obtained or entered？ could you please solve my problems ? @Jhy1993

关于如何处理IMDB数据集

您好，IMDB数据集中，电影的类型可能有多个(multi label)对吗？请问是如何处理的？

另外代码缺失了一个实验脚本 jhyexps.py

where is the acm/acm_allMP_multi_fea_.ckpt

rt, how can I get the pre-trained model?

Your code is hard to use. I cannot even take a simple run.

Please attach me if revised.

损失函数

你好，
请问无监督应用场景的损失函数应该怎么定义？

Purpose of heads

I am trying to understand conceptually what the number of heads does. I am struggling to understand from the paper.

Classification task enquiry

taking the IMDB dataset for example, the paper mentioned using M-A-M and M-D-M metapaths to generate embeddings. To be specific, it only generated embeddings for the movie-typed nodes. Thus, may I ask how did you perform the node classification task? for instance, if I input a feature vector of an Actor instance, am I able to obtain the return that specify class A even when the model didn't learn to embed actor types?

学长你好IMDB的数据集已经放出来了吗

一直在follow你们的工作，关于ACM和DBLP数据集都已经有相应的处理过版本，为了后续实验能更好的复现学长你们的工作，方便讲IMDB数据集也公开一下，或者私信一下，非常感谢

Multiple edge types

Hi! @Jhy1993

Thanks for bringing such inspiring work! After reading your paper, I understand that how HAN could handle the multiple types of nodes in a graph. But I am not quite sure whether and how HAN could handle the multiple edges.

For example, a user could 'click' and 'purchase' an item. In this case, we have two types of nodes, item, and user; and the nodes are connected via two types of edges, the 'click' and 'purchase'. The natural goal is to derive embeddings for a user under different types of activities. That is, we may want to derive the embedding of the user based on his 'click' behavior, and get another embedding of the user under his 'purchase' activity.

I wonder that whether HAN could handle such tasks. It looks like simply by defining meth-path, say User-Item-User, cannot fully express the types of edges. Any ideas? Thanks in advance!

最后利用KNN进行分类的实验设置是否合理？

ex_acm3025.py 脚本中，最终用test_mask选出的特征和标签训练KNN进行分类，这样做的意义是：在测试集上划分了训练集、测试集，这样设置是不合理的。为了公平地和其他方法进行比较，应该用所有的特征，以及train_mask选出的标签训练KNN，然后测试KNN在test_mask上的分类效果。

请问下，问下graphsage和transductive gcn在模型上的区别是啥啊？感觉基本一样啊

多谢多谢

ACM dataset

i have downloaded AMC3025.mat file and in that file there are features for 3025 papers and three adj matrices for PLP, PTP and PAP relations with size 3025* 3025. we don't have corresponding heterogeneous graph as described in the paper with authors and subjects.
for applying other methods to this dataset i need the complete heterogeneous graph.
how do i download the heterogeneous graph?

Data Preprocessing

Hi,
Thanks for sharing your code.

Here I have a question related to the ACM data preprocessing.

When I extract papers from KDD, SIGMOD,SIGCOMM, MobiCOMM, and VLDB, there are 4025 papers which is different from what is reported in your paper. Could you pls tell me where the problem might be? Or can you share the cleaned data (not the PAP, PLP adjacency matrix)?

Thanks!

The changed code of Baselines

Can anyone share the baselines which use the new data?Thanks~

imdb的输入特征矩阵

您好！请问IMDB的数据集的输入矩阵是怎样的？我用了论文中M-A-M（电影-演员-电影）来分类电影类型，以每部电影为结点，该节点的特征向量用演员one-hot，是一个非常稀疏的矩阵，而且，最后效果并不好，是我的特征矩阵错了吗？期待您的回复。。。

代码运行问题

你好，想请问你的 load_data_dblp函数里，在导入ACM.mat数据时，找不到对应的 data['label'], data['feature']，请问是不是数据不对呢？

the source of raw datasets

Can you provide the raw files of these datasets?

I want to verify the effect on some models that do not consider the metapath.
Thank you！

Errors on your code

Wrong indent at utils/layers line 20

seq_fts = tf.layers.conv1d(seq, out_sz, 1, use_bias=False)

Error module import at ex_acm3025.py line 5

from models import GAT, HeteGAT, HeteGAT_multi, HeteGAT_multi_const_1, HeteGAT_multi_const_2

it should be

from models.gat import ...

also there are no such classes as HeteGAT_multi_const_1, HeteGAT_multi_const_2