matenure / fastgcn Goto Github PK

The sample codes for our ICLR18 paper "FastGCN: Fast Learning with Graph Convolutional Networks via Importance Sampling""

Python 100.00%

graph-convolutional-networks graphsage fastgcn reddit

fastgcn's Introduction

FastGCN

This is the Tensorflow implementation of our ICLR2018 paper: "FastGCN: Fast Learning with Graph Convolutional Networks via Importance Sampling".

Instructions of the sample codes:

[For Reddit dataset]

train_batch_multiRank_inductive_reddit_Mixlayers_sampleA.py is the final model. (precomputated the AH in the bottom layer) The original Reddit data should be transferred into the .npz format using this function: transferRedditDataFormat.
Note: By default, this code does no sampling. To enable sampling, change `main(None)` at the bottom to `main(100)`. (The number is the sample size. You can also try other sample sizes)

train_batch_multiRank_inductive_reddit_Mixlayers_uniform.py is the model for uniform sampling.

train_batch_multiRank_inductive_reddit_Mixlayers_appr2layers.py is the model for 2-layer approximation.

create_Graph_forGraphSAGE.py is used to transfer the data into the GraphSAGE format, so that users can compare our method with GraphSAGE. We also include the transferred original Cora dataset in this repository (./data/cora_graphSAGE).

[For pubmed or cora]

train.py is the original GCN model.

pubmed_Mix_sampleA.py 	The dataset could be defined in the codes, for example: flags.DEFINE_string('dataset', 'pubmed', 'Dataset string.')

pubmed_Mix_uniform.py and pubmed_inductive_appr2layers.py are similar to the ones for reddit.

pubmed-original**.py means the codes are used for original Cora or Pubmed datasets. Users could also change their datasets by changing the data load function from load_data() to load_data_original().

fastgcn's People

Contributors

Stargazers

Watchers

Forkers

neo4reo jdc08161063 anhngml villafly statml ai3dvision hammadhaleem sucrerouge michrotaru hyzcn pacianick parsonszeng hhh920406 songfgh slowbull oj9040 tedzhouhk colinwxl xuyou314 lqfarmer thuzarwin mawy610 minsu-daniel-kim hankeping kaustavc adooou foristkirito junsangpark wang1104014663 znsoftm daiquanyu codes-kzhan wilson-zhang yuehanlyu paulrich1234 zzwloveai gds101054108 xiaolong-yun uctoronto wanfengkai tianzq lichao88 m-ak cjthornhill zhouzhou12 seigercom mansurul11 alisure-fork shengguanwsu chaosjtu supermousse aaaeeee hengvzhang zhuhm1996 lonelykid96 garfield0428 m397026474 peternara cslele caesarstefanus youngbigbird1985 youngflyasd beglobetrotter winwinjjiang chengli0327 shengzhang90 janeyzy tiger-tiger lirui-se linhr000 4ai deniseduma wjq0323 johnnt849 ronaldpereira liangzai951 zbn123 xrosliang ebonilla milkigit humdingers parthsatija lyouqi marz869 sunfeng90 jianzhu jeozhao marisssssa feynmandna sxxtyz jiehu-cv yangxiaojun1230 qqfox gaoshan2006 sailfish009 sklepper sohailkhanmarwat chen5016 sauceduck mirjunaid26

fastgcn's Issues

how can I use it on my own dataset ?

How can I generate graph and set my own labels using my own dataset ?
there is no clear instructions on how you did the split and the labels on your datasets 'cora' and others
please add raw non pre processed data that you used and add the scripts that does the pre processing and graph building

Training Accuracy much lower than Validation Accuracy

In running the sample code, I found that the training accuracy is much lower than the validation accuracy, which is different from training GraphSAGE on reddit in their repo. Is this normal.

For example, the logging I got:
Epoch: 0042 train_loss= 1.72849 train_acc= 0.66406 val_loss= 3.17200 val_acc= 0.90848 time per batch= 0.01104
Epoch: 0043 train_loss= 1.84603 train_acc= 0.59375 val_loss= 3.18259 val_acc= 0.90506 time per batch= 0.01108
Epoch: 0044 train_loss= 1.86952 train_acc= 0.60156 val_loss= 3.17415 val_acc= 0.90324 time per batch= 0.01116

Also, the paper reports the F1 measure. How to get F1 score using the codebase?

What are the pubmed-original_inductive_FastGCN.py and pubmed-original_transductive_FastGCN.py meaning?

Hi~ @matenure
I want to change the some code to pytorch version, so I have some question about the use of inductive file and the transductive file. And if I have an inductive task for GCN, but the origin GCN can't use the mini-batch, how could I add the code so that it can be used with mini-batch?
Thanks a lot.

Embeddings

Hi,

Where exactly are the learned embeddings? For example, after running on the pubmed appr2layers.

Thanks

Files not matching with instructions

Hi
Thanks for the awesome work. Could you please update the instructions for running the code, files in the readme donot correspond to the .py files in the github

Thanks
Anurag

If I want to see FastGCN's results on the Pumed dataset, ask which.py file to select？pubmed-original_inductive_FastGCN.py？

Is this a one-layer version ?

Thanks for your works !
I read the file 'pubmed_Mix_sampleA.py' and I found that it only sample once which means the model just has one layer. But the paper said the result is test on two-layer models. Is there anything I miss ?

Some question about CGNF

Hello, I read your article ' CGNF: CONDITIONAL GRAPH NEURAL FIELDS' recently. As I can't find your email, I can only leave you a message. Could you share the source code of this paper? I'm quite interested in your research.
My email: [email protected]

Reddit adjacency matrix

How did you build the reddit_adj.npz file from the reddit data provided here http://snap.stanford.edu/graphsage/? I see thee code to load reddit_adj.npz but in transferRedditDataFormat there is no code to create one. I tried the commented code where adjacency matrix is created from feat_id_map but that crashed because of index mismatch.

The graph of FastGCN contains still the vertices of test, why the FastGCN is inductive?

I understand that fastgcn performs sampling on the X in the batch, and it get a subgraph for each sampling. But the graph (the adjacency matrix A) still contains the node of the test set, why it is inductive? But usually, we should assume that when we train the model, we don’t know where the test set is.
Can anyone explain it to me?

How the column prop is calculated?

The column norm is squared in paper, while in code it is not squared as frobenious norm. It seems like a contradiction. I changed it to squared form as in the paper and found that accuracy decreased a lot.

converting npz file from reddit json file

Hi,

In the file, "transformRedditGraph2NPZpy",
I have some problem on running this code.

For module, "transferRedditData2AdjNPZ" ~

def transferRedditData2AdjNPZ(dataset_dir): **_G = json_graph.node_link_graph(json.load(open(dataset_dir + "/reddit-G.json")))_** feat_id_map = json.load(open(dataset_dir + "/reddit-id_map.json")) feat_id_map = {id: val for id, val in feat_id_map.iteritems()} numNode = len(feat_id_map) print(numNode) adj = sp.lil_matrix((numNode, numNode)) print("no") for edge in G.edges(): adj[feat_id_map[edge[0]], feat_id_map[edge[1]]] = 1 sp.save_npz("reddit_adj.npz", sp.coo_matrix(adj))

feat_id_map looks {'2gh0ul' : 12378 ... } and G.edge looks like (integer, integer)
So, I think there should be some error
' adj[feat_id_map[edge[0], feat_id_map[edge[1]]'
because edge[i] is some integer, and feat_id_map[integer] doesn't make sense.

Is there something that I miss ?
Thanks in advance.

Running on Reddit dataset is extremely slow

I downloaded the processed Reddit data set form #8 (comment), and then run train_batch_multiRank_inductive_reddit_Mixlayers_sampleA.py with default parameters. It takes about 10 minutes for a single epoch. However the paper reported 638.6 seconds for the WHOLE training process. I am ~200x slower than your reported speed.

I am running on an AWS m5.2xlarge instance with the same CPU spec as your machine (8 vCPUs = 4 core 8 thread, 2.5GHz). All dependencies are simply installed by pip.

How can gcn (batched) reduce memory usage?

Hi, thanks for your great work! But I have two question
1.
The paper said:"The nonbatched version of GCN runs out of memory on the large graph Reddit". How can gcn (batched) reduce memory usage? Don't we need to load the adjacency matrix in gcn(batched)?

GCN is a method for transductive. How does it work on Reddit as it is a inductive dataset in previous work.

no such folder named (./data/cora_graphSAGE)

Hello,I want to use cora dataset with GraphSAGE, but I can't find the file named cora-G.json and there is no such folder named (./data/cora_graphSAGE).I'm appreciate for your help, thank you.

Why not using the GPUs to train ?

Hello, you did a wonderful job, and why not using GPUs to train ?
Does the training procedure of GPU is slower than that of CPU?

Error in transformRedditGraph2NPZ.py

I download the reddit data from snap and unzip it, and run

python transformRedditGraph2NPZ.py

Then it shows the following error message.

Traceback (most recent call last):
File "transformRedditGraph2NPZ.py", line 69, in
transferRedditDataFormat("reddit","reddit.npz")
File "transformRedditGraph2NPZ.py", line 46, in transferRedditDataFormat
train_ids = [n for n in G.nodes() if not G.node[n]['val'] and not G.node[n]['test']]
File "transformRedditGraph2NPZ.py", line 46, in
train_ids = [n for n in G.nodes() if not G.node[n]['val'] and not G.node[n]['test']]
KeyError: 'val'

Is there a bug? Or is it because of other issues? Thanks.

`TypeError: 633 is not JSON serializable` when calling `create_Graph_forGraphSAGE.py`

When runing the create_Graph_forGraphSAGE.py, got some error as following:


(py27) user@machinae:/FastGCN$ python create_Graph.py
Traceback (most recent call last):
  File "create_Graph.py", line 39, in <module>
    json.dump(data,f)
  File "/miniconda3/envs/py27/lib/python2.7/json/__init__.py", line 189, in dump
    for chunk in iterable:
  File "/miniconda3/envs/py27/lib/python2.7/json/encoder.py", line 434, in _iterencode
    for chunk in _iterencode_dict(o, _current_indent_level):
  File "/miniconda3/envs/py27/lib/python2.7/json/encoder.py", line 408, in _iterencode_dict
    for chunk in chunks:
  File "/miniconda3/envs/py27/lib/python2.7/json/encoder.py", line 332, in _iterencode_list
    for chunk in chunks:
  File "/miniconda3/envs/py27/lib/python2.7/json/encoder.py", line 408, in _iterencode_dict
    for chunk in chunks:
  File "/miniconda3/envs/py27/lib/python2.7/json/encoder.py", line 442, in _iterencode
    o = _default(o)
  File "/miniconda3/envs/py27/lib/python2.7/json/encoder.py", line 184, in default
    raise TypeError(repr(o) + " is not JSON serializable")
TypeError: 633 is not JSON serializable

could you give me some suggestions?
Thank you.

for create_Graph.py bug

There is an error when I run create_Graph.py.
FileNotFoundError: [Errno 2] No such file or directory: 'cora/cora0-G.json'

Test accuracy on Cora, Citeseer, and Pubmed using original split

Hi, I plan to run pubmed-original_transductive_FastGCN.py on Cora 100 times and change the random seed from 0 to 99. I get the result (until seed = 64)
avg: 79.8%
max: 80.6%
min: 78.4%
std: 0.46%

This is much lower than the reported results in your paper (you only report 81.8%, I don't know it's avg or max, and the how many times you run).

I will run code on Citeseer and Pubmed too.

Do I need to change the hyperparameters? Can you tell me the hyperparameters on this three dataset with original data split so I can get results comparable with GCN.

Thank you very much!

Could you provide train loss visualize picture by tensorboard ? And the require files not full provided.

I run the train_batch_multiRank_inductive_reddit_Mixlayers_sampleA.py
The data process procedure is complete by the transferRedditDataFormat and transferRedditData2AdjNPZ in transformRedditGraph2NPZ.py which remove node without "val" in G (load by reddit.json)
in transferRedditDataFormat
the conclusion is not appropriate, the training loss not decrease uniformly with the training going on.
So could you provide the generated reddit.npy reddit_adj.npy
in google drive or other server ?
Or tackle the node attribute problem ?

where is Reddit dataset?

Setting of batch size and sampled layer size for large network

Thank you for your nice work.
Now I try to replace GCN model with Fast GCN model because the graph is a large network (around 2973489 nodes with '0,1' two labels). There are only around 2000 nodes with label '1'.

When I use default setting for such a large graph, it's hard to learn something for Fast GCN model.

For those strong imbalance graph network, how should I set the parameter, like layers, dimension of hidden layer, batch size and sampled size for each layer?
I would be appreciated if you give some hints or advice!

Graphsage data prep for large datasets

Hello there,
I would like to apply the code for a dataset with a millon nodes, however using the dense matrix would cause a memory error in graphsage data prep. do you have any idea how to fix it?

x tx files not opening

Hi there are two things I want to ask . It can be a dumb question as I am new to it.

x , tx files are not loading but no errors either .. I am using python 3.7.
I want to run FastGCN on my own dataset. Is there any guide/steps how can I create my x,y,tx,ty,graph files from my own dataset ?

Different weights for calculating probability

In the file, “pubmed_inductive_appr2layers.py”, on line 123, you have used the adjacency matrix for the batch of nodes only to calculate the weights to be used for calculating probabilities for each node. While to calculate the same on line 128 (p2), you have used the adjacency matrix for the entire training nodes. Even in the other files, you have p0, as defined on line 107 in this file for the same.

So, shouldn’t it be same throughout and so p1 should be replaced by p0 rather in this file ?

how to get cora dataset?

Difference between GCN_APPRO and GCN_APPRO_MIX

It seems that the paper used GCN_APPRO(two sample layer), while the code uses GCN_APPRO_MIX, what is the difference between two methods, does the ultimate result varies a lot?