Git Product home page Git Product logo

fastgcn's Introduction

FastGCN

This is the Tensorflow implementation of our ICLR2018 paper: "FastGCN: Fast Learning with Graph Convolutional Networks via Importance Sampling".

Instructions of the sample codes:

[For Reddit dataset]

train_batch_multiRank_inductive_reddit_Mixlayers_sampleA.py is the final model. (precomputated the AH in the bottom layer) The original Reddit data should be transferred into the .npz format using this function: transferRedditDataFormat.
Note: By default, this code does no sampling. To enable sampling, change `main(None)` at the bottom to `main(100)`. (The number is the sample size. You can also try other sample sizes)

train_batch_multiRank_inductive_reddit_Mixlayers_uniform.py is the model for uniform sampling.

train_batch_multiRank_inductive_reddit_Mixlayers_appr2layers.py is the model for 2-layer approximation.

create_Graph_forGraphSAGE.py is used to transfer the data into the GraphSAGE format, so that users can compare our method with GraphSAGE. We also include the transferred original Cora dataset in this repository (./data/cora_graphSAGE).

[For pubmed or cora]

train.py is the original GCN model.

pubmed_Mix_sampleA.py 	The dataset could be defined in the codes, for example: flags.DEFINE_string('dataset', 'pubmed', 'Dataset string.')

pubmed_Mix_uniform.py and pubmed_inductive_appr2layers.py are similar to the ones for reddit.

pubmed-original**.py means the codes are used for original Cora or Pubmed datasets. Users could also change their datasets by changing the data load function from load_data() to load_data_original().

fastgcn's People

Contributors

cai-lw avatar gokceneraslan avatar matenure avatar submit1 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fastgcn's Issues

how can I use it on my own dataset ?

How can I generate graph and set my own labels using my own dataset ?
there is no clear instructions on how you did the split and the labels on your datasets 'cora' and others
please add raw non pre processed data that you used and add the scripts that does the pre processing and graph building

Training Accuracy much lower than Validation Accuracy

In running the sample code, I found that the training accuracy is much lower than the validation accuracy, which is different from training GraphSAGE on reddit in their repo. Is this normal.

For example, the logging I got:
Epoch: 0042 train_loss= 1.72849 train_acc= 0.66406 val_loss= 3.17200 val_acc= 0.90848 time per batch= 0.01104
Epoch: 0043 train_loss= 1.84603 train_acc= 0.59375 val_loss= 3.18259 val_acc= 0.90506 time per batch= 0.01108
Epoch: 0044 train_loss= 1.86952 train_acc= 0.60156 val_loss= 3.17415 val_acc= 0.90324 time per batch= 0.01116

Also, the paper reports the F1 measure. How to get F1 score using the codebase?

Embeddings

Hi,

Where exactly are the learned embeddings? For example, after running on the pubmed appr2layers.

Thanks

Files not matching with instructions

Hi
Thanks for the awesome work. Could you please update the instructions for running the code, files in the readme donot correspond to the .py files in the github

Thanks
Anurag

Is this a one-layer version ?

Thanks for your works !
I read the file 'pubmed_Mix_sampleA.py' and I found that it only sample once which means the model just has one layer. But the paper said the result is test on two-layer models. Is there anything I miss ?

Some question about CGNF

Hello, I read your article ' CGNF: CONDITIONAL GRAPH NEURAL FIELDS' recently. As I can't find your email, I can only leave you a message. Could you share the source code of this paper? I'm quite interested in your research.
My email: [email protected]

Reddit adjacency matrix

How did you build the reddit_adj.npz file from the reddit data provided here http://snap.stanford.edu/graphsage/? I see thee code to load reddit_adj.npz but in transferRedditDataFormat there is no code to create one. I tried the commented code where adjacency matrix is created from feat_id_map but that crashed because of index mismatch.

How the column prop is calculated?

The column norm is squared in paper, while in code it is not squared as frobenious norm. It seems like a contradiction. I changed it to squared form as in the paper and found that accuracy decreased a lot.

converting npz file from reddit json file

Hi,

In the file, "transformRedditGraph2NPZpy",
I have some problem on running this code.

For module, "transferRedditData2AdjNPZ" ~

def transferRedditData2AdjNPZ(dataset_dir): **_G = json_graph.node_link_graph(json.load(open(dataset_dir + "/reddit-G.json")))_** feat_id_map = json.load(open(dataset_dir + "/reddit-id_map.json")) feat_id_map = {id: val for id, val in feat_id_map.iteritems()} numNode = len(feat_id_map) print(numNode) adj = sp.lil_matrix((numNode, numNode)) print("no") for edge in G.edges(): adj[feat_id_map[edge[0]], feat_id_map[edge[1]]] = 1 sp.save_npz("reddit_adj.npz", sp.coo_matrix(adj))

feat_id_map looks {'2gh0ul' : 12378 ... } and G.edge looks like (integer, integer)
So, I think there should be some error
' adj[feat_id_map[edge[0], feat_id_map[edge[1]]'
because edge[i] is some integer, and feat_id_map[integer] doesn't make sense.

Is there something that I miss ?
Thanks in advance.

Running on Reddit dataset is extremely slow

I downloaded the processed Reddit data set form #8 (comment), and then run train_batch_multiRank_inductive_reddit_Mixlayers_sampleA.py with default parameters. It takes about 10 minutes for a single epoch. However the paper reported 638.6 seconds for the WHOLE training process. I am ~200x slower than your reported speed.

I am running on an AWS m5.2xlarge instance with the same CPU spec as your machine (8 vCPUs = 4 core 8 thread, 2.5GHz). All dependencies are simply installed by pip.

How can gcn (batched) reduce memory usage?

Hi, thanks for your great work! But I have two question
1.
The paper said:"The nonbatched version of GCN runs out of memory on the large graph Reddit". How can gcn (batched) reduce memory usage? Don't we need to load the adjacency matrix in gcn(batched)?

GCN is a method for transductive. How does it work on Reddit as it is a inductive dataset in previous work.

no such folder named (./data/cora_graphSAGE)

Hello,I want to use cora dataset with GraphSAGE, but I can't find the file named cora-G.json and there is no such folder named (./data/cora_graphSAGE).I'm appreciate for your help, thank you.

Error in transformRedditGraph2NPZ.py

I download the reddit data from snap and unzip it, and run

python transformRedditGraph2NPZ.py

Then it shows the following error message.

Traceback (most recent call last):
File "transformRedditGraph2NPZ.py", line 69, in
transferRedditDataFormat("reddit","reddit.npz")
File "transformRedditGraph2NPZ.py", line 46, in transferRedditDataFormat
train_ids = [n for n in G.nodes() if not G.node[n]['val'] and not G.node[n]['test']]
File "transformRedditGraph2NPZ.py", line 46, in
train_ids = [n for n in G.nodes() if not G.node[n]['val'] and not G.node[n]['test']]
KeyError: 'val'

Is there a bug? Or is it because of other issues? Thanks.

`TypeError: 633 is not JSON serializable` when calling `create_Graph_forGraphSAGE.py`

When runing the create_Graph_forGraphSAGE.py, got some error as following:


(py27) user@machinae:/FastGCN$ python create_Graph.py
Traceback (most recent call last):
  File "create_Graph.py", line 39, in <module>
    json.dump(data,f)
  File "/miniconda3/envs/py27/lib/python2.7/json/__init__.py", line 189, in dump
    for chunk in iterable:
  File "/miniconda3/envs/py27/lib/python2.7/json/encoder.py", line 434, in _iterencode
    for chunk in _iterencode_dict(o, _current_indent_level):
  File "/miniconda3/envs/py27/lib/python2.7/json/encoder.py", line 408, in _iterencode_dict
    for chunk in chunks:
  File "/miniconda3/envs/py27/lib/python2.7/json/encoder.py", line 332, in _iterencode_list
    for chunk in chunks:
  File "/miniconda3/envs/py27/lib/python2.7/json/encoder.py", line 408, in _iterencode_dict
    for chunk in chunks:
  File "/miniconda3/envs/py27/lib/python2.7/json/encoder.py", line 442, in _iterencode
    o = _default(o)
  File "/miniconda3/envs/py27/lib/python2.7/json/encoder.py", line 184, in default
    raise TypeError(repr(o) + " is not JSON serializable")
TypeError: 633 is not JSON serializable

could you give me some suggestions?
Thank you.

for create_Graph.py bug

There is an error when I run create_Graph.py.
FileNotFoundError: [Errno 2] No such file or directory: 'cora/cora0-G.json'

Test accuracy on Cora, Citeseer, and Pubmed using original split

Hi, I plan to run pubmed-original_transductive_FastGCN.py on Cora 100 times and change the random seed from 0 to 99. I get the result (until seed = 64)
avg: 79.8%
max: 80.6%
min: 78.4%
std: 0.46%

This is much lower than the reported results in your paper (you only report 81.8%, I don't know it's avg or max, and the how many times you run).

I will run code on Citeseer and Pubmed too.

Do I need to change the hyperparameters? Can you tell me the hyperparameters on this three dataset with original data split so I can get results comparable with GCN.

Thank you very much!

Could you provide train loss visualize picture by tensorboard ? And the require files not full provided.

I run the train_batch_multiRank_inductive_reddit_Mixlayers_sampleA.py
The data process procedure is complete by the transferRedditDataFormat and transferRedditData2AdjNPZ in transformRedditGraph2NPZ.py which remove node without "val" in G (load by reddit.json)
in transferRedditDataFormat
the conclusion is not appropriate, the training loss not decrease uniformly with the training going on.
So could you provide the generated reddit.npy reddit_adj.npy
in google drive or other server ?
Or tackle the node attribute problem ?

Setting of batch size and sampled layer size for large network

Thank you for your nice work.
Now I try to replace GCN model with Fast GCN model because the graph is a large network (around 2973489 nodes with '0,1' two labels). There are only around 2000 nodes with label '1'.

When I use default setting for such a large graph, it's hard to learn something for Fast GCN model.

For those strong imbalance graph network, how should I set the parameter, like layers, dimension of hidden layer, batch size and sampled size for each layer?
I would be appreciated if you give some hints or advice!

Graphsage data prep for large datasets

Hello there,
I would like to apply the code for a dataset with a millon nodes, however using the dense matrix would cause a memory error in graphsage data prep. do you have any idea how to fix it?

x tx files not opening

Hi there are two things I want to ask . It can be a dumb question as I am new to it.

  1. x , tx files are not loading but no errors either .. I am using python 3.7.

  2. I want to run FastGCN on my own dataset. Is there any guide/steps how can I create my x,y,tx,ty,graph files from my own dataset ?

Different weights for calculating probability

In the file, “pubmed_inductive_appr2layers.py”, on line 123, you have used the adjacency matrix for the batch of nodes only to calculate the weights to be used for calculating probabilities for each node. While to calculate the same on line 128 (p2), you have used the adjacency matrix for the entire training nodes. Even in the other files, you have p0, as defined on line 107 in this file for the same.

So, shouldn’t it be same throughout and so p1 should be replaced by p0 rather in this file ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.