Comments (8)
Yes you should pass the same files for training to evaluation, otherwise the system does not know how the graph likes like. You also need to keep the dataset name same for train and eval, it is currently used in naming the output files (e.g., .npy)
from dgl-ke.
Ok thank you!
But I passed the same files, and it still yielded this error. I'm pretty sure I have enough space on the machine. Do you know what could be the possible cause?
from dgl-ke.
Can you show me the CMD for train and eval and can you show the calltrace where std::bad_alloc happened.
from dgl-ke.
Here is my command:
DGLBACKEND=pytorch dglke_train --model_name ComplEx --data_path ./data --data_files LJ_training.txt LJ_validation.txt LJ_test.txt --format raw_udd_hrt --batch_size 200000 --neg_sample_size 1000 --hidden_dim 100 --gamma 19.9 --lr 0.1 --max_step 2400 --log_interval 100 --batch_size_eval 10000 -adv --regularization_coef 1.00E-09 --test --gpu 1 --num_thread 1 --num_proc 1
And this is the output:
Using backend: pytorch
Logs are being recorded at: ckpts/ComplEx_FB15k_10/train.log
Reading train triples....
Finished. Read 62094396 train triples.
Reading valid triples....
Finished. Read 34496887 valid triples.
Reading test triples....
Finished. Read 34496886 test triples.
|Train|: 62094396
/usr/local/lib/python3.6/dist-packages/dgl/base.py:25: UserWarning: multigraph will be deprecated.DGL will treat all graphs as multigraph in the future.
warnings.warn(msg, warn_type)
|valid|: 34496887
|test|: 34496886
Total initialize time 611.986 seconds
[proc 0][Train](100/2400) average pos_loss: 0.6891365647315979
[proc 0][Train](100/2400) average neg_loss: 0.6942214441299438
[proc 0][Train](100/2400) average loss: 0.6916790020465851
[proc 0][Train](100/2400) average regularization: 7.632575531232532e-05
[proc 0][Train] 100 steps take 11.511 seconds
[proc 0]sample: 10.400, forward: 0.435, backward: 0.562, update: 0.113
......
......
[proc 0][Train](2400/2400) average pos_loss: 0.29677054792642593
[proc 0][Train](2400/2400) average neg_loss: 0.731871457695961
[proc 0][Train](2400/2400) average loss: 0.5143210029602051
[proc 0][Train](2400/2400) average regularization: 0.0003860547285876237
[proc 0][Train] 100 steps take 1.440 seconds
[proc 0]sample: 0.388, forward: 0.426, backward: 0.499, update: 0.126
proc 0 takes 45.253 seconds
training takes 45.25495719909668 seconds
terminate called after throwing an instance of 'std::bad_allocterminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
Aborted (core dumped)
Thanks!
from dgl-ke.
It seems you are using huge batch_size especially for evaluation. As in the evaluation, if you do not specify the neg_sample_size_eval, then the whole entity set will be used as candidate negative nodes, this will consume lots of memory. Reduce batch_size_eval to smaller one like 100 or 500. and use neg_sample_size_eval=10000 for example.
Your dataset has 60M edges, which is much larger than fb15k
from dgl-ke.
Thank you! The error is gone, but the evaluation becomes very slow with --batch_size_eval being 100
from dgl-ke.
How many nodes do you have? If it is large, e.g. millions of nodes, I recommend you to use neg_sample_size_eval=10000.
from dgl-ke.
Since the docs are update. Close this issue.
from dgl-ke.
Related Issues (20)
- Upgrade DGL dependency HOT 2
- Can DGL_KE models be implemented on dynamic knowledge graphs? HOT 1
- Force dtype to int64 to ensure that we don't index with non-long tensor
- IndexError: list index out of range when training on raw user defined knowledge graph HOT 4
- Support Adam or Adagrad HOT 8
- Can not install dgl 0.4.3 HOT 4
- DGLBACKEND s not recognized as an internal or external command HOT 1
- No module named 'ogb
- whether just assign vertexes but not the edges together with on graph partition when use METIS
- [BUG] Quick start example code does not work HOT 4
- dgl.__version__ >= 0.8 breaks on partition.py HOT 2
- RuntimeError: Cannot re-initialize CUDA in forked subprocess HOT 1
- Multi-gpu training is not effective on specific cases
- `graph.HeteroGraph` Error happened when running example HOT 1
- !DGLBACKEND=pytorch dglke_train Not Working HOT 1
- pytorch dglke_train Not Working, Expected type graph.Graph but get graph.HeteroGraph HOT 3
- can't train my KG ,it keeps telling me 'AssertionError: test set is not provided'
- Installation error, no corresponding version HOT 1
- DGL-KE TransR Predict Error
- 'dgl' has no attribute '_deprecate'
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dgl-ke.