Git Product home page Git Product logo

Comments (8)

classicsong avatar classicsong commented on June 1, 2024 1

I c, you use tee.
The output will be batched. You may need to wait for test to finish.

from dgl-ke.

classicsong avatar classicsong commented on June 1, 2024

Can you check following two things:

  1. The CPU usage and GPU usage when it 'freeze'
  2. The connection to the notebook server.

from dgl-ke.

kdutia avatar kdutia commented on June 1, 2024

The connection to the notebook server is ok. I've just reduced max_step to 40,000 and now it gets stuck at 22,000.

This is what I get when I run nvidia-smi on my EC2 machine.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05    Driver Version: 450.51.05    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:1E.0 Off |                    0 |
| N/A   42C    P0    51W / 300W |   1271MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     23371      C   ...ector-tBN8TxCn/bin/python     1269MiB |
+-----------------------------------------------------------------------------+

from dgl-ke.

classicsong avatar classicsong commented on June 1, 2024

Can you show me the entire cmdline?

from dgl-ke.

kdutia avatar kdutia commented on June 1, 2024

This is my latest experiment which is stuck on 22,000

/home/ubuntu/.local/share/virtualenvs/heritage-connector-tBN8TxCn/lib/python3.7/site-packages/dgl/base.py:25: UserWarning: multigraph will be deprecated.DGL will treat all graphs as multigraph in the future.
  warnings.warn(msg, warn_type)
Logs are being recorded at: /home/ubuntu/data/results/TransE_l1_hc1708_14/train.log
Reading train triples....
Finished. Read 1112834 train triples.
Reading valid triples....
Finished. Read 34417 valid triples.
Reading test triples....
Finished. Read 34417 test triples.
|Train|: 1112834
|valid|: 34417
|test|: 34417
Total initialize time 5.893 seconds
[proc 0][Train](1000/40000) average pos_loss: 0.9731047201752663
[proc 0][Train](1000/40000) average neg_loss: 0.8851320103555917
[proc 0][Train](1000/40000) average loss: 0.9291183650195599
[proc 0][Train](1000/40000) average regularization: 3.712290239400318e-05
[proc 0][Train] 1000 steps take 11.759 seconds
[proc 0]sample: 1.389, forward: 4.386, backward: 1.811, update: 4.162
[proc 0][Train](2000/40000) average pos_loss: 0.6279566985368729
[proc 0][Train](2000/40000) average neg_loss: 0.6747330973446369
[proc 0][Train](2000/40000) average loss: 0.6513448976278305
[proc 0][Train](2000/40000) average regularization: 4.70831001366605e-05
[proc 0][Train] 1000 steps take 8.641 seconds
[proc 0]sample: 1.267, forward: 4.299, backward: 1.802, update: 1.263
[proc 0][Train](3000/40000) average pos_loss: 0.39179811123013497
[proc 0][Train](3000/40000) average neg_loss: 0.48457677371799945
[proc 0][Train](3000/40000) average loss: 0.43818744249641894
[proc 0][Train](3000/40000) average regularization: 5.761738594082999e-05
[proc 0][Train] 1000 steps take 8.768 seconds
[proc 0]sample: 1.367, forward: 4.343, backward: 1.854, update: 1.194
[proc 0][Train](4000/40000) average pos_loss: 0.34823234072327613
[proc 0][Train](4000/40000) average neg_loss: 0.43337956032902003
[proc 0][Train](4000/40000) average loss: 0.39080595020949843
[proc 0][Train](4000/40000) average regularization: 6.410275710004499e-05
[proc 0][Train] 1000 steps take 8.561 seconds
[proc 0]sample: 1.273, forward: 4.279, backward: 1.805, update: 1.194
[proc 0][Train](5000/40000) average pos_loss: 0.28567829565703867
[proc 0][Train](5000/40000) average neg_loss: 0.381733529381454
[proc 0][Train](5000/40000) average loss: 0.33370591297745705
[proc 0][Train](5000/40000) average regularization: 7.077617696631933e-05
[proc 0][Train] 1000 steps take 8.606 seconds
[proc 0]sample: 1.320, forward: 4.235, backward: 1.804, update: 1.237
[proc 0][Train](6000/40000) average pos_loss: 0.27727535855770113
[proc 0][Train](6000/40000) average neg_loss: 0.359264762930572
[proc 0][Train](6000/40000) average loss: 0.31827006104588507
[proc 0][Train](6000/40000) average regularization: 7.580427336506546e-05
[proc 0][Train] 1000 steps take 8.738 seconds
[proc 0]sample: 1.313, forward: 4.382, backward: 1.802, update: 1.232
[proc 0][Train](7000/40000) average pos_loss: 0.25403493851423264
[proc 0][Train](7000/40000) average neg_loss: 0.340814183909446
[proc 0][Train](7000/40000) average loss: 0.297424561008811
[proc 0][Train](7000/40000) average regularization: 8.013840392231942e-05
[proc 0][Train] 1000 steps take 8.778 seconds
[proc 0]sample: 1.358, forward: 4.336, backward: 1.853, update: 1.220
[proc 0][Train](8000/40000) average pos_loss: 0.249679933860898
[proc 0][Train](8000/40000) average neg_loss: 0.32741397356987
[proc 0][Train](8000/40000) average loss: 0.28854695366322997
[proc 0][Train](8000/40000) average regularization: 8.428859609557549e-05
[proc 0][Train] 1000 steps take 8.737 seconds
[proc 0]sample: 1.307, forward: 4.384, backward: 1.814, update: 1.221
[proc 0][Train](9000/40000) average pos_loss: 0.2401422714293003
[proc 0][Train](9000/40000) average neg_loss: 0.32162084911391137
[proc 0][Train](9000/40000) average loss: 0.2808815600425005
[proc 0][Train](9000/40000) average regularization: 8.721390135906404e-05
[proc 0][Train] 1000 steps take 8.843 seconds
[proc 0]sample: 1.335, forward: 4.411, backward: 1.816, update: 1.269
[proc 0][Train](10000/40000) average pos_loss: 0.23346979524195194
[proc 0][Train](10000/40000) average neg_loss: 0.31008259259164334
[proc 0][Train](10000/40000) average loss: 0.2717761939018965
[proc 0][Train](10000/40000) average regularization: 9.087248668947723e-05
[proc 0][Train] 1000 steps take 8.752 seconds
[proc 0]sample: 1.290, forward: 4.396, backward: 1.831, update: 1.224
[proc 0][Train](11000/40000) average pos_loss: 0.23218116450309753
[proc 0][Train](11000/40000) average neg_loss: 0.3106829285062849
[proc 0][Train](11000/40000) average loss: 0.2714320463836193
[proc 0][Train](11000/40000) average regularization: 9.28232833830407e-05
[proc 0][Train] 1000 steps take 8.918 seconds
[proc 0]sample: 1.371, forward: 4.442, backward: 1.827, update: 1.266
[proc 0][Train](12000/40000) average pos_loss: 0.2225425555408001
[proc 0][Train](12000/40000) average neg_loss: 0.2998654997013509
[proc 0][Train](12000/40000) average loss: 0.2612040272951126
[proc 0][Train](12000/40000) average regularization: 9.625060645339545e-05
[proc 0][Train] 1000 steps take 8.826 seconds
[proc 0]sample: 1.322, forward: 4.444, backward: 1.856, update: 1.194
[proc 0][Train](13000/40000) average pos_loss: 0.22791376207768918
[proc 0][Train](13000/40000) average neg_loss: 0.3035700426399708
[proc 0][Train](13000/40000) average loss: 0.2657419020012021
[proc 0][Train](13000/40000) average regularization: 9.777809328079456e-05
[proc 0][Train] 1000 steps take 8.668 seconds
[proc 0]sample: 1.283, forward: 4.339, backward: 1.832, update: 1.205
[proc 0][Train](14000/40000) average pos_loss: 0.2136377188116312
[proc 0][Train](14000/40000) average neg_loss: 0.29430538304895165
[proc 0][Train](14000/40000) average loss: 0.25397155099362134
[proc 0][Train](14000/40000) average regularization: 0.00010068564637185773
[proc 0][Train] 1000 steps take 8.839 seconds
[proc 0]sample: 1.392, forward: 4.428, backward: 1.816, update: 1.192
[proc 0][Train](15000/40000) average pos_loss: 0.22163018888235092
[proc 0][Train](15000/40000) average neg_loss: 0.29803618866577747
[proc 0][Train](15000/40000) average loss: 0.2598331885784864
[proc 0][Train](15000/40000) average regularization: 0.00010206920657947194
[proc 0][Train] 1000 steps take 8.699 seconds
[proc 0]sample: 1.265, forward: 4.363, backward: 1.824, update: 1.237
[proc 0][Train](16000/40000) average pos_loss: 0.21076470874249936
[proc 0][Train](16000/40000) average neg_loss: 0.29174677131138743
[proc 0][Train](16000/40000) average loss: 0.25125573988258837
[proc 0][Train](16000/40000) average regularization: 0.0001043690236110706
[proc 0][Train] 1000 steps take 8.948 seconds
[proc 0]sample: 1.399, forward: 4.436, backward: 1.861, update: 1.241
[proc 0][Train](17000/40000) average pos_loss: 0.2166016393750906
[proc 0][Train](17000/40000) average neg_loss: 0.293723577266559
[proc 0][Train](17000/40000) average loss: 0.2551626083999872
[proc 0][Train](17000/40000) average regularization: 0.00010588835964153987
[proc 0][Train] 1000 steps take 8.634 seconds
[proc 0]sample: 1.272, forward: 4.379, backward: 1.822, update: 1.151
[proc 0][Train](18000/40000) average pos_loss: 0.20872869043052197
[proc 0][Train](18000/40000) average neg_loss: 0.2904951977562159
[proc 0][Train](18000/40000) average loss: 0.24961194440722465
[proc 0][Train](18000/40000) average regularization: 0.00010771663221385097
[proc 0][Train] 1000 steps take 8.670 seconds
[proc 0]sample: 1.306, forward: 4.291, backward: 1.819, update: 1.243
[proc 0][Train](19000/40000) average pos_loss: 0.21248555865883828
[proc 0][Train](19000/40000) average neg_loss: 0.29046530737914145
[proc 0][Train](19000/40000) average loss: 0.2514754327312112
[proc 0][Train](19000/40000) average regularization: 0.00010925246066472027
[proc 0][Train] 1000 steps take 8.893 seconds
[proc 0]sample: 1.336, forward: 4.475, backward: 1.819, update: 1.253
[proc 0][Train](20000/40000) average pos_loss: 0.20720807661116122
[proc 0][Train](20000/40000) average neg_loss: 0.28932265562191606
[proc 0][Train](20000/40000) average loss: 0.2482653663828969
[proc 0][Train](20000/40000) average regularization: 0.00011058572954789269
[proc 0][Train] 1000 steps take 8.636 seconds
[proc 0]sample: 1.322, forward: 4.292, backward: 1.838, update: 1.174
[proc 0][Train](21000/40000) average pos_loss: 0.2084350918084383
[proc 0][Train](21000/40000) average neg_loss: 0.2872589902477339
[proc 0][Train](21000/40000) average loss: 0.24784704087674617
[proc 0][Train](21000/40000) average regularization: 0.00011231124978075968
[proc 0][Train] 1000 steps take 8.813 seconds
[proc 0]sample: 1.273, forward: 4.403, backward: 1.821, update: 1.305
[proc 0][Train](22000/40000) average pos_loss: 0.20681360100209714
[proc 0][Train](22000/40000) average neg_loss: 0.2890164882643148

from dgl-ke.

classicsong avatar classicsong commented on June 1, 2024

The launch cmd line as dglke_train ...
Which dgl version do you use?

from dgl-ke.

kdutia avatar kdutia commented on June 1, 2024

The latest release on pypi. Sorry! The command is:

dglke_train --max_step 40000 --model_name NAMES --data_path ~/data --save_path ~/data/results  --dataset hc1708 \
    --format raw_udd_htr --data_files train.txt valid.txt test.txt \
    --log_interval 1000 --batch_size 1024 --batch_size_eval 16 --neg_sample_size 16 \
    --lr LR --hidden_dim HIDDEN_DIM -rc REGULARIZATION_COEF -g GAMMA \
    --gpu 0 --mix_cpu_gpu --async_update --test --neg_sample_size_eval 100000 | tee results.txt

also the following make no difference to it working:

  • adding or removing -adv
  • changing LR
  • changing model between TransE_l1 and TransE_l2
  • changing hidden_dim
  • choosing regularization_coef from [2e-6, 2e-8]
  • choosing gamma from [1, 5, 10, 20]

from dgl-ke.

kdutia avatar kdutia commented on June 1, 2024

Thanks, I hadn't considered this.

from dgl-ke.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.