Comments (14)
@KJGithub2021
All experiments were running on a single NVIDIA GeForce 1080 (12G) GPU card.
The default training parameters is 10 epochs and 96 batch_size with evaluation every 1000 steps. You can check it via the default script at https://github.com/JasonForJoy/IMN/blob/master/Ubuntu_V1/scripts/ubuntu_train.sh
from imn.
@KJGithub2021 All experiments were running on a single NVIDIA GeForce 1080 (12G) GPU card. The default training parameters is 10 epochs and 96 batch_size with evaluation every 1000 steps. You can check it via the default script at https://github.com/JasonForJoy/IMN/blob/master/Ubuntu_V1/scripts/ubuntu_train.sh
@KJGithub2021 All experiments were running on a single NVIDIA GeForce 1080 (12G) GPU card. The default training parameters is 10 epochs and 96 batch_size with evaluation every 1000 steps. You can check it via the default script at https://github.com/JasonForJoy/IMN/blob/master/Ubuntu_V1/scripts/ubuntu_train.sh
@JasonForJoy thanks for your reply. And yes I already checked the sh files, but the default configuration I was specifying was given in the train.py source code file instead. Anyhow I will use the one you pointed out.
Secondly can you let me know how long the model took to complete training and evaluation on the specs that you mentioned?
from imn.
@KJGithub2021 It took about 90h (including evaluation on the dev set every 1000 steps) under the default setting, i.e., 10 epochs and 96 batch_size on a single NVIDIA GeForce 1080 (12G) GPU card.
from imn.
@KJGithub2021 It took about 90h (including evaluation on the dev set every 1000 steps) under the default setting, i.e., 10 epochs and 96 batch_size on a single NVIDIA GeForce 1080 (12G) GPU card.
@JasonForJoy and this time is taken on the original UDC V2 dataset, which is composed of 957101 train dialogs and 19560 valid dialogs ?
secondly, can this time (i.e. ~4 days of training and evaluation) be further reduced ? Is there any room for model optimization ?
from imn.
@KJGithub2021
About 50h on the Ubuntu V2 dataset.
You might try:
- Enlarge batch_size with more advanced GPU cards
- Evaluate with less frequent steps, e.g. evaluate every 2k steps
from imn.
@KJGithub2021 About 50h on the Ubuntu V2 dataset. You might try:
- Enlarge batch_size with more advanced GPU cards
- Evaluate with less frequent steps, e.g. evaluate every 2k steps
Okay thankyou for the information! will let you know if I come across anything.
from imn.
@KJGithub2021 About 50h on the Ubuntu V2 dataset. You might try:
- Enlarge batch_size with more advanced GPU cards
- Evaluate with less frequent steps, e.g. evaluate every 2k steps
Okay thankyou for the information! will let you know if I come across anything.
@JasonForJoy
Do the model checkpoints, that are saved after each evaluation, help the model training to resume automatically in case the training disconnects at any point or the training needs to be manually resumed through coding?
Thanks.
from imn.
@KJGithub2021 About 50h on the Ubuntu V2 dataset. You might try:
- Enlarge batch_size with more advanced GPU cards
- Evaluate with less frequent steps, e.g. evaluate every 2k steps
Okay thankyou for the information! will let you know if I come across anything.
@JasonForJoy Do the model checkpoints, that are saved after each evaluation, help the model training to resume automatically in case the training disconnects at any point or the training needs to be manually resumed through coding? Thanks.
@JasonForJoy
In continuation to my previous query, can you also throw some light on how to resume model training from a saved checkpoint using google colab ? As there must have been times you guys had to resume training due to longer runtimes...
I would really appreciate your help. Thanks.
from imn.
@KJGithub2021 About 50h on the Ubuntu V2 dataset. You might try:
- Enlarge batch_size with more advanced GPU cards
- Evaluate with less frequent steps, e.g. evaluate every 2k steps
Okay thankyou for the information! will let you know if I come across anything.
@JasonForJoy Do the model checkpoints, that are saved after each evaluation, help the model training to resume automatically in case the training disconnects at any point or the training needs to be manually resumed through coding? Thanks.
@JasonForJoy can you kindly respond on this query and give some direction ? I really appreciate your help.
from imn.
@KJGithub2021
Sorry, we do not have any experience of resuming model training from a saved checkpoint using google colab. No suggestion can be provided.
from imn.
@KJGithub2021 Sorry, we do not have any experience of resuming model training from a saved checkpoint using google colab. No suggestion can be provided.
Okay...but what was the purpose of your code to save model checkpoints ?
from imn.
@KJGithub2021 Sorry, we do not have any experience of resuming model training from a saved checkpoint using google colab. No suggestion can be provided.
@JasonForJoy Understood. But how did you plan to resume training through your code from saved checkpoint otherwise ?
from imn.
@KJGithub2021 About 50h on the Ubuntu V2 dataset. You might try:
- Enlarge batch_size with more advanced GPU cards
- Evaluate with less frequent steps, e.g. evaluate every 2k steps
Okay thankyou for the information! will let you know if I come across anything.
@JasonForJoy Can you please confirm if you also used batch_size 96 for test dataset or 128?
from imn.
@KJGithub2021 About 50h on the Ubuntu V2 dataset. You might try:
- Enlarge batch_size with more advanced GPU cards
- Evaluate with less frequent steps, e.g. evaluate every 2k steps
Hi @JasonForJoy can you please confirm if reducing the batch size (due to low-end GPU machine availability) can impact the performance values of the model ?
from imn.
Related Issues (20)
- One question about the numbers on the last column of one row HOT 2
- when running bash ubuntu_train.sh, one error is D:\Anaconda\python.exe: can't find '__main__' module in 'D:/' HOT 2
- When running python train.py, there is nothing in checkpoints. & GPU use problem HOT 1
- 训练步数 HOT 12
- 可以提供应用部分的代码或者推荐相关的repo吗? HOT 1
- The process get stuck after python eval.py HOT 3
- 关于responses.txt的句子的疑问 HOT 1
- 适合single turn检索吗? HOT 1
- 关于评价指标的计算 HOT 1
- 请教几个有关豆瓣数据集的问题 HOT 4
- Embedding Glove + word2vec embedding
- Graph Execution Error on running tensorflow session
- loss calculated as nan
- model Hyperparameters HOT 1
- Batch Size impact on model performance
- a little lower results on MAP and MRR HOT 1
- 关于模型筛选的问题 HOT 1
- 关于数据集中的候选response的来源? HOT 2
- 请问可以提供英文部分word2vec词向量训练的代码吗 HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from imn.