Git Product home page Git Product logo

Comments (7)

ahatamiz avatar ahatamiz commented on August 22, 2024

Hi @55998

I tried to reproduce this but issue but was not successful. What is your PyTorch version ?

presistent_workers is an standard argument in native PyTorch dataloader and used here. The idea is to not shutdown the workers after 1 epoch of dataset is used. This should enhance the performance, but is not a critical component. I would suggest removing it for the moment until we can pin down the issue.

Thanks

from research-contributions.

Jx-Tan avatar Jx-Tan commented on August 22, 2024

Hi @ahatamiz

Thank you very much, my PyTorch version is 1.6.0. when I use this code, I remove this argument, and able to continue training.

But I encountered another bug. I hope you can help me to solve it.

Describe the bug
RuntimeError: Unsupported data type for NCCL process group, in ..../torch/distributed/distributed_c10d.py line 1185, in all_gather work = _default_pg.allgather([tensor_list], [tensor])

To Reproduce
Steps to reproduce the behavior:

Go to 'UNETR/BTCV'
Install 'monai==0.7.0 nibabel==3.1.1 tqdm==4.59.0 einops==0.3.2 tensorboardx==2.1'
Run commands 'python main.py
--batch_size=1
--logdir=unetr_pretrained
--optim_lr=2e-4
--lrschedule=warmup_cosine
--infer_overlap=0.5
--save_checkpoint
--data_dir=/dataset/dataset0/
--pretrained_dir='./pretrained_models/'
--pretrained_model_name='UNETR_model_best_acc.pth'
--resume_ckpt'
--distributed
Expected behavior
Start the train correctly.

Screenshots
Console output:
1635478307(1)

Related code:
1635478465(1)
in trainer.py line 85

1635478566(1)
in utils.py line 56

1635478631(1)
in utils.py line 65

Environment (please complete the following information):
OS ubuntu 16.04
Python version Python 3.6
MONAI version [e.g. git commit hash] 0.7.0
CUDA/cuDNN version 10.2
GPU models and configuration 2 pieces of Geforce Nvidia RTX 2080ti 11G
Additional context
Pytorch version is 1.6.0

from research-contributions.

ahatamiz avatar ahatamiz commented on August 22, 2024

Hi @55998

I was able to reproduce this issue.

This issue (and the first one) are both caused by the PyTorch version. I recommend using torch==1.9.1 to avoid both of these issues.

I will update the requirements.txt to reflect this.

Thanks

from research-contributions.

Jx-Tan avatar Jx-Tan commented on August 22, 2024

Hi @ahatamiz

Thank you for your reply. Now I can correctly use this code.

But I got a new issue. I use two Nvidia Geforce RTX 2080Ti, and torch==1.9.1, the data is same with readme, and training Parameters is
--feature_size=32
--batch_size=1
--logdir=unetr_test
--optim_lr=2e-4
--lrschedule=warmup_cosine
--infer_overlap=0.5
--save_checkpoint
--data_dir=/dataset/dataset0/
--distributed
I set the max_epochs is 3000. But when finish the training, I got the Final traing 2999/2999 loss:0.5940, Final validation 2999/2999 acc 0.8024047, Training Finished! Best Accuracy: 0.0
So, witch problem with it?

1635560067(1)

from research-contributions.

Jx-Tan avatar Jx-Tan commented on August 22, 2024

Hi @ahatamiz

I Used the best_model to test, I get the AVG_DICE was 0.79; Without the pre_training model, I set 3000epochs, and the final validation ACC was 0.80
The data display seems a little low. What may be the reason? The data set I use is abdomen CT raw_data in btcv challenge.

from research-contributions.

hnjzbss avatar hnjzbss commented on August 22, 2024

Hi @ahatamiz

Thank you for your reply. Now I can correctly use this code.

But I got a new issue. I use two Nvidia Geforce RTX 2080Ti, and torch==1.9.1, the data is same with readme, and training Parameters is --feature_size=32 --batch_size=1 --logdir=unetr_test --optim_lr=2e-4 --lrschedule=warmup_cosine --infer_overlap=0.5 --save_checkpoint --data_dir=/dataset/dataset0/ --distributed I set the max_epochs is 3000. But when finish the training, I got the Final traing 2999/2999 loss:0.5940, Final validation 2999/2999 acc 0.8024047, Training Finished! Best Accuracy: 0.0 So, witch problem with it?

1635560067(1)

@ahatamiz

感谢你的回复。现在我可以正确使用这段代码了。

但是我遇到了一个新问题。我用了两个Nvidia Geforce RTX 2080Ti,torch==1.9.1,数据和readme一样,训练参数是 --feature_size=32 --batch_size=1 --logdir=unetr_test --optim_lr=2e-4 - -lrschedule=warmup_cosine --infer_overlap=0.5 --save_checkpoint --data_dir=/dataset/dataset0/ --distributed 我设置的max_epochs是3000。但是当完成训练时,我得到了Final traing 2999/2999 loss:0.5940, Final验证 2999/2999 acc 0.8024047,培训完成!最佳准确度:0.0 那么,有问题吗?

1635560067(1)

How did you solve this problem?

from research-contributions.

ahatamiz avatar ahatamiz commented on August 22, 2024

Hi @55998

I was able to reproduce this issue.

Based on this, I have submitted a new pull request (#18) that addresses this issue. Please re-try once it's merged.

Thanks

from research-contributions.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.