Describe the bug I use the UNETR/BTCV code to complete multi orga

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-ho

TypeError:init() got an unexpected keyword argument 'presistent_workers' about research-contributions HOT 7 CLOSED

Jx-Tan commented on August 22, 2024

TypeError:__init__() got an unexpected keyword argument 'presistent_workers'

from research-contributions.

Comments (7)

ahatamiz commented on August 22, 2024

Hi @55998

I tried to reproduce this but issue but was not successful. What is your PyTorch version ?

presistent_workers is an standard argument in native PyTorch dataloader and used here. The idea is to not shutdown the workers after 1 epoch of dataset is used. This should enhance the performance, but is not a critical component. I would suggest removing it for the moment until we can pin down the issue.

Thanks

from research-contributions.

Jx-Tan commented on August 22, 2024

Hi @ahatamiz

Thank you very much, my PyTorch version is 1.6.0. when I use this code, I remove this argument, and able to continue training.

But I encountered another bug. I hope you can help me to solve it.

Describe the bug
RuntimeError: Unsupported data type for NCCL process group, in ..../torch/distributed/distributed_c10d.py line 1185, in all_gather work = _default_pg.allgather([tensor_list], [tensor])

To Reproduce
Steps to reproduce the behavior:

Go to 'UNETR/BTCV'
Install 'monai==0.7.0 nibabel==3.1.1 tqdm==4.59.0 einops==0.3.2 tensorboardx==2.1'
Run commands 'python main.py
--batch_size=1
--logdir=unetr_pretrained
--optim_lr=2e-4
--lrschedule=warmup_cosine
--infer_overlap=0.5
--save_checkpoint
--data_dir=/dataset/dataset0/
--pretrained_dir='./pretrained_models/'
--pretrained_model_name='UNETR_model_best_acc.pth'
--resume_ckpt'
--distributed
Expected behavior
Start the train correctly.

Screenshots
Console output：

Related code：

in trainer.py line 85

in utils.py line 56

in utils.py line 65

Environment (please complete the following information):
OS ubuntu 16.04
Python version Python 3.6
MONAI version [e.g. git commit hash] 0.7.0
CUDA/cuDNN version 10.2
GPU models and configuration 2 pieces of Geforce Nvidia RTX 2080ti 11G
Additional context
Pytorch version is 1.6.0

from research-contributions.

ahatamiz commented on August 22, 2024

Hi @55998

I was able to reproduce this issue.

This issue (and the first one) are both caused by the PyTorch version. I recommend using torch==1.9.1 to avoid both of these issues.

I will update the requirements.txt to reflect this.

Thanks

from research-contributions.

Jx-Tan commented on August 22, 2024

Hi @ahatamiz

Thank you for your reply. Now I can correctly use this code.

But I got a new issue. I use two Nvidia Geforce RTX 2080Ti, and torch==1.9.1, the data is same with readme, and training Parameters is
--feature_size=32
--batch_size=1
--logdir=unetr_test
--optim_lr=2e-4
--lrschedule=warmup_cosine
--infer_overlap=0.5
--save_checkpoint
--data_dir=/dataset/dataset0/
--distributed
I set the max_epochs is 3000. But when finish the training, I got the Final traing 2999/2999 loss:0.5940, Final validation 2999/2999 acc 0.8024047, Training Finished! Best Accuracy: 0.0
So, witch problem with it?

from research-contributions.

Jx-Tan commented on August 22, 2024

Hi @ahatamiz

I Used the best_model to test, I get the AVG_DICE was 0.79; Without the pre_training model, I set 3000epochs, and the final validation ACC was 0.80
The data display seems a little low. What may be the reason? The data set I use is abdomen CT raw_data in btcv challenge.

from research-contributions.

hnjzbss commented on August 22, 2024

Hi @ahatamiz

Thank you for your reply. Now I can correctly use this code.

But I got a new issue. I use two Nvidia Geforce RTX 2080Ti, and torch==1.9.1, the data is same with readme, and training Parameters is --feature_size=32 --batch_size=1 --logdir=unetr_test --optim_lr=2e-4 --lrschedule=warmup_cosine --infer_overlap=0.5 --save_checkpoint --data_dir=/dataset/dataset0/ --distributed I set the max_epochs is 3000. But when finish the training, I got the Final traing 2999/2999 loss:0.5940, Final validation 2999/2999 acc 0.8024047, Training Finished! Best Accuracy: 0.0 So, witch problem with it?

嗨@ahatamiz

感谢你的回复。现在我可以正确使用这段代码了。

但是我遇到了一个新问题。我用了两个Nvidia Geforce RTX 2080Ti，torch==1.9.1，数据和readme一样，训练参数是 --feature_size=32 --batch_size=1 --logdir=unetr_test --optim_lr=2e-4 - -lrschedule=warmup_cosine --infer_overlap=0.5 --save_checkpoint --data_dir=/dataset/dataset0/ --distributed 我设置的max_epochs是3000。但是当完成训练时，我得到了Final traing 2999/2999 loss:0.5940, Final验证 2999/2999 acc 0.8024047，培训完成！最佳准确度：0.0 那么，有问题吗？

How did you solve this problem?

from research-contributions.

ahatamiz commented on August 22, 2024

Hi @55998

I was able to reproduce this issue.

Based on this, I have submitted a new pull request (#18) that addresses this issue. Please re-try once it's merged.

Thanks

from research-contributions.

TypeError:init() got an unexpected keyword argument 'presistent_workers' about research-contributions HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent