charlescxk / torchsemiseg Goto Github PK
View Code? Open in Web Editor NEW[CVPR 2021] CPS: Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision
License: MIT License
[CVPR 2021] CPS: Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision
License: MIT License
Hi,
Thanks for the excellent work. There is the problem, I run the code reveral times, and each time the loss is different. Are there any random seeds haven't been fixed.
`seed = config.seed
if engine.distributed:
seed = engine.local_rank
torch.manual_seed(seed)
random.seed(seed)
np.random.seed(seed)
if torch.cuda.is_available():
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)`
Thanks for your share! Great works!
I'm trying to reproduce your result based on resnet-101 under the PascalVOC dataset.
I miss your config files of the resnet-101, and I'm wondering is there any change compared with your resnet-50 settings (in here)?
Cheers
Hello, I am using 1 Telsa P100 16gb with batchsize 4, with your default config.
I only reach about 66% mIOU, different with your report in paper and your log: 73.28
Hello, the final performance of the experiment is the last epoch result or the best performance among the 20-34 epoch? (the ratio is 1/8)
May I ask how much memory and time for the VOC dataset? I use the script with 1/8 data partitions, and needs nearly 64g GPU memory. Is it normal? Thank you for your help.
Hi, first, thanks for sharing for this great work !
I was reading the paper and the code and I notice in the Mean Teacher exp, the paper said you use x1 and x2 with two different augmentations, but in the code you simply use x to feed the teacher and the student ? Is it normal or did I miss something ?
Moreover, could you release the mean teacher experience for cityscapes ? Because I could not reproduce the results from the paper
Thanks
Hello, I'm training with double 3080 cards. The batchsize can only be set to 4. The learning rate is 0.0025. The ratio of 1 / 8. I've run it three times and the IOU is only 0.66. What's the reason?Does batchsize have a great impact?
Hi, I tried to runt he code here and
here, the extra dict will always be None as the TrainPre() here always returns an empty extra_dict and hence batches won't have the key 'img'Hi, I have a question about the Figure 7 of the paper. What is the formula to obtain the overlap ratio? Is it an mIoU between the predictions of network 1 and network 2 for each kind of samples ?
Thanks
The Citiscapes archive from OneDrive seems to be corrupted. I've tested it on two devices and still can't unpack it.
$ unzip city.zip
Archive: city.zip
warning [city.zip]: 8575179524 extra bytes at beginning or within zipfile
(attempting to process anyway)
error [city.zip]: start of central directory not found;
zipfile corrupt.
(please check that you have transferred or created the zipfile in the
appropriate BINARY mode and that you have compiled UnZip properly)
Hi;
I couldn't run the command "cd ./model/voc8.res50v3+.CPS" because there is no folder called model.
where is the folder model?
I run the script.sh file and report an error.
Traceback (most recent call last):
File "train.py", line 28, in
from apex.parallel import DistributedDataParallel, SyncBatchNorm
File "", line 971, in _find_and_load
File "", line 955, in _find_and_load_unlocked
File "", line 656, in _load_unlocked
File "", line 626, in _load_backward_compatible
File "/home/dj/anaconda3/envs/py3.6torch1.6/lib/python3.6/site-packages/apex-0.1-py3.6.egg/apex/init.py", line 12, in
File "", line 971, in _find_and_load
File "", line 955, in _find_and_load_unlocked
File "", line 656, in _load_unlocked
File "", line 626, in _load_backward_compatible
File "/home/dj/anaconda3/envs/py3.6torch1.6/lib/python3.6/site-packages/apex-0.1-py3.6.egg/apex/optimizers/init.py", line 2, in
File "", line 971, in _find_and_load
File "", line 955, in _find_and_load_unlocked
File "", line 656, in _load_unlocked
File "", line 626, in _load_backward_compatible
File "/home/dj/anaconda3/envs/py3.6torch1.6/lib/python3.6/site-packages/apex-0.1-py3.6.egg/apex/optimizers/fp16_optimizer.py", line 8, in
File "/home/dj/anaconda3/envs/py3.6torch1.6/lib/python3.6/ctypes/init.py", line 361, in getattr
func = self.getitem(name)
File "/home/dj/anaconda3/envs/py3.6torch1.6/lib/python3.6/ctypes/init.py", line 366, in getitem
func = self._FuncPtr((name_or_ordinal, self))
AttributeError: /home/dj/anaconda3/envs/py3.6torch1.6/bin/python: undefined symbol: THCudaHalfTensor_normall
Traceback (most recent call last):
File "/home/dj/anaconda3/envs/py3.6torch1.6/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/dj/anaconda3/envs/py3.6torch1.6/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/dj/anaconda3/envs/py3.6torch1.6/lib/python3.6/site-packages/torch/distributed/launch.py", line 261, in
main()
File "/home/dj/anaconda3/envs/py3.6torch1.6/lib/python3.6/site-packages/torch/distributed/launch.py", line 257, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/home/dj/anaconda3/envs/py3.6torch1.6/bin/python', '-u', 'train.py', '--local_rank=0']' returned non-zero exit status 1.
My operating environment is pytorch1.6 CUDA10.0 gcc and g++ 7.5.0 ,Two 2080ti GPU。
I did not follow the method you said when installing apex,I keep failing to install according to that command,But I installed successfully through python3 setup.py install,So I turn to you for help。
Hi again, I have more question about BN layer.
I want to disable batchnormalization layer, so I tried to change model status from model.train() to model.eval() in train.py. (And also tried replacing BatchNorm layer with Identity layer but same error occurred)
The train script occurred this error:
"""
ValueError: Expected input batch_size (21) to match target batch_size (2).
"""
How can I disable the BN layer? Thank you!
Hi,
I notice that you write the comment for the 10 times head learning rate here, but seems you do not perform that. Since training the entire network is really time-consuming, may I kindly ask that whether it is the wrong comment or not.
Cheers.
thank you for release the code. we run the default config as your public in voc8.res50v3+.CPS, the best performance is 72.76, a little lower than the 73.20 in paper. have any other tricks? the default epoch is 34, the paper is 60? but your released log, 73.28 is also use 34 epoch.
It is a very nice work.
Thank you so much for your contribution.
Is it possible to release scripts for few-supervision on PASCAL VOC 2012, too?
Does the network generate labels for all pixels of each image to supervise another network during training? (in GCT some pixel that higher than a certain threshold is selected).
In the early stage of network training, the labels predicted by poor performance at the beginning are inaccurate, and is it reasonable to use the wrong labels as supervisory information for another network?
When I use pytorch 1.0, apex DDP will automatically exit without error message, so I try to run this code with pytorch 1.8.1. However I got similar results reported in #14 (72.0±0.2 mIU for cps, 1/8 voc r50 setting), can anyone figure out the possible reason that causes the performance decay?
@yh-pengtu, @frank-xwang
Hello, I trained labeled ratio 8, 137 epochs, learning_rate 0.02, batch size 8 with 8 cards as your default configuration in city8.res50v3+.CPS,but the mean_IU is 70.682% which is lower than 74.39% in the paper. Later,i change the epoch to 240 and the learning rate to 0.04 refers to the experiments in the paper,but it still doesn't work. Do you have any advice for it?
Thank you a lot.
Hi,
Thank you for the sharing and excellent work. I have run the code for resnet50 but I fail to find the configs for resnet101, would you like to share them?
Besides, I notice there are some tiny changes in the resnet50 and ASPP. Is there any reason? and how to get the pre-trained model trained on ImageNet? (I mean using standard supervised learning or self-supervised learning?)
Besides, I try to download the city.zip and unzip it. But I fail to unzip it. So would you like to have a check?
By the way, thank you for your code, it is very helpful.
Best regards,
Yuhang
Hi, I tried to setup the base virtual environment, but I faced a problem.
$ conda env create -f semiseg.yaml
Then, I can see this message.
Collecting package metadata (repodata.json): done
Solving environment: failed
ResolvePackageNotFound:
My environment.
Win10
Anaconda Version : 1.7.2
GPU: RTX3090
Please help me.
Thank you.
Hi, thank you for releasing awesome work!
I have a question about weight decay value config on resnet50 - pascal voc - 1/8 labels.
The paper said "The momentum is fixed as 0.9 and the weight decay is set to 0.0005(5e-4)"
But in config file, the weight decay value defined with 1e-4.
"""
In TorchSemiSeg/exp.voc/voc8.res50v3+2B.CPS+2BCutMix/config.py ...
line 106: C.weight_decay = 1e-4
"""
Which value should I follow? Thank you!
I follow the default setting of codes and obtained 71.687% mIoU when training with 1/8 labels on pascal voc,but this paper reported 73.67% mIoU. Why the performance gap so huge, and anyone can tell me how to improve this code? Thanks very much!
Hi,is the learning rate of 1/16 also 0.0025?
I can't achieve the accuracy in the paper by using 0.0025.
Hi,
Thanks so much for the share! I'm also very enjoy to read your paper!
I just miss the code of the self-training, in your Table 7. Could you please provide some details about it?
Cheers!
First of all, thank you for your great work.
Currently, I'm using your code to train on another dataset and face the problem described in the attached image.
After install APEX successfully, I run training code with 2 A6000 GPUs and it stops after few epochs.
Another problem I want to ask is that my server keeps restarting when I ran the training code for several times before I could. Do you know the reason why?
Thank you very much
Hi, @charlesCXK ,
Great work and thanks for the share of the code.
I have a question that the cps loss in the paper is also computed on the labeled data, but can't find in the code. Is it forgot or the code is not complete.
Best regards.
Due to the server cuda version, I was unable to use the pytorch1.0.0 version requested by the author. So I upgraded PyTorch to 1.9.1. But then it didn't work.How can I deal with this problem.
_
/opt/conda/envs/semiseg2/lib/python3.6/site-packages/torch/distributed/launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torch.distributed.run.
Note that --use_env is set by default in torch.distributed.run.
If your script expects --local_rank
argument to be set, please
change it to read from os.environ['LOCAL_RANK']
instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
FutureWarning,
WARNING:torch.distributed.run:*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
Traceback (most recent call last):
File "train.py", line 28, in
Traceback (most recent call last):
File "train.py", line 28, in
from apex.parallel import DistributedDataParallel, SyncBatchNorm
from apex.parallel import DistributedDataParallel, SyncBatchNorm
File "/opt/conda/envs/semiseg2/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/init.py", line 12, in
File "/opt/conda/envs/semiseg2/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/init.py", line 12, in
from . import optimizers
File "/opt/conda/envs/semiseg2/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/optimizers/init.py", line 2, in
from . import optimizers
File "/opt/conda/envs/semiseg2/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/optimizers/init.py", line 2, in
from .fp16_optimizer import FP16_Optimizer
File "/opt/conda/envs/semiseg2/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/optimizers/fp16_optimizer.py", line 8, in
from .fp16_optimizer import FP16_Optimizer
File "/opt/conda/envs/semiseg2/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/optimizers/fp16_optimizer.py", line 8, in
lib.THCudaHalfTensor_normall.argtypes=[ctypes.c_void_p, ctypes.c_void_p]
File "/opt/conda/envs/semiseg2/lib/python3.6/ctypes/init.py", line 361, in getattr
lib.THCudaHalfTensor_normall.argtypes=[ctypes.c_void_p, ctypes.c_void_p]
File "/opt/conda/envs/semiseg2/lib/python3.6/ctypes/init.py", line 361, in getattr
func = self.getitem(name)
File "/opt/conda/envs/semiseg2/lib/python3.6/ctypes/init.py", line 366, in getitem
func = self.getitem(name)
File "/opt/conda/envs/semiseg2/lib/python3.6/ctypes/init.py", line 366, in getitem
func = self._FuncPtr((name_or_ordinal, self))
AttributeError: /opt/conda/envs/semiseg2/bin/python: undefined symbol: THCudaHalfTensor_normall
func = self._FuncPtr((name_or_ordinal, self))
AttributeError: /opt/conda/envs/semiseg2/bin/python: undefined symbol: THCudaHalfTensor_normall
Traceback (most recent call last):
File "train.py", line 28, in
from apex.parallel import DistributedDataParallel, SyncBatchNorm
File "/opt/conda/envs/semiseg2/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/init.py", line 12, in
from . import optimizers
File "/opt/conda/envs/semiseg2/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/optimizers/init.py", line 2, in
from .fp16_optimizer import FP16_Optimizer
File "/opt/conda/envs/semiseg2/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/optimizers/fp16_optimizer.py", line 8, in
lib.THCudaHalfTensor_normall.argtypes=[ctypes.c_void_p, ctypes.c_void_p]
File "/opt/conda/envs/semiseg2/lib/python3.6/ctypes/init.py", line 361, in getattr
func = self.getitem(name)
File "/opt/conda/envs/semiseg2/lib/python3.6/ctypes/init.py", line 366, in getitem
func = self._FuncPtr((name_or_ordinal, self))
AttributeError: /opt/conda/envs/semiseg2/bin/python: undefined symbol: THCudaHalfTensor_normall
Traceback (most recent call last):
File "train.py", line 28, in
from apex.parallel import DistributedDataParallel, SyncBatchNorm
File "/opt/conda/envs/semiseg2/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/init.py", line 12, in
from . import optimizers
File "/opt/conda/envs/semiseg2/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/optimizers/init.py", line 2, in
from .fp16_optimizer import FP16_Optimizer
File "/opt/conda/envs/semiseg2/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/optimizers/fp16_optimizer.py", line 8, in
lib.THCudaHalfTensor_normall.argtypes=[ctypes.c_void_p, ctypes.c_void_p]
File "/opt/conda/envs/semiseg2/lib/python3.6/ctypes/init.py", line 361, in getattr
func = self.getitem(name)
File "/opt/conda/envs/semiseg2/lib/python3.6/ctypes/init.py", line 366, in getitem
func = self._FuncPtr((name_or_ordinal, self))
AttributeError: /opt/conda/envs/semiseg2/bin/python: undefined symbol: THCudaHalfTensor_normall
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 34178) of binary: /opt/conda/envs/semiseg2/bin/python
Traceback (most recent call last):
File "/opt/conda/envs/semiseg2/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/opt/conda/envs/semiseg2/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/opt/conda/envs/semiseg2/lib/python3.6/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/opt/conda/envs/semiseg2/lib/python3.6/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/opt/conda/envs/semiseg2/lib/python3.6/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/opt/conda/envs/semiseg2/lib/python3.6/site-packages/torch/distributed/run.py", line 692, in run
)(*cmd_args)
File "/opt/conda/envs/semiseg2/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 116, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/envs/semiseg2/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train.py FAILED
Other Failures:
[1]:
time: 2021-09-22_11:17:30
rank: 1 (local_rank: 1)
exitcode: 1 (pid: 34179)
error_file: <N/A>
msg: "Process failed with exitcode 1"
[2]:
time: 2021-09-22_11:17:30
rank: 2 (local_rank: 2)
exitcode: 1 (pid: 34180)
error_file: <N/A>
msg: "Process failed with exitcode 1"
[3]:
time: 2021-09-22_11:17:30
rank: 3 (local_rank: 3)
exitcode: 1 (pid: 34181)
error_file: <N/A>
msg: "Process failed with exitcode 1"
22 11:17:31 using devices 0, 1, 2, 3
_
In Figure 3 in the paper, there's a comparison of fully supervised baselines. How to reproduce these results (especially with HRNet)?
Hi, Xiaokang,
Thanks for sharing so solid work! I noticed that the loss in supervised learning is ohem loss. Have you ever done the experiments for ce loss and how about the result?
Hi.According to the Getting Started.md, there is different epochs for different label ratio. but i'm confused about the setting's meaning.
if i'm right, here you only need to specify the total iters.It is not necessary to calculate the epoch corresponding to different label ratio?
If I want to train supervised baseline, does it mean that if my total iters are 40k and the label ratio is 0.125, then I need to do 40k iterations(maybe batch size is 8) on this 0.125*10852 labeled images only? But if i use semi supervised method, one iteration means fetching the labeled data once and fetching the unlabeled data once?
Hope to receive your reply. very thanks.
best,
Thx for your marvelous work!
When I was trying to reproduce your work in voc2012 dataset, I felt strange about the size of inputting images after cropping operation, should it be like batch* channel321321, same with other methods? In your work the size of images is batch* channel512512 now,if I have not made a mistake.
We notice that in the manuscript, Y should not backward the gradients.
But in the codes, there are no operation to stop the gradients?
Thanks for share the code, I have reproduced the result of voc.CPS, but I also have some confuse.
Hello, I found some difference of other methods's implement.
In MeanTeacher, the paper propose to update the teacher model weight with ema of student's model. but in code, it only have mse loss between teacher and student model.
Hi,could you give a date when you can release other SOTA semi-supervised segmentation methods?
pooling_size = (min(try_index(self.pooling_size, 0), x.shape[2]),
min(try_index(self.pooling_size, 1), x.shape[3]))
Hi @charlesCXK, I was reading your paper and code, and I got a question concerning the numbers of epochs (max number of iters), did you choose it arbitrary ? Or do you take the fully supervised number of iters from deeplabv3+ as a starting point and adapt it to the various splits ?
I hope my question is clear :)
Hi,
Thank you for the sharing and excellent work. In order to reproduce the Tab2 results(w/o CutMix Aug) of ResNet101 on Cityscapes dataset, we just change the path of pretrain_model path in 'config.py' with the path of the resnet101 model. However, the results of ResNet101(1/16 63.632) are far from the paper(1/16 72.18), even worse than the performance of ResNet50(1/16 68.21). Why the performance gap so huge?
C.dataset_path = osp.join(C.volna, 'DATA/pascal_voc')
C.img_root_folder = C.dataset_path
C.gt_root_folder = C.dataset_path
C.pretrained_model = C.volna + 'DATA/pytorch-weight/resnet50_v1c.pth'
here is my experimental environment:
8 Tesla V100 GPUs, pytorch 1.0.0, python 3.6.7
Thank you for your job! I want to know how to train cps with my custom dataset, i should change code where it is?
Hi, first, thanks for sharing for this great work !
I noticed that you have released other SOTA methods on the VOC dataset. I wonder if you will release them on the Cityscapes dataset?
Thanks
Hi, Xiaokang,
Congrats so solid work. I have a question about the different initialization, I check the code of the segmentation model, but I don't find the operation of parameters initialization, I want to know the details of this operation, could you give me some advice?
Best,
Xiangde.
Thanks for sharing this excellent work!
I have a question for the parameter initialization in this Teacher-Student paradigm:
param_t.data.copy_(param_s.data)
.Looking forward to your clarification! Thank you
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.