wkentaro / pytorch-fcn Goto Github PK
View Code? Open in Web Editor NEWPyTorch Implementation of Fully Convolutional Networks. (Training code to reproduce the original result is available.)
License: MIT License
PyTorch Implementation of Fully Convolutional Networks. (Training code to reproduce the original result is available.)
License: MIT License
Hi, can you provide simple instructions about how to train, do inference and get dataset?
Hi,
Thanks so much for your help! I'm new to pytorch.
There is a new error again!
Traceback (most recent call last):
File "examples/voc/train_fcn32s.py", line 100, in
main()
File "examples/voc/train_fcn32s.py", line 56, in main
model.copy_params_from_vgg16(vgg16, init_upscore=False)
File "/home/zheshiyige/Desktop/fully convolutional network/pytorch-fcn-master/torchfcn/models/fcn32s.py", line 117, in copy_params_from_vgg16
l2.weight.data = l1.weight.data.view(l2.weight.size())
File "/home/zheshiyige/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 238, in getattr
type(self).name, name))
AttributeError: 'Dropout' object has no attribute 'weight'
Thanks and Best Regards,
When resuming from a saved checkpoint using the flag --resume
and the path to .pth.tar file saved under the 'log' folder by the program, there seems to be an error during the optim step.
Traceback (most recent call last):
File "train_fcn32s.py", line 202, in
main()
File "train_fcn32s.py", line 198, in main
trainer.train()
File "/home/arunirc/pytorch-fcn/lib/python2.7/site-packages/torchfcn/trainer.py", line 286, in train
self.train_epoch()
File "/home/arunirc/pytorch-fcn/lib/python2.7/site-packages/torchfcn/trainer.py", line 245, in train_epoch
self.optim.step()
File "/home/arunirc/pytorch-fcn/lib/python2.7/site-packages/torch/optim/sgd.py", line 87, in step
param_state = self.state[p]
KeyError: Parameter containing:
(0 ,0 ,.,.) =
-0.1587 0.0404 -0.2275
-0.1916 -0.1479 0.0287
-0.0271 -0.3107 -0.1193
As the title, in torchfcn/models/fcn32s.py we have the setting for the first conv1 layer:
nn.Conv2d(3, 64, 3, padding=100),
Why do we need a padding of side length 100 instead of 1 according to the filter size 3?
Thanks
Hi,
I cannot find 'fcn.utils' function, which is used in 'trainer.py' for fcn.utils.label_accuracy_score and fcn.utils.visualize_segmentation.
This project used different preprocessing method with the Pytorch pretrained models. The pretrained models use RGB values normalized to [0, 1]. While in pytorch-fcn, images are BGR values and not normalized to [0, 1], only center normalized. So how did it works?
See this on pytorch discuss:
All pretrained torchvision models have the same preprocessing, which is to normalize using the following mean/std values: https://github.com/pytorch/examples/blob/master/imagenet/main.py#L92-L93164 (input is RGB format)
[ref: https://discuss.pytorch.org/t/how-to-preprocess-input-for-pre-trained-networks/683/2]
Hello, when I was training the network after tuning the lr=1e4 which is the same as in FCN paper?
Why does it raise a nan error in loss? (it is ok if I use the lr set by your original script)
hi man~ sorry for bothering u
I can not get the http://www.eecs.berkeley.edu/Research/Projects/CS/vision/grouping/sema
ntic_contours/benchmark.tgz -O benchmark.tar for the benchmark data.
Could u please upload the files to your project?
thx!
Hello! I am using your pytorch-fcn, which need the dependency fcn. However, when I try to convert the caffe model to pytorch one, importing both fcn and caffe the kernel become dead. But just importint single one, caffe or fcn, it works. If I import fcn after importing caffe or caffe after fcn, it can not work...
Hi,
So sorry to bother you again! Could you tell me what's the meaning of the three arguments, out, resume and no-deconv? How to provide these three arguments to make it run properly?
Thanks and Best Regards,
As the title, thanks. It seems that
./train_fcn32s.py --out logs/fcs32s_sbd
does not work out with the current version.
Best,
The error tells me that I don't have the vgg16-from-caffe pretrained model.
bg455@51d9f43122cd:~/projects/pytorch-fcn/examples/voc$ ./train_fcn32s.py ./config/001.yaml
Traceback (most recent call last):
File "./train_fcn32s.py", line 128, in
main()
File "/home/bg455/.local/lib/python2.7/site-packages/click/core.py", line 722, in call
return self.main(*args, **kwargs)
File "/home/bg455/.local/lib/python2.7/site-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/home/bg455/.local/lib/python2.7/site-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/bg455/.local/lib/python2.7/site-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "./train_fcn32s.py", line 93, in main
vgg16 = torchfcn.models.VGG16(pretrained=True)
File "/usr/local/lib/python2.7/dist-packages/torchfcn-1.5.0-py2.7.egg/torchfcn/models/vgg.py", line 10, in VGG16
state_dict = torch.load(model_file)
File "/usr/local/lib/python2.7/dist-packages/torch/serialization.py", line 246, in load
f = open(f, 'rb')
IOError: [Errno 2] No such file or directory: '/home/wkentaro/data/models/torch/vgg16-from-caffe.pth'
Hi WKentaro,
Thank you for the implementation! I am new to PyTorch and I will try this out for an upcoming project.
Does your implementation include Bottleneck (BC) blocks? It seems to improve training time significantly.
Hi,
Thanks for sharing the code. I wonder that what kind of GPU were you using for speed test? When I ran the speed test on GTX 1080 (8G memory), the Elapsed time for chainer is: 404.87 [s / 1000 evals] and the for pytorch, it is 178.93 [s / 1000 evals]. And whenever I try to run it with the VOC dataset, it has the out of memory error, so I wonder if you can share your experiment environment, such as GPU type, how long it runs and how much memory fcn32s_pytorch takes.
Thank you!
Hello, wkentaro. Do you find that the training loss will suddenly become larger when a new epoch starts? I've noticed that and think it is because of the incorrect use of the training and evaluating mode of the model.
In train_epoch(), line 171, when you finish validation, I think you should change the model back into training mode.
Hi author
First I would like to thank for your sharing. During training, I notice the loss (around 30000) is obviously larger than fcn experiment on caffe(0.4 ~ 3.0). After reading your code, I find the cause. In your code trainer.py,
loss = F.nll_loss(log_p, target, weight=weight, size_average=False)
if size_average:
loss /= mask.sum().data[0]
Note here mask
is a torch.ByteTensor
, which means the data is stored in a byte, value from 0~255
(8 bits). So when you sum it up, it will easily lead to overflow (because answer is also stored in torch.ByteTensor
).
For example
test1 = Variable(torch.ones(1, 256,256).type('torch.ByteTensor'))
print(test1, test1.sum())
'''
Variable containing:
( 0 ,.,.) =
1 1 1 ... 1 1 1
1 1 1 ... 1 1 1
1 1 1 ... 1 1 1
... ⋱ ...
1 1 1 ... 1 1 1
1 1 1 ... 1 1 1
1 1 1 ... 1 1 1
[torch.ByteTensor of size 1x256x256]
Variable containing:
0
[torch.ByteTensor of size 1]
'''
The correct way should be
test1 = Variable(torch.ones(1, 256,256).type('torch.ByteTensor'))
print(test1, test1.data.sum())
'''
Variable containing:
( 0 ,.,.) =
1 1 1 ... 1 1 1
1 1 1 ... 1 1 1
1 1 1 ... 1 1 1
... ⋱ ...
1 1 1 ... 1 1 1
1 1 1 ... 1 1 1
1 1 1 ... 1 1 1
[torch.ByteTensor of size 1x256x256]
65536
'''
So to compute the loss properly, you need to change loss /= mask.sum().data[0]
to loss /= mask.data.sum()
.
I have created a pull request to fix this.
Hi, I did an experiment about the loss function:
log_p = log_p.transpose(1, 2).transpose(2, 3).contiguous().view(-1, c)
I find the code segment with view(-1, c)
or without both works.Why? In my opinion, the code without view(-1, c)
is the true one. Or, in the memory, the both are actually the same. How about your idea?
I am training an OCR using RNN. I am supplying input data as word images of varying dimensions (since each word can be of different lengths) and the size of class labels of each input data is also not consistent. Since each word can have a different number of characters.
tensor_word_dataset = WordImagesDataset(images, truths, transform = ToTensor())
dataset_loader = torch.utils.data.DataLoader(tensor_word_dataset,
batch_size=16, shuffle=True,)
This gives me the error:
RuntimeError: inconsistent tensor size at /py/conda-bld/pytorch_1493673470840/work/torch/lib/TH/generic/THTensorCopy.c:46
The image sizes of first 5 input labels and images respectively are:
torch.Size([2]) torch.Size([32, 41])
torch.Size([7]) torch.Size([32, 95])
torch.Size([2]) torch.Size([32, 38])
torch.Size([2]) torch.Size([32, 53])
torch.Size([2]) torch.Size([32, 49])
torch.Size([6]) torch.Size([32, 55])
Any suggestions as to how should I fix it. I want to shuffle the data and send it in batches instead of supplying them one at a time.
Hi,
So sorry to bother you again!
Now, there is a new error
Traceback (most recent call last):
File "voc/train_fcn32s.py", line 100, in
main()
File "voc/train_fcn32s.py", line 56, in main
torchfcn.utils.copy_params_vgg16_to_fcn32s(vgg16, model, init_upscore=False)
AttributeError: 'module' object has no attribute 'utils'
But I found that I have installed torchfcn (1.4.1) and can import torchfcn.
Thanks and Best Regards,
I was trying to run train in the VOC Example with log but this parameter doesn't exist!
noticed it will output log no regardless of parameter
Hi,
When I run the examples/voc/train_fcn32s.py. There is a strange error
File "train_fcn32s.py", line 46, in main
model = torchfcn.models.FCN32s(n_class=21, deconv=deconv)
TypeError: init() got an unexpected keyword argument 'deconv'
could you tell me what's wrong with this error?
Thanks and Best Regards,
Hi,
I have a question for loss function,
loss = F.nll_loss(log_p, target, weight=weight, size_average=False) if size_average: loss /= mask.sum().data[0]
Whether 'size_average' is True or False, scales of gradients for the loss function are different.
This may be a problem because different scales of gradients may impact on how we set learning rate.
Is this the reason you use very low learning rate, 1e-10?
Thanks!!
When I implement the VOC2012 Dataset based your code, it occurs an error "incosistent tensor size". The code about dataset is same with yours. But I add some test code as below:
if __name__ == '__main__':
dst = VOCDataSet("./data", is_transform=True)
trainloader = data.DataLoader(dst, batch_size=4)
for i, data in enumerate(trainloader):
imgs, labels = data
print(imgs[0].type())
Did you encounter this problem?
Hi, I tried to train the fcn8s model using pretrained weights from the vgg16 model, but there was no visible learning and the model output was all zero for the first to ~10k iterations. However, the training loss converges when I use the fcn16s model to initialize the weights. I could not really understand the reason behind this, can you help with an explanation.
Hi, I just want to ask how you decide the crop offset in different models. If I want to train this model on different datasets with different image size, how to calculate the precise crop offset?
Hi,
Recently, I run the examples/voc/train_fcn32s.py, and encountered another error:
Traceback (most recent call last):
File "train_fcn32s.py", line 101, in
main()
File "train_fcn32s.py", line 55, in main
vgg16.load_state_dict(torch.load(pth_file))
File "/home/zheshiyige/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 331, in load_state_dict
.format(name))
KeyError: 'unexpected key "classifier.1.weight" in state_dict'
I have download the /home/zheshiyige/data/models/torch/vgg16-00b39a1b.pth
Could you tell me how to fix this error?
Thanks and Best Regards,
Hi,
May I ask that when you update FCN8s or FCN 16s?
Thanks!
Now after I ran python setup.py install
, I got the following error:
import torchfcn
Traceback (most recent call last):
File "<ipython-input-1-c08454750c97>", line 1, in <module>
import torchfcn
File "/home/zhang/anaconda2/lib/python2.7/site-packages/torchfcn-0.2-py2.7.egg/torchfcn/__init__.py", line 3, in <module>
from trainer import Trainer # NOQA
File "/home/zhang/anaconda2/lib/python2.7/site-packages/torchfcn-0.2-py2.7.egg/torchfcn/trainer.py", line 6, in <module>
import fcn
ImportError: No module named fcn****
Hi,wkentrao. sorry to bother u !
I want to train fcn32s , like this.
./train_fcn32s.py -g 0
Now,i have a error. I dont know how to fixed that.
Traceback (most recent call last):
File "./train_fcn32s.py", line 164, in
main()
File "./train_fcn32s.py", line 160, in main
trainer.train()
File "/home/wukuan/anaconda3/envs/env27/lib/python2.7/site-packages/torchfcn/trainer.py", line 222, in train
self.train_epoch()
File "/home/wukuan/anaconda3/envs/env27/lib/python2.7/site-packages/torchfcn/trainer.py", line 208, in train_epoch
elapsed_time = elapsed_time.total_seconds()
AttributeError: 'float' object has no attribute 'total_seconds'
I want to know some details about this error.
Thanks and Best Regards!
Why is a lot of memory (~11G) consumed after importing torchfcn? And because of this, I am not able to run the training demo due to out of memory. How could this happen?
code-block:: bash
./train_fcn32s.py config/001.yaml
Running this step, I can see this error, as follows,
Traceback (most recent call last):
File "./train_fcn32s.py", line 128, in
main()
File "/home/xyz/anaconda2/lib/python2.7/site-packages/click/core.py", line 722, in call
return self.main(*args, **kwargs)
File "/home/xyz/anaconda2/lib/python2.7/site-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/home/xyz/anaconda2/lib/python2.7/site-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/xyz/anaconda2/lib/python2.7/site-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "./train_fcn32s.py", line 62, in main
cfg, out = load_config_file(config_file)
File "./train_fcn32s.py", line 32, in load_config_file
name += '_VCS-%s' % git_hash()
File "./train_fcn32s.py", line 21, in git_hash
hash = subprocess.check_output(shlex.split(cmd)).strip()
File "/home/xyz/anaconda2/lib/python2.7/subprocess.py", line 212, in check_output
process = Popen(stdout=PIPE, *popenargs, **kwargs)
File "/home/xyz/anaconda2/lib/python2.7/subprocess.py", line 390, in init
errread, errwrite)
File "/home/xyz/anaconda2/lib/python2.7/subprocess.py", line 1024, in _execute_child
raise child_exception
OSError: [Errno 2] No such file or directory
Wishing for your reply!
In the first round of evaluation (with pretrained VGG16 model) there is following error
.../pytorch-fcn/examples/voc/torchfcn/utils.py:24: RuntimeWarning: invalid value encountered in divide
acc_cls = np.diag(hist) / hist.sum(axis=1)
.../pytorch-fcn/examples/voc/torchfcn/utils.py:26: RuntimeWarning: invalid value encountered in divide
iu = np.diag(hist) / (hist.sum(axis=1) + hist.sum(axis=0) - np.diag(hist))
and the training loss becomes NaN
. Is this expected?
Also I am not sure if it's related but AFTER the first round all the validation error becomes NaN
.
Hi, sorry about disturbing you, i got an error when run voc/train_fcn322.py, i can't import fcn, is that i need addition package? Thanks.
I tested this out on an AWS p2 instance and I got a significantly slower benchmark. Can you confirm the hardware that you ran your benchmark on?
==> Benchmark: gpu=0, times=1000, dynamic_input=False
==> Testing FCN32s with PyTorch
Elapsed time: 245.73 [s / 1000 evals]
Hz: 4.07 [hz]
Hi, when I run the file "model_caffe_to_pytorch.py", it turns out that there is "no module named fcn". Thank you very much.
By the way, "export PYTHONAPTH=$(pwd)/python:$PYTHONPATH" should be "export PYTHONPATH=$(pwd)/python:$PYTHONPATH", I think.
$ ./train_fcn32s.py
Traceback (most recent call last):
File "./train_fcn32s.py", line 9, in <module>
import torchfcn
File "/home/acgtyrant/Projects/pytorch-fcn/.env/lib/python3.6/site-packages/torchfcn-1.3-py3.6.egg/torchfcn/__init__.py", line 1, in <module>
from . import datasets # NOQA
File "/home/acgtyrant/Projects/pytorch-fcn/.env/lib/python3.6/site-packages/torchfcn-1.3-py3.6.egg/torchfcn/datasets/__init__.py", line 1, in <module>
from .apc import APC2016V1 # NOQA
File "/home/acgtyrant/Projects/pytorch-fcn/.env/lib/python3.6/site-packages/torchfcn-1.3-py3.6.egg/torchfcn/datasets/apc/__init__.py", line 1, in <module>
from v1 import APC2016V1 # NOQA
ModuleNotFoundError: No module named 'v1'
Hi,
I can only find fcn32s implementation in VOC example, is there fcn16s and fcn8s implementation?
Thanks and Best Regards,
Hello. I found that when the training time exceed 24 hours, the logging will write "1 day" to the time field of "log.csv" . However, this will cause error when using the script "learning_curve.py". How can I fix this error? Thanks!
Hi,
I want to use batch size more than 1.
So, I changed batch_size to 4.
e.g.)
train_loader = torch.utils.data.DataLoader(
torchfcn.datasets.SBDClassSeg(root, split='train', transform=True),
batch_size=4, shuffle=True, **kwargs)
But, it raises an error so that I could not use it.
Could you please please let me know how to modify it?
and, is it possible to use multiple gpus by modifying codes?
Thank you!
Hi,
Thanks a lot for your help! Now, the code can run successfully on my computer. But there is a warnning, I'm not sure whether it is an important issue.
/home/zheshiyige/anaconda2/lib/python2.7/site-packages/fcn-5.8.1-py2.7.egg/fcn/utils.py:310: RuntimeWarning: invalid value encountered in true_divide
/home/zheshiyige/.local/lib/python2.7/site-packages/skimage/util/dtype.py:122: UserWarning: Possible precision loss when converting from float64 to uint8
.format(dtypeobj_in, dtypeobj_out))
/home/zheshiyige/.local/lib/python2.7/site-packages/skimage/transform/_warps.py:84: UserWarning: The default mode, 'constant', will be changed to 'reflect' in skimage 0.15.
warn("The default mode, 'constant', will be changed to 'reflect' in "
Train epoch=0: 7%| | 611/8498 [02:25<31:02, 4.24it/s]/home/zheshiyige/anaconda2/lib/python2.7/site-packages/fcn-5.8.1-py2.7.egg/fcn/utils.py:312: RuntimeWarning: invalid value encountered in true_divide
Thanks and Best Regards,
I found that when you trained the FCN8s model, the learning rate is too small (1e-14). I remember that the learning rate is set to 1e-4 in the original FCN paper. I am a little confused.
Can you give me some answer? Thank you for advance
hi, when i using resume to load parameters, it is crash saying ''unexpected key "module.features.0.weight" in state_dict''. I don't know what wrong with this, Thanks a lot!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.