Comments (9)
Oh wow, what a nightmare. I am happy you found the culprit though, I am hence closing this issue for now since I am under the impression that it wasn't an issue with the code in the end. Please feel free to reopen the issue in case I am mistaken. And thanks for keeping us updated throughout all of this!
from pytorch-pwc.
And sorry for forgetting to answer your questions.
Is the block=tuple([ 32, 1, 1 ]) specifying that there are 32 threads for kernel_Correlation_updateOutput or is it specified somewhere else?
It means that there are 32 threads per warp and it is originally defined here: https://github.com/lmb-freiburg/flownet2/blob/b92e198b56b0e52e1ba0a5a98dc0e39fa5ae70cc/src/caffe/layers/correlation_layer.cu#L17
from pytorch-pwc.
I can't help but wonder if that 196 channels in self.netSix
is part of the cause of my memory error since the correlation in kernel_Correlation_updateOutput
indicates that it is stepping through 32 channels at a time. It should stop at 160 but instead is stopping at 192 instead and that would indicate that it is treating like it has 224 channels while reducing it to 81. I can't imagine what data it is using in those extra channels but it makes me wonder if that was why the original paper indicated they were having a lot of edge problems in their model. I'm just guessing though. I know even less about C at this point than python but I am learning.
I started training from scratch with a corrected self.netSix
but I'm training to deblur and not directly training the flow. I should probably go back and train the pytorch-pwc
model by itself instead but I'm curious if this will work.
from pytorch-pwc.
For me, I always get this error when I am using any gpu other than gpu:0. I tried my best to make sure everything is on the same gpu device, but this error won't go away. So I ended up mapping whichever available device to gpu:0 when running docker.
from pytorch-pwc.
I only have one GPU. It wasn't because of that. It is because of a memory leak caused by having a channels set to 196 instead of 192 while stepping through 32 channels at a time in the CuPy module I mentioned above. I have not had this error anymore. Instead of training from scratch I am just modifying the model after I have loaded the pretrained model. I'm still doing tests but I think it is going to outperform the original deblur model I'm working on. I haven't had a single error since I changed this code. If I put it back to 196 and put the multiply by 20 I will get it sometime during an epoch. You know CuPy better than me. Take a close look at it. Is it not reading memory locations that don't have data defined?
from pytorch-pwc.
I think I understand the CuPy code a little better now and there is no memory leak in regards to having 196 channels. The issue seems to have been purely having a part of the model layer in the return statement. I moved the netRefiner
layer out of the return
statement like the following and the memory errors ceased. My only guess is that PyTorch does not handle layers in the return
statement exactly like it does in the rest of the forward
block
@sniklaus, I do have one question about the CuPy code though and I haven't found an answer on the internet. Is the block=tuple([ 32, 1, 1 ])
specifying that there are 32 threads for kernel_Correlation_updateOutput
or is it specified somewhere else? I assume that threadIdx.x
is only from 0 to 31 depending on which thread is running.
def forward(self, tenOne, tenTwo):
tenOne = self.netExtractor(tenOne)
tenTwo = self.netExtractor(tenTwo)
objEstimate = self.netSix(tenOne[-1], tenTwo[-1], None)
objEstimate = self.netFiv(tenOne[-2], tenTwo[-2], objEstimate)
objEstimate = self.netFou(tenOne[-3], tenTwo[-3], objEstimate)
objEstimate = self.netThr(tenOne[-4], tenTwo[-4], objEstimate)
objEstimate = self.netTwo(tenOne[-5], tenTwo[-5], objEstimate)
objEstimate['tenFeat'] = self.netRefiner(objEstimate['tenFeat'])
return (objEstimate['tenFlow'] + objEstimate['tenFeat']) * 20.0
# end
from pytorch-pwc.
I'm starting to think I have a hardware issue. It was working fine for135 epochs, gone though millions of iterations and now I can't make any more progress because of this exact same error. No code changes.
I probably need a new video card with some water cooling. I was running 80C for many months and I bet that has degraded the GPU some. I can't come up with another explanation of why it would be working so well and then stop working at all.
from pytorch-pwc.
I think I have found a resolution. I am using MSi Afterburner to under-clock my NVIDIA GeForce RTX 2060 Super. It has a boost clock of 1680Mhz. I reduced the Core Clock by 100 MHz. I also set the GPU Temp Limit to 76C and since the Power Limit is linked it was reduced to 88%. It has been running a few hours with no more of this error.
from pytorch-pwc.
First of all, thank you again @sniklaus, for making this code available. I have a workstation with 2 Nvidia 1080 Ti on it, and I still get the same error whenever I have this code running on one GPU and I try to run a separate experiment on the 2nd GPU. As a sidenote, I've never encountered this issue when trying to run any other code on both GPUs simultaneously. I think my issue is related to what @StArchon94 posted earlier:
For me, I always get this error when I am using any gpu other than gpu:0. I tried my best to make sure everything is on the same gpu device, but this error won't go away. So I ended up mapping whichever available device to gpu:0 when running docker.
This a very weird issue and can be quite problematic since I cannot debug while, say, training a model, which can take a couple of days. @StArchon94 did you end up finding a solution after all?
from pytorch-pwc.
Related Issues (20)
- Can it test in CPU device? HOT 1
- Cannot inspect the model using TensorBoard HOT 4
- Generalization to unseen data HOT 1
- how to compute optical flow for small size imgs? HOT 1
- what is the range of optical flow value in an image? HOT 1
- About the direction of estimated flow HOT 2
- About pwc-net pretrained model in pytorch version HOT 1
- Just want to confirm if my analysis of your commit 5f4d7de is correct HOT 7
- In run.py, self.netSix has out_channels = 196. Paper says 192. HOT 1
- CuPy correlation layer error HOT 4
- Cupy cuda error. HOT 1
- What is the purpose of `tenMask` in `run.backwarp`? HOT 2
- the two same images HOT 4
- I already did what @fabiopk said but I am still getting this error HOT 2
- How to change the cuda version of correlation.py to the python version HOT 1
- how to import frames of river videos into PWC-Net codes HOT 7
- cupy issue HOT 2
- Normalization in `backward` function. HOT 1
- Deprecation of cupy.cuda.compile_with_cache() in cupy 13.0 HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pytorch-pwc.