Comments (7)
@leighton613
Please pull my latest code.
FPN is memory efficient because its feature is diverse and its Fast-RCNN head is light.
I just use 5428MiB memory for training, 1 image, shorter side 800, batch size 512, rpn batch size 256
from fpn.
Thanks @xmyqsh , I haven't check your latest update, which must be very efficient.
Just for comparison, my implementation is adapted from faster-rcnn and is pretty similar to your last version.
- faster-rcnn consumes 5000+ MiB
- fpn w/ only p2 consumes 8600+ MiB
- fpn w/ all layers consumes > 12189 MiB and get
ResourceExhaustedError
.
Any suggestions maybe....?
from fpn.
@leighton613
Your Fast-RCNN consumes 5000+ MiB, so your base model should be ResNet50, right?
And your FPN consumes another 8600+ MiB, I think you haven't share base model between Fast-RCNN and FPN. Meanwhile, your RPN‘s head is incredible big, pay attention to FPN's RPN conv's output is 256, not 512 as in faster-rcnn. And I think yours should larger than 512. P2's feature map which is RPN conv's input is 64 times larger than P5 whose feature map size is same as C5's feature map, this makes RPN conv's output sensitive.
Big P2's feature map also make number of anchors increasing large when anchor scale is increasing. The box-regression layer (reg) and box-classification layer (cls)'s ouput should also increase linearly as number of anchors scale increase.
your RPN's conv size / faster-rcnn's RPN's conv size = (256 / 2048) * 64 * (your RPN conv's output / faster-rcnn's RPN's output)
your rgs/cls layer size / faster-rcnn's rgs/cls layer size = (your RPN conv's output / faster-rcnn's RPN's output) * 64 * (your number of anchor's scale / faster-rcnn's RPN's anchor's scale)
and
your RPN's conv size / my RPN's conv size = (your RPN conv's output / ((1 + 1/4 + 1/16 + 1/64) * 256))
your rgs/cls layer size / my rgs/cls layer size = ((your RPN conv's output * 1 * your number of anchor's scale) / (256 * (1 + 1/4 + 1/16 + 1/64) * 1))
In a word, I think you could try to decrease your RPN conv's output and your number of anchor's scale as well as share base model among FPN and Fast-RCNN.
Anyway, all of these above seems not the main problem, there must be other causes before I saw your source code.
from fpn.
Thanks @xmyqsh ! That's very thoughtful and detailed analysis. I actually use VGG as base for convenience but that doesn't matter at this time. The problem was resolved by changing tf configuration, and let this program to use all the GPU...
Also, I read your updated code, and the one shared anchor_target_layer
(instead of having four) is helpful to save some RAM of course ;) However, the (possible) downside is, some of the four roi-pooling layers get empty input (no proposals for this scale) and then get cudaCheckError() failed
. Wonder if anyone encountered this (or I mis-implemented some part)?
from fpn.
@leighton613
I have encountered the same cudaCheckError()
as yours.
I just hack the roi_pooling_op_gpu L100 and L205 to cope with it.
from fpn.
@xmyqsh Thanks! I'll take some time into op later... For now I changed _calc_level
related function to ensure there won't be any empty rois.
from fpn.
@xmyqsh Hi, I've been testing your code but got the same issue "cudaCheckError() failed in ROIPoolForward: invalid device function". Could you show me more details of how to hack the roi_pooling_op_gpu L100 and L205 to cope with it? Thank you.
from fpn.
Related Issues (20)
- alt_opt testing problem
- training loss is nan
- ValueError: attempt to get argmax of an empty sequence
- ValueError: Shape of a new variable (Fast-RCNN/fc6/weights) must be fully defined, but instead was (?, 1024). HOT 2
- cannot convert float infinity to integer
- alt training error HOT 2
- Can you share the link of pretrain ResNet 50 imagenetpre-train model HOT 2
- Encounter this error: tensorflow.python.framework.errors_impl.NotFoundError HOT 3
- Getting -1 for map using VOC07+12 Trainval for validation HOT 3
- InternalError (see above for traceback): Failed to run py callback pyfunc_0: see error log.
- Train New Dataset HOT 3
- no module named cython_bbox
- different shape error when training now data HOT 1
- tensorflow.python.framewor.errors
- UnknownError (see above for traceback): KeyError: b'TRAIN' HOT 1
- Getting -1 for map using VOC07+12 Trainval HOT 6
- running test_net.py gives no detection result HOT 1
- nms model cannot be imported
- Use ResNet101 Model
- Memory leak problem with proposal_layer.py
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fpn.