Comments (15)
Hi @maaft! Thank you for your interest in our work!
As an example, you may want to take a look at Figure 2 in https://openreview.net/forum?id=SJx9ngStPH. It seems the Gumbel softmax ("GDAS") may lead to worse one-shot validation error, while at the same time achieve similar test regret (i.e., the accuracy of derived subnetworks).
For now, it is hard for me to say if using Gumbel softmax is always a good idea. However, I would guess stabilizing feature maps in the supernet should lead to a stable search process, which could be vital for dense prediction tasks like segmentation. Gumbel softmax may include larger dynamics during the search.
from fasterseg.
Thank you for your answer!
To clarify, alphas and betas are learned using softmax during training and the network for evaluation is just picked using argmax, correct?
I.e.:
def train(...):
alphas = F.softmax(self.alphas, dim=-1)
...
def eval(...):
idx = self.alphas.argmax(dim=-1, keepdim=True)
alphas = torch.zeros_like(alphas)
alphas.scatter_(1, idx, 1)
Do you mean by "stabilizing feature maps" adding a loss that forces each cell to produce a similar output to other cells of the same supercell?
from fasterseg.
Yes -- the operator with the largest alpha value is selected for train-from-scratch.
By "stabilizing feature maps" I mean reducing the variations of feature maps during the search. A large variation could come from different feature maps produced by different operators sampled by the one-hot Gumble softmax. Instead, the feature map will be less variant if it is the weighted sum of all operators' outputs.
from fasterseg.
Thanks!
Does this also apply to betas? If yes, how can some cells of the searched architecture have two in-bound connections from previous cells? (E.g. Cell 7 from downsampling-level 16)
Or do you use some kind of thresholds to determine if a connection is disabled or not? And if yes, what is the threshold you are using?
By the way, I noticed that in my case Gumbel Softmax sampling converges way faster than just using plain softmax. Note, that I don't use hard (one hot) sampling and sample different alphas per image (such that I obtain multiple pathes equal to the size of my batch).
from fasterseg.
At index #7 there are actually two cells. See the last paragraph in our section 3.4 (page 6 bottom):
"It is worth noting that, the multi-resolution branches will share both cell weights and feature maps if their cells are of the same operator type, spatial resolution, and expansion ratio. This design contributes to a faster network. Once cells in branches diverge, the sharing between the branches will be stopped and they become individual branches."
This means starting from index #5, two branches will never share anything. They have their separate convolutional layers.
For your Gumbel Softmax question: I never tried to sample per-image Gumbel noise. If you did not apply temperature annealing in the Gumbel Softmax, then maybe you could try: 1) use smaller architecture learning rate; 2) use larger temperature for Gumbel Softmax.
from fasterseg.
I've read that passage but I don't really understand it as your code is suggesting something else:
In model_search.py
you initialize your Network_Multi_Path
with all cells of all branches and I don't see shared parameters anywhere. Could you please clarify where the sharing takes place?
Thank you for being so patient and answering my questions! :)
from fasterseg.
We implement this in our derived architectures.
- At each layer, we say two branches are "connected" (i.e. they share both cell weight and feature maps) only if: 1) they output the same scale, 2) they choose the same operator, 3) they choose the same width: https://github.com/TAMU-VITA/FasterSeg/blob/master/train/model_seg.py#L257
- We merge connected branches into a group: https://github.com/TAMU-VITA/FasterSeg/blob/master/train/model_seg.py#L261
- Branches belong the same group will share the same cell, otherwise they will not share either cells or feature maps.
from fasterseg.
Okay, so the sharing is only done after the model search is finished.
Another thing I noticed is that in your paper you use the betas on the input of each cell:
But in your implementation you use the betas on the cell's outputs:
https://github.com/TAMU-VITA/FasterSeg/blob/bb52a004ff83f64d3dd8989104234fdb862d1cc5/search/model_search.py#L330
I guess that both approaches are somehow equivalent in the end but I was wondering why you decided to do the weighting on the output of the cell?
Afaik, input weighting as per eq. (1) would spare us one forward pass per cell and therefore highly needed memory.
from fasterseg.
Sorry to bother you again, but could you kindly answer my last question?
If we really are able to multiply betas on the cell's input we would spare lots of memory and could use larger batch-sizes.
from fasterseg.
Also, to come back to my original question:
If you train a softmax-weighted sum of all all feature-maps for each super-cell, how do you decide that the architecture sampled from epoch N+1
is actually better than the architecture from epoch N
?
The problem that I have observed is, that even while the architecture might have gotten better, the argmax-derived model is performing worse (i.e. lower IoU) and I think that this mainly comes from softmax-training.
Here you see the validation DiceScore of the argmax-derived model. Red is with hard-gumbel-softmax sampling, blue is with plain-softmax training:
from fasterseg.
Sorry to bother you again, but could you kindly answer my last question?
If we really are able to multiply betas on the cell's input we would spare lots of memory and could use larger batch-sizes.
I think current Eq. 1 is actually doing input weighting, as the weighted output will be the input into the next cell.
Without using Gumbel Softmax, I doubt we could save memory here, since both two outputs need to be calculated in the computation graph.
from fasterseg.
Also, to come back to my original question:
If you train a softmax-weighted sum of all all feature-maps for each super-cell, how do you decide that the architecture sampled from epochN+1
is actually better than the architecture from epochN
?The problem that I have observed is, that even while the architecture might have gotten better, the argmax-derived model is performing worse (i.e. lower IoU) and I think that this mainly comes from softmax-training.
Here you see the validation DiceScore of the argmax-derived model. Red is with hard-gumbel-softmax sampling, blue is with plain-softmax training:
For this question, this paper may be helpful to you.
In their Figure 2, solid lines are the performance of derived subnetworks, and dashed lines are the performance of the supernet. We can see that the solid line (test error) is not always getting lower along the search process. I think this is one important challenge in NAS community to be solved, especially for weight-sharing NAS methods.
from fasterseg.
I think current Eq. 1 is actually doing input weighting, as the weighted output will be the input into the next cell.
Yes, this is what I wrote and understand. But in your code you do output weighting and I don't understand why?
from fasterseg.
I mean output reweighting at cell_i
is actually doing input reweight for cell_i+1
.
from fasterseg.
Okay, I will continue to use input-weighting then as I spare one forward pass through my cell (faster search time, lower memory consumption).
from fasterseg.
Related Issues (20)
- How can FasterSeg be used to perform inference on a video? HOT 6
- I reported an error on my own data set while I was infering HOT 3
- Test Issue HOT 1
- Training with custom data with different resolution to cityscapes dataset HOT 1
- TranA and TrainB is overlapped according to your implementation HOT 5
- NaN values in loss function during Step 2.2 HOT 2
- BaseDataset.__getitem__ always returns random item if file_length is set HOT 1
- CUDA out of memory: GeForce GTX 1650. HOT 1
- Which Nvidia GPU card do you use for training? HOT 2
- What's the idea of `betas2path ` ? HOT 3
- CUDA out of memory HOT 2
- How to estimate the runtime of the four steps depending on epochs and iterations?
- RuntimeError: transform: failed to synchronize: cudaErrorAssert: device-side assert triggered
- error during run_latency.py HOT 3
- Whether the architectures are different after each searching?
- Custom Data Resolution for Training HOT 10
- RuntimeError:CUDA error:API call is not supported in the installed CUDA driver HOT 1
- Setting up on Windows
- Model mismatch occurs when training student network with own data set HOT 2
- A train_fine_val.txt file
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fasterseg.