In your model_search.py code you simply use F.softmax(...)</

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

At index <a class="issue-link js-issue-link" data-error-text="Failed to load title" da

We implement this in our derived architectures. At each layer,

Why don't you sample Alphas and Betas using Gumbel-Softmax? about fasterseg HOT 15 CLOSED

vita-group commented on June 10, 2024

Why don't you sample Alphas and Betas using Gumbel-Softmax?

from fasterseg.

Comments (15)

chenwydj commented on June 10, 2024

Hi @maaft! Thank you for your interest in our work!

As an example, you may want to take a look at Figure 2 in https://openreview.net/forum?id=SJx9ngStPH. It seems the Gumbel softmax ("GDAS") may lead to worse one-shot validation error, while at the same time achieve similar test regret (i.e., the accuracy of derived subnetworks).

For now, it is hard for me to say if using Gumbel softmax is always a good idea. However, I would guess stabilizing feature maps in the supernet should lead to a stable search process, which could be vital for dense prediction tasks like segmentation. Gumbel softmax may include larger dynamics during the search.

from fasterseg.

maaft commented on June 10, 2024

Thank you for your answer!

To clarify, alphas and betas are learned using softmax during training and the network for evaluation is just picked using argmax, correct?

I.e.:

def train(...):
   alphas = F.softmax(self.alphas, dim=-1)
   ...

def eval(...):
    idx = self.alphas.argmax(dim=-1, keepdim=True)
    alphas = torch.zeros_like(alphas)
    alphas.scatter_(1, idx, 1)

Do you mean by "stabilizing feature maps" adding a loss that forces each cell to produce a similar output to other cells of the same supercell?

from fasterseg.

chenwydj commented on June 10, 2024

Yes -- the operator with the largest alpha value is selected for train-from-scratch.

By "stabilizing feature maps" I mean reducing the variations of feature maps during the search. A large variation could come from different feature maps produced by different operators sampled by the one-hot Gumble softmax. Instead, the feature map will be less variant if it is the weighted sum of all operators' outputs.

from fasterseg.

maaft commented on June 10, 2024

Thanks!
Does this also apply to betas? If yes, how can some cells of the searched architecture have two in-bound connections from previous cells? (E.g. Cell 7 from downsampling-level 16)

Or do you use some kind of thresholds to determine if a connection is disabled or not? And if yes, what is the threshold you are using?

By the way, I noticed that in my case Gumbel Softmax sampling converges way faster than just using plain softmax. Note, that I don't use hard (one hot) sampling and sample different alphas per image (such that I obtain multiple pathes equal to the size of my batch).

from fasterseg.

chenwydj commented on June 10, 2024

At index #7 there are actually two cells. See the last paragraph in our section 3.4 (page 6 bottom):

"It is worth noting that, the multi-resolution branches will share both cell weights and feature maps if their cells are of the same operator type, spatial resolution, and expansion ratio. This design contributes to a faster network. Once cells in branches diverge, the sharing between the branches will be stopped and they become individual branches."

This means starting from index #5, two branches will never share anything. They have their separate convolutional layers.

For your Gumbel Softmax question: I never tried to sample per-image Gumbel noise. If you did not apply temperature annealing in the Gumbel Softmax, then maybe you could try: 1) use smaller architecture learning rate; 2) use larger temperature for Gumbel Softmax.

from fasterseg.

maaft commented on June 10, 2024

I've read that passage but I don't really understand it as your code is suggesting something else:

In model_search.py you initialize your Network_Multi_Path with all cells of all branches and I don't see shared parameters anywhere. Could you please clarify where the sharing takes place?

Thank you for being so patient and answering my questions! :)

from fasterseg.

chenwydj commented on June 10, 2024

We implement this in our derived architectures.

At each layer, we say two branches are "connected" (i.e. they share both cell weight and feature maps) only if: 1) they output the same scale, 2) they choose the same operator, 3) they choose the same width: https://github.com/TAMU-VITA/FasterSeg/blob/master/train/model_seg.py#L257
We merge connected branches into a group: https://github.com/TAMU-VITA/FasterSeg/blob/master/train/model_seg.py#L261
Branches belong the same group will share the same cell, otherwise they will not share either cells or feature maps.

from fasterseg.

maaft commented on June 10, 2024

Okay, so the sharing is only done after the model search is finished.

Another thing I noticed is that in your paper you use the betas on the input of each cell:

But in your implementation you use the betas on the cell's outputs:
https://github.com/TAMU-VITA/FasterSeg/blob/bb52a004ff83f64d3dd8989104234fdb862d1cc5/search/model_search.py#L330

I guess that both approaches are somehow equivalent in the end but I was wondering why you decided to do the weighting on the output of the cell?

Afaik, input weighting as per eq. (1) would spare us one forward pass per cell and therefore highly needed memory.

from fasterseg.

maaft commented on June 10, 2024

Sorry to bother you again, but could you kindly answer my last question?

If we really are able to multiply betas on the cell's input we would spare lots of memory and could use larger batch-sizes.

from fasterseg.

maaft commented on June 10, 2024

Also, to come back to my original question:
If you train a softmax-weighted sum of all all feature-maps for each super-cell, how do you decide that the architecture sampled from epoch N+1 is actually better than the architecture from epoch N ?

The problem that I have observed is, that even while the architecture might have gotten better, the argmax-derived model is performing worse (i.e. lower IoU) and I think that this mainly comes from softmax-training.

Here you see the validation DiceScore of the argmax-derived model. Red is with hard-gumbel-softmax sampling, blue is with plain-softmax training:

from fasterseg.

chenwydj commented on June 10, 2024

Sorry to bother you again, but could you kindly answer my last question?

If we really are able to multiply betas on the cell's input we would spare lots of memory and could use larger batch-sizes.

I think current Eq. 1 is actually doing input weighting, as the weighted output will be the input into the next cell.

Without using Gumbel Softmax, I doubt we could save memory here, since both two outputs need to be calculated in the computation graph.

from fasterseg.

chenwydj commented on June 10, 2024

Also, to come back to my original question:
If you train a softmax-weighted sum of all all feature-maps for each super-cell, how do you decide that the architecture sampled from epoch N+1 is actually better than the architecture from epoch N ?

The problem that I have observed is, that even while the architecture might have gotten better, the argmax-derived model is performing worse (i.e. lower IoU) and I think that this mainly comes from softmax-training.

Here you see the validation DiceScore of the argmax-derived model. Red is with hard-gumbel-softmax sampling, blue is with plain-softmax training:

For this question, this paper may be helpful to you.

In their Figure 2, solid lines are the performance of derived subnetworks, and dashed lines are the performance of the supernet. We can see that the solid line (test error) is not always getting lower along the search process. I think this is one important challenge in NAS community to be solved, especially for weight-sharing NAS methods.

from fasterseg.

maaft commented on June 10, 2024

I think current Eq. 1 is actually doing input weighting, as the weighted output will be the input into the next cell.

Yes, this is what I wrote and understand. But in your code you do output weighting and I don't understand why?

See:
https://github.com/TAMU-VITA/FasterSeg/blob/bb52a004ff83f64d3dd8989104234fdb862d1cc5/search/model_search.py#L331

from fasterseg.

chenwydj commented on June 10, 2024

I mean output reweighting at cell_i is actually doing input reweight for cell_i+1.

from fasterseg.

maaft commented on June 10, 2024

Okay, I will continue to use input-weighting then as I spare one forward pass through my cell (faster search time, lower memory consumption).

from fasterseg.

Why don't you sample Alphas and Betas using Gumbel-Softmax? about fasterseg HOT 15 CLOSED

Comments (15)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent