Git Product home page Git Product logo

Comments (15)

chenwydj avatar chenwydj commented on June 10, 2024

Hi @maaft! Thank you for your interest in our work!

As an example, you may want to take a look at Figure 2 in https://openreview.net/forum?id=SJx9ngStPH. It seems the Gumbel softmax ("GDAS") may lead to worse one-shot validation error, while at the same time achieve similar test regret (i.e., the accuracy of derived subnetworks).

For now, it is hard for me to say if using Gumbel softmax is always a good idea. However, I would guess stabilizing feature maps in the supernet should lead to a stable search process, which could be vital for dense prediction tasks like segmentation. Gumbel softmax may include larger dynamics during the search.

from fasterseg.

maaft avatar maaft commented on June 10, 2024

Thank you for your answer!

To clarify, alphas and betas are learned using softmax during training and the network for evaluation is just picked using argmax, correct?

I.e.:

def train(...):
   alphas = F.softmax(self.alphas, dim=-1)
   ...

def eval(...):
    idx = self.alphas.argmax(dim=-1, keepdim=True)
    alphas = torch.zeros_like(alphas)
    alphas.scatter_(1, idx, 1)

Do you mean by "stabilizing feature maps" adding a loss that forces each cell to produce a similar output to other cells of the same supercell?

from fasterseg.

chenwydj avatar chenwydj commented on June 10, 2024

Yes -- the operator with the largest alpha value is selected for train-from-scratch.

By "stabilizing feature maps" I mean reducing the variations of feature maps during the search. A large variation could come from different feature maps produced by different operators sampled by the one-hot Gumble softmax. Instead, the feature map will be less variant if it is the weighted sum of all operators' outputs.

from fasterseg.

maaft avatar maaft commented on June 10, 2024

Thanks!
Does this also apply to betas? If yes, how can some cells of the searched architecture have two in-bound connections from previous cells? (E.g. Cell 7 from downsampling-level 16)

image

Or do you use some kind of thresholds to determine if a connection is disabled or not? And if yes, what is the threshold you are using?

By the way, I noticed that in my case Gumbel Softmax sampling converges way faster than just using plain softmax. Note, that I don't use hard (one hot) sampling and sample different alphas per image (such that I obtain multiple pathes equal to the size of my batch).

from fasterseg.

chenwydj avatar chenwydj commented on June 10, 2024

At index #7 there are actually two cells. See the last paragraph in our section 3.4 (page 6 bottom):

"It is worth noting that, the multi-resolution branches will share both cell weights and feature maps if their cells are of the same operator type, spatial resolution, and expansion ratio. This design contributes to a faster network. Once cells in branches diverge, the sharing between the branches will be stopped and they become individual branches."

This means starting from index #5, two branches will never share anything. They have their separate convolutional layers.

For your Gumbel Softmax question: I never tried to sample per-image Gumbel noise. If you did not apply temperature annealing in the Gumbel Softmax, then maybe you could try: 1) use smaller architecture learning rate; 2) use larger temperature for Gumbel Softmax.

from fasterseg.

maaft avatar maaft commented on June 10, 2024

I've read that passage but I don't really understand it as your code is suggesting something else:

In model_search.py you initialize your Network_Multi_Path with all cells of all branches and I don't see shared parameters anywhere. Could you please clarify where the sharing takes place?

Thank you for being so patient and answering my questions! :)

from fasterseg.

chenwydj avatar chenwydj commented on June 10, 2024

We implement this in our derived architectures.

  1. At each layer, we say two branches are "connected" (i.e. they share both cell weight and feature maps) only if: 1) they output the same scale, 2) they choose the same operator, 3) they choose the same width: https://github.com/TAMU-VITA/FasterSeg/blob/master/train/model_seg.py#L257
  2. We merge connected branches into a group: https://github.com/TAMU-VITA/FasterSeg/blob/master/train/model_seg.py#L261
  3. Branches belong the same group will share the same cell, otherwise they will not share either cells or feature maps.

from fasterseg.

maaft avatar maaft commented on June 10, 2024

Okay, so the sharing is only done after the model search is finished.

Another thing I noticed is that in your paper you use the betas on the input of each cell:
image

But in your implementation you use the betas on the cell's outputs:
https://github.com/TAMU-VITA/FasterSeg/blob/bb52a004ff83f64d3dd8989104234fdb862d1cc5/search/model_search.py#L330

I guess that both approaches are somehow equivalent in the end but I was wondering why you decided to do the weighting on the output of the cell?

Afaik, input weighting as per eq. (1) would spare us one forward pass per cell and therefore highly needed memory.

from fasterseg.

maaft avatar maaft commented on June 10, 2024

Sorry to bother you again, but could you kindly answer my last question?

If we really are able to multiply betas on the cell's input we would spare lots of memory and could use larger batch-sizes.

from fasterseg.

maaft avatar maaft commented on June 10, 2024

Also, to come back to my original question:
If you train a softmax-weighted sum of all all feature-maps for each super-cell, how do you decide that the architecture sampled from epoch N+1 is actually better than the architecture from epoch N ?

The problem that I have observed is, that even while the architecture might have gotten better, the argmax-derived model is performing worse (i.e. lower IoU) and I think that this mainly comes from softmax-training.

Here you see the validation DiceScore of the argmax-derived model. Red is with hard-gumbel-softmax sampling, blue is with plain-softmax training:

image

from fasterseg.

chenwydj avatar chenwydj commented on June 10, 2024

Sorry to bother you again, but could you kindly answer my last question?

If we really are able to multiply betas on the cell's input we would spare lots of memory and could use larger batch-sizes.

I think current Eq. 1 is actually doing input weighting, as the weighted output will be the input into the next cell.

Without using Gumbel Softmax, I doubt we could save memory here, since both two outputs need to be calculated in the computation graph.

from fasterseg.

chenwydj avatar chenwydj commented on June 10, 2024

Also, to come back to my original question:
If you train a softmax-weighted sum of all all feature-maps for each super-cell, how do you decide that the architecture sampled from epoch N+1 is actually better than the architecture from epoch N ?

The problem that I have observed is, that even while the architecture might have gotten better, the argmax-derived model is performing worse (i.e. lower IoU) and I think that this mainly comes from softmax-training.

Here you see the validation DiceScore of the argmax-derived model. Red is with hard-gumbel-softmax sampling, blue is with plain-softmax training:

image

For this question, this paper may be helpful to you.

In their Figure 2, solid lines are the performance of derived subnetworks, and dashed lines are the performance of the supernet. We can see that the solid line (test error) is not always getting lower along the search process. I think this is one important challenge in NAS community to be solved, especially for weight-sharing NAS methods.

from fasterseg.

maaft avatar maaft commented on June 10, 2024

I think current Eq. 1 is actually doing input weighting, as the weighted output will be the input into the next cell.

Yes, this is what I wrote and understand. But in your code you do output weighting and I don't understand why?

See:
https://github.com/TAMU-VITA/FasterSeg/blob/bb52a004ff83f64d3dd8989104234fdb862d1cc5/search/model_search.py#L331

from fasterseg.

chenwydj avatar chenwydj commented on June 10, 2024

I mean output reweighting at cell_i is actually doing input reweight for cell_i+1.

from fasterseg.

maaft avatar maaft commented on June 10, 2024

Okay, I will continue to use input-weighting then as I spare one forward pass through my cell (faster search time, lower memory consumption).

from fasterseg.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.