Git Product home page Git Product logo

Comments (8)

theskyv avatar theskyv commented on August 29, 2024 1

Thank you for your reply and discussion. I have three new questions I would like to discuss with you @ChaoningZhang :

  1. Is it important to output like with SAM or to output real and valid objects? -- As my pictures above show, FastSAM outputs a complete bus, while MobileSAM outputs a bus window with a lot of noise (>30%), and perhaps MobileSAM is more similar to SAM in mIoU. However, I don't think "windows with a lot of noise" is more useful than "a complete bus".

  2. Is the mIoU comparison method fair? --MobileSAM uses knowledge distillation to learn the output of the original SAM, while FastSAM is trained separately. It is well known that knowledge distillation helps make student more similar to teacher. It feels very unfair to take the original SAM as Ground Truth and compare the output of FastSAM and MobileSAM to see which is more similar to the original SAM. This seems similar to "testing on training data".

  3. Is FastSAM really not suitable for small objects? -- I tried looking at the segmentation results of FastSAM and found that it can segment car windows well (it's just that when faced with the segmentation ambiguity problem, FastSAM tends to prioritize large objects, while SAM/MobileSAM tends to prioritize small objects. This is not superior or inferior). Using the quantitative data in the FastSAM paper as evidence, FastSAM performs due to SAM (ViT-H) on small objects and slightly inferior to SAM on large objects. I'd like to know why you think FastSAM is not suitable for detecting small objects.

image
image

Looking forward to discuss with you.

from mobilesam.

ChaoningZhang avatar ChaoningZhang commented on August 29, 2024

As I was reading the paper I had some confusion: your paper mentions that FastSAM needs at least two prompt points and on that basis compares the performance of FastSAM and MobileSAM in segment anything mode.

But as you can see from the HuggingFace demo posted by FastSAM, their approach supports a single point and has worked well in my own attempts. If it is indeed a writing error, I very much look forward to your revision and showing new experiments to effectively compare the two methods.

BTW, I am also curious how Table 7 in the paper is done exactly, and why the model performance can be proved by modifying the distance of positive and negative prompt points? 51db6846031ae9f807827f67fcbc1bc

Thanks for your interest in our work. When we compared with their method and found that it cannot work with a single point. We just carefully read their code again and find that it can work with a single point. We will conduct comparison in this setup soon and update the results here for getting you informed.

For the results in Table7, we do not use the distance for proving that ours is better. Distance is more or less like a hyperparameter here and we show that FastSAM has a significantly lower with all distances. More extensive comparison is on the way.

Thanks for your interest in our work again.

from mobilesam.

ChaoningZhang avatar ChaoningZhang commented on August 29, 2024

As I was reading the paper I had some confusion: your paper mentions that FastSAM needs at least two prompt points and on that basis compares the performance of FastSAM and MobileSAM in segment anything mode.

But as you can see from the HuggingFace demo posted by FastSAM, their approach supports a single point and has worked well in my own attempts. If it is indeed a writing error, I very much look forward to your revision and showing new experiments to effectively compare the two methods.

BTW, I am also curious how Table 7 in the paper is done exactly, and why the model performance can be proved by modifying the distance of positive and negative prompt points? 51db6846031ae9f807827f67fcbc1bc

We just confirm that FastSAM can indeed work with a single point, and the mIoU (average on 100 images.) for FastSAM is around 0.43 while that for MobileSAM is around 0.74. Hope this addresses your confusion. Otherwise, please kindly let us know.

from mobilesam.

theskyv avatar theskyv commented on August 29, 2024

I'm curious how you chose the 100 test images, how the exact mIoU was calculated, and how the point locations were chosen?

I tried FastSAM's HuggingFace demo and MobileSAM demo in pull request (#4) and found that MobileSAM outputs some strange results in single point mode.

4eba13e2bc5f65db328fce805c1aace

dad2a72a9df0c3e0a6da4fae9646ede

from mobilesam.

ChaoningZhang avatar ChaoningZhang commented on August 29, 2024

As stated in the paper, the mIoU was calculated with the comparison with the mask generated by the original SAM. We set the point in the middle of the image. I suggested you to experiment with the mask with the original SAM. The mask you provided seems to be very good, actually. If you try with the original SAM, you will likely get the same mask as the you reported. Detecting small objects (window instead of car in this case) is more preferable.

We tried the image you provided on the official segment anything demo (https://segment-anything.com/demo), the results are shown as follows;
image

The case you pointed out here proves that our MobileSAM aligns with the original SAM to detect small objects well, while FastSAM fails. Thank you very much for raising this issue to shed light on the fact that FastSAM seems to be not appropriate for detecting small objects as the Original SAM and our MobileSAM。

from mobilesam.

ChaoningZhang avatar ChaoningZhang commented on August 29, 2024

Thank you for your reply and discussion. I have three new questions I would like to discuss with you @ChaoningZhang :

  1. Is it important to output like with SAM or to output real and valid objects? -- As my pictures above show, FastSAM outputs a complete bus, while MobileSAM outputs a bus window with a lot of noise (>30%), and perhaps MobileSAM is more similar to SAM in mIoU. However, I don't think "windows with a lot of noise" is more useful than "a complete bus".
  2. Is the mIoU comparison method fair? --MobileSAM uses knowledge distillation to learn the output of the original SAM, while FastSAM is trained separately. It is well known that knowledge distillation helps make student more similar to teacher. It feels very unfair to take the original SAM as Ground Truth and compare the output of FastSAM and MobileSAM to see which is more similar to the original SAM. This seems similar to "testing on training data".
  3. Is FastSAM really not suitable for small objects? -- I tried looking at the segmentation results of FastSAM and found that it can segment car windows well (it's just that when faced with the segmentation ambiguity problem, FastSAM tends to prioritize large objects, while SAM/MobileSAM tends to prioritize small objects. This is not superior or inferior). Using the quantitative data in the FastSAM paper as evidence, FastSAM performs due to SAM (ViT-H) on small objects and slightly inferior to SAM on large objects. I'd like to know why you think FastSAM is not suitable for detecting small objects.

image image

Looking forward to discuss with you.

We are currently very busy with other ongoing works and try to answer your question in a brief manner. An important characteristic of the SAM is that it can address the ambiguity issue. With a single point, SAM can output three masks to address these issues, for which our MobileSAM follows. If a large object is more meaningful than the smaller one as you suggested, SAM can just choose the three masks that have the largest region. Obviously, this does not make sense and that's why SAM relies on a score ranking mechanism to choose the final mask. In other words, it is the model that automatically prefers the smaller object (without human bias). It seems that FastSAM ignores this ambiguity issue. It seems to me that ignoring this ambiguity issue and claiming a larger object like bus might be biased. After all, for human A, a point can mean the bus, while for human B, the same point can mean the window in this context. Do you agree?

from mobilesam.

ChaoningZhang avatar ChaoningZhang commented on August 29, 2024

Thank you for your reply and discussion. I have three new questions I would like to discuss with you @ChaoningZhang :

  1. Is it important to output like with SAM or to output real and valid objects? -- As my pictures above show, FastSAM outputs a complete bus, while MobileSAM outputs a bus window with a lot of noise (>30%), and perhaps MobileSAM is more similar to SAM in mIoU. However, I don't think "windows with a lot of noise" is more useful than "a complete bus".
  2. Is the mIoU comparison method fair? --MobileSAM uses knowledge distillation to learn the output of the original SAM, while FastSAM is trained separately. It is well known that knowledge distillation helps make student more similar to teacher. It feels very unfair to take the original SAM as Ground Truth and compare the output of FastSAM and MobileSAM to see which is more similar to the original SAM. This seems similar to "testing on training data".
  3. Is FastSAM really not suitable for small objects? -- I tried looking at the segmentation results of FastSAM and found that it can segment car windows well (it's just that when faced with the segmentation ambiguity problem, FastSAM tends to prioritize large objects, while SAM/MobileSAM tends to prioritize small objects. This is not superior or inferior). Using the quantitative data in the FastSAM paper as evidence, FastSAM performs due to SAM (ViT-H) on small objects and slightly inferior to SAM on large objects. I'd like to know why you think FastSAM is not suitable for detecting small objects.

image image
Looking forward to discuss with you.

We are currently very busy with other ongoing works and try to answer your question in a brief manner. An important characteristic of the SAM is that it can address the ambiguity issue. With a single point, SAM can output three masks to address these issues, for which our MobileSAM follows. If a large object is more meaningful than the smaller one as you suggested, SAM can just choose the three masks that have the largest region. Obviously, this does not make sense and that's why SAM relies on a score ranking mechanism to choose the final mask. In other words, it is the model that automatically prefers the smaller object (without human bias). It seems that FastSAM ignores this ambiguity issue. It seems to me that ignoring this ambiguity issue and claiming a larger object like bus might be biased. After all, for human A, a point can mean the bus, while for human B, the same point can mean the window in this context. Do you agree?

I close it for the moment since there are no follow-up issues. Thanks for your interest in our work again~~

from mobilesam.

Vanessaaui avatar Vanessaaui commented on August 29, 2024

from mobilesam.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.