Some confusion about analysis and experiment in the paper about mobilesam HOT 8 CLOSED

chaoningzhang commented on August 29, 2024

Some confusion about analysis and experiment in the paper

from mobilesam.

Comments (8)

theskyv commented on August 29, 2024 1

Thank you for your reply and discussion. I have three new questions I would like to discuss with you @ChaoningZhang :

Is it important to output like with SAM or to output real and valid objects? -- As my pictures above show, FastSAM outputs a complete bus, while MobileSAM outputs a bus window with a lot of noise (>30%), and perhaps MobileSAM is more similar to SAM in mIoU. However, I don't think "windows with a lot of noise" is more useful than "a complete bus".
Is the mIoU comparison method fair? --MobileSAM uses knowledge distillation to learn the output of the original SAM, while FastSAM is trained separately. It is well known that knowledge distillation helps make student more similar to teacher. It feels very unfair to take the original SAM as Ground Truth and compare the output of FastSAM and MobileSAM to see which is more similar to the original SAM. This seems similar to "testing on training data".
Is FastSAM really not suitable for small objects? -- I tried looking at the segmentation results of FastSAM and found that it can segment car windows well (it's just that when faced with the segmentation ambiguity problem, FastSAM tends to prioritize large objects, while SAM/MobileSAM tends to prioritize small objects. This is not superior or inferior). Using the quantitative data in the FastSAM paper as evidence, FastSAM performs due to SAM (ViT-H) on small objects and slightly inferior to SAM on large objects. I'd like to know why you think FastSAM is not suitable for detecting small objects.

Looking forward to discuss with you.

from mobilesam.

ChaoningZhang commented on August 29, 2024

As I was reading the paper I had some confusion: your paper mentions that FastSAM needs at least two prompt points and on that basis compares the performance of FastSAM and MobileSAM in segment anything mode.

But as you can see from the HuggingFace demo posted by FastSAM, their approach supports a single point and has worked well in my own attempts. If it is indeed a writing error, I very much look forward to your revision and showing new experiments to effectively compare the two methods.

BTW, I am also curious how Table 7 in the paper is done exactly, and why the model performance can be proved by modifying the distance of positive and negative prompt points?

Thanks for your interest in our work. When we compared with their method and found that it cannot work with a single point. We just carefully read their code again and find that it can work with a single point. We will conduct comparison in this setup soon and update the results here for getting you informed.

For the results in Table7, we do not use the distance for proving that ours is better. Distance is more or less like a hyperparameter here and we show that FastSAM has a significantly lower with all distances. More extensive comparison is on the way.

Thanks for your interest in our work again.

from mobilesam.

ChaoningZhang commented on August 29, 2024

As I was reading the paper I had some confusion: your paper mentions that FastSAM needs at least two prompt points and on that basis compares the performance of FastSAM and MobileSAM in segment anything mode.

But as you can see from the HuggingFace demo posted by FastSAM, their approach supports a single point and has worked well in my own attempts. If it is indeed a writing error, I very much look forward to your revision and showing new experiments to effectively compare the two methods.

BTW, I am also curious how Table 7 in the paper is done exactly, and why the model performance can be proved by modifying the distance of positive and negative prompt points?

We just confirm that FastSAM can indeed work with a single point, and the mIoU (average on 100 images.) for FastSAM is around 0.43 while that for MobileSAM is around 0.74. Hope this addresses your confusion. Otherwise, please kindly let us know.

from mobilesam.

theskyv commented on August 29, 2024

I'm curious how you chose the 100 test images, how the exact mIoU was calculated, and how the point locations were chosen?

I tried FastSAM's HuggingFace demo and MobileSAM demo in pull request (#4) and found that MobileSAM outputs some strange results in single point mode.

from mobilesam.

ChaoningZhang commented on August 29, 2024

As stated in the paper, the mIoU was calculated with the comparison with the mask generated by the original SAM. We set the point in the middle of the image. I suggested you to experiment with the mask with the original SAM. The mask you provided seems to be very good, actually. If you try with the original SAM, you will likely get the same mask as the you reported. Detecting small objects (window instead of car in this case) is more preferable.

We tried the image you provided on the official segment anything demo (https://segment-anything.com/demo), the results are shown as follows;

The case you pointed out here proves that our MobileSAM aligns with the original SAM to detect small objects well, while FastSAM fails. Thank you very much for raising this issue to shed light on the fact that FastSAM seems to be not appropriate for detecting small objects as the Original SAM and our MobileSAM。

from mobilesam.

ChaoningZhang commented on August 29, 2024

Thank you for your reply and discussion. I have three new questions I would like to discuss with you @ChaoningZhang :

Is it important to output like with SAM or to output real and valid objects? -- As my pictures above show, FastSAM outputs a complete bus, while MobileSAM outputs a bus window with a lot of noise (>30%), and perhaps MobileSAM is more similar to SAM in mIoU. However, I don't think "windows with a lot of noise" is more useful than "a complete bus".

Is the mIoU comparison method fair? --MobileSAM uses knowledge distillation to learn the output of the original SAM, while FastSAM is trained separately. It is well known that knowledge distillation helps make student more similar to teacher. It feels very unfair to take the original SAM as Ground Truth and compare the output of FastSAM and MobileSAM to see which is more similar to the original SAM. This seems similar to "testing on training data".

Is FastSAM really not suitable for small objects? -- I tried looking at the segmentation results of FastSAM and found that it can segment car windows well (it's just that when faced with the segmentation ambiguity problem, FastSAM tends to prioritize large objects, while SAM/MobileSAM tends to prioritize small objects. This is not superior or inferior). Using the quantitative data in the FastSAM paper as evidence, FastSAM performs due to SAM (ViT-H) on small objects and slightly inferior to SAM on large objects. I'd like to know why you think FastSAM is not suitable for detecting small objects.

Looking forward to discuss with you.

We are currently very busy with other ongoing works and try to answer your question in a brief manner. An important characteristic of the SAM is that it can address the ambiguity issue. With a single point, SAM can output three masks to address these issues, for which our MobileSAM follows. If a large object is more meaningful than the smaller one as you suggested, SAM can just choose the three masks that have the largest region. Obviously, this does not make sense and that's why SAM relies on a score ranking mechanism to choose the final mask. In other words, it is the model that automatically prefers the smaller object (without human bias). It seems that FastSAM ignores this ambiguity issue. It seems to me that ignoring this ambiguity issue and claiming a larger object like bus might be biased. After all, for human A, a point can mean the bus, while for human B, the same point can mean the window in this context. Do you agree?

from mobilesam.

ChaoningZhang commented on August 29, 2024

Thank you for your reply and discussion. I have three new questions I would like to discuss with you @ChaoningZhang :

Is it important to output like with SAM or to output real and valid objects? -- As my pictures above show, FastSAM outputs a complete bus, while MobileSAM outputs a bus window with a lot of noise (>30%), and perhaps MobileSAM is more similar to SAM in mIoU. However, I don't think "windows with a lot of noise" is more useful than "a complete bus".

Is the mIoU comparison method fair? --MobileSAM uses knowledge distillation to learn the output of the original SAM, while FastSAM is trained separately. It is well known that knowledge distillation helps make student more similar to teacher. It feels very unfair to take the original SAM as Ground Truth and compare the output of FastSAM and MobileSAM to see which is more similar to the original SAM. This seems similar to "testing on training data".

Is FastSAM really not suitable for small objects? -- I tried looking at the segmentation results of FastSAM and found that it can segment car windows well (it's just that when faced with the segmentation ambiguity problem, FastSAM tends to prioritize large objects, while SAM/MobileSAM tends to prioritize small objects. This is not superior or inferior). Using the quantitative data in the FastSAM paper as evidence, FastSAM performs due to SAM (ViT-H) on small objects and slightly inferior to SAM on large objects. I'd like to know why you think FastSAM is not suitable for detecting small objects.

Looking forward to discuss with you.

We are currently very busy with other ongoing works and try to answer your question in a brief manner. An important characteristic of the SAM is that it can address the ambiguity issue. With a single point, SAM can output three masks to address these issues, for which our MobileSAM follows. If a large object is more meaningful than the smaller one as you suggested, SAM can just choose the three masks that have the largest region. Obviously, this does not make sense and that's why SAM relies on a score ranking mechanism to choose the final mask. In other words, it is the model that automatically prefers the smaller object (without human bias). It seems that FastSAM ignores this ambiguity issue. It seems to me that ignoring this ambiguity issue and claiming a larger object like bus might be biased. After all, for human A, a point can mean the bus, while for human B, the same point can mean the window in this context. Do you agree?

I close it for the moment since there are no follow-up issues. Thanks for your interest in our work again~~

from mobilesam.

Vanessaaui commented on August 29, 2024

I'm so sorry El sáb., 1 de julio de 2023 10:21 a. m., Chaoning Zhang < ***@***.***> escribió:

…

Closed #13 <#13> as completed. — Reply to this email directly, view it on GitHub <#13 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BAZ5WPK3APKDNO7CKEMWUT3XOA6AVANCNFSM6AAAAAAZWUIUDU> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

from mobilesam.

Some confusion about analysis and experiment in the paper about mobilesam HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent