z-x-yang / aot Goto Github PK

Associating Objects with Transformers for Video Object Segmentation

License: BSD 3-Clause "New" or "Revised" License

computer-vision video-object-segmentation

aot's Introduction

A Github Pages template for academic websites. This was forked (then detached) by Stuart Geiger from the Minimal Mistakes Jekyll Theme, which is © 2016 Michael Rose and released under the MIT License. See LICENSE.md.

I think I've got things running smoothly and fixed some major bugs, but feel free to file issues or make pull requests if you want to improve the generic template / theme.

Note: if you are using this repo and now get a notification about a security vulnerability, delete the Gemfile.lock file.

Instructions

Register a GitHub account if you don't have one and confirm your e-mail (required!)
Fork this repository by clicking the "fork" button in the top right.
Go to the repository's settings (rightmost item in the tabs that start with "Code", should be below "Unwatch"). Rename the repository "[your GitHub username].github.io", which will also be your website's URL.
Set site-wide configuration and create content & metadata (see below -- also see this set of diffs showing what files were changed to set up an example site for a user with the username "getorg-testacct")
Upload any files (like PDFs, .zip files, etc.) to the files/ directory. They will appear at https://[your GitHub username].github.io/files/example.pdf.
Check status by going to the repository settings, in the "GitHub pages" section
(Optional) Use the Jupyter notebooks or python scripts in the markdown_generator folder to generate markdown files for publications and talks from a TSV file.

See more info at https://academicpages.github.io/

To run locally (not on GitHub Pages, to serve on your own computer)

Clone the repository and made updates as detailed above
Make sure you have ruby-dev, bundler, and nodejs installed: sudo apt install ruby-dev ruby-bundler nodejs
Run bundle clean to clean up the directory (no need to run --force)
Run bundle install to install ruby dependencies. If you get errors, delete Gemfile.lock and try again.
Run bundle exec jekyll liveserve to generate the HTML and serve it from localhost:4000 the local server will automatically rebuild and refresh the pages on change.

Changelog -- bugfixes and enhancements

There is one logistical issue with a ready-to-fork template theme like academic pages that makes it a little tricky to get bug fixes and updates to the core theme. If you fork this repository, customize it, then pull again, you'll probably get merge conflicts. If you want to save your various .yml configuration files and markdown files, you can delete the repository and fork it again. Or you can manually patch.

To support this, all changes to the underlying code appear as a closed issue with the tag 'code change' -- get the list here. Each issue thread includes a comment linking to the single commit or a diff across multiple commits, so those with forked repositories can easily identify what they need to patch.

aot's People

Contributors

Stargazers

Watchers

Forkers

siyisan muvguan zhangsongdmk developer-isaac-xu htqin wangbo-zhao chen-si-jia wufengchina

aot's Issues

About the LSTT block

What does the symbol in red frame mean? Add or concat?

Do you directly add the output feature of the backbone to the input of decoder?

Except the feature map output by LSTT, do you also directly add the output feature of the backbone to the input of the encoder? And do you add features with different scale from backbone to the decoder?

Do you use dropout in Long-Term Attention and Short-Term Attention?

May I ask when you plan to release the code?

About the number of params of the model

May I ask the number of params of AOT-T, AOT-S, AOT-B, AOT-L?

The performance gains from image datasets (COCO, VOC, etc.)

Hi,

I have read some of your great papers for VOS.

I noticed that models are usually first pre-trained on image datasets (COCO, VOC, etc.) and then trained on video datasets (DAVIS, YTB). May I know some experience or results about the performance gap that training with/without image datasets?

In addition, does the MobileNetV2 in this work pre-trained on ImageNet or trained from scratch?

Many thanks in advance.

Identification Embedding

Hi,
Do you consider the background like an object when embedding identification or you aggregate the background after decoding?

Thanks.

About the input dimensions of Long-Term Attention

How do you deal with the changeable input dimensions since the input dimensions of torch.nn.MultiheadAttention should be fixed? Or do I have any misunderstandings about the use of torch.nn.MultiheadAttention ?

Multi-view object segmentation

Hi,
I haven't explored your algorithm yet but I was wondering if it could be useful to me.
I wanted to know if this solution can also be used in a context where there are several images of the same object from multiple angles (not necessarily in order) and therefore not a video of the same.

Are there any theoretical bases that motivate this or some experiments done?

Thanks
Gianluca

The sampling strategy during training

Hi Zongxin,

May I know more details about the training strategies?

In this paper, the sequence length is 5 during training, does it means 1 frame as reference (long-term), 1 frame as previous frame (short-term), and 3 frames as current frames to predict in a sequential manner?
Does the sampling of the frame indexes in sequences is the same as CFBI? If it does not, may I have some details?

Thank you.

Question regarding Table 1

Hi, love your AOT/AOST. Excellent scalability. The open-source code is another big plus.

I was wondering what counts as "all frames" in Table 1 of AOST? Most papers don't mention whether they use 6FPS/30FPS (and they also don't release the code!) The authors of STM did say they use all the frames: seoungwugoh/STM#3 (comment). HMMN uses a similar evaluation structure so it's likely also the case; KMN has no code but their baseline score matches STM so it is also likely that they use the 30 FPS version. Am I missing something here?

Cheers.

May I ask the number of params of the LSTT block?

About training detial

Hi, Zongxin,

Thank you for the nice work. I am concerned about the training detail.

How many models do you use in Table 1? Davis valid/test share one and Youtube 18/19 share one?
You say "For main training, the training steps are 100,000 for YouTube-VOS or 50,000 for DAVIS" and what is the iteration of total training? Because I think "training steps" is a middle step of adjusting the learning rate.

About the X^m_l in Long-Term Attention

The X^m_l in Long-Term Attention (red box in the picture) means the features output by backbone , or is it obtained through Self-Attention like X^t_l (blue box in the picture) ?

When open the source code ?

Finetuning on my own dataset

Hi Zongxin,

Congrats on a series of great works as well as Segment-and-Track Everything Project!

Now I am working on testing the Track Everything project on a multi-view microscopic dataset where the severe occlusions are likely happening and my target is tracking many small "black circles" like only 20 x 20 pixel size in a 1024 x 1024 full image while may have more than hundreds in a single image.

I am a bit surprised to see after some efforts, your model gave us a pretty nice result but still sometimes wrongly tracking multiple circles as the same or the mask could be kind of rough.

I guess there may be some additional finetuning work awaiting for us, though I am not the expert in this area. My current efforts is more like exploring recent SOTA works and preparing datasets for them. Unfortunately, I can't find some way to optimize your model for tracking part (I have already finetuned SAM in our task so every frame can produce pretty nice mask but I cannot track them correctly).

Could you please give me some advice? And any plan to release code for customized dataset finetuning? I am very happy to help to test & discuss if you are interested in how well your model can work with medical & microscopic data like CT, MRI or Cryo-EM, Cryo-ET after finetuning.

Thanks in advance!

Do you realized key_padding_mask in you long-term attention and short-term Attention?

Do you realized key_padding_mask (a parameter in torch.nn.multiheadattention) in the long-term attention and short-term Attention? Or you just do not need this in your network?

Example video of results

Hello,

Is it possible to provide an example video or two which demonstrate the results of the method?

Thanks,
Chris