Git Product home page Git Product logo

motok's Introduction

Object Discovery from Motion-Guided Tokens

This is the repository for Object Discovery from Motion-Guided Tokens, published at CVPR 2023.

[Project Page] [Paper]

Set up

Python 3 dependencies:

  • torch 1.7.1+CUDA11.0
  • matplotlib
  • cv2
  • numpy
  • scipy
  • tqmd

Datasets

MOVi-E

MOVi-E dataset can be accessed from the official repo. After downloading, we save the data to npy files for training. See process_movi.py for details.

TRI-PD

PD datasets (RGB, flow, depth, semantic masks) and additional annotations (moving object masks; dynamic object masks): [TRI-PD dataset]. The "simplified" folder contains flow (both forward and backward), rgb, and motion masks. The "full" folder contains additional RGB, depth, flow, motion masks, and semantic masks.

Raw PD dataset (which contains RGB, semantic segmentation, instance segmentation, optical flow, depth, camera colibrations, 2D/3D bounding boxes, etc.) is connected to TRI's Vidar project. Leave a message in the issues or contact [email protected] for the annotations other than the simplified ones.

Sample code to transfer the motion vectors to flow xy:

rgba = cv2.imread('x.png',-1)
r,g,b,a = rgba[:,:,0], rgba[:,:,1], rgba[:,:,2], rgba[:,:,3]
h,w,_ = flow.shape
dx_i = r+g*256
dy_i = b+a*256
flow_x = ((dx_i / 65535.0)*2.0-1.0) * w
flow_y = ((dy_i / 65535.0)*2.0 - 1.0) * h            

To download the files in google drive from a server, please check gdown. Some sample code to download the files in the folder:

import gdown
url = "https://drive.google.com/drive/folders/1q5AjqhoivJb67h9MZCgUtqb4CooDrZhC"

gdown.download_folder(url, quiet = False, use_cookies = False)

KITTI

KITTI dataset can be downloaded from the offcial website we use all the RGB images for training. The motion segmentations we used can be downloaded from here

Dataset structure

MOVI

root 
	- train
		- video-0000
			- rgb.npy
			- forward_flow.npy
			- backward_flow.npy
			- depth.npy
			- segment.npy
		- video-0001
		- ...
	- val
	- test

TRI-PD

root 
   - scene_000001
      - rgb
         - camera_01
            - 000000000000000005.png
            - ...
         - camera_04
         - camera_05
         - camera_06
         - camera_07
         - camera_08
         - camera_09
      - motion_vectors_2d
      - back_motion_vectors_2d
      - moving_masks
      - ari_masks
      - est_masks
   - scene_000003
   - ...

KITTI

root
	- 2011_09_26_drive_0001_sync
		- image_02
			- data
				- 0000000000.png
				- 0000000001.png
				- ...
			- raft_seg
				- 0000000000.png
				- 0000000001.png
				- ...
		- image_03
			- data
			- raft_seg
	- 2011_09_26_drive_0002_sync
	- ...

Train the model

See trainPD.sh, trainKITTI.sh and trainMOVI.shfor sample training scripts. See args in the training python scripts for details.

Evaluating the pre-trained model

To evaluate or infer on the test set, first download the pre-trained model (or train it with the training code), then run

python eval(movi/pd/kitti).py

Notice that we provide the version without motion cue on MOVi-E and with motion cue on TRI-PD and KITTI.

Visualize the slot masks

To infer and visualize on a video of arbitary length, see

Plot.py

for a sample code.

Pre-trained models

Pre-trained models are located in the pre-trained models folder in this drive.

Other helpful codes

In this repo, we mainly provide the architecture for VQ-space + perceiver decoder. More implementations about different choices of decoders and reconstruction space shown in our paper can be found in the folder others.

Acknowledgement

The slot attention modules is referred to the pytorch slot attention and the official google repo, the estimated motion segments are generated by Towards segmenting anything that moves repo.

For the estimated annotation generations, we use smurf and Vidar.

Previous work

Discovering objects that can move

@inproceedings{bao2022discovering,
    Author = {Bao, Zhipeng and Tokmakov, Pavel and Jabri, Allan and Wang, Yu-Xiong and Gaidon, Adrien and Hebert, Martial},
    Title = {Discorying Object that Can Move},
    Booktitle = {CVPR},
    Year = {2022},
}

Citation

@inproceedings{bao2023object,
    Author = {Bao, Zhipeng and Tokmakov, Pavel and Wang, Yu-Xiong and Gaidon, Adrien and Hebert, Martial},
    Title = {Object Discovery from Motion-Guided Tokens},
    Booktitle = {CVPR},
    Year = {2023},
}

motok's People

Contributors

zpbao avatar

Stargazers

 avatar  avatar Dongwon Kim avatar EmilyM avatar Dimitris Koletsis avatar  avatar Phu Ninh avatar ZHrush avatar June avatar Mahmoud Ali avatar Mohammad avatar Han Keceli avatar Soma Muthumanickam avatar John Pinto avatar Aditya Wagh avatar  avatar Kimiphone avatar tm avatar Tatsuya Ishihara avatar Hcc avatar  avatar James avatar Cheol-Hui Min avatar Ukcheol Shin avatar gabriel_fiastre avatar Zhihao Duan avatar 孙鑫宇 avatar Angel Villar-Corrales avatar Manan Lalit avatar

Watchers

Tatsuya Ishihara avatar  avatar

Forkers

joshurbandavis

motok's Issues

Question Regarding Table 4 vs 5 vs B

Hello - thank you for open sourcing your work! I just had a quick question regarding the FG ARI performance on KITTI.

In the tables below:
image
image
It seems that each table has slightly different numbers for FG ARI performance on KITTI. Am I correct in understanding:

  • All results on KITTI are per-frame FG ARI (since I think the 200 frame test data does not have temporal information)
  • Table B is conducted with a ViT backbone
  • What is the difference between Table 4 and 5? (60.6 vs 64.4)

Thank you in advance, and I apologize if I have missed a detail in the paper.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.