Git Product home page Git Product logo

cone's Introduction

CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding

ACL anthology

TL;DR: CONE (see overview below) deals with an emerging and challenging problem of video temporal grounding (VTG) in the long-form video. It is a plug-and-play framework on top of existing VTG models to handle long videos through a sliding window mechanism. It is also a coarse-to-fine alignment framework that utilizes a pipeline of {window slicing and selection, proposal generation and ranking}.

CONE

This repo supports data pre-processing, training and evaluation of both Ego4D-NLQ and MAD benchmarks.

πŸ“’ News

πŸ—„ Table of Contents

πŸ“ Preparation

Install-dependencies

  • Follow INSTALL.md for installing necessary dependencies and compiling the code.

Prepare-offline-data

  • Download full Ego4D-NLQ data Ego4D-NLQ (8.29GB).
  • Download partial MAD data MAD (6.5GB). We CAN NOT share the MAD visual features at this moment, please request access to the MAD dataset from official resource MAD github.
  • We provide the feature extraction and file pre-processing procedures for both benchmarks in detail, please refer to Feature_Extraction_MD.
  • If you unzip the Ego4D-NLQ data, the extracted folder structure should look like
This folder
└───offline_extracted_features/
β”‚    └───egovlp_video_feature_1.875fps.tar.gz    
β”‚    └───egovlp_text_cls_feature.tar.gz
β”‚    └───egovlp_text_token_feature.tar.gz
β”‚    └───...
└───offline_lmdb/
β”‚    └───egovlp_video_feature_1.875fps/
β”‚    └───egovlp_egovlp_text_features/
β”‚    └───...
└───data/
β”‚    └───ego4d_ori_data/
β”‚    β”‚	 └───nlq_train.json
β”‚    β”‚	 └───...
β”‚    └───ego4d_data/
β”‚    β”‚	 └───train.jsonl
β”‚    β”‚	 └───...
└───one_training_sample/
β”‚    └───tensorboard_log/    
β”‚    └───inference_ego4d_val_top20_nms_0.5_preds.txt
β”‚    └───inference_ego4d_val_top20_nms_0.5_preds.json
β”‚    └───...

πŸ”§ Experiments

Note that our default base model is Moment-DETR. Additionally, we also release the code with 2D-TAN as the base model. Please refer to 2D-TAN README.

Ego4D-NLQ

Please refer to Ego4d-NLQ_ECCV_2022_workshop for detailed information about our submission for Ego4D ECCV 2022 Challenge.

Ego4D-NLQ-training

Training can be launched by running the following command:

bash cone/scripts/train_ego4d.sh CUDA_DEVICE_ID NUM_QUERIES WINDOW_LENGTH ADAPTER 

CUDA_DEVICE_ID is cuda device id. NUM_QUERIES is moment queries number, default as 5. WINDOW_LENGTH is visual feature number inside one video window. ADAPTER is model type string for visual adapter module, can be one of linear and none.

The checkpoints and other experiment log files will be written into cone_results. For training under different settings, you can append additional command line flags to the command above. For more configurable options, please check our config file cone/config.py.

The actual command used in the experiments is

bash cone/scripts/train_ego4d.sh 0 5 90 linear 

In additional, we find that the performance empirically increases when the textual token feature extractor is replaced by CLIP or RoBERTa, thus, we recommend you to use CLIP or RoBERTa token feature via the following commands,

bash cone/scripts/train_ego4d_clip.sh 0 5 90 linear 
bash cone/scripts/train_ego4d_roberta.sh 0 5 90 linear 

Ego4D-NLQ-inference

Once the model is trained, you can use the following commands for inference:

bash cone/scripts/inference_ego4d.sh CUDA_DEVICE_ID CHECKPOINT_PATH EVAL_ID  --nms_thd 0.5  --topk_window 20
bash cone/scripts/inference_ego4d_test.sh CUDA_DEVICE_ID CHECKPOINT_PATH EVAL_ID --nms_thd 0.5  --topk_window 20

where CUDA_DEVICE_ID is cuda device id, CHECKPOINT_PATH is the path to the saved checkpoint, EVAL_ID is a name string for evaluation id. We adopt Non-Maximum Suppression (NMS) with a threshold of 0.5 and set pre-filtering window number as 20.

  • The results (Recall@K at IoU = 0.3 or 0.5) on the val. set should be similar to the performance of the below table reported in the main paper.
Metric \ Method R@1 IoU=0.3 R@5 IoU=0.3 R@1 IoU=0.5 R@5 IoU=0.5
CONE 14.15 30.33 8.18 18.02

In additional, we provide our experiment log files Ego4D-NLQ-Training-Sample(24MB).

Note that we inference on 3874 queries in validation split, but NaQ removes zero-duration ground-truth queries and inferences on 3529 queries in validation split . The performance of CONE will be higher (i.e., multiplied by 3874/3529=1.098) if we use the same validation split of NaQ.

MAD

MAD-training

Training can be launched by running the following command:

bash cone/scripts/train_mad.sh CUDA_DEVICE_ID NUM_QUERIES WINDOW_LENGTH ADAPTER  

CUDA_DEVICE_ID is cuda device id. NUM_QUERIES is moment queries number, default as 5. WINDOW_LENGTH is visual feature number inside one video window. ADAPTER is model type string for visual adapter module, can be one of linear and none.

The actual command used in the experiments is

bash cone/scripts/train_mad.sh 0 5 125 linear 
bash cone/scripts/train_mad.sh 0 5 125 none --no_adapter_loss 

MAD-inference

Once the model is trained, you can use the following commands for inference:

bash cone/scripts/inference_mad.sh CUDA_DEVICE_ID CHECKPOINT_PATH EVAL_ID  --nms_thd 0.5  --topk_window 30
bash cone/scripts/inference_mad_test.sh CUDA_DEVICE_ID CHECKPOINT_PATH EVAL_ID  --nms_thd 0.5  --topk_window 30

where CUDA_DEVICE_ID is cuda device id, CHECKPOINT_PATH is the path to the saved checkpoint, EVAL_ID is a name string for evaluation id.

We adopt Non-Maximum Suppression (NMS) with a threshold of 0.5 and set pre-filtering window number as 30.

  • The results (Recall@K at IoU = 0.3) should be similar to the performance of the below table reported in the main paper.
Method \ R@K 1 5 10 50
CONE (val) 6.73 15.20 20.07 32.09
CONE (test) 6.87 16.11 21.53 34.73

In additional, we provide the experiment log files MAD-Training-Sample(370MB).

πŸ‹οΈβ€οΈ Run predictions on your own videos and queries

You may also want to run CONE model on your own videos and queries. Currently, it supports moment retrieval on first-person videos with EgoVLP video feature extractor. For third-person videos, the video/text feature extractors should be replaced with CLIP for better performance.

Preliminaries

Mkdir the ckpt folder and place two weight files Egovlp.pth and model_best.ckpt into the ckpt folder

Mkdir the example folder and place one Ego4D video example with the uid "94cdabf3-c078-4ad4-a3a1-c42c8fc3f4ad" into the example folder

Install some additional dependencies

pip install transformers easydict decord
pip install einops timm
pip install pytorchvideo

Run

Run the example provided in this repo:

python run_on_video/run.py

The output will look like the following:

Build models...
Loading feature extractors...
Loading EgoVLP models
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_projector.bias', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Loading trained Moment-DETR model...
Loading CONE models
Run prediction...
video_name:  94cdabf3-c078-4ad4-a3a1-c42c8fc3f4ad
text_query:  Did I wash the green pepper?
-----------------------------prediction------------------------------------
Rank 1, moment boundary in seconds: 87.461 103.1118, score: 1.9370151082074316
Rank 2, moment boundary in seconds: 350.099 360.7614, score: 1.9304019422713785
Rank 3, moment boundary in seconds: 95.9942 101.9733, score: 1.9060271403367295
Rank 4, moment boundary in seconds: 275.3885 286.7189, score: 1.8871944230965596
Rank 5, moment boundary in seconds: 384.3145 393.3277, score: 1.701088363940821  

βœ‰οΈ Contact

This repo is maintained by Zhijian Hou. Questions and discussions are welcome via [email protected].

πŸ™ Acknowledgements

This code is based on Moment-DETR. We use some resources from CLIP, EgoVLP to extract the features. We thank the authors for their awesome open-source contributions.

cone's People

Contributors

houzhijian avatar

Stargazers

Da Zhang avatar  avatar Chaolei Tan avatar Brikarl avatar LikeGiver avatar Hengyuan Zhang avatar ChengDingxin avatar Kraig Aue avatar yahooo avatar Zixu Cheng avatar  avatar  avatar Yutong Wang avatar Xiao Liang avatar  avatar  avatar  avatar  avatar FRANKISS avatar Khylon avatar Burak Satar avatar Jeff Carpenter avatar Tanveer Hannan avatar Zhiming Sun avatar zeeeyang avatar Jiong Yin avatar  avatar Ye Liu avatar

Watchers

Burak Satar avatar Tanveer Hannan avatar  avatar

cone's Issues

ModuleNotFoundError

When I use mad_clip_text_extractor.py to extract CLIP textual query and token features, it shows ModuleNotFoundError: No module named 'data_utils'. And when I check the utils directory, there is no file called 'data_utils'. So where exactly is the ClipFeatureExtractor module?
7fb7a197b67038bccfe2a0e5b143d91

Mad Score

The MAD score reported in your paper is higher than is reported in the Readme file. Is there any change in config?

Visualisation

How did you create visualisation for Ego-4d and MAD?

Which clip features do you use for MAD dataset?

Hi, I'd like to know which CLIP feature you are using for MAD dataset. Is it CLIP_B32 or CLIP_L14? By the way, I unzip your Ego4D tar.gz and found two files, egovlp_video_feature_1.87fps.tar.gz and egovlp_video_feature_1.875fps.tar.gz, which one did you use, and there doesn't seem to be much difference between the two.

Something wrong when I reproduce the paper

Hello author,when I reproduce the experiment according to your prompts,maybe something went wrong. I got some wrong results: after some epochs, R5,R10,R50,R100 become completely identical.

NaQ Comparison

You mentioned in the ReadMe file: "Note that we inference on 3874 queries in validation split, but NaQ removes zero-duration ground-truth queries and inferences on 3529 queries in validation split . The performance of CONE will be higher (i.e., multiplied by 3874/3529=1.098) if we use the same validation split of NaQ."

Can you refer me the source of this information?

mad_clip_text_extractor.py

The original data for MAD contains MAD_train.json. In the 'mad_clip_text_extractor.py' file you load train.jsonl. Are they the same?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.