Git Product home page Git Product logo

jssprz / visual_syntactic_embedding_video_captioning Goto Github PK

View Code? Open in Web Editor NEW
30.0 2.0 8.0 61 KB

Source code of the paper titled *Improving Video Captioning with Temporal Composition of a Visual-Syntactic Embedding*

License: MIT License

Python 100.00%
video-captioning msvd msr-vtt wacv2021 deep-learning pos-tagging representation-learning encoder-decoder syntactic-representations video-description

visual_syntactic_embedding_video_captioning's Introduction

Improving Video Captioning with Temporal Composition of a Visual-Syntactic Embedding

PRs Welcome Video Captioning and DeepLearning Source code of a WACV'21 paper MIT License

This repository is the source code for the paper titled Improving Video Captioning with Temporal Composition of a Visual-Syntactic Embedding. Video captioning is the task of predicting a semantic and syntactically correct sequence of words given some context video. In this paper, we consider syntactic representation learning as an essential component of video captioning. We construct a visual-syntactic embedding by mapping into a common vector space a visual representation, that depends only on the video, with a syntactic representation that depends only on Part-of-Speech (POS) tagging structures of the video description. We integrate this joint representation into an encoder-decoder architecture that we call Visual-Semantic-Syntactic Aligned Network (SemSynAN), which guides the decoder (text generation stage) by aligning temporal compositions of visual, semantic, and syntactic representations. We tested our proposed architecture obtaining state-of-the-art results on two widely used video captioning datasets: the Microsoft Video Description (MSVD) dataset and the Microsoft Research Video-to-Text (MSR-VTT) dataset.

Table of Contents

  1. Model
  2. Requirements
  3. Manual
  4. Qualitative Results
  5. Quantitative Results
  6. Citation

Model

Video Captioning with Visual-Syntactic Embedding (SemSynAN) Visual-Syntactic Embedding

Requirements

  1. Python 3.6
  2. PyTorch 1.2.0
  3. NumPy
  4. h5py

Manual

git clone --recursive https://github.com/jssprz/visual_syntactic_embedding_video_captioning.git

Download Data

mkdir -p data/MSVD && wget -i msvd_data.txt -P data/MSVD
mkdir -p data/MSR-VTT && wget -i msrvtt_data.txt -P data/MSR-VTT

For extracting your own visual features representations you can use our visual-feature-extracotr module.

Training

If you want to train your own models, you can reutilize the datasets' information stored and tokenized in the corpus.pkl files. For constructing these files you can use the scripts we provide in video_captioning_dataset module. Basically, the content of these files is organized as follow:

0: train_data: captions and idxs of training videos in format [corpus_widxs, vidxs], where:

  • corpus_widxs is a list of lists with the index of words in the vocabulary
  • vidxs is a list of indexes of video features in the features file

1: val_data: same format of train_data.

2: test_data: same format of train_data.

3: vocabulary: in format {'word': count}.

4: idx2word: is the vocabulary in format {idx: 'word'}.

5: word_embeddings: are the vectors of each word. The i-th row is the word vector of the i-th word in the vocabulary.

We use the val_references.txt and test_references.txt files for computing the evaluation metrics only.

Testing

1. Download pre-trained models at epochs 41 (for MSVD) and 12 (for MSR-VTT)

wget https://s06.imfd.cl/04/github-data/SemSynAN/MSVD/captioning_chkpt_41.pt -P pretrain/MSVD
wget https://s06.imfd.cl/04/github-data/SemSynAN/MSR-VTT/captioning_chkpt_12.pt -P pretrain/MSR-VTT

2. Generate captions for test samples

python test.py -chckpt pretrain/MSVD/captioning_chkpt_41.pt -data data/MSVD/ -out results/MSVD/
python test.py -chckpt pretrain/MSR-VTT/captioning_chkpt_12.pt -data data/MSR-VTT/ -out results/MSR-VTT/

3. Metrics

python evaluate.py -gen results/MSVD/predictions.txt -ref data/MSVD/test_references.txt
python evaluate.py -gen results/MSR-VTT/predictions.txt -ref data/MSR-VTT/test_references.txt

Qualitative Results

qualitative results

Quantitative Results

Dataset epoch B-4 M C R
MSVD 100 64.4 41.9 111.5 79.5
MSR-VTT 60 46.4 30.4 51.9 64.7

Citation

@InProceedings{Perez-Martin_2021_WACV,
    author    = {Perez-Martin, Jesus and Bustos, Benjamin and Perez, Jorge},
    title     = {Improving Video Captioning With Temporal Composition of a Visual-Syntactic Embedding},
    booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
    month     = {January},
    year      = {2021},
    pages     = {3039-3049}
}

visual_syntactic_embedding_video_captioning's People

Contributors

jssprz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

visual_syntactic_embedding_video_captioning's Issues

code for training

Sorry, I did not find the code for your training model, will you publish it in the future?

regarding dataloader

Dear Sir,

I need to know, how you make a dataloader. Please suggest a suitable data loader for your code.

Training Code

Hello, are you planning to release the training code soon?

ECO features

Thank you for sharing your amazing work.

I need to extract ECO features only. Can you help me how to do that? Specifically, I need to just extract ECO features of videos. Do I have to run all code and models of ECO in GitHub or what?

about training code

Thanks for sharing this amazing work.
It seems do not contain training code.
Could you provide the the training code?

Code??

I am very interested in this paper. When will the code be published?

Result gap

Hi,

Thanks for your code. I found there is a gap comparing the recorded results (on the paper or the repo) after I exactly followed the "test" code. Here are my results:

MSVD:
RESULTS: Bleu_1: 0.906 Bleu_2: 0.814 Bleu_3: 0.725 Bleu_4: 0.627 (v.s. 0.644) METEOR: 0.397 ROUGE_L: 0.783 CIDEr: 1.089

MSRVTT:
Bleu_1: 0.831 Bleu_2: 0.702 Bleu_3: 0.566 Bleu_4: 0.443 (v.s. 0.464) METEOR: 0.288 ROUGE_L: 0.625 CIDEr: 0.501

Is there something wrong?

Thank you.

Could you give more details about the semantic concept (SC) detector?

Hi,

Thank you for your code. Could you give more details about the semantic concept (SC) detector? Like, how to form the SC vocabulary? Since you only release `cnn_sem_globals' which is the probabilities of SL (400d), I wonder how the 400d corresponds to the word in SC vocabulary?

Thank you.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.