Fine-grained Image Captioning with CLIP Reward

Authors: Jaemin Cho, David Seunghyun Yoon, Ajinkya Kale, Franck Dernoncourt, Trung Bui, Mohit Bansal
Findings of NAACL 2022 Paper
(Inference using pretrained model on custom image)
Try Replicate web demo and docker image here

Code structure

# Configurations
./configs/
    # MLE
    phase1/
    # RL
    phase2/

# COCO caption evaluation
./cider
./coco-caption

# Preprocessing
./clip # CLIP feature extractor
./scripts # COCO preprocessing
./scripts_FineCapEval # FineCapEval preprocessing
./data # Storing preprocessed features

# Core model / Rewards / Data loading
./captioning

# Training / Evaluation
./tools

# Fine-tuning CLIP Text encoder
./retrieval

# Pretrained checkpoints
./save

# Storing original dataset files
./datasets

Setup

Install Dependencies

# Create python environment (optional)
conda create -n clip4caption python=3.7
source activate clip4caption

# python dependenceies
pip install -r requirements.txt

## Install this repo as package
pip install -e .

# Install Detectron2 (optional for training utilities)
pip install detectron2 -f https://dl.fbaipublicfiles.com/detectron2/wheels/cu113/torch1.10/index.html

# Setup coco-caption (optional for text metric evaluation)
git clone https://github.com/clip-vil/cider
git clone https://github.com/clip-vil/coco-caption
bash get_stanford_models.sh
bash get_google_word2vec_model.sh

# Install java (optional for METEOR evaluation as part of text metrics)
sudo apt install default-jre

Download Pretrained models

We host model checkpoints via google drive. Download checkpoints as below. The .ckpt file size for captioning and CLIP models are 669.65M and 1.12G, respectively.

# Captioning model
./save/
    clipRN50_cider/
        clipRN50_cider-last.ckpt
    clipRN50_cider_clips/
        clipRN50_cider_clips-last.ckpt
    clipRN50_clips/
        clipRN50_clips-last.ckpt
    clipRN50_clips_grammar/
        clipRN50_clips_grammar-last.ckpt
    clipRN50_mle/
        clipRN50_mle-last.ckpt

# Finetuned CLIP Text encoder
./retrieval/
    save/
        clip_negative_text-last.ckpt

Dataset preparation

# Original dataset files - to be downloaded
./datasets/
    # Download from http://mscoco.org/dataset/#download
    COCO/
        images/
            train2014/
            val2014/
        annotations/
            captions_train2014.json
            captions_val2014.json

    # Download from https://drive.google.com/drive/folders/1jlwInAsVo-PdBdJlmHKPp34dLnxIIMLx
    FineCapEval/
        images/
            XXX.jpg

MS COCO

Download files

./datasets/
    # Download from http://mscoco.org/dataset/#download
     COCO/
        images/
            train2014/
            val2014/
        annotations/
            captions_train2014.json
            captions_val2014.json


./data/
    # Download from http://cs.stanford.edu/people/karpathy/deepimagesent/caption_datasets.zip
    dataset_coco.json

    # Download from from https://drive.google.com/drive/folders/1eCdz62FAVCGogOuNhy87Nmlo5_I0sH2J
    coco-train-words.p

Text processing

python scripts/prepro_labels.py --input_json data/dataset_coco.json --output_json data/cocotalk.json --output_h5 data/cocotalk

Visual feature extraction

python scripts/clip_prepro_feats.py --input_json data/dataset_coco.json --output_dir data/cocotalk --images_root datasets/COCO/images --model_type RN50

# optional (n_jobs)
--n_jobs 4 --job_id 0

Visual fetaure extraction for CLIP-S Reward

python scripts/clipscore_prepro_feats.py --input_json data/dataset_coco.json --output_dir data/cocotalk --images_root datasets/COCO/images

# optional (n_jobs)
--n_jobs 4 --job_id 0

FineCapEval

Download files from https://drive.google.com/drive/folders/1jlwInAsVo-PdBdJlmHKPp34dLnxIIMLx?usp=sharing

./datasets/
    FineCapEval/
        images/
            XXX.jpg

./data/
    FineCapEval.json
    FineCapEval.csv

Visual feature extraction

python scripts_FineCapEval/clip_prepro_feats.py --input_json data/FineCapEval.json --output_dir data/FineCapEval --images_root datasets/FineCapEval/images --model_type RN50

# optional (n_jobs)
--n_jobs 4 --job_id 0

Training and Evaluation

1) MLE training

export MLE_ID='clipRN50_mle'

# Training
python tools/train_pl.py --cfg configs/phase1/$MLE_ID.yml --id $MLE_ID

# Evaluation
EVALUATE=1 python tools/train_pl.py --cfg configs/phase1/$MLE_ID.yml --id $MLE_ID

# Text-to-Iage Retrieval with CLIP VIT-B/32
python tools/eval_clip_retrieval.py --gen_caption_path "./eval_results/$MLE_ID.json"

# Evaluation on FineCapEval
python tools/finecapeval_inference.py --reward mle
python tools/eval_finecapeval.py --generated_id2caption ./FineCapEval_results/clipRN50_mle.json

2) RL finetuning

Reward: CIDEr

export REWARD='cider'
export MLE_ID='clipRN50_mle'
export RL_ID='clipRN50_'$REWARD

# Copy MLE checkpoint as starting point of RL finetuning
mkdir save/$RL_ID
cp save/$MLE_ID/$MLE_ID-last.ckpt save/$RL_ID/$RL_ID-last.ckpt

# Training
python tools/train_pl.py --cfg configs/phase2/$RL_ID.yml --id $RL_ID

# Evaluation
EVALUATE=1 python tools/train_pl.py --cfg configs/phase2/$RL_ID.yml --id $RL_ID

# Text-to-Iage Retrieval with CLIP VIT-B/32
python tools/eval_clip_retrieval.py --gen_caption_path "./eval_results/$RL_ID.json"

# Evaluation on FineCapEval
python tools/finecapeval_inference.py --reward $REWARD
python tools/eval_finecapeval.py --generated_id2caption ./FineCapEval_results/$RL_ID.json

Reward: CLIP-S

export REWARD='clips'
export MLE_ID='clipRN50_mle'
export RL_ID='clipRN50_'$REWARD

# Copy MLE checkpoint as starting point of RL finetuning
mkdir save/$RL_ID
cp save/$MLE_ID/$MLE_ID-last.ckpt save/$RL_ID/$RL_ID-last.ckpt

# Training
python tools/train_pl.py --cfg configs/phase2/$RL_ID.yml --id $RL_ID

# Evaluation
EVALUATE=1 python tools/train_pl.py --cfg configs/phase2/$RL_ID.yml --id $RL_ID

# Text-to-Iage Retrieval with CLIP VIT-B/32
python tools/eval_clip_retrieval.py --gen_caption_path "./eval_results/$RL_ID.json"

# Evaluation on FineCapEval
python tools/finecapeval_inference.py --reward $REWARD
python tools/eval_finecapeval.py --generated_id2caption ./FineCapEval_results/$RL_ID.json

Reward: CLIP-S + CIDEr

export REWARD='clips_cider'
export MLE_ID='clipRN50_mle'
export RL_ID='clipRN50_'$REWARD

# Copy MLE checkpoint as starting point of RL finetuning
mkdir save/$RL_ID
cp save/$MLE_ID/$MLE_ID-last.ckpt save/$RL_ID/$RL_ID-last.ckpt

# Training
python tools/train_pl.py --cfg configs/phase2/$RL_ID.yml --id $RL_ID

# Evaluation
EVALUATE=1 python tools/train_pl.py --cfg configs/phase2/$RL_ID.yml --id $RL_ID

# Text-to-Iage Retrieval with CLIP VIT-B/32
python tools/eval_clip_retrieval.py --gen_caption_path "./eval_results/$RL_ID.json"

# Evaluation on FineCapEval
python tools/finecapeval_inference.py --reward $REWARD
python tools/eval_finecapeval.py --generated_id2caption ./FineCapEval_results/$RL_ID.json

Reward: CLIP-S + Grammar

Run CLIP Finetuning (for grammar) following ./retrieval/README.md
Run RL training using the updated CLIP

export REWARD='clips_grammar'
export MLE_ID='clipRN50_mle'
export RL_ID='clipRN50_'$REWARD

# Copy MLE checkpoint as starting point of RL finetuning
mkdir save/$RL_ID
cp save/$MLE_ID/$MLE_ID-last.ckpt save/$RL_ID/$RL_ID-last.ckpt

# Training
python tools/train_pl.py --cfg configs/phase2/$RL_ID.yml --id $RL_ID

# Evaluation
EVALUATE=1 python tools/train_pl.py --cfg configs/phase2/$RL_ID.yml --id $RL_ID

# Text-to-Iage Retrieval with CLIP VIT-B/32
python tools/eval_clip_retrieval.py --gen_caption_path "./eval_results/$RL_ID.json"

# Evaluation on FineCapEval
python tools/finecapeval_inference.py --reward $REWARD
python tools/eval_finecapeval.py --generated_id2caption ./FineCapEval_results/$RL_ID.json

Acknowledgments

We thank the developers of CLIP-ViL, ImageCaptioning.pytorch, CLIP, coco-caption, cider for their public code release.

Reference

Please cite our paper if you use our models in your works:

@inproceedings{Cho2022CLIPReward,
  title     = {Fine-grained Image Captioning with CLIP Reward},
  author    = {Jaemin Cho and Seunghyun Yoon and Ajinkya Kale and Franck Dernoncourt and Trung Bui and Mohit Bansal},
  booktitle = {Findings of NAACL},
  year      = {2022}
}

murilo / clip-caption-reward Goto Github PK

clip-caption-reward's Introduction