Git Product home page Git Product logo

visdial-challenge-starter-pytorch's Introduction

Visual Dialog Challenge Starter Code

PyTorch starter code for the Visual Dialog Challenge 2019.

If you use this code in your research, please consider citing:

@misc{desai2018visdialch,
  author =       {Karan Desai and Abhishek Das and Dhruv Batra and Devi Parikh},
  title =        {Visual Dialog Challenge Starter Code},
  howpublished = {\url{https://github.com/batra-mlp-lab/visdial-challenge-starter-pytorch}},
  year =         {2018}
}

DOI

What's new with v2019?

If you are a returning user (from Visual Dialog Challenge 2018), here are some key highlights about our offerings in v2019 of this starter code:

  1. Almost a complete rewrite of v2018, which increased speed, readability, modularity and extensibility.
  2. Multi-GPU support - try out specifying GPU ids to train/evaluate scripts as: --gpu-ids 0 1 2 3
  3. Docker support - we provide a Dockerfile which can help you set up all the dependencies with ease.
  4. Stronger baseline - our Late Fusion Encoder is equipped with Bottom-up Top-Down attention. We also provide pre-extracted image features (links below).
  5. Minimal pre-processed data - no requirement to download tens of pre-processed data files anymore (were typically referred as visdial_data.h5 and visdial_params.json).

Setup and Dependencies

This starter code is implemented using PyTorch v1.0, and provides out of the box support with CUDA 9 and CuDNN 7. There are two recommended ways to set up this codebase: Anaconda or Miniconda, and Docker.

Anaconda or Miniconda

  1. Install Anaconda or Miniconda distribution based on Python3+ from their downloads' site.
  2. Clone this repository and create an environment:
git clone https://www.github.com/batra-mlp-lab/visdial-challenge-starter-pytorch
conda create -n visdialch python=3.6

# activate the environment and install all dependencies
conda activate visdialch
cd visdial-challenge-starter-pytorch/
pip install -r requirements.txt

# install this codebase as a package in development version
python setup.py develop

Note: Docker setup is necessary if you wish to extract image features using Detectron.

Docker

We provide a Dockerfile which creates a light-weight image with all the dependencies installed.

  1. Install nvidia-docker, which enables usage of GPUs from inside a container.
  2. Build the image as:
cd docker
docker build -t visdialch .
  1. Run this image in a container by setting user+group, attaching project root (this codebase) as a volume and setting shared memory size according to your requirements (depends on the memory usage of your model).
nvidia-docker run -u $(id -u):$(id -g) \
                  -v $PROJECT_ROOT:/workspace \
                  --shm-size 16G visdialch /bin/bash

We recommend this development workflow, attaching the codebase as a volume would immediately reflect source code changes inside the container environment. We also recommend containing all the source code for data loading, models and other utilities inside visdialch directory. Since it is a setuptools-style package, it makes handling of absolute/relative imports and module resolving less painful. Scripts using visdialch can be created anywhere in the filesystem, as far as the current conda environment is active.

Download Data

  1. Download the VisDial v1.0 dialog json files from here and keep it under $PROJECT_ROOT/data directory, for default arguments to work effectively.

  2. Get the word counts for VisDial v1.0 train split here. They are used to build the vocabulary.

  3. We also provide pre-extracted image features of VisDial v1.0 images, using a Faster-RCNN pre-trained on Visual Genome. If you wish to extract your own image features, skip this step and download VIsDial v1.0 images from here instead. Extracted features for v1.0 train, val and test are available for download at these links.

  1. We also provide pre-extracted FC7 features from VGG16, although the v2019 of this codebase does not use them anymore.

Training

This codebase supports both generative and discriminative decoding; read more here. For reference, we have Late Fusion Encoder from the Visual Dialog paper.

We provide a training script which accepts arguments as config files. The config file should contain arguments which are specific to a particular experiment, such as those defining model architecture, or optimization hyperparameters. Other arguments such as GPU ids, or number of CPU workers should be declared in the script and passed in as argparse-style arguments.

Train the baseline model provided in this repository as:

python train.py --config-yml configs/lf_disc_faster_rcnn_x101.yml --gpu-ids 0 1 # provide more ids for multi-GPU execution other args...

To extend this starter code, add your own encoder/decoder modules into their respective directories and include their names as choices in your config file. We have an --overfit flag, which can be useful for rapid debugging. It takes a batch of 5 examples and overfits the model on them.

Saving model checkpoints

This script will save model checkpoints at every epoch as per path specified by --save-dirpath. Refer visdialch/utils/checkpointing.py for more details on how checkpointing is managed.

Logging

We use Tensorboard for logging training progress. Recommended: execute tensorboard --logdir /path/to/save_dir --port 8008 and visit localhost:8008 in the browser.

Evaluation

Evaluation of a trained model checkpoint can be done as follows:

python evaluate.py --config-yml /path/to/config.yml --load-pthpath /path/to/checkpoint.pth --split val --gpu-ids 0

This will generate an EvalAI submission file, and report metrics from the Visual Dialog paper (Mean reciprocal rank, R@{1, 5, 10}, Mean rank), and Normalized Discounted Cumulative Gain (NDCG), introduced in the first Visual Dialog Challenge (in 2018).

The metrics reported here would be the same as those reported through EvalAI by making a submission in val phase. To generate a submission file for test-std or test-challenge phase, replace --split val with --split test.

Results and pretrained checkpoints

Performance on v1.0 test-std (trained on v1.0 train + val):

Model R@1 R@5 R@10 MeanR MRR NDCG
lf-disc-faster-rcnn-x101 0.4617 0.7780 0.8730 4.7545 0.6041 0.5162
lf-gen-faster-rcnn-x101 0.3620 0.5640 0.6340 19.4458 0.4657 0.5421

Acknowledgements

  • This starter code began as a fork of batra-mlp-lab/visdial-rl. We thank the developers for doing most of the heavy-lifting.
  • The Lua-torch codebase of Visual Dialog, at batra-mlp-lab/visdial, served as an important reference while developing this codebase.
  • Some documentation and design strategies of Metric, Reader and Vocabulary classes are inspired from AllenNLP, It is not a dependency because the use-case in this codebase would be too little in its current state.

visdial-challenge-starter-pytorch's People

Contributors

abhshkdz avatar dependabot[bot] avatar hsm207 avatar shubhamagarwal92 avatar yashkant avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

visdial-challenge-starter-pytorch's Issues

About image features

Hello! Thank you for providing the features of the image. However I didn't find the information of boxes of these features. Can you provide the image features with the information of boxes? Thanks a lot.

Training step is too slow

Hi,
Thank you for your code.
As I go deeply into this code, I found the training step is particular slow. The problem here (I guess) is the dataset construction processing, where too much functions (e.g., padding sequences, getting history) are implemented in the __get_item__.
I wonder, have you tried to wrap these functions in the __init__ function? This might lead to more memory consuming but will absolutely accelerate the training process.
Thanks.

Extracting Actual Images

Could you elaborate on the relationship between the image_ids in the new dataset with respect to the COCO image_ids. We're trying to visualize some of the images using a script hooked into the Coco api, but there seems to be no correlation between the image_ids used here and the ones in Coco.
Is there something we're missing?

extract image features

Hi, thanks for sharing the visual dialog challenge code. If i extract image features by mtself, where can i get "config_faster_rcnn_x101.yaml" and "model_faster_rcnn_x101.pkl"?

Need suggestion about embeddings

I am trying to use elmo embeddings from allennlp and need some suggestion.

In the for loop of __getitem__ before you convert it to indices, I also save the raw_question

dialog[i]["raw_question"] = dialog[i]["question"] # Tokenized

which could then be converted to char_ids and elmo_emb

        
        
        ques_char_ids = batch_to_ids([dialog_round["raw_question"] for dialog_round in dialog])
        ques_elmo_emb = self._elmo_wrapper(ques_char_ids)


    def _elmo_wrapper(self, char_ids, max_sequence_length = None):
        # Refer: https://github.com/allenai/allennlp/issues/2659
        """
        Parameters
        ----------
        char_ids : torch.Tensor
            char ids of the raw sequences

        Returns
        -------
        torch.Tensor
            Tensor of sequences padded to max length

        """
        if not max_sequence_length:
            max_sequence_length = self.config["max_sequence_length"]
        # with torch.no_grad():
        #     elmo_seq = self.elmo(char_ids)['elmo_representations'][0]
        # elmo_seq = self.elmo(char_ids)['elmo_representations'][0].requires_grad_(False)
        elmo_seq = self.elmo(char_ids)['elmo_representations'][0].detach()
        batch_size, timesteps, emb_dim  = elmo_seq.size()
        if timesteps > max_sequence_length:
            elmo_emb = elmo_seq[:, :max_sequence_length, :]
        else:
            # Pad zeros
            zeroes_size = max_sequence_length - elmo_seq.size(1)
            zeros = torch.zeros(batch_size, zeroes_size, emb_dim).type_as(elmo_seq)
            elmo_emb = torch.cat([elmo_seq, zeros], 1)

        return elmo_emb

However the training gets too slow. Do you have any experience with elmo and suggest why it is happening?

I think one of the possible workaround is to extract and save the embeddings as a pre-processing step. Could you share your data generation scripts please.

generative decoder

Thank you for your code. Has the author tried to use generative decoder?

How to get multi-gpu to work?

I created an Compute Engine instance on Google Cloud Compute with 4 K80 gpus, followed the instructions on the repo to setup the Anaconda environment and download the data. I ran the training with:

python train.py --gpu-ids 0 1 2 3

The batch_size is 128 and cpu_workers is 4.

During training, I use nvidia-smi and can see that all 4 gpus are utilized (but rarely at 100%). Furthermore, the the seconds per iteration is a lot worse compared to a single GPU (8 vs 2).

What other configs should I adjust to get a speedup from using mulitple gpus?

Softmax dimension is wrong.

I think there is critical mistake in decoder code.

Dimension for implementing Softmax is wrong.

Since the score tensor's size is [batch_size x answer_options], dimension should be changed
(dim 0 -> dim 1)

Torch=1.0.0 is not found

ERROR: Could not find a version that satisfies the requirement torch==1.0.0 (from -r requirements.txt (line 10)) (from versions: 0.1.2, 0.1.2.post1, 0.1.2.post2)
ERROR: No matching distribution found for torch==1.0.0 (from -r requirements.txt (line 10))?

The 'answer' would be 0 if the answer is one word

[dialog_round["answer"][:-1] for dialog_round in dialog]

Hi, the code here confuses me. Since 'dialog_round["answer"][:-1]' and 'dialog_round["answer"][1:]' ignore the last and the first word respectively, if the answer is one word, the 'answers_in' and 'answers_out' would be '0'. In this situation, the model would not learn anything from this sample.
Not sure if I am understanding this right, looking forward to your reply.
Thank you.

Bounding box coordinates

Hi @kdexd ,

Is it possible to release the bounding box information (co-ordinates/labels) of the detectron features to actually map these features to the original images.

Thanks.

missing file 'data/visdial_1.0_train.json' when running train.py

Thanks for posting the visual dialog challenge code. When going through the readme file, I could follow it up to the step where wee invoke training. When running
python train.py --config-yml configs/lf_disc_faster_rcnn_x101.yml --gpu-ids 4 5 6 7
I get the following error. I cannot seem to find 'data/visdial_1.0_train.json'

(visdialch) beymer@alm00:~/VisualDialog/visdial-challenge-starter-pytorch$ python train.py --config-yml configs/lf_disc_faster_rcnn_x101.yml --gpu-ids 4 5 6 7
dataset:
concat_history: true
image_features_test_h5: data/features_faster_rcnn_x101_test.h5
image_features_train_h5: data/features_faster_rcnn_x101_train.h5
image_features_val_h5: data/features_faster_rcnn_x101_val.h5
img_norm: 1
max_sequence_length: 20
vocab_min_count: 5
word_counts_json: data/visdial_1.0_word_counts_train.json
model:
decoder: disc
dropout: 0.5
encoder: lf
img_feature_size: 2048
lstm_hidden_size: 512
lstm_num_layers: 2
word_embedding_size: 300
solver:
batch_size: 128
initial_lr: 0.01
lr_gamma: 0.1
lr_milestones:

  • 4
  • 7
  • 10
    num_epochs: 20
    training_splits: train
    warmup_epochs: 1
    warmup_factor: 0.2

config_yml : configs/lf_disc_faster_rcnn_x101.yml
train_json : data/visdial_1.0_train.json
val_json : data/visdial_1.0_val.json
val_dense_json : data/visdial_1.0_val_dense_annotations.json
gpu_ids : [4, 5, 6, 7]
cpu_workers : 4
overfit : False
validate : False
in_memory : False
save_dirpath : checkpoints/
load_pthpath :
Traceback (most recent call last):
File "train.py", line 104, in
config["dataset"], args.train_json, overfit=args.overfit, in_memory=args.in_memory
File "/home/beymer/VisualDialog/visdial-challenge-starter-pytorch/visdialch/data/dataset.py", line 26, in init
self.dialogs_reader = DialogsReader(dialogs_jsonpath)
File "/home/beymer/VisualDialog/visdial-challenge-starter-pytorch/visdialch/data/readers.py", line 35, in init
with open(dialogs_jsonpath, "r") as visdial_file:
FileNotFoundError: [Errno 2] No such file or directory: 'data/visdial_1.0_train.json'

why do max_sequence_length - 1 in dataset.py

`def _pad_sequences(self, sequences: List[List[int]]):
"""Given tokenized sequences (either questions, answers or answer
options, tokenized in __getitem__), padding them to maximum
specified sequence length. Return as a tensor of size
``(*, max_sequence_length)``.

    This method is only called in ``__getitem__``, chunked out separately
    for readability.

    Parameters
    ----------
    sequences : List[List[int]]
        List of tokenized sequences, each sequence is typically a
        List[int].

    Returns
    -------
    torch.Tensor, torch.Tensor
        Tensor of sequences padded to max length, and length of sequences
        before padding.
    """

    for i in range(len(sequences)):
        sequences[i] = sequences[i][
            : self.config["max_sequence_length"] - 1
        ]
    sequence_lengths = [len(sequence) for sequence in sequences]

    # Pad all sequences to max_sequence_length.
    maxpadded_sequences = torch.full(
        (len(sequences), self.config["max_sequence_length"]),
        fill_value=self.vocabulary.PAD_INDEX,
    )
    padded_sequences = pad_sequence(
        [torch.tensor(sequence) for sequence in sequences],
        batch_first=True,
        padding_value=self.vocabulary.PAD_INDEX,
    )
    maxpadded_sequences[:, : padded_sequences.size(1)] = padded_sequences
    return maxpadded_sequences, sequence_lengths`

VisDial v0.9

Hello.
I want to use VisDial v0.9.
So, I run the prepro.py -version 0.9 and I got the visdial_datta.h5 and visdial_params.json.
But, when I run the train.py, I got this error.
What can I dot to solve this problem?

Traceback (most recent call last):
File "train.py", line 146, in
for i, batch in enumerate(dataloader):
File "/home/ailab/anaconda2/envs/visdial-chal/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 188, in next
batch = self.collate_fn([self.dataset[i] for i in indices])
File "/home/ailab/anaconda2/envs/visdial-chal/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 188, in
batch = self.collate_fn([self.dataset[i] for i in indices])
File "/home/ailab/visdial-challenge-starter-pytorch/dataloader.py", line 164, in getitem
item['num_rounds'] = self.data[dtype + '_num_rounds'][idx]
IndexError: index 87666 is out of range for dimension 0 (of size 82783)

ffi.lua:56 expected align(#) on line 579

When run command
th prepro_img_vgg16.lua -imageRoot ../image_root -gpuid 0

there are errors:

/home/denniswu/torch/install/bin/lua: .../denniswu/torch/install/share/lua/5.1/trepl/init.lua:389: .../denniswu/torch/install/share/lua/5.1/trepl/init.lua:389: ...me/denniswu/torch/install/share/lua/5.1/hdf5/ffi.lua:56: expected align(#) on line 579 stack traceback: [C]: in function 'error' .../denniswu/torch/install/share/lua/5.1/trepl/init.lua:389: in function 'require' prepro_img_vgg16.lua:3: in main chunk [C]: in function 'dofile' .../torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk

has anyone met this problem?

thanks in advance.

Shared memory issues with parallelization

Hi @kdexd

I am running into all kinds of shared memory errors after this commit 9c1ee36

pytorch/pytorch#8976
pytorch/pytorch#973

I guess this parallelization is not stable; sometimes it run while sometimes it breaks (even though after trying possible solutions) such as:

torch.multiprocessing.set_sharing_strategy('file_system')

# https://github.com/pytorch/pytorch/issues/973
import resource
rlimit = resource.getrlimit(resource.RLIMIT_NOFILE)
resource.setrlimit(resource.RLIMIT_NOFILE, (2048*4, rlimit[1]))

Is there a leak somewhere? Might be best to have a look.

RuntimeError: DataLoader worker (pid 22114) is killed by signal: Killed.

If I set the cpu-workers to be 4 , then after hundreds of iterations, I got error “RuntimeError: DataLoader worker (pid 22114) is killed by signal: Killed.”

I searched related topics, some suggested “cpu-workers=0”. So I set it to be 0 but after hundreds of iterations, I still got killed. This time, only “Killed” is given. No other hints.

In the meantime, when I set ''cpu-workers=0'', training is too slow ,about 1.1~2 s/it.

In the end, I want to know how long it took you to train this model.

Tokenizing is slow

The tokenization process is too slow, specifically for debug needs. A debug option, or option to load pre-processed file will be appreciated.

About concat_history in dataset.py

Hi, concat_history flag in dataset.py is quite confusing.

I think if self.config.get("concat_history", True): would be correct
not if self.config.get("concat_history", False):.

Due to the code above, the dataloader returns the concatenated history if concat_history==False.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.