Git Product home page Git Product logo

image-captioner's Introduction

image-captioner

Image Captioning using Encoder-Decoder

https://imagecaptioner.herokuapp.com/

An image captioning application based on Neural Image Caption model utilizing encoder-decoder architecture, using pretrained CNN as encoder and LSTM as decoder.

Overview

Recurrent Neural Networks (RNN) are used for varied number of applications including machine translation. The Encoder-Decoder architecture is utilized for such settings where a varied-length input sequence is mapped to the varied-length output sequence. The same network can also be used for image captioning.

In image captioning, the core idea is to use CNN as encoder and a normal RNN as decoder. This application uses the architecture proposed by Show and Tell: A Neural Image Caption Generator.

image captioner structure

Here's an excerpt from the paper:

Here, we propose to follow this elegant recipe, replacing the encoder RNN by a deep convolution neural network (CNN). Over the last few years it has been convincingly shown that CNNs can produce a rich representation of the input image by embedding it to a fixed-length vector, such that this representation can be used for a variety of vision tasks. Hence, it is natural to use a CNN as an image “encoder”, by first pre-training it for an image classification task and using the last hidden layer as an input to the RNN decoder that generates sentences. We call this model the Neural Image Caption, or NIC.

Implementation

This image-captioner application is developed using PyTorch and Django. All the code related to model implementation is in the pytorch directory.

  • Dataset used: MS-COCO dataset
  • Vocabulary: The vocabulary consists of mapping between words and indices.
  • Encoder: The ResNet152 model pretrained on Imagenet is used as encoder.
  • Decoder: The LSTM (Long-Short Term Memory) network is used as decoder. We'll give the decoder RNN a special <start> token to indicate the start of the sentence and <end> token to indicate end of sequence. In addition to taking two weight matrices i.e. the input at the current time-step and the hidden state at the previous time-step thus combining those to get next hidden states, we need to add image information i.e. third weight matrix. Then, we sample the vocabulary at every time-step.

Running locally

  1. Cone the repo

    $ git clone https://github.com/kHarshit/image-captioner.git
    $ cd image-captioner
    
  2. (Optional) Create virtual environment either through conda or virtualenv

    $ conda env create -f environment.yaml
    
  3. Install the dependencies

    $ pip install -r requirements.txt
    
  4. Run server

    $ python manage.py runserver
    

Your app should now be running on localhost:8000.

Input Output
demo1.png demo2.png

Image captioning with Attention

The problem with encoder-decoder approach is that all the input information needs to be compressed in a fixed length context vector. It makes it difficult for the network to cope up with large amount of input information (e.g. in text, large sentences) and produce good results with only that context vector. With attention mechanism, the encoder CNN instead of producing a single context vector to summarize the input image, produces a grid of vectors. In addition to sampling the vocabulary, it also produces a distribution over the locations in the image where the model looks while training thus focusing the attention at one part of image. The idea is described in Show, Attend and Tell: Neural Image Caption Generation with Visual Attention .

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.