Image Caption

Description

Image caption referes to the process of describing an image using a sentence. By asking computers to generate captions based on input image, we may find ways to improve life quality for those with visual impairment. In addition, with image caption we could also build and optimize the search engine for image content.

For our final project for DL Summer '22, we build and use a CNN-RNN model to automatically generate descriptions.

CNN-RNN model: The image is encoded into a context vector by a CNN which can then be passed to a RNN decoder ¹ Here, we use the resnet18 model for our CNN encoder part.

Data

We used the Flickr Image dataset from Kaggle, with a total of 8k images and corresponding human annotated captions.

Model and Notebook

We followed these steps to create our final model:

Tokenizing the captions. We used spacy here.
Image Augmentation
Processed Image as input for CNN (Resnet-18)
CNN -> LSTM
Use output from LSTM to generate result captions

Link to the notebook here. Link to the helper functions and model files.

Result

Average Log Loss: 3.3332 after a few epochs

Summary

Issues

Model can only use words in the vocab it trains on
Loss function only penalize for getting the words wrong. There is no meaning retention
Can not work with grammar
Need more epochs (Research groups norally take days to train their models)

Future Steps

Increase LSTM layering
Add in attention
Use subwords tokenizers
Train for more epochs
Use different loss function, for example BLEU, or embedding comparisons

Discussions

Some other implementions using captions could be:

Given a sentence/query, find out the matching images.
Object detection

Slides

Our Slides for the final presentation

Requirements

Python==3.7.13
jupyter==1.0.0
matplotlib==3.2.2
numpy==1.21.6
pandas==1.3.5
Pillow==7.1.2
spacy==3.3.1
torch @ https://download.pytorch.org/whl/cu113/torch-1.11.0%2Bcu113-cp37-cp37m-linux_x86_64.whl
torchaudio @ https://download.pytorch.org/whl/cu113/torchaudio-0.11.0%2Bcu113-cp37-cp37m-
tqdm==4.64.0

Image reference ↩

randomaowu / image_caption Goto Github PK

image_caption's Introduction

Image Caption

Description

Data

Model and Notebook

Result

Summary

Issues

Future Steps

Discussions

Slides

Requirements

image_caption's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

randomaowu / image_caption Goto Github PK

image_caption's Introduction

Image Caption

Description

Data

Model and Notebook

Result

Summary

Issues

Future Steps

Discussions

Slides

Requirements

Footnotes

image_caption's People

Contributors

Stargazers

Watchers

Recommend Projects

Recommend Topics

Recommend Org