Image caption referes to the process of describing an image using a sentence. By asking computers to generate captions based on input image, we may find ways to improve life quality for those with visual impairment. In addition, with image caption we could also build and optimize the search engine for image content.
For our final project for DL Summer '22, we build and use a CNN-RNN model to automatically generate descriptions.
CNN-RNN model: The image is encoded into a context vector by a CNN which can then be passed to a RNN decoder 1
Here, we use the resnet18 model for our CNN encoder part.
We used the Flickr Image dataset from Kaggle, with a total of 8k images and corresponding human annotated captions.
We followed these steps to create our final model:
- Tokenizing the captions. We used spacy here.
- Image Augmentation
- Processed Image as input for CNN (Resnet-18)
- CNN -> LSTM
- Use output from LSTM to generate result captions
Link to the notebook here. Link to the helper functions and model files.
Average Log Loss: 3.3332 after a few epochs
- Model can only use words in the vocab it trains on
- Loss function only penalize for getting the words wrong. There is no meaning retention
- Can not work with grammar
- Need more epochs (Research groups norally take days to train their models)
- Increase LSTM layering
- Add in attention
- Use subwords tokenizers
- Train for more epochs
- Use different loss function, for example BLEU, or embedding comparisons
Some other implementions using captions could be:
- Given a sentence/query, find out the matching images.
- Object detection
Our Slides for the final presentation
Python==3.7.13
jupyter==1.0.0
matplotlib==3.2.2
numpy==1.21.6
pandas==1.3.5
Pillow==7.1.2
spacy==3.3.1
torch @ https://download.pytorch.org/whl/cu113/torch-1.11.0%2Bcu113-cp37-cp37m-linux_x86_64.whl
torchaudio @ https://download.pytorch.org/whl/cu113/torchaudio-0.11.0%2Bcu113-cp37-cp37m-
tqdm==4.64.0