Video Caption with Neuraltalk2

General information

This is a code release of captioning videos using Neuraltalk2. We provide a way to extract the deep image feature of VGG-16, and detect shot boundaries using the feature. We can also finetune the MS-COCO model, annotate the key frames, and return the captions to the video sequence. A sample output can be found here on Youtube.

Steps to generate video captions

Follow the instruction and install all required libraries.

Caffe: https://github.com/BVLC/caffe
neuraltalk2: https://github.com/karpathy/neuraltalk2
ffmpeg: https://www.ffmpeg.org/

Below we show an example of generating captions from a test video sequence. (The test video is part of the 'santa' video which we show on Youtube.) You should be able to replicate the workflow with your video.

Extract frames from the video.

We firstly extract the frames from the videos using ffmpeg. There are a few parameters you need to setup: '-ss' denotes the starting time. '-t' indicates the duration of the video you want to process. '-i' gives the input video. You should replace the directory and the video name with your file. '-r' defines the frame rate, and here we use 5 ms. After that, you should define the name of the extracted image sequence in a new directory.

$ ffmpeg -ss 00:00:00 -t 00:00:30 -i YOUR_WORKING_DIRECTORY/data/test.mp4 -r 5.0 YOUR_WORKING_DIRECTORY/data/santa/img/s%4d.jpg

Extract deep features (VGG-16) from the video frames that you just generated.

Besides the caffe package, we use one of the pre-trained models called VGG-16. Essentially, it is a topology including a very deep network with 16 layers. You should download the weights and layer configuration under your Caffe directory.

Now, you can extract the visual feature from the video frames. We provide a code called 'caffe_feat.py' for that. You need to open the file and change the 'caffe_root' and 'input_path' to your own directory. Then, you can run the following script.

python caffe_feat.py

It will generate a feature file called 'feat.txt' in svm-light format in the 'input_path' folder.

Find the key frames from the video.

Since we have extracted the visual feature from all frames in the video, we can find the key frames which separate the video shots. You should change to your working directory and run the script below.

python caption.py 'YOUR_WORKING_DIRECTORY'  'genKeyframes'

A group of key frames will be stored under 'YOUR_WORKING_DIRECTORY/key/'.

Generate video captions from the key frames of the video.

Now, you should use the tools from Neuraltalk2 to generate captions from the video frames. Find the path to the installed Neuraltalk2 package, and run the 'eval.lua' as below,

th eval.lua -model /YOUR_NEURALTALK2_MODEL_PATH/model_coco.t7 -image_folder  YOUR_WORKING_PATH/data/santa/key  -num_images -1 > caplog.txt

Here, we create a log file to store the captions. This is a workaround (hack) to the 'vis.json' file generated by Neuratalk2 originally. Below, we also take an additional edit to the log file in order to create the srt file. You should open 'caplog.txt' file and remove the header and foot notes in the log, and leave only the caption information as follows

cp "/homeappl/home/gcao/tmp/Video-Caption/data/santa/key/s0001.jpg" vis/imgs/img1.jpg	
image 1: a black and white photo of a car parked on the side of the road	
evaluating performance... 1/-1 (0.000000)	
cp "/homeappl/home/gcao/tmp/Video-Caption/data/santa/key/s0037.jpg" vis/imgs/img2.jpg	
image 2: an airplane is parked on the tarmac at an airport	
evaluating performance... 2/-1 (0.000000)	
cp "/homeappl/home/gcao/tmp/Video-Caption/data/santa/key/s0013.jpg" vis/imgs/img3.jpg	
image 3: a car is parked on the side of the road	
evaluating performance... 3/-1 (0.000000)

Create the srt file

Here, we want to create a caption file with the time frames corresponding to the video. Below is how we do that.

python caption.py 'YOUR_WORKING_PATH/'  'genSrt'

Attach the caption to the original video

Finally, we can attach the caption to the video and show the result in 'capped.mp4'.

ffmpeg -i YOUR_WORKING_PATH/data/test.mp4 -vf subtitles=santa.srt capped.mp4

Voila! Now you can caption your videos with neuraltalk2. Note, the subtitles you generated come from the pre-trained model provided by Karpathy. You can follow their instruction to train new language models. In the future, we may have an update on this with our model.

uirboyan / video-caption-with-neuraltalk2 Goto Github PK