Git Product home page Git Product logo

lip-reading-deeplearning's Introduction

Lip Reading - Cross Audio-Visual Recognition using 3D Convolutional Neural Networks - Official Project Page

https://img.shields.io/badge/contributions-welcome-brightgreen.svg?style=flat https://badges.frapsoft.com/os/v2/open-source.svg?v=102 https://coveralls.io/repos/github/astorfi/3D-convolutional-Audio-Visual/badge.svg?branch=master https://img.shields.io/twitter/follow/amirsinatorfi.svg?label=Follow&style=social

This repository contains the code developed by TensorFlow for the following paper:

im1 im2 im3

The input pipeline must be prepared by the users. This code is aimed to provide the implementation for Coupled 3D Convolutional Neural Networks for audio-visual matching. Lip-reading can be a specific application for this work.

If you used this code, please kindly consider citing the following paper:

@article{torfi20173d,
  title={3D Convolutional Neural Networks for Cross Audio-Visual Matching Recognition},
  author={Torfi, Amirsina and Iranmanesh, Seyed Mehdi and Nasrabadi, Nasser and Dawson, Jeremy},
  journal={IEEE Access},
  year={2017},
  publisher={IEEE}
  }

Table of Contents

training

liptrackingdemo

Audio-visual recognition (AVR) has been considered as a solution for speech recognition tasks when the audio is corrupted, as well as a visual recognition method used for speaker verification in multi-speaker scenarios. The approach of AVR systems is to leverage the extracted information from one modality to improve the recognition ability of the other modality by complementing the missing information.

The essential problem is to find the correspondence between the audio and visual streams, which is the goal of this work. We proposed the utilization of a coupled 3D Convolutional Neural Network (CNN) architecture that can map both modalities into a representation space to evaluate the correspondence of audio-visual streams using the learned multimodal features.

The proposed architecture will incorporate both spatial and temporal information jointly to effectively find the correlation between temporal information for different modalities. By using a relatively small network architecture and much smaller dataset, our proposed method surpasses the performance of the existing similar methods for audio-visual matching which use CNNs for feature representation. We also demonstrate that effective pair selection method can significantly increase the performance.

The input pipeline must be provided by the user. The rest of the implementation consider the dataset which contains the utterance-based extracted features.

For lip tracking, the desired video must be fed as the input. At first, cd to the corresponding directory:

cd code/lip_tracking

The run the dedicated python file as below:

python VisualizeLip.py --input input_video_file_name.ext --output output_video_file_name.ext

Running the aforementioned script extracts the lip motions by saving the mouth area of each frame and create the output video with a rectangular around the mouth area for better visualization.

The required arguments are defined by the following python script which have been defined in the VisualizeLip.py file:

ap = argparse.ArgumentParser()
ap.add_argument("-i", "--input", required=True,
             help="path to input video file")
ap.add_argument("-o", "--output", required=True,
             help="path to output video file")
ap.add_argument("-f", "--fps", type=int, default=30,
             help="FPS of output video")
ap.add_argument("-c", "--codec", type=str, default="MJPG",
             help="codec of output video")
args = vars(ap.parse_args())

Some of the defined arguments have their default values and no further action is required by them.

In the visual section, the videos are post-processed to have an equal frame rate of 30 f/s. Then, face tracking and mouth area extraction are performed on the videos using the dlib library [dlib]. Finally, all mouth areas are resized to have the same size and concatenated to form the input feature cube. The dataset does not contain any audio files. The audio files are extracted from videos using FFmpeg framework [ffmpeg]. The processing pipeline is the below figure.

readme_images/processing.gif

The proposed architecture utilizes two non-identical ConvNets which uses a pair of speech and video streams. The network input is a pair of features that represent lip movement and speech features extracted from 0.3 second of a video clip. The main task is to determine if a stream of audio corresponds with a lip motion clip within the desired stream duration. In the two next sub-sections, we are going to explain the inputs for speech and visual streams.

Speech Net

On the time axis, the temporal features are non-overlapping 20ms windows which are used for the generation of spectrum features that possess a local characteristic. The input speech feature map, which is represented as an image cube, corresponds to the spectrogram as well as the first and second order derivatives of the MFEC features. These three channels correspond to the image depth. Collectively from a 0.3 second clip, 15 temporal feature sets (each forms 40 MFEC features) can be derived which form a speech feature cube. Each input feature map for a single audio stream has the dimensionality of 15 × 40 × 3. This representation is depicted in the following figure:

readme_images/Speech_GIF.gif

The speech features have been extracted using [SpeechPy] package.

Please refer to code/speech_input/input_feature.py for having an idea about how the input pipeline works.

Visual Net

The frame rate of each video clip used in this effort is 30 f/s. Consequently, 9 successive image frames form the 0.3 second visual stream. The input of the visual stream of the network is a cube of size 9x60x100, where 9 is the number of frames that represent the temporal information. Each channel is a 60x100 gray-scale image of mouth region.

readme_images/lip_motion.jpg

The architecture is a coupled 3D convolutional neural network in which two different networks with different sets of weights must be trained. For the visual network, the lip motions spatial information alongside the temporal information are incorporated jointly and will be fused for exploiting the temporal correlation. For the audio network, the extracted energy features are considered as a spatial dimension, and the stacked audio frames form the temporal dimension. In the proposed 3D CNN architecture, the convolutional operations are performed on successive temporal frames for both audio-visual streams.

readme_images/DNN-Coupled.png

At first, clone the repository. Then, cd to the dedicated directory:

cd code/training_evaluation

Finally, the train.py file must be executed:

python train.py

For evaluation phase, a similar script must be executed:

python test.py

The below results demonstrate effects of the proposed method on the accuracy and the speed of convergence.

accuracy

The best results, which is the right-most one, belongs to our proposed method.

converge

The effect of proposed Online Pair Selection method has been shown in the figure.

The current version of the code does not contain the adaptive pair selection method proposed by 3D Convolutional Neural Networks for Cross Audio-Visual Matching Recognition paper. Just a simple pair selection with hard thresholding is included at the moment.

We are looking forward to your kind feedback. Please help us to improve the code and make our work better. For contribution, please create the pull request and we will investigate it promptly. Once again, we appreciate your feedback and code inspections.

references

[SpeechPy]@misc{amirsina_torfi_2017_810392, author = {Amirsina Torfi}, title = {astorfi/speech_feature_extraction: SpeechPy}, month = jun, year = 2017, doi = {10.5281/zenodo.810392}, url = {https://doi.org/10.5281/zenodo.810391}}
[dlib]
    1. King. Dlib-ml: A machine learning toolkit. Journal of Machine Learning Research, 10:1755–1758, 2009.
[ffmpeg]
  1. Developers. FFmpeg tool (version be1d324) [software], 2016.

lip-reading-deeplearning's People

Contributors

astorfi avatar xiaokeai18 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lip-reading-deeplearning's Issues

Where to find MFEC api?

Thanks for your awesome work! I 'm not find MFEC implementation in Speechpy, could you tell me where to get it, thank you!!

Need help about demo

Anyone who is currently working or have done worked on it .I need help .I have run this project lip tracking with dlib is working fine for me . I want to know how i can do lip-reading from video. its only giving output of tracking lips no text ouput of lips reading

Increase the training for longer videos

How do I train and test for the videos with longer length?
Right now, if I give input video with length longer than 6 seconds, then it process only the first 6 seconds of the video.

Online pair selection?

Hi , In your paper, Pair selection algorithm to select main contributing impostor pairs which is imp_dis < (max_gen+ margin). Could you clarify why it's a main contributing impostor pairs?

redundant activation functions in lipread_mouth.py

hi, thank you for such a nice repo.

I noticed that in your code you are used slim library and custom PReLU activation after every conv layer.
the problem is that after slim.conv2d (that actually performs 3d convolutions in this case) tensor already passed through activation, because default parameter 'activation_fn' is relu.
So, your PReLU alphas don't learn, because instead of negative values you get all zeros.
In order to fix that: net = slim.conv2d(inputs,...., activation_fn=None)

The version of tensorflow.

Could you please give some information about the version of tensorflow, I got the different errors when I did on the different versions of tensorflow. Thank you.
tensorflow:1.11.0

 File "/Users/apple/anaconda3/envs/venv/lib/python2.7/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1025, in convolution
    (conv_dims + 2, input_rank))
ValueError: Convolution expects input with rank 4, got 5

tensorflow:1.6.0

  File "/Users/apple/anaconda3/envs/tensorflow-1.0.0/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1751, in restore
    raise ValueError("Can't load save_path when it is None.")
ValueError: Can't load save_path when it is None.

./results/mouth/frame_%*.png: could not find codec parameters

Hi:
i am glad to find this source but when i run ./run.sh or ./lip_tracking_demo.sh, the error is:
[image2 @ 0x1aa4100] Could not open file : ./results/mouth/frame_.png
[image2 @ 0x1aa4100] Could not find codec parameters (Video: png)
./results/mouth/frame_%
.png: could not find codec parameters

beacuse the results/mouth/frame_%.png can't be find. Can you release this file or give me an example what form the frame_%.png is? if the frame_%*.png is only the part of mouth just like 'lip-reading-deeplearning-master/readme_images/1.gif '

How to train my dataset

Thank you for the excellent work and publicly available code.
I want to ask a quaetion.How to train my dataset, I did not find the data input port in the program, may I ask how to do it?

Help in Lip reading

Mr.Torfi,
May you please help me in lip-reading? I am able to detect the lips but what is the next step?
sorry for the inconvenience and really thank you. I am really interested in your work.

keras not used?

Just scanned through the code quickly, it appears that keras is not used anywhere. Was that a matter of principle, or is the required functionality not supported at the keras level? Thanks!

A problem of Multiclass Classification

As far as I understtod, your code supports only a binary classification problem. I could not find any information in the paper regarding the classes (the "Words"/"Subjects" are the classes?). I am trying to use this for a multi-class problem. Since pairing has been done for frame sequences of each video (9 of them) with the corresponding speech spectrogram and MFEC features, I suppose there will be no problem if one changes the number of classes.
When I change number of classes from 2 to 6, I get this error, can you help me?

Epoch 1, Minibatch 1 of 15 , Minibatch Loss= 1056.706787, EER= 0.50000, AUC= 0.33333, AP= 0.69683, contrib = 8 pairs
Epoch 1, Minibatch 2 of 15 , Minibatch Loss= 1793.572998, EER= 0.50000, AUC= 0.55000, AP= 0.61167, contrib = 9 pairs
Epoch 1, Minibatch 3 of 15 , Minibatch Loss= 1273.130249, EER= 0.50000, AUC= 0.62500, AP= 0.80417, contrib = 6 pairs
Epoch 1, Minibatch 4 of 15 , Minibatch Loss= 1280.513916, EER= 0.25000, AUC= 0.60714, AP= 0.81829, contrib = 11 pairs
Epoch 1, Minibatch 5 of 15 , Minibatch Loss= 1651.882568, EER= 0.40000, AUC= 0.60000, AP= 0.67778, contrib = 9 pairs
Epoch 1, Minibatch 6 of 15 , Minibatch Loss= 1395.890381, EER= 0.40000, AUC= 0.48000, AP= 0.53429, contrib = 10 pairs
Epoch 1, Minibatch 7 of 15 , Minibatch Loss= 1423.493164, EER= 0.27273, AUC= 0.63636, AP= 0.58000, contrib = 16 pairs
Epoch 1, Minibatch 8 of 15 , Minibatch Loss= 1248.631836, EER= 0.50000, AUC= 0.55000, AP= 0.61167, contrib = 9 pairs
Epoch 1, Minibatch 9 of 15 , Minibatch Loss= 1377.684937, EER= 0.50000, AUC= 0.54167, AP= 0.74385, contrib = 10 pairs
Epoch 1, Minibatch 10 of 15 , Minibatch Loss= 1460.154419, EER= 0.33333, AUC= 0.83333, AP= 0.88750, contrib = 7 pairs
Epoch 1, Minibatch 11 of 15 , Minibatch Loss= 1794.762451, EER= 0.40000, AUC= 0.33333, AP= 0.67771, contrib = 17 pairs
Epoch 1, Minibatch 12 of 15 , Minibatch Loss= 1140.301392, EER= 0.50000, AUC= 0.37500, AP= 0.36667, contrib = 6 pairs
Epoch 1, Minibatch 13 of 15 , Minibatch Loss= 1273.781738, EER= 0.66667, AUC= 0.47619, AP= 0.51664, contrib = 16 pairs
Epoch 1, Minibatch 14 of 15 , Minibatch Loss= 989.276489, EER= 0.50000, AUC= 0.58333, AP= 0.36667, contrib = 8 pairs
Epoch 1, Minibatch 15 of 15 , Minibatch Loss= 1625.663696, EER= 0.33333, AUC= 0.83333, AP= 0.95028, contrib = 11 pairs
TESTING: Epoch 1, Minibatch 1 of 3 
TESTING: Epoch 1, Minibatch 2 of 3 
Traceback (most recent call last):
  File "/media/Data/Scripts/lip-reading-deeplearning/code/training_evaluation/train.py", line 667, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "/media/Data/Scripts/lip-reading-deeplearning/code/training_evaluation/train.py", line 659, in main
    score_dissimilarity_vector[i * batch_k_validation:(i + 1) * batch_k_validation])
  File "/media/Data/Scripts/lip-reading-deeplearning/code/training_evaluation/roc_curve/calculate_roc.py", line 16, in calculate_eer_auc_ap
    AUC = metrics.roc_auc_score(label, -distance, average='macro', sample_weight=None)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/metrics/ranking.py", line 277, in roc_auc_score
    sample_weight=sample_weight)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/metrics/base.py", line 72, in _average_binary_score
TESTING: Epoch 1, Minibatch 3 of 3 
    raise ValueError("{0} format is not supported".format(y_type))
ValueError: multiclass format is not supported

Is this because of ROC calculation (line 659 of train.py) for multi-class classification? Can you tell me which parts need modification, maybe I missed something.

about the demo

hello,i run your code,and see the demo,the paper tell about the audio and video's synchronization.but why the demo is lip_tracking ,how can i see the audio and video's effect of synchronization?and the training data how can i find .

Lip reading details

According to the documentation and as per the research paper, we can get the utterances for the lip movements. How do I run it and get the response?

cannot set WRITEABLE flag to True of this array

When I run the VisualizeLip.py, I encounter the following problems:
Traceback (most recent call last): File "VisualizeLip.py", line 129, in <module> frame.setflags(write=True) ValueError: cannot set WRITEABLE flag to True of this array
I looked it up on the Internet and it said the problem is numpy library's version,
my tensorflow is 2.1.0, it conflicts with numpy library

This recording has been archived

Hello!

The video of Training/Evaluation DEMO is not exist. It shows "All unclaimed recordings (the ones not linked to any user account) are automatically archived 7 days after upload."

Unable to execute the project

I am following the training/evaluation video demo. When I run the ./run.sh command, I'm getting errors

image

Could you tell me what I should do?!
I tried a lot talking with chatGPT but no use. I am stuck anyone please help.

error when changing the input size

I changed the input image size into: 'mouth': np.random.random_sample(size=(num_training_samples, 9, 64, 64, 1))

then the error
Negative dimension size caused by subtracting 5 from 3 for 'tower_0/mouth_cnn/fc5/fc5_1/convolution' (op: 'Conv3D') with input shapes: [?,9,3,3,128], [1,2,5,128,256]

How should I change the net structure to fix that? thank you

How to synchronize your audio and image frame?

Hello,
I wonder to know that how do you deal with the lip-movement frame and audio synchronization?
Since the input FPS may vary from videos to videos, e.g. 30 FPS means 33ms per frame, so each frame will represent 33ms audio. How do you deal with the video-audio corresponding pair?

Question about speech features

As you mentioned before "use of non-overlapping hamming windows for generating speech features" , I'm not sure how to do it here. Could you describe in detail the procedures here? Thank a lot!

Error with using Dlib

image

It seems dlib... is not available, but how can I solve this ?

I will be using the library for lip reading
Thank you

ValueError: not enough values to unpack (expected 6, got 3)

Hi,
Thanks for the project, please let me know which version of tensorflow you are using for the project, as I am getting "ValueError: not enough values to unpack (expected 6, got 3)" in train and test.py files at sess.run(). maybe there is a compatibility issue. P.S: I am using Windows machine.

Number of filters

I may be mistaken, but shouldn't we be creating 128, not 64 filters here? The comment below says shape=(?, 9, 128); we also had 128 in the previous layer.

net = slim.conv3d(net, 64, [1, 1, 1], padding='VALID', activation_fn=None, normalizer_fn=None, scope='fc5')

net = slim.conv3d(net, 64, [15, 1, 1],padding='VALID', activation_fn=None, normalizer_fn=None, scope='fc5')

local variable 'i' referenced before assignment

when i run test.py :

File "test.py", line 576, in <module> tf.app.run() File "/home/ligen/anaconda2/envs/python3/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "test.py", line 401, in main with tf.name_scope('%s_%d' % ('tower', i)) as scope: UnboundLocalError: local variable 'i' referenced before assignment

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.