LipScribe: An application that converts lip movements of speakers in a silent video to text and display that using an android application. Exploiting the capabilities of 3D CNN to extract information from spatio temporal data this Deep Learning aims at creating words from a sequence of frames in a video.
- Install the android application using the APK from https://drive.google.com/drive/folders/10pGHK0VYddb7Kn0rjqMDCVR4Nh-bR_U3?usp=sharing
- On launching the application on android phone, it checks for the camera and requests the permission for recording videos and capturing pictures.
- Choose to allow the application to record videos and capture pictures.
- Click on allow for the application on request for accessing the files and media.
- Click on 'start camera' on start page to start recording the video.
- The recorded video is processed from external storage and given to model for prediction this happens in background and loading screen is displayed on screen.
- The prediction of the word speaker utterted is displayed on the screen.
- Use the android aplication to record a video.
- The process goes through preprocessing where Haar Cascade Classifier extract frames video and subsequently lips of a speaker from those frames.
- This is sent to a 3D CNN model which outputs a word as the final output.
An android operating system compatible application is developed to deploy the predictions from the model. The application requires model built with tensorflow version 1.15. ffmpeg library is used to extract frames for data preprocessing from a video. The mouth region is extracted and converted into embeddings and passed as input to the model .