View Code? Open in Web Editor NEW

3D CNN based video classification android application. Transcribes lip movements of the speaker in a silent video to text. The neural network captures spatio temporal information from video required to generate words from video. MLOps using Vertex AI was used to deploy the model in a CI/CD fashion on android app

Python 69.13% Java 30.87%

lipscribe's Introduction

LipScribe: An application that converts lip movements of speakers in a silent video to text and display that using an android application. Exploiting the capabilities of 3D CNN to extract information from spatio temporal data this Deep Learning aims at creating words from a sequence of frames in a video.

Android application installation and using:

Install the android application using the APK from https://drive.google.com/drive/folders/10pGHK0VYddb7Kn0rjqMDCVR4Nh-bR_U3?usp=sharing
On launching the application on android phone, it checks for the camera and requests the permission for recording videos and capturing pictures.
Choose to allow the application to record videos and capture pictures.
Click on allow for the application on request for accessing the files and media.
Click on 'start camera' on start page to start recording the video.
The recorded video is processed from external storage and given to model for prediction this happens in background and loading screen is displayed on screen.
The prediction of the word speaker utterted is displayed on the screen.

Working of the application:

Use the android aplication to record a video.
The process goes through preprocessing where Haar Cascade Classifier extract frames video and subsequently lips of a speaker from those frames.
This is sent to a 3D CNN model which outputs a word as the final output.

Model Evaluation:

Android Application:

An android operating system compatible application is developed to deploy the predictions from the model. The application requires model built with tensorflow version 1.15. ffmpeg library is used to extract frames for data preprocessing from a video. The mouth region is extracted and converted into embeddings and passed as input to the model .

Demo:

Deetcting a word on app:

No Faace Detection on App:

Recommend Projects

surabhigovil / lipscribe Goto Github PK

lipscribe's Introduction

Android application installation and using:

Working of the application:

Model Evaluation:

Android Application:

Demo:

lipscribe's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent