DysarthrAI

A Voice for All: Communication assistant for people with Dysarthric speech

Simon Hodgkinson, Michael Powers, Rich Ung

Project Deliverable

Website

About
Datasets
Model Details
Website App Implementation
Resources
Contact Us
Appendix

About

Our mission is to improve the communication abilities of people with dysarthric speech. Dysarthria is a condition where muscles used for speech are weak and hard to control, resulting in slurred or slow speech that can be difficult to understand. Common causes of dysarthria include neurological disorders such as stroke, brain injury, brain tumors, and conditions that cause facial paralysis or tongue or throat muscle weakness.

Our application, DysarthrAI, is a communication assistant for people with dysarthric speech. It enables these individuals to communicate phrases to others, regardless of their vocal abilities. The speaker-dependent model requires the user to store phrases they wish to communicate in the future, along with translations of those phrases. Once a phrase is saved, the user can speak the phrase into the app which will use our algorithm to provide a clear audio translation using text to speech.

Datasets

We used the TORGO dataset located here.

The TORGO data is downloaded and unzipped to data/TORGO. This folder contains 8 folders, one for each person ("F01", "F03", etc.) - 3 females and 5 males. However, these directories are also added to the .gitignore file because they are also very large and would take up too much space within our repository.

We performed a series of transformations and data cleaning as shown within the following notebooks:

These notebooks also involve creating spectrograms and MFCCs so that we can further analyze the audio files and create models.

We also ran our datasets on AWS Transcribe and the Google Translate API to see the accuracy of audio files from people with dysarthric speech. Our code is found within this notebook.

Model Details

MFCC's (Mel Frequency Cepstral Coefficients)

Been found to outperform spectrograms in ASR systems
Similar to log-scaled spectrogram with bucketing into distinct 'coefficients'
Inspired by human hearing (we resolve sound in quasi-log frequency bands)

Dynamic Time Warping (DTW)

Measures similarity between two sequences, taking into account different production rates
Does not require a lot of training data, unlike deep learning approaches such as CNN
Dysarthric speech particularly prone to pauses or variable speed, making them good candidate for normalization using DTW
The idea is to compare MFCC of input phrase to pre-recorded training examples using DTW to eliminate temporal distortion, then assigning the predicted label to the input phrase with the minimum difference.

DTW and "Shifting"

Review of examples the DTW algorithm gets wrong suggests an issue in the alignment of MFCC vectors.
Even though the DTW algorithm is designed to handle sequences that are not perfectly aligned, it appears it can still sometimes struggle

Final Model

Our final model using the concepts described above is located here.

Website App Implementation

We've built a website app that allows a user to:

Upload audio files with translation label ("saved phrases") - one file for each phrase the user wishes the system to recognize
Upload audio files without translation label ("requested phrase"), and request a translation from the system
Provide translation validation (yes/no) back to the system

This allows us to run our model on new audio file datasets and gather more audio file training data to further improve our models.

When an audio file with translation label (“saved phrase”) is added, the system will:

Convert the audio to a MFCC vector, and store that vector in a database

When a “requested phrase” enters the system, the model will:

Convert the audio to a MFCC vector
Calculate DTW distance between the vector and each "saved phrase" MFCC vector
Choose the “saved phrase” that is the closest match - minimum DTW distance
Display the translation label

The website was created through various services from AWS:

Backend for Model

The final model that we've developed was implemented into a Docker container running a Flask application. This Flask application gathers the data from S3, runs the model, and updates the results in DynamoDB. Since Flask is written in Python, implementing the final model within our app became easier using Flask and Docker. The Docker container is then deployed to Fargate, which allows us to run containers in the cloud without managing the infrastructure.

Frontend Website

The frontend website is built using React, which is a javascript framework that helps build interactive applications. This website is then deployed to S3, which Route 53 and Cloudfront use to direct users whenever they access dysarthrai.com. This frontend website then uses S3 to upload, store, and manage audio files, DynamoDB to find and update audio labels, and Docker/Fargate to run the model that we've developed over the past couple of weeks.

Resources

Link to Planning Google Doc
Presentations

Contact Us

Feel free to contact any of the team members below if you have any additional questions:

Appendix

Loading Environment

Run the following command within the base directory of this repository to build the notebook Docker environment for this project:

docker build -t w210/capstone:1.0 .

Run the following command within the base directory of this repository to run the notebook Docker environment for this project:

docker run --rm -p 8888:8888 -p 6006:6006 -e JUPYTER_ENABLE_LAB=yes -v "$PWD":/home/jovyan/work w210/capstone:1.0

mbulkow / mids_capstone Goto Github PK

mids_capstone's Introduction