Git Product home page Git Product logo

mids_capstone's Introduction

  DysarthrAI

A Voice for All: Communication assistant for people with Dysarthric speech

Simon Hodgkinson, Michael Powers, Rich Ung

Project Deliverable

Table of Contents

  1. About
  2. Datasets
  3. Model Details
  4. Website App Implementation
  5. Resources
  6. Contact Us
  7. Appendix

About

Our mission is to improve the communication abilities of people with dysarthric speech. Dysarthria is a condition where muscles used for speech are weak and hard to control, resulting in slurred or slow speech that can be difficult to understand. Common causes of dysarthria include neurological disorders such as stroke, brain injury, brain tumors, and conditions that cause facial paralysis or tongue or throat muscle weakness.

Our application, DysarthrAI, is a communication assistant for people with dysarthric speech. It enables these individuals to communicate phrases to others, regardless of their vocal abilities. The speaker-dependent model requires the user to store phrases they wish to communicate in the future, along with translations of those phrases. Once a phrase is saved, the user can speak the phrase into the app which will use our algorithm to provide a clear audio translation using text to speech.

Datasets

We used the TORGO dataset located here.

The TORGO data is downloaded and unzipped to data/TORGO. This folder contains 8 folders, one for each person ("F01", "F03", etc.) - 3 females and 5 males. However, these directories are also added to the .gitignore file because they are also very large and would take up too much space within our repository.

We performed a series of transformations and data cleaning as shown within the following notebooks:

  1. Download Dataset
  2. Create Spectrograms
  3. Create indexes
  4. Create MFCCs

These notebooks also involve creating spectrograms and MFCCs so that we can further analyze the audio files and create models.

We also ran our datasets on AWS Transcribe and the Google Translate API to see the accuracy of audio files from people with dysarthric speech. Our code is found within this notebook.

Model Details

MFCC's (Mel Frequency Cepstral Coefficients)

  • Been found to outperform spectrograms in ASR systems
  • Similar to log-scaled spectrogram with bucketing into distinct 'coefficients'
  • Inspired by human hearing (we resolve sound in quasi-log frequency bands) MFCCs

Dynamic Time Warping (DTW)

  • Measures similarity between two sequences, taking into account different production rates
  • Does not require a lot of training data, unlike deep learning approaches such as CNN
  • Dysarthric speech particularly prone to pauses or variable speed, making them good candidate for normalization using DTW
  • The idea is to compare MFCC of input phrase to pre-recorded training examples using DTW to eliminate temporal distortion, then assigning the predicted label to the input phrase with the minimum difference. DTW

DTW and "Shifting"

  • Review of examples the DTW algorithm gets wrong suggests an issue in the alignment of MFCC vectors.
  • Even though the DTW algorithm is designed to handle sequences that are not perfectly aligned, it appears it can still sometimes struggle DTW and Shifting

Final Model

  • Our final model using the concepts described above is located here.

Website App Implementation

We've built a website app that allows a user to:

  • Upload audio files with translation label ("saved phrases") - one file for each phrase the user wishes the system to recognize
  • Upload audio files without translation label ("requested phrase"), and request a translation from the system
  • Provide translation validation (yes/no) back to the system

Front End

This allows us to run our model on new audio file datasets and gather more audio file training data to further improve our models.

When an audio file with translation label (“saved phrase”) is added, the system will:

  • Convert the audio to a MFCC vector, and store that vector in a database

When a “requested phrase” enters the system, the model will:

  • Convert the audio to a MFCC vector
  • Calculate DTW distance between the vector and each "saved phrase" MFCC vector
  • Choose the “saved phrase” that is the closest match - minimum DTW distance
  • Display the translation label

Back End

The website was created through various services from AWS:

Website App Architecture

Backend for Model

The final model that we've developed was implemented into a Docker container running a Flask application. This Flask application gathers the data from S3, runs the model, and updates the results in DynamoDB. Since Flask is written in Python, implementing the final model within our app became easier using Flask and Docker. The Docker container is then deployed to Fargate, which allows us to run containers in the cloud without managing the infrastructure.

Frontend Website

The frontend website is built using React, which is a javascript framework that helps build interactive applications. This website is then deployed to S3, which Route 53 and Cloudfront use to direct users whenever they access dysarthrai.com. This frontend website then uses S3 to upload, store, and manage audio files, DynamoDB to find and update audio labels, and Docker/Fargate to run the model that we've developed over the past couple of weeks.

Resources

Contact Us

Feel free to contact any of the team members below if you have any additional questions:

Appendix

Loading Environment

Run the following command within the base directory of this repository to build the notebook Docker environment for this project:

docker build -t w210/capstone:1.0 .

Run the following command within the base directory of this repository to run the notebook Docker environment for this project:

docker run --rm -p 8888:8888 -p 6006:6006 -e JUPYTER_ENABLE_LAB=yes -v "$PWD":/home/jovyan/work w210/capstone:1.0

mids_capstone's People

Contributors

richung avatar miketp333 avatar hodgsim avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.