Git Product home page Git Product logo

k-farruh / speech-accent-detection Goto Github PK

View Code? Open in Web Editor NEW
43.0 1.0 9.0 22.21 MB

The human speaks a language with an accent. A particular accent necessarily reflects a person's linguistic background. The model defines accent based audio record. The result of the model could be used to determine accents and help decrease accents to English learning students and improve accents by training.

License: MIT License

Makefile 9.93% Python 90.07%
native-speakers accent english-languages accent-detection mfcc

speech-accent-detection's Introduction

Speech Accent Detection

The human speaks a language with an accent. A particular accent necessarily reflects a person's linguistic background. The model defines accent based audio record. The result of the model could be used to determine accents and help decrease accents to English learning students and improve accents by training.

Outline

About

The English language is a global language. It is becoming a must language for most people. Since the English language going to be part of different nationals, the origin of the English language is changing based on location and the mother languages of the people in that area. So on we can find American-English, England-English, Indian-English, and other English languages. One of the significant differences between different English languages is the accent.

Accent detection would allow us to define the student's accent level and re-train with a native speaker or for the school that wants to hire the teacher, which could determine the accent on the teacher.

Objectives

  • The model can classify the speaker's accent based on the audio file (Wav format).

Dependencies:

The model runs on Ubuntu 18.04 Python requirement libraries in requirements.txt Moreover needs:

Dataset

  1. George Mason University Speech Accent Archive dataset contains around 3500 audio files and speakers from over 100 countries.

    All speakers in the dataset read from the same passage:

    "Please call Stella. Ask her to bring these things with her from the store: Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob. We also need a small plastic snake and a big toy frog for the kids. She can scoop these things into three red bags, and we will go meet her Wednesday at the train station."

  2. CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit (version 0.92).

    This CSTR VCTK Corpus includes speech data uttered by 110 English speakers with various accents. Each speaker reads out 400 sentences, which were selected from a newspaper, the rainbow passage, and an elicitation paragraph used for the speech accent archive. The newspaper texts were taken from Herald Glasgow, with permission from Herald & Times Group. Each speaker has a different set of newspaper texts selected based on a greedy algorithm that increases the contextual and phonetic coverage.

  3. Mozilla Voice data. The Mozilla Voice data contains tens of thousands of files of native and non-native speakers speaking different sentences.

The dataset contained .mp3 audio files, which were converted to .wav audio files.

Mel Frequency Cepstrum Coefficients (MFCC)

To vectorize the audio files by creating MFCC's. MFCC's are meant to mimic the biological process of humans creating sound to produce phonemes and the way humans perceive these sounds.

Phonemes are base units of sounds that combine to make up words for a language. Non-native English speakers will use different phonemes than native speakers. The phonemes non-native speakers use will also be unique to their native language(s). By identifying the difference in phonemes, we will be able to differentiate between accents.

Overview

  1. Bin the raw audio signal
    Better to produce a matrix from a continuous signal, by binning the audio signal. On short time scales, we assume that audio signals do not change very much. Longer frames will vary too much, and shorter frames will not provide enough signal. The standard is to bin the raw audio signal into 20-40 ms frames.

    The following steps are applied over every one of the frames. A set of coefficients is determined for every frame:

  2. Calculate the periodogram power estimates
    This process models how the cochlea interprets sounds by vibrating at different locations based on the incoming frequencies. The periodogram is an analog for this process as it measures spectral density at different frequencies. First, we need to take the Discrete Fourier Transform of every frame. The periodogram power estimate is calculated using the following equation:

  3. Apply Mel filterbank and sum energies in each filter
    The cochlea can't differentiate between frequencies that are very close to each other. This problem is amplified at higher frequencies, meaning that greater ranges of frequencies will be increasingly interpreted as the same pitch. So, we sum up the signal at various increasing ranges of frequencies to get a measure of the energy density in each range.

    This filterbank is a set of 26 triangular filters. These filters are vectors that are mostly zero, except for a small range of the spectrum. First, we convert frequencies to the Mel Scale (which turns the actual tone of a frequency to its perceived frequency). Then we multiply each filter with the power spectrum and add the resulting coefficients to obtain the filter bank energies. In the end, we will have a single coefficient for each filter.

  4. Take the log of all filter energies
    We need to record the previously calculated filter bank energies because humans can differentiate between low frequency sounds better than they can between high-frequency sounds. The shape of the matrix hasn't changed, so we still have 26 coefficients.

  5. Take Discrete Cosine Transform (DCT) of the log filterbank energies
    Because the standard is to create overlapping filterbanks, these energies are correlated, and we use DCT to decorrelate them. The higher DCT coefficients are then dropped, which has been shown to perform model performance, leaving us with 13 cepstral coefficients.

Resources for learning about MFCC's:

  1. Pratheeksha Nair's Medium Aricle
  2. Haytham Fayek's Personal Website
  3. Practical Cryptography

Models

From Keras FAQ: "A Keras model has two modes: training and testing. Regularization mechanisms, such as Dropout and L1/L2 weight regularization, are turned off at a testing time.

Besides, the training loss is the average of the losses over each batch of training data. Because your model is changing over time, the loss over the first batches of an epoch is generally higher than over the last batches. On the other hand, the testing loss for an epoch is computed using the model as it is at the end of the epoch, resulting in a lower loss."

FFNN (Feed-Forward Neural Network)

The FFNN model architecture:

Classification Results

  • Accuracy: 0.63 0.90 0.87
  • Recall : 0.75 0.91 0.90
  • Precision: 0.69 0.93 0.89
  • F1_score: 0.69 0.93 0.89

CNN (Convolution Neural Network)

The CNN model architecture:

Classification Results

  • Accuracy: 0.90
  • Recall : 0.91
  • Precision: 0.93
  • F1_score: 0.93

LSTM (Long Short-Term Memory)

Classification Results

  • Accuracy:0.87
  • Recall : 0.90
  • Precision: 0.89
  • F1_score: 0.89

Incorrect Classifications

Going back and listening to the files where my model failed brought two conclusions:

  • The majority of the misclassified test data was incorrectly labeled
  • Most of the remaining misclassified data was problematic because the accent seemed to be a blend, indicating that the speaker may also be fluent in another language.

While it would be best if the model could also correctly classify these blended accents, the mixed accents may not pose a severe problem. Speech recognition systems may not have a problem picking up what these speakers are saying. For example, a speech recognition system trained mainly on US data may be able to pick up reasonably well on a speaker from the UK who has spent a fair amount of time in the US. As long as the model is classifying speakers with more traditional UK accents, we can build another speech recognition model for these speakers.

Future Work

Try to classify more accents, instead of native and non-native accents. It would ultimately like to train CNN to classify most of the accents with more datasets.

Unfortunately, the dataset is not enough for multi-accent detection.

speech-accent-detection's People

Contributors

k-farruh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

speech-accent-detection's Issues

Inference script

A inference script to apply pretrained models would be a helpful addition

Pkl file not found

hey i was trying to run the code locally and it throwed me the error accent_df = CombineAndCreateDF(dir_external, dir_interim)
File "c:\Users\Asus\Desktop\speech-accent-detection-master\src\data\make_dataset.py", line 221, in init
with open(path_df_audio_accent, 'rb') as input_:
FileNotFoundError: [Errno 2] No such file or directory: 'C:\Users\Asus\Desktop\speech-accent-detection-master\data\interim\df_accent_gmu.pkl'

can you help me please?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.