Lightweight and Interpretable ML Model for Speech Emotion Recognition and Ambiguity Resolution (trained on IEMOCAP dataset)

License: MIT License

Jupyter Notebook 97.19% Python 2.81%

speech-emotion-recognition pytorch scikit-learn pandas librosa multimodal-emotion-recognition python3 iemocap lstm

multimodal-speech-emotion-recognition's Introduction

Multimodal Speech Emotion Recognition and Ambiguity Resolution

Overview

Identifying emotion from speech is a non-trivial task pertaining to the ambiguous definition of emotion itself. In this work, we build light-weight multimodal machine learning models and compare it against the heavier and less interpretable deep learning counterparts. For both types of models, we use hand-crafted features from a given audio signal. Our experiments show that the light-weight models are comparable to the deep learning baselines and even outperform them in some cases, achieving state-of-the-art performance on the IEMOCAP dataset.

The hand-crafted feature vectors obtained are used to train two types of models:

ML-based: Logistic Regression, SVMs, Random Forest, eXtreme Gradient Boosting and Multinomial Naive-Bayes.
DL-based: Multi-Layer Perceptron, LSTM Classifier

This project was carried as a course project for the course CS 698 - Computational Audio taught by Prof. Richard Mann at the University of Waterloo. For a more detailed explanation, please check the report.

Datasets

The IEMOCAP dataset was used for all the experiments in this work. Please refer to the report for a detailed explanation of pre-processing steps applied to the dataset.

Requirements

All the experiments have been tested using the following libraries:

xgboost==0.82
torch==1.0.1.post2
scikit-learn==0.20.3
numpy==1.16.2
jupyter==1.0.0
pandas==0.24.1
librosa==0.7.0

To avoid conflicts, it is recommended to setup a new python virtual environment to install these libraries. Once the env is setup, run pip install -r requirements.txt to install the dependencies.

Instructions to run the code

Clone this repository by running git clone [email protected]:Demfier/multimodal-speech-emotion-recognition.
Go to the root directory of this project by running cd multimodal-speech-emotion-recognition/ in your terminal.
Start a jupyter notebook by running jupyter notebook from the root of this project.
Run 1_extract_emotion_labels.ipynb to extract labels from transriptions and compile other required data into a csv.
Run 2_build_audio_vectors.ipynb to build vectors from the original wav files and save into a pickle file
Run 3_extract_audio_features.ipynb to extract 8-dimensional audio feature vectors for the audio vectors
Run 4_prepare_data.ipynb to preprocess and prepare audio + video data for experiments
It is recommended to train LSTMClassifier before running any other experiments for easy comparsion with other models later on:

Change config.py for any of the experiment settings. For instance, if you want to train a speech2emotion classifier, make necessary changes to lstm_classifier/s2e/config.py. Similar procedure follows for training text2emotion (t2e) and text+speech2emotion (combined) classifiers.
Run python lstm_classifier.py from lstm_classifier/{exp_mode} to train an LSTM classifier for the respective experiment mode (possible values of exp_mode: s2e/t2e/combined)

Run 5_audio_classification.ipynb to train ML classifiers for audio
Run 5.1_sentence_classification.ipynb to train ML classifiers for text
Run 5.2_combined_classification.ipynb to train ML classifiers for audio+text

Note: Make sure to include correct model paths in the notebooks as not everything is relative right now and it needs some refactoring

UPDATE: You can access the preprocessed data files here to skip the steps 4-7: https://www.dropbox.com/scl/fo/jdzz2y9nngw9rxsbz9vyj/h?rlkey=bji7zcqclusagzfwa7alm59hx&dl=0

Results

Accuracy, F-score, Precision and Recall has been reported for the different experiments.

Audio

Models	Accuracy	F1	Precision	Recall
RF	56.0	56.0	57.2	57.3
XGB	55.6	56.0	56.9	56.8
SVM	33.7	15.2	17.4	21.5
MNB	31.3	9.1	19.6	17.2
LR	33.4	14.9	17.8	20.9
MLP	41.0	36.5	42.2	35.9
LSTM	43.6	43.4	53.2	40.6
ARE (4-class)	56.3	-	54.6	-
E1 (4-class)	56.2	45.9	67.6	48.9
E1	56.6	55.7	57.3	57.3

E1: Ensemble (RF + XGB + MLP)

Text

Models	Accuracy	F1	Precision	Recall
RF	62.2	60.8	65.0	62.0
XGB	56.9	55.0	70.3	51.8
SVM	62.1	61.7	62.5	63.5
MNB	61.9	62.1	71.8	58.6
LR	64.2	64.3	69.5	62.3
MLP	60.6	61.5	62.4	63.0
LSTM	63.1	62.5	65.3	62.8
TRE (4-class)	65.5	-	63.5	-
E1 (4-class)	63.1	61.4	67.7	59.0
E2	64.9	66.0	71.4	63.2

E2: Ensemble (RF + XGB + MLP + MNB + LR) E1: Ensemble (RF + XGB + MLP)

Audio + Text

Models	Accuracy	F1	Precision	Recall
RF	65.3	65.8	69.3	65.5
XGB	62.2	63.1	67.9	61.7
SVM	63.4	63.8	63.1	65.6
MNB	60.5	60.3	70.3	57.1
MLP	66.1	68.1	68.0	69.6
LR	63.2	63.7	66.9	62.3
LSTM	64.2	64.7	66.1	65.0
MDRE (4-class)	75.3	-	71.8	-
E1 (4-class)	70.3	67.5	73.2	65.5
E2	70.1	71.8	72.9	71.5

For more details, please refer to the report

Citation

If you find this work useful, please cite:

@article{sahu2019multimodal,
  title={Multimodal Speech Emotion Recognition and Ambiguity Resolution},
  author={Sahu, Gaurav},
  journal={arXiv preprint arXiv:1904.06022},
  year={2019}
}

multimodal-speech-emotion-recognition's People

Contributors

Stargazers

Watchers

Forkers

min-sheng yingmuying rexchen396 jackli95 longinhit hnbrh peterzs mohitsharma29 wjkim1108 annabelle115 kevin2107 remielyzs dertilo bcm628 mausamsion happyjin zeoops chenyang918 rezaebrh shenyi666666 chenyueg undeadyequ qlightman arnavsinha raikarsagar gachet haizadtarik cocacolabai aakankshachouhan giannisginis yougnway sawravchy snehilsanyal grailsociety phillette chenyangjun45 neverstoplearn manoj04418 kahliang sumit6597 birdyfun ggzhang0071 telefonica zoughi rongfei-chen keithwang5 tomkingsforduoa komalanadkat dev07060 dreemurr-t lawsinger ssmgg armlynobinguar wendonggan lily232321 ravi5175 juju2181 v-anh ymfiubyf ham09ak gandalf012 jaewon-ua neelpawarcmu belgats clifford02 selvavivek nataliehh tharunsai12 tanvirrahman0202 nurdans haguy77 lk782558827 nhminh0701 soon14 mondon11 jsyllas tlwzzy gphuang huacong wang274 fanyu1024 ioannantousaki

multimodal-speech-emotion-recognition's Issues

Weights of the pre-trained models

Could you provide the weights of the pretrained models (MLP and the LSTM)

error using mnb classifier for audio classification

ValueError: Negative values in data passed to MultinomialNB (input X)

how did you remove negative values in Harmonics column to avoid the previous error?

Combine Text + Speech classifiers

No data folder and no dataset available, can someone help ??

error in creating audio features file

please help, i know that you added a new update and uploaded the audio features file but I need to generate it to make some edits.

Add SAVEE dataset too

Currently need to oversample for fear and sad class. Maybe SAVEE contains training examples for those categories

Convert all notebooks to scripts

LSTM text classifier not working

Test BiLSTM in ensemble for text and combined experiments

run in 2 hours vs 10

used multiprocessing with 100 processes,
optimal number may vary based on machine

extract_audio_features is not giving valid csv output

The final block of third file 3_extract_audio_features is not giving proper output file. It throws "Some exception occurred" for whole of the file. What might be the issue here?`abels_df = pd.read_csv(labels_path)

for sess in (range(1, 6)):
audio_vectors = pickle.load(open('{}{}.pkl'.format(audio_vectors_path, sess), 'rb'))
for index, row in tqdm(labels_df[labels_df['wav_file'].str.contains('Ses0{}'.format(sess))].iterrows()):
try:
wav_file_name = row['wav_file']
label = emotion_dict[row['emotion']]
y = audio_vectors[wav_file_name]

            feature_list = [wav_file_name, label]  # wav_file, label
            sig_mean = np.mean(abs(y))
            feature_list.append(sig_mean)  # sig_mean
            feature_list.append(np.std(y))  # sig_std

            rmse = librosa.feature.rmse(y + 0.0001)[0]
            feature_list.append(np.mean(rmse))  # rmse_mean
            feature_list.append(np.std(rmse))  # rmse_std

            silence = 0
            for e in rmse:
                if e <= 0.4 * np.mean(rmse):
                    silence += 1
            silence /= float(len(rmse))
            feature_list.append(silence)  # silence

            y_harmonic = librosa.effects.hpss(y)[0]
            feature_list.append(np.mean(y_harmonic) * 1000)  # harmonic (scaled by 1000)

            # based on the pitch detection algorithm 
            
            cl = 0.45 * sig_mean
            center_clipped = []
            for s in y:
                if s >= cl:
                    center_clipped.append(s - cl)
                elif s <= -cl:
                    center_clipped.append(s + cl)
                elif np.abs(s) < cl:
                    center_clipped.append(0)
            auto_corrs = librosa.core.autocorrelate(np.array(center_clipped))
            feature_list.append(1000 * np.max(auto_corrs)/len(auto_corrs))  # auto_corr_max (scaled by 1000)
            feature_list.append(np.std(auto_corrs))  # auto_corr_std

            df_features = df_features.append(pd.DataFrame(feature_list, index=columns).transpose(), ignore_index=True)
        except:
            print('Some exception occured')

df_features.to_csv('/content/drive/My Drive/data/pre-processed/audio_features.csv', index=False)`

It only creates CSV file of around 100 bytes. Thanks in advance!!

Add word2vec/glove word embeddings

Questions about document handling.

I want to know what changes have been made to the two CSV files（audio_test.csv、audio_train.csv ）（in data/s2e/） to generate modified_ df_ test. csv、modified_ df_ train. csv ？
Because I met a mistake using 'torch.LongTensor(data[1])' directly for the original two CSV files when run lstm_classfier.py . And you didn't tell us how to generate the two new CSV file.

Add LSTM Text-Classifier

s2e Utils.py

in utils.py of s2e it says test if test or train, but we don't have files such as moditifed_df_train/test.csv when doing pre-processing. Also when we try to run lstm-classifier(s2e) by changing name of modified_df to modified_df_train/test, we get the problem in float.tensor(input batch) of str in utils.
I dropped wav file, unnamed: 0 and label from df to make it work but nowit says i"nput.size(-1) must be equal to input_size. Expected 8, got 7 ".

about the modified_df_train.csv

FileNotFoundError: [Errno 2] No such file or directory: '../../data/s2e/modified_df_train.csv', i dont know how to produce this csv, shall someone send to [email protected] if it can not be public

Add scheduling

Implement Multimodal baseline

Pack sequences in `LSTMClassifier`

dataset

can you share the dataset folder for the audio_vectors and csv file

When I run python lstm_classifier.py I get error: No such file or directory: 'embeddings/saved_200d_word_embeddings.pkl

Do I need to add some files for word embedding to your code?

Prettify sentence classification notebook

Make unified data splits for all modes

Currently, I have duplicated code from the lstm_classifier.py into the notebook because the data splits in modified_df_*.csv is not the same as the one in the notebook. If I add such a global split, running experiments should be a breeze and aforementioned duplication of code could be avoided

Compile Results

Audio

Models	Accuracy	F1	Precision	Recall
RF	56.0	56.0	57.2	57.3
XGB	55.6	56.0	56.9	56.8
SVM	33.7	15.2	17.4	21.5
MNB	31.3	9.1	19.6	17.2
LR	33.4	14.9	17.8	20.9
MLP	41.0	36.5	42.2	35.9
LSTM	43.6	43.4	53.2	40.6
ARE (4-class)	56.3	-	54.6	-
E1 (4-class)	56.2	45.9	67.6	48.9
E1	56.6	55.7	57.3	57.3

E1: Ensemble (RF + XGB + MLP)

Text

Models	Accuracy	F1	Precision	Recall
RF	62.2	60.8	65.0	62.0
XGB	56.9	55.0	70.3	51.8
SVM	62.1	61.7	62.5	63.5
MNB	61.9	62.1	71.8	58.6
LR	64.2	64.3	69.5	62.3
MLP	60.6	61.5	62.4	63.0
LSTM	63.1	62.5	65.3	62.8
TRE (4-class)	65.5	-	63.5	-
E1 (4-class)	63.1	61.4	67.7	59.0
E2	64.9	66.0	71.4	63.2

E2: Ensemble (RF + XGB + MLP + MNB + LR)
E1: Ensemble (RF + XGB + MLP)

Audio + Text

Models	Accuracy	F1	Precision	Recall
RF	65.3	65.8	69.3	65.5
XGB	62.2	63.1	67.9	61.7
SVM	63.4	63.8	63.1	65.6
MNB	60.5	60.3	70.3	57.1
MLP	66.1	68.1	68.0	69.6
LR	63.2	63.7	66.9	62.3
LSTM	64.2	64.7	66.1	65.0
MDRE (4-class)	75.3	-	71.8	-
E1 (4-class)	70.3	67.5	73.2	65.5
E2	70.1	71.8	72.9	71.5

wrong code at t2e/lstm_classifier.py

42 # sum hidden states
43 class_scores = F.softmax(self.out(rnn_output), dim=1)

↓you should change

42 # sum hidden states
43 class_scores = F.log_softmax(self.out(rnn_output), dim=1)

python -c 'import torch; print(torch.cuda.is_available())' is false

torch.cuda.is_available() is false, did you have the same issue?

Run combined features through the LSTMClassifier

AttributeError: Can't get attribute 'Vocabulary' on <module 'main'

AttributeError: Can't get attribute 'Vocabulary' on <module 'main'

Complete dataset in CSV format

hi @Demfier,

can you please provide the one complete file of the dataset with their corresponding labels?

Some exception occured

Hello @Demfier
got so much "Some exception occured" during the 3th step. And it had a bed effect on the 4th step i.e. prapare_data. Could you please tell me how to fix this? Thanks in advance!

Combine all the notebooks' code to appropriate python modules

ValueError: too many dimensions 'str'

When I am running the lstm_classifier.py i am getting the error mentioned above.

Here is a description of the error -

Traceback (most recent call last):
File "/content/drive/MyDrive/IS698 Project/lstm.py", line 56, in
train_batches = load_data()
File "/content/drive/MyDrive/IS698 Project/md_utils.py", line 35, in load_data
batches.append([torch.FloatTensor(input_batch),
ValueError: too many dimensions 'str'

Also how to run the s2e model I am getting error in the same. Can you please help me out?

The labels of the confusion matrix seem to be swapped

Hi, I think you have swapped the x and y label in the plot_confusion_matrix method in the notebooks.

It confused me, so I thought i should ask you to check that.
However that is only a small thing in this excellent repo.

Regards,
Kai Karren

demfier / multimodal-speech-emotion-recognition Goto Github PK