david-yoon / multimodal-speech-emotion Goto Github PK

View Code? Open in Web Editor NEW

251.0 10.0 70.0 245 KB

TensorFlow implementation of "Multimodal Speech Emotion Recognition using Audio and Text," IEEE SLT-18

Home Page: https://arxiv.org/abs/1810.04635

License: MIT License

Jupyter Notebook 75.62% Python 22.05% Shell 0.40% C++ 1.93%

speech-emotion-recognition paralinguistics multimodal-deep-learning

multimodal-speech-emotion's Introduction

multimodal-speech-emotion

This repository contains the source code used in the following paper,

Multimodal Speech Emotion Recognition using Audio and Text, IEEE SLT-18, [paper]

[requirements]

tensorflow==1.4 (tested on cuda-8.0, cudnn-6.0)
python==2.7
scikit-learn==0.20.0
nltk==3.3

[download data corpus]

IEMOCAP [link] [paper]
download IEMOCAP data from its original web-page (license agreement is required)

[preprocessed-data schema (our approach)]

Get the preprocessed dataset [application link]

If you want to download the "preprocessed dataset," please ask the license to the IEMOCAP team first.
for the preprocessing, refer to codes in the "./preprocessing"
We cannot publish ASR-processed transcription due to the license issue (commercial API), however, we assume that it is moderately easy to extract ASR-transcripts from the audio signal by oneself. (we used google-cloud-speech-api)
Format of the data for our experiments:

MFCC : MFCC features of the audio signal (ex. train_audio_mfcc.npy)
[#samples, 750, 39] - (#sampels, sequencs(max 7.5s), dims)

MFCC-SEQN : valid lenght of the sequence of the audio signal (ex. train_seqN.npy)
[#samples] - (#sampels)

PROSODY : prosody features of the audio signal (ex. train_audio_prosody.npy)
[#samples, 35] - (#sampels, dims)

TRANS : sequences of trasnciption (indexed) of a data (ex. train_nlp_trans.npy)
[#samples, 128] - (#sampels, sequencs(max))

LABEL : targe label of the audio signal (ex. train_label.npy)
[#samples] - (#sampels)

[source code]

repository contains code for following models

Audio Recurrent Encoder (ARE)
Text Recurrent Encoder (TRE)
Multimodal Dual Recurrent Encoder (MDRE)
Multimodal Dual Recurrent Encoder with Attention (MDREA)

[training]

refer "reference_script.sh"
fianl result will be stored in "./TEST_run_result.txt"

[cite]

Please cite our paper, when you use our code | model | dataset

@inproceedings{yoon2018multimodal,
title={Multimodal Speech Emotion Recognition Using Audio and Text},
author={Yoon, Seunghyun and Byun, Seokhyun and Jung, Kyomin},
booktitle={2018 IEEE Spoken Language Technology Workshop (SLT)},
pages={112--118},
year={2018},
organization={IEEE}
}

multimodal-speech-emotion's People

Contributors

Stargazers

Watchers

multimodal-speech-emotion's Issues

(WARN) [1] in instance 'lld.reader' : Mismatch in input level buffer sizes (levelconf.nT). Level #0 has size 2 which is smaller than the max. input size of all input levels (10). This might cause the processing to hang unpredictably or cause incomplete processing.

when i am running IEMOCAP_01_wav_to_feature.ipynb
i am getting error like this.
i am unable to understand the error . can you please help me

Wav files dataset

Hi David, i'm doing a multimodal-sentiment-analysis in speech data so I wanted to know if its possible that you send me your raw wav files dataset via google drive or a repo?

No such file or directory: './data/processed/IEMOCAP/MFCC12EDA.csv'

Hi david
I've run the following code but it don't create MFCC12EDA.csv file

Thanks you for your help

MFCC12EDA.csv is not reaching the row of 4457391

Hello David:
Thank you very much for enjoying your code. When I run the IEMOCAP_01_wav_to_feature.ipynb in the Preprocessing folder. I only got a .wav 's MFCC coefficient like 599,that isn't reaching the 4457391,I didn't know how to solve it. Please help me ,thanks very much.

Best wishes!

How to get the processed_trans.npy file

IEMOCAP_02_to_four_category.ipynb looks for a file called procesed_trans.npy in part 07-A, cell 42:

data = np.load('../data/processed/IEMOCAP/processed_trans.npy')

Apparently, it contains 128-dimensional features for every data point. But I could not find any part of the code that generates this .npy file. The closest I found was generating a precessed_tran.csv file in IEMOCAP_00_extract_label_transcription.ipynb. This file generates the session ids and the raw text transcriptions as strings. How can I convert these raw values to the 128-dimensional features in the .npy file?

Do you have any idea on why attention-based model worse than the rnn model ?

Hi, I see that the attention-based model performs worse than the rnn-based model in the original paper. Do you have an idea why ?

I also read "SPEECH EMOTION RECOGNITION USING MULTI-HOP ATTENTION MECHANISM", where you suggested that the MHA-1 based on the attention mechanism gives a better performance. However, I see there is almost no difference bettween MHA-1 and the attention model you proposed here except that you used Bi-LSTM in the latter one.

I noticed that you used a different feature extractor ( opensmile vs. Kaldi). Does this make the differences ?

How to get the recognition rate of individual emotions

Hello, David
had some problems when I was running the code.I hope you can give me some help.
I tried to run the code 'reference_script.sh' and got the final total recognition rate at 'TEST_run_result'.But I don't know how to get single accuracy in 'happy, angry, sad, neutral' categories, or './inference_log/audio.txt' and './inference_log/audio_label.txt'.
Looking forward to your reply.Thank you very much.

Please share your model weights/checkpoints

Can you provide WAP code?Thanks!

Hello, David.
I saw your paper and ran the code.
I get a confusion matrix after running the analysis/Confusion_Matrix.ipynb.
But I can't get WAP to evaluate my result.
Can you provide WAP code?
Or can I just compute WAP from confusion matrix?
Thanks.

application of a visiting Ph.D

Hi, Dadid
Thanks for your code to help my research. I want to be a visitingPh.D to learn in your university. Could you give me your e-mail to communicate with each other? I want to seek advice from you.
Thanks very much
Yours
Harotu

Unable to open link to download IEMOCAP preprocessed dataset

Hello, David
I've filled out the form to request a dataset, and I've also gotten permission from the IEMOCAP team. But I can not open the download URL. Can you check to see if there's a problem with the URL? I will be grateful for your help!

Different size of prosody features

Following what you did in pre-processing, I got the same MFCC and emobase2010 features, But the prosody features are different. I saw in your preporcessing ipynb, the size is 10039.
However, this is what i got:

the size of the prosody i got is 10081.

I wonder if you used the same arff_targets.conf.inc in opensmile config.

Missing arff_targets.conf.inc config file

Hi @david-yoon , first of all thank you for your great effort on this project.
I tried to preprocessed-data but it stuck Prosody extraction. This error is about missing the arff_targets.conf.inc which is mentioned in the standard_data_output.conf.inc

I can't managed to find in this repo also don't know how to define it.

Could you help to provide the file or way to define it?

Thanks