Git Product home page Git Product logo

david-yoon / multimodal-speech-emotion Goto Github PK

View Code? Open in Web Editor NEW
251.0 10.0 70.0 245 KB

TensorFlow implementation of "Multimodal Speech Emotion Recognition using Audio and Text," IEEE SLT-18

Home Page: https://arxiv.org/abs/1810.04635

License: MIT License

Jupyter Notebook 75.62% Python 22.05% Shell 0.40% C++ 1.93%
speech-emotion-recognition paralinguistics multimodal-deep-learning

multimodal-speech-emotion's Introduction

multimodal-speech-emotion

This repository contains the source code used in the following paper,

Multimodal Speech Emotion Recognition using Audio and Text, IEEE SLT-18, [paper]


[requirements]

tensorflow==1.4 (tested on cuda-8.0, cudnn-6.0)
python==2.7
scikit-learn==0.20.0
nltk==3.3

[download data corpus]

  • IEMOCAP [link] [paper]
  • download IEMOCAP data from its original web-page (license agreement is required)

[preprocessed-data schema (our approach)]

  • Get the preprocessed dataset [application link]

    If you want to download the "preprocessed dataset," please ask the license to the IEMOCAP team first.

  • for the preprocessing, refer to codes in the "./preprocessing"

  • We cannot publish ASR-processed transcription due to the license issue (commercial API), however, we assume that it is moderately easy to extract ASR-transcripts from the audio signal by oneself. (we used google-cloud-speech-api)

  • Format of the data for our experiments:

    MFCC : MFCC features of the audio signal (ex. train_audio_mfcc.npy)
    [#samples, 750, 39] - (#sampels, sequencs(max 7.5s), dims)

    MFCC-SEQN : valid lenght of the sequence of the audio signal (ex. train_seqN.npy)
    [#samples] - (#sampels)

    PROSODY : prosody features of the audio signal (ex. train_audio_prosody.npy)
    [#samples, 35] - (#sampels, dims)

    TRANS : sequences of trasnciption (indexed) of a data (ex. train_nlp_trans.npy)
    [#samples, 128] - (#sampels, sequencs(max))

    LABEL : targe label of the audio signal (ex. train_label.npy)
    [#samples] - (#sampels)

[source code]

  • repository contains code for following models

    Audio Recurrent Encoder (ARE)
    Text Recurrent Encoder (TRE)
    Multimodal Dual Recurrent Encoder (MDRE)
    Multimodal Dual Recurrent Encoder with Attention (MDREA)


[training]

  • refer "reference_script.sh"
  • fianl result will be stored in "./TEST_run_result.txt"

[cite]

  • Please cite our paper, when you use our code | model | dataset

    @inproceedings{yoon2018multimodal,
    title={Multimodal Speech Emotion Recognition Using Audio and Text},
    author={Yoon, Seunghyun and Byun, Seokhyun and Jung, Kyomin},
    booktitle={2018 IEEE Spoken Language Technology Workshop (SLT)},
    pages={112--118},
    year={2018},
    organization={IEEE}
    }

multimodal-speech-emotion's People

Contributors

david-yoon avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

multimodal-speech-emotion's Issues

Wav files dataset

Hi David, i'm doing a multimodal-sentiment-analysis in speech data so I wanted to know if its possible that you send me your raw wav files dataset via google drive or a repo?

MFCC12EDA.csv is not reaching the row of 4457391

Hello David:
Thank you very much for enjoying your code. When I run the IEMOCAP_01_wav_to_feature.ipynb in the Preprocessing folder. I only got a .wav 's MFCC coefficient like 599,that isn't reaching the 4457391,I didn't know how to solve it. Please help me ,thanks very much.

Best wishes!

How to get the processed_trans.npy file

IEMOCAP_02_to_four_category.ipynb looks for a file called procesed_trans.npy in part 07-A, cell 42:

data = np.load('../data/processed/IEMOCAP/processed_trans.npy')

Apparently, it contains 128-dimensional features for every data point. But I could not find any part of the code that generates this .npy file. The closest I found was generating a precessed_tran.csv file in IEMOCAP_00_extract_label_transcription.ipynb. This file generates the session ids and the raw text transcriptions as strings. How can I convert these raw values to the 128-dimensional features in the .npy file?

Do you have any idea on why attention-based model worse than the rnn model ?

Hi, I see that the attention-based model performs worse than the rnn-based model in the original paper. Do you have an idea why ?

I also read "SPEECH EMOTION RECOGNITION USING MULTI-HOP ATTENTION MECHANISM", where you suggested that the MHA-1 based on the attention mechanism gives a better performance. However, I see there is almost no difference bettween MHA-1 and the attention model you proposed here except that you used Bi-LSTM in the latter one.

I noticed that you used a different feature extractor ( opensmile vs. Kaldi). Does this make the differences ?

How to get the recognition rate of individual emotions

Hello, David
had some problems when I was running the code.I hope you can give me some help.
I tried to run the code 'reference_script.sh' and got the final total recognition rate at 'TEST_run_result'.But I don't know how to get single accuracy in 'happy, angry, sad, neutral' categories, or './inference_log/audio.txt' and './inference_log/audio_label.txt'.
Looking forward to your reply.Thank you very much.

Can you provide WAP code?Thanks!

Hello, David.
I saw your paper and ran the code.
I get a confusion matrix after running the analysis/Confusion_Matrix.ipynb.
But I can't get WAP to evaluate my result.
Can you provide WAP code?
Or can I just compute WAP from confusion matrix?
Thanks.

application of a visiting Ph.D

Hi, Dadid
Thanks for your code to help my research. I want to be a visitingPh.D to learn in your university. Could you give me your e-mail to communicate with each other? I want to seek advice from you.
Thanks very much
Yours
Harotu

Unable to open link to download IEMOCAP preprocessed dataset

Hello, David
I've filled out the form to request a dataset, and I've also gotten permission from the IEMOCAP team. But I can not open the download URL. Can you check to see if there's a problem with the URL? I will be grateful for your help!

Different size of prosody features

Following what you did in pre-processing, I got the same MFCC and emobase2010 features, But the prosody features are different. I saw in your preporcessing ipynb, the size is 10039.
However, this is what i got:
Screenshot 2020-03-13 at 14 15 06
the size of the prosody i got is 10081.

I wonder if you used the same arff_targets.conf.inc in opensmile config.

Missing arff_targets.conf.inc config file

Hi @david-yoon , first of all thank you for your great effort on this project.
I tried to preprocessed-data but it stuck Prosody extraction. This error is about missing the arff_targets.conf.inc which is mentioned in the standard_data_output.conf.inc

I can't managed to find in this repo also don't know how to define it.

Could you help to provide the file or way to define it?

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.