Multimodal Speech Emotion Recognition using Audio and Text, IEEE SLT-18, [paper]
- IEMOCAP [link] [paper]
- download IEMOCAP data from its original web-page (license agreement is required)
- for the preprocessing, refer to codes in the "./preprocessing"
- this part comes from the paper auther's repository: https://github.com/david-yoon/multimodal-speech-emotion
- Examples
MFCC : MFCC features of the audio signal (ex. train_audio_mfcc.npy)
MFCC-SEQN : valid lenght of the sequence of the audio signal (ex. train_seqN.npy)
PROSODY : prosody features of the audio signal (ex. train_audio_prosody.npy)
LABEL : targe label of the audio signal (ex. train_label.npy)
TRANS : sequences of trasnciption (indexed) of a data (ex. train_nlp_trans.npy)
- run "train_script.sh"
- or run "python "single_text_trainer.py", "single_audio_trainer.py", "multi_modal_trainer.py", and "multi_modal_attn_trainer.py", manually
- run "evaluate.ipynb" to inference testing data with 4 different models
- and then, run "analysis/Confusion_Matrix.ipynb" to plot the confusion matrix