Git Product home page Git Product logo

visual_speech_recognition_for_multiple_languages's Introduction

logo

Visual Speech Recognition for Multiple Languages

Authors

Pingchuan Ma, Alexandros Haliassos, Adriana Fernandez-Lopez, Honglie Chen, Stavros Petridis, Maja Pantic.

Update

2023-07-26: We have released our training recipe for real-time AV-ASR, see here.

2023-06-16: We have released our training recipe for AutoAVSR, see here.

2023-03-27: We have released our AutoAVSR models for LRS3, see here.

Introduction

This is the repository of Visual Speech Recognition for Multiple Languages, which is the successor of End-to-End Audio-Visual Speech Recognition with Conformers. By using this repository, you can achieve the performance of 19.1%, 1.0% and 0.9% WER for automatic, visual, and audio-visual speech recognition (ASR, VSR, and AV-ASR) on LRS3.

Tutorial

We provide a tutorial Open In Colab to show how to use our Auto-AVSR models to perform speech recognition (ASR, VSR, and AV-ASR), crop mouth ROIs or extract visual speech features.

Demo

English -> Mandarin -> Spanish French -> Portuguese -> Italian

Preparation

  1. Clone the repository and enter it locally:
git clone https://github.com/mpc001/Visual_Speech_Recognition_for_Multiple_Languages
cd Visual_Speech_Recognition_for_Multiple_Languages
  1. Setup the environment.
conda create -y -n autoavsr python=3.8
conda activate autoavsr
  1. Install pytorch, torchvision, and torchaudio by following instructions here, and install all packages:
pip install -r requirements.txt
conda install -c conda-forge ffmpeg
  1. Download and extract a pre-trained model and/or language model from model zoo to:
  • ./benchmarks/${dataset}/models

  • ./benchmarks/${dataset}/language_models

  1. [For VSR and AV-ASR] Install RetinaFace or MediaPipe tracker.

Benchmark evaluation

python eval.py config_filename=[config_filename] \
               labels_filename=[labels_filename] \
               data_dir=[data_dir] \
               landmarks_dir=[landmarks_dir]
  • [config_filename] is the model configuration path, located in ./configs.

  • [labels_filename] is the labels path, located in ${lipreading_root}/benchmarks/${dataset}/labels.

  • [data_dir] and [landmarks_dir] are the directories for original dataset and corresponding landmarks.

  • gpu_idx=-1 can be added to switch from cuda:0 to cpu.

Speech prediction

python infer.py config_filename=[config_filename] data_filename=[data_filename]
  • data_filename is the path to the audio/video file.

  • detector=mediapipe can be added to switch from RetinaFace to MediaPipe tracker.

Mouth ROIs cropping

python crop_mouth.py data_filename=[data_filename] dst_filename=[dst_filename]
  • dst_filename is the path where the cropped mouth will be saved.

Model zoo

Overview

We support a number of datasets for speech recognition:

AutoAVSR models

Lip Reading Sentences 3 (LRS3)

Components WER url size (MB)
Visual-only
- 19.1 GoogleDrive or BaiduDrive(key: dqsy) 891
Audio-only
- 1.0 GoogleDrive or BaiduDrive(key: dvf2) 860
Audio-visual
- 0.9 GoogleDrive or BaiduDrive(key: sai5) 1540
Language models
- - GoogleDrive or BaiduDrive(key: t9ep) 191
Landmarks
- - GoogleDrive or BaiduDrive(key: mi3c) 18577

VSR for multiple languages models

Lip Reading Sentences 2 (LRS2)

Components WER url size (MB)
Visual-only
- 26.1 GoogleDrive or BaiduDrive(key: 48l1) 186
Language models
- - GoogleDrive or BaiduDrive(key: 59u2) 180
Landmarks
- - GoogleDrive or BaiduDrive(key: 53rc) 9358
Lip Reading Sentences 3 (LRS3)

Components WER url size (MB)
Visual-only
- 32.3 GoogleDrive or BaiduDrive(key: 1b1s) 186
Language models
- - GoogleDrive or BaiduDrive(key: 59u2) 180
Landmarks
- - GoogleDrive or BaiduDrive(key: mi3c) 18577
Chinese Mandarin Lip Reading (CMLR)

Components CER url size (MB)
Visual-only
- 8.0 GoogleDrive or BaiduDrive(key: 7eq1) 195
Language models
- - GoogleDrive or BaiduDrive(key: k8iv) 187
Landmarks
- - GoogleDrive or BaiduDrive(key: 1ret) 3721
CMU Multimodal Opinion Sentiment, Emotions and Attributes (CMU-MOSEAS)

Components WER url size (MB)
Visual-only
Spanish 44.5 GoogleDrive or BaiduDrive(key: m35h) 186
Portuguese 51.4 GoogleDrive or BaiduDrive(key: wk2h) 186
French 58.6 GoogleDrive or BaiduDrive(key: t1hf) 186
Language models
Spanish - GoogleDrive or BaiduDrive(key: 0mii) 180
Portuguese - GoogleDrive or BaiduDrive(key: l6ag) 179
French - GoogleDrive or BaiduDrive(key: 6tan) 179
Landmarks
- - GoogleDrive or BaiduDrive(key: vsic) 3040
GRID

Components WER url size (MB)
Visual-only
Overlapped 1.2 GoogleDrive or BaiduDrive(key: d8d2) 186
Unseen 4.8 GoogleDrive or BaiduDrive(key: ttsh) 186
Landmarks
- - GoogleDrive or BaiduDrive(key: 16l9) 1141

You can include data_ext=.mpg in your command line to match the video file extension in the GRID dataset.

Lombard GRID

Components WER url size (MB)
Visual-only
Unseen (Front Plain) 4.9 GoogleDrive or BaiduDrive(key: 38ds) 186
Unseen (Side Plain) 8.0 GoogleDrive or BaiduDrive(key: k6m0) 186
Landmarks
- - GoogleDrive or BaiduDrive(key: cusv) 309

You can include data_ext=.mov in your command line to match the video file extension in the Lombard GRID dataset.

TCD-TIMIT

Components WER url size (MB)
Visual-only
Overlapped 16.9 GoogleDrive or BaiduDrive(key: jh65) 186
Unseen 21.8 GoogleDrive or BaiduDrive(key: n2gr) 186
Language models
- - GoogleDrive or BaiduDrive(key: 59u2) 180
Landmarks
- - GoogleDrive or BaiduDrive(key: bnm8) 930

Citation

If you use the AutoAVSR models training code, please consider citing the following paper:

@inproceedings{ma2023auto,
  author={Ma, Pingchuan and Haliassos, Alexandros and Fernandez-Lopez, Adriana and Chen, Honglie and Petridis, Stavros and Pantic, Maja},
  booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  title={Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels}, 
  year={2023},
}

If you use the VSR models for multiple languages please consider citing the following paper:

@article{ma2022visual,
  title={{Visual Speech Recognition for Multiple Languages in the Wild}},
  author={Ma, Pingchuan and Petridis, Stavros and Pantic, Maja},
  journal={{Nature Machine Intelligence}},
  volume={4},
  pages={930--939},
  year={2022}
  url={https://doi.org/10.1038/s42256-022-00550-z},
  doi={10.1038/s42256-022-00550-z}
}

License

It is noted that the code can only be used for comparative or benchmarking purposes. Users can only use code supplied under a License for non-commercial purposes.

Contact

[Pingchuan Ma](pingchuan.ma16[at]imperial.ac.uk)

visual_speech_recognition_for_multiple_languages's People

Contributors

mpc001 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

visual_speech_recognition_for_multiple_languages's Issues

Fail to reproduce result from demo video

I tried to reproduce the result of the Mandarin lip reading from the following demo video:

https://youtu.be/FIau-6JA9Po?t=33

I've made a clip "demo_cn.mp4" from this video 0:33-0:41.

My code:

python main.py --config-filename configs/CMLR_V_WER8.0.ini --data-filename inputs/demo_cn.mp4

The output:

load a pre-trained model from: models/CMLR/CMLR_V_WER8.0/model.pth
face tracking speed: 4.90 fps.
hyp: 有一种种的人俗话说的大家人才能真的一年里的一个行

This is different from the one shown in the demo: 中青祝愿大家在新的一年里新春愉快身体健康

I also extracted mouth ROIs from the clip (link).

Would you please let me know if I missed anything?

How are multiple datasets combined during training?

Hi, thanks for this great work.

I have a question about the section "3.8 Using Additional Training Data" from your paper "Visual Speech Recognition for Multiple Languages in the Wild"

For example, for LRS3 the best WER of 32.1 is achieved by combining the datasets LRW + LRS2 + AVSpeech + LRS3. I was just wondering what way they're combined during training, which of the scenarios would be correct?

Scenario A:

  1. Pretrain using LRW + LRS2 + AVSpeech datasets
  2. Initialise from 1 above, then train on the LRS3 dataset only

Scenario B:

  1. Pretrain using LRW + LRS2 + AVSpeech datasets
  2. Initialise from 1 above, then train on LRW + LRS2 + AVSpeech + LRS3 datasets

Would there be a performance difference between these 2 scenarios?

Thanks

Inference time

Do you have any benchmark for inference time ? I have tried with 10-second-length video and it took about 1-2 minutes to print the result.

How deal with dataset already cropped mouth region

Hi dear

I would ask how I can deal with data already cropped mouth region with distribution size, I want to apply all pre-processing and data augmentation processes on this project on my own data. But landmarks were not all detected or more of them were none
because face detection was used. so how I can do it to apply this project to my own dataset?

Thank
best regards

'git lfs pull' not work

When I install face_detection with code: 'git lfs pull' in ‘tools/readme.md’.
My pycharm told me 'batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.'
How can I fix this.

CMU-MOSEAS Dataset

Do you still have a copy of the CMU-MOSEAS dataset? I've been informed by the authors that they lost all copies of it. If you still have a copy I would immensely appreciate it if you could share it.

Thanks.

Using GPU in inference

Hello
So, attempting to use the GPU with a sample video trying to perform lip reading with this simple command:

python main.py --config-filename configs/LRS3_V_WER32.3.ini --data-filename test_video.mp4 --gpu-idx 0

Terminal runs normally, loading the pretrained model etc then I face this error:

load a pre-trained model from: models/LRS3/LRS3_V_WER32.3/model.pth
face tracking speed: 1.51 fps.
Traceback (most recent call last):
  File "main.py", line 210, in <module>
    main()
  File "main.py", line 189, in main
    one_step_inference(
  File "main.py", line 159, in one_step_inference
    output = lipreader(data_filename, landmarks_filename)
  File "Visual_Speech_Recognition_for_Multiple_Languages\lipreading\subroutines.py", line 107, in __call__
    output = self.model.predict(sequence)
  File "Visual_Speech_Recognition_for_Multiple_Languages\lipreading\model.py", line 141, in predict
    nbest_hyps = self.beam_search(
  File "Miniconda3\envs\VSRML\lib\site-packages\torch\nn\modules\module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "Visual_Speech_Recognition_for_Multiple_Languages\espnet\nets\beam_search.py", line 373, in forward
    running_hyps = self.post_process(i, maxlen, maxlenratio, best, ended_hyps)
  File "Visual_Speech_Recognition_for_Multiple_Languages\espnet\nets\batch_beam_search.py", line 348, in post_process
    return self._batch_select(running_hyps, remained_ids)
  File "Visual_Speech_Recognition_for_Multiple_Languages\espnet\nets\batch_beam_search.py", line 51, in _batch_select
    score=hyps.score[ids],
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

The lip reading task works correctly if I don't specify a GPU parameter and looking around abit, apparently the problem is offloading some variables from CPU to GPU which breaks somewhere during runtime. Any help would be appreciated. Thank you!

GRID dataset

Hello, I'm using your pre-trained model to test the GRID dataset, but for some reason, all the outputs are the same, they are all the same sentence. Can you help me with this?
59dc2aef7105e97c1a4dbafbbf02fc5

Version issues

Can you please tell me the life version of pytorch you are using, I have some errors with the 2.0 version. Thank you!

Is there an audio-visual Chinese model?

Thanks for releasing the awesome work! I noticed that the Chinese lip reading model is based on the visual modality. I used the visual model but it achieved poor performance on the example video clips like #5. Is there an audio-visual version that hopefully achieves better results?

Thanks.

Error CMUMOSEAS_V_ES_WER44.5 model

When implementing the CMUMOSEAS_V_ES_WER44.5 model, the execution is broken.

The same does not happen with its other variations (CMUMOSEAS_V_FR_WER58.6 and CMUMOSEAS_V_PT_WER51.4 work fine).

image

Do you know why? Can anyone help me?

Thanks!

Training Code Required

Thanks for the releasement. I wonder if the training code will be available in the future? Thanks.

Multilingual models

Hi, thanks for your excellent work.

I wonder if all released models are language specific? Are there multilingual models?

Thanks.

RetinaFace detector not work

Thank you for releasing your code to the public, I faced issues when trying to apply crop_mouth.py using the RetinaFace detector while when using MediaPipe it works well!

Screenshot 2023-06-06 at 6 35 13 PM

I have another Q, AVSRDataLoader(modality="video", speed_rate=1, transform=False, detector=cfg.detector, convert_gray=False) I applied it like this when I using MediaPipe detector, which means there is not any transform on frames! because I see there are transforms on videos such as see below

Screenshot 2023-06-06 at 6 44 08 PM

regards

pre-trained VSR / ASR model

As mentioned in S3, the pre-trained models are always trained on the same data as the full model (yet I do not know the pre-training details), and specially the pre-trained VSR model has exactly the same architecture as the full one. So, I wonder why the supervised signals (e.g., intermediate representations) from pre-trained VSR still make sense. Could you give in-depth explanations?

GRID dataset

Could you please share the original GRID data set? There are some missing items online.

There is something wrong in README

Mouth ROIs Cropping should use command "python crop_mouth.py" instead of "python main.py"(this is old one).

and it seems there is No module named 'ibug.face_detection' ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.