mpc001 / visual_speech_recognition_for_multiple_languages Goto Github PK
View Code? Open in Web Editor NEWVisual Speech Recognition for Multiple Languages
License: Other
Visual Speech Recognition for Multiple Languages
License: Other
Which methods you use to extract landmarks from image?
Hi , I‘m a beginner in lipreading. I'm curious how low the latency of lip recognition can be? Is there any solution to reduce the delay?
Thank you very much.
Am begginer
in python
Select this topic as project in my final year and Using ubuntu version
Wher i want to give path of models in this givencode
When I install face_detection with code: 'git lfs pull' in ‘tools/readme.md’.
My pycharm told me 'batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.'
How can I fix this.
hi,
Do you have plan to release code/model of audio-visual speech recognition model proposed in "End-to-end Audio-visual Speech Recognition with Conformers"?
Thanks!
Thanks for releasing the awesome work! I noticed that the Chinese lip reading model is based on the visual modality. I used the visual model but it achieved poor performance on the example video clips like #5. Is there an audio-visual version that hopefully achieves better results?
Thanks.
Hi,
Is there a reason why for the audio-only model there is no lm?
as I see here:
Hi, thanks for this great work.
I have a question about the section "3.8 Using Additional Training Data"
from your paper "Visual Speech Recognition for Multiple Languages in the Wild"
For example, for LRS3 the best WER of 32.1 is achieved by combining the datasets LRW + LRS2 + AVSpeech + LRS3. I was just wondering what way they're combined during training, which of the scenarios would be correct?
Scenario A:
Scenario B:
Would there be a performance difference between these 2 scenarios?
Thanks
Hello
So, attempting to use the GPU with a sample video trying to perform lip reading with this simple command:
python main.py --config-filename configs/LRS3_V_WER32.3.ini --data-filename test_video.mp4 --gpu-idx 0
Terminal runs normally, loading the pretrained model etc then I face this error:
load a pre-trained model from: models/LRS3/LRS3_V_WER32.3/model.pth
face tracking speed: 1.51 fps.
Traceback (most recent call last):
File "main.py", line 210, in <module>
main()
File "main.py", line 189, in main
one_step_inference(
File "main.py", line 159, in one_step_inference
output = lipreader(data_filename, landmarks_filename)
File "Visual_Speech_Recognition_for_Multiple_Languages\lipreading\subroutines.py", line 107, in __call__
output = self.model.predict(sequence)
File "Visual_Speech_Recognition_for_Multiple_Languages\lipreading\model.py", line 141, in predict
nbest_hyps = self.beam_search(
File "Miniconda3\envs\VSRML\lib\site-packages\torch\nn\modules\module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "Visual_Speech_Recognition_for_Multiple_Languages\espnet\nets\beam_search.py", line 373, in forward
running_hyps = self.post_process(i, maxlen, maxlenratio, best, ended_hyps)
File "Visual_Speech_Recognition_for_Multiple_Languages\espnet\nets\batch_beam_search.py", line 348, in post_process
return self._batch_select(running_hyps, remained_ids)
File "Visual_Speech_Recognition_for_Multiple_Languages\espnet\nets\batch_beam_search.py", line 51, in _batch_select
score=hyps.score[ids],
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)
The lip reading task works correctly if I don't specify a GPU parameter and looking around abit, apparently the problem is offloading some variables from CPU to GPU which breaks somewhere during runtime. Any help would be appreciated. Thank you!
Hi dear
I would ask how I can deal with data already cropped mouth region with distribution size, I want to apply all pre-processing and data augmentation processes on this project on my own data. But landmarks were not all detected or more of them were none
because face detection was used. so how I can do it to apply this project to my own dataset?
Thank
best regards
Hello. I trained on LRS2 by following the recipe of auto_avsr and using the pretrained LRS3 AVSR model avsr_trlrwlrs2lrs3vox2avsp_base.pth
. The LRS2 dataset is processed by following your instructions. But the results is strange:WER is above 1. Even if I use avsr_trlrwlrs2lrs3vox2avsp_base.pth, the WER is still above 1. And the predicted sentence is not correct, like the truth is "And for me the surprise was", but the predicted one is "It's time for the final round which".
Do you have any idea about this? Thank you in advance.
I tried to reproduce the result of the Mandarin lip reading from the following demo video:
https://youtu.be/FIau-6JA9Po?t=33
I've made a clip "demo_cn.mp4" from this video 0:33-0:41.
My code:
python main.py --config-filename configs/CMLR_V_WER8.0.ini --data-filename inputs/demo_cn.mp4
The output:
load a pre-trained model from: models/CMLR/CMLR_V_WER8.0/model.pth
face tracking speed: 4.90 fps.
hyp: 有一种种的人俗话说的大家人才能真的一年里的一个行
This is different from the one shown in the demo: 中青祝愿大家在新的一年里新春愉快身体健康
I also extracted mouth ROIs from the clip (link).
Would you please let me know if I missed anything?
As mentioned in S3, the pre-trained models are always trained on the same data as the full model (yet I do not know the pre-training details), and specially the pre-trained VSR model has exactly the same architecture as the full one. So, I wonder why the supervised signals (e.g., intermediate representations) from pre-trained VSR still make sense. Could you give in-depth explanations?
This video uses only a very simple method take all video into memory, this could be easily OOM when video get long.
Can you please tell me the life version of pytorch you are using, I have some errors with the 2.0 version. Thank you!
Hi, thanks for your excellent work.
I wonder if all released models are language specific? Are there multilingual models?
Thanks.
Mouth ROIs Cropping should use command "python crop_mouth.py" instead of "python main.py"(this is old one).
and it seems there is No module named 'ibug.face_detection' ?
Do you still have a copy of the CMU-MOSEAS dataset? I've been informed by the authors that they lost all copies of it. If you still have a copy I would immensely appreciate it if you could share it.
Thanks.
Thanks for the releasement. I wonder if the training code will be available in the future? Thanks.
Do you have any benchmark for inference time ? I have tried with 10-second-length video and it took about 1-2 minutes to print the result.
Could you please share the original GRID data set? There are some missing items online.
Thank you for releasing your code to the public, I faced issues when trying to apply crop_mouth.py using the RetinaFace detector while when using MediaPipe it works well!
I have another Q, AVSRDataLoader(modality="video", speed_rate=1, transform=False, detector=cfg.detector, convert_gray=False) I applied it like this when I using MediaPipe detector, which means there is not any transform on frames! because I see there are transforms on videos such as see below
regards
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.