mpc001 / end-to-end-lipreading Goto Github PK
View Code? Open in Web Editor NEWPytorch code for End-to-End Audiovisual Speech Recognition
Pytorch code for End-to-End Audiovisual Speech Recognition
Hi, I also use a single 1080Ti for training. I also don't change any parameters. Why it took around 20 minutes per 1% epoch in the video-only problem.
Process: [ 4716/488766 (1%)] Loss: 6.3010 Acc:0.0000 Cost time: 1142s Estimated time:118110s
@mpc001
end-to-end-lipreading/video_only/model.py
Line 56 in 595a988
Here, after changing from "BatchNorm2d" to "BatchNorm1d" the program is able to run normally.
From official documentations of PyTorch:
CLASS torch.nn.BatchNorm2d(num_features, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)[SOURCE]
Applies Batch Normalization over a 4D input (a mini-batch of 2D inputs with additional channel dimension) as described in the paper Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift .
Apparently it should have been 1d BatchNorm. The same typo can be found in the audio-visual model.
Thank you for your code.
I have a question about training.
In temporal conv step, did you use 0.0003 learning rate, no weight decay and adjust lr(scheduler) in this code?
because only in 11 epochs it overfitted and accuracy is too low.
Hi, thank you for the code. I want to ask about the concatenation part,
inputs = torch.cat((audio_outputs, video_outputs), dim=2)
outputs = concat_model(inputs)
Does this mean it is being concatenated along the feature axis? Is it because the audio and visual input has different number of timesteps?
Which acc should I choose, from the orignal paper, or your acc in github, thanks.
Hello, thank you for your work. I would like to ask where did you get the babble noise added in the audio files in your work?
Hello, I'm trying to train the lip reading network following by your advice,could you please tell me the accuracy that the model can achieve for three stages?
I was suffer from overfitting when training the video only model, and the cvtransfomer.py is not used in dataset? Thanks
when I run (as suggested in README) : CUDA_VISIBLE_DEVICES='0' python main.py --path '' --dataset /mnt/data/rajivratn/lrw/lipread_mp4 --mode 'temporalConv' --every-frame False --batch-size 36 --lr 3e-4 --epochs 30 --test False
I am getting the following error:
RuntimeError: invalid argument 1: must be strictly positive at /pytorch/torch/lib/TH/generic/THTensorMath.c:2247
Hi, thank you for your work. Could you provide the pretrained model?
Thank you for your contribution!
Could you please give more details on data preparing?
Which model/code do you use to extract mouth ROI?
I can run this repository only after preparing the data by myself? right?
Thank you!
could you please provide the way the dataset has to be arranged for running the code?
@mpc001
Thank you for your code !
I want to run your code, and I found that in your code , you write the ResNet34 yourself while the Pytorch provide the pretrained ResNet34.
I want to know is there any diffirences between your written ResNet34 and the provided ResNet34 by Pytorch?
Thank you very much!
Hello,
In the audiovisual code there is a concat mode (path to pre-trained concat model) is this for the pretrained model in audiovisual?
Also in the code 2 references of model are declared (one in main and one in concat_model) that are throwing an error. Can we return the concat model instead? since no model is declared in these functions...
Thanks in advance
I have sent you the requested email address and you have not sent me a link. @mpc001
@mpc001
Thanks for your code. And I try to replicate your code in keras-tensorflow, but It suffers from serious overfitting. In N2 phase (3DCNN+ResNet+temporal Conv), I fully follow your training settings (data augmentation and ROI cropping), the training accuracy is around 88%, However, the validation accuracy is only about 65%. The result is far from yours (74.6%), I don't know why, so could you tell me the more training detail and tricks?
Hello,
I have some doubts about the process of training the audiovisual model. Currently, I am following the steps indicated on the README going from temporalconv, backend, and later finetuneGRU to train the whole network. I have some questions about the process:
Training with pre-trained models:
When traininig using the pre-trained models. The net got stuck in this part :
inputs = torch.cat((audio_outputs, video_outputs), dim=2)
It requires 3D tensors to concatenate them, but the "video_outputs" and "audio_outputs" that I got are 2D tensors such as [B, 500]. How these 3D tensors for audio and video should look like? Is there a code missing or should something to transform them as required by the net?
These are the tensors I got for conv1 and conv2 backend. The tensors of backend conv1 are not the same. Those should be equal?
initial audio tensor: [B,19456]
audio after backend conv1: [B,2048,1]
audio after backend conv2(final): [B,500]
initial video tensor : [B,1,29,96,96]
video after backend conv1: [B,1024,1]
video after backend conv2(final): [B,500]
The concat_pretrained model is actually 3 files named as _a.pt,_b.pt,_v.pt all those three should be merge and use as a input? along with audio and video models to start the training with temporal convolutional backend? or Which one should I consider first?
Training from scratch:
I started the process of training from scratch without the pre-trained models. But only the audiovisual net because this is the part I am interested in. Is that the correct approach? Should I train from the scratch also the audio-only and video-only models first?
If I train also the audio-only and video-only models, should I use the .pt files of the last phase "finetuneGRU" of the audio and video net as an inputs (pt files) for the training of the audiovisual net? Besides that how could I get the concat_model.pt? How should look like the temporalConv training for audivisual with these inputs (audio_model.pt, video_model.pt & concat_model.pt)?
On the README, the step ii mentioned "Throw away the temporal convolutional backend, freeze the parameters of the frontend and the ResNet and train the LSTM backend" is this not already somehow specify on the code?
Would it be possible for the creators to provide details about specifications of the machine used to train and also estimated running time per epoch on the same machine?
@mpc001
Hello! Thanks for your great work!
I'm wondering how to get the middle output of resnet34 using a pre-trained video-only model? Could you please offer some code examples of extracting those lip embeddings?
Hi! Thank you for your work!
I'm trying to reproduce your network architecture using keras. I'm not sure to understand everything, so i have few questions:
Thank you!
Getting this error while trying to run the code
Traceback (most recent call last): File "main.py", line 212, in <module> main() File "main.py", line 208, in main test_adam(args, use_gpu) File "main.py", line 181, in test_adam train_test(model, dset_loaders, criterion, 0, 'val', optimizer, args, logger, use_gpu, save_path) File "main.py", line 105, in train_test _, preds = torch.max(F.softmax(outputs, dim=1).data, 1) File "/usr/local/lib/python2.7/dist-packages/torch/nn/functional.py", line 768, in softmax return torch._C._nn.softmax(input, dim) RuntimeError: dimension out of range (expected to be in range of [-1, 0], but got 1)
Any ideas on how to go about debugging this?
In "data = librosa.load(filename, sr=16000)[0]", I find the data.shape is differnt from my Mac and Ubuntu. (Maybe the reason is that the video format is ".mp4"? I get the same shape in ".mov".)
On Ubuntu, the data.shape is (20480, ) larger than 19456. But on the Ubuntu the speed of librosa.load is too slow (maybe three days to process, I don't konw why is so slow), so I run the "convert_audio.py" on Mac (my Mac is 100 times faster than Ubuntu to run "librosa.load").
On Mac, the data.shape is (18368, ) smaller than 19456, so I run the "data = librosa.load(filename, sr=17000)[0][-19456:]" on my mac. But use your pretrained audio model, I get the accuracy is only 94.67% rather than 97.72%.
So I have some questions.
What is the shape you get on the "data = librosa.load(filename, sr=16000)[0]"?
Why is the data.shape differnt from my Mac and Ubuntu?
Why is the speed so slow on Ubuntu?
Hi. I would like to know how to add train, validation and test data after executing main.py prgrm in audio only folder?. Plz reply.
When I run the main.py, I get the FileNotFoundError: No such file or directory 'MONEY/NoisyAudio/-5dB/MONEY_00581.npz'
Epoch 0/29
Current Learning rate: [0.0003]
Traceback (most recent call last):
File "/home/Documents/audiovisual/audio_only/main.py", line 256, in
main()
File "/home/Documents/audiovisual/audio_only/main.py", line 252, in main
test_adam(args, use_gpu)
File "/home/Documents/audiovisual/audio_only/main.py", line 230, in test_adam
model = train_test(model, dset_loaders, criterion, epoch, 'train', optimizer, args, logger, use_gpu, save_path)
File "/home/evialv/Documents/audiovisual/audio_only/main.py", line 146, in train_test
outputs = model(inputs)
File "/home/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/Documents/audiovisual/audio_only/model.py", line 156, in forward
x = x.view(-1, self.frameLen, self.inputDim)
RuntimeError: shape '[-1, 29, 512]' is invalid for input of size 497664
Thank your for your good work. I want to use your trained model to test if it works on my task and I'll appreciate it if you can provide the trained model.
Hello,
I scan the code and the paper, and I can't find the code about adding the babble noise of different levels to the audio clip. Could you please tell me how to realize it.
Thank you !
Hello.
I’m doing my research on multimodal AI, such as multimodal ASR.
How can I get some pretrained models of audiovisual net?
Has anyone successfully reproduce the results from the provided pretrained models?
I tried a simple script to predict mp4s of ABOUT
in training data, but it seems to produce wildly different labels for each sample. It seems unlikely this could achieve 83.39% accuracy, so I guess there must be something wrong in the following scripts:
import imageio
import numpy as np
import torch
import torch.nn.functional as F
import cv2
from model import lipreading
def load_frames(path):
# depending on your environment, this might sometimes produce 30 frames
cap = np.array(list(imageio.get_reader(path, 'ffmpeg')))
images = np.stack(
[cv2.cvtColor(cap[_], cv2.COLOR_RGB2GRAY) for _ in range(29)], axis=0)
images = images[:, 84:172, 120:208] / 255.
mean = 0.413621
std = 0.1700239
images = (images - mean) / std
images = images.reshape(1, 1, 29, 88, 88)
images = torch.tensor(images, dtype=torch.float32)
return images
def reload_model(model, path):
model_dict = model.state_dict()
pretrained_dict = torch.load(path)
model_dict.update(pretrained_dict)
model.load_state_dict(model_dict)
return model
# load model
m = lipreading(mode='finetuneGRU')
m = reload_model(m, 'model.pt')
m.eval()
for j in range(2, 10):
images = load_frames(
'lipread_mp4/ABOUT/train/ABOUT_%05d.mp4'
% j)
# print the label
outputs = m(images)
outputs = torch.mean(outputs, 1)
outputs = torch.argmax(F.softmax(outputs, dim=1), dim=1)
print(outputs)
Hi, thanks for your work. Please can you provide the pretrained model for audio. Also the pretrained model for video. What I need in order to get them? Just an account on GitHub? I l already have one.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.