smeetrs / deep_avsr Goto Github PK

A PyTorch implementation of the Deep Audio-Visual Speech Recognition paper.

License: MIT License

Python 100.00%

audio-visual-speech-recognition automatic-speech-recognition lip-reading speech-recognition speech-to-text visual-speech-recognition

deep_avsr's People

Contributors

Stargazers

Watchers

deep_avsr's Issues

When I train the ao model ,the WER and CER are always 1.

Thank you for your code,I was obsessed with this problem for days.
When I trained the ao model ,the WER and CER were always 1,and I printed the predictionbatch, and each predictionbatch is [39,39,39,39,39,39.39,39]or [1,39,1,39,1,39,1,39]. I tried to change the seed value but it didn't work. But when i trained the vo model
,it was normal.

The WER is always 1.000

When I was training, the loss function dropped to 3.18, and the WER and CER were always 1.000.I printed the predictionbatch, and each predictionbatch is [39,39,39,39,39,39.39,39].Do you know what caused this？

What does an iteration mean?

I am unclear as to what an iteration means in the context of pretraining. As I understand, we run pretraining with increasing number of words (1, 2, 3 etc) where we pass forward the pretrained models from each run. I presume the pretrain needs to be run until convergence for each NUM_WORD (=1,2,etc). Is this what is referred to as an iteration?

Thanks so much for your response and this repo.

Broken pipe error in Dataloader

@lordmartian That change resolved the error, however, I had set numworkers in the config file to 0 in order to avoid a mutliprocess error that was masking the original error. Setting the numworkers to anything other than 0 still causes this error.

File "", line 1, in
File "D:\Users\arunm\anaconda3\lib\multiprocessing\spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "D:\Users\arunm\anaconda3\lib\multiprocessing\spawn.py", line 114, in _main
prepare(preparation_data)
File "D:\Users\arunm\anaconda3\lib\multiprocessing\spawn.py", line 225, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "D:\Users\arunm\anaconda3\lib\multiprocessing\spawn.py", line 277, in _fixup_main_from_path
run_name="mp_main")
File "D:\Users\arunm\anaconda3\lib\runpy.py", line 263, in run_path
pkg_name=pkg_name, script_name=fname)
File "D:\Users\arunm\anaconda3\lib\runpy.py", line 96, in _run_module_code
mod_name, mod_spec, pkg_name, script_name)
File "D:\Users\arunm\anaconda3\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "D:\Users\arunm\PycharmProjects\AV_Speech_Recognition\audio_visual\pretrain.py", line 106, in
trainingLoss, trainingCER, trainingWER = train(model, pretrainLoader, optimizer, loss_function, device, trainParams)
File "D:\Users\arunm\PycharmProjects\AV_Speech_Recognition\audio_visual\utils\general.py", line 39, in train
for batch, (inputBatch, targetBatch, inputLenBatch, targetLenBatch) in enumerate(tqdm(trainLoader, leave=False, desc="Train", ncols=75)):
File "D:\Users\arunm\anaconda3\lib\site-packages\tqdm\std.py", line 1107, in iter
for obj in iterable:
File "D:\Users\arunm\anaconda3\lib\site-packages\torch\utils\data\dataloader.py", line 278, in iter
return _MultiProcessingDataLoaderIter(self)
File "D:\Users\arunm\anaconda3\lib\site-packages\torch\utils\data\dataloader.py", line 682, in init
w.start()
File "D:\Users\arunm\anaconda3\lib\multiprocessing\process.py", line 112, in start
self._popen = self._Popen(self)
File "D:\Users\arunm\anaconda3\lib\multiprocessing\context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "D:\Users\arunm\anaconda3\lib\multiprocessing\context.py", line 322, in _Popen
return Popen(process_obj)
File "D:\Users\arunm\anaconda3\lib\multiprocessing\popen_spawn_win32.py", line 46, in init
prep_data = spawn.get_preparation_data(process_obj._name)
File "D:\Users\arunm\anaconda3\lib\multiprocessing\spawn.py", line 143, in get_preparation_data
Traceback (most recent call last):
File "D:/Users/arunm/PycharmProjects/AV_Speech_Recognition/audio_visual/pretrain.py", line 106, in
_check_not_importing_main()
File "D:\Users\arunm\anaconda3\lib\multiprocessing\spawn.py", line 136, in _check_not_importing_main
trainingLoss, trainingCER, trainingWER = train(model, pretrainLoader, optimizer, loss_function, device, trainParams)
File "D:\Users\arunm\PycharmProjects\AV_Speech_Recognition\audio_visual\utils\general.py", line 39, in train
for batch, (inputBatch, targetBatch, inputLenBatch, targetLenBatch) in enumerate(tqdm(trainLoader, leave=False, desc="Train", ncols=75)):
File "D:\Users\arunm\anaconda3\lib\site-packages\tqdm\std.py", line 1107, in iter
is not going to be frozen to produce an executable.''')
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.
for obj in iterable:

File "D:\Users\arunm\anaconda3\lib\site-packages\torch\utils\data\dataloader.py", line 278, in iter
return _MultiProcessingDataLoaderIter(self)
File "D:\Users\arunm\anaconda3\lib\site-packages\torch\utils\data\dataloader.py", line 682, in init
w.start()
File "D:\Users\arunm\anaconda3\lib\multiprocessing\process.py", line 112, in start
self._popen = self._Popen(self)
File "D:\Users\arunm\anaconda3\lib\multiprocessing\context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "D:\Users\arunm\anaconda3\lib\multiprocessing\context.py", line 322, in _Popen
return Popen(process_obj)
File "D:\Users\arunm\anaconda3\lib\multiprocessing\popen_spawn_win32.py", line 89, in init
reduction.dump(process_obj, to_child)
File "D:\Users\arunm\anaconda3\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
BrokenPipeError: [Errno 32] Broken pipe

Originally posted by @arunm95 in #9 (comment)

Improvement in AV model performance over AO

Thanks for your reply.
I noted that you just implemented the transformer-CTC architecture of the paper. I implemented the transformer-seq2seq based on Espnet, but the performance of the AV model is slightly worse than the AO model. I think it's due to context vector fusion step.

The paper said concatenating channel-wise the context vectors of the audio and video modalities and fed to feed-forward layers. I just concatenated the two context vectors (e.g, the dimension both are (32, 123, 512)) into (32, 123, 1024) and used a linear layer into (32, 123, 512), then fed to feed-forward layers, but the performance is always slightly worse than the AO model. Do you have any idea on how to implement the channel-wise concatenation?
I noted you used a convolution layer to fuse the audio and video embedding in the transformer-CTC. Should I use the same method for fusion of the two context vectors?

Any idea or suggestions will be appreciated. Thanks!

Originally posted by @yuexianghubit in #4 (comment)

Train and validation WER both remain 1 while training VO model

Hi,
I am trying to train the video-only model, when the 'PRETRAIN_NUM_WORDS' is 1, it seems that the WER of training and testing set are both 1 all the time and there is no any improvement.

`Train: 0%| | 0/512 [00:00<?, ?it/s]Step: 182 || Tr.Loss: 3.239813 Val.Loss: 3.226672 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000

Epoch 183: reducing learning rate of group 0 to 1.0000e-06.
Train: 0%| | 0/512 [00:00<?, ?it/s]Step: 183 || Tr.Loss: 3.241490 Val.Loss: 3.221430 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000

Train: 0%| | 0/512 [00:00<?, ?it/s]Step: 184 || Tr.Loss: 3.240253 Val.Loss: 3.238177 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000

Train: 0%| | 0/512 [00:00<?, ?it/s]Step: 185 || Tr.Loss: 3.228107 Val.Loss: 3.234346 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000

Train: 0%| | 0/512 [00:00<?, ?it/s]Step: 186 || Tr.Loss: 3.234290 Val.Loss: 3.216766 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000

Train: 0%| | 0/512 [00:00<?, ?it/s]Step: 187 || Tr.Loss: 3.241915 Val.Loss: 3.232590 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000

Train: 0%| | 0/512 [00:00<?, ?it/s]Step: 188 || Tr.Loss: 3.233189 Val.Loss: 3.228462 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000

Train: 0%| | 0/512 [00:00<?, ?it/s]Step: 189 || Tr.Loss: 3.236741 Val.Loss: 3.223365 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000

Train: 0%| | 0/512 [00:00<?, ?it/s]Step: 190 || Tr.Loss: 3.235876 Val.Loss: 3.216625 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000

Train: 0%| | 0/512 [00:00<?, ?it/s]Step: 191 || Tr.Loss: 3.241944 Val.Loss: 3.242806 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000

Train: 0%| | 0/512 [00:00<?, ?it/s]Step: 192 || Tr.Loss: 3.237240 Val.Loss: 3.243809 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000

Train: 0%| | 0/512 [00:00<?, ?it/s]Step: 193 || Tr.Loss: 3.238747 Val.Loss: 3.219588 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000
`
Is this situation normal?
Thanks for your suggestions.

Originally posted by @yuexianghubit in #4 (comment)

Transferring to train set was not going well

Hi there,

I have been using your code to experiment on features extracted using a custom visual front. The pretraining was not bad, and on the validation set the WER reached about 83.1% while the CER was 36.3% after the curriculum learning. (1,2,3,5,7,9,13,17,21,29,37).
Step: 199 || Tr.Loss: 0.428430 Val.Loss: 2.574524 || Tr.CER: 0.109 Val.CER: 0.525 || Tr.WER: 0.363 Val.WER: 0.831
(The step size was set 200 because I just wanted a quick try.)

However, the training phase was not so good, and it seems the pretrained model was not working because the train CER started from 95.5%.

Train_Val_Loss

Train_Val_WER
　
Train_Val_CER　

As you can see, the val CER reached about 50% at last while the WER remains 109.5%...
In your experiment, how did the CER/WER change when transferring to the training phase ? It will be really helpful if you can give me some advice.

tuple attribute error when running pretrain.py

Hi,

I'm currently trying out training the model on the LRS2 dataset. Preprocess executes successfully, but on trying to run the pretrain I run into the following error:

Traceback (most recent call last):
File "audio_visual/pretrain.py", line 106, in
trainingLoss, trainingCER, trainingWER = train(model, pretrainLoader, optimizer, loss_function, device, trainParams)
File "audio_visual\utils\general.py", line 39, in train
for batch, (inputBatch, targetBatch, inputLenBatch, targetLenBatch) in enumerate(tqdm(trainLoader, leave=False, desc="Train", ncols=75)):
File "anaconda3\lib\site-packages\tqdm\std.py", line 1107, in iter
for obj in iterable:
File "anaconda3\lib\site-packages\torch\utils\data\dataloader.py", line 346, in next
data = self.dataset_fetcher.fetch(index) # may raise StopIteration
File "anaconda3\lib\site-packages\torch\utils\data_utils\fetch.py", line 47, in fetch
return self.collate_fn(data)
File "audio_visual\data\utils.py", line 242, in collate_fn
inputBatch = pad_sequence([data[0] for data in dataBatch])
File "anaconda3\lib\site-packages\torch\nn\utils\rnn.py", line 369, in pad_sequence
max_size = sequences[0].size()
AttributeError: 'tuple' object has no attribute 'size'

I've traced it to the use of a tuple to pass both the audio and video inputs from prepare_pretrain_input in utils.py but am unsure how to resolve the problem.

Transformer Decoder in jointDecoder in avnet

Hi,
Why is it that nn.TransformerDecoder is not used in jointDecoder in av_net?
Have you tried using it?

How to initialize encoders with AO and VO sets

Hi, could you specify exactly how to initialize the AV encoders/decoders with the AO and VO trained models in the config file? I'm unclear on where to add those to the settings. Thanks!

When I was training the ao model,the WER and CER were always 1.000

When I was pretraining the ao model, the WER and CER were always 1.000.I printed the predictionbatch, and each predictionbatch is [39,39,39,39,39,39.39,39]or [1,39,1,39,1,39,1,39].Is that normal?

Why set "AUDIO_ONLY_PROBABILITY" and "VIDEOO_ONLY_PROBABILITY" these two parameters?

Hi, I noticed that your training is very much like "joint training", i.e., learning three different tasks in the same network at the same time; however, the details are not quite the same as "joint training", i.e., "joint training "does not have the above two parameters. "Joint training" trains three inputs simultaneously and then assigns different weights; your work doesn't seem to do that. Did you refer to other papers for your approach? If so, can you share it? If not, can you explain why you set it up this way?

Error happening when evaluating

Hi. Thank you for sharing your codes. When I run the train.py, an error happens. When the code goes to
validationLoss, validationCER, validationWER = evaluate(model, valLoader, loss_function, device, valParams), the traceback occurs
Traceback (most recent call last):
File "D:\avsr\deep_avsr\audio_visual\train.py", line 160, in
main()
File "D:\avsr\deep_avsr\audio_visual\train.py", line 114, in main
validationLoss, validationCER, validationWER = evaluate(model, valLoader, loss_function, device, valParams)
File "D:\avsr\deep_avsr\audio_visual\utils\general.py", line 84, in evaluate
for batch, (inputBatch, targetBatch, inputLenBatch, targetLenBatch) in enumerate(tqdm(evalLoader, leave=False, desc="Eval",
File "D:\Anaconda3\envs\autoavsr\lib\site-packages\tqdm\std.py", line 1182, in _iter
for obj in iterable:
File "D:\Anaconda3\envs\autoavsr\lib\site-packages\torch\utils\data\dataloader.py", line 633, in next
data = self._next_data()
File "D:\Anaconda3\envs\autoavsr\lib\site-packages\torch\utils\data\dataloader.py", line 1345, in _next_data
return self._process_data(data)
File "D:\Anaconda3\envs\autoavsr\lib\site-packages\torch\utils\data\dataloader.py", line 1371, in _process_data
data.reraise()
File "D:\Anaconda3\envs\autoavsr\lib\site-packages\torch_utils.py", line 644, in reraise
raise exception
FileNotFoundError: Caught FileNotFoundError in DataLoader worker process 5.
Original Traceback (most recent call last):
File "D:\Anaconda3\envs\autoavsr\lib\site-packages\torch\utils\data_utils\worker.py", line 308, in _worker_loop
data = fetcher.fetch(index)
File "D:\Anaconda3\envs\autoavsr\lib\site-packages\torch\utils\data_utils\fetch.py", line 51, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "D:\Anaconda3\envs\autoavsr\lib\site-packages\torch\utils\data_utils\fetch.py", line 51, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "D:\avsr\deep_avsr\audio_visual\data\lrs2_dataset.py", line 112, in getitem
inp, trgt, inpLen, trgtLen = prepare_main_input(audioFile, visualFeaturesFile, targetFile, noise, self.reqInpLen, self.charToIx,
File "D:\avsr\deep_avsr\audio_visual\data\utils.py", line 27, in prepare_main_input
with open(targetFile, "r") as f:
FileNotFoundError: [Errno 2] No such file or directory: 'D:/lrs2_v1/main/.txt'

So it appears to be something cannot be found. This is weired since the training process works but the error happens to the evaluating though I think the dataloaders' structure for these two process are similar. Can you help me fix this problem?

Pretrained models not available

Can you please provide pretrained models for the project

Selection of videos

Dear Professor, is it possible to get all the data in each training step when training the network?
Because in the code I see that the videos of pretrain and train are grouped and only one video of the group is selected at a time.

if self.dataset == "pretrain":
#（1）index goes from 0 to stepSize-1
#（2）dividing the dataset into partitions of size equal to stepSize and selecting a random partition
#（3）fetch the sample at position 'index' in this randomly selected partition
base = self.stepSize * np.arange(int(len(self.datalist)/self.stepSize)+1)
ixs = base + index
ixs = ixs[ixs < len(self.datalist)]
index = np.random.choice(ixs)
targetFile = self.datalist[index] + ".txt"
targetFile_name = self.datalist[index]
inp, trgt, inpLen, trgtLen = prepare_pretrain_input(visualFeaturesFile, targetFile, self.numWords, self.charToIx, self.videoParams)

Is not grouped during validation and testing？

In addition, this model is TM-CTC, could the TM-seq2seq model be published.

Hope to get your answer

The python version 3.6.9 is not supported anymore.

Questions about meanings of variable in beam search

Hi @lordmartian

Sorry to bother you again! I am a newbie on Visual Speech Recognition. I have some problems about the details of beam search. Could you offer some explanation? Thx ahead.

Q1: the meaning of entries[labeling].logPrTotal and entries[labeling].logPrText
In the Connectionist Temporal Classification loss(CTC-loss), I get a rudimentary knowledge of the defination of blank, which means a symbol inserted between duplicate characters. But it seems that there is no evidence of logPrTotal or logPrText. Besides, it seems that these two variables play their parts in calculating the score used in Beam Search. Could you help me to understand it better?

Thanks for your reply! I am eagerly waiting for your reply!

Beat wishes!

config.py file is missing?

Hi,
Happened to come by this repo and think this pretty awesome. Cant seem to locate the config.py file though. Was it intentional?

the result under noisy audio

Hi,
I have some questions about the result under Noisy audio. My understanding is that the WER in cases noisy audio needs to set the 'args["NOISE_PROBABILITY"] =1' in config.py . Would you please tell me if my understanding is correct . Thank you very much for your help

audio-visual.pt file

Hello, thanks for sharing the code.
I have a question, is the audio-visual.pt file you uploaded already integrated with the weights of AO and VO?

preprocess for lrs2 dataset

hi, which face detector and face aliment are used in lrs2 and lrs3 dataset?

Language Model Execution Never Ends

While performing inference on the audio-visual model and setting USE_LM to True, the demo gets stuck and never completes. Specifically, line number 34 in lrs2_char_lm.py -batch, finalStateBatch = self.lstm(batch, initStateBatch), executes endlessly. How can this be resolved? Thanks!

how much training time

Hi there,
thanks for so nice code.

I didn't find early stopping in your code. So after changing 'PRETRAIN_NUM_WORDS' every time, it will run 1000 steps/epochs? which will take many days to train the whole model.

may I know how long it took for you to train the whole AV model?
Looking forward to your reply. Thanks!

Details of visual fronted model

I noticed that the open source model in Ref 1 is based on the TensorFlow framework, have you converted this open source model to the PyTorch framework? In addition, the model in Ref 1 was trained with the visual front end and the back end together, did you cut the back end before and only retain the weight and structure of the visual front end?

The error when use preprocess.py

Hi,
when I use the preprocess.py to process the LRS2, it encountered the problem:

It didn't generate the audio wav in preprocessing.py. How can I solve this problem?
Thank you very much.

CUDA out of memory on preprocess.py

Issue

I keep getting CUDA out of memory error from preprocessing my data. I ran the video_only/preprocess.py to preprocess LRS-3 dataset. I always ran into this issue on the same file.

Traceback

Number of data samples to be processed = 21864


Starting preprocessing ....

Preprocess:  36%|██████▊            | 7830/21864 [37:01<3:02:49,  1.28it/s]
Ignoring error for /content/Dataset/Part 6/kSZEsPnhIXg/00005: CUDA out of memory. Tried to allocate 47.83 GiB (GPU 0; 39.59 GiB total capacity; 13.32 GiB already allocated; 24.53 GiB free; 13.34 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "/content/drive/MyDrive/FYP/deep_avsr/video_only/preprocess.py", line 53, in main
    preprocess_sample(file, params)
  File "/content/drive/MyDrive/FYP/deep_avsr/video_only/utils/preprocessing.py", line 58, in preprocess_sample
    outputBatch = vf(inputBatch)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/content/drive/MyDrive/FYP/deep_avsr/video_only/models/visual_frontend.py", line 107, in forward
    batch = self.frontend3D(inputBatch)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/conv.py", line 607, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/conv.py", line 603, in _conv_forward
    input, weight, bias, self.stride, self.padding, self.dilation, self.groups
RuntimeError: CUDA out of memory. Tried to allocate 47.83 GiB (GPU 0; 39.59 GiB total capacity; 13.32 GiB already allocated; 24.53 GiB free; 13.34 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Workaround

I have tried setting max_split_size_mb on PYTORCH_CUDA_ALLOC_CONF environment variable and it would still have this issue.
I set the value to 1, 10, 1024, and 2048. None have worked.

I'm forced to work around it by catching the error and deleting the file that causes the out of memory because it's always the same file that is causing it.
My workaround in video_only/preprocess.py

    def silent_rem(file):
        with contextlib.suppress(OSError):
            os.remove(file)
            
    for file in tqdm(filesList, leave=True, desc="Preprocess", ncols=75):
        try:
            preprocess_sample(file, params)
        except RuntimeError as e:
            print(f"\nIgnoring error for {file}:", e, file=sys.stderr)
            traceback.print_exc()
            silent_rem(file + ".txt")
            silent_rem(file + ".mp4")
            silent_rem(file + ".npy")
            silent_rem(file + ".png")
        finally:
            torch.cuda.empty_cache()

I don't really like this but it's the only way I can solve this and move forward. It would be really helpful if there is a way to fix it.

System Information

I've run it on 2 separate machines with the same issue occurring on different files.

Google Colab Pro+ on GPU Runtime 40GB Memory with 52 GB RAM, Ubuntu 18
- Python 3.7.14
- torch==1.12.1+cu113
- torchaudio==0.12.1+cu113
- torchvision==0.13.1+cu113
RTX 3060 (Laptop) 8GB Memory with 16 GB RAM, Windows 11
- Python 3.6.7
- torch==1.10.2+cu113
- torchaudio==0.10.2+cu113
- torchvision==0.11.3+cu113

ctc_loss problem when running on multiple GPUs

Hi, I didn't have any bugs when training AV on a single GPU;But when I try to use multi-GPU training AV, CTC_LOSS gets the following error:

Traceback (most recent call last):
File "/home/cca01/work2020/guominghao/Deep_AV_ASR/audio_visual/train.py", line 162, in
main()
File "/home/cca01/work2020/guominghao/Deep_AV_ASR/audio_visual/train.py", line 111, in main
trainingLoss, trainingCER, trainingWER = train(model, trainLoader, optimizer, loss_function, device, trainParams)
File "/home/cca01/work2020/guominghao/Deep_AV_ASR/audio_visual/utils/general.py", line 74, in train
loss = loss_function(outputBatch, targetBatch, inputLenBatch, targetLenBatch)
File "/home/cca01/work2020/guominghao/.conda/envs/Deep_AV_PYTORCH/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/home/cca01/work2020/guominghao/.conda/envs/Deep_AV_PYTORCH/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 1295, in forward
self.zero_infinity)
File "/home/cca01/work2020/guominghao/.conda/envs/Deep_AV_PYTORCH/lib/python3.6/site-packages/torch/nn/functional.py", line 1767, in ctc_loss
zero_infinity)
RuntimeError: Expected tensor to have size at least 660 at dimension 1, but got size 1474 for argument #2 'targets' (while checking arguments for ctc_loss_gpu)

Have you ever encountered this problem?

preprocess error

When I run the preprocess file, I encounter the following issue:

File "/home/aa/zzs/deep_avsr_master/audio_visual/preprocess.py", line 110, in
main()
File "/home/aa/zzs/deep_avsr_master/audio_visual/preprocess.py", line 49, in main
preprocess_sample(file, params)
File "/home/aa/zzs/deep_avsr_master/audio_visual/utils/preprocessing.py", line 52, in preprocess_sample
cv.imwrite(roiFile, np.floor(255*np.concatenate(roiSequence, axis=1)).astype(np.int))
File "<array_function internals>", line 180, in concatenate
ValueError: need at least one array to concatenate

Curriculum learning

You have the following description in Important Training Details：
"We perform iterations of Curriculum Learning by changing the PRETRAIN_NUM_WORDS config option. The number of words used in each iteration of curriculum learning is as follows: 1,2,3,5,7,9,13,17,21,29,37, i.e., 11 iterations in total."

I am confused. Can you describe in detail what the "iteration" refers to? My understanding is that the network need to iterate 1000 times ("NUM_STEPS") every time "PRETRAIN_NUM_WORDS" is changed in the course learning strategy.

The performance when doing curriculum learning

Hi there,

Thanks a lot for your open source repo and code, I adapted your code to build a new AVNet and directly using AV data to train the model. when doing curriculum learning and the num of words = 1, my model converge at a WER of 0.55, but when I use this checkpoint to initialize num of words = 2 and start training, the model converge at a WER of 0.7, which is even worse than when the num of words = 1. I expected WER to drop after each iteration.

The question I want to ask is that did you ever encounter problems like this when you training AO or VO model?

When pretrain VO model, loss reduce a few epochs, but immediately increase and remain.

Hi, thank you for the work.
I reproduced the code, it works well with your pretrained weights.(pretrain, train, test.py all ok)
But I'd like to try some different methods, so I didn't use it this time.
(haven't change the code, just didn't load the pretrained weights)

I was running the video_only pretrain.py
Number of Words = 2
My seed = 18547840

And the loss value looks tricky:

Step: 000 || Tr.Loss: 3.135768 Val.Loss: 2.934046 || Tr.CER: 0.977 Val.CER: 0.967 || Tr.WER: 1.022 Val.WER: 1.001
Step: 001 || Tr.Loss: 2.911791 Val.Loss: 2.872616 || Tr.CER: 0.928 Val.CER: 0.890 || Tr.WER: 1.013 Val.WER: 1.000
Step: 002 || Tr.Loss: 2.946833 Val.Loss: 3.349309 || Tr.CER: 0.917 Val.CER: 1.000 || Tr.WER: 1.012 Val.WER: 1.000
Step: 003 || Tr.Loss: 3.326881 Val.Loss: 3.314617 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000
Step: 004 || Tr.Loss: 3.323197 Val.Loss: 3.303168 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000
Step: 005 || Tr.Loss: 3.323742 Val.Loss: 3.327703 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000
Step: 006 || Tr.Loss: 3.324991 Val.Loss: 3.318675 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000
Step: 007 || Tr.Loss: 3.321263 Val.Loss: 3.317882 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000
Step: 008 || Tr.Loss: 3.319875 Val.Loss: 3.322166 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000
Step: 009 || Tr.Loss: 3.323132 Val.Loss: 3.320766 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000
Step: 010 || Tr.Loss: 3.325111 Val.Loss: 3.319928 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000
Step: 011 || Tr.Loss: 3.323602 Val.Loss: 3.306033 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000
Step: 012 || Tr.Loss: 3.322382 Val.Loss: 3.324228 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000
Step: 013 || Tr.Loss: 3.319179 Val.Loss: 3.308431 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000
Step: 014 || Tr.Loss: 3.324116 Val.Loss: 3.316021 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000
Step: 015 || Tr.Loss: 3.320856 Val.Loss: 3.307545 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000

On step 1 and 2, the loss reduce to 2.9, but it goes back to 3.3 right next step and remains (it continue until step 130, then I stop it)
And the WER remains 1.00, too.
I found this situation a little different from #6
Wondering if this give you a clue to know how it happens, or how to avoid this issue?
Or if it's just a normal situation?

Thank you for your kindness.

Details of the pretrained visual frontend and language models?

Hi,
Can you provide any more details on the pretrained visual frontend and LM's? Datsets used in the pretrained model?

Thanks!

How AO and VO model are trained?

Thanks for sharing such a useful repo.

You mention that:

"We train the AO and VO models first. We then initialize the AV model with weights from the trained AO and VO models as follows: AO Audio Encoder → AV Audio Encoder, VO Video Encoder → AV Video Encoder, VO Video Decoder → AV Joint Decoder."

You use AO and VO to initialize AV for further pre-training on LRS2 pre-train set and fine-tuning on LRS-2 train set. But when you pre-train AO and VO before that initialization, do you only use the pre-train set of LRS-2 or do you pre-train on pre-train+train sets or do you pre-train on pre-train and fine-tune on train set before using AO and VO for AV model training? This is particularly important for us as we will use AO and VO model weights you shared and want to know which LRS2 sets they have seen during training.

Thanks!

About loss curve

Hello, when I pretrained the AO model, I trained a total of 220 epochs when pertrain_num_words=3. When epoch 190 is reached, val WER is the best, which is 0.259. However, the loss curve seems to be abnormal.
What is the reason?
I uploaded the graph I generated, please help me have a look
Looking forward to your reply, thank you!

question

Hello: I did not find the pretrain.txt file and preval.txt after running preprocess.py. After creating it manually, it was still empty after executing the script. Then run pretain.py and there will be a data set loading error, and it seems that the content in the two txts just now cannot be read. Looking at the code logic, this pretain.txt is there at the beginning, but you didn't provide it, right? thanks

RuntimeError: cublas runtime error

Hello, I reported this error when preprocessing :
RuntimeError: cublas runtime error : the GPU program failed to execute at C:/w/1/s/windows/pytorch/aten/src/THC/THCBlas.cu:331
It seems that cuda10.0 does not match, can I reproduce this code in cuda12.0？

Looking forward to your reply, thank you！

The error when use preprocess.py

Hi,
when I use the preprocess.py to process the LRS2, it encountered the problem:

It didn't generate the audio wav in preprocessing.py. How can I solve this problem?
Thank you very much.

What does <EOS> and " " mean?

Hi, I' working with your code.
Some question want to ask

What does<EOS> mean? End of sequence? If it's end of sequence, where is start of sequence<SOS>?
And in LRS2, they slice videos, so one video present one sequence.
Does it means that <EOS> and <SOS> is not necessary?
And, does " " means space between words?

https://github.com/lordmartian/deep_avsr/blob/2fe30359162f71f2bed1b17275122c00594ad40c/video_only/config.py#L26-L29

Out of Memory errors during pretraining

I seem to hit OOM errors as I proceed into pretraining steps/epochs. The memory footprint keeps increasing after each step/epoch. I would have expected it to remain constant for each step/epoch. Am I wrong?

The memory footprint was around ~9G when pretraining started and has steadily increased since.

top - 20:43:22 up 10:22,  1 user,  load average: 2.15, 2.53, 2.41
Tasks: 353 total,   2 running, 268 sleeping,   0 stopped,   0 zombie
%Cpu(s):  8.8 us,  3.3 sy,  0.0 ni, 86.6 id,  1.2 wa,  0.0 hi,  0.1 si,  0.0 st
KiB Mem : 16368460 total,   190368 free, **13370028 used**,  2808064 buff/cache
KiB Swap:  6291452 total,  4440572 free,  1850880 used.  2390908 avail Mem

A couple of steps later

top - 20:44:38 up 10:23,  1 user,  load average: 1.63, 2.29, 2.33
Tasks: 352 total,   2 running, 268 sleeping,   0 stopped,   0 zombie
%Cpu(s):  9.1 us,  3.3 sy,  0.0 ni, 86.2 id,  1.4 wa,  0.0 hi,  0.1 si,  0.0 st
KiB Mem : 16368460 total,   159724 free, **13424752** used,  2783984 buff/cache
KiB Swap:  6291452 total,  4431356 free,  1860096 used.  2333444 avail Mem

Is this increase expected or does something need to be re-initialized?

I pretrained multiple times by changing num_words. I was able to do so upto 29 and by then I had to decrease batch size to 1 to avoid cuda OOM. I wasn't able to train for num_words = 37 as it hit cuda out of memory each time. Is there any other way than reducing batch size to solve this? (i'm using a 11gb gpu)
Since I wasn't able to pretrain for 37 I continued to train.py. In some other issue(closed now) you had mentioned your final pretrain wer for AV as 0.245 and after training wer as 0.168. After pretraining upto 29 words my pretrain wer was around the same as yours, but after runnning train.py i was unable to decrease it as much as you had. My final wer after train came out around 0.180. Is there anything you did after pretrain or while running train.py that helped decrease yours so much? Do you have any other suggestions to decrease wer?

Where is the seq2seq model at 'audi_visual' or 'video only'?

Hi @lordmartian

Thanks for your generous sharing! It seems that there is no accomplishment of Transformer seq2seq @ Fig. 2. I wonder if I have miss something important. If so, could you help me to figure it out?

Thanks for your time!

smeetrs / deep_avsr Goto Github PK

deep_avsr's People

Contributors

Stargazers

Watchers

Forkers

deep_avsr's Issues

Issue

Workaround

System Information

Recommend Projects

Recommend Topics

Recommend Org