smeetrs / deep_avsr Goto Github PK
View Code? Open in Web Editor NEWA PyTorch implementation of the Deep Audio-Visual Speech Recognition paper.
License: MIT License
A PyTorch implementation of the Deep Audio-Visual Speech Recognition paper.
License: MIT License
Thank you for your code,I was obsessed with this problem for days.
When I trained the ao model ,the WER and CER were always 1,and I printed the predictionbatch, and each predictionbatch is [39,39,39,39,39,39.39,39]or [1,39,1,39,1,39,1,39]. I tried to change the seed value but it didn't work. But when i trained the vo model
,it was normal.
When I was training, the loss function dropped to 3.18, and the WER and CER were always 1.000.I printed the predictionbatch, and each predictionbatch is [39,39,39,39,39,39.39,39].Do you know what caused this?
I am unclear as to what an iteration means in the context of pretraining. As I understand, we run pretraining with increasing number of words (1, 2, 3 etc) where we pass forward the pretrained models from each run. I presume the pretrain needs to be run until convergence for each NUM_WORD (=1,2,etc). Is this what is referred to as an iteration?
Thanks so much for your response and this repo.
@lordmartian That change resolved the error, however, I had set numworkers in the config file to 0 in order to avoid a mutliprocess error that was masking the original error. Setting the numworkers to anything other than 0 still causes this error.
File "", line 1, in
File "D:\Users\arunm\anaconda3\lib\multiprocessing\spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "D:\Users\arunm\anaconda3\lib\multiprocessing\spawn.py", line 114, in _main
prepare(preparation_data)
File "D:\Users\arunm\anaconda3\lib\multiprocessing\spawn.py", line 225, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "D:\Users\arunm\anaconda3\lib\multiprocessing\spawn.py", line 277, in _fixup_main_from_path
run_name="mp_main")
File "D:\Users\arunm\anaconda3\lib\runpy.py", line 263, in run_path
pkg_name=pkg_name, script_name=fname)
File "D:\Users\arunm\anaconda3\lib\runpy.py", line 96, in _run_module_code
mod_name, mod_spec, pkg_name, script_name)
File "D:\Users\arunm\anaconda3\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "D:\Users\arunm\PycharmProjects\AV_Speech_Recognition\audio_visual\pretrain.py", line 106, in
trainingLoss, trainingCER, trainingWER = train(model, pretrainLoader, optimizer, loss_function, device, trainParams)
File "D:\Users\arunm\PycharmProjects\AV_Speech_Recognition\audio_visual\utils\general.py", line 39, in train
for batch, (inputBatch, targetBatch, inputLenBatch, targetLenBatch) in enumerate(tqdm(trainLoader, leave=False, desc="Train", ncols=75)):
File "D:\Users\arunm\anaconda3\lib\site-packages\tqdm\std.py", line 1107, in iter
for obj in iterable:
File "D:\Users\arunm\anaconda3\lib\site-packages\torch\utils\data\dataloader.py", line 278, in iter
return _MultiProcessingDataLoaderIter(self)
File "D:\Users\arunm\anaconda3\lib\site-packages\torch\utils\data\dataloader.py", line 682, in init
w.start()
File "D:\Users\arunm\anaconda3\lib\multiprocessing\process.py", line 112, in start
self._popen = self._Popen(self)
File "D:\Users\arunm\anaconda3\lib\multiprocessing\context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "D:\Users\arunm\anaconda3\lib\multiprocessing\context.py", line 322, in _Popen
return Popen(process_obj)
File "D:\Users\arunm\anaconda3\lib\multiprocessing\popen_spawn_win32.py", line 46, in init
prep_data = spawn.get_preparation_data(process_obj._name)
File "D:\Users\arunm\anaconda3\lib\multiprocessing\spawn.py", line 143, in get_preparation_data
Traceback (most recent call last):
File "D:/Users/arunm/PycharmProjects/AV_Speech_Recognition/audio_visual/pretrain.py", line 106, in
_check_not_importing_main()
File "D:\Users\arunm\anaconda3\lib\multiprocessing\spawn.py", line 136, in _check_not_importing_main
trainingLoss, trainingCER, trainingWER = train(model, pretrainLoader, optimizer, loss_function, device, trainParams)
File "D:\Users\arunm\PycharmProjects\AV_Speech_Recognition\audio_visual\utils\general.py", line 39, in train
for batch, (inputBatch, targetBatch, inputLenBatch, targetLenBatch) in enumerate(tqdm(trainLoader, leave=False, desc="Train", ncols=75)):
File "D:\Users\arunm\anaconda3\lib\site-packages\tqdm\std.py", line 1107, in iter
is not going to be frozen to produce an executable.''')
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
for obj in iterable:
File "D:\Users\arunm\anaconda3\lib\site-packages\torch\utils\data\dataloader.py", line 278, in iter
return _MultiProcessingDataLoaderIter(self)
File "D:\Users\arunm\anaconda3\lib\site-packages\torch\utils\data\dataloader.py", line 682, in init
w.start()
File "D:\Users\arunm\anaconda3\lib\multiprocessing\process.py", line 112, in start
self._popen = self._Popen(self)
File "D:\Users\arunm\anaconda3\lib\multiprocessing\context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "D:\Users\arunm\anaconda3\lib\multiprocessing\context.py", line 322, in _Popen
return Popen(process_obj)
File "D:\Users\arunm\anaconda3\lib\multiprocessing\popen_spawn_win32.py", line 89, in init
reduction.dump(process_obj, to_child)
File "D:\Users\arunm\anaconda3\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
BrokenPipeError: [Errno 32] Broken pipe
Originally posted by @arunm95 in #9 (comment)
Thanks for your reply.
I noted that you just implemented the transformer-CTC architecture of the paper. I implemented the transformer-seq2seq based on Espnet, but the performance of the AV model is slightly worse than the AO model. I think it's due to context vector fusion step.
The paper said concatenating channel-wise the context vectors of the audio and video modalities and fed to feed-forward layers. I just concatenated the two context vectors (e.g, the dimension both are (32, 123, 512)) into (32, 123, 1024) and used a linear layer into (32, 123, 512), then fed to feed-forward layers, but the performance is always slightly worse than the AO model. Do you have any idea on how to implement the channel-wise concatenation?
I noted you used a convolution layer to fuse the audio and video embedding in the transformer-CTC. Should I use the same method for fusion of the two context vectors?
Any idea or suggestions will be appreciated. Thanks!
Originally posted by @yuexianghubit in #4 (comment)
Hi,
I am trying to train the video-only model, when the 'PRETRAIN_NUM_WORDS' is 1, it seems that the WER of training and testing set are both 1 all the time and there is no any improvement.
`Train: 0%| | 0/512 [00:00<?, ?it/s]Step: 182 || Tr.Loss: 3.239813 Val.Loss: 3.226672 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000
Epoch 183: reducing learning rate of group 0 to 1.0000e-06.
Train: 0%| | 0/512 [00:00<?, ?it/s]Step: 183 || Tr.Loss: 3.241490 Val.Loss: 3.221430 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000
Train: 0%| | 0/512 [00:00<?, ?it/s]Step: 184 || Tr.Loss: 3.240253 Val.Loss: 3.238177 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000
Train: 0%| | 0/512 [00:00<?, ?it/s]Step: 185 || Tr.Loss: 3.228107 Val.Loss: 3.234346 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000
Train: 0%| | 0/512 [00:00<?, ?it/s]Step: 186 || Tr.Loss: 3.234290 Val.Loss: 3.216766 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000
Train: 0%| | 0/512 [00:00<?, ?it/s]Step: 187 || Tr.Loss: 3.241915 Val.Loss: 3.232590 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000
Train: 0%| | 0/512 [00:00<?, ?it/s]Step: 188 || Tr.Loss: 3.233189 Val.Loss: 3.228462 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000
Train: 0%| | 0/512 [00:00<?, ?it/s]Step: 189 || Tr.Loss: 3.236741 Val.Loss: 3.223365 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000
Train: 0%| | 0/512 [00:00<?, ?it/s]Step: 190 || Tr.Loss: 3.235876 Val.Loss: 3.216625 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000
Train: 0%| | 0/512 [00:00<?, ?it/s]Step: 191 || Tr.Loss: 3.241944 Val.Loss: 3.242806 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000
Train: 0%| | 0/512 [00:00<?, ?it/s]Step: 192 || Tr.Loss: 3.237240 Val.Loss: 3.243809 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000
Train: 0%| | 0/512 [00:00<?, ?it/s]Step: 193 || Tr.Loss: 3.238747 Val.Loss: 3.219588 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000
`
Is this situation normal?
Thanks for your suggestions.
Originally posted by @yuexianghubit in #4 (comment)
Hi there,
I have been using your code to experiment on features extracted using a custom visual front. The pretraining was not bad, and on the validation set the WER reached about 83.1% while the CER was 36.3% after the curriculum learning. (1,2,3,5,7,9,13,17,21,29,37).
Step: 199 || Tr.Loss: 0.428430 Val.Loss: 2.574524 || Tr.CER: 0.109 Val.CER: 0.525 || Tr.WER: 0.363 Val.WER: 0.831
(The step size was set 200 because I just wanted a quick try.)
However, the training phase was not so good, and it seems the pretrained model was not working because the train CER started from 95.5%.
Train_Val_Loss
Train_Val_WER
Train_Val_CER
As you can see, the val CER reached about 50% at last while the WER remains 109.5%...
In your experiment, how did the CER/WER change when transferring to the training phase ? It will be really helpful if you can give me some advice.
Hi,
I'm currently trying out training the model on the LRS2 dataset. Preprocess executes successfully, but on trying to run the pretrain I run into the following error:
Traceback (most recent call last):
File "audio_visual/pretrain.py", line 106, in
trainingLoss, trainingCER, trainingWER = train(model, pretrainLoader, optimizer, loss_function, device, trainParams)
File "audio_visual\utils\general.py", line 39, in train
for batch, (inputBatch, targetBatch, inputLenBatch, targetLenBatch) in enumerate(tqdm(trainLoader, leave=False, desc="Train", ncols=75)):
File "anaconda3\lib\site-packages\tqdm\std.py", line 1107, in iter
for obj in iterable:
File "anaconda3\lib\site-packages\torch\utils\data\dataloader.py", line 346, in next
data = self.dataset_fetcher.fetch(index) # may raise StopIteration
File "anaconda3\lib\site-packages\torch\utils\data_utils\fetch.py", line 47, in fetch
return self.collate_fn(data)
File "audio_visual\data\utils.py", line 242, in collate_fn
inputBatch = pad_sequence([data[0] for data in dataBatch])
File "anaconda3\lib\site-packages\torch\nn\utils\rnn.py", line 369, in pad_sequence
max_size = sequences[0].size()
AttributeError: 'tuple' object has no attribute 'size'
I've traced it to the use of a tuple to pass both the audio and video inputs from prepare_pretrain_input in utils.py but am unsure how to resolve the problem.
Hi,
Why is it that nn.TransformerDecoder is not used in jointDecoder in av_net?
Have you tried using it?
Hi, could you specify exactly how to initialize the AV encoders/decoders with the AO and VO trained models in the config file? I'm unclear on where to add those to the settings. Thanks!
When I was pretraining the ao model, the WER and CER were always 1.000.I printed the predictionbatch, and each predictionbatch is [39,39,39,39,39,39.39,39]or [1,39,1,39,1,39,1,39].Is that normal?
Hi, I noticed that your training is very much like "joint training", i.e., learning three different tasks in the same network at the same time; however, the details are not quite the same as "joint training", i.e., "joint training "does not have the above two parameters. "Joint training" trains three inputs simultaneously and then assigns different weights; your work doesn't seem to do that. Did you refer to other papers for your approach? If so, can you share it? If not, can you explain why you set it up this way?
Hi. Thank you for sharing your codes. When I run the train.py, an error happens. When the code goes to
validationLoss, validationCER, validationWER = evaluate(model, valLoader, loss_function, device, valParams), the traceback occurs
Traceback (most recent call last):
File "D:\avsr\deep_avsr\audio_visual\train.py", line 160, in
main()
File "D:\avsr\deep_avsr\audio_visual\train.py", line 114, in main
validationLoss, validationCER, validationWER = evaluate(model, valLoader, loss_function, device, valParams)
File "D:\avsr\deep_avsr\audio_visual\utils\general.py", line 84, in evaluate
for batch, (inputBatch, targetBatch, inputLenBatch, targetLenBatch) in enumerate(tqdm(evalLoader, leave=False, desc="Eval",
File "D:\Anaconda3\envs\autoavsr\lib\site-packages\tqdm\std.py", line 1182, in _iter
for obj in iterable:
File "D:\Anaconda3\envs\autoavsr\lib\site-packages\torch\utils\data\dataloader.py", line 633, in next
data = self._next_data()
File "D:\Anaconda3\envs\autoavsr\lib\site-packages\torch\utils\data\dataloader.py", line 1345, in _next_data
return self._process_data(data)
File "D:\Anaconda3\envs\autoavsr\lib\site-packages\torch\utils\data\dataloader.py", line 1371, in _process_data
data.reraise()
File "D:\Anaconda3\envs\autoavsr\lib\site-packages\torch_utils.py", line 644, in reraise
raise exception
FileNotFoundError: Caught FileNotFoundError in DataLoader worker process 5.
Original Traceback (most recent call last):
File "D:\Anaconda3\envs\autoavsr\lib\site-packages\torch\utils\data_utils\worker.py", line 308, in _worker_loop
data = fetcher.fetch(index)
File "D:\Anaconda3\envs\autoavsr\lib\site-packages\torch\utils\data_utils\fetch.py", line 51, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "D:\Anaconda3\envs\autoavsr\lib\site-packages\torch\utils\data_utils\fetch.py", line 51, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "D:\avsr\deep_avsr\audio_visual\data\lrs2_dataset.py", line 112, in getitem
inp, trgt, inpLen, trgtLen = prepare_main_input(audioFile, visualFeaturesFile, targetFile, noise, self.reqInpLen, self.charToIx,
File "D:\avsr\deep_avsr\audio_visual\data\utils.py", line 27, in prepare_main_input
with open(targetFile, "r") as f:
FileNotFoundError: [Errno 2] No such file or directory: 'D:/lrs2_v1/main/.txt'
So it appears to be something cannot be found. This is weired since the training process works but the error happens to the evaluating though I think the dataloaders' structure for these two process are similar. Can you help me fix this problem?
Can you please provide pretrained models for the project
Dear Professor, is it possible to get all the data in each training step when training the network?
Because in the code I see that the videos of pretrain and train are grouped and only one video of the group is selected at a time.
if self.dataset == "pretrain":
#(1)index goes from 0 to stepSize-1
#(2)dividing the dataset into partitions of size equal to stepSize and selecting a random partition
#(3)fetch the sample at position 'index' in this randomly selected partition
base = self.stepSize * np.arange(int(len(self.datalist)/self.stepSize)+1)
ixs = base + index
ixs = ixs[ixs < len(self.datalist)]
index = np.random.choice(ixs)
targetFile = self.datalist[index] + ".txt"
targetFile_name = self.datalist[index]
inp, trgt, inpLen, trgtLen = prepare_pretrain_input(visualFeaturesFile, targetFile, self.numWords, self.charToIx, self.videoParams)
Is not grouped during validation and testing?
In addition, this model is TM-CTC, could the TM-seq2seq model be published.
Hope to get your answer
Hi @lordmartian
Sorry to bother you again! I am a newbie on Visual Speech Recognition. I have some problems about the details of beam search. Could you offer some explanation? Thx ahead.
entries[labeling].logPrTotal
and entries[labeling].logPrText
blank
, which means a symbol inserted between duplicate characters. But it seems that there is no evidence of logPrTotal
or logPrText
. Besides, it seems that these two variables play their parts in calculating the score used in Beam Search. Could you help me to understand it better?Thanks for your reply! I am eagerly waiting for your reply!
Beat wishes!
Hi,
Happened to come by this repo and think this pretty awesome. Cant seem to locate the config.py file though. Was it intentional?
Hi,
I have some questions about the result under Noisy audio. My understanding is that the WER in cases noisy audio needs to set the 'args["NOISE_PROBABILITY"] =1' in config.py . Would you please tell me if my understanding is correct . Thank you very much for your help
Hello, thanks for sharing the code.
I have a question, is the audio-visual.pt file you uploaded already integrated with the weights of AO and VO?
hi, which face detector and face aliment are used in lrs2 and lrs3 dataset?
While performing inference on the audio-visual model and setting USE_LM
to True
, the demo gets stuck and never completes. Specifically, line number 34 in lrs2_char_lm.py
-batch, finalStateBatch = self.lstm(batch, initStateBatch)
, executes endlessly. How can this be resolved? Thanks!
Hi there,
thanks for so nice code.
I didn't find early stopping in your code. So after changing 'PRETRAIN_NUM_WORDS' every time, it will run 1000 steps/epochs? which will take many days to train the whole model.
may I know how long it took for you to train the whole AV model?
Looking forward to your reply. Thanks!
I noticed that the open source model in Ref 1 is based on the TensorFlow framework, have you converted this open source model to the PyTorch framework? In addition, the model in Ref 1 was trained with the visual front end and the back end together, did you cut the back end before and only retain the weight and structure of the visual front end?
I keep getting CUDA out of memory error from preprocessing my data. I ran the video_only/preprocess.py
to preprocess LRS-3 dataset. I always ran into this issue on the same file.
Traceback
Number of data samples to be processed = 21864
Starting preprocessing ....
Preprocess: 36%|██████▊ | 7830/21864 [37:01<3:02:49, 1.28it/s]
Ignoring error for /content/Dataset/Part 6/kSZEsPnhIXg/00005: CUDA out of memory. Tried to allocate 47.83 GiB (GPU 0; 39.59 GiB total capacity; 13.32 GiB already allocated; 24.53 GiB free; 13.34 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
File "/content/drive/MyDrive/FYP/deep_avsr/video_only/preprocess.py", line 53, in main
preprocess_sample(file, params)
File "/content/drive/MyDrive/FYP/deep_avsr/video_only/utils/preprocessing.py", line 58, in preprocess_sample
outputBatch = vf(inputBatch)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/content/drive/MyDrive/FYP/deep_avsr/video_only/models/visual_frontend.py", line 107, in forward
batch = self.frontend3D(inputBatch)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/container.py", line 139, in forward
input = module(input)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/conv.py", line 607, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/conv.py", line 603, in _conv_forward
input, weight, bias, self.stride, self.padding, self.dilation, self.groups
RuntimeError: CUDA out of memory. Tried to allocate 47.83 GiB (GPU 0; 39.59 GiB total capacity; 13.32 GiB already allocated; 24.53 GiB free; 13.34 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I have tried setting max_split_size_mb
on PYTORCH_CUDA_ALLOC_CONF
environment variable and it would still have this issue.
I set the value to 1, 10, 1024, and 2048. None have worked.
I'm forced to work around it by catching the error and deleting the file that causes the out of memory because it's always the same file that is causing it.
My workaround in video_only/preprocess.py
def silent_rem(file):
with contextlib.suppress(OSError):
os.remove(file)
for file in tqdm(filesList, leave=True, desc="Preprocess", ncols=75):
try:
preprocess_sample(file, params)
except RuntimeError as e:
print(f"\nIgnoring error for {file}:", e, file=sys.stderr)
traceback.print_exc()
silent_rem(file + ".txt")
silent_rem(file + ".mp4")
silent_rem(file + ".npy")
silent_rem(file + ".png")
finally:
torch.cuda.empty_cache()
I don't really like this but it's the only way I can solve this and move forward. It would be really helpful if there is a way to fix it.
I've run it on 2 separate machines with the same issue occurring on different files.
Hi, I didn't have any bugs when training AV on a single GPU;But when I try to use multi-GPU training AV, CTC_LOSS gets the following error:
Traceback (most recent call last):
File "/home/cca01/work2020/guominghao/Deep_AV_ASR/audio_visual/train.py", line 162, in
main()
File "/home/cca01/work2020/guominghao/Deep_AV_ASR/audio_visual/train.py", line 111, in main
trainingLoss, trainingCER, trainingWER = train(model, trainLoader, optimizer, loss_function, device, trainParams)
File "/home/cca01/work2020/guominghao/Deep_AV_ASR/audio_visual/utils/general.py", line 74, in train
loss = loss_function(outputBatch, targetBatch, inputLenBatch, targetLenBatch)
File "/home/cca01/work2020/guominghao/.conda/envs/Deep_AV_PYTORCH/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/home/cca01/work2020/guominghao/.conda/envs/Deep_AV_PYTORCH/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 1295, in forward
self.zero_infinity)
File "/home/cca01/work2020/guominghao/.conda/envs/Deep_AV_PYTORCH/lib/python3.6/site-packages/torch/nn/functional.py", line 1767, in ctc_loss
zero_infinity)
RuntimeError: Expected tensor to have size at least 660 at dimension 1, but got size 1474 for argument #2 'targets' (while checking arguments for ctc_loss_gpu)
Have you ever encountered this problem?
When I run the preprocess file, I encounter the following issue:
File "/home/aa/zzs/deep_avsr_master/audio_visual/preprocess.py", line 110, in
main()
File "/home/aa/zzs/deep_avsr_master/audio_visual/preprocess.py", line 49, in main
preprocess_sample(file, params)
File "/home/aa/zzs/deep_avsr_master/audio_visual/utils/preprocessing.py", line 52, in preprocess_sample
cv.imwrite(roiFile, np.floor(255*np.concatenate(roiSequence, axis=1)).astype(np.int))
File "<array_function internals>", line 180, in concatenate
ValueError: need at least one array to concatenate
You have the following description in Important Training Details:
"We perform iterations of Curriculum Learning by changing the PRETRAIN_NUM_WORDS config option. The number of words used in each iteration of curriculum learning is as follows: 1,2,3,5,7,9,13,17,21,29,37, i.e., 11 iterations in total."
I am confused. Can you describe in detail what the "iteration" refers to? My understanding is that the network need to iterate 1000 times ("NUM_STEPS") every time "PRETRAIN_NUM_WORDS" is changed in the course learning strategy.
Hi there,
Thanks a lot for your open source repo and code, I adapted your code to build a new AVNet and directly using AV data to train the model. when doing curriculum learning and the num of words = 1, my model converge at a WER of 0.55, but when I use this checkpoint to initialize num of words = 2 and start training, the model converge at a WER of 0.7, which is even worse than when the num of words = 1. I expected WER to drop after each iteration.
The question I want to ask is that did you ever encounter problems like this when you training AO or VO model?
Hi, thank you for the work.
I reproduced the code, it works well with your pretrained weights.(pretrain, train, test.py all ok)
But I'd like to try some different methods, so I didn't use it this time.
(haven't change the code, just didn't load the pretrained weights)
I was running the video_only pretrain.py
Number of Words = 2
My seed = 18547840
And the loss value looks tricky:
Step: 000 || Tr.Loss: 3.135768 Val.Loss: 2.934046 || Tr.CER: 0.977 Val.CER: 0.967 || Tr.WER: 1.022 Val.WER: 1.001
Step: 001 || Tr.Loss: 2.911791 Val.Loss: 2.872616 || Tr.CER: 0.928 Val.CER: 0.890 || Tr.WER: 1.013 Val.WER: 1.000
Step: 002 || Tr.Loss: 2.946833 Val.Loss: 3.349309 || Tr.CER: 0.917 Val.CER: 1.000 || Tr.WER: 1.012 Val.WER: 1.000
Step: 003 || Tr.Loss: 3.326881 Val.Loss: 3.314617 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000
Step: 004 || Tr.Loss: 3.323197 Val.Loss: 3.303168 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000
Step: 005 || Tr.Loss: 3.323742 Val.Loss: 3.327703 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000
Step: 006 || Tr.Loss: 3.324991 Val.Loss: 3.318675 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000
Step: 007 || Tr.Loss: 3.321263 Val.Loss: 3.317882 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000
Step: 008 || Tr.Loss: 3.319875 Val.Loss: 3.322166 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000
Step: 009 || Tr.Loss: 3.323132 Val.Loss: 3.320766 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000
Step: 010 || Tr.Loss: 3.325111 Val.Loss: 3.319928 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000
Step: 011 || Tr.Loss: 3.323602 Val.Loss: 3.306033 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000
Step: 012 || Tr.Loss: 3.322382 Val.Loss: 3.324228 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000
Step: 013 || Tr.Loss: 3.319179 Val.Loss: 3.308431 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000
Step: 014 || Tr.Loss: 3.324116 Val.Loss: 3.316021 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000
Step: 015 || Tr.Loss: 3.320856 Val.Loss: 3.307545 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000
On step 1 and 2, the loss reduce to 2.9, but it goes back to 3.3 right next step and remains (it continue until step 130, then I stop it)
And the WER remains 1.00, too.
I found this situation a little different from #6
Wondering if this give you a clue to know how it happens, or how to avoid this issue?
Or if it's just a normal situation?
Thank you for your kindness.
Hi,
Can you provide any more details on the pretrained visual frontend and LM's? Datsets used in the pretrained model?
Thanks!
Thanks for sharing such a useful repo.
You mention that:
"We train the AO and VO models first. We then initialize the AV model with weights from the trained AO and VO models as follows: AO Audio Encoder → AV Audio Encoder, VO Video Encoder → AV Video Encoder, VO Video Decoder → AV Joint Decoder."
You use AO and VO to initialize AV for further pre-training on LRS2 pre-train set and fine-tuning on LRS-2 train set. But when you pre-train AO and VO before that initialization, do you only use the pre-train set of LRS-2 or do you pre-train on pre-train+train sets or do you pre-train on pre-train and fine-tune on train set before using AO and VO for AV model training? This is particularly important for us as we will use AO and VO model weights you shared and want to know which LRS2 sets they have seen during training.
Thanks!
Hello, when I pretrained the AO model, I trained a total of 220 epochs when pertrain_num_words=3. When epoch 190 is reached, val WER is the best, which is 0.259. However, the loss curve seems to be abnormal.
What is the reason?
I uploaded the graph I generated, please help me have a look
Looking forward to your reply, thank you!
Hello: I did not find the pretrain.txt file and preval.txt after running preprocess.py. After creating it manually, it was still empty after executing the script. Then run pretain.py and there will be a data set loading error, and it seems that the content in the two txts just now cannot be read. Looking at the code logic, this pretain.txt is there at the beginning, but you didn't provide it, right? thanks
Hello, I reported this error when preprocessing :
RuntimeError: cublas runtime error : the GPU program failed to execute at C:/w/1/s/windows/pytorch/aten/src/THC/THCBlas.cu:331
It seems that cuda10.0 does not match, can I reproduce this code in cuda12.0?
Looking forward to your reply, thank you!
Hi, I' working with your code.
Some question want to ask
What does<EOS>
mean? End of sequence? If it's end of sequence, where is start of sequence<SOS>
?
And in LRS2, they slice videos, so one video present one sequence.
Does it means that <EOS>
and <SOS>
is not necessary?
And, does " "
means space between words?
I seem to hit OOM errors as I proceed into pretraining steps/epochs. The memory footprint keeps increasing after each step/epoch. I would have expected it to remain constant for each step/epoch. Am I wrong?
The memory footprint was around ~9G when pretraining started and has steadily increased since.
top - 20:43:22 up 10:22, 1 user, load average: 2.15, 2.53, 2.41
Tasks: 353 total, 2 running, 268 sleeping, 0 stopped, 0 zombie
%Cpu(s): 8.8 us, 3.3 sy, 0.0 ni, 86.6 id, 1.2 wa, 0.0 hi, 0.1 si, 0.0 st
KiB Mem : 16368460 total, 190368 free, **13370028 used**, 2808064 buff/cache
KiB Swap: 6291452 total, 4440572 free, 1850880 used. 2390908 avail Mem
A couple of steps later
top - 20:44:38 up 10:23, 1 user, load average: 1.63, 2.29, 2.33
Tasks: 352 total, 2 running, 268 sleeping, 0 stopped, 0 zombie
%Cpu(s): 9.1 us, 3.3 sy, 0.0 ni, 86.2 id, 1.4 wa, 0.0 hi, 0.1 si, 0.0 st
KiB Mem : 16368460 total, 159724 free, **13424752** used, 2783984 buff/cache
KiB Swap: 6291452 total, 4431356 free, 1860096 used. 2333444 avail Mem
Is this increase expected or does something need to be re-initialized?
In the dataset, you pad the input video feature on both left and right sides. And in collate_fn, you pad the video feature sequence to the same size, which pads to the right side. Is it necessary to pad on the left side of the input feature?
Hi
I trained pretrain with num_word 1,2, but the Validation WER is always 1.0.
The Training WER goes down to about 0.5, but the Validation WER does not go down.
(I've looked at other issues and changed the seed value.)
Is this correct?
If I increase the num_word in the pretrain, will the Validation WER go down?
Hi I had a few questions about pretrain.py and train.py in AV.
Hi @lordmartian
Thanks for your generous sharing! It seems that there is no accomplishment of Transformer seq2seq
@ Fig. 2. I wonder if I have miss something important. If so, could you help me to figure it out?
Thanks for your time!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.