Comments (6)
Hi,
I load the weight of ao in the clean model, instead of train from scratch. The result is improved under 10dB of cafeteria noise conditions. The CER is 35.64%, WER is 62.39%. But it didn't achieve the results in the paper, CER 25.61% and WER 54.48%. Could you please give me some suggestions?
Thanks a lot.
from avsr-tf1.
Hi @xjwla
Are you reproducing the results from our ICMI'18 article on TCD-TIMIT?
Yes, my training pipeline involves a multi-stage process where the same model is fine-tuned on gradually increasing audio SNRs. You can see this as a form of curriculum learning. Otherwise, it would be difficult to learn good representations directly on noisy data samples.
A signal to noise ratio of 10db is still a relatively easy condition, so there shouldn't be large differences to clean speech. At least on LRS2 I don't remember ever seeing convergence issues up to 10db.
Are you using the code in this repository, or have you re-implemented the networks in your own framework? What about the data pipeline? Can you listen to a few samples to find out if they match their advertised SNR?
Seq2seq networks with LSTMs are quite tricky to train, particularly on a small dataset, but the default settings in this repository ensure that you have all the bells and whistles enabled (in tf 1.x !).
Without having specific info regarding your experiment, there is a large number of possible causes. My best advice would be to first validate your current setup on a more estalished audio-only dataset like Librispeech, so you could rule out eventual issues in the code or in the data pipeline.
from avsr-tf1.
Thank you very much for your reply.
Yes, I am reproducing the results from ICMI'18 on TCD-TIMIT. And I am using the code in this repository. I have successfully reproduced the results in your paper in a clean speech. But I encounter some problems with the noise condition as mentioned above. I use the write_records_tcd.py in this repository to generate tfrecordfile that adds noise. The difference between my settings and the default settings is I alter the feature type to 'logmel'. The default setting is 'logmel_stack_w8s3'.
And now I am looking for the reason according to your suggestion. There is no alter in the networks, but the result of a speech with10db noise is much worse than a clean speech. Maybe I got it wrong in 'write_records_tcd.py'?
Thank you so much, you are so kindly.
from avsr-tf1.
Thanks a lot for the clarifications, @xjwla
Hmm, I reckon that the audio sequence pre-processing could have a big impact on attention-based seq2seq models.
The main difference between logmel
and logmel_stack_w8s3
is the feature framerate and the amount of information per frame. The former computes the log magnitude spectrum based on a short-time Fourier transform with a frame length of 25ms and a step size of 10ms. The latter considers a window of 8 consecutive STFT frames, and applies a stride of 3, thus the frame-rate decreases by a factor of 3, and the receptive field is about 95ms pe frame (80+15). Some research literature suggests that low frame-rates are necessary for the CTC/RNN-T model family. This repository implements a seq2seq with global (full utterance) attention, which may be prohibitively expensive to train.
Can you try the logmel_stack_w8s3
transform and see if it makes a difference on 10db SNR? As you can see in the code, stack_w8s3
is a simple post-processing of logmel
, so it doesn't change the data samples.
from avsr-tf1.
I try the logmel_stack_w8s3 transform on 10db SNR. Unfortunately, it didn't make a large difference. The result of cer is 31.07%, wer is 58.53%. Now on 10db SNR, first, I load the model parameters on clean speech (the cer for clean speech is 20.85%, were is 45.58% ) and training until the error rate no longer decreases. And then reduce the learning rate from 0.001 to 0.0001. But the result is still not ideal. Do I need to change some of the parameters compared to training in clean conditions? Thanks a lot.
from avsr-tf1.
The example run_audio.py
script is designed so that you can launch a full experiment under very similar conditions to what is described in the article, excepting the number of epochs per noise level. In case you are using a modified version of this script (e.g. your own data paths), could you please paste its contents here? The default parameters of the AVSR class can be overwritten from the main launch script through kwargs, if needed. To answer your question, you don't need to change any hyper-parameter under different noise conditions.
If you provide all the audio record files (i.e. clean, 10db, 0db, -5db), there is no need to manually load model parameters from checkpoints. The AVSR.train
method will take care of that. Again, training directly on noisy samples is likely to worsen the accuracy on TCD-TIMIT, and would like to have a clearer picture of the experiment you are running.
Also, what dataset partitioning are you using?
I hope this helps. Please let me know if you find the cause of your issue.
from avsr-tf1.
Related Issues (20)
- Inquiry about some parameter selection reason HOT 5
- awgn: out of bounds when sampling noise clip HOT 1
- [AMSGrad] get running error in audio only training HOT 3
- How do I change loss function to CTC?? HOT 2
- [feature] minimum data length for stack log mel feature HOT 2
- visemes and phonemes mapping HOT 1
- How many epochs should i train? HOT 2
- How can i change this av_align model for applying to audio2video?
- What should I do to reproduce the results of the paper? HOT 5
- How can I solve this problem HOT 6
- Inquiry about aus csv generation HOT 2
- What do the folders speaker-dependence and speaker-independence stand for? HOT 2
- How to run this program on multiple GPUs HOT 5
- The tfrecord files HOT 8
- use video-only on LRS2 HOT 1
- KeyError: 'aus' when running run_audiovisual.py HOT 2
- run_audiovisual.py HOT 1
- Ask about epochs and learning rate HOT 1
- ask about the results HOT 9
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from avsr-tf1.