Git Product home page Git Product logo

Comments (6)

xjwla avatar xjwla commented on September 23, 2024

Hi,

I load the weight of ao in the clean model, instead of train from scratch. The result is improved under 10dB of cafeteria noise conditions. The CER is 35.64%, WER is 62.39%. But it didn't achieve the results in the paper, CER 25.61% and WER 54.48%. Could you please give me some suggestions?

Thanks a lot.

from avsr-tf1.

georgesterpu avatar georgesterpu commented on September 23, 2024

Hi @xjwla

Are you reproducing the results from our ICMI'18 article on TCD-TIMIT?

Yes, my training pipeline involves a multi-stage process where the same model is fine-tuned on gradually increasing audio SNRs. You can see this as a form of curriculum learning. Otherwise, it would be difficult to learn good representations directly on noisy data samples.

A signal to noise ratio of 10db is still a relatively easy condition, so there shouldn't be large differences to clean speech. At least on LRS2 I don't remember ever seeing convergence issues up to 10db.

Are you using the code in this repository, or have you re-implemented the networks in your own framework? What about the data pipeline? Can you listen to a few samples to find out if they match their advertised SNR?
Seq2seq networks with LSTMs are quite tricky to train, particularly on a small dataset, but the default settings in this repository ensure that you have all the bells and whistles enabled (in tf 1.x !).

Without having specific info regarding your experiment, there is a large number of possible causes. My best advice would be to first validate your current setup on a more estalished audio-only dataset like Librispeech, so you could rule out eventual issues in the code or in the data pipeline.

from avsr-tf1.

xjwla avatar xjwla commented on September 23, 2024

Thank you very much for your reply.

Yes, I am reproducing the results from ICMI'18 on TCD-TIMIT. And I am using the code in this repository. I have successfully reproduced the results in your paper in a clean speech. But I encounter some problems with the noise condition as mentioned above. I use the write_records_tcd.py in this repository to generate tfrecordfile that adds noise. The difference between my settings and the default settings is I alter the feature type to 'logmel'. The default setting is 'logmel_stack_w8s3'.

And now I am looking for the reason according to your suggestion. There is no alter in the networks, but the result of a speech with10db noise is much worse than a clean speech. Maybe I got it wrong in 'write_records_tcd.py'?

Thank you so much, you are so kindly.

from avsr-tf1.

georgesterpu avatar georgesterpu commented on September 23, 2024

Thanks a lot for the clarifications, @xjwla

Hmm, I reckon that the audio sequence pre-processing could have a big impact on attention-based seq2seq models.

The main difference between logmel and logmel_stack_w8s3 is the feature framerate and the amount of information per frame. The former computes the log magnitude spectrum based on a short-time Fourier transform with a frame length of 25ms and a step size of 10ms. The latter considers a window of 8 consecutive STFT frames, and applies a stride of 3, thus the frame-rate decreases by a factor of 3, and the receptive field is about 95ms pe frame (80+15). Some research literature suggests that low frame-rates are necessary for the CTC/RNN-T model family. This repository implements a seq2seq with global (full utterance) attention, which may be prohibitively expensive to train.

Can you try the logmel_stack_w8s3 transform and see if it makes a difference on 10db SNR? As you can see in the code, stack_w8s3 is a simple post-processing of logmel, so it doesn't change the data samples.

from avsr-tf1.

xjwla avatar xjwla commented on September 23, 2024

I try the logmel_stack_w8s3 transform on 10db SNR. Unfortunately, it didn't make a large difference. The result of cer is 31.07%, wer is 58.53%. Now on 10db SNR, first, I load the model parameters on clean speech (the cer for clean speech is 20.85%, were is 45.58% ) and training until the error rate no longer decreases. And then reduce the learning rate from 0.001 to 0.0001. But the result is still not ideal. Do I need to change some of the parameters compared to training in clean conditions? Thanks a lot.

from avsr-tf1.

georgesterpu avatar georgesterpu commented on September 23, 2024

The example run_audio.py script is designed so that you can launch a full experiment under very similar conditions to what is described in the article, excepting the number of epochs per noise level. In case you are using a modified version of this script (e.g. your own data paths), could you please paste its contents here? The default parameters of the AVSR class can be overwritten from the main launch script through kwargs, if needed. To answer your question, you don't need to change any hyper-parameter under different noise conditions.

If you provide all the audio record files (i.e. clean, 10db, 0db, -5db), there is no need to manually load model parameters from checkpoints. The AVSR.train method will take care of that. Again, training directly on noisy samples is likely to worsen the accuracy on TCD-TIMIT, and would like to have a clearer picture of the experiment you are running.

Also, what dataset partitioning are you using?

I hope this helps. Please let me know if you find the cause of your issue.

from avsr-tf1.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.