Git Product home page Git Product logo

Comments (10)

pzelasko avatar pzelasko commented on July 21, 2024 1
  1. Yes the distribution of examples in mini batches will be stationary (the same as the weights in mux), until some iterator ends (the tail of the iteration doesn’t preserve that anymore)
  2. It’s hard to say, but typically over sampling smaller datasets helps to get better results on them
  3. I suggest calling .repeat().shuffle() on every input cutset to make them infinite and tweaking the weights. The training should be tracked by steps and not epochs after that. You’ll ensure that throughout the whole training the model observes the dataset distribution you want.

from icefall.

pzelasko avatar pzelasko commented on July 21, 2024 1

Thanks for your reply It's working for me

But why does it execute the map function and it modify original data in every epoch when using .repeat(), at the same time, valid_cuts does not modify the original data In my test, I use .repeat(1) in train_cuts, and valid_cuts didn't

It might be related to eager vs lazy cut set. Lazy cut set is read from the file on each iteration so the mutating changes are not persistent. With eager cut sets they are persistent and stacked on each other upon each iteration.

from icefall.

johnchienbronci avatar johnchienbronci commented on July 21, 2024

Thanks for your replay.

The training should be tracked by steps and not epochs after that. You’ll ensure that throughout the whole training the model observes the dataset distribution you want.

How can I modify the code (zipformer/train.py) to set the maximum number of steps instead of epochs?
Currently, the parameter can only specify the number of epochs.

from icefall.

pzelasko avatar pzelasko commented on July 21, 2024

You might need to modify the code a bit in that case. Since the dataloader will never finish iteration you’d need to move validation and checkpoint saving into the training loop to be executed every N steps. Also make sure to set a different random seed if you continue the training otherwise you’ll iterate over the same data as before (most Lhotse classes accept a “trng” seed value to automatically randomize the seed at the cost of non-reproducibility)

from icefall.

johnchienbronci avatar johnchienbronci commented on July 21, 2024

ok, thanks

from icefall.

novahsubronci avatar novahsubronci commented on July 21, 2024

@pzelasko
Hi,
I have a question about .repeat()

zipformer_map

When I use .repeat() on train_cuts, it seems to call the map function in every epoch

epoch 1:
common_voice_en_18722497-6765_repeat0: HE IS BURIED IN THE PÈRE LACHAISE CEMETERY IN PARIS
common_voice_en_18722497-6765_repeat0 encode: HE IS BURIED IN THE PũĩRE LACHAISE CEMETERY IN PARIS
epoch 2:
common_voice_en_18722497-6765_repeat0: HE IS BURIED IN THE PũĩRE LACHAISE CEMETERY IN PARIS
common_voice_en_18722497-6765_repeat0 encode: HE IS BURIED IN THE PūŎŪŎRE LACHAISE CEMETERY IN PARIS
epoch 3:
common_voice_en_18722497-6765_repeat0: HE IS BURIED IN THE PūŎŪŎRE LACHAISE CEMETERY IN PARIS
common_voice_en_18722497-6765_repeat0 encode: HE IS BURIED IN THE PūŐūįūŐūįRE LACHAISE CEMETERY IN PARIS
epoch 4:
common_voice_en_18722497-6765_repeat0: HE IS BURIED IN THE PūŐūįūŐūįRE LACHAISE CEMETERY IN PARIS
common_voice_en_18722497-6765_repeat0 encode: HE IS BURIED IN THE PūŐūıūŐŪŔūŐūıūŐŪŔRE LACHAISE CEMETERY IN PARIS

It looks very weird, the origin character(i.e. È) is converted into another character, and then it gets longer and longer in every epoch

If not use .repeat() it won't happen

Is this a bug with .repeat()?

from icefall.

pzelasko avatar pzelasko commented on July 21, 2024

You may want to perform a copy of both cut and supervison inside of map function to avoid repeated application of this function. E.g.

from lhotse.utils import fastcopy
return fastcopy(c, supervisions=[fastcopy(cut.supervisions[0], text=new_text)])

from icefall.

novahsubronci avatar novahsubronci commented on July 21, 2024

Thanks for your reply
It's working for me

But why does it execute the map function and it modify original data in every epoch when using .repeat(), at the same time, valid_cuts does not modify the original data
In my test, I use .repeat(1) in train_cuts, and valid_cuts didn't

from icefall.

danpovey avatar danpovey commented on July 21, 2024

Thanks for your replay.

The training should be tracked by steps and not epochs after that. You’ll ensure that throughout the whole training the model observes the dataset distribution you want.

How can I modify the code (zipformer/train.py) to set the maximum number of steps instead of epochs? Currently, the parameter can only specify the number of epochs.

Something else to watch out for, in case you do this you should use the Eden2 scheduler, not the Eden scheduler.

from icefall.

novahsubronci avatar novahsubronci commented on July 21, 2024

Thanks for your help

from icefall.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.