<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Yes the distribution of examples in mini batches will be stationary (the same as

Use CutSet.mux to effect? about icefall HOT 10 CLOSED

johnchienbronci commented on July 21, 2024

Use CutSet.mux to effect?

from icefall.

Comments (10)

pzelasko commented on July 21, 2024 1

Yes the distribution of examples in mini batches will be stationary (the same as the weights in mux), until some iterator ends (the tail of the iteration doesn’t preserve that anymore)
It’s hard to say, but typically over sampling smaller datasets helps to get better results on them
I suggest calling .repeat().shuffle() on every input cutset to make them infinite and tweaking the weights. The training should be tracked by steps and not epochs after that. You’ll ensure that throughout the whole training the model observes the dataset distribution you want.

from icefall.

pzelasko commented on July 21, 2024 1

Thanks for your reply It's working for me

But why does it execute the map function and it modify original data in every epoch when using .repeat(), at the same time, valid_cuts does not modify the original data In my test, I use .repeat(1) in train_cuts, and valid_cuts didn't

It might be related to eager vs lazy cut set. Lazy cut set is read from the file on each iteration so the mutating changes are not persistent. With eager cut sets they are persistent and stacked on each other upon each iteration.

from icefall.

johnchienbronci commented on July 21, 2024

Thanks for your replay.

The training should be tracked by steps and not epochs after that. You’ll ensure that throughout the whole training the model observes the dataset distribution you want.

How can I modify the code (zipformer/train.py) to set the maximum number of steps instead of epochs?
Currently, the parameter can only specify the number of epochs.

from icefall.

pzelasko commented on July 21, 2024

You might need to modify the code a bit in that case. Since the dataloader will never finish iteration you’d need to move validation and checkpoint saving into the training loop to be executed every N steps. Also make sure to set a different random seed if you continue the training otherwise you’ll iterate over the same data as before (most Lhotse classes accept a “trng” seed value to automatically randomize the seed at the cost of non-reproducibility)

from icefall.

johnchienbronci commented on July 21, 2024

ok, thanks

from icefall.

novahsubronci commented on July 21, 2024

@pzelasko
Hi,
I have a question about .repeat()

When I use .repeat() on train_cuts, it seems to call the map function in every epoch

epoch 1:
common_voice_en_18722497-6765_repeat0: HE IS BURIED IN THE PÈRE LACHAISE CEMETERY IN PARIS
common_voice_en_18722497-6765_repeat0 encode: HE IS BURIED IN THE PũĩRE LACHAISE CEMETERY IN PARIS
epoch 2:
common_voice_en_18722497-6765_repeat0: HE IS BURIED IN THE PũĩRE LACHAISE CEMETERY IN PARIS
common_voice_en_18722497-6765_repeat0 encode: HE IS BURIED IN THE PūŎŪŎRE LACHAISE CEMETERY IN PARIS
epoch 3:
common_voice_en_18722497-6765_repeat0: HE IS BURIED IN THE PūŎŪŎRE LACHAISE CEMETERY IN PARIS
common_voice_en_18722497-6765_repeat0 encode: HE IS BURIED IN THE PūŐūįūŐūįRE LACHAISE CEMETERY IN PARIS
epoch 4:
common_voice_en_18722497-6765_repeat0: HE IS BURIED IN THE PūŐūįūŐūįRE LACHAISE CEMETERY IN PARIS
common_voice_en_18722497-6765_repeat0 encode: HE IS BURIED IN THE PūŐūıūŐŪŔūŐūıūŐŪŔRE LACHAISE CEMETERY IN PARIS

It looks very weird, the origin character(i.e. È) is converted into another character, and then it gets longer and longer in every epoch

If not use .repeat() it won't happen

Is this a bug with .repeat()?

from icefall.

pzelasko commented on July 21, 2024

You may want to perform a copy of both cut and supervison inside of map function to avoid repeated application of this function. E.g.

from lhotse.utils import fastcopy
return fastcopy(c, supervisions=[fastcopy(cut.supervisions[0], text=new_text)])

from icefall.

novahsubronci commented on July 21, 2024

Thanks for your reply
It's working for me

But why does it execute the map function and it modify original data in every epoch when using .repeat(), at the same time, valid_cuts does not modify the original data
In my test, I use .repeat(1) in train_cuts, and valid_cuts didn't

from icefall.

danpovey commented on July 21, 2024

Thanks for your replay.

The training should be tracked by steps and not epochs after that. You’ll ensure that throughout the whole training the model observes the dataset distribution you want.

How can I modify the code (zipformer/train.py) to set the maximum number of steps instead of epochs? Currently, the parameter can only specify the number of epochs.

Something else to watch out for, in case you do this you should use the Eden2 scheduler, not the Eden scheduler.

from icefall.

novahsubronci commented on July 21, 2024

Thanks for your help

from icefall.

Use CutSet.mux to effect? about icefall HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent