coqui-ai / trainer Goto Github PK
View Code? Open in Web Editor NEW🐸 - A general purpose model trainer, as flexible as it gets
🐸 - A general purpose model trainer, as flexible as it gets
There are two problems. The first occurs when starting VTCK VITS training with DDP, which results in a division by zero error. The second one occurred with a custom VITS recipe:
> EVALUATION
| > Synthesizing test sentences.
! Run is removed from /home/wdb/src/gitlab.com/whydobirds/tts-coqui/recipes/frank/vits/vits_frank-June-01-2022_02+35PM-75e2a0fd
Traceback (most recent call last):
File "/home/wdb/miniconda3/envs/tts-coqui/lib/python3.8/site-packages/trainer/trainer.py", line 1492, in fit
self._fit()
File "/home/wdb/miniconda3/envs/tts-coqui/lib/python3.8/site-packages/trainer/trainer.py", line 1480, in _fit
self.test_run()
File "/home/wdb/miniconda3/envs/tts-coqui/lib/python3.8/site-packages/trainer/trainer.py", line 1416, in test_run
self.model.test_log(test_outputs, self.dashboard_logger, self.training_assets, self.total_steps_done)
File "/home/wdb/miniconda3/envs/tts-coqui/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in __getattr__
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'DistributedDataParallel' object has no attribute 'test_log'
Start training VITS in distributed mode:
python -m trainer.distribute --script recipes/vctk/vits/train_vits.py --gpus "0,1"
This immediately results in a division by zero error.
Starting with a custom script:
python -m trainer.distribute --script recipes/custom/vits/train_vits.py --gpus "0,1"
Results in the stack trace shown above. Presumably the VCTK one will run in to the same problem once the division by zero is resolved.
The training run should complete successfully.
No response
The wget above 404s.
I think it might just be
Line 1415 in 5217943
Because the optional arguments are hardcoded as in this line https://github.com/coqui-ai/Trainer/blob/main/trainer/distribute.py#L37-L40, another arguments, such as --start_with_eval
or --use_accelerate
will not be changed and kept as default when running, for example, https://github.com/coqui-ai/TTS/blob/dev/TTS/bin/train_tts.py,
such as CUDA_VISIBLE_DEVICES="0,1,2,3" python -m trainer.distribute --start_with_eval true --use_accelerate true --script train_tts.py --config_path <config_path>
, start with eval and use accelerate will be both false because only continue_path, restore_path, group_id, and use_ddp (which is fixed to true) will be changed.
See the command example written in the description.
No response
No response
No response
No response
Hi,
Following YourTTS vctk recipe I try to restore a model to continue the training.
But I get the following error :
> Restoring from best_model_22016.pth ...
> Restoring Model...
> Restoring Optimizer...
Traceback (most recent call last):
File "/home/caraduf/Models/train_yourtts_16kHz.py", line 318, in <module>
trainer = Trainer(
File "/home/caraduf/CoquiTTS/Trainer/trainer/trainer.py", line 507, in __init__
(self.model, self.optimizer, self.scaler, self.restore_step, self.restore_epoch) = self.restore_model(
File "/home/caraduf/CoquiTTS/Trainer/trainer/trainer.py", line 711, in restore_model
optimizer = _restore_list_objs(checkpoint["optimizer"], optimizer)
File "/home/caraduf/CoquiTTS/Trainer/trainer/trainer.py", line 701, in _restore_list_objs
obj.load_state_dict(states)
AttributeError: 'list' object has no attribute 'load_state_dict'
The obj is a list but the else is executed :
def _restore_list_objs(states, obj):
if isinstance(obj, list):
for idx, state in enumerate(states):
obj[idx].load_state_dict(state)
if isinstance(obj, dict):
for key, state in states.items():
obj[key].load_state_dict(state)
else:
obj.load_state_dict(states)
return obj
A workaround is to replace the second if with an elif
because if obj is a List it cannot become a Dict. So in my opinion it makes sense to use a elif but I can be wrong.
Set a restore path to a checkpoint in the recipe and run the recipe.
python3 train_yourtts.py
THe model is restored and the training goes on.
No response
{
"CUDA": {
"GPU": [
"NVIDIA GeForce RTX 3090"
],
"available": true,
"version": "11.7"
},
"Packages": {
"PyTorch_debug": false,
"PyTorch_version": "1.13.1+cu117",
"Trainer": "v0.0.22",
"numpy": "1.22.4"
},
"System": {
"OS": "Linux",
"architecture": [
"64bit",
"ELF"
],
"processor": "x86_64",
"python": "3.10.6",
"version": "#64-Ubuntu SMP Thu Jan 5 11:43:13 UTC 2023"
}
}
No response
Just a reminder we need to implement and batch size finder and learning rate finder into the trainer
Hi,
I have two questions:
Thank you!
🚀 Feature Description
Hi,
Sometimes while training, I need my GPU (eg to do some work with Whisper, or because I need to switch off the computer). So I have to interrupt the training and sometimes it is right between 2 checkpoints (eg checkpoints are saved every 10k iterations and it is 7k after the previous saved checkpoints).
So in this case I would loose all the training that has been achieved after the previous checkpoint.
Consequently it would be more comfortable that a checkpoint is saved when I interrupt the training so that I can then restore the training right from this checkpoint.
Solution
When the training process is interrupted (ctrl-c) make coqui save a checkpoint at the current step (as it does when save_step
is reached).
Alternative Solutions
I could lower save_step
but then checkpoints are too near to each others.
Additional context
On 776eba8 this test tries to call get_data_loader
with the unexpected kwarg samples
.
_______________________________ test_train_mnist _______________________________
self = Trainer()
def fit(self) -> None:
"""Where the ✨️magic✨️ happens..."""
try:
> self._fit()
trainer/trainer.py:1403:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = Trainer()
def _fit(self) -> None:
"""🏃 train -> evaluate -> test for the number of epochs."""
self._restore_best_loss()
self.total_steps_done = self.restore_step
for epoch in range(0, self.config.epochs):
if self.num_gpus > 1:
# let all processes sync up before starting with a new epoch of training
dist.barrier()
self.callbacks.on_epoch_start(self)
self.keep_avg_train = KeepAverage()
self.keep_avg_eval = KeepAverage() if self.config.run_eval else None
self.epochs_done = epoch
self.c_logger.print_epoch_start(epoch, self.config.epochs, self.output_path)
if not self.skip_train_epoch:
> self.train_epoch()
trainer/trainer.py:1387:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = Trainer()
def train_epoch(self) -> None:
"""Main entry point for the training loop. Run training on the all training samples."""
# initialize the data loader
> self.train_loader = self.get_train_dataloader(
self.training_assets,
self.train_samples,
verbose=True,
)
trainer/trainer.py:1148:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = Trainer(), training_assets = {}, samples = None, verbose = True
def get_train_dataloader(self, training_assets: Dict, samples: List, verbose: bool) -> DataLoader:
"""Initialize and return a training data loader.
Call ```model.get_train_data_loader``` if it is implemented, else call ```model.get_data_loader```
and set ```is_eval=False```.
Args:
ap (AudioProcessor): Audio processor.
samples (List): Data samples used for training.
verbose (bool): enable/disable printing loader stats at initialization.
Returns:
DataLoader: Initialized training data loader.
"""
if self.num_gpus > 1:
if hasattr(self.model.module, "get_train_data_loader"):
loader = self.model.module.get_train_data_loader(
self.config,
self.training_assets,
samples,
verbose,
self.num_gpus,
self.args.rank,
)
return loader
else:
if hasattr(self.model, "get_train_data_loader"):
loader = self.model.get_train_data_loader(
self.config, self.training_assets, samples, verbose, self.num_gpus
)
return loader
> return self._get_loader(
self.model,
self.config,
training_assets,
False,
samples,
verbose,
self.num_gpus,
)
trainer/trainer.py:679:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = Trainer()
model = MnistModel(
(layer_1): Linear(in_features=784, out_features=128, bias=True)
(layer_2): Linear(in_features=128, out_features=256, bias=True)
(layer_3): Linear(in_features=256, out_features=10, bias=True)
)
config = MnistModelConfig(output_path='output', logger_uri=None, run_name='run', project_name=None, run_description='🐸Coqui tra...rue, lr=0.001, optimizer='Adam', optimizer_params={}, lr_scheduler=None, lr_scheduler_params={}, use_grad_scaler=False)
assets = {}, is_eval = False, samples = None, verbose = True, num_gpus = 0
def _get_loader(
self,
model: nn.Module,
config: Coqpit,
assets: Dict,
is_eval: str,
samples: List,
verbose: bool,
num_gpus: int,
) -> DataLoader:
if num_gpus > 1:
if hasattr(model.module, "get_data_loader"):
loader = model.module.get_data_loader(
config,
assets,
is_eval,
samples,
verbose,
num_gpus,
self.args.rank,
)
else:
if hasattr(model, "get_data_loader"):
> loader = model.get_data_loader(
config=config, assets=assets, is_eval=is_eval, samples=samples, verbose=verbose, num_gpus=num_gpus
)
E TypeError: get_data_loader() got an unexpected keyword argument 'samples'
Seems like you wanted to say < 3.10
but wrote <= 3.10
, at least the latter would allow 3.10
, but not 3.10.2
.
Lines 33 to 39 in 45a5604
Building coqui-trainer with python 3.10.2
Clarification whether <
or <=
is the intended constraint.
> RuntimeError: Coqui-Trainer requires python >= 3.6 and <=3.10 but your Python version is 3.10.2 (main, Jan 13 2022, 19:06:22) [GCC 10.3.0]
- Version 0.0.5
- Python 3.10.2
No response
Download and try to load a checkpoint from S3 that is saved by ClearML raises the following error.
/miniforge3/lib/python3.9/site-packages/torch/serialization.py in __init__(self, name_or_buffer)
240 class _open_zipfile_reader(_opener):
241 def __init__(self, name_or_buffer) -> None:
--> 242 super(_open_zipfile_reader, self).__init__(torch._C.PyTorchFileReader(name_or_buffer))
243
244
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory
Try running
import os
from dataclasses import dataclass, field
import torch
from torch import nn
from torch.nn import functional as F
from torch.utils.data import DataLoader
from torchvision import transforms
from torchvision.datasets import MNIST
from trainer import Trainer, TrainerArgs, TrainerConfig, TrainerModel
@dataclass
class MnistModelConfig(TrainerConfig):
optimizer: str = "Adam"
lr: float = 0.001
epochs: int = 5
print_step: int = 1
plot_step: int = 1
save_step: int = 1
dashboard_logger: str = "clearml"
project_name: str = "pytorch-mnist"
run_name: str = "test-run"
class MnistModel(TrainerModel):
def __init__(self):
super().__init__()
# mnist images are (1, 28, 28) (channels, height, width)
self.layer_1 = nn.Linear(28 * 28, 128)
self.layer_2 = nn.Linear(128, 256)
self.layer_3 = nn.Linear(256, 10)
def forward(self, x):
batch_size, _, _, _ = x.size()
# (b, 1, 28, 28) -> (b, 1*28*28)
x = x.view(batch_size, -1)
x = self.layer_1(x)
x = F.relu(x)
x = self.layer_2(x)
x = F.relu(x)
x = self.layer_3(x)
x = F.log_softmax(x, dim=1)
return x
def train_step(self, batch, criterion):
x, y = batch
logits = self(x)
loss = criterion(logits, y)
return {"model_outputs": logits}, {"loss": loss}
def eval_step(self, batch, criterion):
x, y = batch
logits = self(x)
loss = criterion(logits, y)
return {"model_outputs": logits}, {"loss": loss}
def get_criterion(self):
return torch.nn.NLLLoss()
def get_data_loader(self, config, assets, is_eval, samples, verbose, num_gpus, rank=0):
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))])
dataset = MNIST(os.getcwd(), train=not is_eval, download=True, transform=transform)
mnist_train = DataLoader(dataset, batch_size=8)
return mnist_train
def test_train_mnist():
model = MnistModel()
trainer = Trainer(TrainerArgs(), MnistModelConfig(), model=model, output_path=os.getcwd())
trainer.fit()
if __name__ == "__main__":
test_train_mnist()
No response
No response
{
"CUDA": {
"GPU": [
"NVIDIA A100-SXM4-80GB",
"NVIDIA A100-SXM4-80GB"
],
"available": true,
"version": "11.3"
},
"Packages": {
"PyTorch_debug": false,
"PyTorch_version": "1.11.0",
"Trainer": "v0.0.4",
"numpy": "1.21.2"
},
"System": {
"OS": "Linux",
"architecture": [
"64bit",
"ELF"
],
"processor": "x86_64",
"python": "3.9.5",
"version": "#106-Ubuntu SMP Thu Jan 6 23:58:14 UTC 2022"
}
}
No response
🚀 Feature Description
Save last N checkpoints on ClearML logger.
Solution
Implement saving only the last N checkpoints on a dashboard logger of choice.
Additional context
Keeping all the checkpoints is unnecessary and induces extra costs on cloud storage.
👟 must save not only the checkpoints but also the other model artifacts like config.json, speaker.json, etc.
Gradient accumulation breaks the vocoder training in 🐸TTS.
Try any vocoder recipe of end2end recipe with gradient accumulation.
No response
No response
No need for system info
No response
👟 needs to raise an error when there is no eval sample or skip the eval step.
Raise an error " [!] No eval samples provided"
Or skip the eval step
No response
https://github.com/coqui-ai/TTS/issues/1447
No response
I've been trying to train yourtts on a google compute instance, but it doesn't seem to work using trainer.distribute.
Previously i could run it, but it would get up to the same point in initialization and then crash one of the training workers, with the others freezing.
I am running largely unchanged code from the provided recipe, and have simply reduced the worker count to work on the cloud instance, and added my own dataset.
It previously trained fine without distributed training until it runs out of vram. and training locally on a 3090 works fine if not slowly.
Also TTS is installed to the latest version, not sure why collect_env_info.py didn't catch it.
CUDA_VISIBLE_DEVICES="0,1,2,3" python -m trainer.distribute --script train_yourtts.py
on google compute instanceRuns the training script with processing split between the GPUs.
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/trainer/trainer.py", line 1666, in fit
self._fit()
File "/opt/conda/lib/python3.10/site-packages/trainer/trainer.py", line 1618, in _fit
self.train_epoch()
File "/opt/conda/lib/python3.10/site-packages/trainer/trainer.py", line 1350, in train_epoch
for cur_step, batch in enumerate(self.train_loader):
File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 633, in __next__
data = self._next_data()
File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1345, in _next_data
return self._process_data(data)
File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1371, in _process_data
data.reraise()
File "/opt/conda/lib/python3.10/site-packages/torch/_utils.py", line 644, in reraise
raise exception
TypeError: Caught TypeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
data = fetcher.fetch(index)
File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/opt/conda/lib/python3.10/site-packages/TTS/tts/models/vits.py", line 263, in __getitem__
item = self.samples[idx]
TypeError: list indices must be integers or slices, not list
{
"CUDA": {
"GPU": [
"Tesla T4",
"Tesla T4",
"Tesla T4",
"Tesla T4"
],
"available": true,
"version": "11.7"
},
"Packages": {
"PyTorch_debug": false,
"PyTorch_version": "2.0.1+cu117",
"Trainer": "v0.0.27",
"numpy": "1.23.5"
},
"System": {
"OS": "Linux",
"architecture": [
"64bit",
""
],
"processor": "",
"python": "3.10.10",
"version": "#1 SMP Debian 5.10.179-1 (2023-05-12)"
}
}
No response
num_gpus test takes hangs.
Check the CI action logs.
No response
No response
CI Action instances
No response
The dependencies in requirements.txt does not include:
There is no requirements file for tests, but they require
Release versions are currently only availabe on PyPi, while the exact commit they were built with on Git needs to be guessed and can't be cleanly referenced for automatic updates.
Trainer saves the config that is given at the beginning of the training rather than the one restored after continuing a training.
Continue one of your training using python script and give a modified config.json
The trainer will use the restored config but save the modified config.
Save the config as exactly used.
No response
{
"CUDA": {
"GPU": [
""
],
"available": true,
"version": "11.3"
},
"Packages": {
"PyTorch_debug": false,
"PyTorch_version": "1.11.0",
"Trainer": "v0.0.12",
"numpy": "1.21.6"
},
"System": {
"OS": "Linux",
"architecture": [
"64bit",
"ELF"
],
"processor": "x86_64",
"python": "3.9.5",
"version": "#119-Ubuntu SMP Mon Mar 7 18:49:24 UTC 2022"
}
}
No response
possibly_batched_index is not a list.
python -m trainer.distribute --gpus "0,1" --script train_multi.py --restore_path /root/.local/share/tts/tts_models--en--ljspeech--vits/model_file.pth
No response
> TRAINING (2022-08-26 01:41:21)
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/trainer/trainer.py", line 1533, in fit
self._fit()
File "/opt/conda/lib/python3.7/site-packages/trainer/trainer.py", line 1517, in _fit
self.train_epoch()
File "/opt/conda/lib/python3.7/site-packages/trainer/trainer.py", line 1281, in train_epoch
for cur_step, batch in enumerate(self.train_loader):
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 681, in __next__
data = self._next_data()
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1376, in _next_data
return self._process_data(data)
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1402, in _process_data
data.reraise()
File "/opt/conda/lib/python3.7/site-packages/torch/_utils.py", line 461, in reraise
raise exception
TypeError: Caught TypeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
data = fetcher.fetch(index)
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
TypeError: 'int' object is not iterable
! Run is kept in logs/multi_be/vits_ljs_speaker_embedded-August-26-2022_01+41AM-0000000
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/trainer/trainer.py", line 1533, in fit
self._fit()
File "/opt/conda/lib/python3.7/site-packages/trainer/trainer.py", line 1517, in _fit
self.train_epoch()
File "/opt/conda/lib/python3.7/site-packages/trainer/trainer.py", line 1281, in train_epoch
for cur_step, batch in enumerate(self.train_loader):
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 681, in __next__
data = self._next_data()
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1376, in _next_data
return self._process_data(data)
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1402, in _process_data
data.reraise()
File "/opt/conda/lib/python3.7/site-packages/torch/_utils.py", line 461, in reraise
raise exception
TypeError: Caught TypeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
data = fetcher.fetch(index)
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
TypeError: 'int' object is not iterable
docker pytorch/pytorch:1.12.1-cuda11.3-cudnn8-runtime
No response
Trying to import this package on Windows gives the following error:
File "<stdin>", line 1, in <module>
File "...\lib\site-packages\trainer\__init__.py", line 4, in <module>
from trainer.trainer import *
File "...\lib\site-packages\trainer\trainer.py", line 47, in <module>
multiprocessing.set_start_method("fork")
File "...\lib\multiprocessing\context.py", line 246, in set_start_method
self._actual_context = self.get_context(method)
File "...\lib\multiprocessing\context.py", line 238, in get_context
return super().get_context(method)
File "...\lib\multiprocessing\context.py", line 192, in get_context
raise ValueError('cannot find context for %r' % method) from None
ValueError: cannot find context for 'fork'
Since there is no fork
method on Windows, please either fix this issue or do not mark this package as platform-independent.
Install this package on a Windows system.
Start Python.
Try to import trainer.
No response
No response
- trainer v0.0.5 (installed via conda-forge)
- OS: Windows 10
No response
Supposedly, setting a value to the --grad_accum_steps
option should decrease the needed steps to complete an epoch by simulating a higher batch size, however, setting it to any value does absolutely nothing on training speed or the total step counter.
batch_size=8
using the recipe train_vits_tts_phonemes.py --grad_accum_steps=16
or add the value direcly to Trainer()
--grad_accum_steps
to another valueAssuming the batch_size=8
and grad_accum_steps=8
, an epoch that has 6400 steps to complete should reduce the number of steps to 100 while having an effective batch size of 64 (8*8) during training.
No response
{
"CUDA": {
"GPU": [
"NVIDIA GeForce RTX 2060"
],
"available": true,
"version": "11.7"
},
"Packages": {
"PyTorch_debug": false,
"PyTorch_version": "2.0.1+cu117",
"Trainer": "v0.0.29",
"numpy": "1.22.0"
},
"System": {
"OS": "Linux",
"architecture": [
"64bit",
"ELF"
],
"processor": "",
"python": "3.10.12",
"version": "#1 ZEN SMP PREEMPT_DYNAMIC Wed, 02 Aug 2023 10:40:11 +0000"
}
}
No response
I tried using the WandB logger for training in the TTS repo, but it didn't work.
CUDA_VISIBLE_DEVICES=0 python recipes/ljspeech/vits_tts/train_vits.py \
--coqpit.dashboard_logger wandb \
--coqpit.project_name FakeName \
--coqpit.wandb_entity FakeEntity \
It crashes with this error:
Traceback (most recent call last):
File "runs/train_vits.py", line 85, in <module>
eval_samples=eval_samples,
File "/home/fijipants/miniconda3/envs/coqui-0.6.1/lib/python3.7/site-packages/trainer/trainer.py", line 359, in __init__
self.dashboard_logger = logger_factory(config, output_path)
File "/home/fijipants/miniconda3/envs/coqui-0.6.1/lib/python3.7/site-packages/trainer/logging/__init__.py", line 36, in logger_factory
entity=config.wandb_entity,
TypeError: Can't instantiate abstract class WandbLogger with abstract methods add_audio, add_figure, add_scalar
It should work just like the default Tensorboard logger
No response
{
"CUDA": {
"GPU": [
"NVIDIA GeForce RTX 3090",
"NVIDIA GeForce RTX 3090"
],
"available": true,
"version": "11.3"
},
"Packages": {
"PyTorch_debug": false,
"PyTorch_version": "1.10.2",
"TTS": "0.6.1",
"numpy": "1.19.5"
},
"System": {
"OS": "Linux",
"architecture": [
"64bit",
""
],
"processor": "x86_64",
"python": "3.7.11",
"version": "#202202230823 SMP PREEMPT Wed Feb 23 14:53:24 UTC 2022"
}
}
No response
Hi everybody,
Trainer is great, but still uses APEX, which is deprecated and tends to cause problems.
Could you remove and/or replace with Nvidia/amp?
Best regrads
Just run Trainer on an Nvidia GPU
No response
No response
not relevant
No response
Currently, we save checkpoints without audio samples for each one. In this way, we cannot evaluate the models correctly. My suggestion is to run test_run() every time that we save a checkpoint.
It is because in multi-gpu mode functions of a model
are contained by module
namespace
so any model functions should be called as model.module.model_func()
, so the callbacks.
logging
module🚀 Feature Description
I'm fine-tuning a Coqui TTS model on a different dataset and language. Since it's never reaching the best loss of the pre-trained model it never saves the best model (of the fine-tuning stage). I am only able to see the final checkpoints saved but the best model gets lost in history since I keep only the final X checkpoints.
Solution
If I could reset the best loss saved in the model, I believe it would be possible to save the best model starting from the fine-tuning training.
Error when starting model training from checkpoint in Coqui TTS
When saved as a checkpoint for later training, the last training and eval losses are saved as in dict. When training from scratch, the last training loss is saved as a float. Hence, starting from a checkpoint doesn't run the code properly
https://colab.research.google.com/drive/1OwemROn306_JIYASjx39d52eXFHS1O_u
The training should stop
Traceback (most recent call last):
File "/mnt/Work/anaconda3/envs/tts-env/lib/python3.10/site-packages/trainer/trainer.py", line 1808, in fit
self._fit()
File "/mnt/Work/anaconda3/envs/tts-env/lib/python3.10/site-packages/trainer/trainer.py", line 1771, in _fit
self.save_best_model()
File "/mnt/Work/anaconda3/envs/tts-env/lib/python3.10/site-packages/trainer/utils/distributed.py", line 35, in wrapped_fn
return fn(*args, **kwargs)
File "/mnt/Work/anaconda3/envs/tts-env/lib/python3.10/site-packages/trainer/trainer.py", line 1893, in save_best_model
self.best_loss = save_best_model(
File "/mnt/Work/anaconda3/envs/tts-env/lib/python3.10/site-packages/trainer/io.py", line 183, in save_best_model
if current_loss < best_loss:
TypeError: '<' not supported between instances of 'float' and 'dict'
-torch: 2.1.0
-trainer: 0.0.31
-python: 3.10
-OS: Endeavor OS
-cuda: cuda_12.2.r12.2
-GPU: NVIDIA RTX 3060
-pytorch installation: pip
No response
Hi,
When running YourTTS recipe with my own LJSpeech dataset, during the first evaluation I get the following error :
> DataLoader initialization
| > Tokenizer:
| > add_blank: True
| > use_eos_bos: False
| > use_phonemes: False
| > Number of instances : 20
| > Preprocessing samples
| > Max text length: 119
| > Min text length: 21
| > Avg text length: 45.2
|
| > Max audio length: 119952.0
| > Min audio length: 18103.0
| > Avg audio length: 48773.45
| > Num. instances discarded samples: 0
| > Batch group size: 0.
> Using weighted sampler for attribute 'speaker_name' with alpha '1.0'
None
> Attribute weights for '['ljspeech']'
| > [0.22360679774997894]
> EVALUATION
! Run is kept in /home/caraduf/Models/YourTTS_ME_22k-February-07-2023_06+31AM-0000000
Traceback (most recent call last):
File "/home/caraduf/CoquiTTS/Trainer/trainer/trainer.py", line 1659, in fit
self._fit()
File "/home/caraduf/CoquiTTS/Trainer/trainer/trainer.py", line 1614, in _fit
self.eval_epoch()
File "/home/caraduf/CoquiTTS/Trainer/trainer/trainer.py", line 1501, in eval_epoch
outputs,
UnboundLocalError: local variable 'outputs' referenced before assignment
I updated the trainer to the latest version following the instructions for github but the issue still occurs.
Also note that training VITS model against the same dataset (and also the same max value [10 seconds or 10 x 22050]) is working. So it stops only when running the YourTTS recipe. I will try with debug mode ON and see if it shows interesting things.
Here is the adapted recipe :
import os
import torch
from trainer import Trainer, TrainerArgs
from TTS.bin.compute_embeddings import compute_embeddings
from TTS.bin.resample import resample_files
from TTS.config.shared_configs import BaseDatasetConfig
from TTS.tts.configs.vits_config import VitsConfig
from TTS.tts.datasets import load_tts_samples
from TTS.tts.models.vits import CharactersConfig, Vits, VitsArgs, VitsAudioConfig
from TTS.utils.downloaders import download_vctk
torch.set_num_threads(24)
# pylint: disable=W0105
"""
This recipe replicates the first experiment proposed in the YourTTS paper (https://arxiv.org/abs/2112.02418).
YourTTS model is based on the VITS model however it uses external speaker embeddings extracted from a pre-trained speaker encoder and has small architecture changes.
In addition, YourTTS can be trained in multilingual data, however, this recipe replicates the single language training using the VCTK dataset.
If you are interested in multilingual training, we have commented on parameters on the VitsArgs class instance that should be enabled for multilingual training.
In addition, you will need to add the extra datasets following the VCTK as an example.
"""
# Path where you want to save the models outputs (configs, checkpoints and tensorboard logs)
OUT_PATH = os.path.dirname(os.path.abspath(__file__)) # "/raid/coqui/Checkpoints/original-YourTTS/"
# Name of the run for the Trainer
RUN_NAME = "YourTTS_ME_22k"
Me_Rec_1_config = BaseDatasetConfig(
formatter="ljspeech", dataset_name="ME_Rec1", meta_file_train="metadata.csv", path="/home/caraduf/Datasets/ME_22kHz/Rec_1_LARGE_V2_22.05kHz_dataset", language="fr-fr"
)
Me_Rec_2_config = BaseDatasetConfig(
formatter="ljspeech", dataset_name="ME_Rec2", meta_file_train="metadata.csv", path="/home/caraduf/Datasets/ME_22kHz/Rec_2_LARGE_V2_22.05kHz_dataset", language="fr-fr"
)
# Add here all datasets configs, in our case we just want to train with the VCTK dataset then we need to add just VCTK. Note: If you want to added new datasets just added they here and it will automatically compute the speaker embeddings (d-vectors) for this new dataset :)
DATASETS_CONFIG_LIST = [
Me_Rec_1_config,
Me_Rec_2_config
]
# If you want to do transfer learning and speedup your training you can set here the path to the original YourTTS model
RESTORE_PATH = None # "/root/.local/share/tts/tts_models--multilingual--multi-dataset--your_tts/model_file.pth"
# This paramter is usefull to debug, it skips the training epochs and just do the evaluation and produce the test sentences
SKIP_TRAIN_EPOCH = False
# Set here the batch size to be used in training and evaluation
BATCH_SIZE = 32
# Training Sampling rate and the target sampling rate for resampling the downloaded dataset (Note: If you change this you might need to redownload the dataset !!)
# Note: If you add new datasets, please make sure that the dataset sampling rate and this parameter are matching, otherwise resample your audios
SAMPLE_RATE = 22050
# Max audio length in seconds to be used in training (every audio bigger than it will be ignored)
MAX_AUDIO_LEN_IN_SECONDS = 10
# # Define the number of threads used during the audio resampling
# NUM_RESAMPLE_THREADS = 10
# # Check if VCTK dataset is not already downloaded, if not download it
# if not os.path.exists(VCTK_DOWNLOAD_PATH):
# print(">>> Downloading VCTK dataset:")
# download_vctk(VCTK_DOWNLOAD_PATH)
# resample_files(VCTK_DOWNLOAD_PATH, SAMPLE_RATE, file_ext="flac", n_jobs=NUM_RESAMPLE_THREADS)
# # init configs
# vctk_config = BaseDatasetConfig(
# formatter="vctk",
# dataset_name="vctk",
# meta_file_train="",
# meta_file_val="",
# path=VCTK_DOWNLOAD_PATH,
# language="en",
# ignored_speakers=[
# "p261",
# "p225",
# "p294",
# "p347",
# "p238",
# "p234",
# "p248",
# "p335",
# "p245",
# "p326",
# "p302",
# ], # Ignore the test speakers to full replicate the paper experiment
# )
### Extract speaker embeddings
SPEAKER_ENCODER_CHECKPOINT_PATH = (
"https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/model_se.pth.tar"
)
SPEAKER_ENCODER_CONFIG_PATH = "https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/config_se.json"
D_VECTOR_FILES = [] # List of speaker embeddings/d-vectors to be used during the training
# Iterates all the dataset configs checking if the speakers embeddings are already computated, if not compute it
for dataset_conf in DATASETS_CONFIG_LIST:
# Check if the embeddings weren't already computed, if not compute it
embeddings_file = os.path.join(dataset_conf.path, "speakers.pth")
if not os.path.isfile(embeddings_file):
print(f">>> Computing the speaker embeddings for the {dataset_conf.dataset_name} dataset")
compute_embeddings(
SPEAKER_ENCODER_CHECKPOINT_PATH,
SPEAKER_ENCODER_CONFIG_PATH,
embeddings_file,
old_spakers_file=None,
config_dataset_path=None,
formatter_name=dataset_conf.formatter,
dataset_name=dataset_conf.dataset_name,
dataset_path=dataset_conf.path,
meta_file_train=dataset_conf.meta_file_train,
meta_file_val=dataset_conf.meta_file_val,
disable_cuda=False,
no_eval=False,
)
D_VECTOR_FILES.append(embeddings_file)
# Audio config used in training.
audio_config = VitsAudioConfig(
sample_rate=SAMPLE_RATE,
hop_length=256,
win_length=1024,
fft_size=1024,
mel_fmin=0.0,
mel_fmax=None,
num_mels=80,
)
# Init VITSArgs setting the arguments that is needed for the YourTTS model
model_args = VitsArgs(
d_vector_file=D_VECTOR_FILES,
use_d_vector_file=True,
d_vector_dim=512,
num_layers_text_encoder=10,
speaker_encoder_model_path=SPEAKER_ENCODER_CHECKPOINT_PATH,
speaker_encoder_config_path=SPEAKER_ENCODER_CONFIG_PATH,
resblock_type_decoder="1", # On the paper, we accidentally trained the YourTTS using ResNet blocks type 2, if you like you can use the ResNet blocks type 1 like the VITS model
# Usefull parameters to enable the Speaker Consistency Loss (SCL) discribed in the paper
# use_speaker_encoder_as_loss=True,
# Usefull parameters to the enable multilingual training
# use_language_embedding=True,
# embedded_language_dim=4,
)
# General training config, here you can change the batch size and others usefull parameters
config = VitsConfig(
output_path=OUT_PATH,
model_args=model_args,
run_name=RUN_NAME,
project_name="YourTTS",
run_description="""
- Original YourTTS trained using shorter extacts made by new method
""",
dashboard_logger="tensorboard",
logger_uri=None,
audio=audio_config,
batch_size=BATCH_SIZE,
batch_group_size=48,
eval_batch_size=BATCH_SIZE,
num_loader_workers=4,
eval_split_max_size=256,
print_step=50,
plot_step=100,
log_model_step=1000,
save_step=5000,
save_n_checkpoints=10,
save_checkpoints=True,
target_loss="loss_1",
print_eval=False,
use_phonemes=False,
phonemizer="espeak",
phoneme_language="fr-fr",
compute_input_seq_cache=True,
add_blank=True,
text_cleaner="multilingual_cleaners",
characters=CharactersConfig(
characters_class="TTS.tts.models.vits.VitsCharacters",
pad="_",
eos="&",
bos="*",
blank=None,
#characters="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\u00af\u00b7\u00df\u00e0\u00e1\u00e2\u00e3\u00e4\u00e6\u00e7\u00e8\u00e9\u00ea\u00eb\u00ec\u00ed\u00ee\u00ef\u00f1\u00f2\u00f3\u00f4\u00f5\u00f6\u00f9\u00fa\u00fb\u00fc\u00ff\u0101\u0105\u0107\u0113\u0119\u011b\u012b\u0131\u0142\u0144\u014d\u0151\u0153\u015b\u016b\u0171\u017a\u017c\u01ce\u01d0\u01d2\u01d4\u0430\u0431\u0432\u0433\u0434\u0435\u0436\u0437\u0438\u0439\u043a\u043b\u043c\u043d\u043e\u043f\u0440\u0441\u0442\u0443\u0444\u0445\u0446\u0447\u0448\u0449\u044a\u044b\u044c\u044d\u044e\u044f\u0451\u0454\u0456\u0457\u0491\u2013!'(),-.:;? ",
characters="!',-.:?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz «»ÀÇÉÊàâçèéêëîïôùûœ–’…",
punctuations="!'(),-.:;? ",
phonemes="",
is_unique=True,
is_sorted=True,
),
phoneme_cache_path=None,
precompute_num_workers=12,
start_by_longest=True,
datasets=DATASETS_CONFIG_LIST,
cudnn_benchmark=False,
max_audio_len=SAMPLE_RATE * MAX_AUDIO_LEN_IN_SECONDS,
mixed_precision=True,
test_sentences=[
[
"Il m'a fallu du temps pour créer cette voix alors ma bouche ne restera pas fermée.",
# "ME",
None,
"fr_FR",
],
[
"Il m'a fallu beaucoup de temps pour développer une voix, et maintenant que je l'ai, je ne vais pas me taire.",
# "ME",
None,
"fr_FR",
],
[
"Mais son âge rendait cette dernière qualité plus saillante.",
# "ME",
None,
"fr_FR",
],
# [
# "It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.",
# "VCTK_p277",
# None,
# "en",
# ],
# [
# "Be a voice, not an echo.",
# "VCTK_p239",
# None,
# "en",
# ],
# [
# "I'm sorry Dave. I'm afraid I can't do that.",
# "VCTK_p258",
# None,
# "en",
# ],
# [
# "This cake is great. It's so delicious and moist.",
# "VCTK_p244",
# None,
# "en",
# ],
# [
# "Prior to November 22, 1963.",
# "VCTK_p305",
# None,
# "en",
# ],
],
# Enable the weighted sampler
use_weighted_sampler=True,
# Ensures that all speakers are seen in the training batch equally no matter how many samples each speaker has
weighted_sampler_attrs={"speaker_name": 1.0},
weighted_sampler_multipliers={},
# It defines the Speaker Consistency Loss (SCL) α to 9 like the paper
speaker_encoder_loss_alpha=9.0,
)
# Load all the datasets samples and split traning and evaluation sets
train_samples, eval_samples = load_tts_samples(
config.datasets,
eval_split=True,
eval_split_max_size=config.eval_split_max_size,
eval_split_size=config.eval_split_size,
)
# Init the model
model = Vits.init_from_config(config)
# Init the trainer and 🚀
trainer = Trainer(
TrainerArgs(restore_path=RESTORE_PATH, skip_train_epoch=SKIP_TRAIN_EPOCH),
config,
output_path=OUT_PATH,
model=model,
train_samples=train_samples,
eval_samples=eval_samples,
)
trainer.fit()
Create a dataset in LJSpeech format (22,05 kHz audios) in French.
Adapt dataset config, sample rate in the provided recipe.
Launch it.
python3 YourTTS_recipe.py
The training should go on.
No response
{
"CUDA": {
"GPU": [
"NVIDIA GeForce RTX 3090"
],
"available": true,
"version": "11.7"
},
"Packages": {
"PyTorch_debug": false,
"PyTorch_version": "1.13.1+cu117",
"Trainer": "v0.0.22",
"numpy": "1.22.4"
},
"System": {
"OS": "Linux",
"architecture": [
"64bit",
"ELF"
],
"processor": "x86_64",
"python": "3.10.6",
"version": "#64-Ubuntu SMP Thu Jan 5 11:43:13 UTC 2023"
}
}
No response
🚀 Feature Description
Uploading the training python script will improve the reproducibility.
For single GPU training, I am using train_yourtts.py. When I switch to multi-gpu, the program could run but didn't show acceleration. I checked the code in distribute.py and found that it only set environment and start parallel processes. It didn't do collection and sync operation. I am wondering if it is by design or I misused the trainer.distribute
CUDA_VISIBLE_DEVICES=0,1 python -m trainer.distribute --script recipes/vctk/yourtts/train_yourtts.py
expected two times acceleration but actually the progress is same as single gpu training.
No response
{
"CUDA": {
"GPU": [
"NVIDIA H100 80GB HBM3",
"NVIDIA H100 80GB HBM3",
"NVIDIA H100 80GB HBM3",
"NVIDIA H100 80GB HBM3",
"NVIDIA H100 80GB HBM3",
"NVIDIA H100 80GB HBM3",
"NVIDIA H100 80GB HBM3",
"NVIDIA H100 80GB HBM3"
],
"available": true,
"version": "12.1"
},
"Packages": {
"PyTorch_debug": false,
"PyTorch_version": "2.1.1+cu121",
"Trainer": "v0.0.34",
"numpy": "1.22.0"
},
"System": {
"OS": "Linux",
"architecture": [
"64bit",
"ELF"
],
"processor": "x86_64",
"python": "3.9.18",
"version": "#99-Ubuntu SMP Mon Oct 30 20:42:41 UTC 2023"
}
}
No response
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/trainer/trainer.py", line 1591, in fit
self._fit()
File "/opt/conda/lib/python3.7/site-packages/trainer/trainer.py", line 1544, in _fit
self.train_epoch()
File "/opt/conda/lib/python3.7/site-packages/trainer/trainer.py", line 1309, in train_epoch
_, _ = self.train_step(batch, batch_num_steps, cur_step, loader_start_time)
File "/opt/conda/lib/python3.7/site-packages/trainer/trainer.py", line 1172, in train_step
num_optimizers=len(self.optimizer),
File "/opt/conda/lib/python3.7/site-packages/trainer/trainer.py", line 1021, in _optimize
with torch.autocast(device_type=device, dtype=dtype, enabled=config.mixed_precision):
AttributeError: module 'torch' has no attribute 'autocast'
https://github.com/coqui-ai/TTS/blob/dev/recipes/ljspeech/delightful_tts/train_delightful_tts.py
No response
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/trainer/trainer.py", line 1591, in fit
self._fit()
File "/opt/conda/lib/python3.7/site-packages/trainer/trainer.py", line 1544, in _fit
self.train_epoch()
File "/opt/conda/lib/python3.7/site-packages/trainer/trainer.py", line 1309, in train_epoch
_, _ = self.train_step(batch, batch_num_steps, cur_step, loader_start_time)
File "/opt/conda/lib/python3.7/site-packages/trainer/trainer.py", line 1172, in train_step
num_optimizers=len(self.optimizer),
File "/opt/conda/lib/python3.7/site-packages/trainer/trainer.py", line 1021, in _optimize
with torch.autocast(device_type=device, dtype=dtype, enabled=config.mixed_precision):
AttributeError: module 'torch' has no attribute 'autocast'
torch version:'1.8.1+cu102'. I have checked several version of torch docs, seems no autocast in torch
No response
Error while training:-
resource.getrlimit(resource.RLIMIT_NOFILE)
is (1048576, 1048576)
| > stats_path:None
2023-06-14T07:29:43.025431079Z | > base:10
2023-06-14T07:29:43.025437149Z | > hop_length:256
2023-06-14T07:29:43.025444429Z | > win_length:1024
2023-06-14T07:29:43.025450699Z > initialization of speaker-embedding layers.
2023-06-14T07:29:43.025462919Z Traceback (most recent call last):
2023-06-14T07:29:43.025469199Z File "/workspace/coqui-tts/train.py", line 320, in <module>
2023-06-14T07:29:43.025476859Z trainer = Trainer(
2023-06-14T07:29:43.025484659Z File "/usr/local/lib/python3.10/dist-packages/trainer/trainer.py", line 405, in __init__
2023-06-14T07:29:43.025494939Z self.use_cuda, self.num_gpus = self.setup_training_environment(args=args, config=config, gpu=gpu)
2023-06-14T07:29:43.025500099Z File "/usr/local/lib/python3.10/dist-packages/trainer/trainer.py", line 632, in setup_training_environment
2023-06-14T07:29:43.025543959Z resource.setrlimit(resource.RLIMIT_NOFILE, (4096, rlimit[1]))
2023-06-14T07:29:43.025560229Z ValueError: not allowed to raise maximum limit
Due this line:
Lines 653 to 660 in 9879d3d
No errors
| > stats_path:None
2023-06-14T07:29:43.025431079Z | > base:10
2023-06-14T07:29:43.025437149Z | > hop_length:256
2023-06-14T07:29:43.025444429Z | > win_length:1024
2023-06-14T07:29:43.025450699Z > initialization of speaker-embedding layers.
2023-06-14T07:29:43.025462919Z Traceback (most recent call last):
2023-06-14T07:29:43.025469199Z File "/workspace/coqui-tts/train.py", line 320, in <module>
2023-06-14T07:29:43.025476859Z trainer = Trainer(
2023-06-14T07:29:43.025484659Z File "/usr/local/lib/python3.10/dist-packages/trainer/trainer.py", line 405, in __init__
2023-06-14T07:29:43.025494939Z self.use_cuda, self.num_gpus = self.setup_training_environment(args=args, config=config, gpu=gpu)
2023-06-14T07:29:43.025500099Z File "/usr/local/lib/python3.10/dist-packages/trainer/trainer.py", line 632, in setup_training_environment
2023-06-14T07:29:43.025543959Z resource.setrlimit(resource.RLIMIT_NOFILE, (4096, rlimit[1]))
2023-06-14T07:29:43.025560229Z ValueError: not allowed to raise maximum limit
{
"CUDA": {
"GPU": [
"Tesla V100-FHHL-16GB"
],
"available": true,
"version": "11.7"
},
"Packages": {
"PyTorch_debug": false,
"PyTorch_version": "2.0.1+cu117",
"Trainer": "v0.0.20",
"numpy": "1.22.4"
},
"System": {
"OS": "Linux",
"architecture": [
"64bit",
"ELF"
],
"processor": "x86_64",
"python": "3.10.6",
"version": "#46-Ubuntu SMP Fri Jul 10 00:24:02 UTC 2020"
}
}
No response
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.