Git Product home page Git Product logo

ar-vits's Introduction

AutoRegressive-VITS

(WIP) text to speech using autoregressive transformer and VITS

Note

  • 模型效果未完全验证,不一定会好,请谨慎踩坑,预训练模型还在练
  • 从零训练需要海量数据(至少上千小时?)(类似valle、speartts、soundstorm)数据量少一定不会有好效果。。
  • 由于vits+refenc在zeroshot方向局限性很大,因此本仓库不追求zeroshot,本仓库的目标是,在有一个大的lm的pretrain的情况下,借助自回归lm的力量,希望在对小数据finetune以后能有很好的韵律。
  • 简单更新了一些初步的 合成samples

Todo

  • 在原神数据上训练
  • 收集更多中文开源数据训练(预计600H左右)训练并放出pretrain(x) --> out-of-distribution文本效果很差,例如读文言文 并且长句效果不好, 会抽风
    • 添加word level bert 并repeat到phoneme level改善out-of-distribution效果
    • 将同一spk的数据多条合并为一条音频 提高平均数据时长 改善长句合成效果稳定性
    • 更换为RoPE相对位置编码改善长句合成效果稳定性?
  • 编写finetune相关代码,增加sid支持
  • 优化日语和英语文本前端,收集更多日、英数据(预计每种语言600H)训练并放出pretrain

structure

structure.png

Training pipeline

  1. jointly train S2 vits decoder and quantizer
  2. extract semantic tokens
  3. train S1 text to semantic

vits S2 training

  • resample.py
  • gen_phonemes.py
  • extract_ssl_s2.py
  • gen_filelist_s2.py
  • train_s2.py

gpt S1 training

  • extract_vq_s1.py
  • gen_filelist_s1.py
  • train_s1.py

Inference

  • s1_infer.py/s2_infer.py (work in progress)

Pretrained models

  • work in progress

ar-vits's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ar-vits's Issues

why got phoneme inputs in S2?

Hi, thanks for your work! But I am a bit confused that:

Seems like S1 has done the work to get phoneme to semantic token, and do semantic to wav in S2 stage.
Hubwer token has semantic information enough, so why we put phoneme seq in s2?

Dialect support: Cantonese

Cantonese, a vibrant and widely spoken dialect, is a fundamental part of the cultural and linguistic landscape of Hong Kong, GuangZhou and its surrounding regions. With over 85.5 million speakers worldwide, it is one of the most spoken Chinese dialects, making it an essential communication tool for countless individuals. As a proud HongKonger, Cantonese is the language I use in my daily life for conversations with my family and friends.

However, despite its widespread usage, Cantonese resources, especially in the field of technology, have historically been limited compared to other more prominent languages.

I hope gpt-vits will support this dialect, thank you!

Nice work!!!

Nice work!!! 没有大佬的工作就没有隔壁的GPT-SoVITs了🤣

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.