Git Product home page Git Product logo

jansystemic / amphion Goto Github PK

View Code? Open in Web Editor NEW

This project forked from open-mmlab/amphion

0.0 0.0 0.0 5.68 MB

Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development.

Home Page: https://openhlt.github.io/amphion/

License: MIT License

Shell 3.44% Python 96.50% Cython 0.05%

amphion's Introduction

Amphion: An Open-Source Audio, Music, and Speech Generation Toolkit


Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development. Amphion offers a unique feature: visualizations of classic models or architectures. We believe that these visualizations are beneficial for junior researchers and engineers who wish to gain a better understanding of the model.

The North-Star objective of Amphion is to offer a platform for studying the conversion of any inputs into audio. Amphion is designed to support individual generation tasks, including but not limited to,

  • TTS: Text to Speech (⛳ supported)
  • SVS: Singing Voice Synthesis (👨‍💻 developing)
  • VC: Voice Conversion (👨‍💻 developing)
  • SVC: Singing Voice Conversion (⛳ supported)
  • TTA: Text to Audio (⛳ supported)
  • TTM: Text to Music (👨‍💻 developing)
  • more…

In addition to the specific generation tasks, Amphion also includes several vocoders and evaluation metrics. A vocoder is an important module for producing high-quality audio signals, while evaluation metrics are critical for ensuring consistent metrics in generation tasks.

Here is the Amphion v0.1 demo, whose voice, audio effects, and singing voice are generated by our models. Just enjoy it!

Amphion-Demo-EN.mp4

🚀 News

  • 2023/12/18: Amphion v0.1 release. arXiv hf youtube readme
  • 2023/11/28: Amphion alpha release. readme

⭐ Key Features

TTS: Text to Speech

  • Amphion achieves state-of-the-art performance when compared with existing open-source repositories on text-to-speech (TTS) systems. It supports the following models or architectures:
    • FastSpeech2: A non-autoregressive TTS architecture that utilizes feed-forward Transformer blocks.
    • VITS: An end-to-end TTS architecture that utilizes conditional variational autoencoder with adversarial learning
    • Vall-E: A zero-shot TTS architecture that uses a neural codec language model with discrete codes.
    • NaturalSpeech2: An architecture for TTS that utilizes a latent diffusion model to generate natural-sounding voices.

SVC: Singing Voice Conversion

  • Ampion supports multiple content-based features from various pretrained models, including WeNet, Whisper, and ContentVec. Their specific roles in SVC has been investigated in our NeurIPS 2023 workshop paper. arXiv code
  • Amphion implements several state-of-the-art model architectures, including diffusion-, transformer-, VAE- and flow-based models. The diffusion-based architecture uses Bidirectional dilated CNN as a backend and supports several sampling algorithms such as DDPM, DDIM, and PNDM. Additionally, it supports single-step inference based on the Consistency Model.

TTA: Text to Audio

  • Amphion supports the TTA with a latent diffusion model. It is designed like AudioLDM, Make-an-Audio, and AUDIT. It is also the official implementation of the text-to-audio generation part of our NeurIPS 2023 paper. arXiv code

Vocoder

Evaluation

Amphion provides a comprehensive objective evaluation of the generated audio. The evaluation metrics contain:

  • F0 Modeling: F0 Pearson Coefficients, F0 Periodicity Root Mean Square Error, F0 Root Mean Square Error, Voiced/Unvoiced F1 Score, etc.
  • Energy Modeling: Energy Root Mean Square Error, Energy Pearson Coefficients, etc.
  • Intelligibility: Character/Word Error Rate, which can be calculated based on Whisper and more.
  • Spectrogram Distortion: Frechet Audio Distance (FAD), Mel Cepstral Distortion (MCD), Multi-Resolution STFT Distance (MSTFT), Perceptual Evaluation of Speech Quality (PESQ), Short Time Objective Intelligibility (STOI), etc.
  • Speaker Similarity: Cosine similarity, which can be calculated based on RawNet3, WeSpeaker, and more.

Datasets

Amphion unifies the data preprocess of the open-source datasets including AudioCaps, LibriTTS, LJSpeech, M4Singer, Opencpop, OpenSinger, SVCC, VCTK, and more. The supported dataset list can be seen here (updating).

📀 Installation

git clone https://github.com/open-mmlab/Amphion.git
cd Amphion

# Install Python Environment
conda create --name amphion python=3.9.15
conda activate amphion

# Install Python Packages Dependencies
sh env.sh

🐍 Usage in Python

We detail the instructions of different tasks in the following recipes:

🙏 Acknowledgement

©️ License

Amphion is under the MIT License. It is free for both research and commercial use cases.

📚 Citations

@article{zhang2023amphion,
      title={Amphion: An Open-Source Audio, Music and Speech Generation Toolkit}, 
      author={Xueyao Zhang and Liumeng Xue and Yuancheng Wang and Yicheng Gu and Xi Chen and Zihao Fang and Haopeng Chen and Lexiao Zou and Chaoren Wang and Jun Han and Kai Chen and Haizhou Li and Zhizheng Wu},
      journal={arXiv},
      year={2023},
      volume={abs/2312.09911}
}

amphion's People

Contributors

lmxue avatar rmsnow avatar hecheng0625 avatar chenx17 avatar viewfinder-annn avatar bakerbunker avatar eltociear avatar treya-lin avatar vocodexelysium avatar adorable-qin avatar zhizhengwu avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.