hiroshiba / realtime-yukarin Goto Github PK

An application for real-time voice conversion

License: MIT License

Python 99.79% Shell 0.21%

realtime-yukarin's Introduction

Realtime Yukarin: an application for real-time voice conversion

Realtime Yukarin is the application for real-time voice conversion with a single command. This application needs trained deep learning models and a GPU computer. The source code is an OSS and MIT license. So you can modify this code, or use it for your applications whether commercial or non-commercial.

Japanese README

Supported environment

Windows
GeForce GTX 1060
6GB GPU memory
Intel Core i7-7700 CPU @ 3.60GHz
Python 3.6

Preparation

Installation required libraries

pip install -r requirements.txt

Prepare trained models

You need two trained models, a first stage model responsible for voice conversion and a second stage model for enhancing the quality of the converted results. You can create a first stage model with Yukarin and a second stage model with Become Yukarin.

Also, for voice pitch conversion, you need a file of frequency statistics at Yukarin.

Here, each filename is as follows:

Content	Filename
Frequency statistics for input voice	`./sample/input_statistics.npy`
Frequency statistics for target voice	`./sample/target_statistics.npy`
First stage model from Yukarin	`./sample/model_stage1/predictor.npz`
First stage's config file	`./sample/model_stage1/config.json`
Second stage model from Become Yukarin	`./sample/model_stage2/predictor.npz`
Second stage's config file	`./sample/model_stage2/config.json`

Verification

You can verify prepared files with executing ./check.py. The following example converts 5 seconds voice data of input.wav, and save to output.wav.

python check.py \
    --input_path 'input.wav' \
    --input_time_length 5 \
    --output_path 'output.wav' \
    --input_statistics_path './sample/input_statistics.npy' \
    --target_statistics_path './sample/target_statistics.npy' \
    --stage1_model_path './sample/model_stage1/predictor.npz' \
    --stage1_config_path './sample/model_stage1/config.json' \
    --stage2_model_path './sample/model_stage2/predictor.npz' \
    --stage2_config_path './sample/model_stage2/config.json' \

If you have problems, you can ask questions on Github Issue.

Run

To perform real-time voice conversion, create a config file config.yaml and run ./run.py.

python run.py ./config.yaml

Description of config file

# Name of input sound device. Partial Match. Details are below.
input_device_name: str

# Name of output sound device. Partial Match. Details are below.
output_device_name: str

# Input sampling rate
input_rate: int

# Output sampling rate
output_rate: int

# frame_period for Acoustic feature
frame_period: int

# Length of voice to convert at one time (seconds).
# If it is too long, delay will increase, and if it is too short, processing will not catch up.
buffer_time: float

# Method to calclate the fundamental frequency. world ofr crepe.
# CREPE needs additional libraries, details are requirements.txt
extract_f0_mode: world

# Length of voice to be synthesized at one time (number of samples)
vocoder_buffer_size: int

# Amplitude scaling for input.
# When it is more than 1, the amplitude becomes large, and when it is less than 1, the amplitude becomes small.
input_scale: float

# Amplitude scaling for output.
# When it is more than 1, the amplitude becomes large, and when it is less than 1, the amplitude becomes small.
output_scale: float

# Silence threshold for input (db).
# The smaller the value, the easier it is to silence.
input_silent_threshold: float

# Silence threshold for output (db).
# The smaller the value, the easier it is to silence.
output_silent_threshold: float

# Overlap for encoding (seconds)
encode_extra_time: float

# Overlap for converting (seconds)
convert_extra_time: float

# Overlap for decoding (seconds)
decode_extra_time: float

# Path of frequency statistics file
input_statistics_path: str
target_statistics_path: str

# Path of trained model file
stage1_model_path: str
stage1_config_path: str
stage2_model_path: str
stage2_config_path: str

(preliminary knowledge) Name of sound device

In the example below, Logitech Speaker is the name of the sound device.

License

MIT License

realtime-yukarin's People

Contributors

Stargazers

Watchers

realtime-yukarin's Issues

Questions and documentation

Which model does Yukarin uses for its training?
Are there any target voice training document specifications?
Would public voice datasets help with training?
Does this project work with English datasets?
Why is the example page's voice so "robotic"/"compressed"?

Yukarin以外でも対応できますか？

Yukarin以外の人物の声にしたい場合、どこのファイルをいじればよいでしょうか！？

run.pyやcheck.pyの実行時にエラーが出てします。

check.pyやrun.pyを実行した際に
OSError: /usr/local/lib/python3.6/dist-packages/world4py/libworld.so: cannot open shared object file: No such file or directory
と出力され、動作できない状態となっています。
（ファイルを探してみたところ、上記の階層にちゃんと存在していました。
　※os.path.exists(_WORLD_LIBRARY_PATH)でチェックし　True　が返ることを確認しました。）

realtime-yukarinが推奨環境がWindowsなのは存じておりますが、
become-yukarinでも同様に発生しているため、こちらに書かせていただきました。

解決方法などご存知でしたらご共有いただけると大変助かります。

OS：Ubuntu　18.04.5 LTS (Bionic Beaver)
Python 3.6.9

エラー全文

python3 check.py --input_path 'input.wav' --input_time_length 5 --output_path 'output.wav' --input_statistics_path './sample/input_statistics.npy' --target_statistics_path './sample/target_statistics.npy' --stage1_model_path './sample/model_stage1/predictor_260000.npz' --stage1_config_path './sample/model_stage1/config.json' --stage2_model_path './sample/model_stage2/predictor_8000.npz' --stage2_config_path './sample/model_stage2/config.json'
Traceback (most recent call last):
File "check.py", line 13, in
from realtime_voice_conversion.config import VocodeMode
File "/home/＊＊＊＊/ドキュメント/IkeboMaster/realtime-yukarin/realtime_voice_conversion/init.py", line 1, in
from . import stream
File "/home/＊＊＊＊/ドキュメント/IkeboMaster/realtime-yukarin/realtime_voice_conversion/stream/init.py", line 2, in
from .decode_stream import DecodeStream
File "/home/＊＊＊＊/ドキュメント/IkeboMaster/realtime-yukarin/realtime_voice_conversion/stream/decode_stream.py", line 7, in
from ..yukarin_wrapper.vocoder import Vocoder
File "/home/＊＊＊＊/ドキュメント/IkeboMaster/realtime-yukarin/realtime_voice_conversion/yukarin_wrapper/vocoder.py", line 5, in
from world4py.native import structures, apidefinitions, utils
File "/usr/local/lib/python3.6/dist-packages/world4py/native/init.py", line 6, in
from world4py.native import apis, tools, utils, structures
File "/usr/local/lib/python3.6/dist-packages/world4py/native/apis.py", line 6, in
from world4py.native import apidefinitions, structures, utils
File "/usr/local/lib/python3.6/dist-packages/world4py/native/apidefinitions.py", line 7, in
from world4py.native import structures, instance
File "/usr/local/lib/python3.6/dist-packages/world4py/native/instance.py", line 9, in
_WORLD = ctypes.cdll.LoadLibrary(_WORLD_LIBRARY_PATH)
File "/usr/lib/python3.6/ctypes/init.py", line 426, in LoadLibrary
return self._dlltype(name)
File "/usr/lib/python3.6/ctypes/init.py", line 348, in init
self._handle = _dlopen(self._name, mode)
OSError: /usr/local/lib/python3.6/dist-packages/world4py/libworld.so: cannot open shared object file: No such file or directory

Can u introduce your thought of real-time?

Hi, I have a voice conversion model, I want to convert it to a realtime model.

I am confused what do "start_time" and "extra_time" in your code mean.
I want to record audio from microphone , process the audio data and play the processed audio at the meanwhile. How can I design the code?
Thank u very much!

Any samples？

Hi，
Could you show some samples？
Thanks