Git Product home page Git Product logo

Comments (68)

ibab avatar ibab commented on August 10, 2024 11

Yeah, it's definitely a planned feature.
I'll get to it eventually, but I'd also accept contributions if someone is interested.
A solution to this should also integrate with the AudioReader interface.

from tensorflow-wavenet.

ibab avatar ibab commented on August 10, 2024 6

I've also thought about just plugging in the raw text, but I'm pretty sure we would need at least some kind of attention mechanism if we want it to work properly (i.e. some way for the network to figure out which parts of the text correspond to which sections of the waveform).

from tensorflow-wavenet.

alexbeloi avatar alexbeloi commented on August 10, 2024 5

@ibab @jyegerlehner @Zeta36

Apologies for my absence/delay, I was in the middle of moving when I started working on this. Hopefully we can get the ball rolling on this again.

from tensorflow-wavenet.

alexbeloi avatar alexbeloi commented on August 10, 2024 4

I'm starting to work on it, I think I can get some basic implementation working over the next couple of days. Global part should be easy, and a dumb implementation (upsampling by repeating values) of local conditioning should be fast to implement as well.

This way, we can get to a stage where the net can produces some low-quality speech. Then work on improving the quality by adding more sophisticated upsampling methods.

The white paper also talks about using local conditioning features beyond just the text data, they do some preprocessing to compute phonetic features from the text. That would be nice to add later as well.

from tensorflow-wavenet.

rockyrmit avatar rockyrmit commented on August 10, 2024 4

(from one of the WaveNet co-authors):
Linguistic features which we used were similar to those listed in this document.
https://wiki.inf.ed.ac.uk/twiki/pub/CSTR/F0parametrisation/hts_lab_format.pdf

AFAIK there is no publicly available large TTS speech database containing linguistic features :-(
So TTS research community (especially universities) often uses small ones.

One candidate is CMU ARCTIC databases with HTS demo. CMU ARCTIC has 4 US English speakers (about 1 hour per speaker). It is distributed with phoneme-level segmentations. HTS demo shows how to extract other linguistic features (described in the above-mentioned documents) from raw texts using festival. If you have any TTS experts / PhD researchers around, they can be familiar with how to use festival / HTS-demo.

let me know if anyone want to start to working on the linguistic features and local condition.

from tensorflow-wavenet.

Zeta36 avatar Zeta36 commented on August 10, 2024 4

I think that it is very important for this project not to die, that somebody public or share already his deployment about the local or global conditioning (even if it is unfinished). I'm afraid this project can get stuck in the current state if no one give a new step.

I've done my best but I'm afraid I have no the equipment (no GPU) nor the knowledge to do much more than what I've already done.

from tensorflow-wavenet.

jyegerlehner avatar jyegerlehner commented on August 10, 2024 4

@alexbeloi @Zeta36

I think you two were both working on some global conditioning code. I've got a branch with an implementation of global conditioning with a working test here. It does not implement anything in the file reader that reports back ID. I implemented the test with toy data, where we globally condition on a speaker id, where speaker id is 0, 1 or 2, and it generates a sine wave of a different frequency depending on which ID is chosen.

I'd be just as happy to use one of your implementations instead of mine, especially since I think you've probably got the AudioReader modified to report speaker id and I don't have that. But I'd like to preserve the tests that I wrote.

So I'd like to know if you are still planning on contributing global conditioning, and have ideas on how to merge our various contributions.

from tensorflow-wavenet.

jyegerlehner avatar jyegerlehner commented on August 10, 2024 3

@alexbeloi
I'm contemplating working from your branch, and adding my test on top of it. Looking at your branch I notice at a few things:

https://github.com/alexbeloi/tensorflow-wavenet/blob/master/wavenet/model.py#L560

Here I was using tf.nn.embedding_lookup to go from the integer that specifies "speaker_id", not tf.one_hot.

compactness
I think one problem with using tf.one-hot instead tf.embedding_lookup is its effect on the size of the 'gcond_filter' and 'gcond_gate' parameter tensors. These occur in every dilation layer. And the size of each is global_condition_channels x dilation channels. When using tf.one_hot, global_condition_channels = the number of mutually exclusive categories, whereas with tf.embedding_lookup, the global_condition_channels specifies the embedding size, and can be chosen independently of the number of mutually-exclusive categories. This might be a size 16 or 32 embedding, as opposed to a size 109 vector (to cover the speakers in VCTK corpus).

generality
Another problem is generality: one might wish to do global coditioning where there isn't an enumeration of mutually exclusive categories upon which one is conditioning. Your approach works fine where there are only 109 speakers in the VCTK corpus, but what if one wishes to condition upon some embedding vector produced by, say, seq2seq. Or a context stack (2.2 in the paper). I don't think the number of possible character sequences that correspond to valid sentences in a language could feasibly be enumerated. But you can produce a dense embedding vector of fixed size (say, 1000) that represents any sentence. The h in the equation at the bottom of page 4 in paper can be any vector you want to condition on, but with the tf.one_hot it can only be an input to the WaveNetModel as an integer enumerating all possible values.

local conditioning: separate PR?
I think it's usually good practice to break up large changes into smaller ones, so as not to try to "eat the elephant" all in one sitting. Each of global and local conditioning is complicated enough a change I think they are better in separate PRs. I'd suggest putting them in their own named branches rather than your master.

local conditioning: hard-wired to strings
https://github.com/alexbeloi/tensorflow-wavenet/blob/master/wavenet/model.py#L566

I'm guessing your use of tf.string_to_hash_bucket_fast() is intended to process linguistic features (which come as strings? I don't really know). But the paper also mentions local conditioning for context stacks (section 2.6), which will not be strings, but a dense embedding vector y in equation at the top of page 5.

local conditioning: upsampling/deconvolution
Your tf.image.resize_images I think does what they said doesn't work as well (page 5, last paragraph of 2.5) I think this needs to be a strided transpose convolution (AKA deconvolution).

So in short, I think what I'm proposing is that global_condition vector h and local_condition vector y come into the WaveNetModel class as dense vectors of any size from any source, and that any encoding (e.g. tf.one_hot or tf.nn.embedding_lookup) be done outside the WaveNetModel. Then, when we're working with VCTK we can do one_hot or embedding_lookup to produce global_condition, but when we're dealing with other things that produce a dense vector we can accommodate that too.

I think the approach you are taking works as long as all we care about is the VCTK corpus (or a few music genres) without context stacks. But context stacks are definitely on my road map so prefer not to see local conditioning hard-wired to strings.

Maybe the wider community is happy with your approach and if so perhaps they can speak up.

BTW these are my initial thoughts; I often miss things and am very persuadable.

from tensorflow-wavenet.

r9y9 avatar r9y9 commented on August 10, 2024 3

Hi, I have also implemented global and local conditioning. See https://github.com/r9y9/wavenet_vocoder for details. Audio samples are found at https://r9y9.github.io/wavenet_vocoder/. I think that is ready to use as a mel-spectrogram vocoder with DeepVoice or Tacotron.

EDIT: you can find some discussion at r9y9/wavenet_vocoder#1.

from tensorflow-wavenet.

thomasmurphycodes avatar thomasmurphycodes commented on August 10, 2024 2

I mean let's give it a shot and see what happens. Google Research has a
bunch of papers over on there page about HMM-ing characters to phonemes, so
we could look into a subproject where we try to implement that.

On Sat, Oct 8, 2016 at 2:25 PM, Alex Beloi [email protected] wrote:

I was thinking to just use the raw text from the corpus data for local
conditioning to start, just encode each character into a vector and
upsample (by repeats) it to the number of samples in the audio file, not
ideal but it's a start. Characters should be able to act as a really rough
proxy for phonetic features.

Ideally, the raw text should be processed (perhaps via some other model)
into a sequence of phonetic features and then that would be upsampled
to the size of the audio sample.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#112 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEKqWZmBL3E8ObZmo44eoP_ht2lqW8Ufks5qx-AGgaJpZM4KKVOa
.

from tensorflow-wavenet.

alexbeloi avatar alexbeloi commented on August 10, 2024 2

@Zeta36, @ibab Apologies for the delays, the local/global conditioning has been taking a bit longer than expected.

I can push my progress to my fork by tonight. What I have right now runs, though for some reason training stalls at exactly iteration 116 (i.e. the process will not continue to the next iteration, despite default num_steps = 4000)

One of the main time sinks is that it takes a long time to train and then generate wav files to check if the conditioning is doing anything at all. No real way around that.

from tensorflow-wavenet.

jyegerlehner avatar jyegerlehner commented on August 10, 2024 2

@alexbeloi

they became mismatched over time because of the audio slicing.

That sounds good. I'm glad you stumbled over that tripwire before I got to it :P.

I fear we may have duplicated some effort, but you are ahead of me. I hadn't got to the audio reader part yet. I've spent most of the time building out model_test.py so that we can test training and "speaker id"-conditioned generation. So perhaps we can combine your global conditioning with my test, or pick the better parts of both.

Have you by any chance incorporated speaker shuffling in your audio reader changes? I think we're going to need that so you might keep it in mind as you write that code, if not implement in the first PR.

from tensorflow-wavenet.

alexbeloi avatar alexbeloi commented on August 10, 2024 2

@jyegerlehner

The shuffling has been in the back of my mind. I haven't worked on it yet, definitely needs to get implemented at some point for the data to be closer to IID.

@ibab and all
I've caught up my changes with upstream/master and pushed it to my fork. So far I have the model and training part done for both global and local conditioning but not the generation. I haven't been able to verify that the conditioning is working since I haven't gotten the generation working yet.

I want to clean it up more and modularize the embedding/upsampling before making a PR but if anyone wants to hack away at it in parallel, feel free.

https://github.com/alexbeloi/tensorflow-wavenet

running the following will train model with global conditioning as speaker_id from the VCTK corpus data, and local conditioning from the corresponding text data.
python train.py --vctk

The way I've implemented it, the conditioning gets applied to each dilation layer (not just the initial one), it's not clear to me from the paper if that's the intended method.

from tensorflow-wavenet.

alexbeloi avatar alexbeloi commented on August 10, 2024 2

@jyegerlehner Thanks for the feedback, I agree with everything you've pointed out. My plan was to do hacky vctk specific embeddings, get the math right, then go back and replace with more generic embeddings/upsampling.

@Zeta36 Thanks verifying some of the things work! I'll have to look at what you say regarding the sample_size. I thought the way I had it, it was queuing the same global condition for each piece that is sliced and queued from the sample.

from tensorflow-wavenet.

chrisnovello avatar chrisnovello commented on August 10, 2024 2

Exciting thread here!

fwiw: I re-recorded one of the entries from VCTK in my own voice, and got decent babble results. I used the VCTK txt and recording style with the intention of later training on the full corpus + my voice in the mix. Planning to do more recordings, and I'd be happy to do them in a way that helps generate data with linguistic features (by adding markup myself, and/or reading passages designed with them in mind, etc). I might be able to find some others to help on this as well. Let me know if any of this would be useful!

from tensorflow-wavenet.

Zeta36 avatar Zeta36 commented on August 10, 2024 1

@alexbeloi, imagine you have 5 wav files each with a different size. You yield in load_vctk_audio() the audio raw vector, the speaker id and the text the wav is saying. If you fill the sample_holder, the id holder and the text holder in one time (self.sample_size equals to None) all is correct. But if you set a sample_size to cut in pieces, you have this problem:

  1. We have 5 raw audio and we start the first iteration with the buffer_ clean. We append then to the buffer the first audio raw vector and we cut the first piece of buffer_ with a certain sample size, after what we feed the three holders. We then repeat again and cut another piece of sample size and feed again.

  2. We repeat this process while len(buffer_) > self.sample_size, so when after cutting a piece it results in len(buffer_) being less or equal to self.sample_size we ignore this last piece (this is the real problem) and we restart the loop with a new audio raw file and a new speaker id and text but now we have NOT the buffer_ clean as in the first bucle, but it now it has the remain piece of the last audio raw as we have seen.

In other words, when we start cutting an audio vector, the last piece will be ignored and will stay in the buffer_ to the next iteration. This is not a mayor problem when we are working without conditioning as until now, but it cannot stay in this way with conditioning, because in the second iteration you begin to mix audio raw data from different speakers and text.

A fast solution will be simply cleaning the buffer_ at the beginning of each iteration in the line right after:
for audio, extra in iterator:
using
buffer_ = np.array([])

This would be a solution but this will ignore the last piece of every audio file what may be not a good idea.

Regards,
Samu.

from tensorflow-wavenet.

beppeben avatar beppeben commented on August 10, 2024 1

Hi guys,

I've given a little thought on local conditioning, with the final goal of training the network to do TTS.
I am by no means an expert in this domain, I've just tried to come up with some ideas for the task that seamed reasonable to me (but could very well be trivial or wrong). I would love to hear your opinion on this.

So here's how I imagine this to work:

Let's take an input text T that we would like to use as a local condition to the net. Each entry T(i) is a one-hot encoding of a character (so T(i) has dimension 30 or something). We might add two special characters START and END. We could first compute a vector TT in order to roughly identify phonemes from small groups of contiguous letters. Say we use 3 letters, so phoneme i will be defined as

TT(i) = f(T(i-1)*w1 + T(i)*w2 + T(i+1)*w3)

where f is some nonlinear function (sigmoid or similar) acting pointwise on the input vector.
We can associate a feature vector H(i) to each phoneme i, intuitively describing how it sounds. So we can write

H = f(W_H*TT)

Now we need to find a scalar duration S(i) for each phoneme i, giving a measure of how much it will last. We could do it by quantizing a set of possible durations and then defining

S = soft_max(W_S*TT)

Now we need to transform this into a local condition at sample t. This conditioning L(t) could be defined as a "weighted" mixture between neighbouring phonemes

L(t) = sum_i H(i)*N((t-M(i,t))/S(i))

where N denotes the gaussian pdf and M(i,t) is the position of phoneme i in the sample, as seen from sample t. We could model instead MM(i,t) = M(i,t)-M(i-1,t), so we can impose that all components be positive.

The process could be initialized with uniform spacing of phonemes over the wave, i.e. MM(i,0) = sample_size/text_size. But then it can evolve as the wave progresses. So for example the position of the first phoneme M(1, t) could get delayed over time with respect to the initial value, if the wave realizes that it's mostly generating silence at the beginning.

The processes MM(i,) could be modeled using a RNN (or a set of RNNs?) in a way that I haven't completely figured out. The idea is however that its generation be conditioned on input and dilation layers of the original wavenet, and that everything be trained as one single net.

A potential issue that I see is that this has to be trained using a sample size that contains all the given text, since cutting the input text in the middle of a sentence is always somehow arbitrary as you can never be sure that the sound waves contain the corresponding text after the split. So this can be computationally challenging.

What do you think? Has something like this already been done?

from tensorflow-wavenet.

rafaelvalle avatar rafaelvalle commented on August 10, 2024 1

I assume condition can be done in the same way as char2wav, where a decoder learns vocoder features from a sequence of characters and feeds them into wavenet for training. Note that char2wav is trained end-to-end.
https://mila.umontreal.ca/en/publication/char2wav-end-to-end-speech-synthesis/

from tensorflow-wavenet.

jakeoverbeek avatar jakeoverbeek commented on August 10, 2024 1

Hello guys.

Any progress on the local conditioning? We are doing a thesis about TTS and the WaveNet model looks pretty interesting. We have the computing power to test stuff. @beppeben @alexbeloi @jyegerlehner did you guys make any progress?

from tensorflow-wavenet.

rafaelvalle avatar rafaelvalle commented on August 10, 2024 1

This is to confirm that we also got global and local conditioning to work for wavenet decoder based on Mel-spectrograms, the same way @r9y9 has done in his repo, i.e. upsampling with upsampling layers and matrix addition inside of tanh and sigmoid.
It would be very useful to have a successful implementation of local conditioning of linguistic features!

from tensorflow-wavenet.

Zeta36 avatar Zeta36 commented on August 10, 2024

Is somebody working on this already?

from tensorflow-wavenet.

thomasmurphycodes avatar thomasmurphycodes commented on August 10, 2024

I agree global will be easier, should just be a one-hot vector representing
the speaker. Am I thinking about this wrong that the local conditioning
requires us to train on data sets that contain the phonetic data as a
feature vector in addition to the waveform feature? What dataset are you
thinking of using?

On Sat, Oct 8, 2016 at 1:56 PM, Alex Beloi [email protected] wrote:

I'm starting to work on it, I think I can get some basic implementation
working over the next couple of days. Global part should be easy, and a
dumb implementation (upsampling by repeating values) of local conditioning
should be fast to implement as well.

This way, we can get to a stage where the net can produces some
low-quality speech. Then work on improving the quality by adding more
sophisticated upsampling methods.

The white paper also talks about using local conditioning features beyond
just the text data, they do some preprocessing to compute phonetic features
from the text. That would be nice to add later as well.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#112 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEKqWYd6YWpFgmjPuTd12PWhrGHD5bdVks5qx9ligaJpZM4KKVOa
.

from tensorflow-wavenet.

alexbeloi avatar alexbeloi commented on August 10, 2024

I was thinking to just use the raw text from the corpus data for local conditioning to start, just encode each character into a vector and upsample (by repeats) it to the number of samples in the audio file, not ideal but it's a start. Characters should be able to act as a really rough proxy for phonetic features.

Ideally, the raw text should be processed (perhaps via some other model) into a sequence of phonetic features and then that would be upsampled to the size of the audio sample.

from tensorflow-wavenet.

nakosung avatar nakosung commented on August 10, 2024

@thomasmurphycodes Could you post the list of papers?

from tensorflow-wavenet.

thomasmurphycodes avatar thomasmurphycodes commented on August 10, 2024

Yeah will tomorrow when in the office, they're on a box I have there.

On Sat, Oct 8, 2016 at 10:38 PM, Nako Sung [email protected] wrote:

@thomasmurphycodes https://github.com/thomasmurphycodes Could you post
the list of papers?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#112 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEKqWQBOgH-ifllSzpZuJVXOafNVXramks5qyFOTgaJpZM4KKVOa
.

from tensorflow-wavenet.

thomasmurphycodes avatar thomasmurphycodes commented on August 10, 2024

I think that's the case for sure. They explicitly mention the convolution
up-sampling (zero-padding) in the paper.

On Mon, Oct 10, 2016 at 7:30 AM, Igor Babuschkin [email protected]
wrote:

I've also thought about just plugging in the raw text, but I'm pretty sure
we would need at least some kind of attention mechanism if we want it to
work properly (i.e. some way for the network to figure out which parts of
the text correspond to which sections of the waveform).


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#112 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEKqWdC5vND-v17ykcajGuKc95ry3MD0ks5qyiHbgaJpZM4KKVOa
.

from tensorflow-wavenet.

wuaalb avatar wuaalb commented on August 10, 2024

In #92 HMM-aligned phonetic features are already provided. The upsampling/repeating values step is for going from feature vector per HMM frame to feature vector per time-domain sample..

from tensorflow-wavenet.

rockyrmit avatar rockyrmit commented on August 10, 2024

found:
Merlin
online, anyone use their training data here:
CMU_ARCTIC datasets
as linguistic features to train the wavenet?

from tensorflow-wavenet.

thomasmurphycodes avatar thomasmurphycodes commented on August 10, 2024

Possibly a memory overhead issue? Or is it converging?

Sent from my iPhone

On Oct 14, 2016, at 11:24, Alex Beloi [email protected] wrote:

@Zeta36, @ibab Apologies for the delays, the local/global conditioning has been taking a bit longer than expected.

I can push my progress my fork by tonight. What I have right now runs, though for some reason training stalls at exactly iteration 116 (i.e. the process will not continue to the next iteration, despite default num_steps = 4000)

One of the main time sinks is that it takes a long time to train and then generate wav files to check if the conditioning is doing anything at all. No real way around that.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

from tensorflow-wavenet.

alexbeloi avatar alexbeloi commented on August 10, 2024

I figured out the issue with that, it was related to the filereader and queue, I created a second queue for the text files and was dequeuing text/audio together, but they became mismatched over time because of the audio slicing.

from tensorflow-wavenet.

Zeta36 avatar Zeta36 commented on August 10, 2024

@alexbeloi you are doing a great job!!

I have replicated my text WaveNet implementation (#117) but using your model modifications for global and local conditioning. Well, after training the model using texts in Spanish and English (being ID = 1 the Spanish texts and ID=2 the English ones), I could generate later text in any language independently by using the parameter --speaker_id equals to 1 or 2!!

This mean that your global conditioning is working perfectly!!

Stay working on it!!

I would like to mention one thing about your code. In the AudioReader, when we iterate after reading the audio, when we cut the audio into buffers of self.sample_sizes, the ID and the text sometimes start to mixing badly.

Imagine for example, that we read from a folder with 5 wav files and that the load_vctk_audio() returns a tuple with the audio raw data, the ID of the speaker, and the plain text. If we set self.sample_sizes to None then everything works fine because we'll feed sample_placefolder, id_placeholder and text_placeholder correctly (this whole audio raw is feed in the sample_holder in a time). But, and this is important, if we set a sample_sizes, then the audio is going to be cut and the id and the text in some cases start to mix badly, and the placeholders start to be fed incorrectly: where for example a sample_holder is feed with raw data from two different wav files and ID and text being badly informed.

I had this problem with my text prove, where in some moments I had the sample_holder with both Spanish and English texts at the same time.

from tensorflow-wavenet.

alexbeloi avatar alexbeloi commented on August 10, 2024

@Zeta36 Ah, I see now.

If we don't want to drop the tail piece of audio, we can pad the it with silent noise, queue it, and have the buffer cleaned as you suggest. Or the functionality (between dropping and padding) can be determined by whether silence_threshold is set or not.

from tensorflow-wavenet.

sonach avatar sonach commented on August 10, 2024

@alexbeloi
Good job!
(1) I notice your code for using local_condition:
conv_filter = conv_filter +
causal_conv(local_condition, weights_lcond_filter, dilation)
I think the local_condition doesn't need to do dilation, it is just a 1x1 conv, and doesn't need to do causal_conv, just conv1d is OK.
So what is your consideration here?
(2)

The way I've implemented it, the conditioning gets applied to each dilation layer (not just the initial one), it's not clear to me from the paper if that's the intended method.

I think what you done is the intended mothod. Vg,k*y means that every layer(k is the layer index) has seperate weights.

from tensorflow-wavenet.

thomasmurphycodes avatar thomasmurphycodes commented on August 10, 2024

That's a great idea Chris. I wonder if we could create an expanded
multi-speaker set on the VCTK text within this project.

On Mon, Oct 17, 2016 at 2:59 AM, Chris Novello [email protected]
wrote:

Exciting thread here!

fwiw: I re-recorded one of the entries from VCTK in my own voice, and got decent
babble results
https://soundcloud.com/paperkettle/wavenet-babble-test-trained-a-neural-network-to-speak-with-my-voice.
I used the VCTK txt and recording style with the intention of later
training on the full corpus + my voice in the mix. Planning to do more
recordings, and I'd be happy to do them in a way that helps generate data
with linguistic features (by adding markup myself, and/or reading passages
designed with them in mind, etc). I might be able to find some others to
help on this as well. Let me know if any of this would be useful!


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#112 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEKqWaRdSW95O2ZdKquFYekvxd5ZakV6ks5q0xzMgaJpZM4KKVOa
.

from tensorflow-wavenet.

linVdcd avatar linVdcd commented on August 10, 2024

@alexbeloi Hi, I used your code to train VCTK. But when I tried to generate a wav file, I got an error. This is the way I used the generate.py file:
python generate.py --wav_out_path=out.wav --speaker_id=2 --speaker_text='hello world' --samples=16000 --logdir=./logdir/train/2016-10-18T12-35-15 ./logdir/train/2016-10-18T12-35-15/model.ckpt-2000

And I got the error:
Shape must be rank 2 but is rank 3 for 'wavenet_1/dilated_stack/layer0/MatMul_6' (op: 'MatMul') with input shapes: [?,?,32], [32,32].

Did i miss something? Thank you.

from tensorflow-wavenet.

alexbeloi avatar alexbeloi commented on August 10, 2024

@lin5547 Hi, thanks for testing things. You haven't missed anything, the generation part is still a work-in-progress unfortunately, I'm looking to have things working by the end of the week.

@sonach You're right, the paper says this should be just a 1x1 conv, will make the change.

from tensorflow-wavenet.

sonach avatar sonach commented on August 10, 2024

@alexbeloi

The way I've implemented it, the conditioning gets applied to each dilation layer (not just the initial one), it's not clear to me from the paper if that's the intended method.

I discuss this with an ASR expert. In speaker adaption application, the speaker ID vector is applied to every layer instead of the first layer only. So your implemention should be OK:)

from tensorflow-wavenet.

bryandeng avatar bryandeng commented on August 10, 2024

@rockyrmit
If we use linguistic features in HTS label format, Merlin's front-end provides an out-of-the-box solution to the conversion from labels to NumPy feature vectors.

https://github.com/CSTR-Edinburgh/merlin/blob/master/src/frontend/label_normalisation.py#L45

from tensorflow-wavenet.

vasquez75 avatar vasquez75 commented on August 10, 2024

Greetings @alexbeloi and/or @Zeta36!

Can you give me detailed steps of how I can make this thing say actual words? I downloaded the repository to my local machine, created a subfolder folder named "VCTK-Corpus" in the tensorflow-wavenet-master directory, then threw in some wav files into the VCTK-Corpus folder I created, and ran python train.py --data_dir=VCTK-Corpus. I am now able to generate the alien sounds when I run python generate.py --samples 16000 model.ckpt-1000, but really I'd like to hear it talk.
Note: I'm not using the real VCTK corpus. I have a bunch of wav files of my own and the text to go with them. Is there a specific set of pre-processing that I need to do?
A step by step would be great. Let me know, thanks!

from tensorflow-wavenet.

NickShahML avatar NickShahML commented on August 10, 2024

I just wanted to comment on this real quick with a few naive ideas.

From a ByteNet perspective, the decoder is conditioned on both the source network's input AND the output of the decoder's previous timestep.

Idea 1

Therefore, one simple strategy I had for local conditioning, is just to sum the source network's output and the regular input. This does prevent the network from fully learning the distinct features of them though.

Idea 2

Idea two would be to concat the inputs. In this way we would expand the "width" of the image by a factor of 2. However, the same convolutional kernal would be applied to both types of inputs.

Idea 3

Use tf.atrous_conv2d to include a height dimension rather than just width. The height could incorporate multiple signals (not just 2). This in my opinion would be the best option and comes at the cost of doubling parameter sizes.

Thoughts on these?

from tensorflow-wavenet.

jyegerlehner avatar jyegerlehner commented on August 10, 2024

@LeavesBreathe It's frustrating that the ByteNet paper doesn't spell out how the decoder brings the source network's output into the decoder's "residual multiplicative" unit just shows a single vector coming in. Or did I just miss it? I guess you didn't see it either, which is why we're contriving our own way to do it.

Your ideas sound plausible to me.

On #3, if I understand you, the "extra" convolution seems redundant. The s vector already is the result of a time convolution, and then the decoder network is doing it's own convolutions. I don't imagine it would hurt anything. But makes the code more complicated.

A simpler thing, and probably my favorite at the moment, is like your idea #2, except couldn't we just concatenate s and t along channels? So in Figure 1, if s_8 value has m channels, and t_8 value has n channels, then the concatenated result has m+n channels, and that concatenation result is what flows through the res blocks in Figure 3, such that 2d = m+n, where 2d is as labelled in the Figure 3.

from tensorflow-wavenet.

NickShahML avatar NickShahML commented on August 10, 2024

@jyegerlehner I'm with you -- idea 2 is the simplest to implement that also yields some promise. Code wise, idea 3 would be an entire rewrite as suddenly all tensors would now need to be 4d.

I do believe if you look at fig 3 in the bytenet paper, they do have 2d that they then use a 1x1 conv to reduce to 1d. I honestly don't understand why it is done this way -- why don't they just keep the 2d the whole way? To keep the computation cheaper?

To be clear, we would concatenate these two inputs along dimension 2 in the input_batch tensor correct?

Right now for text, my inputs into the entire wavenet are: [batch_size, timesteps, embedding_size]. If we concatenate the inputs, then we now have: [batch_size, timesteps, embedding_size*2]

I will also work on implementing symmetric padding in the meantime for those interested in the source network. I'm currently working on bytenet here: https://github.com/LeavesBreathe/bytenet_tensorflow

from tensorflow-wavenet.

jyegerlehner avatar jyegerlehner commented on August 10, 2024

@LeavesBreathe

2d that they then use a 1x1 conv to reduce to 1d. I honestly don't understand why it is done this way -- why don't they just keep the 2d the whole way? To keep the computation cheaper?

I bet it's mostly to save memory. Each one of those many sigmoid, tanh and element-wise addition and multiplication ops in the multiplicative residual unit produces another tensor. That's at least umpteen of them. Making them smaller is a big memory savings. Plus, I think residual units are different than non-residual layers or blocks: any lower dimensional bottleneck in a non-residual layer will force you to throw away information if you can't compress it, but you don't have that problem with residual layers, since everything that was there in the input is still there at the output. Anyway, that's my theory.

To be clear, we would concatenate these two inputs along dimension 2 in the input_batch tensor correct?

Yeah that's what I was thinking. But.. I think we missed something. Notice in Figure 2 where they talk about what happens when the source and target streams are of different length. As when the German source sentence ends before the English target. They say they simply don't condition the decoder on the source any more. Which means the source contribution can just go away. So what would we do in our concatenation scheme? Maybe we could just set the missing source contribution to zeros. Or we could go back to your idea 1, and instead of concatenating, sum s and t. That "feels" like it might be better to me, since they would have the same embedding space.. and would make more sense for the source contribution to just go away... maybe?

from tensorflow-wavenet.

NickShahML avatar NickShahML commented on August 10, 2024

They say they simply don't condition the decoder on the source any more. Which means the source contribution can just go away. So what would we do in our concatenation scheme? Maybe we could just set the missing source contribution to zeros. Or we could go back to your idea 1, and instead of concatenating, sum s and t. That "feels" like it might be better to me, since they would have the same embedding space.. and would make more sense for the source contribution to just go away... maybe?

@jyegerlehner I pretty confident that in the case where you run out of source input timesteps, they pad the actual source network inputs with zeros. From the paper:

At each step the target network takes as input the corresponding column of the
source representation until the target network produces the end-of-sequence symbol. The
source representation is zero-padded on the fly: if the target network produces symbols
beyond the length of the source sequence, the corresponding conditioning column is set to
zero. In the latter case the predictions of the target network are conditioned on source
and target representations from previous steps. Figure 2 represents the dynamic unfolding
process.

So I don't believe summing is the way to go. Instead, I think the second idea and concating them is the way to approach them.


What you said about the residual comments and memory savings makes alot of sense to me, especially since d was reported to be 892 (not sure why they chose that number). I think I'll build this block on Sunday in my fork.


Also...I'm struggling to convert the causal convolution so that it will accept dilations from both sides just like the source network. Maybe this is simpler than I think, but if you could help with that, this would be useful for those who want to use wavenet as a classifier.

If there was a dilation rate set to 8, then 4 holes would go to the left and 4 would go to the right. This is depicted in figure 1 in the source network. Working on it here:

https://github.com/LeavesBreathe/bytenet_tensorflow/blob/master/bytenet/convolution_ops.py

from tensorflow-wavenet.

jyegerlehner avatar jyegerlehner commented on August 10, 2024

I pretty confident that in the case where you run out of source input timesteps, they pad the actual source network inputs with zeros.

Well sure, zeros. But that merely begs the question: are the zeros being added, or concatenated, to the target embedding?

especially since d was reported to be 892

Thanks. I never noticed that. I must have the reading-comprehension/attention-span of a gnat.

Also...I'm struggling to convert the causal convolution so that it will accept dilations from both sides just like the source network.

I don't see what the problem is. Sure, the decoder/target network is a WaveNetModel. It does causal/masked convolutions. The encoder/source network is not causal. It's just a run-of-the-mill conv net (doing convolutions in time). Well not quite run-of-the-mill. I think you can work out the filter width, stride and dilation from the red part of Figure 1. Or maybe from the words they wrote. I dunno. I tend to look at the pictures. They talk about how the input is n-gram encoding (which sounds complicated) and so on blah blah. I haven't looked into it closely. And there's that whole "sub-batch-normalization" which I would probably skip the first time around because who knows what that means. I haven't found batch-normalization to be very helpful but that could just be me.

if you could help with that

I'm still working on global conditioning for wavenet which we haven't merged yet. I intend to train WaveNet on the full VCTK corpus next. One thing at a time. So I'm afraid in the near term you're going to have to deal with this without me. But if you can wait long enough I might get back to this.

from tensorflow-wavenet.

NickShahML avatar NickShahML commented on August 10, 2024

@jyegerlehner Will respond to all of this on Sunday. Unfortunately, I can't work today -- I will try using the tf.atrous_conv2d for the source network as I believe it is more efficient. With tf.conv1d you're still using tf.conv2d internally.

Also I read this paper probably 10 times so i definitely missed the 892 the first 5 times.

from tensorflow-wavenet.

ibab avatar ibab commented on August 10, 2024

@LeavesBreathe: I might be misunderstanding how you want to use tf.atrous_conv2d, but note that its rate parameter sets the dilation rate for both dimensions at the same time, so it might not give you what you want.

from tensorflow-wavenet.

NickShahML avatar NickShahML commented on August 10, 2024

@ibab and @jyegerlehner I'm back working on this. I wanted to try tf.atrous_conv2d because it would save in computation (and coding). I understand that the rate parameter is used for both height and width. In our case, we are just interested in the width.

However, if we set the height to just 1 (like we are currently doing), this wouldn't this op work successfully? It can dilate all it want on the height dimension but there is only one height it can receive values from. Perhaps this is a bad strategy to approach this with.

I'll think about this more -- perhaps I need to do the conv1d approach that you guys led with. I'm just confused as to how to make it a non-causal yet atrous (or dilated) conv1d based upon code. The whole point of this would be for the source network. Let me work on this more and get back to you tomorrow.

from tensorflow-wavenet.

Zeta36 avatar Zeta36 commented on August 10, 2024

Hello again, @alexbeloi :).

@jyegerlehner did a lot about global conditioning, maybe you can use his work instead of continuing with your (now old) branch.

Regards.

from tensorflow-wavenet.

GuangChen2016 avatar GuangChen2016 commented on August 10, 2024

@alexbeloi @jyegerlehner @Zeta36 @
Hello, guys!
Has any one of you worked out the local conditional part? I am still working on this, but I still don't have any idea about how to insert the the textual information into the network. One problem is that we cut the radio into a fixed pieces, how can the textual information match with the fixed pieces?
I am confused with that,but I am quite interested at this, can you give me some advice or suggestion?

Best Regards.

from tensorflow-wavenet.

Whytehorse avatar Whytehorse commented on August 10, 2024

Input -> HMM/CNN -> Output
Text -> HMM/CNN -> Speech
The training data should be the actual text and phonetic data and wav files and speaker id
From the article
https://deepmind.com/blog/wavenet-generative-model-raw-audio/
"Knowing What to Say

In order to use WaveNet to turn text into speech, we have to tell it what the text is. We do this by transforming the text into a sequence of linguistic and phonetic features (which contain information about the current phoneme, syllable, word, etc.) and by feeding it into WaveNet. This means the network’s predictions are conditioned not only on the previous audio samples, but also on the text we want it to say.

If we train the network without the text sequence, it still generates speech, but now it has to make up what to say. As you can hear from the samples below, this results in a kind of babbling, where real words are interspersed with made-up word-like sounds:

from tensorflow-wavenet.

ttslr avatar ttslr commented on August 10, 2024

@Whytehorse Have you implemented local condition for the WAVENET? can you share the code?

Thank you very much! Best Regards!

from tensorflow-wavenet.

Whytehorse avatar Whytehorse commented on August 10, 2024

I don't have any code yet since I'm still working on porting tensorflow to my non-nvidia gpu. Anyway, my understanding is that you need to get some pre-made data set which has recorded speech that is tagged in time with text. There already exist such data sets plus you can use movies with subtitles and/or use any tts api

from tensorflow-wavenet.

JahnaviChowdary avatar JahnaviChowdary commented on August 10, 2024

@alexbeloi
Is the local conditioning for generation part done?

from tensorflow-wavenet.

AlvinChen13 avatar AlvinChen13 commented on August 10, 2024

Any progress for local conditioning part?

from tensorflow-wavenet.

wilsoncai1992 avatar wilsoncai1992 commented on August 10, 2024

@alexbeloi Any idea how we could verify the local conditioning, and then proceed to completing .wav output?

from tensorflow-wavenet.

rafaelvalle avatar rafaelvalle commented on August 10, 2024

@wilsoncai1992 by verify do you mean checking that it's working appropriately? One could, for example, iteratively slice the audio into N equal sized regions and interpolate between 2 conditions, increasing N at each iteration but keeping the number of conditions constant. I would choose 2 conditions that are supposedly easy to train and perceived as dissimilar, e.g. speech and music or Portuguese and German speech or speech and non-speech.

from tensorflow-wavenet.

Whytehorse avatar Whytehorse commented on August 10, 2024

@beppeben That's similar to the hidden Markov model approach that was taken by the likes of Carnegie Mellon University in CMU-sphinx. It produces speech that sounds like Stephen Hawking. The new approach uses recurrent neural networks(LSTM). In this approach, you need parameters and hyper-parameters. The hyper-parameters can be anything but for speech it's likely to be frequency range, pitches, intonations, durations, etc. These are what are actually being trained but they could be formulated via FFT and other functions to get a 100% accurate speech synthesizer without training. The parameters could be anything but for speech they are things like text(what to say), voice(how to say it). We could even extend these models to include much more parameters like: mood, speed, etc.

from tensorflow-wavenet.

beppeben avatar beppeben commented on August 10, 2024

@rafaelvalle Thanks for pointing that out. char2wav seems to use a type of local conditioning that was first introduced in a paper by Graves in the context of handwriting generation. I agree that the same attention-based local condition (working directly on the text input, with no separate feature extraction, trained jointly with the main net) could be used for the wavenet.

I was surprised to notice that the conditioning works quite similarly to what I proposed above, being based on a mixture of gaussians for optimally weighting the part of text to focus on at each point in the sample.

In my proposal, every character (or group of 3 characters) has its own gaussian weight, while in Graves' paper the weight of each character is determined by the sum of K gaussian functions (which are the same for all the characters).

Also in my proposal the gaussians are defined and centered on the sample space, while in Graves they live on the characters space. I imagined their means to represent the location of each character (or phoneme) on the wave, while in Graves their means \kappa have a less clear interpretation. Graves'a approach is most probably more convenient than mine though, since the number of gaussians K is fixed and does not depend on the number of characters.

Implementing that on the wavenet would require deciding what variables to use to determine the conditioning weights. Graves uses the first layer of the same RNN that generates the handwriting. char2wav probably does something similar, since the wave generation is also based on a RNN. Wavenet could use some channels of the dilation layers for that purpose, even though I wonder if they contain enough memory for the task.

from tensorflow-wavenet.

beppeben avatar beppeben commented on August 10, 2024

I implemented a Graves-style local conditioning in my own fork, by introducing another wavenet at the bottom of the main one, which computes the character attention weights to be later fed into the main net.

I couldn't get any satisfactory results yet, it doesn't seem easy to learn the correct text/wave alignment without a previous segmentation step (as they do in Deep Voice for example).

But it's also true that I only have an old CPU so I had to greatly reduce the size of the net/sampling rate to be able to get any training done at all, so maybe some better results could be achieved with some extra computing power.

from tensorflow-wavenet.

dp-aixball avatar dp-aixball commented on August 10, 2024

@beppeben
Any samples? Thanks!

from tensorflow-wavenet.

matanox avatar matanox commented on August 10, 2024

@beppeben I think unfortunately indeed the original WaveNet article discloses little of the alignment method. It seems to say not much more than "External models predicting log F0 values and phone durations from linguistic features were also trained for each language.".

Whereas it seems to imply, in Table 2 there in the article, that without these alignments, their own MOS results were not impressive compared to the legacy methods. I find it a little troubling in terms of scientific disclosure. The Baidu deep voice papers indeed include some guidance, but this still looks to me like one of the hardest parts of the architecture to reproduce for a given input dataset!

In stark contrast, the Tacotron paper/architecture eliminates the need for this data preparation step, or at least, it doesn't require phoneme level alignment between the (audio, text) pairs as part of its input.

from tensorflow-wavenet.

potrepka avatar potrepka commented on August 10, 2024

Hey, just an idea - since I'm just starting to get into all this (and may actually be thinking of doing a PhD now!) - but it seems to me that pitch detection in music would be a much easier route to go if you're looking to generate a corpus to test local conditioning.

from tensorflow-wavenet.

kastnerkyle avatar kastnerkyle commented on August 10, 2024

For interested parties: Merlin (https://github.com/CSTR-Edinburgh/merlin) has a tutorial on how to extract both text features and vocoder features from audio, as well as some chunks doing HMM for text feature -> audio feature alignment. Alternatively, a weaker version of the WaveNet pipeline could take the per timestep recognition information out of Kaldi, maybe using something like Gentle (https://github.com/lowerquality/gentle). I have experimented with these some for "in-the-wild" data. They also talk some about the LSTM setups in the appendix of the WaveNet paper, and DeepVoice has a similar setup which could be helpful reading.

from tensorflow-wavenet.

aleixcm avatar aleixcm commented on August 10, 2024

Hi @rafaelvalle is this version available somewhere?
Best.

from tensorflow-wavenet.

rafaelvalle avatar rafaelvalle commented on August 10, 2024

@aleixcm NVIDIA has recently released alien CUDA code with faster than real-time inference for wavenet conditioned on mel spectrograms. https://github.com/NVIDIA/nv-wavenet/

from tensorflow-wavenet.

candlewill avatar candlewill commented on August 10, 2024

This is a very long thread now and not discussed for nearly one year. Here is some of my understanding, hoping someone could point out my mistake.

It is not well explained how to implement up-sample (deconvolution or repeat) to add linguistic local conditioning features. As the local conditioning should be up-sampled to the same time-series length with audio, we should know audio length before upsample. When training, audio length is known, but it's not known at prediction stage, which lead to be impossible for unsample. One way is to use attention mechanism, but it's not mentioned in the paper. Another way is to use other local conditioning features (e.g., mel features in r9y9's implementation), in which each frame corresponds to a fixed number of samples.

from tensorflow-wavenet.

SatyamKumarr avatar SatyamKumarr commented on August 10, 2024

@alexbeloi Great work on implementing local conditioning. You have carried work for Text-to-Speech Synthesis, by providing both text, speech pairs as local conditioning. I want to know how is local conditioning carried for voice conversion. Any idea on updating parameters in your code in order to pass Acoustic features?

@Zeta36 @alexbeloi Is it neccesary to use LSTM or RNN with GRU for preserving sequence?
#117 (comment)

from tensorflow-wavenet.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.