<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

fixed <a class="issue-link js-issue-link" data-error-text="Failed to load title" data

I think the original implementation is right (see <a href="https://arxiv.org/pdf/1706

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

positional encoding about deepvoice3_pytorch HOT 12 CLOSED

r9y9 commented on May 27, 2024

positional encoding

from deepvoice3_pytorch.

Comments (12)

r9y9 commented on May 27, 2024

Sorry, I'm not sure what you mean by repeating twice. In paper,

sin(position_rate * pos / np.power(10000, i / d_pos_vec) (for even i)
cos(position_rate * pos / np.power(10000, i / d_pos_vec) (for odd i)

The strategy in the code is that computing position_rate * pos / np.power(10000, 2 * (i // 2) / d_pos_vec) (eq. 1) for each i and pos and then slicing them with stride 2 as:

position_enc[1:, 0::2] = torch.sin(position_enc[1:, 0::2])  # dim 2i
position_enc[1:, 1::2] = torch.cos(position_enc[1:, 1::2])  # dim 2i+1

from deepvoice3_pytorch.

taras-sereda commented on May 27, 2024

the strategy is clear.
even values in your code are correct, when odd values are wrong

d_pos_vec = 256
position_rate = 1.0
pos = 1
positions = [position_rate * pos / np.power(10000, 2 * (i//2) / d_pos_vec) for i in range(d_pos_vec)]
positions[0::2] = np.sin(positions[0::2])
positions[1::2] = np.cos(positions[1::2])
print(positions[:4])
[0.8414709848078965, 0.54030230586813977, 0.80196179521478528, 0.59737532508120794]

approach from paper

d_pos_vec = 256
position_rate = 1.0
pos = 1
print(np.sin(position_rate * pos / np.power(10000, 0 / d_pos_vec)))
print(np.cos(position_rate * pos / np.power(10000, 1 / d_pos_vec)))
print(np.sin(position_rate * pos / np.power(10000, 2 / d_pos_vec)))
print(np.cos(position_rate * pos / np.power(10000, 3 / d_pos_vec)))
0.841470984808
0.569695008693
0.801961795215
0.623420035442

the results should be same. Do you agree?

from deepvoice3_pytorch.

r9y9 commented on May 27, 2024

I see, you are right. Thank you for catching this. The implementation was actually adapted from https://github.com/jadore801120/attention-is-all-you-need-pytorch. I'm not sure the difference affects speech quality.

from deepvoice3_pytorch.

taras-sereda commented on May 27, 2024

@r9y9 you are welcome.
I'm not sure as well if it affects the speech quality. Just wanted to clarify.
btw. have you tried to use self-attention idea from attention-is-all-you-need-pytorch?
It's in my list to try it in encoder part of the deepvoice3.

from deepvoice3_pytorch.

r9y9 commented on May 27, 2024

I haven't tried it yet. If you get an impressive result with it, that would be great!

from deepvoice3_pytorch.

taras-sereda commented on May 27, 2024

fixed #20

from deepvoice3_pytorch.

tuan3w commented on May 27, 2024

I think the original implementation is right (see paper). i is the dimension index, not the position of word. PE(pos, 2i) and PE(pos, 2i+1) are the basis of space (like in FFT transformation). This allows learn attend by relative position: See

from deepvoice3_pytorch.

r9y9 commented on May 27, 2024

@tuan3w Thank you for the explanation. For your information, assuming DeepVoice3 paper is correct, @taras-sereda is right. However, it makes more sense as you pointed out to design positional encoding PE(pos+k, 2i) can be represented as a linear combination of PE(pos, 2i) and PE(pos, 2i+1) for any fixed k. I understand this allows learn attend by relative position.

@taras-sereda, what do you think?

from deepvoice3_pytorch.

taras-sereda commented on May 27, 2024

@r9y9 the justification provided by @tuan3w is convincing in favour of approach described in Attention is All you need paper. But clearly pos encoding described in DeepVoice3 paper is different. I'm not sure it makes much difference which one to use. Considering the fact that learnable positional encoding (which assumes no analytical representation) gives similar results.
https://arxiv.org/pdf/1706.03762.pdf Table 3 row(E)

from deepvoice3_pytorch.

r9y9 commented on May 27, 2024

@taras-sereda While I don't fully understand why DeepVoice3 uses a slightly different version of positional encoding, personally, either is fine if it actually works. As you may notice, the code in the repository is not trying to replicate DeepVoice3 exactly, but try to build a good TTS based on ideas from DeepVoice3. I have been using https://arxiv.org/abs/1706.03762 style positional encoding (with position rate) so far and get reasonable results. So my question is, did you get a reasonable result with the modification?

from deepvoice3_pytorch.

r9y9 commented on May 27, 2024

Reverted aeed225 for now. Happy to reapply if this change actually works well.

from deepvoice3_pytorch.

stale commented on May 27, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

from deepvoice3_pytorch.

positional encoding about deepvoice3_pytorch HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent