In the __init__ p is defined but never used. I assume

I can remove the dropout OR I looked at some other

transformer.ipynb : MultiHeadAttention parameter p is never used about nyu-dlsp20 HOT 4 CLOSED

reachtarunhere commented on June 10, 2024 1

transformer.ipynb : MultiHeadAttention parameter p is never used

from nyu-dlsp20.

Comments (4)

Atcold commented on June 10, 2024 2

Definitely the second version.

A few more edits, since you're at it.
The forward should have two inputs, x and ξ.
x is one input while ξ is the secondary one.

d' should be the dimensionality of q and k, while d'' the dimensionality of v.
The matrices should be h * d height.

Currently we have that

self.d_k = d_model // self.num_heads

but it should be the other way around, with d' and d'' of our choice, and then d_total = h*d.

Question: did you watch my lecture about this notebook?

from nyu-dlsp20.

reachtarunhere commented on June 10, 2024 1

Thanks for the pointers. In my personal implementation I made a more general version which does not require dimension match in inputs so that the values can have different dims. I'll refactor it and send the PR.

PS: Yes I did watch the lecture about the notebook. Really enjoyed it :)

from nyu-dlsp20.

Atcold commented on June 10, 2024

Yup, must have been a leftover dropout.
Feel free to send a PR that removes it and closes this issue.

from nyu-dlsp20.

reachtarunhere commented on June 10, 2024

I can remove the dropout

I looked at some other implementations and they seem to use the dropout. Most of them apply this dropout on the computed A matrix. So we can keep the dropout to be consistent with other implementations. I modified the code as below (Note the lines with lots of ######################)

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads, p, d_input=None):
        super().__init__()
        self.num_heads = num_heads
        self.d_model = d_model
        # Tarun: handling for single vs multiple heads?
        if d_input is None:
            d_xq = d_xk = d_xv = d_model
        else:
            d_xq, d_xk, d_xv = d_input
            
        # Make sure that the embedding dimension of model is a multiple of number of heads
        assert d_model % self.num_heads == 0

        self.d_k = d_model // self.num_heads
        
        # These are still of dimension d_model. They will be split into number of heads 
        self.W_q = nn.Linear(d_xq, d_model, bias=False)
        self.W_k = nn.Linear(d_xk, d_model, bias=False)
        self.W_v = nn.Linear(d_xv, d_model, bias=False)
        
        # Outputs of all sub-layers need to be of dimension d_model
        self.W_h = nn.Linear(d_model, d_model)
        
        self.dropout = nn.Dropout(p) ######################### NEW ##################################
        
    def scaled_dot_product_attention(self, Q, K, V):
        batch_size = Q.size(0) 
        k_length = K.size(-2) 
        
        # Scaling by d_k so that the soft(arg)max doesnt saturate
        Q = Q / np.sqrt(self.d_k)                         # (bs, n_heads, q_length, dim_per_head)
        scores = torch.matmul(Q, K.transpose(2,3))          # (bs, n_heads, q_length, k_length)
        
        A = nn_Softargmax(dim=-1)(scores)   # (bs, n_heads, q_length, k_length)
        
        # Get the weighted average of the values
        ######################### NEW dropout before matmul ##################################
        H = torch.matmul(self.dropout(A), V)     # (bs, n_heads, q_length, dim_per_head)

        return H, A

And then ran the training and evaluation again with default 0.1 dropout as in the code and I consistently get about 1% percentage higher accuracy over many runs.

So which PR would you prefer remove it or add it in the above fashion?

from nyu-dlsp20.

transformer.ipynb : MultiHeadAttention parameter p is never used about nyu-dlsp20 HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent