Comments (4)
Definitely the second version.
A few more edits, since you're at it.
The forward should have two inputs, x
and ΞΎ
.
x
is one input while ΞΎ
is the secondary one.
d'
should be the dimensionality of q
and k
, while d''
the dimensionality of v
.
The matrices should be h * d
height.
Currently we have that
self.d_k = d_model // self.num_heads
but it should be the other way around, with d'
and d''
of our choice, and then d_total = h*d
.
Question: did you watch my lecture about this notebook?
from nyu-dlsp20.
Thanks for the pointers. In my personal implementation I made a more general version which does not require dimension match in inputs so that the values can have different dims. I'll refactor it and send the PR.
PS: Yes I did watch the lecture about the notebook. Really enjoyed it :)
from nyu-dlsp20.
Yup, must have been a leftover dropout.
Feel free to send a PR that removes it and closes this issue.
from nyu-dlsp20.
I can remove the dropout
OR
I looked at some other implementations and they seem to use the dropout. Most of them apply this dropout on the computed A matrix. So we can keep the dropout to be consistent with other implementations. I modified the code as below (Note the lines with lots of ######################)
class MultiHeadAttention(nn.Module):
def __init__(self, d_model, num_heads, p, d_input=None):
super().__init__()
self.num_heads = num_heads
self.d_model = d_model
# Tarun: handling for single vs multiple heads?
if d_input is None:
d_xq = d_xk = d_xv = d_model
else:
d_xq, d_xk, d_xv = d_input
# Make sure that the embedding dimension of model is a multiple of number of heads
assert d_model % self.num_heads == 0
self.d_k = d_model // self.num_heads
# These are still of dimension d_model. They will be split into number of heads
self.W_q = nn.Linear(d_xq, d_model, bias=False)
self.W_k = nn.Linear(d_xk, d_model, bias=False)
self.W_v = nn.Linear(d_xv, d_model, bias=False)
# Outputs of all sub-layers need to be of dimension d_model
self.W_h = nn.Linear(d_model, d_model)
self.dropout = nn.Dropout(p) ######################### NEW ##################################
def scaled_dot_product_attention(self, Q, K, V):
batch_size = Q.size(0)
k_length = K.size(-2)
# Scaling by d_k so that the soft(arg)max doesnt saturate
Q = Q / np.sqrt(self.d_k) # (bs, n_heads, q_length, dim_per_head)
scores = torch.matmul(Q, K.transpose(2,3)) # (bs, n_heads, q_length, k_length)
A = nn_Softargmax(dim=-1)(scores) # (bs, n_heads, q_length, k_length)
# Get the weighted average of the values
######################### NEW dropout before matmul ##################################
H = torch.matmul(self.dropout(A), V) # (bs, n_heads, q_length, dim_per_head)
return H, A
And then ran the training and evaluation again with default 0.1 dropout as in the code and I consistently get about 1% percentage higher accuracy over many runs.
So which PR would you prefer remove it or add it in the above fashion?
from nyu-dlsp20.
Related Issues (20)
- Cannote create pDL environment HOT 1
- Problem visualizing spanish translation on github.io HOT 8
- [11-VAE.ipnb] TSNE fit_transform() happens a error with cuda HOT 2
- [12-regularization.ipynb] two errors happened HOT 1
- Self-Attention Paragraph Typos HOT 2
- [8-seq_classification.ipynb] Little typo error in paragraph 8
- Request to contribute the derivation of KL Divergence for two gaussian distributions HOT 1
- Not following the instructions leads to RuntimeError HOT 1
- Omission of x in the equation in the Week 1 lecture HOT 1
- week 6, ch 6.3 HOT 3
- [DLSP22] typo in 16-gated_GCN.ipynb HOT 3
- Russian translation (dictionary) HOT 4
- solutions HOT 1
- Software version update for 2023 HOT 1
- cross attention issue, topic_12.3. Attention and the Transformer HOT 3
- README suggestion HOT 3
- Autoencoders code: NameError: name 'img_bad' is not defined. How to fix? HOT 2
- [02-space_stretching.ipynb] Color pattern for show_scatterplot HOT 3
- The 15-transformer notebook has multiple issues with recent pytorch version HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nyu-dlsp20.