This repository is for containing implementations of NLP via PyTorch. The repository includes:
nlp's Introduction
nlp's People
Forkers
euhkimnlp's Issues
5. GPT-1 implement issues
Pre-processing
1. Generate fine-tuning dataset
I need a wrapper that performs inserting sos, eos and delimiter tokens, for torchtext.
1). How can I add the delimiter token and which class (or function, module) has the responsibility for it?
I'm not sure it's the best approach or de factor standard for N.L.P., but I guess inserting cancatenated texts with special tokens including delimiter token is much easier to implement it. Therefore, the preprocessing for inserting the special token (i.e. $), and concatenation of hypothesis and premise should be implemented outside of the model.
2). Is total sequence length 512 or 512*3 in Q.A. task?
3). should I add pad_idx for each document, question and a set of possible answers, or after concatenating them?
Language modeling
1. Is the embedding weight equivalent to the last linear layer weight, or does it share the initialized weight at the beginning of the training?
Huggingface code initializes the linear layer weight with the embedding layer's one. It seems that the linear and embedding layer share weight only on an initializing step, but not training steps.
class LMHead(nn.Module):
""" Language Model Head for the transformer """
def __init__(self, model, cfg, trunc_and_reshape=True):
super(LMHead, self).__init__()
self.n_embd = cfg.n_embd
embed_shape = model.embed.weight.shape
self.decoder = nn.Linear(embed_shape[1], embed_shape[0], bias=False)
self.decoder.weight = model.embed.weight # Tied weights
self.trunc_and_reshape = trunc_and_reshape # XD
I don't think the original paper explicitly shows it and I'm not sure why the code implements this way.
2. Why the last linear layer doesn't have a bias?
The linear layer indicates the embedding matrix W_e according to the original paper. An embedding layer doesn't include a bias term so the last linear layer shouldn't have bias term.
The below code shows the implementation of the huggingface which is the last layer in language model of GPT-1. We can see that the last linear layer (W_e) doesn't have a bias term.
class LMHead(nn.Module):
""" Language Model Head for the transformer """
def __init__(self, model, cfg, trunc_and_reshape=True):
super(LMHead, self).__init__()
self.n_embd = cfg.n_embd
embed_shape = model.embed.weight.shape
self.decoder = nn.Linear(embed_shape[1], embed_shape[0], bias=False)
self.decoder.weight = model.embed.weight # Tied weights
self.trunc_and_reshape = trunc_and_reshape # XD
Fine-tuning
1. How can I train the model with L3(x)? What is the target?
2. Why do I need LMHead for a fine-tuning task?
Because the loss function of the model is L_3(x)=L_2(x) + \lambda L_1(x) which is the summation of language model loss and fine-tuning loss. Therefore, we need both LMHead and Fine-tuning head to calculate the losses on the fine-tuning task.
3. What is the shape of all possible answers $a_k$ in a Q.A. task?
According to the original paper, all possible answers are concatenated.
For these tasks, we are given a context document z, a question q, and a set of possible answers{ak}. We concatenate the document context and question with each possible answer, adding a delimiter token in between to get[z; q; $; a_k].
So what is the shape of input of Q.A. fine-tuning task? Does it mean that all a_k is flattened (i.e. concatenate)?
6. BERT Paper Issue
0. Why does BERT use Encoder ONLY architecture?
Why don't just use both of them?
Inputs for both encoder and decoder are the same when you train for LM objective with an unsupervised dataset. Therefore, the encoder and the decoder of the transformer will train redundant information from the dataset. See this for more details.
Is the powerful performance of BERT based on MLM/NSP/architecture itself (i.e. encoder)?
Why doesn't prior works try bidirectional LM and fail to get good results?
What are the differences between the effects of encoder and decoder architecture?
Encoders more like AE ((denoise) auto-encoder) approach while decoders more like AR (auto-regressive) approches.
1. Introduction
Why does Q.A. task belong token-level tasks rather than paraphrasing (i.e. sentence-level task)?
Because the task of Q.A. is to predict whether each token is the start/end token of the answer or not.
3. BERT
What does "arbitrary span of contiguous text" mean? Why do I need it? Can we just take (a) linguistic sentence(s) as an input?
Why does not BERT always replace “masked” words with the actual [MASK] token?
The original paper says:
Although this allows us to obtain a bidirectional pre-trained model, a downside is that we are creating a mismatch between pre-training and fine-tuning, since the [MASK] token does not appear during fine-tuning. To mitigate this, we do not always replace “masked” words with the actual [MASK] token.
If the machine trains a task reconstructing the original input for the [MASK] token, the machine will not predict anything in the fine-tuning step since the machine trains only modify the [MASK] tokens into proper words rather than reconstruct while sentence or understand the context. Therefore, we do not always replace “masked” words with the actual [MASK] token for understanding language itself (i.e., context).
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.