Light

inhyeokyoo / nlp Goto Github PK

View Code? Open in Web Editor NEW

3.0 2.0 1.0 135 KB

Python 49.03% Jupyter Notebook 50.97%

nlp's Introduction

NLP

This repository is for containing implementations of NLP via PyTorch. The repository includes:

nlp's People

Contributors

Stargazers

Watchers

Forkers

euhkim

nlp's Issues

5. GPT-1 implement issues

Pre-processing

1. Generate fine-tuning dataset

I need a wrapper that performs inserting sos, eos and delimiter tokens, for torchtext.

1). How can I add the delimiter token and which class (or function, module) has the responsibility for it?

I'm not sure it's the best approach or de factor standard for N.L.P., but I guess inserting cancatenated texts with special tokens including delimiter token is much easier to implement it. Therefore, the preprocessing for inserting the special token (i.e. $), and concatenation of hypothesis and premise should be implemented outside of the model.

2). Is total sequence length 512 or 512*3 in Q.A. task?

3). should I add pad_idx for each document, question and a set of possible answers, or after concatenating them?

Language modeling

1. Is the embedding weight equivalent to the last linear layer weight, or does it share the initialized weight at the beginning of the training?

Huggingface code initializes the linear layer weight with the embedding layer's one. It seems that the linear and embedding layer share weight only on an initializing step, but not training steps.

class LMHead(nn.Module):
    """ Language Model Head for the transformer """

    def __init__(self, model, cfg, trunc_and_reshape=True):
        super(LMHead, self).__init__()
        self.n_embd = cfg.n_embd
        embed_shape = model.embed.weight.shape
        self.decoder = nn.Linear(embed_shape[1], embed_shape[0], bias=False)
        self.decoder.weight = model.embed.weight # Tied weights
        self.trunc_and_reshape = trunc_and_reshape  # XD

I don't think the original paper explicitly shows it and I'm not sure why the code implements this way.

2. Why the last linear layer doesn't have a bias?

The linear layer indicates the embedding matrix W_e according to the original paper. An embedding layer doesn't include a bias term so the last linear layer shouldn't have bias term.

The below code shows the implementation of the huggingface which is the last layer in language model of GPT-1. We can see that the last linear layer (W_e) doesn't have a bias term.

class LMHead(nn.Module):
    """ Language Model Head for the transformer """

    def __init__(self, model, cfg, trunc_and_reshape=True):
        super(LMHead, self).__init__()
        self.n_embd = cfg.n_embd
        embed_shape = model.embed.weight.shape
        self.decoder = nn.Linear(embed_shape[1], embed_shape[0], bias=False)
        self.decoder.weight = model.embed.weight # Tied weights
        self.trunc_and_reshape = trunc_and_reshape  # XD

Fine-tuning

1. How can I train the model with L3(x)? What is the target?

2. Why do I need LMHead for a fine-tuning task?

Because the loss function of the model is L_3(x)=L_2(x) + \lambda L_1(x) which is the summation of language model loss and fine-tuning loss. Therefore, we need both LMHead and Fine-tuning head to calculate the losses on the fine-tuning task.

3. What is the shape of all possible answers $a_k$ in a Q.A. task?

According to the original paper, all possible answers are concatenated.

For these tasks, we are given a context document z, a question q, and a set of possible answers{ak}. We concatenate the document context and question with each possible answer, adding a delimiter token in between to get[z; q; $; a_k].

So what is the shape of input of Q.A. fine-tuning task? Does it mean that all a_k is flattened (i.e. concatenate)?

6. BERT Paper Issue

0. Why does BERT use Encoder ONLY architecture?

Why don't just use both of them?

Inputs for both encoder and decoder are the same when you train for LM objective with an unsupervised dataset. Therefore, the encoder and the decoder of the transformer will train redundant information from the dataset. See this for more details.

Is the powerful performance of BERT based on MLM/NSP/architecture itself (i.e. encoder)?

Why doesn't prior works try bidirectional LM and fail to get good results?

What are the differences between the effects of encoder and decoder architecture?

Encoders more like AE ((denoise) auto-encoder) approach while decoders more like AR (auto-regressive) approches.

1. Introduction

Why does Q.A. task belong token-level tasks rather than paraphrasing (i.e. sentence-level task)?

Because the task of Q.A. is to predict whether each token is the start/end token of the answer or not.

3. BERT

What does "arbitrary span of contiguous text" mean? Why do I need it? Can we just take (a) linguistic sentence(s) as an input?

Why does not BERT always replace “masked” words with the actual [MASK] token?

The original paper says:

Although this allows us to obtain a bidirectional pre-trained model, a downside is that we are creating a mismatch between pre-training and fine-tuning, since the [MASK] token does not appear during fine-tuning. To mitigate this, we do not always replace “masked” words with the actual [MASK] token.

If the machine trains a task reconstructing the original input for the [MASK] token, the machine will not predict anything in the fine-tuning step since the machine trains only modify the [MASK] tokens into proper words rather than reconstruct while sentence or understand the context. Therefore, we do not always replace “masked” words with the actual [MASK] token for understanding language itself (i.e., context).

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.