Git Product home page Git Product logo

nlp's Introduction

nlp's People

Contributors

inhyeokyoo avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

euhkim

nlp's Issues

5. GPT-1 implement issues

Pre-processing

1. Generate fine-tuning dataset

I need a wrapper that performs inserting sos, eos and delimiter tokens, for torchtext.

1). How can I add the delimiter token and which class (or function, module) has the responsibility for it?

I'm not sure it's the best approach or de factor standard for N.L.P., but I guess inserting cancatenated texts with special tokens including delimiter token is much easier to implement it. Therefore, the preprocessing for inserting the special token (i.e. $), and concatenation of hypothesis and premise should be implemented outside of the model.

2). Is total sequence length 512 or 512*3 in Q.A. task?

3). should I add pad_idx for each document, question and a set of possible answers, or after concatenating them?

Language modeling

1. Is the embedding weight equivalent to the last linear layer weight, or does it share the initialized weight at the beginning of the training?

Huggingface code initializes the linear layer weight with the embedding layer's one. It seems that the linear and embedding layer share weight only on an initializing step, but not training steps.

class LMHead(nn.Module):
    """ Language Model Head for the transformer """

    def __init__(self, model, cfg, trunc_and_reshape=True):
        super(LMHead, self).__init__()
        self.n_embd = cfg.n_embd
        embed_shape = model.embed.weight.shape
        self.decoder = nn.Linear(embed_shape[1], embed_shape[0], bias=False)
        self.decoder.weight = model.embed.weight # Tied weights
        self.trunc_and_reshape = trunc_and_reshape  # XD

I don't think the original paper explicitly shows it and I'm not sure why the code implements this way.

2. Why the last linear layer doesn't have a bias?

The linear layer indicates the embedding matrix W_e according to the original paper. An embedding layer doesn't include a bias term so the last linear layer shouldn't have bias term.

The below code shows the implementation of the huggingface which is the last layer in language model of GPT-1. We can see that the last linear layer (W_e) doesn't have a bias term.

class LMHead(nn.Module):
    """ Language Model Head for the transformer """

    def __init__(self, model, cfg, trunc_and_reshape=True):
        super(LMHead, self).__init__()
        self.n_embd = cfg.n_embd
        embed_shape = model.embed.weight.shape
        self.decoder = nn.Linear(embed_shape[1], embed_shape[0], bias=False)
        self.decoder.weight = model.embed.weight # Tied weights
        self.trunc_and_reshape = trunc_and_reshape  # XD

Fine-tuning

1. How can I train the model with L3(x)? What is the target?

2. Why do I need LMHead for a fine-tuning task?

Because the loss function of the model is L_3(x)=L_2(x) + \lambda L_1(x) which is the summation of language model loss and fine-tuning loss. Therefore, we need both LMHead and Fine-tuning head to calculate the losses on the fine-tuning task.

3. What is the shape of all possible answers $a_k$ in a Q.A. task?

According to the original paper, all possible answers are concatenated.

For these tasks, we are given a context document z, a question q, and a set of possible answers{ak}. We concatenate the document context and question with each possible answer, adding a delimiter token in between to get[z; q; $; a_k].

So what is the shape of input of Q.A. fine-tuning task? Does it mean that all a_k is flattened (i.e. concatenate)?

6. BERT Paper Issue

0. Why does BERT use Encoder ONLY architecture?

Why don't just use both of them?

Inputs for both encoder and decoder are the same when you train for LM objective with an unsupervised dataset. Therefore, the encoder and the decoder of the transformer will train redundant information from the dataset. See this for more details.

Is the powerful performance of BERT based on MLM/NSP/architecture itself (i.e. encoder)?

Why doesn't prior works try bidirectional LM and fail to get good results?

What are the differences between the effects of encoder and decoder architecture?

Encoders more like AE ((denoise) auto-encoder) approach while decoders more like AR (auto-regressive) approches.

1. Introduction

Why does Q.A. task belong token-level tasks rather than paraphrasing (i.e. sentence-level task)?

Because the task of Q.A. is to predict whether each token is the start/end token of the answer or not.

3. BERT

What does "arbitrary span of contiguous text" mean? Why do I need it? Can we just take (a) linguistic sentence(s) as an input?

Why does not BERT always replace “masked” words with the actual [MASK] token?

The original paper says:

Although this allows us to obtain a bidirectional pre-trained model, a downside is that we are creating a mismatch between pre-training and fine-tuning, since the [MASK] token does not appear during fine-tuning. To mitigate this, we do not always replace “masked” words with the actual [MASK] token.

If the machine trains a task reconstructing the original input for the [MASK] token, the machine will not predict anything in the fine-tuning step since the machine trains only modify the [MASK] tokens into proper words rather than reconstruct while sentence or understand the context. Therefore, we do not always replace “masked” words with the actual [MASK] token for understanding language itself (i.e., context).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.