Git Product home page Git Product logo

sentencode's Introduction

SentEncode

Sentence encoder made as part of practical assignment for UvA AI course "Advanced Topics in Computational sementics

Goal

The goal is to train a sentence encoder and classifier on a natural language inference (NLI) task, and then transfer the trained sentence encoder to another task. Performance in the transfer learning tasks can be measured with SentEval.

Structure of the code

train.py

Contains the main code to train the model, using pytorch lightning Trainer. Usage:

python train.py --encoder_type ...

Other command line options give flexibility to specify training parameters, model parameters and location of source files.

files

--data_dir, default="data/"
--dataset_dir, default="snli_1_0/"
--dataset_file, default="snli_1.0"  # train.jsonl, valid.jsonl, test.jsonl will be appended
--vocab_file, default=None
--embeddings_file, default= "glove.840B.300d.txt"
--snli_embeddings, default=None

device options

--device, default="gpu"

train options

--lr, type=float, default=0.1
--lr_limit, type=float, default=1E-5
--weight_decay, type=float, default=0.99
--batch_size, type=int, default=64
--patience, type=int, default=0

model options

--encoder_type, default="mean", choices=["mean", "lstm", "bilstm", "poolbilstm"]
--classifier, default="mlp", choices=["mlp", "linear"]
--embedding_size, type=int, default=300
--hidden_size, type=int, default=2048
--num_layers, type=int, default=64
--aggregate_method, default="max", choices=["max", "avg"]

After training, two model files are saved:

  • encoder_<encoder_type>: contains the state dict of the sentence encoder
  • classifier_<encoder_type>: contains the state dict of the classier These state dicts can be used by test.py to calculate accuracy on the test dataset. The sentence encoder state dict is also used by eval.py, to test performance of the sentence encoder on the SentEval tasks.

test.py

Loads state dicts of sentence encoder and classifier and tests performance of the combination on the SNLI test dataset. Usage:

python test.py --encoder_type ...

Command line options similar to train.py (see above). In addition, there is the following option:

--models_dir, default="models/"

eval.py

Loads state dict of sentence encoder and tests performance of the sentence encoder against the SentEval tasks. Usage:

python eval.py --encoder_type ... --tasks ... 

Command line options are similar to train.py (see above). In addition, there are the following options:

--tasks, nargs="*", type=str, default=['MR', 'CR', 'SUBJ', 'MPQA', 'SST2', 'TREC', 'MRPC', 'SICKEntailment']
--models_dir, default="models/"

snli_lightning.py

A pytorch lightning wrapper for the embedding, the sentence encoder and the classifier. Contains specification of the optimizer, learning rate scheduler, and the train, validation and test steps.

encoder.py

Code to encode two sentences and generate a score that signals if/how the two sentences are related: entailment, neutral, contradiction

models.py

Code that defines the four available options for the sentence encoder:

  • MeanEmbedding: calculate the mean of the token embeddings;
  • UniLSTM: use the last hidden state as sentence representation;
  • BiLSTM: use the concatenation of the hidden state of last token of the forward pass and the hidden state of first token on the backward pass as sentence representation;
  • PoolBiLSTM: concatenates the forward and backward hidden states of each token, and use pooling (max of average) over all tokens to create sentence representation. All modules are initialised using a dict 'opt' with the relevant parameters for the model.

vocab.py

Code for building the vocabulary from source sentences, and for matching of vocabulary with word embeddings

data.py

Code to load the SNLI dataset. Also contains the function 'batchify' that is used by the dataloader to transform a list of examples from the dataset to the batch format that is used by to train/evaluate the model

Interactive testing

A Colab notebook can be used to perform the SentEval tests interactively. The user is asked to specify which model, and which task(s), and then the SentEval check is run with the given model and task(s).

https://colab.research.google.com/drive/1xvCqrutfoLzPWNP7BHkR_gEDcANN0I4c?usp=sharing

Pretrained models

The pretrained sentence encoders and classifiers for each model can be downloaded from:

https://drive.google.com/drive/folders/1vMA9nbBabWRfaf2UY3ef5hnOO8cPo68f?usp=sharing

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.