Git Product home page Git Product logo

bert-pytorch's Introduction

BERT-pytorch

PyTorch implementation of BERT in "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" (https://arxiv.org/abs/1810.04805)

Requirements

All dependencies can be installed via:

pip install -r requirements.txt

Quickstart

Prepare data

First things first, you need to prepare your data in an appropriate format. Your corpus is assumed to follow the below constraints.

  • Each line is a document.
  • A document consists of sentences, seperated by vertical bar (|).
  • A sentence is assumed to be already tokenized. Tokens are seperated by space.
  • A sentence has no more than 256 tokens.
  • A document has at least 2 sentences.
  • You have two distinct data files, one for train data and the other for val data.

This repo comes with example data for pretraining in data/example directory. Here is the content of data/example/train.txt file.

One, two, three, four, five,|Once I caught a fish alive,|Six, seven, eight, nine, ten,|Then I let go again.
Iā€™m a little teapot|Short and stout|Here is my handle|Here is my spout.
Jack and Jill went up the hill|To fetch a pail of water.|Jack fell down and broke his crown,|And Jill came tumbling after.  

Also, this repo includes SST-2 data in data/SST-2 directory for sentiment classification.

Build dictionary

python bert.py preprocess-index data/example/train.txt --dictionary=dictionary.txt

Running the above command produces dictionary.txt file in your current directory.

Pre-train the model

python bert.py pretrain --train_data data/example/train.txt --val_data data/example/val.txt --checkpoint_output model.pth

This step trains BERT model with unsupervised objective. Also this step does:

  • logs the training procedure for every epoch
  • outputs model checkpoint periodically
  • reports the best checkpoint based on validation metric

Fine-tune the model

You can fine-tune pretrained BERT model with downstream task. For example, you can fine-tune your model with SST-2 sentiment classification task.

python bert.py finetune --pretrained_checkpoint model.pth --train_data data/SST-2/train.tsv --val_data data/SST-2/dev.tsv

This command also logs the procedure, outputs checkpoint, and reports the best checkpoint.

See also

  • Transformer-pytorch : My own implementation of Transformer. This BERT implementation is based on this repo.

Author

@dreamgonfly

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    šŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. šŸ“ŠšŸ“ˆšŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ā¤ļø Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.