Git Product home page Git Product logo

dynamic-pooling's Introduction

Efficient Transformers with Dynamic Token Pooling

grab-landing-page

Environment | Data | Training | Repository | Issues | Cite

Paper: Efficient Transformers with Dynamic Token Pooling

Environment:

conda create -n dynamic-pooling python=3.8
pip install -r requirements.txt

Data:

  • Download & preprocess
    • text8
      • bash scripts/get_text8.sh
    • wiki40b
      • bash scripts/get_wiki40b.sh $lang
      • where $lang is for example vi
      • check Link for how the abbreviation of other languages
      • Script first downloads wiki40b under ./data/wiki40b/$lang/, and then applies our cleaners on top of it based on text8 cleaning rules. Final training data sits under ./data/wiki40b/$lang/text8. We found that for some systems there might occur some errors when downloading wiki40b using datasets. In this case after you manage to get the data just apply our cleaners on it.
  • Train Unigram
    • python tokenizer_data/train_tokenizer.py $vocab_size $dataset
    • $vocab_size is the integer target vocab size of Unigram
    • $dataset is text8 for text8, wiki40b/$lang/text8 for wiki40b

Training:

  • Training by default starts with a simple test that checks the autoregressive property of a model. We support grad accummulation, distributed training, half precision training.

  • To run training use:

C=configs/whitespaces.yaml GPUS= bash scripts/run_exp.sh
- C -> defines the path to the config 
- GPUS -> defines the number of GPUs for distributed run, when not given then the training runs on a single GPU/CPU

Repository:

Repository is a fork from: https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/Transformer-XL

We decided to fork from the Nvidia implementation of Transformer XL, because Transformer XL is strong and established baseline in Language Modelling, and Nvidia code is well-optimised for the current hardware.

  • ./configs/
    • we've prepared configs for all models presented in our work, i.e., Vanilla, Fixed, Entropy, Unigram, Whitespaces, Gumbel
  • ./tokenizer_data/
    • Pretrained tokenizers using HuggingFace/Sentencepiece library for all datasets we've tested in the paper. You can train them yourself by running:
      • python ./tokenizer_data/train_tokenizer.py $ARGS
      • Args are defined in the ./tokenizer_data/train_tokenizer.py
  • ./cleaners/
    • Implementation of preprocessing rules applied to raw wiki40b dataesets and cc-100 dataset
  • Boundary Predictor:
    • {Vanilla, Fixed, Whitespaces}
      • These approaches do not need a boundary predictor. Boundaries are extracted from the data itself in the boundary_creator.py, then used in the DataLoader.
    • {Unigram}
      • Segmentation based on Unigram needs a Boundary Predictor, because Unigram itself is not autoregressive. We teach the Boundary Predictor module defined in hourglass.py to predict the Unigram segmentation. Boundary Predictor is autoregressive, which makes the whole model autoregressive as well. Unigram segmentation is extracted in boundary_creator.py.
    • {Entropy, Gumbel}
      • These approaches are end-to-end and use the main model to train Boundary Predictor. Entire logic is implemented in the hourglass.py.

Issues:

In case of any questions or problems with the codebase feel free to raise a Github Issue or contact me directly at: [email protected]

Cite:

@misc{nawrot2022dynamic,
      title={Efficient Transformers with Dynamic Token Pooling},
      author={Piotr Nawrot and Jan Chorowski and Adrian Łańcucki and Edoardo M. Ponti},
      year={2022},
      eprint={2211.09761},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

dynamic-pooling's People

Contributors

piotrnawrot avatar

Stargazers

Pascual Merita Torres avatar  avatar Rasmus Larsen avatar Diego ROJAS avatar  avatar Cheol-Ho Cho avatar Anton Schäfer avatar Maria Ryskina avatar XLXW avatar Midori avatar  avatar nixuanfan avatar Sofian Mejjoute avatar  avatar Feng Chen avatar Wei Liu avatar Yu Zhang avatar Volodymyr Kyrylov avatar Michał avatar Maciej Pióro avatar Kajetan Husiatyński avatar Jakub Kuzioła avatar Konrad Mocarski avatar Stefan Schweter avatar Daniel Edoh-Bedi avatar Tokarev Igor avatar Jeff Carpenter avatar Yu Le avatar  avatar JasonWu avatar P G avatar Alfredo De La Fuente avatar Daniel Korzekwa avatar Carson M. avatar Jong-hun Shin avatar 爱可可-爱生活 avatar Eunkwang Jeon avatar WeiXin avatar Balazs Kocsis avatar jzhu avatar Paulius Danėnas avatar Ben avatar  avatar  avatar Seyed Sajad Mirzababaei avatar Edoardo Maria Ponti avatar Szymon Tworkowski avatar  avatar  avatar

Watchers

James Cloos avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.