Git Product home page Git Product logo

mini_gpt's Introduction

Emergent Abilities in Reduced-Scale Generative Language Models

This repository has code to filter data from exisiting corpora based on child vocabulary and train small language models on this filtered data.

Installation

git clone [email protected]:text-machine-lab/mini_gpt.git
cd mini_gpt
pip install -r requirements.txt

Usage

The tokenizer for filtering is from filter_vocab_cpp.
Compile the C++ based filtration code and copy over the object fileto the src directory.

The vocabulary used for simplification of pre-training data can be found in data/AOChildes_word_frequency.csv This vocabulary is based on child-directed speech transcripts that can be found here

Downloading the SlimPajama dataset using git lfs:

git lfs install
git clone https://huggingface.co/datasets/cerebras/SlimPajama-627B

Chunks 1-10 are downloaded when this command is run.

For gathering the unfiltered dataset:

python SlimPajama_unfiltered.py

For Vocab filtering, use the following command per chunk:

python SlimPajama_filtering.py  --chunk_id 1

Pre-training data:

The pre-training data which consits of vocabulary filtered SlimPajama dataset can be found here 22B and 2.1B

To train BPE tokenizer:

python create_tokenizer.py --dataset_path ./dataset \
    --vocab_size 15_000 \
    --save_dir ./tokenizer

For pre-training a language model with distributed training:

python -u -m accelerate.commands.launch main.py \
     --lr 2.8e-3 --num_warmup_steps 1000 --num_layers 8 \
     --hidden_size 32 --use_tokenizer filtered \
     --chkpt_dir ../models/SlimPajama_Nov23_context128_vocab_21k/filtered/hidden_32_num_layer_8_int_128 \
     --int_size 128 --rope_theta 20

Notebooks

For creating the minigpt dataset, counting the number of tokens in the filtered dataset, analysis of the filtered dataset use the notebook notebooks/2.0-dataset-statistics.ipynb

For applying position interpolation on pre-trained models use the notebook notebooks/1.0-rope-pi.ipynb

To filter downstream evaluation datasets based on AO-Childes vocabulary use the notebook notebooks/4.0-dataset_filtering.ipynb

To get generations from the pre-trained and baseline models use notebooks/3.0-model-generations.ipynb

Citation

@misc{muckatira2024emergent,
      title={Emergent Abilities in Reduced-Scale Generative Language Models},
      author={Sherin Muckatira and Vijeta Deshpande and Vladislav Lialin and Anna Rumshisky},
      year={2024},
      eprint={2404.02204},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

mini_gpt's People

Contributors

sherinbojappa avatar vijetadeshpande avatar

Stargazers

Jeff Carpenter avatar Huang Haiduo avatar

Watchers

Alexey Romanov avatar Anna Rumshisky avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.