Git Product home page Git Product logo

llama-moe's Introduction

LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training

LLaMA-MoE favicon
📢 A SMALLER AFFORDABLE MoE MODEL FOR EVERYONE!!

🎉 Introduction

LLaMA-MoE is a series of open-sourced Mixture-of-Expert (MoE) models based on LLaMA and SlimPajama. We build LLaMA-MoE with the following two steps:

  1. Partition LLaMA's FFNs into sparse experts and insert top-K gate for each layer of experts.
  2. Continually pre-train the initialized MoE model with an optimized data sampling weights from Sheared LLaMA and filtered datasets from SlimPajama.

MoE Routing

🔥 Features

  1. Lightweight Models: The number of activated model parameters is only 3.0~3.5B, which is friendly for deployment and research usage.
  2. Multiple Expert Construction Methods:
    1. Neuron-Independent: Random, Clustering, Co-activation Graph, Gradient (Zhang et al., 2022, Zuo et al., 2022)
    2. Neuron-Sharing: Inner, Inter (residual)
  3. Multiple MoE Gating Strategies:
    1. TopK Noisy Gate (Shazeer et al., 2017)
    2. Switch Gating (Fedus et al., 2022)
  4. Fast Continual Pre-training:
    1. FlashAttention-v2 integrated (Dao, 2023)
    2. Fast streaming dataset loading
  5. Abundant Monitor Items:
    1. Gate load, gate importance
    2. Loss on steps, loss on tokens, balance loss
    3. TGS (tokens/GPU/second), MFU (model FLOPs utilization)
    4. Other visualization utilities
  6. Dynamic Weight Sampling:
    1. Self-defined static sampling weights
    2. Sheared LLaMA's dynamic batch loading (Xia et al., 2023)

🚀 QuickStart

# python>=3.10

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_dir = "llama-moe/LLaMA-MoE-v1-3_5B-2_8"
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_dir, torch_dtype=torch.bfloat16, trust_remote_code=True)
model.eval()
model.to("cuda:0")

input_text = "Suzhou is famous of"
inputs = tokenizer(input_text, return_tensors="pt")
inputs = inputs.to("cuda:0")

pred = model.generate(**inputs, max_length=50, temperature=0.0)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
# Suzhou is famous of its beautiful gardens. The most famous one is the Humble Administrator's Garden. It is a classical Chinese garden with a history of more than 600 years. The garden is divided into three

⚙️ Installation

  1. Prepare conda environment: conda create -n smoe python=3.11 (If your environment name is not smoe, you may need to change environment in launching scripts)
  2. Add correct environment variables in ~/.bashrc (gcc is set to newer version for installing flash-attn). e.g.:
    export PATH=/mnt/petrelfs/share/cuda-11.8/bin:$PATH
    export LD_LIBRARY_PATH=/mnt/petrelfs/share/cuda-11.8/lib64:$LD_LIBRARY_PATH
    export PATH=/mnt/petrelfs/share/gcc-10.1.0/bin:$PATH
    export LD_LIBRARY_PATH=/mnt/petrelfs/share/gcc-10.1.0/lib64:$LD_LIBRARY_PATH
  3. Take the variables into effect: source ~/.bashrc
  4. Install PyTorch (CUDA-11.8): pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
  5. Install dependencies: pip install -r requirements.txt
  6. Install flash-attn: pip install flash-attn==2.0.1 --no-build-isolation. You may need to follow the flash-attn installation instructions to avoid some errors.
  7. Install the latest Git: conda install git
  8. Clone the repo: git clone [email protected]:pjlab-sys4nlp/llama-moe.git (If you don't setup the ssh key to GitHub, you may not able to clone through ssh. Check the docs about it.)
  9. Change current directory: cd llama-moe
  10. Install smoe in editable mode: pip install -e .[dev]
  11. Setup pre-commit hooks: pre-commit install

📊 Model Performance

Model #Activated Experts #Experts #Activated Params Links
LLaMA-MoE-3.0B 2 16 3.0B [🤗 HF Weights]
LLaMA-MoE-3.5B (4/16) 4 16 3.5B [🤗 HF Weights]
LLaMA-MoE-3.5B (2/8) 2 8 3.5B [🤗 HF Weights]
Model Average SciQ PIQA WinoGrande ARC-e ARC-c (25) HellaSwag (10) LogiQA BoolQ (32) LAMBADA NQ (32) MMLU (5)
OPT-2.7B 50.3 78.9 74.8 60.8 54.4 34.0 61.4 25.8 63.3 63.6 10.7 25.8
Pythia-2.8B 51.5 83.2 73.6 59.6 58.8 36.7 60.7 28.1 65.9 64.6 8.7 26.8
INCITE-BASE-3B 53.7 85.6 73.9 63.5 61.7 40.3 64.7 27.5 65.8 65.4 15.2 27.2
Open-LLaMA-3B-v2 55.6 88.0 77.9 63.1 63.3 40.1 71.4 28.1 69.2 67.4 16.0 26.8
Sheared-LLaMA-2.7B 56.4 87.5 76.9 65.0 63.3 41.6 71.0 28.3 73.6 68.3 17.6 27.3
LLaMA-MoE-3.0B 55.5 84.2 77.5 63.6 60.2 40.9 70.8 30.6 71.9 66.6 17.0 26.8
LLaMA-MoE-3.5B (4/16) 57.7 87.6 77.9 65.5 65.6 44.2 73.3 29.7 75.0 69.5 20.3 26.8
LLaMA-MoE-3.5B (2/8) 57.6 88.4 77.6 66.7 65.3 43.1 73.3 29.6 73.9 69.4 19.8 27.0

🚧 Expert Construction

  • Neuron-Independent
    • IndependentRandom: bash ./scripts/expert_construction/split/run_split_random.sh
    • IndependentClustering: bash ./scripts/expert_construction/split/run_split_clustering.sh
  • Neuron-Sharing
    • SharingInner: bash ./scripts/expert_construction/split/run_split_gradient.sh
    • SharingInter: bash ./scripts/expert_construction/split/run_split_gradient_residual.sh

For more information, please refer to Expert Construction docs.

🚅 Continual Pre-training

Tokenization

Download SlimPajama into /path_to_data and put data from different domains into separate folders:

  • /path_to_data/en_arxiv
  • /path_to_data/en_book
  • /path_to_data/en_c4
  • /path_to_data/en_cc
  • /path_to_data/en_stack
  • /path_to_data/en_wikipedia
  • /path_to_data/github

Each file should be end with *.jsonl and each line looks like:

{"id": "id-info", "content": "raw text to be tokenized"}

Run the following command to tokenize the data in each folder:

python -m smoe.utils.tokenize \
  -f jsonl \
  -t /path_to_tokenizer \
  -i /path_to_data/en_arxiv \
  -o /path_to_data_tokenized/en_arxiv

Continual Pre-training (CPT)

  • NOTICE: Please create logs/ folder manually: mkdir -p logs
  • To run the continual pre-training, please check the CPT docs.

💎 Evaluation

📑 Citation

@article{llama-moe-2023,
  title={LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training},
  author={LLaMA-MoE Team},
  journal={arXiv preprint arXiv:},
  url={https://arxiv.org/abs/},
  year={2023}
}

LLaMA-MoE Team w/ ❤️

llama-moe's People

Contributors

daizedong avatar jcruan519 avatar spico197 avatar tongjingqi avatar xiaoyee avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.