LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training

📢 A SMALLER AFFORDABLE MoE MODEL FOR EVERYONE!!

🎉 Introduction

LLaMA-MoE is a series of open-sourced Mixture-of-Expert (MoE) models based on LLaMA and SlimPajama. We build LLaMA-MoE with the following two steps:

Partition LLaMA's FFNs into sparse experts and insert top-K gate for each layer of experts.
Continually pre-train the initialized MoE model with an optimized data sampling weights from Sheared LLaMA and filtered datasets from SlimPajama.

🔥 Features

Lightweight Models: The number of activated model parameters is only 3.0~3.5B, which is friendly for deployment and research usage.
Multiple Expert Construction Methods:
1. Neuron-Independent: Random, Clustering, Co-activation Graph, Gradient (Zhang et al., 2022, Zuo et al., 2022)
2. Neuron-Sharing: Inner, Inter (residual)
Multiple MoE Gating Strategies:
1. TopK Noisy Gate (Shazeer et al., 2017)
2. Switch Gating (Fedus et al., 2022)
Fast Continual Pre-training:
1. FlashAttention-v2 integrated (Dao, 2023)
2. Fast streaming dataset loading
Abundant Monitor Items:
1. Gate load, gate importance
2. Loss on steps, loss on tokens, balance loss
3. TGS (tokens/GPU/second), MFU (model FLOPs utilization)
4. Other visualization utilities
Dynamic Weight Sampling:
1. Self-defined static sampling weights
2. Sheared LLaMA's dynamic batch loading (Xia et al., 2023)

🚀 QuickStart

# python>=3.10

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_dir = "llama-moe/LLaMA-MoE-v1-3_5B-2_8"
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_dir, torch_dtype=torch.bfloat16, trust_remote_code=True)
model.eval()
model.to("cuda:0")

input_text = "Suzhou is famous of"
inputs = tokenizer(input_text, return_tensors="pt")
inputs = inputs.to("cuda:0")

pred = model.generate(**inputs, max_length=50, temperature=0.0)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
# Suzhou is famous of its beautiful gardens. The most famous one is the Humble Administrator's Garden. It is a classical Chinese garden with a history of more than 600 years. The garden is divided into three

⚙️ Installation

Prepare conda environment: conda create -n smoe python=3.11 (If your environment name is not smoe, you may need to change environment in launching scripts)

Add correct environment variables in ~/.bashrc (gcc is set to newer version for installing flash-attn). e.g.:

export PATH=/mnt/petrelfs/share/cuda-11.8/bin:$PATH
export LD_LIBRARY_PATH=/mnt/petrelfs/share/cuda-11.8/lib64:$LD_LIBRARY_PATH
export PATH=/mnt/petrelfs/share/gcc-10.1.0/bin:$PATH
export LD_LIBRARY_PATH=/mnt/petrelfs/share/gcc-10.1.0/lib64:$LD_LIBRARY_PATH

Take the variables into effect: source ~/.bashrc
Install PyTorch (CUDA-11.8): pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
Install dependencies: pip install -r requirements.txt
Install flash-attn: pip install flash-attn==2.0.1 --no-build-isolation. You may need to follow the flash-attn installation instructions to avoid some errors.
Install the latest Git: conda install git
Clone the repo: git clone [email protected]:pjlab-sys4nlp/llama-moe.git (If you don't setup the ssh key to GitHub, you may not able to clone through ssh. Check the docs about it.)
Change current directory: cd llama-moe
Install smoe in editable mode: pip install -e .[dev]
Setup pre-commit hooks: pre-commit install

📊 Model Performance

Model	#Activated Experts	#Experts	#Activated Params	Links
LLaMA-MoE-3.0B	2	16	3.0B	[🤗 HF Weights]
LLaMA-MoE-3.5B (4/16)	4	16	3.5B	[🤗 HF Weights]
LLaMA-MoE-3.5B (2/8)	2	8	3.5B	[🤗 HF Weights]

Model	Average	SciQ	PIQA	WinoGrande	ARC-e	ARC-c (25)	HellaSwag (10)	LogiQA	BoolQ (32)	LAMBADA	NQ (32)	MMLU (5)
OPT-2.7B	50.3	78.9	74.8	60.8	54.4	34.0	61.4	25.8	63.3	63.6	10.7	25.8
Pythia-2.8B	51.5	83.2	73.6	59.6	58.8	36.7	60.7	28.1	65.9	64.6	8.7	26.8
INCITE-BASE-3B	53.7	85.6	73.9	63.5	61.7	40.3	64.7	27.5	65.8	65.4	15.2	27.2
Open-LLaMA-3B-v2	55.6	88.0	77.9	63.1	63.3	40.1	71.4	28.1	69.2	67.4	16.0	26.8
Sheared-LLaMA-2.7B	56.4	87.5	76.9	65.0	63.3	41.6	71.0	28.3	73.6	68.3	17.6	27.3
LLaMA-MoE-3.0B	55.5	84.2	77.5	63.6	60.2	40.9	70.8	30.6	71.9	66.6	17.0	26.8
LLaMA-MoE-3.5B (4/16)	57.7	87.6	77.9	65.5	65.6	44.2	73.3	29.7	75.0	69.5	20.3	26.8
LLaMA-MoE-3.5B (2/8)	57.6	88.4	77.6	66.7	65.3	43.1	73.3	29.6	73.9	69.4	19.8	27.0

🚧 Expert Construction

Neuron-Independent
- Independent_Random: bash ./scripts/expert_construction/split/run_split_random.sh
- Independent_Clustering: bash ./scripts/expert_construction/split/run_split_clustering.sh
Neuron-Sharing
- Sharing_Inner: bash ./scripts/expert_construction/split/run_split_gradient.sh
- Sharing_Inter: bash ./scripts/expert_construction/split/run_split_gradient_residual.sh

For more information, please refer to Expert Construction docs.

🚅 Continual Pre-training

Tokenization

Download SlimPajama into /path_to_data and put data from different domains into separate folders:

/path_to_data/en_arxiv
/path_to_data/en_book
/path_to_data/en_c4
/path_to_data/en_cc
/path_to_data/en_stack
/path_to_data/en_wikipedia
/path_to_data/github

Each file should be end with *.jsonl and each line looks like:

{"id": "id-info", "content": "raw text to be tokenized"}

Run the following command to tokenize the data in each folder:

python -m smoe.utils.tokenize \
  -f jsonl \
  -t /path_to_tokenizer \
  -i /path_to_data/en_arxiv \
  -o /path_to_data_tokenized/en_arxiv

Continual Pre-training (CPT)

NOTICE: Please create logs/ folder manually: mkdir -p logs
To run the continual pre-training, please check the CPT docs.

💎 Evaluation

For evalution on Natural Questions (NQ), please refer to opencompass.
For other tasks, please refer to lm-eval-harness.

📑 Citation

@article{llama-moe-2023,
  title={LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training},
  author={LLaMA-MoE Team},
  journal={arXiv preprint arXiv:},
  url={https://arxiv.org/abs/},
  year={2023}
}

LLaMA-MoE Team w/ ❤️

xiaoyee / llama-moe Goto Github PK

llama-moe's Introduction

LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training

🎉 Introduction

🔥 Features

🚀 QuickStart

⚙️ Installation

📊 Model Performance

🚧 Expert Construction

🚅 Continual Pre-training

Tokenization

Continual Pre-training (CPT)

💎 Evaluation

📑 Citation

llama-moe's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent