Git Product home page Git Product logo

flagscale's Introduction

Introduction

FlagScale is a comprehensive toolkit for large-scale models, developed with the support of the Beijing Academy of Artificial Intelligence (BAAI). It builds upon open-source projects such as Megatron-LM and vllm.

Our primary objective with FlagScale is to optimize the use of computational resources for large models, while maintaining numerical stability and model effectiveness. Currently, FlagScale is in its early stages of development. We are actively collaborating with the community to enhance its capabilities, with the aim to support a variety of large models across diverse hardware architectures.

Highlights

FlagScale provides developers with the actual configurations, optimization schemes and hyper-parameter settings for the large model training from BAAI. It also assists developers in rapidly establishing a basic yet complete pipeline for LLM, including training, fine-tuning, inference and serving. It has several features as follows:

  • Provide the training schemes of the Aquila models form BAAI which can guaranteed training convergence
  • Support the model weight conversion to Huggingface and the distributed optimizer repartition
  • Keep timely synchronization with the upstream projects

News and Updates

  • 2024.4.11 🔥 We release the new version (v0.3):

    • Accomplish the heterogeneous hybrid training of the Aquila2-70B-Expr model on a cluster utilizing a combination of NVIDIA and Iluvatar chips.
    • Provide the training of the Aquila2 series across a variety of AI chips from six distinct manufacturers.
  • 2023.11.30 We release the new version (v0.2):

    • Provide the actually used training scheme for Aquila2-70B-Expr, including the parallel strategies, optimizations and hyper-parameter settings.
    • Support heterogeneous training on chips of different generations with the same architecture or compatible architectures, including NVIDIA GPUs and Iluvatar CoreX chips.
    • Support training on chinese domestic hardwares, including Iluvatar CoreX and Baidu KUNLUN chips.
  • 2023.10.11 We release the initial version (v0.1) by supporting the Aquila models, and also provide our actually used training schemes for Aquila2-7B and Aquila2-34B, including the parallel strategies, optimizations and hyper-parameter settings.

Quick Start

We highly recommend developers to follow the Megatron-LM Usage. Here we provide instructions for Aquila LLMs:

Setup

  1. Install the Megatron-LM dependencies as the original link

  2. Install the requirements for FlagScale

git clone [email protected]:baai-opensp/FlagScale.git 
cd FlagScale
pip install -r requirements.txt

Pretrain the Aquila model

  1. Start a distributed training job
python run.py --config-path ./examples/aquila/conf --config-name config

FlagScale leverages Hydra for configuration management. The YAML configuration is structured into four key sections:

  • experiment: Defines the experiment directory, backend, and other related environmental configurations.
  • system: Details execution parameters, such as parallel strategies and precision of operations.
  • model: Describes the model's architecture along with its associated hyperparameters.
  • data: Specifies configurations related to the data used by the model.

All valid configurations correspond to the arguments used in Megatron-LM, with hyphens (-) replaced by underscores (_). For a complete list of available configurations, please refer to the Megatron-LM arguments source file.

To kickstart the training process, consider using the existing YAML files in the examples folder as a template. Simply copy and modify these files to suit your needs. Please note the following important configurations:

  • exp_dir: the directory for saving checkpoints, tensorboards and other logging information.

  • hostfile: the hostfile file path for the current training, which consists of a list of hostnames and slot counts. For example:

    hostnames-1/IP-1 slots=8
    hostnames-2/IP-2 slots=8
    

    These hostnames or IPs represent machines accessible via passwordless SSH and the slots specify the number of GPUs available on that machine.

  • data_path: the path of the training datasets following the Megatron-LM format. For quickly running the pretraining process, we also provide a small processed data (bin and idx) from the Pile dataset.

  1. Stop a distributed training job
python run.py --config-path ./examples/aquila/conf --config-name config action=stop

Do the heterogenous training

Please checkout the v0.3 branch first and follow the instructions below.

It is very simple to do the heterogeneous training on chips of different generations with the same architecture or compatible architectures. You only need to follow the steps below and everything else just remains the same as the above homogeneous training. In addition, you can also refer to the examples 1, 2, 3 for better understanding.

  1. Extend the hostfile

    Before doing the heterogenous training, you should extend the hostfile by adding the device types. You are free to choose the identifier strings for these device types, but please ensure they are not duplicated.

    hostnames-1/IP-1 slots=8 typeA
    hostnames-2/IP-2 slots=8 typeB
    
  2. Add the heterogeneous configuration

    • If you choose the heterogenous pipeline parallelism mode, please set the following configurations:

      • hetero-mode: specify the heterogenous training mode pp.
      • hetero-current-device-type: specify the device type of the current node.
      • hetero-device-types: specify all the device types used in this training.
      • hetero-pipeline-stages: specify the stage splitting configuration. For example, given 2 4 4 3 5 5 5, the total pipeline parallel size is 2 + 3 = 5, the total number of the model layers is 4 + 4 + 5 + 5 + 5 = 23, the pipeline parallel size for the first device type in the hetero-device-types list is 2 and the pipeline parallel size for the second device type in the hetero-device-types is list 3.
    • If you choose the heterogenous data parallelism mode, please set the following configurations:

      • hetero-mode: specify the heterogenous training mode dp.
      • hetero-current-device-type: specify the device type of the current node.
      • hetero-device-types: specify all the device types used in this training.
      • hetero-micro-batch-sizes: specify the micro batch size splitting configuration. For example, given 2 1 3 2, the total data parallel size is 2 + 3 = 5 and the micro batch size for each training iteration is 2 * 1 + 3 * 2 = 8, the data parallel size for the first device type in the hetero-device-types list is 2 and the data parallel size for the second device type in the hetero-device-types is 3 list.
      • Remove the micro-batch-size configuration because hetero-micro-batch-sizes works as the same purpose.

From FlagScale to HuggingFace

  1. Change to the FlagScale directory
cd FlagScale/megatron 
  1. Merge the multiple checkpoints to a single checkpoint (if needed)
python tools/checkpoint_util.py --model-type GPT \
        --load-dir ${LOAD_DIR} --save-dir ${SAVE_DIR} \
        --true-vocab-size 100008 --vocab-file ${FlagScale_HOME}/examples/aquila/tokenizer/vocab.json \
        --megatron-path ${FlagScale_HOME} --target-tensor-parallel-size 1 --target-pipeline-parallel-size 1

Please set the following variables before running the command:

  • LOAD_DIR: the directory for loading the original checkpoint.
  • SAVE_DIR: the directory for saving the merged checkpoint.
  • FlagScale_HOME: the directory of FlagScale.
  1. Convert the merged checkpoint to the Huggingface format
export PYTHONPATH=${FlagScale_HOME}:$PYTHONPATH

python scripts/convert_megatron_unsharded_to_huggingface.py \
        --input-dir ${SAVE_DIR}/iter_${ITERATION}/mp_rank_00/ \
        --output-dir ${SAVE_DIR}/iter_${ITERATION}_hf \
        --num-layers 60 --hidden-size 6144 \
        --num-attention-heads 48 --group-query-attention --num-query-groups 8 \
        --data-type bf16 --multiple-of 4096 --hidden-dim-multiplier 1.3

Please set the following variables before running the command:

  • FlagScale_HOME: the directory of FlagScale.
  • SAVE_DIR: the directory for loading the merged checkpoint.
  • ITERATION: the iteration number from latest_checkpointed_iteration.txt in SAVE_DIR and need to be padded zeros to 7 digits.

Note that the above configuration is for converting Aquila-34B and you may need to change the model configurations such as num_layers andhidden_size as needed.

Serve a model

  1. Change to the FlagScale directory
cd FlagScale/megatron
  1. Merge the multiple checkpoints to a single checkpoint (as needed)
python tools/checkpoint_util.py --model-type GPT \
        --load-dir ${LOAD_DIR} --save-dir ${SAVE_DIR} \
        --true-vocab-size 100008 --vocab-file ${FlagScale_HOME}/examples/aquila/tokenizer/vocab.json \
        --megatron-path ${FlagScale_HOME} --target-tensor-parallel-size 1 --target-pipeline-parallel-size 1

Please set the following variables before running the command:

  • LOAD_DIR: the directory for loading the original checkpoint.
  • SAVE_DIR: the directory for saving the merged checkpoint.
  • FlagScale_HOME: the directory of FlagScale.
  1. Serve the Aquila2 model by the below script. Here we take the Aquila2-34B as an example and assume you have an A800-80G GPU.
python ../examples/aquila/34B/inference_auto.py \
       --server-port ${SERVER_PORT} --master-process ${MASTER_PORT} \
       --device "0" --iteration -1 --checkpoint-path "${CKPT_DIR}" \
       --model-info "Aquila-34b"

Please set the following variables before running the command:

  • SERVER_PORT: the server port for serving the model.
  • MASTER_PORT: the port of the master process.
  • CKPT_DIR: the directory for loading the merged checkpoint.
  1. After you have served an Aquila model successfully, you can send a request to do the testing.
python tools/test/test_api_flask.py

Repartition the distributed optimizer [optional]

When using the distributed optimizer, you can use the following tool to repartition the distributed optimizer if the parallel schemes is changed during the training.

  1. Change to the FlagScale directory
cd FlagScale/megatron
  1. Repartition the model weight
python tools/checkpoint_util_lite.py --conversion-type weight --model-type GPT --load-dir ${LOAD_DIR} --save-dir ${SAVE_DIR} \ 
    --true-vocab-size 100008 --vocab-file ${FlagScale_HOME}/examples/aquila/tokenizer/vocab.json --megatron-path  ${FlagScale_HOME} \
    --target-tensor-parallel-size ${TP} --target-pipeline-parallel-size ${PP} 

Please set the following variables before running the command:

  • LOAD_DIR: the directory for loading the original checkpoint.
  • SAVE_DIR: the directory for saving the converted checkpoint.
  • FlagScale_HOME: the directory of FlagScale.
  • TP: the target tensor parallel size.
  • PP: the target pipeline parallel size.
  1. Repartition the distributed optimizer
python tools/checkpoint_util_lite.py --conversion-type optimizer --model-type GPT --load-dir ${LOAD_DIR} --save-dir ${SAVE_DIR} \ 
    --true-vocab-size 100008 --vocab-file ${FlagScale_HOME}/examples/aquila/tokenizer/vocab.json --megatron-path  ${FlagScale_HOME} \
    --target-tensor-parallel-size ${TP} --target-pipeline-parallel-size ${PP} 

Please set the following variables before running the command as these used in the model weight conversion:

  • LOAD_DIR: the directory for loading the original checkpoint.
  • SAVE_DIR: the directory for saving the converted checkpoint.
  • FlagScale_HOME: the directory of FlagScale.
  • TP: the target tensor parallel size.
  • PP: the target pipeline parallel size.

Future work

We will work with the community together on the following items:

  • Release the actual used training schemes for more models from BAAI
  • Add customized optimizations and integrate techniques from other excellent open-source projects like DeepSpeed and vLLM etc.
  • Support LLMs with different model structures
  • Support the model training with more hardware architectures

License

This project is mainly based on the Megatron-LM project and is licensed under the Apache License (Version 2.0). This project also contains other third-party components under other open-source licenses. See the LICENSE file for more information.

flagscale's People

Contributors

aklife97 avatar aoyulong avatar blahblahhhj avatar borisfom avatar deepakn94 avatar ekmb avatar erhoo82 avatar ericharper avatar ftgreat avatar hyunwoongko avatar janebert avatar jaredcasper avatar jiemingz avatar jon-barker avatar kantneel avatar ksivaman avatar kvareddy avatar lmcafee-nvidia avatar maanug-nv avatar maximumentropy avatar mikolajblaz avatar mpatwary avatar phoenixdong avatar pxuab avatar satpalsr avatar sudhakarsingh27 avatar timmoon10 avatar zhaoyinglia avatar zliucr avatar zlkanyo009 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

flagscale's Issues

期望新增以下切割模型权重的功能

1.期望能用其他加速卡来切,而不仅仅是nv卡来切权重。因为默认的tools/checkpoint_util.py 里会设计到nv编译的逻辑,其他卡不支持。
2.多机分布式支持切割权重。因为有的加速卡没有配置共享存储,模型一大,拷贝权重就很不方便,期望能有多机切割权重的功能。
3.降低host端的峰值内存。由于不同机器上host端的内存不一样,nv机器上的内存有1T,单机就能切;但对于某些host端内存比较小的,比如512G的情况下,切割权重会出现oom,因此期望增加降峰值内存的功能,比如load 一层layer,就save 一层layer。

llama2 70B模型在不同PP下loss下降趋势不同

7_gpu_pp1.log
31_gpu_pp4.log
378074e481adec4316fc6dd978b7a61

在跑llama2 70B(减少层数)时,PP=1跟PP=4出现loss下降趋势不同的情况,log与曲线图见上述上传,脚本如下:

export CUDA_DEVICE_MAX_CONNECTIONS=1

GPUS_PER_NODE=8
# Change for multinode config
MASTER_ADDR=192.167.5.2
MASTER_PORT=29501
NUM_NODES=4
NODE_RANK=0
WORLD_SIZE=$(($GPUS_PER_NODE*$NUM_NODES))

CHECKPOINT_PATH='/data/zhangling21/ckpts/'
TENSORBOARD_LOGS_PATH='/data/zhangling21/tensorboard_logs/'
TOKENIZER_PATH='/data/zhangling21/llama_00_text_document/tokenizer/tokenizer.model'
DATA_PATH='/data/zhangling21/llama_00_text_document/llama_00_text_document'

DISTRIBUTED_ARGS=(
    --nproc_per_node $GPUS_PER_NODE
    --nnodes $NUM_NODES
    --node_rank $NODE_RANK
    --master_addr $MASTER_ADDR
    --master_port $MASTER_PORT
)
# --tokenizer-type LLaMASentencePieceTokenizer \
# --rmsnorm-epsilon 1e-5

LLAMA_MODEL_ARGS=(
    --num-layers 8
    --hidden-size 8192
    --ffn-hidden-size 28672
    --num-attention-heads 64
    --seq-length 4096
    --max-position-embeddings 4096
    --group-query-attention
    --num-query-groups 8
    --tokenizer-type Llama2Tokenizer
    --tokenizer-model $TOKENIZER_PATH
    --swiglu
    --normalization RMSNorm
    --use-rotary-position-embeddings
    --no-position-embedding
    --disable-bias-linear
)
# --optimizer adam
# --adam-eps 1e-05
# --no-contiguous-buffers-in-local-ddp
# --recompute-method uniform
# --no-async-tensor-model-parallel-allreduce
# --embedding-dropout 0
# --multi-query-attention
# --multi-query-group-num 8
# --ffn-dim-multiplier 1.3
# --recompute-granularity full
# --distribute-saved-activations
# --recompute-num-layers 1
# --memory-saving

# --fp16

    # --optimizer adam
    # --adam-eps 1e-05
TRAINING_ARGS=(
    --micro-batch-size 1
    --global-batch-size 44
    --train-samples 24414
    --weight-decay 1e-2
    --optimizer adam
    --clip-grad 1.0
    --lr 0.00015
    --lr-decay-style cosine
    --min-lr 1.0e-5
    --lr-warmup-fraction .01
    --adam-beta1 0.9
    --adam-beta2 0.95
    --attention-dropout 0.0
    --hidden-dropout 0.0
    --untie-embeddings-and-output-weights
    --multiple-of 4096
    --no-gradient-accumulation-fusion
    --recompute-granularity 'full'
    --recompute-num-layers 1
    --recompute-method 'uniform'
    --no-async-tensor-model-parallel-allreduce
)

MODEL_PARALLEL_ARGS=(
        --tensor-model-parallel-size 8
        --pipeline-model-parallel-size 4
)

DATA_ARGS=(
    --data-path $DATA_PATH
    --split 1
)

EVAL_AND_LOGGING_ARGS=(
    --log-interval 1
    --init-method-std 0.02
    --seed 1234
    --eval-iters 0
    --use-cpu-initialization
)
    #--load "/data/zhangling21/llama_00_text_document/ckpt0227_8L"
    #--no-load-rng
    #--save "/data/zhangling21/llama_00_text_document/ckpt0227_8L"
    #--save-interval 1

cmd="torchrun ${DISTRIBUTED_ARGS[@]} pretrain_llama.py \
        ${LLAMA_MODEL_ARGS[@]} \
        ${TRAINING_ARGS[@]} \
        ${MODEL_PARALLEL_ARGS[@]} \
        ${DATA_ARGS[@]} \
        ${EVAL_AND_LOGGING_ARGS[@]}"
echo $cmd
eval $cmd

[QUESTION] Support other hardware?

By now, FlagScale wupport training on chinese domestic hardwares, including Iluvatar CoreX and Baidu KUNLUN chips.
Any other hardwares will be supported in future?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.