aethercortex / llama-x Goto Github PK

View Code? Open in Web Editor NEW

1.6K 42.0 101.0 7.59 MB

Open Academic Research on Improving LLaMA to SOTA LLM

License: Apache License 2.0

Python 100.00%

llama-x's Introduction

Llama-X: Open Academic Research on Improving LLaMA to SOTA LLM

This is the repo for the Llama-X, which aims to:

Progressively improve the performance of LLaMA to SOTA LLM with open-source community.
Conduct Llama-X as an open academic research which is long-term, systematic and rigorous.
Save the repetitive work of community and we work together to create more and faster increment.

The project will follow these principles:

We will publish all the code, model, data, and experiments details.
We will continuously improve the model version by version and open the newest method.
We will summary the method of each main version as academic papers.
We announce a complete research plan. The contributors are wellcome to cooperate with each other to progressively improve Llama-X through iteration of the target versions.
The check-in of the new model must achieve significant improvement with current version on automatic evaluation.

📣 Please join if you are interested in Llama-X.

News
Ten main research areas
Llama-X Model Version
Llama-X Evaluation
Llama-X Paper List
Usage
How to contribute

News

We have completed the training of our first version of model (Llama-X 3.0.1 7B). Please experience our model in the demo page, and the data, code and model weights of different scales will be updated in this repo later.

Ten main research areas

[1]. Research on Instruction Tuning

instruction-following tuning

[2]. Research on RLHF & RLAIF

fundamental RLHF
AI learning from AI

[3]. Research on Data Quality

high quality data for pre-training, fine-tuning, user feedbacks, multi-modality, etc

[4]. Research on Long Context Transformer

enable efficient transformers for long sequence (>30k)

[5]. Research on Multi-modal (text + image) Modeling

text + image in; text out

[6]. Research on Multilingual

comparable multilingual performance with English

[7]. Research on Efficient infrastructure and optimization

improve training and inference speed
build deep learning stack which scales predictably

[8]. Research on Evaluation

comprehensive evaluation of model capabilities

[9]. Research on Interpretability

interpret the source of each capability of LLM

[10]. Research on LLM on Actions

combine LLM with search, recommendation and other plugins

Llama-X Model Version

Llama-X	Baseline	Performance
3.0.0 (LLaMA)	GPT-3	Outperform
3.1.0	text-davinci-001	Comparable
3.2.0	text-davinci-002	Comparable
3.3.0	text-davinci-003	Comparable
3.5.0	gpt-35-turbo	Comparable
3.6.0	GPT-4	80% Avg.Gap
3.7.0	GPT-4	60% Avg.Gap
3.8.0	GPT-4	40% Avg.Gap
3.9.0	GPT-4	20% Avg.Gap
4.0.0	GPT-4	Comparable

We are focusing on the above research areas [1] & [3] now, and would public our first version of model (Llama-X 3.0.1) and paper.

Llama-X Evaluation

Each new version of Llama-X model should significantly outperform (+>1%) the current version model on the automatic evaluation of all the following Type-A benchmarks. And the additional evaluation for Type-B benchmarks should be added in the 3.6.0+ versions:

Type	Benchmarks
A	MMLU
A	HumanEval
A	GSM-8K
A	NaturalQuestions
A	TruthfulQA
B	Leetcode
B	GRE
B	AP
B	MMLU-Multilingual
B	Visual Inputs (TBD)

Results:

Model	MMLU	TruthfulQA	GSM-8K	NaturalQuestions
InstructGPT davinci v2 (175B)^	0.57	0.62	0.35	0.389
Llama-X 3.0.1 (7B)	0.4412	0.2032	0.1887	0.2422
Llama-i (7B)	0.5121	0.2142	0.2259	0.3499

^ The results of InstructGPT davinci v2 (175B) are copied from Stanford CRFM Benchmark.

Llama-X Paper List

LLaMA: Open and Efficient Foundation Language Models.

Usage

Setup. Install the conda environment:

conda create -n llamax python=3.10
conda activate llamax
git clone https://github.com/AetherCortex/Llama-X.git
cd Llama-X/src
conda install pytorch==1.12.0 torchvision==0.13.0 torchaudio==0.12.0 cudatoolkit=11.3 -c pytorch
pip install transformers==4.31.0
cd ../..
pip install -r requirements.txt

Training data example (e.g., Stanford Alpaca):

Llama-X/src/data/alpaca_data.json

Convert LLaMA checkpoint to HuggingFace format:

cd Llama-X/src
python transformers/src/transformers/models/llama/convert_llama_weights_to_hf.py \
    --input_dir /path/to/llama-7B/ \
    --model_size 7B \
    --output_dir /path/to/llama-7B/hf

Train LLaMA-7B on DeepSpeed Zero-3:

deepspeed train.py \
    --model_name_or_path /path/to/llama-7B/hf \
    --data_path /path/to/example_data.json \
    --output_dir /path/to/llama-7B/hf/ft \
    --num_train_epochs 3 \
    --model_max_length 512 \
    --per_device_train_batch_size 64 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 100 \
    --save_total_limit 2 \
    --learning_rate 2e-5 \
    --warmup_steps 2 \
    --logging_steps 2 \
    --lr_scheduler_type "cosine" \
    --report_to "tensorboard" \
    --gradient_checkpointing True \
    --deepspeed configs/deepspeed_config.json \
    --fp16 True

Train LLaMA-7B on DeepSpeed Zero-3 with Multi-nodes

deepspeed --num_gpus num_of_gpus_in_each_node \
    --num_nodes num_of_nodes \
    --master_addr ip_address_of_main_node \
    --master_port 34545 \
    --hostfile configs/hostfile \
    train.py \
    --model_name_or_path /path/to/llama-7B/hf \
    --data_path /path/to/example_data.json \
    --output_dir /path/to/llama-7B/hf/ft \
    --num_train_epochs 3 \
    --model_max_length 512 \
    --per_device_train_batch_size 64 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 100 \
    --save_total_limit 2 \
    --learning_rate 2e-5 \
    --warmup_steps 2 \
    --logging_steps 2 \
    --lr_scheduler_type "cosine" \
    --report_to "tensorboard" \
    --gradient_checkpointing True \
    --deepspeed configs/deepspeed_config.json \
    --fp16 True

The current code of Llama-X support:
- Fully Finetune: Optimize full LLaMA checkpoint, instead of Low-Rank Adaptation (LoRA).
- High Efficiency: Training 7B model with 50k examples/epoch & batch_size=64 within 1 hour on 8 x V100 GPUs.

LLaMA	Batch Size	V100s	Time (h)
7 B	64	8	1.00
13 B	32	8	2.00

Inference

# web demo inference
python generate.py

# batch inference
To Do

How to contribute

Developers can become Contributors by contributing helpful code, data, paper and computing resource, etc.

Code: Including algorithm implementation, training optimization, inference optimization, and model deployment.
Data: Every research area and version iteration requires high-quality data, including instruction-answer, pre-training, multi-modal, multilingual, and user feedbacks data, etc.
Paper: We will maintain a Llama-X Paper List, and use Llama-X as the base model for optimized, fully tested, and significantly improved academic papers. You can check in to the Llama X Paper List.
Computing resource: We hope to help accelerate model iteration speed by coordinating redundant computing power from some developers or non-profit sponsorship from universities/enterprises.

How to communicate with us

Github Issues
Email: [email protected]
Discord:

Thanks For

This project has been inspired by multiple open source projects:

Meta AI LLaMA

Huggingface Transformers Llama

Alpaca and Alpaca-LoRA

Disclaimer

The use of resources(e.g., code, data and model weights) related to this project is limited to academic research and is prohibited for commercial purposes. The content generated by any model of Llama-X is subject to factors such as randomness and uncontrollability, and this project cannot guarantee its accuracy. This project does not assume any legal responsibility for the content of the model output, nor does it assume any responsibility for any losses that may arise from the use of related resources and output results.

llama-x's People

Contributors

Stargazers

Watchers

Forkers

wangjiaqiys nlpxucan sheli00 zwhinmedia jiayong zurichrain patrick-tssn jxzhangjhu yxkryptonite nickfantasy dumpmemory janglichao jiangwei victorsungo bruinxiong fpzh2011 jeffrey28 shengzing zhenyangiacas ddlbojack nlpkazhe sky1worker chelsea-yeung reborm ftgreat felixgithub2017 jchzhao zxyscz hanbu97 baponkar neromous lw3259111 newlightlw kindow guihonghao furongpeng jibin5167 ucas010 julyhcw rchanggogogo chenllliang royiluo mikesimb s1530129650 zhanghg1983 git-models glin93 vinglogn wangpeiyi9979 qshuang123 synho gaoxiaojun xwjim syp2ysy zss205 xgreat8 amishakov vishal-s-p robertmarton dkzdev huijiawu0 152334h thalesfsp oalee ltngonnguyen kennydizi tpk28 jgayle99 worthmining jasonfenggit deltavml teknium1 generalzh l1aoxingyu haorand lp177 hertera1 ogyank mars-wei songfang w32zhong sysang heraclex12 elshor wang-xiaodong1899 tadeze xipq thanhpham1987 franciszhangkk xedis gptcrash notvicent3 jingchensun greatwildfire belle9217 zhymma moyeen52 dongxf369 heli-dawnlab703 250handsomeliang

llama-x's Issues

About Llama-X and Alpaca repo

Hi, may I know why the hyperparameters of the training command in Llama-x (this repo) and Alpaca are different. Eg., the batch size 128 vs. 512 (64*8), the warmup steps 0.03 (ratio) vs. 2.
Which hyperparameter should we adopt?

Another question is what is the Llama-i (7B) in the Llama-X Evaluation section? And the GSM8K result is 18.8% while my own LLAMA-X model (using the hyperparamters in this repo) is only 10%. Not sure why the gap is so large. Would you mind sharing your evaluation script on GSM8K in Llama-X? Thank you.

there is a endless loop in convert_tokens_to_ids(self.unk_token) and self._convert_token_to_id_with_added_voc(tokens)

hi,this is part of my error information,thanks!

File "/mnt/data/wxc/workspace/Llama-X/src/transformers/src/transformers/tokenization_utils_fast.py", line 257, in _convert_token_to_id_with_added_voc
return self.unk_token_id
File "/mnt/data/wxc/workspace/Llama-X/src/transformers/src/transformers/tokenization_utils_base.py", line 1150, in unk_token_id
return self.convert_tokens_to_ids(self.unk_token)
File "/mnt/data/wxc/workspace/Llama-X/src/transformers/src/transformers/tokenization_utils_fast.py", line 250, in convert_tokens_to_ids
return self._convert_token_to_id_with_added_voc(tokens)

only load lm_head.weight and embed_tokens.weight parameters

I can only load these parameters,
○ lm_head.weight : 131076096
○ model.embed_tokens.weight : 131076096
and the other parameter is None.

Number of parameters:  262152192
model.embed_tokens.weight : 131076096
model.layers.0.self_attn.q_proj.weight : 0
model.layers.0.self_attn.k_proj.weight : 0
model.layers.0.self_attn.v_proj.weight : 0
model.layers.0.self_attn.o_proj.weight : 0
model.layers.0.mlp.gate_proj.weight : 0
model.layers.0.mlp.down_proj.weight : 0
model.layers.0.mlp.up_proj.weight : 0
model.layers.0.input_layernorm.weight : 0
model.layers.0.post_attention_layernorm.weight : 0

Have you ever tried PEFT on LLaMA?

Hi, have you ever tried to integrate PEFT to your code and provide the training scripts ?

Learning rate fixed at 0 during training via DeepSpeed

I followed all the setup instruction given in the README.
The command I am using is:

deepspeed train.py \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --data_path Llama-X/data/alpaca_data.json \
    --output_dir ./model_weights_finetuned \
    --num_train_epochs 3 \
    --model_max_length 512 \
    --per_device_train_batch_size 64 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 100 \
    --save_total_limit 2 \
    --learning_rate 2e-5 \
    --warmup_steps 2 \
    --logging_steps 2 \
    --lr_scheduler_type "cosine" \
    --report_to "tensorboard" \
    --gradient_checkpointing True \
    --deepspeed configs/deepspeed_config.json \
    --fp16 True

Initially, I got the following error:

ValueError: Found `optimizer` configured in the DeepSpeed config, but no `scheduler`. Please configure a scheduler in the DeepSpeed config.

I downgraded to transformers version 4.29.2 as suggested here.

Now, training is happening but the learning rate from the beginning itself is fixed to zero. Below are the logs:

[2023-08-28 04:36:42,566] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-28 04:36:43,585] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-08-28 04:36:43,585] [INFO] [runner.py:555:main] cmd = /home/anmol/anaconda3/envs/llamax/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None train.py --model_name_or_path meta-llama/Llama-2-7b-hf --data_path /home/anmol/TieredModels/code/08_llamax_approach/Llama-X/data/alpaca_data.json --output_dir ./model_weights_finetuned --num_train_epochs 3 --model_max_length 512 --per_device_train_batch_size 64 --per_device_eval_batch_size 1 --gradient_accumulation_steps 1 --evaluation_strategy no --save_strategy steps --save_steps 100 --save_total_limit 2 --learning_rate 2e-5 --warmup_steps 2 --logging_steps 2 --lr_scheduler_type cosine --report_to tensorboard --gradient_checkpointing True --deepspeed configs/deepspeed_config.json --fp16 True
[2023-08-28 04:36:44,193] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-28 04:36:45,187] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2023-08-28 04:36:45,187] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=4, node_rank=0
[2023-08-28 04:36:45,187] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2023-08-28 04:36:45,187] [INFO] [launch.py:163:main] dist_world_size=4
[2023-08-28 04:36:45,187] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
[2023-08-28 04:36:45,955] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-28 04:36:45,977] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-28 04:36:45,980] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-28 04:36:45,996] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-28 04:36:47,765] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-08-28 04:36:47,765] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-08-28 04:36:47,765] [INFO] [comm.py:643:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-08-28 04:36:47,772] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-08-28 04:36:47,772] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-08-28 04:36:47,773] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-08-28 04:36:47,773] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-08-28 04:36:47,801] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-08-28 04:36:47,801] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-08-28 04:36:54,794] [INFO] [partition_parameters.py:326:__exit__] finished initializing model with 6.74B parameters
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.91s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.91s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.91s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.97s/it]
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Running tokenizer on train dataset (num_proc=32): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 52002/52002 [00:02<00:00, 24943.76 examples/s]
52002
Sample 12208 of the training set: {'input_ids': [1, 13866, 338, 385, 15278, 393, 16612, 263, 3414, 29889, 14350, 263, 2933, 393, 7128, 2486, 1614, 2167, 278, 2009, 29889, 13, 13, 2277, 29937, 2799, 4080, 29901, 13, 4391, 385, 4544, 1813, 411, 263, 28435, 322, 263, 1014, 2813, 292, 13, 13, 2277, 29937, 13291, 29901, 29966, 1420, 29958, 13, 1678, 529, 2813, 29958, 13, 4706, 529, 3257, 29958, 5494, 292, 322, 3323, 2813, 292, 829, 3257, 29958, 13, 1678, 1533, 2813, 29958, 13, 1678, 529, 2587, 29958, 13, 4706, 529, 29882, 29896, 29958, 5494, 292, 829, 29882, 29896, 29958, 13, 4706, 529, 29882, 29906, 29958, 4035, 2813, 292, 829, 29882, 29906, 29958, 13, 1678, 1533, 2587, 29958, 13, 829, 1420, 29958, 2], 'labels': [-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 29966, 1420, 29958, 13, 1678, 529, 2813, 29958, 13, 4706, 529, 3257, 29958, 5494, 292, 322, 3323, 2813, 292, 829, 3257, 29958, 13, 1678, 1533, 2813, 29958, 13, 1678, 529, 2587, 29958, 13, 4706, 529, 29882, 29896, 29958, 5494, 292, 829, 29882, 29896, 29958, 13, 4706, 529, 29882, 29906, 29958, 4035, 2813, 292, 829, 29882, 29906, 29958, 13, 1678, 1533, 2587, 29958, 13, 829, 1420, 29958, 2]}.
Sample 46872 of the training set: {'input_ids': [1, 13866, 338, 385, 15278, 393, 16612, 263, 3414, 29889, 14350, 263, 2933, 393, 7128, 2486, 1614, 2167, 278, 2009, 29889, 13, 13, 2277, 29937, 2799, 4080, 29901, 13, 6113, 263, 29871, 29941, 29899, 29946, 10541, 5828, 1048, 263, 285, 9102, 1058, 27401, 278, 2462, 29889, 13, 13, 2277, 29937, 13291, 29901, 26222, 2501, 263, 931, 29892, 727, 471, 263, 26565, 285, 9102, 1058, 10600, 297, 263, 282, 898, 297, 278, 25013, 29889, 3118, 2462, 29892, 263, 2107, 13569, 3974, 14455, 714, 29892, 20616, 292, 599, 278, 15006, 297, 278, 4038, 29889, 450, 285, 9102, 471, 10087, 304, 1371, 29892, 577, 1183, 5089, 2986, 714, 310, 278, 282, 898, 322, 4822, 278, 25013, 29892, 11705, 292, 19225, 11308, 297, 902, 13394, 29889, 2296, 5096, 287, 701, 278, 11308, 2820, 902, 282, 898, 322, 2825, 263, 2594, 4336, 304, 5557, 278, 3974, 515, 9677, 292, 29889, 2973, 14332, 9109, 29892, 278, 285, 9102, 750, 7160, 278, 2462, 29991, 2], 'labels': [-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 26222, 2501, 263, 931, 29892, 727, 471, 263, 26565, 285, 9102, 1058, 10600, 297, 263, 282, 898, 297, 278, 25013, 29889, 3118, 2462, 29892, 263, 2107, 13569, 3974, 14455, 714, 29892, 20616, 292, 599, 278, 15006, 297, 278, 4038, 29889, 450, 285, 9102, 471, 10087, 304, 1371, 29892, 577, 1183, 5089, 2986, 714, 310, 278, 282, 898, 322, 4822, 278, 25013, 29892, 11705, 292, 19225, 11308, 297, 902, 13394, 29889, 2296, 5096, 287, 701, 278, 11308, 2820, 902, 282, 898, 322, 2825, 263, 2594, 4336, 304, 5557, 278, 3974, 515, 9677, 292, 29889, 2973, 14332, 9109, 29892, 278, 285, 9102, 750, 7160, 278, 2462, 29991, 2]}.
Sample 4920 of the training set: {'input_ids': [1, 13866, 338, 385, 15278, 393, 16612, 263, 3414, 29892, 3300, 2859, 411, 385, 1881, 393, 8128, 4340, 3030, 29889, 14350, 263, 2933, 393, 7128, 2486, 1614, 2167, 278, 2009, 29889, 13, 13, 2277, 29937, 2799, 4080, 29901, 13, 6113, 263, 740, 297, 5132, 304, 7252, 1023, 6031, 29889, 13, 13, 2277, 29937, 10567, 29901, 13, 1576, 1023, 6031, 526, 525, 11548, 29915, 322, 525, 272, 927, 4286, 13, 13, 2277, 29937, 13291, 29901, 1753, 7252, 29918, 19651, 29898, 29879, 29896, 29892, 269, 29906, 1125, 13, 29871, 396, 3831, 598, 1023, 6031, 322, 736, 263, 7223, 995, 13, 29871, 565, 269, 29896, 1275, 269, 29906, 29901, 13, 1678, 736, 5852, 13, 29871, 1683, 29901, 13, 1678, 736, 7700, 13, 13, 29937, 4321, 1206, 13, 1807, 29896, 353, 525, 11548, 29915, 13, 1807, 29906, 353, 525, 272, 927, 29915, 13, 13, 2914, 353, 7252, 29918, 19651, 29898, 1807, 29896, 29892, 1347, 29906, 29897, 13, 2158, 29898, 2914, 29897, 2], 'labels': [-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 1753, 7252, 29918, 19651, 29898, 29879, 29896, 29892, 269, 29906, 1125, 13, 29871, 396, 3831, 598, 1023, 6031, 322, 736, 263, 7223, 995, 13, 29871, 565, 269, 29896, 1275, 269, 29906, 29901, 13, 1678, 736, 5852, 13, 29871, 1683, 29901, 13, 1678, 736, 7700, 13, 13, 29937, 4321, 1206, 13, 1807, 29896, 353, 525, 11548, 29915, 13, 1807, 29906, 353, 525, 272, 927, 29915, 13, 13, 2914, 353, 7252, 29918, 19651, 29898, 1807, 29896, 29892, 1347, 29906, 29897, 13, 2158, 29898, 2914, 29897, 2]}.
[2023-08-28 04:37:06,368] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
[2023-08-28 04:37:06,370] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
[2023-08-28 04:37:06,376] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
[2023-08-28 04:37:06,377] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
 [WARNING]  cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!
Using /home/anmol/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
Creating extension directory /home/anmol/.cache/torch_extensions/py310_cu113/cpu_adam...
 [WARNING]  cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!
Using /home/anmol/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
 [WARNING]  cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!
Using /home/anmol/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
 [WARNING]  cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!
Using /home/anmol/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
Emitting ninja build file /home/anmol/.cache/torch_extensions/py310_cu113/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -I/home/anmol/anaconda3/envs/llamax/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -isystem /home/anmol/anaconda3/envs/llamax/lib/python3.10/site-packages/torch/include -isystem /home/anmol/anaconda3/envs/llamax/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/anmol/anaconda3/envs/llamax/lib/python3.10/site-packages/torch/include/TH -isystem /home/anmol/anaconda3/envs/llamax/lib/python3.10/site-packages/torch/include/THC -isystem /home/anmol/anaconda3/envs/llamax/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -march=native -fopenmp -D__AVX256__ -D__DISABLE_CUDA__ -c /home/anmol/anaconda3/envs/llamax/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o
[2/2] c++ cpu_adam.o -shared -fopenmp -L/home/anmol/anaconda3/envs/llamax/lib/python3.10/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o cpu_adam.so
Loading extension module cpu_adam...
Loading extension module cpu_adam...
Time to load cpu_adam op: 16.43460202217102 seconds
Time to load cpu_adam op: 16.424538373947144 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 16.438047647476196 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 16.528525590896606 seconds
Parameter Offload: Total persistent parameters: 266240 in 65 params
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.01}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.02}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.03}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.04}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.05}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.06}

Does anyone have any idea on what I might be doing wrong ?

Does this repository support training of the GPT series?

extremely bad performance

just an FYI for other people who may use this repo

How to run 13B LLaMA model for Full-parameters SFT ?

Is there anybody can provide a running command example for LLaMA-13B model Full-parameters SFT?

Btw, I gonna run this experiment on 8 * A100-80GB hardware envs.

Thanks so much!

About the training strategy

Very nice project and appreciate your contribution!

I have seen the deepspeed config and I want to confirm the current training strategy. For LLaMA-13B, the training uses Zero-3 optimization, checkpointing, and CPU-offload, right? I'm curious if you have tried tensor parallel (used in original LLaMA training) or model parallel?

We would also love to contribute to the training implementation about model parallel for fast large scale training, aiming at models greater than 13B. Currently I'm investigating torch/fairscale pipeline parallel mechanism.

Best,
Fangkai

Concern on the language

Interesting project, but I have some concern on the language.
As is known that there are less Chinese tokens in the training data of Llama, and each Chinese token is tokenized into several tokens which is ineffecient in generation. Would the project hand this? e.g. add new tokens and do some pretraining?

About llama-2-70B fine-tuning

Thanks for your amazing work!

I'm experiencing out-of-memory problems when using Llama-X's fine-tuning code to do Supervised fine-tuning for 70B (Llama-2-13B doesn't have this problem), using a configuration of 3 sets of 8*A100 (40G).

So would like to inquire about the training configuration used if possible, thanks a lot!

Demo is dead

The demo link gives an 502

how big is your RAM

I trained on 4V100 32g and got OOM . If i set offload to cpu ,the process will be killed because of out of RAM，So how large is your RAM ? And is it possible to train llama 7B with 4v100? Really thanks for reply.

RuntimeError: CUDA out of memory.

I fine-tune llama-7b on 8 V100 32G. However， it occurs CUDA out of memory.

RuntimeError: CUDA out of memory. Tried to allocate 688.00 MiB (GPU 6; 31.75 GiB total capacity; 29.87 GiB already allocated; 41.94 MiB free; 30.29 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2023-04-04 17:43:59,766] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 1840
[2023-04-04 17:44:02,370] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 1841
[2023-04-04 17:44:05,213] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 1842
[2023-04-04 17:44:07,817] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 1843
[2023-04-04 17:44:10,420] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 1844
[2023-04-04 17:44:13,023] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 1845
[2023-04-04 17:44:15,585] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 1846
[2023-04-04 17:44:15,586] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 1847
[2023-04-04 17:44:18,429] [ERROR] [launch.py:324:sigkill_handler] ['/home/aiscuser/.conda/envs/llamax/bin/python3.1', '-u', 'train.py', '--local_rank=7', '--model_name_or_path', './llama-7b-hf', '--data_path', '../data/alpaca_data.json', '--output_dir', 'output/', '--num_train_epochs', '3', '--per_device_train_batch_size', '64', '--per_device_eval_batch_size', '1', '--gradient_accumulation_steps', '1', '--evaluation_strategy', 'no', '--save_strategy', 'steps', '--save_steps', '100', '--save_total_limit', '2', '--learning_rate', '2e-5', '--warmup_steps', '2', '--logging_steps', '2', '--lr_scheduler_type', 'cosine', '--report_to', 'tensorboard', '--gradient_checkpointing', 'True', '--deepspeed', 'configs/deepspeed_config.json', '--fp16', 'True'] exits with return code = 1

watch -n 1 nvidia-smi


    Tue Apr  4 17:48:15 2023
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 510.85.02    Driver Version: 510.85.02    CUDA Version: 11.6     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |===============================+======================+======================|
    |   0  Tesla V100-SXM2...  On   | 00000001:00:00.0 Off |                    0 |
    | N/A   42C    P0    67W / 300W |   4491MiB / 32768MiB |     32%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   1  Tesla V100-SXM2...  On   | 00000002:00:00.0 Off |                    0 |
    | N/A   44C    P0    57W / 300W |   2482MiB / 32768MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   2  Tesla V100-SXM2...  On   | 00000003:00:00.0 Off |                    0 |
    | N/A   40C    P0    54W / 300W |   2502MiB / 32768MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   3  Tesla V100-SXM2...  On   | 00000004:00:00.0 Off |                    0 |
    | N/A   41C    P0    56W / 300W |   2482MiB / 32768MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   4  Tesla V100-SXM2...  On   | 00000005:00:00.0 Off |                    0 |
    | N/A   39C    P0    54W / 300W |   2522MiB / 32768MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   5  Tesla V100-SXM2...  On   | 00000006:00:00.0 Off |                    0 |
    | N/A   43C    P0    59W / 300W |   2522MiB / 32768MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   6  Tesla V100-SXM2...  On   | 00000007:00:00.0 Off |                    0 |
    | N/A   40C    P0    55W / 300W |   2502MiB / 32768MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   7  Tesla V100-SXM2...  On   | 00000008:00:00.0 Off |                    0 |
    | N/A   43C    P0    53W / 300W |   2522MiB / 32768MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    
    +-----------------------------------------------------------------------------+
    | Processes:                                                                  |
    |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
    |        ID   ID                                                   Usage      |
    |=============================================================================|

Need optimization for mps

Reproduction info:

MacBook Pro 2021 14' with M1 chip
macOS 12.4
6B model

The inference speed is very slow on macOS machine with mps, which needs further optimization.

utils.py and openai==0.8.0 TypeError: 'type' object is not subscriptable

this code ./src/utils.py
line 47 return_text=False in
def openai_completion(
prompts: Union[str, Sequence[str], Sequence[dict[str, str]], dict[str, str]],
decoding_args: OpenAIDecodingArguments,
model_name="text-davinci-003",
sleep_time=2,
batch_size=1,
max_instances=sys.maxsize,
max_batches=sys.maxsize,
return_text=False,
shows this error:
File "train.py", line 27, in
import utils
File "/home/llama/Llama-X-main/src/utils.py", line 47, in
return_text=False,
TypeError: 'type' object is not subscriptable

my openai==0.8.0,deepspeed==0.10.0,and other pip in requirements.txt , After changed the transformer version to 4.29.2 but i can't slove this error, maybe some pip version are not true, if anyone have some suggestion, please leave your comment here，thank you very much!

improve LLaMA for visual understanding like GPT-4

Thanks for the good works!

We have tried to improve LLaMa model to understand visual information and support multi-modal chatting.
We are inspired that a good vit, e.g., CLIP vision encoder, and a well-trained large language model, e.g., LLaMA, with connection network, e.g., MLP or Transformer, can cover visual applications, like PALM-E.

The results in image captioning, VQA, and more multi-modal tasks, are promising in 7B and we call on more people to support testing of larger models.

Github: https://github.com/feizc/Visual-LLaMA

fine-tuning scripts and hyper-parameters setting
datasets for fine-grained alignment and instruct tuning
interactive gradio and visual chatbot

How to fix unstable loss. I am using wizardlm or Llama-X training code with vicuna style chat format for fine-tuning Llama-2-7b-hf model.

I'm using the 'Llama-X' (https://github.com/AetherCortex/Llama-X) training code with the vicuna-style chat template to fine-tune the Llama-2-7b-hf model. However, I'm observing an unstable loss during the process.

Please find the detailed Weights & Biases report at (https://wandb.ai/arpitsh018/huggingface/reports/Untitled-Report--Vmlldzo1NjE2Njgz).

Training Parameters:

os.system(f'deepspeed train.py
--model_name_or_path meta-llama/Llama-2-7b-hf
--data_path ../data/dummy_conversation.json
--output_dir ./checkpoint/finetuned-llama2-7b
--num_train_epochs 1
--model_max_length 4096
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 1
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 1000
--save_total_limit 3
--learning_rate 2e-5
--warmup_steps 0
--logging_steps 1
--lr_scheduler_type "cosine"
--report_to "wandb"
--gradient_checkpointing True
--deepspeed configs/deepspeed_config.json
--bf16 True')

deepspeed_config.json:

{
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 0,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 0,
"stage3_max_reuse_distance": 0,
"stage3_gather_16bit_weights_on_model_save": true
},
"bf16": {
"enabled": true
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": [0.9, 0.999],
"eps": 1e-8,
"weight_decay": 0
}
},
"scheduler": {
"type": "WarmupDecayLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto",
"total_num_steps": "auto"
}
},
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"wall_clock_breakdown": false
}

I'm eager to understand how to stabilize the loss for my training sessions. Any insights or recommendations would be greatly appreciated.

good in llama-7b,but bad in llama-13B ?

finetune the same data on translation,good result in llama-7b,while bad result in llama-13b.why?

Why use offload_param in CPU？

I think the model fragment loading can be completed under the 6.7B parameter, why use parameterized offload to the cpu?

        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },