openmoss / collie Goto Github PK

View Code? Open in Web Editor NEW

393.0 393.0 56.0 28.75 MB

Collaborative Training of Large Language Models in an Efficient Way

Home Page: https://openlmlab-collie.readthedocs.io

License: Apache License 2.0

Python 99.85% Dockerfile 0.05% Shell 0.10%

deep-learning deepspeed nlp pytorch

collie's People

Contributors

Stargazers

Watchers

collie's Issues

使用的Megatron-LM的版本

examples下的alpaca/train.py，使用到了Megatron-LM，用最新的Megatron-LM版本运行会报错。
修改源码绕开报错后，跑数据并行和张量并行的loss都为NAN。
可以提供一下Megatron-LM使用的版本吗。

Whether lr_scheduler for Lomo is implemented now?

Hi authors,

Whether lr = trainer.lr_scheduler.step(global_step) for Lomo in the Trainer is implemented?

If so, how to enable it?

Thanks!

AdaLomo optimizer step method

Hi @KaiLv69

thanks for the writeup and implementation for AdaLomo! It looks like it is missing the step method which torch needs to use this in other frameworks. Can you help with this please?

Error： llama2 70B LlamaForCausalLM.from_pretrained 开启Zero3，会消耗大量内存导致 OOM

8张 V100 显卡，开启 Zero3，TP=1，PP=1，DP=8，LlamaForCausalLM.from_pretrained llama 70B 模型会出现 OOM (内存不够，不是显存不够)，物理内存 512GB。
原因是 dev 分支中，base.py 304行，
state_dict = {}
if not is_zero3_enabled(config) or env.dp_rank == 0
or config.low_cpu_mem_usage or config.quantization_config.load_in_8bit
or getattr(config.quantization_config, "load_in_4bit", False):
state_dict = cls.load_parallel_state_dict(
path=model_path_or_name, config=config,
process_exclusion=process_exclusion, **kwargs
)
会导致 8 个进程都加载一次 state_dict，内存消耗很大，导致OOM

是否可以增加 LLAMA2的支持？谢谢

如题，
LLAMA2: https://github.com/facebookresearch/llama

在3090上使用collie微调moss7B，flash_attn报错

RuntimeError: FlashAttention backward for head dim > 64 requires A100 or H100 GPUs as the implementation needs a large amount of shared memory.

bf16是否支持？

训练loss为NaN

我在使用pipeline parallelism对Moss-7B底座模型进行superivised finetuning，实验环境为4xV100，但是我发现在训练过程中，loss一直都是nan，请问可能是什么原因？

我的配置如下：

"""
使用CoLLie微调Moss-base模型
"""
import sys
sys.path.append('..')
import torch
from transformers import AutoTokenizer

from collie.config import CollieConfig

from collie.data import CollieDatasetForTraining
from collie.controller.trainer import Trainer
from collie.controller.evaluator import EvaluatorForPerplexity, EvaluatorForGeneration
from collie.models.moss import MossForCausalLM
from collie.utils.monitor import StepTimeMonitor, TGSMonitor, MemoryMonitor, LossMonitor, EvalMonitor
from collie.metrics import DecodeMetric, PPLMetric
from collie.module import GPTLMLoss

# 1. 设置路径
# 1.1 预训练模型路径
pretrained_model = "/pretrained_weights/moss-base-7b"

# 2. 设置配置
# 2.1 加载配置
config = CollieConfig.from_pretrained(pretrained_model, trust_remote_code=True,
                                      local_files_only=True)
config.tp_size = 1
config.dp_size = 1
config.pp_size = 4
config.use_flash = False
config.train_epochs = 1
config.eval_per_n_steps = 0
config.eval_per_n_epochs = 1 
config.train_micro_batch_size = 1
config.eval_batch_size = 1
config.gradient_accumulation_steps = 4

# 3. 设置tokenizer
tokenizer = AutoTokenizer.from_pretrained(pretrained_model,
                                          trust_remote_code=True,
                                          local_files_only=True)

# 4. 加载数据集
train_dataset = [
    {
        'input': 'Collie is a python package for ',
        'output': 'finetuning large language models.'
    } for _ in range(10000)
]
train_dataset = CollieDatasetForTraining(train_dataset, tokenizer)
eval_dataset = train_dataset[:32]

# 5. 加载预训练模型
model = MossForCausalLM.from_pretrained(pretrained_model, config=config)

# 6. 设置优化器
# optimizer = Lomo(
#     model,
#     lr = 0.001,
#     clip_grad_norm = 5.0
# )

# Lomo与pp不兼容
optimizer = torch.optim.AdamW(model.parameters(), lr=9e-6)

# 7. 添加监视器
monitors = [
    StepTimeMonitor(config),
    TGSMonitor(config),
    MemoryMonitor(config),
    LossMonitor(config),
    EvalMonitor(config)
]

# 8. 添加Evaluator
evaluator_ppl = EvaluatorForPerplexity(
    model = model,
    config = config,
    dataset = eval_dataset,
    monitors = [
        EvalMonitor(config)
    ],
    metrics = {
        'ppl': PPLMetric()
    }
)
evaluator_decode = EvaluatorForGeneration(
    model = model,
    config = config,
    tokenizer = tokenizer,
    dataset = eval_dataset,
    monitors = [
        EvalMonitor(config)
    ],
    metrics = {
        'decode': DecodeMetric()
    }

)

# 9. 实例化trainer
trainer = Trainer(
    model = model,
    config = config,
    loss_fn = GPTLMLoss(-100),
    optimizer = optimizer,
    train_dataset = train_dataset,
    monitors = monitors,
    evaluators = [evaluator_ppl, evaluator_decode]
)
# 10. 训练/验证
trainer.train()

# CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nnodes=1 --nproc_per_node=4 finetune_moss_base.py

输出结果如下：

替换tokenizer后载入报错

使用readme里的示例代码，8卡，模型换成llama2-70B，tokenizer换成chinese-llama2，词表大小32000→55296
加入了一行model.resize_token_embeddings(len(tokenizer))
由于报错维度对不上所以简单粗暴地将 [start_pos_new:end_pos_new, :] 改成了 [start_pos_new:end_pos_new]（可能是这里被我改错了？）
之后可以正常训练并保存模型，但载入保存的模型时，报错

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 493, in from_pretrained
    return model_class.from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2903, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 3310, in _load_pretrained_model
    raise RuntimeError(f"Error(s) in loading state_dict for {model.__class__.__name__}:\n\t{error_msg}")
RuntimeError: Error(s) in loading state_dict for LlamaForCausalLM:
        size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([256000, 8192]) from checkpoint, the shape in current model is torch.Size([55296, 8192]).
        You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.

尝试设置ignore_mismatched_sizes=True之后输出完全是乱码。
因为256000=32000*8所以觉得是不是扩展embedding的时候就失败了……

[BUG] 使用 CollieDatasetForClassification 在 helm 风格下进行分类评测时，max new token 截取存在问题

使用 CollieDatasetForTraining 在 helm 风格下进行分类评测时，如果模型生成的长度没有达到max new token，collie 也会截取 max new token 个 token 进行评测。且截取的位置存在问题。

如使用 MMLU 测评时，模型只会输出选项 ABCD，当max new token>1时，会出现这一错误。

张量并行流水并行可以和lora一起使用么？报错ValueError: Target module ColumnParallelLinearWithoutBias() is not supported. Currently, only `torch.nn.Linear` and `Conv1D` are supported.

ImportError: cannot import name 'PeftConfig' from 'peft.utils'

谢谢您出色的工作！运行代码是出现：
2023-09-09 14:50:43,666] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
False
/home/a40-01/anaconda3/envs/py38/lib/python3.8/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
warn("The installed version of bitsandbytes was compiled without GPU support. "
'CUDASetup' object has no attribute 'cuda_available'
Traceback (most recent call last):
File "test.py", line 8, in
from collie.config import CollieConfig
File "/home/jdyy-a40-01/anaconda3/envs/py38/lib/python3.8/site-packages/collie/init.py", line 3, in
from .config import CollieConfig
File "/home/jdyy-a40-01/anaconda3/envs/py38/lib/python3.8/site-packages/collie/config.py", line 6, in
from peft.utils import PeftConfig, PeftType

请问是cuda版本的问题吗？谢谢~

llama在张量并行模式下的attention score计算是否有问题？

我这边看代码里面，llama的attention score的计算是直接用的query乘以key，但是在张量并行的模式下，query和key应该都已经被切分了把。这里直接相乘的话，得到的attention score并不是在完整的query和key上进行的，这是否不太对？
希望大佬能给解答一下！

训练出错但没有报错信息

按照readme里的步骤来的，只把模型换成了llama-2-70B。
输出：

Training Epoch: 0 / 1     0% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ -- it./s 0:00:00 / -:--:--
Training Batch: 0 / 157   0% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ -- it./s 0:00:00 / -:--:--  /usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:426: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:426: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:426: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:426: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly.
The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use
use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
Training Epoch: 0 / 1     0% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ -- it./s 0:00:07 / -:--:--
Training Batch: 0 / 157   0% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ -- it./s 0:00:07 / -:--:--  [2023-08-08 03:08:43,774] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -7) local_rank: 0 (pid: 51194) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.1.0.dev20230725+cu121', 'console_scripts', 'torchrun')())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 788, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=====================================================
collie.py FAILED
-----------------------------------------------------
Failures:
[1]:
  time      : 2023-08-08_03:08:43
  host      : 58283303bbb0
  rank      : 1 (local_rank: 1)
  exitcode  : -7 (pid: 51195)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 51195
[2]:
  time      : 2023-08-08_03:08:43
  host      : 58283303bbb0
  rank      : 2 (local_rank: 2)
  exitcode  : -7 (pid: 51196)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 51196
[3]:
  time      : 2023-08-08_03:08:43
  host      : 58283303bbb0
  rank      : 3 (local_rank: 3)
  exitcode  : -7 (pid: 51197)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 51197
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-08-08_03:08:43
  host      : 58283303bbb0
  rank      : 0 (local_rank: 0)
  exitcode  : -7 (pid: 51194)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 51194
=====================================================

重启容器之后恢复正常。
再次重启之后换成8卡 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --rdzv_backend=c10d --rdzv_endpoint=localhost:29402 --nnodes=1 --nproc_per_node=8
输出：

Training Epoch: 0 / 1    0% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ -- it./s 0:00:00 / -:--:--
Training Batch: 0 / 79   0% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ -- it./s 0:00:00 / -:--:--  /usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:426: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:426: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:426: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:426: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:426: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:426: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:426: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:426: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly.
The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use
use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
Training Epoch: 0 / 1    0% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ -- it./s 0:00:13 / -:--:--
Training Batch: 0 / 79   0% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ -- it./s 0:00:13 / -:--:--  [2023-08-08 03:20:09,633] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 96 closing signal SIGTERM
[2023-08-08 03:20:09,639] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 98 closing signal SIGTERM
[2023-08-08 03:20:09,641] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 99 closing signal SIGTERM
[2023-08-08 03:20:09,643] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 100 closing signal SIGTERM
[2023-08-08 03:20:09,645] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 101 closing signal SIGTERM
[2023-08-08 03:20:09,648] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 102 closing signal SIGTERM
[2023-08-08 03:20:11,395] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -7) local_rank: 0 (pid: 95) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.1.0.dev20230725+cu121', 'console_scripts', 'torchrun')())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 788, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
==================================================
collie.py FAILED
--------------------------------------------------
Failures:
[1]:
  time      : 2023-08-08_03:20:09
  host      : 06a78451e09d
  rank      : 2 (local_rank: 2)
  exitcode  : -7 (pid: 97)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 97
--------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-08-08_03:20:09
  host      : 06a78451e09d
  rank      : 0 (local_rank: 0)
  exitcode  : -7 (pid: 95)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 95
==================================================

请问这种情况应该如何debug呢……

megatron是哪个版本

│ /data/conda/usr/llm/envs/llama_etuning/lib/python3.10/site-packages/collie/models/llama/m │
│ odel.py:235 in __init__                                                                   │
│                                                                                           │
│   232 │                                                                                   │
│   233 │   def __init__(self, config: CollieConfig) -> None:                               │
│   234 │   │   super().__init__(config)                                                    │
│ ❱ 235 │   │   self.embed_tokens = tensor_parallel.VocabParallelEmbedding(                 │
│   236 │   │   │   self.collie_config.vocab_size,                                          │
│   237 │   │   │   self.collie_config.hidden_size                                          │
│   238 │   │   )

VocabParallelEmbedding.__init__() missing 2 required keyword-only arguments: 'init_method' and 'config'

chatGLM2 使用张量并行报错

dev分支：
collie/examples/finetune_chatglm2_for_summary.py
配置：
config.pp_size = 1
config.tp_size = 8
报错：
Traceback (most recent call last):
File "finetune_chatglm2_for_summary.py", line 81, in
model = ChatGLM2ForCausalLM.from_pretrained(args.model_path, config=config)
File "/home/sse-ard/merlin.xie/proj/dev/collie/collie/models/base.py", line 332, in from_pretrained
set_module_tensor_to_device(
File "/usr/local/lib/python3.8/dist-packages/accelerate/utils/modeling.py", line 285, in set_module_tensor_to_device
raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([65024, 4096]) in "weight" (which has shape torch.Size([8128, 4096])), this look incorrect.

[Feature] examples 里是否可以新增一个 internLM的用例？

看文档以及代码是存在 internLM的支持的，但是examples目录里缺少InternLm的用例

UnboundLocalError: local variable 'master_addr' referenced before assignment

config = CollieConfig.from_pretrained("decapoda-research/llama-7b-hf")
model = LlamaForCausalLM.from_pretrained(model, config=config)

optimizer = AdamW(model.parameters(), lr=learning)

model, optimizer = accelerator.prepare(
model, optimizer
)

Evaluating is too slow

Hi, there. I find my evaluating stage is too slow, as shown in the below figure (training costs 1:43 for 10 steps, while evaluating already costs 3:43 for only 1 step). May I ask what may cause this kind of issue?

The config and trainer are listed below (using the lasted dev branch):

# CollieConfig
pp_size: 1
tp_size: 1
dp_size: 4
train_micro_batch_size: 4
eval_batch_size: 4
dataloader_num_workers: 4
eval_per_n_epochs: 1
eval_per_n_steps: 0
ds_config:
    train_micro_batch_size_per_gpu: 4
    zero_optimization:
        stage: 2

# trainer
evaluators = [
    Evaluator(
        model=model,
        config=collie_cfg,
        dataset=dataset_val,
        collate_fn=data_collator,
        tokenizer=tokenizer,
        monitors=[
            EvalMonitor(collie_cfg)
        ],
    )
]
trainer = Trainer(
    config=collie_cfg,
    model=model,
    optimizer=optimizer,
    lr_scheduler=lr_scheduler,
    train_dataset=dataset_trn,
    train_dataset_collate_fn=data_collator,
    evaluators=evaluators,
)

请问 Collie 的开源协议是什么？可否添加一个协议文件

请问 Collie 的开源协议是什么？可否添加一个协议文件，谢谢！

init() missing 'init_method' and 'config'

base_model: llama-7b
运行方式：CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nnodes=1 --nproc_per_node=4 train.py（adalomo目录下的train.py文件）

在使用collie LlamaForCausalLM加载模型时，deepspeed的parttion_parameters.py文件报错如下：
init() missing 2 required keyword-only arguments: 'init_method' and 'config'

lr_scheduler设置的问题

我尝试了在finetune_moss_for_training.py文件中设置了lr_scheduler，并传入trainer。查看ds_logs，发现一直是恒定的最大学习率，所以很疑惑？
请问该如何修改呢？万分感谢！

[QUESTION]Multi-node multi-gpu training

Hello, I am currently fine-tuning a Llama-2-70B model using 3*8*A100(40G) and noticed that using any of the following codes will give an OOM situation at the beginning:

(1) WORLD_SIZE=24 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun \
--nproc_per_node=8 \
--master_addr=192.168.0.6 \
--master_port=9901 \
finetune_llama_ptuning.py

(2) python -m torch.distributed.launch \
    --nproc_per_node=8 \
    --nnodes=3 \
    --node_rank=0 \ # Change to 1, 2 on the other two machines.
    --master_addr=192.168.0.6 \
    --master_port=9901 \
    finetune_llama_ptuning.py

(3) deepspeed --num_gpus 8     --num_nodes 3     --master_addr 192.168.0.6     --master_port 9901     --hostfile /mnt/download/configs/hostfile.txt     finetune_llama_ptuning.py     --deepspeed /mnt/git/configs/deepspeed/deepspeed_zero2.json

Can I ask about the general configuration of the project's multi-node multi-gpu training command? I have tried to solve the OOM error I encountered before through collie, and I wish to get as close as possible to the official configuration as a way to get better results, as the OOM situation still exists at the moment, thanks a lot!

能不能重新训练啊？

可以重新训练吗，不用预训练的token,模型，自己重新生成一个模型。这样可以吗。

collie和lomo不兼容

使用 lomo，有如下错误：

  /home/xx/collie/examples/alpaca/train.py(150)<module>()
-> trainer.train()
  /home/xx/collie/collie/controller/trainer.py(325)train()
-> loss = self.train_fn(self, batch, self.global_batch_idx)
> /home/lzy/collie/collie/controller/trainer.py(407)train_fn()
-> trainer.engine.optimizer.get_param_coordinator(training=True).reset_step()
<class 'AttributeError'> 'NoneType' object has no attribute 'get_param_coordinator'

项目readme里面说的A100是多少G版本的。

在使用Adam优化器的情况下，各个模型需要的最少的GPU（A100）数量

是否可以增加baichuan-2 的fine-tuning支持? 或者是否可以给一个如何新增微调模型的guide？谢谢

如题，谢谢！

baichuan-2: https://github.com/baichuan-inc/Baichuan2

或者是否可以给一个如何新增微调模型的guide？用户有需求的话自己参考guide 添加

Support for LLaMA-2 70B with Grouped-Query Attention

Due to the Grouped-Query Attention introduced in LLaMA-2 70B，llama issue，it cannot be loaded into the collie implementation of LLaMA. Hope LLaMA-2 70B can be support in collie. Thanks

Traceback (most recent call last):
  File "/nvme1/gptdata/share1/projects/collie/examples/download.py", line 49, in <module>
    model = LlamaForCausalLM.from_pretrained(model_name, config=config)
  File "/nvme1/gptdata/share1/app/mambaforge/envs/collie/lib/python3.9/site-packages/collie/models/base.py", line 306, in from_pretrained
    state_dict = cls.load_parallel_state_dict(
  File "/nvme1/gptdata/share1/app/mambaforge/envs/collie/lib/python3.9/site-packages/collie/models/llama/model.py", line 414, in load_parallel_state_dict
    part_state_dict[key] = rearrange(
RuntimeError: shape '[8192, 8192]' is invalid for input of size 8388608

save_checkpoint

我的代码是：

every_n_epochs = 1
every_n_batches = 10
last = True
model_only = True
callbacks = [CheckpointCallback(save_path, every_n_epochs=every_n_epochs,
                                                                    every_n_batches=every_n_batches,
                                                                                                            last=last, model_only=model_only)]

# 9. 实例化trainer
trainer = Trainer(
    model = model,
    config = config,
    loss_fn = GPTLMLoss(-100),
    optimizer = optimizer,
    train_dataset = train_dataset,
    monitors = monitors,
    evaluators = [evaluator_ppl, evaluator_decode],
    callbacks=callbacks
)

报错如下

AttributeError: 'DeepSpeedZeRoOffload' object has no attribute 'checkpoint_event_prologue'

ColumnParallelLinearWithoutBias is not supported by peft

When using InternLM model with LoRA, with tensor parallelism degree 2, it raised a ValueError: Target module ColumnParallelLinearWithoutBias() is not supported. Currently, only `torch.nn.Linear` and `Conv1D` are supported..

tensor parallel + zero3 error

zero3能够和模型并行一起用吗？我在尝试中使用

config.use_flash = False
config.tp_size = 4
config.ds_config = {
        "fp16": {
            "enabled": True
        },
        "zero_allow_untested_optimizer": True,
        "zero_force_ds_cpu_optimizer": False,
        "zero_optimization": {
            "stage": 3,
            "offload_optimizer": {
                "device": "cpu",
                "pin_memory": False
            }
        },
        "monitor_config": {
            "enabled": True,
            "tag": "adan",
            "csv_monitor": {
                "enabled": True,
                "output_path": "./ds_logs/"
            }
        }
}

有如下的错误

Traceback (most recent call last):
  File "examples/alpaca/train.py", line 97, in <module>
    model.load_state_dict(state_dict)
  File "/home/xx/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for LlamaForCausalLM:
        size mismatch for layers.0.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for layers.0.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for layers.0.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for layers.0.mlp.gate_proj.weight: copying a param with shape torch.Size([2752, 4096]) from checkpoint, the shape in current model is torch.Size([0]).

基于CoLLie训练7B Moss模型，无法使用Huggingface的AutoModelForCausalLM加载吗？

从Huggingface下载fnlp/moss-base-7b，使用CoLLie中的MossForCausalLM类完成训练，保存的checkpoint无法再用Huggingface的AutoModelForCausalLM预测生成吗？

加载时会报错，提示

ValueError: The state dictionary of the model you are trying to load is corrupted. Are you sure it was properly saved?

请问如何解决？

该项目能否用于对模型进行二次预训练

感谢你们的开源贡献！
我想知道，本项目可以用于进行微调，那么能不能进行预训练呢？
我在文档看到本项目可以接受不带label的dataset，这是否意味着本项目已经实现了预训练？
如果是的话，具体应该怎么配置呢？

是否可以新增chatglm3 支持？

如题，chatglm3 已放出

https://github.com/THUDM/ChatGLM3

[BUG] Evaluation 时使用并行可能不会完整地遍历一遍数据

猜测是并行 size 或 batch size 设置不当（无法整除数据量），可能会有数据被重复计算。

[问题]有关训练可视化

感谢您对于项目的贡献！

问题描述

对于微调训练过程来说，collie有可视化实时训练进展、曲线的方式吗？

期待您的回复！

[BUG] ImportError: cannot import name 'PeftConfig' from 'peft.utils'

Describe
In the initial configuration, I encountered the problem "ImportError: cannot import name 'PeftConfig' from 'peft.utils'", while I was still in the testing phase, and used the sample code without making any modifications.
(There are almost no instructions on the web about this bug, which amazes me.)

To Reproduce

conda create -n collie python=3.10 or conda create -n collie python=3.8
pip install collie-lm -i https://pypi.tuna.tsinghua.edu.cn/simple/
cd collie/examples
run any py code file: python finetune_moss_for_training.py
(I tried to change the address of the model to a local address and make the appropriate changes, but I get the error without modifying the code, which should be an environment configuration issue.)

System info

OS: Ubuntu 20.04.6 LTS
Configured as 3 groups with 8*A100 graphics cards (total of 24 A100-40G graphics cards)
Python = 3.10
Others:

Package                  Version
------------------------ ----------
absl-py                  1.4.0
accelerate               0.22.0
anyio                    3.7.1
appdirs                  1.4.4
beautifulsoup4           4.12.2
bitsandbytes             0.41.1
cachetools               5.3.1
certifi                  2023.7.22
charset-normalizer       3.2.0
click                    8.1.7
cmake                    3.27.2
collie-lm                1.0.3
deepspeed                0.10.2
docker-pycreds           0.4.0
einops                   0.6.1
exceptiongroup           1.1.3
fastapi                  0.103.1
filelock                 3.12.3
fsspec                   2023.9.0
gitdb                    4.0.10
GitPython                3.1.34
google                   3.0.0
google-auth              2.22.0
google-auth-oauthlib     1.0.0
grpcio                   1.57.0
hjson                    3.1.0
huggingface-hub          0.16.4
idna                     3.4
Jinja2                   3.1.2
lit                      16.0.6
Markdown                 3.4.4
markdown-it-py           3.0.0
MarkupSafe               2.1.3
mdurl                    0.1.2
megatron-core            0.2.0
mpmath                   1.3.0
networkx                 3.1
ninja                    1.11.1
numpy                    1.25.2
nvidia-cublas-cu11       11.10.3.66
nvidia-cuda-cupti-cu11   11.7.101
nvidia-cuda-nvrtc-cu11   11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11        8.5.0.96
nvidia-cufft-cu11        10.9.0.58
nvidia-curand-cu11       10.2.10.91
nvidia-cusolver-cu11     11.4.0.1
nvidia-cusparse-cu11     11.7.4.91
nvidia-nccl-cu11         2.14.3
nvidia-nvtx-cu11         11.7.91
oauthlib                 3.2.2
packaging                23.1
pandas                   2.1.0
pathtools                0.1.2
peft                     0.5.0
pip                      23.2.1
protobuf                 3.20.1
psutil                   5.9.5
py-cpuinfo               9.0.0
pyasn1                   0.5.0
pyasn1-modules           0.3.0
pydantic                 1.10.12
Pygments                 2.16.1
python-dateutil          2.8.2
pytz                     2023.3
PyYAML                   6.0.1
regex                    2023.8.8
requests                 2.31.0
requests-oauthlib        1.3.1
rich                     13.5.2
rsa                      4.9
safetensors              0.3.3
scipy                    1.11.2
sentencepiece            0.1.99
sentry-sdk               1.30.0
setproctitle             1.3.2
setuptools               68.0.0
six                      1.16.0
smmap                    5.0.0
sniffio                  1.3.0
soupsieve                2.5
starlette                0.27.0
sympy                    1.12
tensorboard              2.14.0
tensorboard-data-server  0.7.1
tokenizers               0.13.3
torch                    2.0.1
tqdm                     4.66.1
transformers             4.32.1
triton                   2.0.0
typing_extensions        4.7.1
tzdata                   2023.3
urllib3                  1.26.16
wandb                    0.15.9
websockets               11.0.3
Werkzeug                 2.3.7
wheel                    0.38.4

Bug Report

[2023-09-04 13:51:44,714] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Traceback (most recent call last):
  File "/mnt/git/collie/examples/finetune_moss_for_training.py", line 8, in <module>
    from collie.config import CollieConfig
  File "/mnt/anaconda/envs/collie/lib/python3.10/site-packages/collie/__init__.py", line 3, in <module>
    from .config import CollieConfig
  File "/mnt/anaconda/envs/collie/lib/python3.10/site-packages/collie/config.py", line 6, in <module>
    from peft.utils import PeftConfig, PeftType
ImportError: cannot import name 'PeftConfig' from 'peft.utils' (/mnt/anaconda/envs/collie/lib/python3.10/site-packages/peft/utils/__init__.py)

save_16bit_model does not save the proper state_dict

Dear authors,

On V100, the torch.cat implementation of saving state_dict is memory demanding and could cause OOM for LLaMA 7B when gathering weights to a single GPU, so I am trying to save the state_dict using trainer.engine.save_16bit_model. However, it seems that loading the state_dict via the standard from_pretrained interface of huggingface raises ValueError: The state dictionary of the model you are trying to load is corrupted. Are you sure it was properly saved? Does save_parallel_state_dict in the collie implementation of LLaMA use the same set of state_dict keys as those in the huggingface implementation of save_pretrained (of LLaMA)?

Could Lomo class support `param_groups`?

As a subclass of torch.optim.Optimizer, could collie.optim.Lomo support param_groups, by calling super().__init__(params, defaults).
So that we can fit for more schedulers and use per-parameter method to filter out some modules that can not useweight_decay.

V100上执行examples/alpaca/train.py碰到错误No module named 'petrel_client，请问有人知道怎么解决吗

使用命令
CUDA_VISIBLE_DEVICES=4,5,6,7 torchrun --rdzv_backend=c10d --rdzv_endpoint=localhost:29402 --nnodes=1 --nproc_per_node=4 train.py
错误信息
[INFO] [comm.py:643:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Traceback (most recent call last):
  File "train.py", line 67, in <module>
    state_dict = LlamaForCausalLM.load_parallel_state_dict(
  File "/home/collie/collie/examples/alpaca/../../collie/models/llama/model.py", line 365, in load_parallel_state_dict
    if not io_driver.exists(path):
  File "/home/collie/collie/examples/alpaca/../../collie/driver/io/petrel.py", line 76, in exists
    from petrel_client.client import Client
ModuleNotFoundError: No module named 'petrel_client'
Traceback (most recent call last):
  File "train.py", line 67, in <module>
    state_dict = LlamaForCausalLM.load_parallel_state_dict(
  File "/home/collie/collie/examples/alpaca/../../collie/models/llama/model.py", line 365, in load_parallel_state_dict
    if not io_driver.exists(path):
  File "/home/collie/collie/examples/alpaca/../../collie/driver/io/petrel.py", line 76, in exists
    from petrel_client.client import Client
ModuleNotFoundError: No module named 'petrel_client'
Traceback (most recent call last):
  File "train.py", line 67, in <module>
Traceback (most recent call last):
  File "train.py", line 67, in <module>
    state_dict = LlamaForCausalLM.load_parallel_state_dict(
  File "/home/collie/collie/examples/alpaca/../../collie/models/llama/model.py", line 365, in load_parallel_state_dict
    if not io_driver.exists(path):
  File "/home/collie/collie/examples/alpaca/../../collie/driver/io/petrel.py", line 76, in exists
    from petrel_client.client import Client
ModuleNotFoundError: No module named 'petrel_client'
    state_dict = LlamaForCausalLM.load_parallel_state_dict(
  File "/home/collie/collie/examples/alpaca/../../collie/models/llama/model.py", line 365, in load_parallel_state_dict
    if not io_driver.exists(path):
  File "/home/collie/collie/examples/alpaca/../../collie/driver/io/petrel.py", line 76, in exists
    from petrel_client.client import Client
ModuleNotFoundError: No module named 'petrel_client'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 4188) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.1.0a0+fe05266', 'console_scripts', 'torchrun')())
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train.py FAILED

deep_speed initialization for models in the transformers library

Dear authors,

I found that collie can not initialize DeepSpeed when using models in the transformers library. For example, when replace this line of script with the from_pretrained interface of the transformers library, to which any config of the type CollieConfig can not be passed, even the monitors can not be registered correctly since ds is not initialized (DeepSpeed backend not set, please initialize it using init_process_group()). Is there any workaround of this issue or Collie can only support training the internally reimplemented models?

请教 DEV分支的这个改动：support llama2 ，llama-2和llama-1可以同时都支持么？

请教 DEV分支的这个改动：support llama2 ：
https://github.com/OpenLMLab/collie/commit/5adf2a368dcb107ef5685057c0380eeaad10c1a2

llama-2和llama-1可以同时都支持么？作为用户我们比较期望能同时都支持。

除了设置checkpointing之外，dp+tp+pp下，还有什么方式可以降低显存消耗？

我这边在使用dp+tp+pp模式跑llama-65b模型，80张V100卡，tp_size=4，pp_size=20，dp=1，载入完model+optimizer后，一张卡大约要消耗15GB显存，然后我这边设置batch_size=2，gradient_accumulation_steps=160。batch_size已经很小了，但是开始训练后，最大的一张卡显存消耗到了30GB，就算将checkpointing设置为true，也要28GB显存，感觉单纯的activation应该不至于消耗这么多显存吧。这使得我没有办法把batch_size提上去，GPU的空泡率有点高，速度比单纯的zero stage3更慢了。
所以我想问一下，有没有什么什么办法可以进一步降低一下显存占用？

Multinode training

I'm trying to train LLaMa30B with 2 nodes for 4GPU each. I'm using the following command to launch the script and then connecting it from the second node by changing the node_rank

torchrun --nproc_per_node=4 --nnodes=2 --node_rank=0 --master_addr=172.16.100.114 --master_port=29500 train.py --backend=nccl --use_syn

But the script is not executing after the connection and it gets feeze as below

预训练模型为何是从不同路径获取的，这是有其他考虑么？

预训练模型为何是从不同路径获取的？例如：

1）

https://github.com/OpenLMLab/collie/blob/dev/examples/alpaca/train.py
 1.1 预训练模型路径
pretrained_model = 'decapoda-research/llama-7b-hf'
。。。。。。
 5. 加载预训练模型
model = LlamaForCausalLM(config)
state_dict = LlamaForCausalLM.load_parallel_state_dict(
    path="hdd:s3://opennlplab_hdd/models/llama/llama-7b-hf",
    config=config,
    protocol="petrel",
    format="hf"
)

2）

https://github.com/OpenLMLab/collie/blob/dev/examples/finetune_llama_for_classification.py
config = CollieConfig.from_pretrained("decapoda-research/llama-7b-hf")
。。。
model = LlamaForCausalLM.from_pretrained("/mnt/petrelfs/zhangshuo/model/llama-7b-hf", config=config)

还有其他示例的，这里就不列出了。

跑3D并行时，如果节点数大于1，会卡住

我这边使用collie来跑3d_parallelism，用的脚本时legacy中的one_sentence_overfitting的3d_parallelism.py，其中tp_size=4，pp_size=4，16张卡来跑，pp_rank=0和pp_rank=3的进程会卡在optimizer.step之前，但是如果用的GPU数量时8的话，也就是不跨节点，就不会卡住，这可能时什么原因呢？

How to convert parallel state_dict to normal state_dict?

Hi, there! I saved parallel state_dict (requires_grad True only) with 8 GPUs remotely, how to load these state_dicts and save them as one locally? Thanks in advance.

collie_dp0_pp0_tp0.pt  collie_zero_dp0_pp0_tp0.pt  collie_zero_dp2_pp0_tp0.pt  collie_zero_dp4_pp0_tp0.pt  collie_zero_dp6_pp0_tp0.pt
collie.json            collie_zero_dp1_pp0_tp0.pt  collie_zero_dp3_pp0_tp0.pt  collie_zero_dp5_pp0_tp0.pt  collie_zero_dp7_pp0_tp0.pt

使用数据类_ShardContainer遇到错误

使用数据类_ShardContainer，会报下面错误，请问是什么原因？（dev分支）

ValueError: mmap closed or invalid

llama-2-7b拓展词表报错

用的dev分支，examples/further_pretrain_llama里的脚本，运行指令是

torchrun --rdzv_backend=c10d --rdzv_endpoint=localhost:29402 --nnodes=1 --nproc_per_node=8 expand_vocab.py

只修改了llama的路径，包括config、tokenizer和model.from_pretrained。报错如下：

╭──────────────────────────── Traceback (most recent call last) ────────────────────────────╮
│ /d2/data/chuxiong/collie/examples/further_pretrain_llama/expand_vocab.py:85 in <module>   │
│                                                                                           │
│    82 │   model.get_input_embedding()[1].weight.requires_grad = True                      │
│    83 if model.get_lm_head()[1] is not None:                                              │
│    84 │   model.get_lm_head()[1].weight.requires_grad = True                              │
│ ❱  85 optimizer = torch.optim.AdamW(                                                      │
│    86 │   filter(lambda p: p.requires_grad, model.parameters()), lr=2e-4)                 │
│    87 lr_scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(                          │
│    88 │   optimizer, T_max=config.train_epochs * len(train_dataset), eta_min=0)           │
│                                                                                           │
│ /d1/conda3/envs/scx_llm/lib/python3.10/site-packages/torch/optim/adamw.py:50 in __init__  │
│                                                                                           │
│    47 │   │   │   differentiable=differentiable,                                          │
│    48 │   │   │   fused=fused,                                                            │
│    49 │   │   )                                                                           │
│ ❱  50 │   │   super().__init__(params, defaults)                                          │
│    51 │   │                                                                               │
│    52 │   │   if fused:                                                                   │
│    53 │   │   │   if differentiable:                                                      │
│                                                                                           │
│ /d1/conda3/envs/scx_llm/lib/python3.10/site-packages/torch/optim/optimizer.py:187 in      │
│ __init__                                                                                  │
│                                                                                           │
│   184 │   │                                                                               │
│   185 │   │   param_groups = list(params)                                                 │
│   186 │   │   if len(param_groups) == 0:                                                  │
│ ❱ 187 │   │   │   raise ValueError("optimizer got an empty parameter list")               │
│   188 │   │   if not isinstance(param_groups[0], dict):                                   │
│   189 │   │   │   param_groups = [{'params': param_groups}]                               │
│   190                                                                                     │
╰───────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: optimizer got an empty parameter list
╭──────────────────────────── Traceback (most recent call last) ────────────────────────────╮
│ /d2/data/chuxiong/collie/examples/further_pretrain_llama/expand_vocab.py:85 in <module>   │
│                                                                                           │
│    82 │   model.get_input_embedding()[1].weight.requires_grad = True                      │
│    83 if model.get_lm_head()[1] is not None:                                              │
│    84 │   model.get_lm_head()[1].weight.requires_grad = True                              │
│ ❱  85 optimizer = torch.optim.AdamW(                                                      │
│    86 │   filter(lambda p: p.requires_grad, model.parameters()), lr=2e-4)                 │
│    87 lr_scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(                          │
│    88 │   optimizer, T_max=config.train_epochs * len(train_dataset), eta_min=0)           │
│                                                                                           │
│ /d1/conda3/envs/scx_llm/lib/python3.10/site-packages/torch/optim/adamw.py:50 in __init__  │
│                                                                                           │
│    47 │   │   │   differentiable=differentiable,                                          │
│    48 │   │   │   fused=fused,                                                            │
│    49 │   │   )                                                                           │
│ ❱  50 │   │   super().__init__(params, defaults)                                          │
│    51 │   │                                                                               │
│    52 │   │   if fused:                                                                   │
│    53 │   │   │   if differentiable:                                                      │
│                                                                                           │
│ /d1/conda3/envs/scx_llm/lib/python3.10/site-packages/torch/optim/optimizer.py:187 in      │
│ __init__                                                                                  │
│                                                                                           │
│   184 │   │                                                                               │
│   185 │   │   param_groups = list(params)                                                 │
│   186 │   │   if len(param_groups) == 0:                                                  │
│ ❱ 187 │   │   │   raise ValueError("optimizer got an empty parameter list")               │
│   188 │   │   if not isinstance(param_groups[0], dict):                                   │
│   189 │   │   │   param_groups = [{'params': param_groups}]                               │
│   190                                                                                     │
╰───────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: optimizer got an empty parameter list
╭──────────────────────────── Traceback (most recent call last) ────────────────────────────╮
│ /d2/data/chuxiong/collie/examples/further_pretrain_llama/expand_vocab.py:77 in <module>   │
│                                                                                           │
│    74 # 准备模型并调整 embedding 层大小，设置只训练 embedding 和 lm_head 层，加速收敛     │
│    75 model = LlamaForCausalLM.from_pretrained(                                           │
│    76 │   "../../../llama-2-7b", config=config)                                           │
│ ❱  77 model.resize_token_embeddings(len(llama_tokenizer) + 7)  # 取个整                   │
│    78 for p in model.parameters():                                                        │
│    79 │   p.requires_grad = False                                                         │
│    80 # 因为 embedding 和 lm_head 在 pipeline 的情况下被分割到了不同的进程，所以要判断一  │
│                                                                                           │
│ /d1/conda3/envs/scx_llm/lib/python3.10/site-packages/collie/models/base.py:634 in         │
│ resize_token_embeddings                                                                   │
│                                                                                           │
│   631 │   │   │   │   │   │   = lm_head.bias.data[start_pos_old:end_pos_old]              │
│   632 │   │   │   │   if end_pos_new < (new_num_tokens // env.tp_size):                   │
│   633 │   │   │   │   │   initization_method = self.collie_config.initization_method      │
│ ❱ 634 │   │   │   │   │   if self.collie_config.initization_method_params is not None:    │
│   635 │   │   │   │   │   │   initization_method = initization_method(new_lm_head.weight[ │
│   636 │   │   │   │   │   │   │   │   │   │   │   │   │   │   │   │   **self.collie_confi │
│   637 │   │   │   │   │   │   if lm_head.bias is not None:                                │
│                                                                                           │
│ /d1/conda3/envs/scx_llm/lib/python3.10/site-packages/collie/config.py:206 in __getattr__  │
│                                                                                           │
│   203 │   │   self.model_config.save_pretrained(path)                                     │
│   204 │                                                                                   │
│   205 │   def __getattr__(self, name):                                                    │
│ ❱ 206 │   │   return getattr(self.model_config, name)                                     │
│   207 │                                                                                   │
│   208 │   def __setattr__(self, name: str, value: Any) -> None:                           │
│   209 │   │   if name in self.__annotations__.keys():                                     │
│                                                                                           │
│ /d1/conda3/envs/scx_llm/lib/python3.10/site-packages/transformers/configuration_utils.py: │
│ 261 in __getattribute__                                                                   │
│                                                                                           │
│   258 │   def __getattribute__(self, key):                                                │
│   259 │   │   if key != "attribute_map" and key in super().__getattribute__("attribute_ma │
│   260 │   │   │   key = super().__getattribute__("attribute_map")[key]                    │
│ ❱ 261 │   │   return super().__getattribute__(key)                                        │
│   262 │                                                                                   │
│   263 │   def __init__(self, **kwargs):                                                   │
│   264 │   │   # Attributes with defaults                                                  │
╰───────────────────────────────────────────────────────────────────────────────────────────╯
AttributeError: 'LlamaConfig' object has no attribute 'initization_method_params'
╭──────────────────────────── Traceback (most recent call last) ────────────────────────────╮
│ /d2/data/chuxiong/collie/examples/further_pretrain_llama/expand_vocab.py:77 in <module>   │
│                                                                                           │
│    74 # 准备模型并调整 embedding 层大小，设置只训练 embedding 和 lm_head 层，加速收敛     │
│    75 model = LlamaForCausalLM.from_pretrained(                                           │
│    76 │   "../../../llama-2-7b", config=config)                                           │
│ ❱  77 model.resize_token_embeddings(len(llama_tokenizer) + 7)  # 取个整                   │
│    78 for p in model.parameters():                                                        │
│    79 │   p.requires_grad = False                                                         │
│    80 # 因为 embedding 和 lm_head 在 pipeline 的情况下被分割到了不同的进程，所以要判断一  │
│                                                                                           │
│ /d1/conda3/envs/scx_llm/lib/python3.10/site-packages/collie/models/base.py:634 in         │
│ resize_token_embeddings                                                                   │
│                                                                                           │
│   631 │   │   │   │   │   │   = lm_head.bias.data[start_pos_old:end_pos_old]              │
│   632 │   │   │   │   if end_pos_new < (new_num_tokens // env.tp_size):                   │
│   633 │   │   │   │   │   initization_method = self.collie_config.initization_method      │
│ ❱ 634 │   │   │   │   │   if self.collie_config.initization_method_params is not None:    │
│   635 │   │   │   │   │   │   initization_method = initization_method(new_lm_head.weight[ │
│   636 │   │   │   │   │   │   │   │   │   │   │   │   │   │   │   │   **self.collie_confi │
│   637 │   │   │   │   │   │   if lm_head.bias is not None:                                │
│                                                                                           │
│ /d1/conda3/envs/scx_llm/lib/python3.10/site-packages/collie/config.py:206 in __getattr__  │
│                                                                                           │
│   203 │   │   self.model_config.save_pretrained(path)                                     │
│   204 │                                                                                   │
│   205 │   def __getattr__(self, name):                                                    │
│ ❱ 206 │   │   return getattr(self.model_config, name)                                     │
│   207 │                                                                                   │
│   208 │   def __setattr__(self, name: str, value: Any) -> None:                           │
│   209 │   │   if name in self.__annotations__.keys():                                     │
│                                                                                           │
│ /d1/conda3/envs/scx_llm/lib/python3.10/site-packages/transformers/configuration_utils.py: │
│ 261 in __getattribute__                                                                   │
│                                                                                           │
│   258 │   def __getattribute__(self, key):                                                │
│   259 │   │   if key != "attribute_map" and key in super().__getattribute__("attribute_ma │
│   260 │   │   │   key = super().__getattribute__("attribute_map")[key]                    │
│ ❱ 261 │   │   return super().__getattribute__(key)                                        │
│   262 │                                                                                   │
│   263 │   def __init__(self, **kwargs):                                                   │
│   264 │   │   # Attributes with defaults                                                  │
╰───────────────────────────────────────────────────────────────────────────────────────────╯
AttributeError: 'LlamaConfig' object has no attribute 'initization_method_params'
╭──────────────────────────── Traceback (most recent call last) ────────────────────────────╮
│ /d2/data/chuxiong/collie/examples/further_pretrain_llama/expand_vocab.py:85 in <module>   │
│                                                                                           │
│    82 │   model.get_input_embedding()[1].weight.requires_grad = True                      │
│    83 if model.get_lm_head()[1] is not None:                                              │
│    84 │   model.get_lm_head()[1].weight.requires_grad = True                              │
│ ❱  85 optimizer = torch.optim.AdamW(                                                      │
│    86 │   filter(lambda p: p.requires_grad, model.parameters()), lr=2e-4)                 │
│    87 lr_scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(                          │
│    88 │   optimizer, T_max=config.train_epochs * len(train_dataset), eta_min=0)           │
│                                                                                           │
│ /d1/conda3/envs/scx_llm/lib/python3.10/site-packages/torch/optim/adamw.py:50 in __init__  │
│                                                                                           │
│    47 │   │   │   differentiable=differentiable,                                          │
│    48 │   │   │   fused=fused,                                                            │
│    49 │   │   )                                                                           │
│ ❱  50 │   │   super().__init__(params, defaults)                                          │
│    51 │   │                                                                               │
│    52 │   │   if fused:                                                                   │
│    53 │   │   │   if differentiable:                                                      │
│                                                                                           │
│ /d1/conda3/envs/scx_llm/lib/python3.10/site-packages/torch/optim/optimizer.py:187 in      │
│ __init__                                                                                  │
│                                                                                           │
│   184 │   │                                                                               │
│   185 │   │   param_groups = list(params)                                                 │
│   186 │   │   if len(param_groups) == 0:                                                  │
│ ❱ 187 │   │   │   raise ValueError("optimizer got an empty parameter list")               │
│   188 │   │   if not isinstance(param_groups[0], dict):                                   │
│   189 │   │   │   param_groups = [{'params': param_groups}]                               │
│   190                                                                                     │
╰───────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: optimizer got an empty parameter list
╭──────────────────────────── Traceback (most recent call last) ────────────────────────────╮
│ /d2/data/chuxiong/collie/examples/further_pretrain_llama/expand_vocab.py:85 in <module>   │
│                                                                                           │
│    82 │   model.get_input_embedding()[1].weight.requires_grad = True                      │
│    83 if model.get_lm_head()[1] is not None:                                              │
│    84 │   model.get_lm_head()[1].weight.requires_grad = True                              │
│ ❱  85 optimizer = torch.optim.AdamW(                                                      │
│    86 │   filter(lambda p: p.requires_grad, model.parameters()), lr=2e-4)                 │
│    87 lr_scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(                          │
│    88 │   optimizer, T_max=config.train_epochs * len(train_dataset), eta_min=0)           │
│                                                                                           │
│ /d1/conda3/envs/scx_llm/lib/python3.10/site-packages/torch/optim/adamw.py:50 in __init__  │
│                                                                                           │
│    47 │   │   │   differentiable=differentiable,                                          │
│    48 │   │   │   fused=fused,                                                            │
│    49 │   │   )                                                                           │
│ ❱  50 │   │   super().__init__(params, defaults)                                          │
│    51 │   │                                                                               │
│    52 │   │   if fused:                                                                   │
│    53 │   │   │   if differentiable:                                                      │
│                                                                                           │
│ /d1/conda3/envs/scx_llm/lib/python3.10/site-packages/torch/optim/optimizer.py:187 in      │
│ __init__                                                                                  │
│                                                                                           │
│   184 │   │                                                                               │
│   185 │   │   param_groups = list(params)                                                 │
│   186 │   │   if len(param_groups) == 0:                                                  │
│ ❱ 187 │   │   │   raise ValueError("optimizer got an empty parameter list")               │
│   188 │   │   if not isinstance(param_groups[0], dict):                                   │
│   189 │   │   │   param_groups = [{'params': param_groups}]                               │
│   190                                                                                     │
╰───────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: optimizer got an empty parameter list
╭──────────────────────────── Traceback (most recent call last) ────────────────────────────╮
│ /d2/data/chuxiong/collie/examples/further_pretrain_llama/expand_vocab.py:77 in <module>   │
│                                                                                           │
│    74 # 准备模型并调整 embedding 层大小，设置只训练 embedding 和 lm_head 层，加速收敛     │
│    75 model = LlamaForCausalLM.from_pretrained(                                           │
│    76 │   "../../../llama-2-7b", config=config)                                           │
│ ❱  77 model.resize_token_embeddings(len(llama_tokenizer) + 7)  # 取个整                   │
│    78 for p in model.parameters():                                                        │
│    79 │   p.requires_grad = False                                                         │
│    80 # 因为 embedding 和 lm_head 在 pipeline 的情况下被分割到了不同的进程，所以要判断一  │
│                                                                                           │
│ /d1/conda3/envs/scx_llm/lib/python3.10/site-packages/collie/models/base.py:548 in         │
│ resize_token_embeddings                                                                   │
│                                                                                           │
│   545 │   │   │   │   │   = embedding.weight.data[start_pos_old:end_pos_old, :]           │
│   546 │   │   │   │   if end_pos_new < (new_num_tokens // env.tp_size):                   │
│   547 │   │   │   │   │   initization_method = self.collie_config.initization_method      │
│ ❱ 548 │   │   │   │   │   if self.collie_config.initization_method_params is not None:    │
│   549 │   │   │   │   │   │   initization_method = initization_method(new_embedding.weigh │
│   550 │   │   │   │   │   │   │   │   │   │   │   │   │   │   │   │   **self.collie_confi │
│   551 │   │   │   │   │   else:                                                           │
│                                                                                           │
│ /d1/conda3/envs/scx_llm/lib/python3.10/site-packages/collie/config.py:206 in __getattr__  │
│                                                                                           │
│   203 │   │   self.model_config.save_pretrained(path)                                     │
│   204 │                                                                                   │
│   205 │   def __getattr__(self, name):                                                    │
│ ❱ 206 │   │   return getattr(self.model_config, name)                                     │
│   207 │                                                                                   │
│   208 │   def __setattr__(self, name: str, value: Any) -> None:                           │
│   209 │   │   if name in self.__annotations__.keys():                                     │
│                                                                                           │
│ /d1/conda3/envs/scx_llm/lib/python3.10/site-packages/transformers/configuration_utils.py: │
│ 261 in __getattribute__                                                                   │
│                                                                                           │
│   258 │   def __getattribute__(self, key):                                                │
│   259 │   │   if key != "attribute_map" and key in super().__getattribute__("attribute_ma │
│   260 │   │   │   key = super().__getattribute__("attribute_map")[key]                    │
│ ❱ 261 │   │   return super().__getattribute__(key)                                        │
│   262 │                                                                                   │
│   263 │   def __init__(self, **kwargs):                                                   │
│   264 │   │   # Attributes with defaults                                                  │
╰───────────────────────────────────────────────────────────────────────────────────────────╯
AttributeError: 'LlamaConfig' object has no attribute 'initization_method_params'
╭──────────────────────────── Traceback (most recent call last) ────────────────────────────╮
│ /d2/data/chuxiong/collie/examples/further_pretrain_llama/expand_vocab.py:77 in <module>   │
│                                                                                           │
│    74 # 准备模型并调整 embedding 层大小，设置只训练 embedding 和 lm_head 层，加速收敛     │
│    75 model = LlamaForCausalLM.from_pretrained(                                           │
│    76 │   "../../../llama-2-7b", config=config)                                           │
│ ❱  77 model.resize_token_embeddings(len(llama_tokenizer) + 7)  # 取个整                   │
│    78 for p in model.parameters():                                                        │
│    79 │   p.requires_grad = False                                                         │
│    80 # 因为 embedding 和 lm_head 在 pipeline 的情况下被分割到了不同的进程，所以要判断一  │
│                                                                                           │
│ /d1/conda3/envs/scx_llm/lib/python3.10/site-packages/collie/models/base.py:548 in         │
│ resize_token_embeddings                                                                   │
│                                                                                           │
│   545 │   │   │   │   │   = embedding.weight.data[start_pos_old:end_pos_old, :]           │
│   546 │   │   │   │   if end_pos_new < (new_num_tokens // env.tp_size):                   │
│   547 │   │   │   │   │   initization_method = self.collie_config.initization_method      │
│ ❱ 548 │   │   │   │   │   if self.collie_config.initization_method_params is not None:    │
│   549 │   │   │   │   │   │   initization_method = initization_method(new_embedding.weigh │
│   550 │   │   │   │   │   │   │   │   │   │   │   │   │   │   │   │   **self.collie_confi │
│   551 │   │   │   │   │   else:                                                           │
│                                                                                           │
│ /d1/conda3/envs/scx_llm/lib/python3.10/site-packages/collie/config.py:206 in __getattr__  │
│                                                                                           │
│   203 │   │   self.model_config.save_pretrained(path)                                     │
│   204 │                                                                                   │
│   205 │   def __getattr__(self, name):                                                    │
│ ❱ 206 │   │   return getattr(self.model_config, name)                                     │
│   207 │                                                                                   │
│   208 │   def __setattr__(self, name: str, value: Any) -> None:                           │
│   209 │   │   if name in self.__annotations__.keys():                                     │
│                                                                                           │
│ /d1/conda3/envs/scx_llm/lib/python3.10/site-packages/transformers/configuration_utils.py: │
│ 261 in __getattribute__                                                                   │
│                                                                                           │
│   258 │   def __getattribute__(self, key):                                                │
│   259 │   │   if key != "attribute_map" and key in super().__getattribute__("attribute_ma │
│   260 │   │   │   key = super().__getattribute__("attribute_map")[key]                    │
│ ❱ 261 │   │   return super().__getattribute__(key)                                        │
│   262 │                                                                                   │
│   263 │   def __init__(self, **kwargs):                                                   │
│   264 │   │   # Attributes with defaults                                                  │
╰───────────────────────────────────────────────────────────────────────────────────────────╯
AttributeError: 'LlamaConfig' object has no attribute 'initization_method_params'

Llama2 70B 训练报错

使用最新 dev分支代码训练 llama2 70B ，存在以下问题：
│collie/collie/models/llama/model.py:203 in _forward │
│ │
│ 200 │ │ │ │ │ │ │ .permute(0, 2, 1, 4, 3) \ │
│ 201 │ │ │ │ │ │ │ .reshape(batch_size, self.num_key_value_heads, │
│ 202 │ │ │ │ │ │ │ │ │ seq_len + start_pos, -1) │
│ ❱ 203 │ │ │ new_layer_past = torch.stack((present_key, value.permute([0, 2, 1, 3])), dim │
│ 204 │ │ attention_mask = attention_mask if attention_mask is not None else torch.ones((q │
│ 205 │ │ if self.config.use_flash: │
│ 206 │ │ │ output = flash_attention(query, key, value, attention_mask)
RuntimeError: stack expects each tensor to be equal size, but got [1, 8, 2048, 1024] at entry 0 and [1, 64, 2048, 128] at entry 1

上面是一个问题，还有一个问题是前几天的 dev分支代码， trainer.save_model，llama2 70B（8张V100, 可以训练）会出现显存 OOM，按道理能跑训练，不应该显存不够，最新dev代码可能还有这个问题，只是还没跑到就报错了
│ │
│ /opt/conda/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py:1553 in │
│ _allgather_params_coalesced │
│ │
│ 1550 │ │ allgather_params = [] │
│ 1551 │ │ for psize in partition_sizes: │
│ 1552 │ │ │ tensor_size = psize * self.num_partitions │
│ ❱ 1553 │ │ │ flat_tensor = torch.empty(tensor_size, dtype=param_list[0].dtype, device=sel │
│ 1554 │ │ │ flat_tensor.requires_grad = False │
│ 1555 │ │ │ allgather_params.append(flat_tensor) │
│ 1556 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB (GPU 7; 31.75 GiB total capacity; 29.60 GiB already allocated; 312.75 MiB free; 29.63 GiB reserved in total by PyTorch) If reserved memory
is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

@00INDEX 方便看一下吗？

openmoss / collie Goto Github PK

collie's People

Contributors

Stargazers

Watchers

Forkers

collie's Issues

Recommend Projects

Recommend Topics

Recommend Org