ymcui / chinese-mixtral Goto Github PK

View Code? Open in Web Editor NEW

571.0 15.0 41.0 531 KB

中文Mixtral混合专家大模型（Chinese Mixtral MoE LLMs）

Home Page: https://arxiv.org/abs/2403.01851

License: Apache License 2.0

Shell 2.89% Python 97.11%

32k large-language-models llm mixtral mixture-of-experts moe nlp 64k

chinese-mixtral's People

Contributors

Stargazers

Watchers

chinese-mixtral's Issues

训练细节

提交前必须检查以下项目

请确保使用的是仓库最新代码（git pull），一些问题已被解决和修复。
我已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索，没有找到相似问题和解决方案。
第三方插件问题：例如llama.cpp、LangChain、text-generation-webui等，同时建议到对应的项目中查找解决方案。

问题类型

None

操作系统

Linux

详细描述问题

请问可以提供一下训练代码吗，用自己代码训练时遇到一些问题，谢谢

# 请在此处粘贴运行代码

依赖情况（代码类问题务必提供）

# 请在此处粘贴依赖情况（图片请放在代码块外，否则不能显示）

运行日志或截图

# 请在此处粘贴运行日志（图片请放在代码块外，否则不能显示）

扩充词表了吗

提交前必须检查以下项目

请确保使用的是仓库最新代码（git pull），一些问题已被解决和修复。
我已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索，没有找到相似问题和解决方案。
第三方插件问题：例如llama.cpp、LangChain、text-generation-webui等，同时建议到对应的项目中查找解决方案。

问题类型

None

操作系统

None

详细描述问题

（请在此处详细描述遇到的问题）

# 请在此处粘贴运行代码

依赖情况（代码类问题务必提供）

# 请在此处粘贴依赖情况（图片请放在代码块外，否则不能显示）

运行日志或截图

# 请在此处粘贴运行日志（图片请放在代码块外，否则不能显示）

训练错误

提交前必须检查以下项目

请确保使用的是仓库最新代码（git pull），一些问题已被解决和修复。
我已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索，没有找到相似问题和解决方案。
第三方插件问题：例如llama.cpp、LangChain、text-generation-webui等，同时建议到对应的项目中查找解决方案。

问题类型

None

操作系统

None

详细描述问题

全参或lora微调时遇到了问题，遇到的问题同hiyouga/LLaMA-Factory#1998
目前只有4bits+zero2_no_offload可以跑通，想问一下作者在训练中有没有遇到这个问题，怎么解决的，谢谢

依赖情况（代码类问题务必提供）

No response

运行日志或截图

No response

词表扩充问题

提交前必须检查以下项目

请确保使用的是仓库最新代码（git pull），一些问题已被解决和修复。
我已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索，没有找到相似问题和解决方案。
第三方插件问题：例如llama.cpp、LangChain、text-generation-webui等，同时建议到对应的项目中查找解决方案。

问题类型

其他问题

操作系统

Linux

详细描述问题

请问为什么没有进行词表扩充呢？Chinese-Llama里面做了词表扩充，这样不会对中文更友好吗

依赖情况（代码类问题务必提供）

# 请在此处粘贴依赖情况（图片请放在代码块外，否则不能显示）

运行日志或截图

# 请在此处粘贴运行日志（图片请放在代码块外，否则不能显示）

load_in_kbits设置成8和16都报错，只有4能微调，这是啥原因？

提交前必须检查以下项目

请确保使用的是仓库最新代码（git pull），一些问题已被解决和修复。
我已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索，没有找到相似问题和解决方案。
第三方插件问题：例如llama.cpp、LangChain、text-generation-webui等，同时建议到对应的项目中查找解决方案。

问题类型

模型训练与精调

操作系统

Linux

详细描述问题

load_in_kbits设置成8和16都报错，只有4能微调，这是啥原因

# CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
lr=1e-4
lora_rank=64
lora_alpha=128
lora_trainable="q_proj,v_proj,k_proj,o_proj,gate,w1,w2,w3"
modules_to_save="embed_tokens,lm_head"
lora_dropout=0.05

pretrained_model=/share1/zouff/llm_model/chinese-mixtral-instruct
dataset_dir=/share1/zouff/py_pro/Chinese-Mixtral/train_data/zybm_mix_1_0318
per_device_train_batch_size=1
per_device_eval_batch_size=1
gradient_accumulation_steps=8
max_seq_length=512
output_dir=/share1/zouff/py_pro/Chinese-Mixtral/output_model/zybm_mix_30_0320
validation_file=/share1/zouff/py_pro/Chinese-Mixtral/scripts/data/zybm_medical_dev_new.json

deepspeed_config_file=ds_zero2_no_offload.json

torchrun --nnodes 1 --nproc_per_node 8 --master_port 12355 run_clm_sft_with_peft.py \
    --deepspeed ${deepspeed_config_file} \
    --model_name_or_path ${pretrained_model} \
    --tokenizer_name_or_path ${pretrained_model} \
    --dataset_dir ${dataset_dir} \
    --per_device_train_batch_size ${per_device_train_batch_size} \
    --per_device_eval_batch_size ${per_device_eval_batch_size} \
    --do_train \
    --do_eval \
    --seed $RANDOM \
    --fp16 \
    --num_train_epochs 30 \
    --lr_scheduler_type cosine \
    --learning_rate ${lr} \
    --warmup_ratio 0.05 \
    --weight_decay 0.1 \
    --logging_strategy steps \
    --logging_steps 200 \
    --save_strategy steps \
    --save_total_limit 40 \
    --evaluation_strategy steps \
    --eval_steps 200 \
    --save_steps 200 \
    --gradient_accumulation_steps ${gradient_accumulation_steps} \
    --preprocessing_num_workers 8 \
    --max_seq_length ${max_seq_length} \
    --output_dir ${output_dir} \
    --overwrite_output_dir \
    --ddp_timeout 30000 \
    --logging_first_step True \
    --lora_rank ${lora_rank} \
    --lora_alpha ${lora_alpha} \
    --trainable ${lora_trainable} \
    --lora_dropout ${lora_dropout} \
    --modules_to_save ${modules_to_save} \
    --torch_dtype float16 \
    --validation_file ${validation_file} \
    --load_in_kbits 8 \
    --gradient_checkpointing \
    --ddp_find_unused_parameters False \
    --output_router_logits \

依赖情况（代码类问题务必提供）

# accelerate                0.27.2
addict                    2.4.0
aiofiles                  23.2.1
aiohttp                   3.9.3
aiosignal                 1.3.1
aliyun-python-sdk-core    2.15.0
aliyun-python-sdk-kms     2.16.2
altair                    5.2.0
annotated-types           0.6.0
anyio                     4.3.0
arxiv                     2.1.0
asttokens                 2.4.1
async-timeout             4.0.3
attrs                     23.2.0
bitsandbytes              0.42.0
blessed                   1.20.0
blinker                   1.7.0
cachetools                5.3.3
certifi                   2024.2.2
cffi                      1.16.0
charset-normalizer        3.3.2
click                     8.1.7
cloudpickle               3.0.0
colorama                  0.4.6
comm                      0.2.1
contourpy                 1.2.0
cpm-kernels               1.0.11
crcmod                    1.7
cryptography              42.0.5
cupy-cuda12x              12.1.0
cycler                    0.12.1
dataclasses-json          0.6.4
datasets                  2.17.1
debugpy                   1.6.7
decorator                 5.1.1
deepspeed                 0.13.1
dill                      0.3.8
diskcache                 5.6.3
distro                    1.9.0
docstring-parser          0.15
einops                    0.7.0
entrypoints               0.4
exceptiongroup            1.2.0
executing                 2.0.1
fastapi                   0.110.0
fastrlock                 0.8.2
feedparser                6.0.10
ffmpy                     0.3.2
filelock                  3.13.1
fonttools                 4.49.0
frozenlist                1.4.1
fsspec                    2023.10.0
gast                      0.5.4
gitdb                     4.0.11
GitPython                 3.1.42
gpustat                   1.1.1
gradio                    4.19.2
gradio_client             0.10.1
greenlet                  3.0.3
h11                       0.14.0
hjson                     3.1.0
httpcore                  1.0.4
httptools                 0.6.1
httpx                     0.27.0
huggingface-hub           0.21.3
idna                      3.6
importlib-metadata        7.0.1
importlib_resources       6.1.2
interegular               0.3.3
ipykernel                 6.29.3
ipython                   8.22.2
jedi                      0.19.1
jieba                     0.42.1
Jinja2                    3.1.3
jmespath                  0.10.0
joblib                    1.3.2
jsonpatch                 1.33
jsonpointer               2.4
jsonschema                4.21.1
jsonschema-specifications 2023.12.1
jupyter-client            7.3.4
jupyter_core              5.7.1
kiwisolver                1.4.5
langchain                 0.1.9
langchain-community       0.0.24
langchain-core            0.1.27
langchainhub              0.1.14
langsmith                 0.1.10
lark                      1.1.9
latex2mathml              3.77.0
llvmlite                  0.42.0
loguru                    0.7.2
Markdown                  3.5.2
markdown-it-py            3.0.0
MarkupSafe                2.1.5
marshmallow               3.21.0
matplotlib                3.8.3
matplotlib-inline         0.1.6
mdtex2html                1.3.0
mdurl                     0.1.2
modelscope                1.13.0
mpmath                    1.3.0
msgpack                   1.0.8
multidict                 6.0.5
multiprocess              0.70.16
mypy-extensions           1.0.0
nest_asyncio              1.6.0
networkx                  3.2.1
ninja                     1.11.1.1
nltk                      3.8.1
numba                     0.59.0
numpy                     1.26.4
nvidia-cublas-cu12        12.1.3.1
nvidia-cuda-cupti-cu12    12.1.105
nvidia-cuda-nvrtc-cu12    12.1.105
nvidia-cuda-runtime-cu12  12.1.105
nvidia-cudnn-cu12         8.9.2.26
nvidia-cufft-cu12         11.0.2.54
nvidia-curand-cu12        10.3.2.106
nvidia-cusolver-cu12      11.4.5.107
nvidia-cusparse-cu12      12.1.0.106
nvidia-ml-py              12.535.133
nvidia-nccl-cu12          2.18.1
nvidia-nvjitlink-cu12     12.3.101
nvidia-nvtx-cu12          12.1.105
openai                    1.13.3
orjson                    3.9.15
oss2                      2.18.4
outlines                  0.0.34
packaging                 23.2
pandas                    2.2.1
parso                     0.8.3
peft                      0.9.0
pexpect                   4.9.0
pickleshare               0.7.5
pillow                    10.2.0
pip                       23.3.1
platformdirs              4.2.0
prometheus_client         0.20.0
prompt-toolkit            3.0.42
protobuf                  4.25.3
psutil                    5.9.0
ptyprocess                0.7.0
pure-eval                 0.2.2
py-cpuinfo                9.0.0
pyarrow                   15.0.0
pyarrow-hotfix            0.6
pycparser                 2.21
pycryptodome              3.20.0
pydantic                  2.6.3
pydantic_core             2.16.3
pydeck                    0.8.1b0
pydub                     0.25.1
Pygments                  2.17.2
PyJWT                     2.8.0
pynvml                    11.5.0
pyparsing                 3.1.1
python-dateutil           2.8.2
python-dotenv             1.0.1
python-multipart          0.0.9
pytz                      2024.1
PyYAML                    6.0.1
pyzmq                     25.1.2
ray                       2.9.3
referencing               0.33.0
regex                     2023.12.25
requests                  2.31.0
rich                      13.7.1
rouge-chinese             1.0.3
rpds-py                   0.18.0
ruamel.yaml               0.18.6
ruamel.yaml.clib          0.2.8
ruff                      0.2.2
safetensors               0.4.2
scikit-learn              1.4.1.post1
scipy                     1.12.0
semantic-version          2.10.0
sentence-transformers     2.4.0
sentencepiece             0.2.0
setuptools                68.2.2
sgmllib3k                 1.0.0
shellingham               1.5.4
shtab                     1.7.0
simplejson                3.19.2
six                       1.16.0
smmap                     5.0.1
sniffio                   1.3.1
sortedcontainers          2.4.0
SQLAlchemy                2.0.27
sse-starlette             2.0.0
stack-data                0.6.2
starlette                 0.36.3
streamlit                 1.31.1
sympy                     1.12
tenacity                  8.2.3
threadpoolctl             3.3.0
tiktoken                  0.6.0
timm                      0.9.16
tokenizers                0.15.2
toml                      0.10.2
tomli                     2.0.1
tomlkit                   0.12.0
toolz                     0.12.1
torch                     2.1.2
torchvision               0.17.1
tornado                   6.1
tqdm                      4.66.2
traitlets                 5.14.1
transformers              4.38.2
triton                    2.1.0
trl                       0.7.11
typer                     0.9.0
types-requests            2.31.0.20240218
typing_extensions         4.10.0
typing-inspect            0.9.0
tyro                      0.7.3
tzdata                    2024.1
tzlocal                   5.2
urllib3                   2.2.1
uvicorn                   0.27.1
uvloop                    0.19.0
validators                0.22.0
vllm                      0.3.3
watchdog                  4.0.0
watchfiles                0.21.0
wcwidth                   0.2.13
websockets                11.0.3
wheel                     0.41.2
xformers                  0.0.23.post1
xxhash                    3.4.1
yapf                      0.40.2
yarl                      1.9.4
zhipuai                   2.0.1
zipp                      3.17.0

运行日志或截图

# Traceback (most recent call last):
  File "/share1/zouff/py_pro/Chinese-Mixtral/scripts/training/run_clm_sft_with_peft.py", line 424, in <module>
Traceback (most recent call last):
  File "/share1/zouff/py_pro/Chinese-Mixtral/scripts/training/run_clm_sft_with_peft.py", line 424, in <module>
    main()
  File "/share1/zouff/py_pro/Chinese-Mixtral/scripts/training/run_clm_sft_with_peft.py", line 396, in main
        main()train_result = trainer.train(resume_from_checkpoint=checkpoint)

  File "/share1/zouff/py_pro/Chinese-Mixtral/scripts/training/run_clm_sft_with_peft.py", line 396, in main
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/transformers/trainer.py", line 1624, in train
Traceback (most recent call last):
  File "/share1/zouff/py_pro/Chinese-Mixtral/scripts/training/run_clm_sft_with_peft.py", line 424, in <module>
    main()
  File "/share1/zouff/py_pro/Chinese-Mixtral/scripts/training/run_clm_sft_with_peft.py", line 396, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/transformers/trainer.py", line 1624, in train
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/transformers/trainer.py", line 1624, in train
    return inner_training_loop(
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/transformers/trainer.py", line 1961, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/transformers/trainer.py", line 2911, in training_step
    return inner_training_loop(
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/transformers/trainer.py", line 1961, in _inner_training_loop
    self.accelerator.backward(loss)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/accelerate/accelerator.py", line 1960, in backward
    return inner_training_loop(
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/transformers/trainer.py", line 1961, in _inner_training_loop
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 167, in backward
    self.engine.backward(loss, **kwargs)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1964, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2040, in backward
    tr_loss_step = self.training_step(model, inputs)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/transformers/trainer.py", line 2911, in training_step
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
    torch.autograd.backward(
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/torch/autograd/function.py", line 288, in apply
    return user_fn(self, *args)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 288, in backward
        tr_loss_step = self.training_step(model, inputs)torch.autograd.backward(outputs_with_grad, args_with_grad)

  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/transformers/trainer.py", line 2911, in training_step
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/torch/autograd/function.py", line 288, in apply
    return user_fn(self, *args)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 485, in backward
    .mul_(state.SCB.unsqueeze(1).mul(1.0 / 127.0))
RuntimeError: The size of tensor a (32) must match the size of tensor b (8) at non-singleton dimension 0
    self.accelerator.backward(loss)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/accelerate/accelerator.py", line 1960, in backward
    self.accelerator.backward(loss)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/accelerate/accelerator.py", line 1960, in backward
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 167, in backward
    self.engine.backward(loss, **kwargs)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1964, in backward
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 167, in backward
    self.engine.backward(loss, **kwargs)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1964, in backward
Traceback (most recent call last):
Traceback (most recent call last):
  File "/share1/zouff/py_pro/Chinese-Mixtral/scripts/training/run_clm_sft_with_peft.py", line 424, in <module>
  File "/share1/zouff/py_pro/Chinese-Mixtral/scripts/training/run_clm_sft_with_peft.py", line 424, in <module>
    self.optimizer.backward(loss, retain_graph=retain_graph)
Traceback (most recent call last):
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2040, in backward
  File "/share1/zouff/py_pro/Chinese-Mixtral/scripts/training/run_clm_sft_with_peft.py", line 424, in <module>
Traceback (most recent call last):
  File "/share1/zouff/py_pro/Chinese-Mixtral/scripts/training/run_clm_sft_with_peft.py", line 424, in <module>
        main()main()

  File "/share1/zouff/py_pro/Chinese-Mixtral/scripts/training/run_clm_sft_with_peft.py", line 396, in main
  File "/share1/zouff/py_pro/Chinese-Mixtral/scripts/training/run_clm_sft_with_peft.py", line 396, in main
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2040, in backward
    main()
  File "/share1/zouff/py_pro/Chinese-Mixtral/scripts/training/run_clm_sft_with_peft.py", line 396, in main
    main()
      File "/share1/zouff/py_pro/Chinese-Mixtral/scripts/training/run_clm_sft_with_peft.py", line 396, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)    
train_result = trainer.train(resume_from_checkpoint=checkpoint)  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/transformers/trainer.py", line 1624, in train

  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/transformers/trainer.py", line 1624, in train
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/transformers/trainer.py", line 1624, in train
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/transformers/trainer.py", line 1624, in train
    torch.autograd.backward(
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/torch/autograd/function.py", line 288, in apply
    scaled_loss.backward(retain_graph=retain_graph)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
    return user_fn(self, *args)    
return inner_training_loop(  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 288, in backward
    
return inner_training_loop(  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/transformers/trainer.py", line 1961, in _inner_training_loop

  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/transformers/trainer.py", line 1961, in _inner_training_loop
    torch.autograd.backward(
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
    return inner_training_loop(
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/transformers/trainer.py", line 1961, in _inner_training_loop
            Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward passreturn inner_training_loop(Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass


  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/torch/autograd/function.py", line 288, in apply
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/transformers/trainer.py", line 1961, in _inner_training_loop
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/torch/autograd/function.py", line 288, in apply
        return user_fn(self, *args)return user_fn(self, *args)

  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 485, in backward
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 288, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
    .mul_(state.SCB.unsqueeze(1).mul(1.0 / 127.0))
    RuntimeErrortr_loss_step = self.training_step(model, inputs): 
    The size of tensor a (32) must match the size of tensor b (8) at non-singleton dimension 0  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/transformers/trainer.py", line 2911, in training_step
tr_loss_step = self.training_step(model, inputs)

  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/transformers/trainer.py", line 2911, in training_step
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/torch/autograd/function.py", line 288, in apply
    tr_loss_step = self.training_step(model, inputs)
      File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/transformers/trainer.py", line 2911, in training_step
return user_fn(self, *args)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 485, in backward
    tr_loss_step = self.training_step(model, inputs)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/transformers/trainer.py", line 2911, in training_step
    .mul_(state.SCB.unsqueeze(1).mul(1.0 / 127.0))
RuntimeError: The size of tensor a (32) must match the size of tensor b (8) at non-singleton dimension 0
    self.accelerator.backward(loss)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/accelerate/accelerator.py", line 1960, in backward
    self.accelerator.backward(loss)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/accelerate/accelerator.py", line 1960, in backward
    self.accelerator.backward(loss)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/accelerate/accelerator.py", line 1960, in backward
    self.accelerator.backward(loss)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/accelerate/accelerator.py", line 1960, in backward
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 167, in backward
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 167, in backward
    self.engine.backward(loss, **kwargs)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    self.engine.backward(loss, **kwargs)    
self.deepspeed_engine_wrapped.backward(loss, **kwargs)  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    
ret_val = func(*args, **kwargs)  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 167, in backward

  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1964, in backward
    ret_val = func(*args, **kwargs)    
self.deepspeed_engine_wrapped.backward(loss, **kwargs)  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1964, in backward

  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 167, in backward
    self.engine.backward(loss, **kwargs)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
        self.engine.backward(loss, **kwargs)ret_val = func(*args, **kwargs)

  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1964, in backward
    ret_val = func(*args, **kwargs)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1964, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2040, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2040, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)    
self.optimizer.backward(loss, retain_graph=retain_graph)  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2040, in backward

  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2040, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    torch.autograd.backward(
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
    scaled_loss.backward(retain_graph=retain_graph)    
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
    
torch.autograd.backward(  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward

  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/torch/autograd/function.py", line 288, in apply
    scaled_loss.backward(retain_graph=retain_graph)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
    torch.autograd.backward(    
Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
    
return user_fn(self, *args)  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/torch/autograd/function.py", line 288, in apply

  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 288, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/torch/autograd/function.py", line 288, in apply
            return user_fn(self, *args)torch.autograd.backward(torch.autograd.backward(outputs_with_grad, args_with_grad)


  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 288, in backward
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
    return user_fn(self, *args)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 288, in backward
        torch.autograd.backward(outputs_with_grad, args_with_grad)Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass    
    
Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
torch.autograd.backward(outputs_with_grad, args_with_grad)  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/torch/autograd/function.py", line 288, in apply


  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/torch/autograd/function.py", line 288, in apply
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
            Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward passreturn user_fn(self, *args)    return user_fn(self, *args)

Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/torch/autograd/function.py", line 288, in apply
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 485, in backward

  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 288, in backward
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/torch/autograd/function.py", line 288, in apply
        return user_fn(self, *args)    torch.autograd.backward(outputs_with_grad, args_with_grad)
return user_fn(self, *args)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 485, in backward

      File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 485, in backward
.mul_(state.SCB.unsqueeze(1).mul(1.0 / 127.0))
RuntimeError: The size of tensor a (32) must match the size of tensor b (8) at non-singleton dimension 0
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/torch/autograd/function.py", line 288, in apply
    .mul_(state.SCB.unsqueeze(1).mul(1.0 / 127.0))    
.mul_(state.SCB.unsqueeze(1).mul(1.0 / 127.0))
RuntimeError: RuntimeErrorThe size of tensor a (32) must match the size of tensor b (8) at non-singleton dimension 0: 
    The size of tensor a (32) must match the size of tensor b (8) at non-singleton dimension 0return user_fn(self, *args)

  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 485, in backward
    .mul_(state.SCB.unsqueeze(1).mul(1.0 / 127.0))
RuntimeError: The size of tensor a (32) must match the size of tensor b (8) at non-singleton dimension 0
Traceback (most recent call last):
  File "/share1/zouff/py_pro/Chinese-Mixtral/scripts/training/run_clm_sft_with_peft.py", line 424, in <module>
    main()
  File "/share1/zouff/py_pro/Chinese-Mixtral/scripts/training/run_clm_sft_with_peft.py", line 396, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/transformers/trainer.py", line 1624, in train
    return inner_training_loop(
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/transformers/trainer.py", line 1961, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/transformers/trainer.py", line 2911, in training_step
    self.accelerator.backward(loss)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/accelerate/accelerator.py", line 1960, in backward
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 167, in backward
    self.engine.backward(loss, **kwargs)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1964, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2040, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
    torch.autograd.backward(
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/torch/autograd/function.py", line 288, in apply
    return user_fn(self, *args)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 288, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/torch/autograd/function.py", line 288, in apply
    return user_fn(self, *args)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 485, in backward
    .mul_(state.SCB.unsqueeze(1).mul(1.0 / 127.0))
RuntimeError: The size of tensor a (32) must match the size of tensor b (8) at non-singleton dimension 0

  0%|          | 0/3810 [00:03<?, ?it/s]
[2024-03-21 09:55:50,608] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 34314) of binary: /home/zouff/anaconda3/envs/glm/bin/python
Traceback (most recent call last):
  File "/home/zouff/anaconda3/envs/glm/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
run_clm_sft_with_peft.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-03-21_09:55:50
  host      : g01
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 34315)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-03-21_09:55:50
  host      : g01
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 34316)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-03-21_09:55:50
  host      : g01
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 34317)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
  time      : 2024-03-21_09:55:50
  host      : g01
  rank      : 4 (local_rank: 4)
  exitcode  : 1 (pid: 34318)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
  time      : 2024-03-21_09:55:50
  host      : g01
  rank      : 5 (local_rank: 5)
  exitcode  : 1 (pid: 34319)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[6]:
  time      : 2024-03-21_09:55:50
  host      : g01
  rank      : 6 (local_rank: 6)
  exitcode  : 1 (pid: 34320)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[7]:
  time      : 2024-03-21_09:55:50
  host      : g01
  rank      : 7 (local_rank: 7)
  exitcode  : 1 (pid: 34321)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-03-21_09:55:50
  host      : g01
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 34314)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

sft微调的时候，保存完一个checkpoint后中断，然后试着从保存的checkpoint继续跑，报同样的错误，请问该如何解决?

提交前必须检查以下项目

请确保使用的是仓库最新代码（git pull），一些问题已被解决和修复。
我已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索，没有找到相似问题和解决方案。
第三方插件问题：例如llama.cpp、LangChain、text-generation-webui等，同时建议到对应的项目中查找解决方案。

问题类型

模型训练与精调

操作系统

Linux

详细描述问题

（请在此处详细描述遇到的问题）

# CUDA_VISIBLE_DEVICES=0,1,2,3
lr=1e-4
lora_rank=64
lora_alpha=128
lora_trainable="q_proj,v_proj,k_proj,o_proj,gate,w1,w2,w3"
modules_to_save="embed_tokens,lm_head"
lora_dropout=0.05

pretrained_model=/share1/zouff/llm_model/chinese-mixtral-instruct
dataset_dir=/share1/zouff/py_pro/Chinese-Mixtral/train_data/zybm_mix_1_0318
per_device_train_batch_size=1
per_device_eval_batch_size=1
gradient_accumulation_steps=8
max_seq_length=512
output_dir=/share1/zouff/py_pro/Chinese-Mixtral/output_model/zybm_mix_1_0318
validation_file=/share1/zouff/py_pro/Chinese-Mixtral/scripts/data/zybm_medical_dev_new.json

deepspeed_config_file=ds_zero2_no_offload.json

torchrun --nnodes 1 --nproc_per_node 4 run_clm_sft_with_peft.py \
    --deepspeed ${deepspeed_config_file} \
    --model_name_or_path ${pretrained_model} \
    --tokenizer_name_or_path ${pretrained_model} \
    --dataset_dir ${dataset_dir} \
    --per_device_train_batch_size ${per_device_train_batch_size} \
    --per_device_eval_batch_size ${per_device_eval_batch_size} \
    --do_train \
    --do_eval \
    --seed $RANDOM \
    --fp16 \
    --num_train_epochs 3 \
    --lr_scheduler_type cosine \
    --learning_rate ${lr} \
    --warmup_ratio 0.05 \
    --weight_decay 0.1 \
    --logging_strategy steps \
    --logging_steps 200 \
    --save_strategy steps \
    --save_total_limit 5 \
    --evaluation_strategy steps \
    --eval_steps 200 \
    --save_steps 200 \
    --gradient_accumulation_steps ${gradient_accumulation_steps} \
    --preprocessing_num_workers 8 \
    --max_seq_length ${max_seq_length} \
    --output_dir ${output_dir} \
    --overwrite_output_dir \
    --ddp_timeout 30000 \
    --logging_first_step True \
    --lora_rank ${lora_rank} \
    --lora_alpha ${lora_alpha} \
    --trainable ${lora_trainable} \
    --lora_dropout ${lora_dropout} \
    --modules_to_save ${modules_to_save} \
    --torch_dtype float16 \
    --validation_file ${validation_file} \
    --load_in_kbits 4 \
    --gradient_checkpointing \
    --ddp_find_unused_parameters False \
    --output_router_logits \
    --resume_from_checkpoint /share1/zouff/py_pro/Chinese-Mixtral/output_model/zybm_mix_1_0318/checkpoint-200

依赖情况（代码类问题务必提供）

# accelerate==0.27.2
addict==2.4.0
aiofiles==23.2.1
aiohttp==3.9.3
aiosignal==1.3.1
aliyun-python-sdk-core==2.15.0
aliyun-python-sdk-kms==2.16.2
altair==5.2.0
annotated-types==0.6.0
anyio==4.3.0
arxiv==2.1.0
asttokens @ file:///home/conda/feedstock_root/build_artifacts/asttokens_1698341106958/work
async-timeout==4.0.3
attrs==23.2.0
bitsandbytes==0.42.0
blessed==1.20.0
blinker==1.7.0
cachetools==5.3.3
certifi==2024.2.2
cffi==1.16.0
charset-normalizer==3.3.2
click==8.1.7
cloudpickle==3.0.0
colorama==0.4.6
comm @ file:///home/conda/feedstock_root/build_artifacts/comm_1704278392174/work
contourpy==1.2.0
cpm-kernels==1.0.11
crcmod==1.7
cryptography==42.0.5
cupy-cuda12x==12.1.0
cycler==0.12.1
dataclasses-json==0.6.4
datasets==2.17.1
debugpy @ file:///croot/debugpy_1690905042057/work
decorator @ file:///home/conda/feedstock_root/build_artifacts/decorator_1641555617451/work
deepspeed==0.13.1
dill==0.3.8
diskcache==5.6.3
distro==1.9.0
docstring-parser==0.15
einops==0.7.0
entrypoints @ file:///home/conda/feedstock_root/build_artifacts/entrypoints_1643888246732/work
exceptiongroup @ file:///home/conda/feedstock_root/build_artifacts/exceptiongroup_1704921103267/work
executing @ file:///home/conda/feedstock_root/build_artifacts/executing_1698579936712/work
fastapi==0.110.0
fastrlock==0.8.2
feedparser==6.0.10
ffmpy==0.3.2
filelock==3.13.1
fonttools==4.49.0
frozenlist==1.4.1
fsspec==2023.10.0
gast==0.5.4
gitdb==4.0.11
GitPython==3.1.42
gpustat==1.1.1
gradio==4.19.2
gradio_client==0.10.1
greenlet==3.0.3
h11==0.14.0
hjson==3.1.0
httpcore==1.0.4
httptools==0.6.1
httpx==0.27.0
huggingface-hub==0.21.3
idna==3.6
importlib-metadata==7.0.1
importlib_resources==6.1.2
interegular==0.3.3
ipykernel @ file:///home/conda/feedstock_root/build_artifacts/ipykernel_1708996548741/work
ipython @ file:///home/conda/feedstock_root/build_artifacts/ipython_1709559745751/work
jedi @ file:///home/conda/feedstock_root/build_artifacts/jedi_1696326070614/work
jieba==0.42.1
Jinja2==3.1.3
jmespath==0.10.0
joblib==1.3.2
jsonpatch==1.33
jsonpointer==2.4
jsonschema==4.21.1
jsonschema-specifications==2023.12.1
jupyter-client @ file:///home/conda/feedstock_root/build_artifacts/jupyter_client_1654730843242/work
jupyter_core @ file:///home/conda/feedstock_root/build_artifacts/jupyter_core_1704727030956/work
kiwisolver==1.4.5
langchain==0.1.9
langchain-community==0.0.24
langchain-core==0.1.27
langchainhub==0.1.14
langsmith==0.1.10
lark==1.1.9
latex2mathml==3.77.0
llvmlite==0.42.0
loguru==0.7.2
Markdown==3.5.2
markdown-it-py==3.0.0
MarkupSafe==2.1.5
marshmallow==3.21.0
matplotlib==3.8.3
matplotlib-inline @ file:///home/conda/feedstock_root/build_artifacts/matplotlib-inline_1660814786464/work
mdtex2html==1.3.0
mdurl==0.1.2
modelscope==1.13.0
mpmath==1.3.0
msgpack==1.0.8
multidict==6.0.5
multiprocess==0.70.16
mypy-extensions==1.0.0
nest_asyncio @ file:///home/conda/feedstock_root/build_artifacts/nest-asyncio_1705850609492/work
networkx==3.2.1
ninja==1.11.1.1
nltk==3.8.1
numba==0.59.0
numpy==1.26.4
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-ml-py==12.535.133
nvidia-nccl-cu12==2.18.1
nvidia-nvjitlink-cu12==12.3.101
nvidia-nvtx-cu12==12.1.105
openai==1.13.3
orjson==3.9.15
oss2==2.18.4
outlines==0.0.34
packaging @ file:///home/conda/feedstock_root/build_artifacts/packaging_1696202382185/work
pandas==2.2.1
parso @ file:///home/conda/feedstock_root/build_artifacts/parso_1638334955874/work
peft==0.9.0
pexpect @ file:///home/conda/feedstock_root/build_artifacts/pexpect_1706113125309/work
pickleshare @ file:///home/conda/feedstock_root/build_artifacts/pickleshare_1602536217715/work
pillow==10.2.0
pip==23.3.1
platformdirs @ file:///home/conda/feedstock_root/build_artifacts/platformdirs_1706713388748/work
prometheus_client==0.20.0
prompt-toolkit @ file:///home/conda/feedstock_root/build_artifacts/prompt-toolkit_1702399386289/work
protobuf==4.25.3
psutil @ file:///opt/conda/conda-bld/psutil_1656431268089/work
ptyprocess @ file:///home/conda/feedstock_root/build_artifacts/ptyprocess_1609419310487/work/dist/ptyprocess-0.7.0-py2.py3-none-any.whl
pure-eval @ file:///home/conda/feedstock_root/build_artifacts/pure_eval_1642875951954/work
py-cpuinfo==9.0.0
pyarrow==15.0.0
pyarrow-hotfix==0.6
pycparser==2.21
pycryptodome==3.20.0
pydantic==2.6.3
pydantic_core==2.16.3
pydeck==0.8.1b0
pydub==0.25.1
Pygments @ file:///home/conda/feedstock_root/build_artifacts/pygments_1700607939962/work
PyJWT==2.8.0
pynvml==11.5.0
pyparsing==3.1.1
python-dateutil==2.8.2
python-dotenv==1.0.1
python-multipart==0.0.9
pytz==2024.1
PyYAML==6.0.1
pyzmq @ file:///croot/pyzmq_1705605076900/work
ray==2.9.3
referencing==0.33.0
regex==2023.12.25
requests==2.31.0
rich==13.7.1
rouge-chinese==1.0.3
rpds-py==0.18.0
ruamel.yaml==0.18.6
ruamel.yaml.clib==0.2.8
ruff==0.2.2
safetensors==0.4.2
scikit-learn==1.4.1.post1
scipy==1.12.0
semantic-version==2.10.0
sentence-transformers==2.4.0
sentencepiece==0.2.0
setuptools==68.2.2
sgmllib3k==1.0.0
shellingham==1.5.4
shtab==1.7.0
simplejson==3.19.2
six @ file:///home/conda/feedstock_root/build_artifacts/six_1620240208055/work
smmap==5.0.1
sniffio==1.3.1
sortedcontainers==2.4.0
SQLAlchemy==2.0.27
sse-starlette==2.0.0
stack-data @ file:///home/conda/feedstock_root/build_artifacts/stack_data_1669632077133/work
starlette==0.36.3
streamlit==1.31.1
sympy==1.12
tenacity==8.2.3
threadpoolctl==3.3.0
tiktoken==0.6.0
timm==0.9.16
tokenizers==0.15.2
toml==0.10.2
tomli==2.0.1
tomlkit==0.12.0
toolz==0.12.1
torch==2.1.2
torchvision==0.17.1
tornado @ file:///home/conda/feedstock_root/build_artifacts/tornado_1648827254365/work
tqdm==4.66.2
traitlets @ file:///home/conda/feedstock_root/build_artifacts/traitlets_1704212992681/work
transformers==4.38.2
triton==2.1.0
trl==0.7.11
typer==0.9.0
types-requests==2.31.0.20240218
typing-inspect==0.9.0
typing_extensions @ file:///home/conda/feedstock_root/build_artifacts/typing_extensions_1708904622550/work
tyro==0.7.3
tzdata==2024.1
tzlocal==5.2
urllib3==2.2.1
uvicorn==0.27.1
uvloop==0.19.0
validators==0.22.0
vllm==0.3.3
watchdog==4.0.0
watchfiles==0.21.0
wcwidth @ file:///home/conda/feedstock_root/build_artifacts/wcwidth_1704731205417/work
websockets==11.0.3
wheel==0.41.2
xformers==0.0.23.post1
xxhash==3.4.1
yapf==0.40.2
yarl==1.9.4
zhipuai==2.0.1
zipp==3.17.0

运行日志或截图

#  26%|██▋       | 201/762 [00:35<01:38,  5.67it/s]
 27%|██▋       | 202/762 [01:09<03:53,  2.40it/s]
 27%|██▋       | 203/762 [01:44<07:02,  1.32it/s]
 27%|██▋       | 204/762 [02:18<11:24,  1.23s/it]
 27%|██▋       | 205/762 [02:53<17:33,  1.89s/it]
 27%|██▋       | 206/762 [03:27<25:40,  2.77s/it]
 27%|██▋       | 207/762 [04:01<36:33,  3.95s/it]
 27%|██▋       | 208/762 [04:35<50:38,  5.49s/it]
 27%|██▋       | 209/762 [05:10<1:08:46,  7.46s/it]
 28%|██▊       | 210/762 [05:44<1:30:08,  9.80s/it]
 28%|██▊       | 211/762 [06:17<1:54:33, 12.47s/it]
 28%|██▊       | 212/762 [06:51<2:21:25, 15.43s/it]
 28%|██▊       | 213/762 [07:25<2:49:23, 18.51s/it]
 28%|██▊       | 214/762 [07:59<3:15:56, 21.45s/it]
 28%|██▊       | 215/762 [08:34<3:41:47, 24.33s/it]
 28%|██▊       | 216/762 [09:08<4:02:33, 26.65s/it]
 28%|██▊       | 217/762 [09:42<4:18:47, 28.49s/it]
 29%|██▊       | 218/762 [10:16<4:31:55, 29.99s/it]
 29%|██▊       | 219/762 [10:51<4:41:46, 31.14s/it]
 29%|██▉       | 220/762 [11:25<4:49:47, 32.08s/it]
 29%|██▉       | 221/762 [12:00<4:55:29, 32.77s/it]
 29%|██▉       | 222/762 [12:34<4:59:26, 33.27s/it]
 29%|██▉       | 223/762 [13:08<5:01:07, 33.52s/it]
 29%|██▉       | 224/762 [13:42<5:01:59, 33.68s/it]
 30%|██▉       | 225/762 [14:17<5:02:52, 33.84s/it]
 30%|██▉       | 226/762 [14:51<5:03:45, 34.00s/it]
 30%|██▉       | 227/762 [15:25<5:04:03, 34.10s/it]
 30%|██▉       | 228/762 [15:59<5:03:23, 34.09s/it]
 30%|███       | 229/762 [16:33<5:02:05, 34.01s/it]
 30%|███       | 230/762 [17:07<5:01:36, 34.02s/it]
 30%|███       | 231/762 [17:42<5:02:07, 34.14s/it]
 30%|███       | 232/762 [18:15<5:00:55, 34.07s/it]
 31%|███       | 233/762 [18:49<5:00:10, 34.05s/it]
 31%|███       | 234/762 [19:24<4:59:41, 34.06s/it]
 31%|███       | 235/762 [19:58<4:59:40, 34.12s/it]
 31%|███       | 236/762 [20:32<4:58:51, 34.09s/it]
 31%|███       | 237/762 [21:06<4:58:00, 34.06s/it]
 31%|███       | 238/762 [21:40<4:57:09, 34.03s/it]
 31%|███▏      | 239/762 [22:14<4:56:19, 33.99s/it]
 31%|███▏      | 240/762 [22:48<4:56:25, 34.07s/it][E ProcessGroupNCCL.cpp:475] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6909, OpType=ALLREDUCE, NumelIn=131072000, NumelOut=131072000, Timeout(ms)=1800000) ran for 1800084 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6909, OpType=ALLREDUCE, NumelIn=131072000, NumelOut=131072000, Timeout(ms)=1800000) ran for 1800654 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6909, OpType=ALLREDUCE, NumelIn=131072000, NumelOut=131072000, Timeout(ms)=1800000) ran for 1800084 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6909, OpType=ALLREDUCE, NumelIn=131072000, NumelOut=131072000, Timeout(ms)=1800000) ran for 1800654 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6909, OpType=ALLREDUCE, NumelIn=131072000, NumelOut=131072000, Timeout(ms)=1800000) ran for 1800128 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6909, OpType=ALLREDUCE, NumelIn=131072000, NumelOut=131072000, Timeout(ms)=1800000) ran for 1800128 milliseconds before timing out.
[2024-03-19 10:54:46,862] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 128173 closing signal SIGTERM
[2024-03-19 10:54:46,862] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 128174 closing signal SIGTERM
[2024-03-19 10:54:54,867] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 0 (pid: 128172) of binary: /home/zouff/anaconda3/envs/glm/bin/python
Traceback (most recent call last):
  File "/home/zouff/anaconda3/envs/glm/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/zouff/anaconda3/envs/glm/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=======================================================
run_clm_sft_with_peft.py FAILED
-------------------------------------------------------
Failures:
[1]:
  time      : 2024-03-19_10:54:46
  host      : g01
  rank      : 3 (local_rank: 3)
  exitcode  : -6 (pid: 128175)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 128175
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-03-19_10:54:46
  host      : g01
  rank      : 0 (local_rank: 0)
  exitcode  : -6 (pid: 128172)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 128172
=======================================================

Compution assesment

Check before submitting issues

Make sure to pull the latest code, as some issues and bugs have been fixed.
I have read the Wiki and FAQ section AND searched for similar issues and did not find a similar problem or solution
Third-party plugin issues - e.g., llama.cpp, LangChain, text-generation-webui, we recommend checking the corresponding project for solutions

Type of Issue

Other issues

Operating System

None

Describe your issue in detail

you had described that the training was with 48 A40 GPUs, can you share (or estimate) also the time it took to train the pretraining phase and the instructions phase?

Dependencies (must be provided for code-related issues)

No response

Execution logs or screenshots

No response

训练脚本会开放吗

提交前必须检查以下项目

请确保使用的是仓库最新代码（git pull），一些问题已被解决和修复。
我已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索，没有找到相似问题和解决方案。
第三方插件问题：例如llama.cpp、LangChain、text-generation-webui等，同时建议到对应的项目中查找解决方案。

问题类型

模型训练与精调

操作系统

Linux

详细描述问题

在此处粘贴运行代码

none

### 依赖情况（代码类问题务必提供）


non

### 运行日志或截图


none

Chinese-Mixtral-Instruct有考虑上传到魔塔社区吗？

提交前必须检查以下项目

请确保使用的是仓库最新代码（git pull），一些问题已被解决和修复。
我已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索，没有找到相似问题和解决方案。
第三方插件问题：例如llama.cpp、LangChain、text-generation-webui等，同时建议到对应的项目中查找解决方案。

问题类型

None

操作系统

None

详细描述问题

百度和HF这两种方式下载都非常慢，对linux系统支持不友好。

依赖情况（代码类问题务必提供）

# 请在此处粘贴依赖情况（图片请放在代码块外，否则不能显示）

运行日志或截图

# 请在此处粘贴运行日志（图片请放在代码块外，否则不能显示）

训练成本问题请教？

提交前必须检查以下项目

请确保使用的是仓库最新代码（git pull），一些问题已被解决和修复。
我已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索，没有找到相似问题和解决方案。
第三方插件问题：例如llama.cpp、LangChain、text-generation-webui等，同时建议到对应的项目中查找解决方案。

问题类型

其他问题

操作系统

Linux

详细描述问题

（请在此处详细描述遇到的问题）
请问使用的训练数据集大小以及使用的显卡型号数量和显存情况？我们希望能够在自己的数据集上进一步的微调以适应我们自己的领域。感谢支持~

依赖情况（代码类问题务必提供）

不涉及

运行日志或截图

不涉及

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

ymcui / chinese-mixtral Goto Github PK

chinese-mixtral's People

Contributors

Stargazers

Watchers

Forkers

chinese-mixtral's Issues

提交前必须检查以下项目

问题类型

操作系统

详细描述问题

依赖情况（代码类问题务必提供）

运行日志或截图

提交前必须检查以下项目

问题类型

操作系统

详细描述问题

依赖情况（代码类问题务必提供）

运行日志或截图

提交前必须检查以下项目

问题类型

操作系统

详细描述问题

依赖情况（代码类问题务必提供）

运行日志或截图

提交前必须检查以下项目

问题类型

操作系统

详细描述问题

依赖情况（代码类问题务必提供）

运行日志或截图

提交前必须检查以下项目

问题类型

操作系统

详细描述问题

依赖情况（代码类问题务必提供）

运行日志或截图

提交前必须检查以下项目

问题类型

操作系统

详细描述问题

依赖情况（代码类问题务必提供）

运行日志或截图

Check before submitting issues

Type of Issue

Operating System

Describe your issue in detail

Dependencies (must be provided for code-related issues)

Execution logs or screenshots

提交前必须检查以下项目

问题类型

操作系统

详细描述问题

提交前必须检查以下项目

问题类型

操作系统

详细描述问题

依赖情况（代码类问题务必提供）

运行日志或截图

提交前必须检查以下项目

问题类型

操作系统

详细描述问题

依赖情况（代码类问题务必提供）

运行日志或截图

Recommend Projects

Recommend Topics

Recommend Org