Git Product home page Git Product logo

flagperf's People

Contributors

baai-openplatform avatar chenrui9312 avatar clveryang avatar dynamicheart avatar fajingyi avatar forestlee95 avatar fred1912 avatar hodoryu avatar huiyiygy avatar jamesruio avatar kathrine94 avatar kerwinkai avatar kungyork avatar lin5462107 avatar ljy-2000 avatar nrikoh avatar reiase avatar scothunder avatar sherryxie1 avatar shh2000 avatar stezpy avatar tianxiao-baai avatar twang07 avatar upvenly avatar wzd09 avatar xq25478 avatar yan-rui avatar yuzhou03 avatar zbj-china avatar zhangling21 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

flagperf's Issues

问题咨询 MFU是如何计算的

  1. 在训练任务中项目里给出的所有模型的MFU计算方式都是
    MFU = 6 * token-pes-second * params / per-iter-time
    比较好奇这个6代表什么

  2. 另外和其他论文中给出的计算公式也不太一样
    例如megatron 和 nanogpt的MFU计算方式

mixtral_8x7B 数据集下载地址出错

model longformer multi-card training evaluation process did not carry out multi-card average

In the multi-card training, it is necessary to carry out intermittent model accuracy evaluation, but the longformer model did not average the evaluation results during the multi-card evaluation process. Although each card does full data inference, it is not guaranteed that the inference results of all cards are consistent (due to the existence of random factors such as drop out). As a result, the evaluation results of different cards are different, and probability affects the state of state, leading to the stop of training after some cards reach the accuracy, while some cards do not reach the accuracy and continue to train, resulting in the illusion that the machine is hung. It is necessary to all_gather the information of all cards in the evaluation process and then average.

mixtral_8x7B tokenizer 报错

问题说明:
按照mixtral_8x7B的运行说明文档,在GPU上启动后报错: AttributeError: 'LlamaTokenizer' object has no attribute 'unique_identifiers'

分析:
(1)官方代码 FlagPerf/blob/main/training/benchmarks/mixtral_8x7B/megatron/tokenizer.patch 中,当 args.tokenizer_type == 'MistralTokenizer' 时,直接调用了 tokenizer = AutoTokenizer.from_pretrained;

(2)上述(1)区别于其他 tokenizer 在 /Megatron-LM/megatron/training/tokenizer/tokenizer.py 中的实现,如 gpt2xx、llama2、llama3 等 tokenizer 的实现是继承了 MegatronTokenizer(/Megatron-LM/megatron/core/datasets/megatron_tokenizer.py);

(3)而"问题说明"中报错缺失的 "unique_identifiers" 属性是在 MegatronTokenizer 类中初始化的,由于 MistralTokenizer 的调用(patch文件)未继承 MegatronTokenizer 类,导致 unique_identifiers 这个属性没有被初始化;
image

报错建议:
建议确认下FlagPerf/blob/main/training/benchmarks/mixtral_8x7B/megatron/tokenizer.patch中,是否缺失了对MegatronTokenizer的继承。比如类似于Llama3的实现:
image

Handle Non-Zero Return Codes in Process Execution in start_pytorch_task.py

描述:

在分析位于training/run_benchmarks/pytorch目录下的start_pytorch_task.py文件(第193行)时,可能有必要为正在执行的进程实现返回代码检查。目前,代码使用proc.wait()等待每个进程完成,但缺乏对非零返回代码的适当处理。

建议更改:

修改位于start_pytorch_task.py文件第193行开始的代码块,以为每个进程执行引入返回代码检查。
1 如果进程的返回代码为非零:
2 引发一个带有信息性错误消息的异常。
3 在错误消息中包含进程ID,以识别有问题的进程。

指标,有几点疑问,请解释下

  1. 端到端时间 | e2e_time | 总时间+Perf初始化等时间
    这个是指什么?

  2. 推理总吞吐量 | p_infer_whole | 实际推理token数除以总推理时间
    ---实际推理token数,是包含prompt的tokens和complete tokens的总和吗?
    还是只是complete tokens总和?

  3. MMLU回答准确率(few_shots:5)
    我看llama2-7b和Aquila_7b_mmlu的 ACC都在50%以下,这样的正确性指标这么低情况下,这样的测试结果有意义吗?


指标列表

指标名称 指标值索引 特殊说明
数据精度 precision 可选fp32/fp16
批尺寸 bs  
硬件存储使用 mem 通常称为“显存”,单位为GiB
端到端时间 e2e_time 总时间+Perf初始化等时间
验证总吞吐量 p_val_whole 实际验证token数除以总验证时间
验证计算吞吐量 p_val_core 不包含IO部分耗时
推理总吞吐量 p_infer_whole 实际推理token数除以总推理时间
推理计算吞吐量 *p_infer_core 不包含IO部分耗时
推理结果 acc(推理/验证) MMLU回答准确率(few_shots:5)

Performance in CPM Dataloaders During jieba Prefix Dict Construction, Stuck at "Prefix dict has been built successfully", Taking Around 40s, Possibly Machine-Dependent

While utilizing the data loaders provided by CPM , observed a significant performance issue during the construction of the prefix dictionary in the jieba library.
Specifically, this step gets stuck at the message "Prefix dict has been built successfully," taking approximately 40 seconds to complete.
We suspect that this problem might be related to variations in machine performance.

bert-paddle 模型训练,数据集和模型按照bert-pytorch教程文档处理,运行报错

Traceback (most recent call last):
File "model_framework/bert/paddle/run_pretraining.py", line 136, in
config, state = main()
File "model_framework/bert/paddle/run_pretraining.py", line 91, in main
eval_loss, eval_mlm_acc = evaluator.evaluate(trainer)
File "/usr/local/lib/python3.7/dist-packages/decorator.py", line 232, in fun
return caller(func, *(extras + args), **kw)
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/base.py", line 347, in _decorate_functio
n
return func(*args, **kwargs)
File "/nfs/aishPerf_test/training/model_framework/bert/paddle/train/evaluator.py", line 59, in evaluate
loss, mlm_acc, num_masked = trainer.inference(batch)
File "/nfs/aishPerf_test/training/model_framework/bert/paddle/train/trainer.py", line 228, in inference
return self.forward(batch)
File "/nfs/aishPerf_test/training/model_framework/bert/paddle/train/trainer.py", line 223, in forward
next_sentence_labels)
File "/usr/local/lib/python3.7/dist-packages/paddle/nn/layer/layers.py", line 1254, in call
return self.forward(*inputs, **kwargs)
File "/nfs/aishPerf_test/training/model_framework/bert/paddle/model/models/modeling.py", line 729, in forward
next_sentence_label, return_dict)
File "/usr/local/lib/python3.7/dist-packages/paddle/nn/layer/layers.py", line 1254, in call
return self.forward(*inputs, **kwargs)
File "/nfs/aishPerf_test/training/model_framework/bert/paddle/model/models/modeling.py", line 729, in forward
next_sentence_label, return_dict)
File "/usr/local/lib/python3.7/dist-packages/paddle/nn/layer/layers.py", line 1254, in call
return self.forward(*inputs, **kwargs)
File "/nfs/aishPerf_test/training/model_framework/bert/paddle/model/models/modeling.py", line 623, in forw
ard
labels) #masked_positions这个参数应该是找到那些被mask的值
File "/usr/local/lib/python3.7/dist-packages/paddle/nn/layer/layers.py", line 1254, in call
return self.forward(*inputs, **kwargs)
File "/nfs/aishPerf_test/training/model_framework/bert/paddle/model/models/modeling.py", line 581, in forw
ard
prediction_scores = self.predictions(sequence_output, masked_positions)
File "/usr/local/lib/python3.7/dist-packages/paddle/nn/layer/layers.py", line 1254, in call
return self.forward(*inputs, **kwargs)
File "/nfs/aishPerf_test/training/model_framework/bert/paddle/model/models/modeling.py", line 517, in forward
hidden_states = self.transform(hidden_states)
File "/usr/local/lib/python3.7/dist-packages/paddle/nn/layer/layers.py", line 1254, in call
return self.forward(*inputs, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/paddle/nn/layer/common.py", line 175, in forward
x=input, weight=self.weight, bias=self.bias, name=self.name
File "/usr/local/lib/python3.7/dist-packages/paddle/nn/functional/common.py", line 1842, in linear
return _C_ops.linear(x, weight, bias)
ValueError: (InvalidArgument) The Input(X) dims size must not be equal 0, but reviced dims size is 0.
[Hint: Expected phi::product(x.dims()) != 0, but received phi::product(x.dims()):0 == 0:0.] (at ../paddle/
phi/kernels/impl/matmul_kernel_impl.h:978)

什么是容器内启动?

  1. 问题背景
    目前case中,使用在目标机器(可能是本机)中创建Docker容器来进行训练任务;但还有一种常见场景是,本身运行环境就是在Docker中,即用户在容器内运行程序。因为Docker中不能启动Docker的限制,因而需通过访问已启动Docker的形式进行,并需在已启动Docker的内部配置ssh免密登录,提前配置好case运行环境。

  2. 用户操作指南
    2.1. 设置环境变量

export EXEC_IN_CONTAINER=True

2.2. 确保容器内硬件驱动、网络、硬件虚拟化等服务器基础配置齐全

  1. 确保可连**大陆可访问网站,速率正常

  2. 确保容器镜像、容器内软件包对应版本安装正确

  3. 确保可在容器内找到硬件

  4. 确保各服务器间root帐号的ssh信任关系和sudo免密

  5. 确保monitor相关工具已安装:包括cpu(sysstat)、内存(free)、功耗(ipmitool)、系统信息(加速卡状态查看命令)。例如ubuntu系统中,使用apt install [sysstat/ipmitool]安装

  6. 特性声明
    当用户设置环境变量EXEC_IN_CONTAINER=True时,即表示FlagPerf运行于Docker镜像中。与在物理机上运行相比,程序运行工作流在执行时会跳过以下4步:

  • 登录到所有节点,准备镜像
  • 在所有节点,启动容器
  • 在所有节点,配置容器环境
  • 在所有节点,关闭容器
    基于上述设计,用户在使用容器内启动的特性时,配置Docker环境,Docker间ssh免密登录,对应case库依赖的步骤交由用户完成。

单机训练正常,多机训练时,第一个机器启动时卡住

1、配置run_benchmarks/config/cluster_conf.py 中的HOST为两台主机的IP;
2、配置run_benchmarks/config/test_conf.py 中的CASES为两机,如nnodes为2。
3、两台机器之间已配置ssh免密。
4、仅在第一台机器运行下述程序,启动方式为:
export EXEC_IN_CONTAINER=True
python3 ./run_benchmarks/run.py

5、进程在第一台机器启动时卡住,对应log为:
[DEBUG] [cluster_manager.py,94]Run cmd on host with ssh. ssh cmd=ssh -o ConnectTimeout=3 -o StrictHostKeyChecking=no -l root -p 22 10.9.67.75 'python3 FlagPerf/training/run_benchmarks/megatron/start_megatron_task.py --vendor nvidia --case_name llama3_8B:megatron_core060:A100:2:8:1 --model_name llama3_8B --train_script run_pretraining.py --nnodes 2 --nproc 8 --hosts 10.9.67.75,10.9.67.76 --hosts_ports 2222--data_dir flag_perf_test/data_dir --log_dir
FlagPerf/training/result/run20240604153551 --log_level debug --extern_config_file config_A100x2x8.py --enable_extern_config --master_port 29501 --round 1 --visible_dev_env CUDA_VISIBLE_DEVICES --master_addr 10.9.67.75 --node_rank 0 --host_addr 10.9.67.75' host=10.9.67.75 timeout=15

6、目前定位到在如下图函数的run_cmd_wait中卡住
image

7、多机训练时,是否有其他注意事项?

环境变量LOCAL_RANK被start_pytorch_task.py改变

描述:
使用training/run_benchmarks/pytorch/start_pytorch_task.py启动训练时,单机单卡场景默认LOCAL_RANK=-1会被

current_env["LOCAL_RANK"] = str(local_rank)
修改,导致后续错误。
process = subprocess.Popen(start_cmd, shell=True, env=current_env)

比如调用accelerate时会导致https://github.com/huggingface/accelerate/blob/80da9cfb09bb3cc9f1b385cb55d6b90d025a5fd9/src/accelerate/state.py#L195 处分布式类型判断出错。

使用nvidia的transformer样例报错:No module named 'fairseq.data.batch_C'

具体使用 "transformer:pytorch_1.13:A100:1:8:1": "/home/datasets_ckpt/transformer/train/",经过简单测试,容器内已经安装了fairseq.data.batch_C,直接导入正常,在训练流程里导入就报错,具体测试的是FlagPerf/training/benchmarks/transformer/pytorch/run_pretraining.py

check_dir
transformer

具体报错log
Traceback (most recent call last):
File "/data/peiyuan.zhang/FlagPerf/training/benchmarks/transformer/pytorch/run_pretraining.py", line 13, in
from train.evaluator import Evaluator
File "/data/peiyuan.zhang/FlagPerf/training/benchmarks/transformer/pytorch/train/evaluator.py", line 4, in
from fairseq.data import data_utils
File "/data/peiyuan.zhang/FlagPerf/training/benchmarks/transformer/pytorch/fairseq/data/init.py", line 25, in
from .language_pair_dataset import LanguagePairDataset, load_dataset_splits
File "/data/peiyuan.zhang/FlagPerf/training/benchmarks/transformer/pytorch/fairseq/data/language_pair_dataset.py", line 26, in
from . import data_utils
File "/data/peiyuan.zhang/FlagPerf/training/benchmarks/transformer/pytorch/fairseq/data/data_utils.py", line 30, in
import fairseq.data.batch_C
ModuleNotFoundError: No module named 'fairseq.data.batch_C'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.