flagopen / flagperf Goto Github PK

View Code? Open in Web Editor NEW

289.0 289.0 96.0 11.76 MB

FlagPerf is an open-source software platform for benchmarking AI chips.

License: Apache License 2.0

Python 87.73% Shell 3.15% Dockerfile 0.65% C++ 3.99% Cuda 4.47% C 0.02%

flagperf's People

Contributors

Stargazers

Watchers

Forkers

lindylin1817 ox7c000000 yuzhou03 ftgreat scothunder mqlove upvenly jesonfei lsqrun wjx-sudo forestlee95 yan-rui wzd09 dynamicheart shh2000 hellogdc eddiewdy shana34 chenrui9312 fanxiaotong01 laixinyi823 stezpy clveryang reiase kungyork drownfish19 zjmoo123 xiaohan4420 gganduu jinxiangshi clemente0731 mingyuanw-mt shang-mt liquanfeng liuyumoye zhangsanfeng2022 feldmanshan noticeable dh0000000001 twang07 cloud9wj jamesruio shenzhu1993 mandoxzhang nrikoh rrrrrayyyyy linlang1837 w4yne liujiabin20810 fsword73 crazyboystop ai-jie01 helen88 hawkl327 fred1912 huiyiygy fajingyi sherryxie1 jsnoc lin5462107 loomischen lk137095576 kathrine94 yingzhao27 happyxuwork dengfangeek zlkanyo009 dfgan dongqian126 ljy-2000 hzjai0624 kerwinkai xfguo-ucas nastenka98 idontkonwher zhaopufeng gengyang12345 ilovecomet wangxiaoyuvvv hodoryu lgao-matax mhm0902 zbj-china tianxiao-baai g1017 wangxichi cifar10 funcsherl xq25478 114homo514 cheemsruc lanhongken shan19900305 alvin-ychen

flagperf's Issues

deepspeed README的数据集说明

training/benchmarks/chatglm3_6b/deepspeed/README.md内的 ”https://drive.google.com/drive/folders/1IaD_SIIB-K3Sij_-JjWoPy_UrWqQRdjx 中12GB的openwebtext.tar.xz“，现在在该网站上已经删除，改为urlsf_subsetxxxxxx.tar，建议更新

问题咨询 MFU是如何计算的

在训练任务中项目里给出的所有模型的MFU计算方式都是
MFU = 6 * token-pes-second * params / per-iter-time
比较好奇这个6代表什么
另外和其他论文中给出的计算公式也不太一样
例如megatron 和 nanogpt的MFU计算方式

openwebtext下载链接失效

Hello~， baichuan2-13b训练说明中的openwebtext数据集下载链接失效了，麻烦更新下链接或者确认下该数据集与huggingface上的openwebtext数据集是否一致（hf上下载的数据约有20GB+）。

stable_diffusion_v1_4推理的数据集没有文档说明，有下载地址吗

stable_diffusion_v1_4 模型推理的“eval_weights”权重文件在哪里下载

如上图，这个“eval_weights” 对应的权重文件在哪里下载

mixtral_8x7B 数据集下载地址出错

路径 https://github.com/FlagOpen/FlagPerf/tree/main/training/benchmarks/mixtral_8x7B/megatron ，页面上提供的wudao数据集下载链接出错，具体表现为执行：

（1）wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/llama2-datasets/wudao_llama2bpe_content_document.bin
报错：

（2）wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/llama2-datasets/wudao_llama2bpe_content_document.idx
报错：

yolov5l 模型推理模型权重文件哪里可以下载 yolov5l-bs96.onnx

model longformer multi-card training evaluation process did not carry out multi-card average

In the multi-card training, it is necessary to carry out intermittent model accuracy evaluation, but the longformer model did not average the evaluation results during the multi-card evaluation process. Although each card does full data inference, it is not guaranteed that the inference results of all cards are consistent (due to the existence of random factors such as drop out). As a result, the evaluation results of different cards are different, and probability affects the state of state, leading to the stop of training after some cards reach the accuracy, while some cards do not reach the accuracy and continue to train, resulting in the illusion that the machine is hung. It is necessary to all_gather the information of all cards in the evaluation process and then average.

bert paddle 训练的精度为什么那么低呢

mixtral_8x7B tokenizer 报错

问题说明：
按照mixtral_8x7B的运行说明文档，在GPU上启动后报错： AttributeError: 'LlamaTokenizer' object has no attribute 'unique_identifiers'

分析：
（1）官方代码 FlagPerf/blob/main/training/benchmarks/mixtral_8x7B/megatron/tokenizer.patch 中，当 args.tokenizer_type == 'MistralTokenizer' 时，直接调用了 tokenizer = AutoTokenizer.from_pretrained；

（2）上述（1）区别于其他 tokenizer 在 /Megatron-LM/megatron/training/tokenizer/tokenizer.py 中的实现，如 gpt2xx、llama2、llama3 等 tokenizer 的实现是继承了 MegatronTokenizer（/Megatron-LM/megatron/core/datasets/megatron_tokenizer.py）；

（3）而"问题说明"中报错缺失的 "unique_identifiers" 属性是在 MegatronTokenizer 类中初始化的，由于 MistralTokenizer 的调用（patch文件）未继承 MegatronTokenizer 类，导致 unique_identifiers 这个属性没有被初始化；

报错建议：
建议确认下FlagPerf/blob/main/training/benchmarks/mixtral_8x7B/megatron/tokenizer.patch中，是否缺失了对MegatronTokenizer的继承。比如类似于Llama3的实现：

mindspore-resnet50训练case 计算量对齐

Handle Non-Zero Return Codes in Process Execution in start_pytorch_task.py

描述：

在分析位于training/run_benchmarks/pytorch目录下的start_pytorch_task.py文件（第193行）时，可能有必要为正在执行的进程实现返回代码检查。目前，代码使用proc.wait()等待每个进程完成，但缺乏对非零返回代码的适当处理。

建议更改：

修改位于start_pytorch_task.py文件第193行开始的代码块，以为每个进程执行引入返回代码检查。
1 如果进程的返回代码为非零：
2 引发一个带有信息性错误消息的异常。
3 在错误消息中包含进程ID，以识别有问题的进程。

指标，有几点疑问，请解释下

端到端时间 | e2e_time | 总时间+Perf初始化等时间
这个是指什么？
推理总吞吐量 | p_infer_whole | 实际推理token数除以总推理时间
---实际推理token数，是包含prompt的tokens和complete tokens的总和吗?
还是只是complete tokens总和？
MMLU回答准确率（few_shots:5）
我看llama2-7b和Aquila_7b_mmlu的 ACC都在50%以下，这样的正确性指标这么低情况下，这样的测试结果有意义吗？

指标列表

指标名称	指标值索引	特殊说明
数据精度	precision	可选fp32/fp16
批尺寸	bs
硬件存储使用	mem	通常称为“显存”,单位为GiB
端到端时间	e2e_time	总时间+Perf初始化等时间
验证总吞吐量	p_val_whole	实际验证token数除以总验证时间
验证计算吞吐量	p_val_core	不包含IO部分耗时
推理总吞吐量	p_infer_whole	实际推理token数除以总推理时间
推理计算吞吐量	*p_infer_core	不包含IO部分耗时
推理结果	acc(推理/验证)	MMLU回答准确率（few_shots:5）

Performance in CPM Dataloaders During jieba Prefix Dict Construction, Stuck at "Prefix dict has been built successfully", Taking Around 40s, Possibly Machine-Dependent

While utilizing the data loaders provided by CPM , observed a significant performance issue during the construction of the prefix dictionary in the jieba library.
Specifically, this step gets stuck at the message "Prefix dict has been built successfully," taking approximately 40 seconds to complete.
We suspect that this problem might be related to variations in machine performance.

bert-paddle 模型训练，数据集和模型按照bert-pytorch教程文档处理，运行报错

Traceback (most recent call last):
File "model_framework/bert/paddle/run_pretraining.py", line 136, in
config, state = main()
File "model_framework/bert/paddle/run_pretraining.py", line 91, in main
eval_loss, eval_mlm_acc = evaluator.evaluate(trainer)
File "/usr/local/lib/python3.7/dist-packages/decorator.py", line 232, in fun
return caller(func, *(extras + args), **kw)
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/base.py", line 347, in _decorate_functio
n
return func(*args, **kwargs)
File "/nfs/aishPerf_test/training/model_framework/bert/paddle/train/evaluator.py", line 59, in evaluate
loss, mlm_acc, num_masked = trainer.inference(batch)
File "/nfs/aishPerf_test/training/model_framework/bert/paddle/train/trainer.py", line 228, in inference
return self.forward(batch)
File "/nfs/aishPerf_test/training/model_framework/bert/paddle/train/trainer.py", line 223, in forward
next_sentence_labels)
File "/usr/local/lib/python3.7/dist-packages/paddle/nn/layer/layers.py", line 1254, in call
return self.forward(*inputs, **kwargs)
File "/nfs/aishPerf_test/training/model_framework/bert/paddle/model/models/modeling.py", line 729, in forward
next_sentence_label, return_dict)
File "/usr/local/lib/python3.7/dist-packages/paddle/nn/layer/layers.py", line 1254, in call
return self.forward(*inputs, **kwargs)
File "/nfs/aishPerf_test/training/model_framework/bert/paddle/model/models/modeling.py", line 729, in forward
next_sentence_label, return_dict)
File "/usr/local/lib/python3.7/dist-packages/paddle/nn/layer/layers.py", line 1254, in call
return self.forward(*inputs, **kwargs)
File "/nfs/aishPerf_test/training/model_framework/bert/paddle/model/models/modeling.py", line 623, in forw
ard
labels) #masked_positions这个参数应该是找到那些被mask的值
File "/usr/local/lib/python3.7/dist-packages/paddle/nn/layer/layers.py", line 1254, in call
return self.forward(*inputs, **kwargs)
File "/nfs/aishPerf_test/training/model_framework/bert/paddle/model/models/modeling.py", line 581, in forw
ard
prediction_scores = self.predictions(sequence_output, masked_positions)
File "/usr/local/lib/python3.7/dist-packages/paddle/nn/layer/layers.py", line 1254, in call
return self.forward(*inputs, **kwargs)
File "/nfs/aishPerf_test/training/model_framework/bert/paddle/model/models/modeling.py", line 517, in forward
hidden_states = self.transform(hidden_states)
File "/usr/local/lib/python3.7/dist-packages/paddle/nn/layer/layers.py", line 1254, in call
return self.forward(*inputs, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/paddle/nn/layer/common.py", line 175, in forward
x=input, weight=self.weight, bias=self.bias, name=self.name
File "/usr/local/lib/python3.7/dist-packages/paddle/nn/functional/common.py", line 1842, in linear
return _C_ops.linear(x, weight, bias)
ValueError: (InvalidArgument) The Input(X) dims size must not be equal 0, but reviced dims size is 0.
[Hint: Expected phi::product(x.dims()) != 0, but received phi::product(x.dims()):0 == 0:0.] (at ../paddle/
phi/kernels/impl/matmul_kernel_impl.h:978)

什么是容器内启动？

问题背景
目前case中，使用在目标机器（可能是本机）中创建Docker容器来进行训练任务；但还有一种常见场景是，本身运行环境就是在Docker中，即用户在容器内运行程序。因为Docker中不能启动Docker的限制，因而需通过访问已启动Docker的形式进行，并需在已启动Docker的内部配置ssh免密登录，提前配置好case运行环境。
用户操作指南
2.1. 设置环境变量

export EXEC_IN_CONTAINER=True

2.2. 确保容器内硬件驱动、网络、硬件虚拟化等服务器基础配置齐全

确保可连**大陆可访问网站，速率正常
确保容器镜像、容器内软件包对应版本安装正确
确保可在容器内找到硬件
确保各服务器间root帐号的ssh信任关系和sudo免密
确保monitor相关工具已安装:包括cpu(sysstat)、内存(free)、功耗(ipmitool)、系统信息(加速卡状态查看命令)。例如ubuntu系统中，使用apt install [sysstat/ipmitool]安装
特性声明
当用户设置环境变量EXEC_IN_CONTAINER=True时，即表示FlagPerf运行于Docker镜像中。与在物理机上运行相比，程序运行工作流在执行时会跳过以下4步：

登录到所有节点，准备镜像
在所有节点，启动容器
在所有节点，配置容器环境
在所有节点，关闭容器
基于上述设计，用户在使用容器内启动的特性时，配置Docker环境，Docker间ssh免密登录，对应case库依赖的步骤交由用户完成。

在内网环境可以运行么，或者可以设置外网代理之类的么

单机训练正常，多机训练时，第一个机器启动时卡住

1、配置run_benchmarks/config/cluster_conf.py 中的HOST为两台主机的IP；
2、配置run_benchmarks/config/test_conf.py 中的CASES为两机，如nnodes为2。
3、两台机器之间已配置ssh免密。
4、仅在第一台机器运行下述程序，启动方式为：
export EXEC_IN_CONTAINER=True
python3 ./run_benchmarks/run.py

5、进程在第一台机器启动时卡住，对应log为：
[DEBUG] [cluster_manager.py,94]Run cmd on host with ssh. ssh cmd=ssh -o ConnectTimeout=3 -o StrictHostKeyChecking=no -l root -p 22 10.9.67.75 'python3 FlagPerf/training/run_benchmarks/megatron/start_megatron_task.py --vendor nvidia --case_name llama3_8B:megatron_core060:A100:2:8:1 --model_name llama3_8B --train_script run_pretraining.py --nnodes 2 --nproc 8 --hosts 10.9.67.75,10.9.67.76 --hosts_ports 2222--data_dir flag_perf_test/data_dir --log_dir
FlagPerf/training/result/run20240604153551 --log_level debug --extern_config_file config_A100x2x8.py --enable_extern_config --master_port 29501 --round 1 --visible_dev_env CUDA_VISIBLE_DEVICES --master_addr 10.9.67.75 --node_rank 0 --host_addr 10.9.67.75' host=10.9.67.75 timeout=15

6、目前定位到在如下图函数的run_cmd_wait中卡住

7、多机训练时，是否有其他注意事项？

可以设置在模型导出为onnx过程中opset_version为11

环境变量LOCAL_RANK被start_pytorch_task.py改变

描述：
使用training/run_benchmarks/pytorch/start_pytorch_task.py启动训练时，单机单卡场景默认LOCAL_RANK=-1会被

FlagPerf/training/run_benchmarks/pytorch/start_pytorch_task.py

Line 177 in 9491411

current_env["LOCAL_RANK"] = str(local_rank)

修改，导致后续错误。

FlagPerf/training/run_benchmarks/pytorch/start_pytorch_task.py

Line 190 in 9491411

process = subprocess.Popen(start_cmd, shell=True, env=current_env)

比如调用accelerate时会导致https://github.com/huggingface/accelerate/blob/80da9cfb09bb3cc9f1b385cb55d6b90d025a5fd9/src/accelerate/state.py#L195 处分布式类型判断出错。

使用nvidia的transformer样例报错：No module named 'fairseq.data.batch_C'

具体使用 "transformer:pytorch_1.13:A100:1:8:1": "/home/datasets_ckpt/transformer/train/",经过简单测试，容器内已经安装了fairseq.data.batch_C，直接导入正常，在训练流程里导入就报错，具体测试的是FlagPerf/training/benchmarks/transformer/pytorch/run_pretraining.py

具体报错log
Traceback (most recent call last):
File "/data/peiyuan.zhang/FlagPerf/training/benchmarks/transformer/pytorch/run_pretraining.py", line 13, in
from train.evaluator import Evaluator
File "/data/peiyuan.zhang/FlagPerf/training/benchmarks/transformer/pytorch/train/evaluator.py", line 4, in
from fairseq.data import data_utils
File "/data/peiyuan.zhang/FlagPerf/training/benchmarks/transformer/pytorch/fairseq/data/init.py", line 25, in
from .language_pair_dataset import LanguagePairDataset, load_dataset_splits
File "/data/peiyuan.zhang/FlagPerf/training/benchmarks/transformer/pytorch/fairseq/data/language_pair_dataset.py", line 26, in
from . import data_utils
File "/data/peiyuan.zhang/FlagPerf/training/benchmarks/transformer/pytorch/fairseq/data/data_utils.py", line 30, in
import fairseq.data.batch_C
ModuleNotFoundError: No module named 'fairseq.data.batch_C'