flagopen / flagperf Goto Github PK
View Code? Open in Web Editor NEWFlagPerf is an open-source software platform for benchmarking AI chips.
License: Apache License 2.0
FlagPerf is an open-source software platform for benchmarking AI chips.
License: Apache License 2.0
retiannet配置缺少2x8和1x1
training/benchmarks/chatglm3_6b/deepspeed/README.md内的 ”https://drive.google.com/drive/folders/1IaD_SIIB-K3Sij_-JjWoPy_UrWqQRdjx 中12GB的openwebtext.tar.xz“,现在在该网站上已经删除,改为urlsf_subsetxxxxxx.tar,建议更新
在训练任务中项目里给出的所有模型的MFU计算方式都是
MFU = 6 * token-pes-second * params / per-iter-time
比较好奇这个6代表什么
另外和其他论文中给出的计算公式也不太一样
例如megatron 和 nanogpt的MFU计算方式
Hello~, baichuan2-13b训练说明中的openwebtext数据集下载链接失效了,麻烦更新下链接或者确认下该数据集与huggingface上的openwebtext数据集是否一致(hf上下载的数据约有20GB+)。
路径 https://github.com/FlagOpen/FlagPerf/tree/main/training/benchmarks/mixtral_8x7B/megatron ,页面上提供的wudao数据集下载链接出错,具体表现为执行:
(1)wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/llama2-datasets/wudao_llama2bpe_content_document.bin
报错:
(2)wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/llama2-datasets/wudao_llama2bpe_content_document.idx
报错:
In the multi-card training, it is necessary to carry out intermittent model accuracy evaluation, but the longformer model did not average the evaluation results during the multi-card evaluation process. Although each card does full data inference, it is not guaranteed that the inference results of all cards are consistent (due to the existence of random factors such as drop out). As a result, the evaluation results of different cards are different, and probability affects the state of state, leading to the stop of training after some cards reach the accuracy, while some cards do not reach the accuracy and continue to train, resulting in the illusion that the machine is hung. It is necessary to all_gather the information of all cards in the evaluation process and then average.
问题说明:
按照mixtral_8x7B的运行说明文档,在GPU上启动后报错: AttributeError: 'LlamaTokenizer' object has no attribute 'unique_identifiers'
分析:
(1)官方代码 FlagPerf/blob/main/training/benchmarks/mixtral_8x7B/megatron/tokenizer.patch 中,当 args.tokenizer_type == 'MistralTokenizer' 时,直接调用了 tokenizer = AutoTokenizer.from_pretrained;
(2)上述(1)区别于其他 tokenizer 在 /Megatron-LM/megatron/training/tokenizer/tokenizer.py 中的实现,如 gpt2xx、llama2、llama3 等 tokenizer 的实现是继承了 MegatronTokenizer(/Megatron-LM/megatron/core/datasets/megatron_tokenizer.py);
(3)而"问题说明"中报错缺失的 "unique_identifiers" 属性是在 MegatronTokenizer 类中初始化的,由于 MistralTokenizer 的调用(patch文件)未继承 MegatronTokenizer 类,导致 unique_identifiers 这个属性没有被初始化;
报错建议:
建议确认下FlagPerf/blob/main/training/benchmarks/mixtral_8x7B/megatron/tokenizer.patch中,是否缺失了对MegatronTokenizer的继承。比如类似于Llama3的实现:
描述:
在分析位于training/run_benchmarks/pytorch目录下的start_pytorch_task.py文件(第193行)时,可能有必要为正在执行的进程实现返回代码检查。目前,代码使用proc.wait()等待每个进程完成,但缺乏对非零返回代码的适当处理。
建议更改:
修改位于start_pytorch_task.py文件第193行开始的代码块,以为每个进程执行引入返回代码检查。
1 如果进程的返回代码为非零:
2 引发一个带有信息性错误消息的异常。
3 在错误消息中包含进程ID,以识别有问题的进程。
端到端时间 | e2e_time | 总时间+Perf初始化等时间
这个是指什么?
推理总吞吐量 | p_infer_whole | 实际推理token数除以总推理时间
---实际推理token数,是包含prompt的tokens和complete tokens的总和吗?
还是只是complete tokens总和?
MMLU回答准确率(few_shots:5)
我看llama2-7b和Aquila_7b_mmlu的 ACC都在50%以下,这样的正确性指标这么低情况下,这样的测试结果有意义吗?
指标列表
指标名称 | 指标值索引 | 特殊说明 |
---|---|---|
数据精度 | precision | 可选fp32/fp16 |
批尺寸 | bs | |
硬件存储使用 | mem | 通常称为“显存”,单位为GiB |
端到端时间 | e2e_time | 总时间+Perf初始化等时间 |
验证总吞吐量 | p_val_whole | 实际验证token数除以总验证时间 |
验证计算吞吐量 | p_val_core | 不包含IO部分耗时 |
推理总吞吐量 | p_infer_whole | 实际推理token数除以总推理时间 |
推理计算吞吐量 | *p_infer_core | 不包含IO部分耗时 |
推理结果 | acc(推理/验证) | MMLU回答准确率(few_shots:5) |
While utilizing the data loaders provided by CPM , observed a significant performance issue during the construction of the prefix dictionary in the jieba library.
Specifically, this step gets stuck at the message "Prefix dict has been built successfully," taking approximately 40 seconds to complete.
We suspect that this problem might be related to variations in machine performance.
Traceback (most recent call last):
File "model_framework/bert/paddle/run_pretraining.py", line 136, in
config, state = main()
File "model_framework/bert/paddle/run_pretraining.py", line 91, in main
eval_loss, eval_mlm_acc = evaluator.evaluate(trainer)
File "/usr/local/lib/python3.7/dist-packages/decorator.py", line 232, in fun
return caller(func, *(extras + args), **kw)
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/base.py", line 347, in _decorate_functio
n
return func(*args, **kwargs)
File "/nfs/aishPerf_test/training/model_framework/bert/paddle/train/evaluator.py", line 59, in evaluate
loss, mlm_acc, num_masked = trainer.inference(batch)
File "/nfs/aishPerf_test/training/model_framework/bert/paddle/train/trainer.py", line 228, in inference
return self.forward(batch)
File "/nfs/aishPerf_test/training/model_framework/bert/paddle/train/trainer.py", line 223, in forward
next_sentence_labels)
File "/usr/local/lib/python3.7/dist-packages/paddle/nn/layer/layers.py", line 1254, in call
return self.forward(*inputs, **kwargs)
File "/nfs/aishPerf_test/training/model_framework/bert/paddle/model/models/modeling.py", line 729, in forward
next_sentence_label, return_dict)
File "/usr/local/lib/python3.7/dist-packages/paddle/nn/layer/layers.py", line 1254, in call
return self.forward(*inputs, **kwargs)
File "/nfs/aishPerf_test/training/model_framework/bert/paddle/model/models/modeling.py", line 729, in forward
next_sentence_label, return_dict)
File "/usr/local/lib/python3.7/dist-packages/paddle/nn/layer/layers.py", line 1254, in call
return self.forward(*inputs, **kwargs)
File "/nfs/aishPerf_test/training/model_framework/bert/paddle/model/models/modeling.py", line 623, in forw
ard
labels) #masked_positions这个参数应该是找到那些被mask的值
File "/usr/local/lib/python3.7/dist-packages/paddle/nn/layer/layers.py", line 1254, in call
return self.forward(*inputs, **kwargs)
File "/nfs/aishPerf_test/training/model_framework/bert/paddle/model/models/modeling.py", line 581, in forw
ard
prediction_scores = self.predictions(sequence_output, masked_positions)
File "/usr/local/lib/python3.7/dist-packages/paddle/nn/layer/layers.py", line 1254, in call
return self.forward(*inputs, **kwargs)
File "/nfs/aishPerf_test/training/model_framework/bert/paddle/model/models/modeling.py", line 517, in forward
hidden_states = self.transform(hidden_states)
File "/usr/local/lib/python3.7/dist-packages/paddle/nn/layer/layers.py", line 1254, in call
return self.forward(*inputs, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/paddle/nn/layer/common.py", line 175, in forward
x=input, weight=self.weight, bias=self.bias, name=self.name
File "/usr/local/lib/python3.7/dist-packages/paddle/nn/functional/common.py", line 1842, in linear
return _C_ops.linear(x, weight, bias)
ValueError: (InvalidArgument) The Input(X) dims size must not be equal 0, but reviced dims size is 0.
[Hint: Expected phi::product(x.dims()) != 0, but received phi::product(x.dims()):0 == 0:0.] (at ../paddle/
phi/kernels/impl/matmul_kernel_impl.h:978)
问题背景
目前case中,使用在目标机器(可能是本机)中创建Docker容器来进行训练任务;但还有一种常见场景是,本身运行环境就是在Docker中,即用户在容器内运行程序。因为Docker中不能启动Docker的限制,因而需通过访问已启动Docker的形式进行,并需在已启动Docker的内部配置ssh免密登录,提前配置好case运行环境。
用户操作指南
2.1. 设置环境变量
export EXEC_IN_CONTAINER=True
2.2. 确保容器内硬件驱动、网络、硬件虚拟化等服务器基础配置齐全
确保可连**大陆可访问网站,速率正常
确保容器镜像、容器内软件包对应版本安装正确
确保可在容器内找到硬件
确保各服务器间root帐号的ssh信任关系和sudo免密
确保monitor相关工具已安装:包括cpu(sysstat)、内存(free)、功耗(ipmitool)、系统信息(加速卡状态查看命令)。例如ubuntu系统中,使用apt install [sysstat/ipmitool]安装
特性声明
当用户设置环境变量EXEC_IN_CONTAINER=True
时,即表示FlagPerf运行于Docker镜像中。与在物理机上运行相比,程序运行工作流在执行时会跳过以下4步:
容器内启动
的特性时,配置Docker环境,Docker间ssh免密登录,对应case库依赖的步骤交由用户完成。在内网环境可以运行么,或者可以设置外网代理之类的么
1、配置run_benchmarks/config/cluster_conf.py 中的HOST为两台主机的IP;
2、配置run_benchmarks/config/test_conf.py 中的CASES为两机,如nnodes为2。
3、两台机器之间已配置ssh免密。
4、仅在第一台机器运行下述程序,启动方式为:
export EXEC_IN_CONTAINER=True
python3 ./run_benchmarks/run.py
5、进程在第一台机器启动时卡住,对应log为:
[DEBUG] [cluster_manager.py,94]Run cmd on host with ssh. ssh cmd=ssh -o ConnectTimeout=3 -o StrictHostKeyChecking=no -l root -p 22 10.9.67.75 'python3 FlagPerf/training/run_benchmarks/megatron/start_megatron_task.py --vendor nvidia --case_name llama3_8B:megatron_core060:A100:2:8:1 --model_name llama3_8B --train_script run_pretraining.py --nnodes 2 --nproc 8 --hosts 10.9.67.75,10.9.67.76 --hosts_ports 2222--data_dir flag_perf_test/data_dir --log_dir
FlagPerf/training/result/run20240604153551 --log_level debug --extern_config_file config_A100x2x8.py --enable_extern_config --master_port 29501 --round 1 --visible_dev_env CUDA_VISIBLE_DEVICES --master_addr 10.9.67.75 --node_rank 0 --host_addr 10.9.67.75' host=10.9.67.75 timeout=15
7、多机训练时,是否有其他注意事项?
描述:
使用training/run_benchmarks/pytorch/start_pytorch_task.py启动训练时,单机单卡场景默认LOCAL_RANK=-1会被
具体使用 "transformer:pytorch_1.13:A100:1:8:1": "/home/datasets_ckpt/transformer/train/",经过简单测试,容器内已经安装了fairseq.data.batch_C,直接导入正常,在训练流程里导入就报错,具体测试的是FlagPerf/training/benchmarks/transformer/pytorch/run_pretraining.py
具体报错log
Traceback (most recent call last):
File "/data/peiyuan.zhang/FlagPerf/training/benchmarks/transformer/pytorch/run_pretraining.py", line 13, in
from train.evaluator import Evaluator
File "/data/peiyuan.zhang/FlagPerf/training/benchmarks/transformer/pytorch/train/evaluator.py", line 4, in
from fairseq.data import data_utils
File "/data/peiyuan.zhang/FlagPerf/training/benchmarks/transformer/pytorch/fairseq/data/init.py", line 25, in
from .language_pair_dataset import LanguagePairDataset, load_dataset_splits
File "/data/peiyuan.zhang/FlagPerf/training/benchmarks/transformer/pytorch/fairseq/data/language_pair_dataset.py", line 26, in
from . import data_utils
File "/data/peiyuan.zhang/FlagPerf/training/benchmarks/transformer/pytorch/fairseq/data/data_utils.py", line 30, in
import fairseq.data.batch_C
ModuleNotFoundError: No module named 'fairseq.data.batch_C'
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.