Git Product home page Git Product logo

amperecomputingai / ampere_model_library Goto Github PK

View Code? Open in Web Editor NEW
17.0 6.0 4.0 1.3 MB

AML's goal is to make benchmarking of various AI architectures on Ampere CPUs a pleasurable experience :)

Home Page: https://hub.docker.com/u/amperecomputingai

License: Apache License 2.0

Python 94.49% Jupyter Notebook 4.61% Shell 0.90%
aarch64 ampere arm64 armv8-a artificial-intelligence computer-vision inference machine-learning mlperf-inference model-zoo natural-language-processing onnxruntime pytorch tensorflow dlrm large-language-models llama2 yolov8

ampere_model_library's Introduction

Ampere AI

Ampere Model Library

CI tests PyTorch pull count TF pull count ORT pull count llama.cpp pull count

AML's goal is to make benchmarking of various AI architectures on Ampere CPUs a pleasurable experience :)

This means we want the library to be quick to set up and to get you numbers you are interested in. On top of that we want the code to be readible and well structured so it's easy to inspect what exactly is being measured. If you feel like we are not exactly there, please let us know right away by raising an issue! Thank you :)

AML setup

Ampere AI solutions Visit our dockerhub for our frameworks selection.

sudo apt update && sudo apt install -y docker.io
sudo docker run --privileged=true -it amperecomputingai/pytorch:latest
# we also offer onnxruntime and tensorflow

You should see terminal output similar to that one:

Ampere docker welcome prompt

Now, inside the Docker container, run:

git clone --recursive https://github.com/AmpereComputingAI/ampere_model_library.git
cd ampere_model_library
bash setup_deb.sh
source set_env_variables.sh

You are good to go! ๐Ÿ‘Œ

Examples

The go-to solution is benchmark.py script

Benchmark script allows you to quickly evaluate performance of your Ampere system on the example of:

  • ResNet-50 v1.5
  • Whisper medium EN
  • DLRM
  • BERT large
  • YOLO v8s

It's incredibly user-friendly and designed to assist you in getting the best out of your system.

After completing setup with Ampere Optimized PyTorch (see AML setup), it's as easy as:

python3 benchmark.py --no-interactive  # remove --no-interactive if you want a quick estimation of performance

Evaluation results

Running particular AI architectures

Architectures are categorized based on the task they were originally envisioned for. Therefore, you will find ResNet and VGG under computer_vision and BERT under natural_language_processing. Usual workflow is to first setup AML (see AML setup), source environment variables by running source set_env_variables.sh and run run.py or similarly named python file in the directory of the achitecture you want to benchmark. Some models require additional setup steps to be completed first, which should be described in their respective directories under README.md files.

ResNet-50 v1.5

ResNet-50 architecture

Note that the example uses PyTorch - we recommend using Ampere Optimized PyTorch for best results (see AML setup).

source set_env_variables.sh
IGNORE_DATASET_LIMITS=1 AIO_IMPLICIT_FP16_TRANSFORM_FILTER=".*" AIO_NUM_THREADS=32 python3 computer_vision/classification/resnet_50_v15/run.py -m resnet50 -p fp32 -b 16 -f pytorch

The command above will run the model utilizing 32 threads, with batch size of 16. Implicit conversion to FP16 datatype will be applied - you can default to fp32 precision by not setting the AIO_IMPLICIT_FP16_TRANSFORM_FILTER variable.

PSA: you can adjust the level of AIO debug messages by setting AIO_DEBUG_MODE to values in range from 0 to 4 (where 0 is the most peaceful)

Whisper tiny EN

Whisper architecture

Note that the example uses PyTorch - we recommend using Ampere Optimized PyTorch for best results (see AML setup).

source set_env_variables.sh
AIO_IMPLICIT_FP16_TRANSFORM_FILTER=".*" AIO_NUM_THREADS=32 python3 speech_recognition/whisper/run.py -m tiny.en

The command above will run the model utilizing 32 threads, implicit conversion to FP16 datatype will be applied - you can default to fp32 precision by not setting the AIO_IMPLICIT_FP16_TRANSFORM_FILTER variable.

LLaMA2 7B

Transformer vs LLaMA

Note that the example uses PyTorch - we recommend using Ampere Optimized PyTorch for best results (see AML setup).

Before running this example you need to be granted access by Meta to LLaMA2 model. Go here: Meta and here: HF to learn more.

source set_env_variables.sh
wget https://github.com/tloen/alpaca-lora/raw/main/alpaca_data.json
AIO_IMPLICIT_FP16_TRANSFORM_FILTER=".*" AIO_NUM_THREADS=32 python3 natural_language_processing/text_generation/llama2/run.py -m meta-llama/Llama-2-7b-chat-hf --dataset_path=alpaca_data.json

The command above will run the model utilizing 32 threads, implicit conversion to FP16 datatype will be applied - you can default to fp32 precision by not setting the AIO_IMPLICIT_FP16_TRANSFORM_FILTER variable.

YOLO v8 large

YOLO object detection

Note that the example uses PyTorch - we recommend using Ampere Optimized PyTorch for best results (see AML setup).

source set_env_variables.sh
wget https://github.com/ultralytics/assets/releases/download/v0.0.0/yolov8l.pt
AIO_IMPLICIT_FP16_TRANSFORM_FILTER=".*" AIO_NUM_THREADS=32 python3 computer_vision/object_detection/yolo_v8/run.py -m yolov8l.pt -p fp32 -f pytorch

The command above will run the model utilizing 32 threads, implicit conversion to FP16 datatype will be applied - you can default to fp32 precision by not setting the AIO_IMPLICIT_FP16_TRANSFORM_FILTER variable.

BERT large

BERT embeddings

Note that the example uses PyTorch - we recommend using Ampere Optimized PyTorch for best results (see AML setup).

source set_env_variables.sh
wget -O bert_large_mlperf.pt https://zenodo.org/records/3733896/files/model.pytorch?download=1
AIO_IMPLICIT_FP16_TRANSFORM_FILTER=".*" AIO_NUM_THREADS=32 python3 natural_language_processing/extractive_question_answering/bert_large/run_mlperf.py -m bert_large_mlperf.pt -p fp32 -f pytorch

The command above will run the model utilizing 32 threads, implicit conversion to FP16 datatype will be applied - you can default to fp32 precision by not setting the AIO_IMPLICIT_FP16_TRANSFORM_FILTER variable.

ampere_model_library's People

Contributors

amperetravis avatar dkupnicki avatar frank-onspecta avatar jan-grzybek-ampere avatar kaczorrrro avatar kkontny avatar marcelwilnicki avatar mwalenciak-ampere avatar smamindl avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

ampere_model_library's Issues

Add support for torch.jit.trace() models to Pytorch.

Currently model zoo supports only torch.jit.script(model) API to convert Pytorch model to TorchScript. However not all models work with this API (3dUNet, Huggingface BERT models). There is second API to convert model to TorchScript in Pytorch called trace. Model can be converted to Torchscript with torch.jit.trace(model, input_tensor) call.

It seems that scikit-image has not been built correctly.

Fresh build,
Machine: Graviton:
Command: python3 benchmark_tf.py -m mobilenet_v2 -p fp32 -b 16 --timeout=15.0
Full log: https://ci3.onspecta.com/job/Nightly_Builds/job/tf2-long-run/123/console

ARNING:root:Limited tf.compat.v2.summary API due to missing TensorBoard installation.

Traceback (most recent call last):
  File "/root/miniforge3/envs/framework/lib/python3.8/site-packages/skimage/__init__.py", line 121, in <module>
    from ._shared import geometry
ImportError: /root/miniforge3/envs/framework/lib/python3.8/site-packages/skimage/_shared/../../scikit_image.libs/libgomp-d22c30c5.so.1.0.0: cannot allocate memory in static TLS block

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "benchmark_tf.py", line 2, in <module>
    import models.tf as tf_models
  File "/space/jenkins/workspace/Nightly_Builds/tf2-long-run/model_zoo_dev/models/tf.py", line 50, in <module>
    import model_zoo.computer_vision.semantic_segmentation.unet_3d.brats_19.run as run_3d_unet_brats
  File "/space/jenkins/workspace/Nightly_Builds/tf2-long-run/model_zoo_dev/model_zoo/computer_vision/semantic_segmentation/unet_3d/brats_19/run.py", line 5, in <module>
    from utils.cv.brats import BraTS19
  File "/space/jenkins/workspace/Nightly_Builds/tf2-long-run/model_zoo_dev/model_zoo/utils/cv/brats.py", line 10, in <module>
    from batchgenerators.utilities.file_and_folder_operations import *
  File "/root/miniforge3/envs/framework/lib/python3.8/site-packages/batchgenerators/__init__.py", line 3, in <module>
    import batchgenerators.augmentations
  File "/root/miniforge3/envs/framework/lib/python3.8/site-packages/batchgenerators/augmentations/__init__.py", line 3, in <module>
    from . import color_augmentations, crop_and_pad_augmentations, spatial_transformations, \
  File "/root/miniforge3/envs/framework/lib/python3.8/site-packages/batchgenerators/augmentations/color_augmentations.py", line 17, in <module>
    from batchgenerators.augmentations.utils import general_cc_var_num_channels, illumination_jitter
  File "/root/miniforge3/envs/framework/lib/python3.8/site-packages/batchgenerators/augmentations/utils.py", line 25, in <module>
    from skimage.transform import resize
  File "/root/miniforge3/envs/framework/lib/python3.8/site-packages/skimage/__init__.py", line 124, in <module>
    _raise_build_error(e)
  File "/root/miniforge3/envs/framework/lib/python3.8/site-packages/skimage/__init__.py", line 102, in _raise_build_error
    raise ImportError("""%s
ImportError: /root/miniforge3/envs/framework/lib/python3.8/site-packages/skimage/_shared/../../scikit_image.libs/libgomp-d22c30c5.so.1.0.0: cannot allocate memory in static TLS block
It seems that scikit-image has not been built correctly.

Add read-mes

Elements:

  • ref accuracy for every precision / framework
  • instructions on downloading models + datasets, help user get through the process as much as possible
  • summary / comparison of models (their acc + perf at given precision) trained on the same task at eg. model_zoo/cv/classification level
  • change DLS to AIO, OnSpecta to Ampere
  • info on model's limitations such as max batch size supported
  • some visualizations of task, output, etc? gif's, images, strings with examples for NLP ... let's make thing visually appealing :P

VGG16 fails in Pytorch

Inside ghcr.io/amperecomputingai/arm_aml_pytorch:fa0620f container:

After launching run.py script i got following error:

/onspecta/model_zoo_dev/ampere_model_library/computer_vision/classification/vgg_16# AIO_NUM_THREADS=16 IMAGENET_IMG_PATH=/onspecta/model_zoo_dev/downloads/ImageNet/ILSVRC2012_onspecta/ IMAGENET_LABELS_PATH=/onspecta/model_zoo_dev/downloads/ImageNet/imagenet_labels_onspecta.txt python run.py -p fp32 -f pytorch
Segmentation fault (core dumped)

The problem is importing TF in this file, when I comment out following lines from the script:
#from utils.tf import TFFrozenModelRunner
#from utils.tflite import TFLiteRunner
The script works. All models that supports directly both Pytorch and TF has this issue.

2x performance drop using pytorch depending on how input data is fed into model

Hi!

I'm not sure if this is a good place to get help/report issue, if not, let me what would be a better way.

First of all thanks for great library! However trying it out, I faced some performance issues that I can't fully understand. Here is an example script. It is quite simple, I'm using standard resnet50 model and feeding it with some data in a loop. However what I noticed is that small changes in how I'm feeding the data might cause quite drastic 2x performance drop.

import time
import torch
from torchvision.models import resnet50

n_warmup = 5
n_bench = 50

model = resnet50()
model.eval()
model = torch.compile(model, backend='aio', options={'modelname': 'resnet50'})

data = torch.randn(1, 3, 224, 224)

for i in range(n_bench + n_warmup):
    if i == n_warmup:
        start = time.time()
    x = data                              # Latency: ~100 ms
    # x = torch.stack([data[0]])          # Latency: ~200 ms
    # x = torch.randn(1, 3, 224, 224)     # Latency: ~100 ms
    # x = data.clone()                    # Latency: ~200 ms 
    model(x)

duration = time.time() - start
latency = duration / n_bench

print(f'Latency: {latency * 1000:.0f} ms')

So, if you use same data or create random data on each iteration it works fast. However if you use torch.stack or just .clone performance drops. TBH, I don't understand why these quite small changes would matter.

I tried to use torch.jit.script and torch.jit.trace instead of torch.compile, but results weren't any better, in fact using torch.jit.script latency was ~200ms even in 1st case (same data on each iteration). But torch.jit.trace and torch.compile were very similar.

For comparison, without compilation/scripting/tracing latency is ~290ms no matter how you feed the data.

What I also noticed is difference in CPU usage. In both cases both CPU cores are 100% utilized, however when it is slow, more time is spent in kernel threads (red portion)

Fast:
Pasted Graphic 9

Slow:
Pasted Graphic 8

I'm also attaching logs obtained with AIO_DEBUG_MODE=5

log_fast.txt
log_slow.txt

I'm using your latest docker image amperecomputingai/pytorch:1.7.0

Is there something obvious that I'm missing?

AML should support AIO_NUM_THREADS="all"

AIO supports running on the maximum number of cores detected. This is an important setting. In ML deployment through Kubernetes, CPU pinning will be controlled by docker and Kubernetes controller dynamically. It will not be controlled by AIO env var flags. The model zoo script currently only support AIO_NUM_THREADS as number. But AIO actually supports AIO_NUM_THREADS="all". We need to remove the constrain.

Nano GPT model

the command:
python run.py --model_name gpt2 -f pytorch --lambada_path /ampere/aml/natural_language_processing/text_generation/nanogpt/lambada_test_plain_text.txt

results in following error:

`Traceback (most recent call last):
File "/ampere/mzd/ampere_model_library/utils/pytorch.py", line 65, in init
self._frozen_script = torch.jit.freeze(torch.jit.script(self._model), preserved_attrs=[func])
File "/usr/local/lib/python3.10/dist-packages/torch/jit/_script.py", line 1284, in script
return torch.jit._recursive.create_script_module(
File "/usr/local/lib/python3.10/dist-packages/torch/jit/_recursive.py", line 480, in create_script_module
return create_script_module_impl(nn_module, concrete_type, stubs_fn)
File "/usr/local/lib/python3.10/dist-packages/torch/jit/_recursive.py", line 542, in create_script_module_impl
script_module = torch.jit.RecursiveScriptModule._construct(cpp_module, init_fn)
File "/usr/local/lib/python3.10/dist-packages/torch/jit/_script.py", line 614, in _construct
init_fn(script_module)
File "/usr/local/lib/python3.10/dist-packages/torch/jit/_recursive.py", line 520, in init_fn
scripted = create_script_module_impl(orig_value, sub_concrete_type, stubs_fn)
File "/usr/local/lib/python3.10/dist-packages/torch/jit/_recursive.py", line 542, in create_script_module_impl
script_module = torch.jit.RecursiveScriptModule._construct(cpp_module, init_fn)
File "/usr/local/lib/python3.10/dist-packages/torch/jit/_script.py", line 614, in _construct
init_fn(script_module)
File "/usr/local/lib/python3.10/dist-packages/torch/jit/_recursive.py", line 520, in init_fn
scripted = create_script_module_impl(orig_value, sub_concrete_type, stubs_fn)
File "/usr/local/lib/python3.10/dist-packages/torch/jit/_recursive.py", line 542, in create_script_module_impl
script_module = torch.jit.RecursiveScriptModule._construct(cpp_module, init_fn)
File "/usr/local/lib/python3.10/dist-packages/torch/jit/_script.py", line 614, in _construct
init_fn(script_module)
File "/usr/local/lib/python3.10/dist-packages/torch/jit/_recursive.py", line 520, in init_fn
scripted = create_script_module_impl(orig_value, sub_concrete_type, stubs_fn)
File "/usr/local/lib/python3.10/dist-packages/torch/jit/_recursive.py", line 542, in create_script_module_impl
script_module = torch.jit.RecursiveScriptModule._construct(cpp_module, init_fn)
File "/usr/local/lib/python3.10/dist-packages/torch/jit/_script.py", line 614, in _construct
init_fn(script_module)
File "/usr/local/lib/python3.10/dist-packages/torch/jit/_recursive.py", line 520, in init_fn
scripted = create_script_module_impl(orig_value, sub_concrete_type, stubs_fn)
File "/usr/local/lib/python3.10/dist-packages/torch/jit/_recursive.py", line 546, in create_script_module_impl
create_methods_and_properties_from_stubs(concrete_type, method_stubs, property_stubs)
File "/usr/local/lib/python3.10/dist-packages/torch/jit/_recursive.py", line 397, in create_methods_and_properties_from_stubs
concrete_type._create_methods_and_properties(property_defs, property_rcbs, method_defs, method_rcbs, method_defaults)
RuntimeError:
Module 'CausalSelfAttention' has no attribute 'bias' :
File "/ampere/aml/natural_language_processing/text_generation/nanogpt/nanoGPT/model.py", line 78
# manual implementation of attention
att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
att = att.masked_fill(self.bias[:,:,:T,:T] == 0, float('-inf'))
~~~~~~~~~ <--- HERE
att = F.softmax(att, dim=-1)
att = self.attn_dropout(att)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/ampere/aml/natural_language_processing/text_generation/nanogpt/run.py", line 78, in
main()
File "/ampere/aml/natural_language_processing/text_generation/nanogpt/run.py", line 71, in main
run_pytorch_fp32(**vars(args))
File "/ampere/aml/natural_language_processing/text_generation/nanogpt/run.py", line 63, in run_pytorch_fp32
run_pytorch(model_name, batch_size, num_runs, timeout, lambada_path, disable_jit_freeze, **kwargs)
File "/ampere/aml/natural_language_processing/text_generation/nanogpt/run.py", line 57, in run_pytorch
runner = PyTorchRunner(model, disable_jit_freeze=disable_jit_freeze, func="generate")
File "/ampere/mzd/ampere_model_library/utils/pytorch.py", line 69, in init
self._frozen_script = torch.jit.freeze(torch.jit.trace(self._model, example_inputs))
File "/usr/local/lib/python3.10/dist-packages/torch/jit/_trace.py", line 793, in trace
raise RuntimeError("example_kwarg_inputs should be a dict")
RuntimeError: example_kwarg_inputs should be a dict`

Alpaca accuracy

I've noticed that Alpaca's accuracy takes a hit with our submodule.

With default transformers it is at F1 ~0.6, with the submodule ~0.3 - this happens also with native torch. So not an AIO issue, rather introduced by changes in the submodule. Additionally native torch's performance is highly degraded by changes in the submodule - w/o it, alpaca runs at 0.7 tps with 32t, w/ it it runs at 0.05 tps. This is acceptable if necessary to enable AIO perf (0.9 tps with 32t) but if both AIO and native could be well served by the same code that would be awesome because otherwise we won't be able to use this code to benchmark competition and compare directly.

@dkupnicki can you take a look into the submodule's code and figure out why this is? @kkontny if you have any hints / intuition please provide us.

Pytorch models needs two warmup runs

Pytorch on first run of the model gathers profiling data for graph optimization which occurs on second run. Each of this runs can take several seconds which can bias the performance results. First two runs in Pytorch should be dropped from performance calculations.

Add PyTorch models

Please work on adding following models' PyTorch pipelines to model zoo:

  • resnet18
  • resnet50
  • vgg16
  • mobilenet_v3_large
  • mobilenet_v3_small
  • mobilenet_v2
  • squeezenet1
  • shufflenet_v2_x1_0
  • mnasnet1_0
  • resnext50_32x4d
  • wide_resnet50_2
  • densenet121
  • inception_v3
  • alexnet
  • googlenet
  • bert_base

List comes from @kkontny - reportedly models should be available at karol/pytorch branch of objdet.
Second priority task for now - if the model is not trained / there's no pipeline ready for it, don't stress it.

--disable_jit_freeze flag is not working in many Pytorch models

In some models (I found it in resnet50, vgg16, bert_base, roberta, but please check other too) the disable_jit_freeze parameter isn't correctly passed to the run_pytorch_fp function.
It makes tests with jit freeze not working. It is very useful testing the eager mode.

CI test for model_zoo

  • CI test replicating behavior of MZ user in the way of interacting
  • testing new PRs with jenkins and returning status

Issues running AML on amperecomputingai/onnxruntime:1.8.0 image

Hi team, following issues were observed while hands on AML on onnxrt-aio:1.8.0 image. please advise whether these need to be ignored or workaround.

1. Error while execute git clone --recursive https://github.com/AmpereComputingAI/ampere_model_library.git

...
Receiving objects: 100% (559/559), 1.30 MiB | 45.85 MiB/s, done.
Resolving deltas: 100% (337/337), done.
Cloning into '/home/azureuser/ampere_model_library/text_to_image/stable_diffusion/stablediffusion'...
[email protected]: Permission denied (publickey).
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
fatal: clone of '[email protected]:AmpereComputingAI/stablediffusion.git' into submodule path '/home/azureuser/ampere_model_library/text_to_image/stable_diffusion/stablediffusion' failed
Failed to clone 'text_to_image/stable_diffusion/stablediffusion' a second time, aborting

2. Error while execute bash setup_deb.sh

...
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
test-tube 0.7.5 requires torch>=1.1.0, which is not installed.
streamlit 1.28.0 requires altair<6,>=4.0, which is not installed.
streamlit 1.28.0 requires blinker<2,>=1.0.0, which is not installed.
streamlit 1.28.0 requires gitpython!=3.1.19,<4,>=3.0.7, which is not installed.
streamlit 1.28.0 requires pydeck<1,>=0.8.0b4, which is not installed.
streamlit 1.28.0 requires tenacity<9,>=8.1.0, which is not installed.
streamlit 1.28.0 requires toml<2,>=0.10.1, which is not installed.
streamlit 1.28.0 requires tzlocal<6,>=1.1, which is not installed.
streamlit 1.28.0 requires validators<1,>=0.2, which is not installed.
streamlit 1.28.0 requires watchdog>=2.1.5; platform_system != "Darwin", which is not installed.
...
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
ultralytics 8.0.75 requires seaborn>=0.11.0, which is not installed.
ultralytics 8.0.75 requires sentry-sdk, which is not installed.
ultralytics 8.0.75 requires torchvision>=0.8.1, which is not installed.
open-clip-torch 2.7.0 requires torchvision, which is not installed.
nnunet 1.7.0 requires dicom2nifti, which is not installed.
nnunet 1.7.0 requires sklearn, which is not installed.
batchgenerators 0.21 requires unittest2, which is not installed.
pytorch-lightning 1.9.1 requires torchmetrics>=0.7.0, but you have torchmetrics 0.6.0 which is incompatible.
nnunet 1.7.0 requires batchgenerators>=0.23, but you have batchgenerators 0.21 which is incompatible.
...
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
ultralytics 8.0.75 requires seaborn>=0.11.0, which is not installed.
ultralytics 8.0.75 requires sentry-sdk, which is not installed.

3. Attempt to get default onnx fp32 result:

# AIO_PROCESS_MODE=0 OMP_NUM_THREADS=16 python3 run.py -m resnet50_v1.onnx -p fp32 -f ort

FAIL: this model seems to be unsupported in a specified precision: fp32

Looks like ort backend supports fp16 only in AML? https://github.com/AmpereComputingAI/ampere_model_library/blob/main/computer_vision/classification/resnet_50_v15/run.py#L187

4. Attempt to get default onnx fp16 result:

# AIO_PROCESS_MODE=0 OMP_NUM_THREADS=16 python3 run.py -m resnet50_v1.onnx -p fp16 -f ort

Intraop parallelism set to 16 threads


Running with ONNX Runtime

  0%|                                                                                                                     | 0/60 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/aml/computer_vision/classification/resnet_50_v15/run.py", line 199, in <module>
    main()
  File "/aml/computer_vision/classification/resnet_50_v15/run.py", line 188, in main
    run_ort_fp16(**vars(args))
  File "/aml/computer_vision/classification/resnet_50_v15/run.py", line 151, in run_ort_fp16
    return run_ort_fp(model_path, batch_size, num_runs, timeout, images_path, labels_path)
  File "/aml/computer_vision/classification/resnet_50_v15/run.py", line 127, in run_ort_fp
    return run_model(run_single_pass, runner, dataset, batch_size, num_runs, timeout)
  File "/aml/utils/benchmark.py", line 229, in run_model
    single_pass_func(runner, dataset)
  File "/aml/computer_vision/classification/resnet_50_v15/run.py", line 114, in run_single_pass
    output = ort_runner.run(batch_size)
  File "/aml/utils/ort.py", line 36, in run
    outputs = self.session.run(self._output_names, self._feed_dict)
  File "/usr/local/lib/python3.10/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 200, in run
    return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.InvalidArgument: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Unexpected input data type. Actual: (tensor(float16)) , expected: (tensor(float))
  0%|                                                                                                                     | 0/60 [00:00<?, ?it/s]

5. Attempt to get onnx-aio fp16 result:

# AIO_NUM_THREADS=16 python3 run.py -m resnet50_v1.onnx -p fp16 -f ort


Intraop parallelism set to 16 threads


Running with ONNX Runtime

  0%|                                                                                                                     | 0/60 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/aml/computer_vision/classification/resnet_50_v15/run.py", line 199, in <module>
    main()
  File "/aml/computer_vision/classification/resnet_50_v15/run.py", line 188, in main
    run_ort_fp16(**vars(args))
  File "/aml/computer_vision/classification/resnet_50_v15/run.py", line 151, in run_ort_fp16
    return run_ort_fp(model_path, batch_size, num_runs, timeout, images_path, labels_path)
  File "/aml/computer_vision/classification/resnet_50_v15/run.py", line 127, in run_ort_fp
    return run_model(run_single_pass, runner, dataset, batch_size, num_runs, timeout)
  File "/aml/utils/benchmark.py", line 229, in run_model
    single_pass_func(runner, dataset)
  File "/aml/computer_vision/classification/resnet_50_v15/run.py", line 114, in run_single_pass
    output = ort_runner.run(batch_size)
  File "/aml/utils/ort.py", line 36, in run
    outputs = self.session.run(self._output_names, self._feed_dict)
  File "/usr/local/lib/python3.10/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 200, in run
    return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.InvalidArgument: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Unexpected input data type. Actual: (tensor(float16)) , expected: (tensor(float))
  0%|                                                                                                                     | 0/60 [00:00<?, ?it/s]

The onnx model is https://zenodo.org/record/2592612/files/resnet50_v1.onnx

6. Try download ONNX Runtime model in fp16 precision
https://www.dropbox.com/s/r80ndhbht7tixn5/resnet_50_v1.5_fp16.onnx described in README.md:

# python3 run.py -m resnet_50_v1.5_fp16.onnx -p fp16 -f ort

Intraop parallelism set to 16 threads

Traceback (most recent call last):
  File "/aml/computer_vision/classification/resnet_50_v15/run.py", line 199, in <module>
    main()
  File "/aml/computer_vision/classification/resnet_50_v15/run.py", line 188, in main
    run_ort_fp16(**vars(args))
  File "/aml/computer_vision/classification/resnet_50_v15/run.py", line 151, in run_ort_fp16
    return run_ort_fp(model_path, batch_size, num_runs, timeout, images_path, labels_path)
  File "/aml/computer_vision/classification/resnet_50_v15/run.py", line 125, in run_ort_fp
    runner = OrtRunner(model_path)
  File "/aml/utils/ort.py", line 26, in __init__
    self.session = ort.InferenceSession(model, session_options, providers=ort.get_available_providers())
  File "/usr/local/lib/python3.10/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 360, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "/usr/local/lib/python3.10/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 397, in _create_inference_session
    sess = C.InferenceSession(session_options, self._model_path, True, self._read_config_from_model)
onnxruntime.capi.onnxruntime_pybind11_state.InvalidProtobuf: [ONNXRuntimeError] : 7 : INVALID_PROTOBUF : Load model from resnet_50_v1.5_fp16.onnx failed:Protobuf parsing failed.

tensorflow profiler layers information for NLP models using transformers repo

you can run the project in this docker image: https://github.com/OnSpecta/manifests/blob/main/x86/tf2.x/Dockerfile
after setting the docker and downloading model zoo run the following script:

export PYTHONPATH=/path/to/model_zoo

checkout the branch marcel/transformers and run
git submodule update --init --recursive
then
git pull --recurse-submodule
next cd into nlp/transformers and perform
pip install -e .

now you should be able to run the model_zoo/nlp/run.py script. Example:

python3 nlp/run.py --sequence_length=1 --batch_size=1

in order to run with profiler set the flag --profiler and set the path to logs with the environent variable PROFILER_LOG_DIR.
**NOTE: ** set a distinct directory for this purpose as it is being cleared with each run of the profiler.

TROUBLESHOOTING:

Problem: No module setuptools
Solution: pip3 install -U pip setuptools

Problem: error: can't find Rust compiler
Solution:
install rust and restart the console to update the paths
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
(https://www.rust-lang.org/tools/install)

module 'utils.cv.dataset' has no attribute 'OutOfInstances' - after adding transformers

After adding transformers lib to pipeline docker image Jenkins throw an error while building.

+ python3 benchmark_tf.py -m mobilenet_v2 -p fp32 -b 16 --timeout=15.0

The result it:

WARNING:root:Limited tf.compat.v2.summary API due to missing TensorBoard installation.
WARNING:root:Limited tf.compat.v2.summary API due to missing TensorBoard installation.
WARNING:root:Limited tf.compat.v2.summary API due to missing TensorBoard installation.
WARNING:root:Limited tf.summary API due to missing TensorBoard installation.
WARNING:root:Limited tf.compat.v2.summary API due to missing TensorBoard installation.
WARNING:root:Limited tf.compat.v2.summary API due to missing TensorBoard installation.
WARNING:root:Limited tf.compat.v2.summary API due to missing TensorBoard installation.
Version: latest-1043-g773bbcaa0
git-773bbcaa0,jerrybdev,2021-09-10T14:12:30+02:00 built 20210917_141335 by  on  99043dd6832e
Invalid value passed to DLS_CPU_LEVEL. Possible values are 'avx2', 'avx512'. Got:  0
Attempt to register kernel  AvgPoolingMeta<FP16[8]>@NEON with priority clashes (priority-wise) with the following kernels:  AvgPoolingMeta<FLOAT[4]>@NEON AvgPoolingMeta<INT8[32]>@NEON 
Attempt to register kernel  MaxPoolingMeta<FP16[8]>@NEON with priority clashes (priority-wise) with the following kernels:  MaxPoolingMeta<FLOAT[4]>@NEON MaxPoolingMeta<INT8[16]>@NEON 
Unknown DLS variable: DLS_X86 = "0"
Unknown DLS variable: DLS_TARGET_ARCH = "altra"
Unknown DLS variable: DLS_ARM64 = "1"
AVX512_ENABLED: 0
DLS_PROCESS_MODE:  1
DLS_NUM_THREADS:  16
CPU_BIND:  1
MEM_BIND:  1
DLS_SUPERNODE 0
2021-09-17 15:17:54.290872: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:196] None of the MLIR optimization passes are enabled (registered 0 passes)
2021-09-17 15:17:54.294854: W tensorflow/core/platform/profile_utils/cpu_utils.cc:116] Failed to find bogomips or clock in /proc/cpuinfo; cannot determine CPU frequency
PlatformInfo(vendor_id=3, cpu_family=8, cpu_model=3340, isa=NEON, L1=CacheInfo(size=65536, inclusive=1, share_count=1), L2=CacheInfo(size=1048576, inclusive=0, share_count=1), L3=CacheInfo(size=33554432, inclusive=0, share_count=80))

Intraop parallelism set to 16 threads


Running with TensorFlow

Traceback (most recent call last):
  File "/space/jenkins/workspace/Nightly_Builds/tf2-long-run/model_zoo_dev/model_zoo/utils/cv/imagenet.py", line 74, in __get_path_to_img
    file_name = self.__file_names[self.__current_img]
IndexError: list index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "benchmark_tf.py", line 29, in <module>
    main()
  File "benchmark_tf.py", line 25, in main
    model.handler(model, args.batch_size, None, args.timeout)
  File "/space/jenkins/workspace/Nightly_Builds/tf2-long-run/model_zoo_dev/run_utils/imagenet.py", line 66, in imagenet_handler
    return run(model.run_func)
  File "/space/jenkins/workspace/Nightly_Builds/tf2-long-run/model_zoo_dev/run_utils/imagenet.py", line 55, in run
    return func(
  File "/space/jenkins/workspace/Nightly_Builds/tf2-long-run/model_zoo_dev/model_zoo/computer_vision/classification/mobilenet_v2/run.py", line 55, in run_tf_fp32
    return run_tf_fp(model_path, batch_size, num_of_runs, timeout, images_path, labels_path)
  File "/space/jenkins/workspace/Nightly_Builds/tf2-long-run/model_zoo_dev/model_zoo/computer_vision/classification/mobilenet_v2/run.py", line 51, in run_tf_fp
    return run_model(run_single_pass, runner, dataset, batch_size, num_of_runs, timeout)
  File "/space/jenkins/workspace/Nightly_Builds/tf2-long-run/model_zoo_dev/model_zoo/utils/benchmark.py", line 119, in run_model
    single_pass_func(runner, dataset)
  File "/space/jenkins/workspace/Nightly_Builds/tf2-long-run/model_zoo_dev/model_zoo/computer_vision/classification/mobilenet_v2/run.py", line 38, in run_single_pass
    tf_runner.set_input_tensor("input:0", imagenet.get_input_array(shape))
  File "/space/jenkins/workspace/Nightly_Builds/tf2-long-run/model_zoo_dev/model_zoo/utils/cv/imagenet.py", line 93, in get_input_array
    self.path_to_latest_image = self.__get_path_to_img()
  File "/space/jenkins/workspace/Nightly_Builds/tf2-long-run/model_zoo_dev/model_zoo/utils/cv/imagenet.py", line 76, in __get_path_to_img
    raise utils_ds.OutOfInstances("No more ImageNet images to process in the directory provided")
AttributeError: module 'utils.cv.dataset' has no attribute 'OutOfInstances'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.