Foundation model benchmarking tool. Run any model on any AWS platform and benchmark for performance across instance type and serving stack options.

Home Page: https://aws-samples.github.io/foundation-model-benchmarking-tool/

License: MIT No Attribution

Jupyter Notebook 55.83% Python 42.31% Shell 1.60% Dockerfile 0.26%

benchmarking foundation-models inferentia llama2 p4d sagemaker generative-ai benchmark bedrock llama3

foundation-model-benchmarking-tool's Introduction

FMBench

Benchmark any Foundation Model (FM) on any AWS Generative AI service [Amazon SageMaker, Amazon Bedrock, Amazon EKS, Amazon EC2, or Bring your own endpoint.]

Amazon Bedrock | Amazon SageMaker | Amazon EKS | Amazon EC2

FMBench is a Python package for running performance benchmarks for any Foundation Model (FM) deployed on any AWS Generative AI service, be it Amazon SageMaker, Amazon Bedrock, Amazon EKS, or Amazon EC2. The FMs could be deployed on these platforms either directly through FMbench, or, if they are already deployed then also they could be benchmarked through the Bring your own endpoint mode supported by FMBench.

Here are some salient features of FMBench:

Highly flexible: in that it allows for using any combinations of instance types (g5, p4d, p5, Inf2), inference containers (DeepSpeed, TensorRT, HuggingFace TGI and others) and parameters such as tensor parallelism, rolling batch etc. as long as those are supported by the underlying platform.
Benchmark any model: it can be used to be benchmark open-source models, third party models, and proprietary models trained by enterprises on their own data.
Run anywhere: it can be run on any AWS platform where we can run Python, such as Amazon EC2, Amazon SageMaker, or even the AWS CloudShell. It is important to run this tool on an AWS platform so that internet round trip time does not get included in the end-to-end response time latency.

Use FMBench to benchmark an LLM on any AWS generative AI service for price and performance (inference latency, transactions/minute). Here is one of the plots generated by FMBench to help answer the price performance question for the Llama2-13b model when hosted on Amazon SageMaker (the instance types in the legend have been blurred out on purpose, you can find them in the actual plot generated on running FMBench).

Models benchmarked

Configuration files are available in the configs folder for the following models in this repo.

Llama3 on Amazon SageMaker

Llama3 is now available on SageMaker (read blog post), and you can now benchmark it using FMBench. Here are the config files for benchmarking Llama3-8b-instruct and Llama3-70b-instruct on ml.p4d.24xlarge, ml.inf2.24xlarge and ml.g5.12xlarge instances.

Config file for Llama3-8b-instruct on ml.p4d.24xlarge and ml.g5.12xlarge.
Config file for Llama3-70b-instruct on ml.p4d.24xlarge and ml.g5.48xlarge.
Config file for Llama3-8b-instruct on ml.inf2.24xlarge and ml.g5.12xlarge.

Full list of benchmarked models

Model	EC2 g5	EC2 Inf2/Trn1	SageMaker g4dn/g5/p3	SageMaker Inf2/Trn1	SageMaker P4	SageMaker P5	Bedrock On-demand throughput	Bedrock provisioned throughput
Anthropic Claude-3 Sonnet							✅	✅
Anthropic Claude-3 Haiku							✅
Mistral-7b-instruct			✅		✅	✅	✅
Mistral-7b-AWQ						✅
Mixtral-8x7b-instruct							✅
Llama3.1-8b instruct		✅		✅			✅
Llama3.1-70b instruct		✅		✅			✅
Llama3-8b instruct	✅	✅	✅	✅	✅	✅	✅
Llama3-70b instruct	✅		✅	✅	✅		✅
Llama2-13b chat			✅	✅	✅		✅
Llama2-70b chat			✅	✅	✅		✅
Amazon Titan text lite							✅
Amazon Titan text express							✅
Cohere Command text							✅
Cohere Command light text							✅
AI21 J2 Mid							✅
AI21 J2 Ultra							✅
Gemma-2b			✅
Phi-3-mini-4k-instruct			✅
distilbert-base-uncased			✅

New in this release

v1.0.52

Compile for AWS Chips (Trainium, Inferentia) and deploy to SageMaker directly through FMBench.
Llama3.1-8b and Llama3.1-70b config files for AWS Chips (Trainium, Inferentia).
Misc. bug fixes.

v1.0.51

FMBench has a website now. Rework the README file to make it lightweight.
Llama3.1 config files for Bedrock.

v1.0.50

Llama3-8b on Amazon EC2 inf2.48xlarge config file.
Update to new version of DJL LMI (0.28.0).

Release history

Getting started

FMBench is available as a Python package on PyPi and is run as a command line tool once it is installed. All data that includes metrics, reports and results are stored in an Amazon S3 bucket.

Important

💡 All documentation for FMBench is available on the FMBench website

You can run FMBench on either a SageMaker notebook or on an EC2 VM. Both options are described here as part of the documentation. You can even run FMBench as a Docker container A Quickstart guide for SageMaker is bring provided below as well.

👉 The following sections are discussing running FMBench the tool, as different from where the FM is actually deployed. For example, we could run FMBench on EC2 but the model being deployed is on SageMaker or even Bedrock.

Quickstart

FMBench on a SageMaker Notebook

Each FMBench run works with a configuration file that contains the information about the model, the deployment steps, and the tests to run. A typical FMBench workflow involves either directly using an already provided config file from the configs folder in the FMBench GitHub repo or editing an already provided config file as per your own requirements (say you want to try benchmarking on a different instance type, or a different inference container etc.).

👉 A simple config file with key parameters annotated is included in this repo, see config-llama2-7b-g5-quick.yml. This file benchmarks performance of Llama2-7b on an ml.g5.xlarge instance and an ml.g5.2xlarge instance. You can use this config file as it is for this Quickstart.
Launch the AWS CloudFormation template included in this repository using one of the buttons from the table below. The CloudFormation template creates the following resources within your AWS account: Amazon S3 buckets, Amazon IAM role and an Amazon SageMaker Notebook with this repository cloned. A read S3 bucket is created which contains all the files (configuration files, datasets) required to run FMBench and a write S3 bucket is created which will hold the metrics and reports generated by FMBench. The CloudFormation stack takes about 5-minutes to create.

AWS Region Link

us-east-1 (N. Virginia)

us-west-2 (Oregon)

us-gov-west-1 (GovCloud West)
Once the CloudFormation stack is created, navigate to SageMaker Notebooks and open the fmbench-notebook.

AWS Region	Link
us-east-1 (N. Virginia)
us-west-2 (Oregon)
us-gov-west-1 (GovCloud West)

On the fmbench-notebook open a Terminal and run the following commands.

conda create --name fmbench_python311 -y python=3.11 ipykernel
source activate fmbench_python311;
pip install -U fmbench

Now you are ready to fmbench with the following command line. We will use a sample config file placed in the S3 bucket by the CloudFormation stack for a quick first run.
1. We benchmark performance for the Llama2-7b model on a ml.g5.xlarge and a ml.g5.2xlarge instance type, using the huggingface-pytorch-tgi-inference inference container. This test would take about 30 minutes to complete and cost about $0.20.
2. It uses a simple relationship of 750 words equals 1000 tokens, to get a more accurate representation of token counts use the Llama2 tokenizer (instructions are provided in the next section). It is strongly recommended that for more accurate results on token throughput you use a tokenizer specific to the model you are testing rather than the default tokenizer. See instructions provided later in this document on how to use a custom tokenizer.
```
account=`aws sts get-caller-identity | jq .Account | tr -d '"'`
region=`aws configure get region`
fmbench --config-file s3://sagemaker-fmbench-read-${region}-${account}/configs/llama2/7b/config-llama2-7b-g5-quick.yml > fmbench.log 2>&1
```
3. Open another terminal window and do a tail -f on the fmbench.log file to see all the traces being generated at runtime.
```
tail -f fmbench.log
```
4. 👉 For streaming support on SageMaker and Bedrock checkout these config files:
  1. config-llama3-8b-g5-streaming.yml
  2. config-bedrock-llama3-streaming.yml
The generated reports and metrics are available in the sagemaker-fmbench-write-<replace_w_your_aws_region>-<replace_w_your_aws_account_id> bucket. The metrics and report files are also downloaded locally and in the results directory (created by FMBench) and the benchmarking report is available as a markdown file called report.md in the results directory. You can view the rendered Markdown report in the SageMaker notebook itself or download the metrics and report files to your machine for offline analysis.

If you would like to understand what is being done under the hood by the CloudFormation template, see the DIY version with gory details

`FMBench` on SageMaker in GovCloud

No special steps are required for running FMBench on GovCloud. The CloudFormation link for us-gov-west-1 has been provided in the section above.

Not all models available via Bedrock or other services may be available in GovCloud. The following commands show how to run FMBench to benchmark the Amazon Titan Text Express model in the GovCloud. See the Amazon Bedrock GovCloud page for more details.

account=`aws sts get-caller-identity | jq .Account | tr -d '"'`
region=`aws configure get region`
fmbench --config-file s3://sagemaker-fmbench-read-${region}-${account}/configs/bedrock/config-bedrock-titan-text-express.yml > fmbench.log 2>&1

Results

Depending upon the experiments in the config file, the FMBench run may take a few minutes to several hours. Once the run completes, you can find the report and metrics in the local results-* folder in the directory from where FMBench was run. The rpeort and metrics are also written to the write S3 bucket set in the config file.

Here is a screenshot of the report.md file generated by FMBench.

Benchmark models deployed on different AWS Generative AI services (Docs)

FMBench comes packaged with configuration files for benchmarking models on different AWS Generative AI services i.e. Bedrock, SageMaker, EKS and EC2 or bring your own endpoint even.

Enhancements

View the ISSUES on GitHub and add any you might think be an beneficial iteration to this benchmarking harness.

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

Star History

Support

Schedule Demo 👋 - send us an email 🙂
Community Discord 💭
Our emails ✉️ [email protected] / [email protected]

Contributors

foundation-model-benchmarking-tool's People

Contributors

Stargazers

Watchers

foundation-model-benchmarking-tool's Issues

The README gives an impression that this tool is comparing instance types when it should be serving stacks

Update the readme to clarify that it is the complete serving stack that we are comparing across different experiments. The serving stack includes everything from the instance type, the inference container and the backend configuration parameters.

FMBench to support an REST Predictor

To add support for an REST predictor to run inferences on a model deployed out of sagemaker/bedrock

Error: need to escape, but no escapechar set

Running with the default debug.sh configuration (but on Python 3.10), I'm seeing the below error:

[2024-05-31 03:26:59,308] p5078 {main.py:107} ERROR - Failed to execute 1_generate_data.ipynb: 
---------------------------------------------------------------------------
Exception encountered at "In [10]":
---------------------------------------------------------------------------
Error                                     Traceback (most recent call last)
Cell In[10], line 3
      1 # Convert DataFrame to a CSV format string
      2 csv_buffer = io.StringIO()
----> 3 df.to_csv(csv_buffer, index=False)
      4 csv_data = csv_buffer.getvalue()
      5 all_prompts_file = config['dir_paths']['all_prompts_file']

File ~/.cache/pypoetry/virtualenvs/fmbench-XBAYeWJo-py3.10/lib/python3.10/site-packages/pandas/core/generic.py:3902, in NDFrame.to_csv(self, path_or_buf, sep, na_rep, float_format, columns, header, index, index_label, mode, encoding, compression, quoting, quotechar, lineterminator, chunksize, date_format, doublequote, escapechar, decimal, errors, storage_options)
   3891 df = self if isinstance(self, ABCDataFrame) else self.to_frame()
   3893 formatter = DataFrameFormatter(
   3894     frame=df,
   3895     header=header,
   (...)
   3899     decimal=decimal,
   3900 )
-> 3902 return DataFrameRenderer(formatter).to_csv(
   3903     path_or_buf,
   3904     lineterminator=lineterminator,
   3905     sep=sep,
   3906     encoding=encoding,
   3907     errors=errors,
   3908     compression=compression,
   3909     quoting=quoting,
   3910     columns=columns,
   3911     index_label=index_label,
   3912     mode=mode,
   3913     chunksize=chunksize,
   3914     quotechar=quotechar,
   3915     date_format=date_format,
   3916     doublequote=doublequote,
   3917     escapechar=escapechar,
   3918     storage_options=storage_options,
   3919 )

File ~/.cache/pypoetry/virtualenvs/fmbench-XBAYeWJo-py3.10/lib/python3.10/site-packages/pandas/io/formats/format.py:1152, in DataFrameRenderer.to_csv(self, path_or_buf, encoding, sep, columns, index_label, mode, compression, quoting, quotechar, lineterminator, chunksize, date_format, doublequote, escapechar, errors, storage_options)
   1131     created_buffer = False
   1133 csv_formatter = CSVFormatter(
   1134     path_or_buf=path_or_buf,
   1135     lineterminator=lineterminator,
   (...)
   1150     formatter=self.fmt,
   1151 )
-> 1152 csv_formatter.save()
   1154 if created_buffer:
   1155     assert isinstance(path_or_buf, StringIO)

File ~/.cache/pypoetry/virtualenvs/fmbench-XBAYeWJo-py3.10/lib/python3.10/site-packages/pandas/io/formats/csvs.py:266, in CSVFormatter.save(self)
    247 with get_handle(
    248     self.filepath_or_buffer,
    249     self.mode,
   (...)
    254 ) as handles:
    255     # Note: self.encoding is irrelevant here
    256     self.writer = csvlib.writer(
    257         handles.handle,
    258         lineterminator=self.lineterminator,
   (...)
    263         quotechar=self.quotechar,
    264     )
--> 266     self._save()

File ~/.cache/pypoetry/virtualenvs/fmbench-XBAYeWJo-py3.10/lib/python3.10/site-packages/pandas/io/formats/csvs.py:271, in CSVFormatter._save(self)
    269 if self._need_to_save_header:
    270     self._save_header()
--> 271 self._save_body()

File ~/.cache/pypoetry/virtualenvs/fmbench-XBAYeWJo-py3.10/lib/python3.10/site-packages/pandas/io/formats/csvs.py:309, in CSVFormatter._save_body(self)
    307 if start_i >= end_i:
    308     break
--> 309 self._save_chunk(start_i, end_i)

File ~/.cache/pypoetry/virtualenvs/fmbench-XBAYeWJo-py3.10/lib/python3.10/site-packages/pandas/io/formats/csvs.py:320, in CSVFormatter._save_chunk(self, start_i, end_i)
    317 data = [res.iget_values(i) for i in range(len(res.items))]
    319 ix = self.data_index[slicer]._format_native_types(**self._number_format)
--> 320 libwriters.write_csv_rows(
    321     data,
    322     ix,
    323     self.nlevels,
    324     self.cols,
    325     self.writer,
    326 )

File writers.pyx:72, in pandas._libs.writers.write_csv_rows()

Error: need to escape, but no escapechar set

I believe it can be resolved by setting e.g. df.to_csv(csv_buffer, index=False, escapechar="\\"), but seems weird that other people wouldn't have encountered this already? Presumably it's pretty normal for generated prompts to contain commas and/or quote marks... From the Pandas DataFrame.to_csv doc it seems like this escapechar option has existed since at least v1 and always defaulted to None, so unlikely to have been introduced by a dependency upgrade or similar :/

Add Autoscaling functionality to FMBench

To have a functionalit that enables FMBT to support sagemaker endpoint autoscaling via config files for experiment runs.

Unable to locate credential issue

I am trying to run FM bench on an EC2 instance. I am not using SageMaker. I am getting following error:

File ~/anaconda3/envs/fmbench_python311/lib/python3.11/site-packages/botocore/auth.py:418, in SigV4Auth.add_auth(self, request)
416 def add_auth(self, request):
417 if self.credentials is None:
--> 418 raise NoCredentialsError()
419 datetime_now = datetime.datetime.utcnow()
420 request.context['timestamp'] = datetime_now.strftime(SIGV4_TIMESTAMP)

NoCredentialsError: Unable to locate credentials

Do i need to use AWS key and access id somewhere ?

No such file or directory: '/tmp/fmbench-read/tokenizer'

I've been trying to run FMBench with self-contained setup as described on the README to test whether Python 3.10 can be supported as per #94

(But one caveat that I'm running on a SageMaker Notebook Instance rather than plain EC2)

Setting up conda & Poetry, then running ./copy_s3_content.sh and ./debug.sh, my debug script fails in notebook 0 with error:

FileNotFoundError: [Errno 2] No such file or directory: '/tmp/fmbench-read/tokenizer'

Sure enough, if I ls /tmp/fmbench-read I see folders for configs, prompt_template, scripts, and source_data - but no tokenizers... What's the expected way to set up the local tokenizers folder, and could we get it automated within ./copy_s3_content.sh?

Need for Integration of FMBench Tool with External 3rd party Synthetic Data Generation Tools for Benchmarking usecase.

It would be good to have a native integration in the Config.Yaml file of FMBench for integrating with other 3rd party tools for pulling synthetically generated datasets.
How to split this synthetically generated datasets for FMBench based evaluation of FM's would be important functionality to have.
Final Results (results.md) generated by this tool could have some Visual comparison capability with other 3rd party tools which could be used for holistic evaluation of FM's.

Merge model accuracy metrics also into this tool

The fact that we are running inference means that we can also measure the accuracy of those inferences i.e. through rouge score, cosine similarity (to an expert generated response) or other metrics. If we add that then this tool can provide a complete bechmarking solution that includes accuracy as well as cost.

Support Python 3.10 for SageMaker Studio notebooks

Hi team,

Since FMBench today requires Python 3.11+, it can't be installed on the current SageMaker Studio JupyterLab (default Python kernel at SM Distribution v1.7 runs Python 3.10)

Are you really using newly launched features in Python 3.11? Would you be able to add 3.10 support so we can use the library in SageMaker Studio notebooks?

Thanks!

Add a AWS CloudFormation template and provide some canned datasets to get started

Try FMBench with instance count set to > 1 to see how scaling impacts latency and transactions per minute

It would be interesting to see the effect of scaling to multiple instances behind the same endpoint. How does inference latency change as endpoints start to scale (automatically, we could also add parameters for scaling policy), can we support the more transactions with auto-scaling instances while keeping the latency below a threshold and what are the cost implications of doing that. This needs to be fleshed out but this is an interesting area.

This would also need to include support for the Inference Configuration feature that is now available with SageMaker.

Add support for a custom token counter

Currently only the LLama tokenizer is supported but we want to allow users to bring their own token counting logic for different models. This way regardless of the model type, or token count methodology, the user should be able to get accurate results based on their token counter that they use.

Goal: Abstracting out the repo and tool to a point where no matter what token counter type the user uses, you can bring that and run the container to get accurate test results.

Support for custom datasets and custom inference scripts

For models other than Llama and Mistral (say BERT) we need datasets other than LongBench and these models have their own response format.

Add support for bring your own dataset by parameterizing the prompt template.
Add support for custom inference scripts.

[Highest priority] Add support for reading and writing files (configuration, metrics, bring your own model scripts) from Amazon S3.

This includes adding support for s3 readability and interaction, including all data and metrics being accessible via your personal s3 bucket. The goal of this issue is to abstract out the code on this repo in a way where you can bring your own script, your own model, source data files, upload it to s3, and then expect this repo to run, in order to generate these test results within your s3 bucket that you define. Aim for this is to have a folder in a bucket where you upload your source data files, a folder where you upload your bring your own model script, prompt template, and other model artifacts as need be, and then run this repo to generate test results within the 'data' folder that is programmatically generated containing information on metrics, per chunk and inference results, deployed model configurations, and more.

Add support for different payload formats for bring your own datasets for that might be needed for different inference containers

This tool currently supports the HF TGI container, and DJL Deep Speed container on SageMaker and both use the same format but in future other containers might need a different payload format.

Goal: To give user full flexibility to bring their payloads or contain code that generalizes payload generation irrespective of the container type that the user uses. Two options for solution to this issue here:

1/ Have the user bring in their own payload
2/ Have a generic function defined to convert the payload in support for the container type the user is using to deploy their model and generate inference from.

Containerize FMBench and provide instructions for running the container on EC2

Containerize FMBT and provide instructions for running the container on EC2 - Once all of the files are integrated via S3 and all the code is abstracted out in terms of generating metrics for any deployable model on SageMaker (including bring your own model/scripts), we want to be able to containerize this and run the container on EC2.

Goal: To choose the specific config file and prompt to run the container on it, generating results without any heavy lifting of any development efforts.

Is it possible to run FMBench on an EC2 instance on endpoints currently not supported on Sagemaker.

For example i want to run an endpoint which is running on intel/amd/nvidia instances which is generally available. ie, instances which are not available on Sagemaker but available on AWS.

Code cleanup needed to replace the notebooks with regular python files

The work for this repo started as a skunkworks project done over the holidays in the Winter of 2024 and so at the time this was just a bunch of notebooks but now that it has transformed into a formal open-source project with a ton of functionality and a bunch of roadmap items, the notebooks have turned unwieldy!

We need to replace the notebooks with regular python scripts and plus there is a whole bunch of code cleanup that needs to happen to replace global imports, use type hints everywhere, optimizations of functions, etc., the list is long.

Assigning this to myself for now, and would create issues for specific items.

Add documentation for the config file on readthedocs.io

The config file has lots of parameters and while some of them are explained via inline comments, we really need a better way for documenting this.

Loosen dependencies for more flexible installation

As discussed here on StackOverflow:

Applications should generally lock dependencies to exact versions, for reliable deployment
Libraries should generally support broad dependency version ranges where practical, to accommodate installing them on a range of environments and using them in a range of downstream applications

I suggest that poetry.lock accomplishes (1) for users wishing to download fmbench from source, but users installing fmbench from PyPI fall in the camp of (2), and would like fmbench to play nicely with whatever other dependencies might be in their environment.

For stable libraries that follow semver, it seems like we should be able to trust caret requirements? For e.g. specifically I would think something like the below (which I haven't fully tested):

ipywidgets 8.1.1 -> ^8.0.0 (unless we care about specific bug fixes they released?)
transformers 4.36.2 -> ^4.36.2 (Idk which potentially new models you're consuming that might prevent downgrade)
pandas 2.1.4 -> ^2.0.0(even this is only like a year old?)
datasets 2.16.1 -> ^2.14.0 (2.14.0 has an important caching change)
sagemaker 2.220.0 -> ^2.119.0 (Current SMStudio / SageMaker Distribution v1.8 version)
litellm 1.35.8 -> ^1.35.8 (idk how far back we could push this?)
plotly 5.22.0 -> ^5.15.0 (before which there were compatibility issues with Pandas 2.0)

For unstable libraries (seaborn, tomark, kaleido), maybe we could at least use tilde requirements to allow patch versions?

Add support for a default tokenizer if no tokenizer is specified

To add support for bring your own 'SageMaker' pre existing endpoint

We want to add and test functionality for an already deployed SageMaker endpoint and test it with FMBench without having the need to re deploy the endpoint/create a new endpoint

Add support for models hosted on platforms other than SageMaker

While this tool was never thought of as testing models hosted on anything other than SageMaker, but technically there is nothing preventing this. There are two things that need to happen for this.

1/ The models have to be deployed on the platform of choice. This part can be externalized, meaning the deployment code in this repo does not deploy those models, they are deployed separately, outside of this tool.

2/ Have support for bring your own inference script which knows how to query your endpoint. This inference script then runs inferences against the endpoints on platforms other than SageMaker. And so at this point it does not matter if the endpoint is on EKS or EC2.

Add retry handling for throttling errors during inference

Remove the config_filepath.txt as a config reference for developer workflow

To remove the config_filepath.txt as a config reference for developer workflow since we have to change it every time in addition to the debug.sh file. Either to add clearer instructions for this or have just one source for developer workflow config reference (such as the debug.sh file)

Compare different models for the same dataset

There is nothing in FMBench which prevents different experiments in the same config file to use different models, but the generated report is not architected in the same way i.e. it is not created to compare different models but rather compare the same model across serving stacks, so that would need to change. This has been requested by multiple customers, the idea being if we find different models that are fit for task, we now want to find the model and serving stack combination which provides the best price:performance.

UI Support for FMBench

To Do:

1/ Add a user friendly interface for users to seamlessly benchmark/evaluate models
2/ Website/Interface would make it easier for users to benchmark/track results from various runs

ETA: To be decided

Add code to determine the cost of running an entire experiment and include it in the final report

Add code to determine the cost of running an entire experiment and include it in the final report. This would only include the cost of running the SageMaker endpoints based on hourly public pricing (the cost of running this code on a notebook or a EC2 is trivial in comparison and can be ignored).

Running this entire benchmarking test, we can add a couple of lines to calculate the total cost being used to run the specific experiment from an end to end perspective to answer simple questions like:

I am running the experiment for this config file and got the benchmarking results successfully in 'x' time. What is the cost that will be incurred to run this experiment?

Emit live metrics

Emit live metrics so that they can be monitored through Grafana via live dashboard. More information to come on this issue but the goal here is to provide full flexibility to the user to be able to view metrics in ways that best suits the needs of their business and technological goals.

[TBD] --> Some sort of an analytics pipeline sending and emitting live results for different model configurations, their results and different metrics based on the needs of the user.

TypeError: unsupported format string passed to NoneType.format

I am getting this error while trying to run fmbench on EC2 instance. Please see the command line below:

fmbench --config-file /tmp/fmbench-read/configs/byoe/config-model-byo-sagemaker-endpoint.yml --local-mode yes --write-bucket placeholder > fmbench.log 2>&1

I had added my instance type as m7a.32xlarge and edited pricing.yml is edited with m7a.32 xlarge pricing but even then I am getting this error below.

[2024-07-24 19:42:03,100] p61605 {clientwrap.py:98} WARNING - INFO:main:the cost for running bring-your-own-sm-endpoint running on m7a.32xlarge for 0.4006618220009841s is $None

[2024-07-24 19:42:03,225] p61605 {clientwrap.py:91} INFO - get_inference, uuid=3429a58e3a4843d6880caff2dbd1994b, error occurred with my-own-endpoint, exception='NoneType' object has no attribute 'get'
get_inference, uuid=14188c702d3f4132816efcf23a7eb1bf, error occurred with my-own-endpoint, exception='NoneType' object has no attribute 'get'
get_inference, uuid=7fbfdc7dd51b497dad014a724eb609dd, error occurred with my-own-endpoint, exception='NoneType' object has no attribute 'get'
get_inference, uuid=033060cba8d14e9f8d91261e8bd98f21, error occurred with my-own-endpoint, exception='NoneType' object has no attribute 'get'
get_inference, uuid=2ff98b5c93304932b96c2359288751f3, error occurred with my-own-endpoint, exception='NoneType' object has no attribute 'get'
get_inference, uuid=7c79f122066c4fd78913355b655da053, error occurred with my-own-endpoint, exception='NoneType' object has no attribute 'get'
get_inference, uuid=75f5b1ec23964a7798db730438b51417, error occurred with my-own-endpoint, exception='NoneType' object has no attribute 'get'
get_inference, uuid=53a1e2d4fdf94f52b3412963e1e56979, error occurred with my-own-endpoint, exception='NoneType' object has no attribute 'get'
get_inference, uuid=826a61034a234f40bcbae9191c37f974, error occurred with my-own-endpoint, exception='NoneType' object has no attribute 'get'
get_inference, uuid=d8bd5ed8e129413a8a456f43f4551858, error occurred with my-own-endpoint, exception='NoneType' object has no attribute 'get'
get_inference, uuid=685d348a24ac4b6598e78870f4013640, error occurred with my-own-endpoint, exception='NoneType' object has no attribute 'get'
get_inference, uuid=a40973ff82374f2bb829c96fb92ae1c2, error occurred with my-own-endpoint, exception='NoneType' object has no attribute 'get'
get_inference, uuid=239f2fc046894045a2d1da2fec401fc7, error occurred with my-own-endpoint, exception='NoneType' object has no attribute 'get'
get_inference, uuid=696541e7dd614469b2f8163a327c65bc, error occurred with my-own-endpoint, exception='NoneType' object has no attribute 'get'
get_inference, uuid=12aed3eba7a34bd8a5d60b1cbd30e0e7, error occurred with my-own-endpoint, exception='NoneType' object has no attribute 'get'
get_inference, uuid=fda193122ef142e8be7b8ac255e05398, error occurred with my-own-endpoint, exception='NoneType' object has no attribute 'get'
get_inference, uuid=4a8f6537899e48d4ae76fd8cd476c6a8, error occurred with my-own-endpoint, exception='NoneType' object has no attribute 'get'

[2024-07-24 19:42:03,607] p61605 {engines.py:261} INFO - Ending Cell 25-----------------------------------------
Executing: 64%|██████▍ | 25/39 [00:05<00:03, 4.22cell/s]
[2024-07-24 19:42:04,767] p61605 {main.py:107} ERROR - Failed to execute 3_run_inference.ipynb:

Exception encountered at "In [16]":

TypeError Traceback (most recent call last)
Cell In[16], line 134
120 exp_cost = predictor.calculate_cost(exp_instance_type,
121 experiment.get('instance_count'),
122 pricing_config,
123 experiment_duration,
124 prompt_tokens_total,
125 completion_tokens_total)
126 logger.info(f"the cost for running {experiment['name']} running on "
127 f"{exp_instance_type} for {experiment_duration}s is ${exp_cost}")
129 experiment_durations.append({
130 'experiment_name': experiment['name'],
131 'instance_type': exp_instance_type,
132 'instance_count': experiment.get('instance_count'),
133 'duration_in_seconds': f"{experiment_duration:.2f}",
--> 134 'cost': f"{exp_cost:.6f}",
135 })
137 logger.info(f"experiment={e_idx+1}/{num_experiments}, name={experiment['name']}, "
138 f"duration={experiment_duration:.6f} seconds, exp_cost={exp_cost:.6f}, done")

TypeError: unsupported format string passed to NoneType.format

FMbench support for different AWS Region

Would be good to have a CFT template that works in different AWS Region. Would be good to provide support in us-west-2

Reports only show information per instance instead of per endpoint

In this configuration, I wanted to compare the performance of two different hosting images (LMI and TGI) on the same endpoint. (I added the g5 just to get past the statistics issue). So the config includes 2 inf2 endpoints and on g5 endpoint.

However, the graph only shows the metrics for the instance (presumably the best run) instead of a different circle for each endpoint.

We might also do the same thing if we were testing out different configurations, like batch sizes on inf2.

Inf2-LMI-versions.yml.txt

The business summary plot in the report needs to have a caption for disclaimer

Visualizations are powerful, so the message that it is the full serving stack that results in a particular performance benchmark can get lost unless explicitly called out. It is possible that someone could take the impression that a given instance type always performance better without considering that it is the instance type+inference container +parameters and so the results should not be taken out of context.

Provide config file for FLAN-T5 out of the box

FLAN-T5 XL is still used by multiple customers so a comparison of this model across g5.xlarge and g5.2xlarge instances would be very useful and so a config file for this should be provided.

KeyError: "Column(s) ['GPUMemoryUtilization', 'GPUUtilization'] do not exist" for non-GPU instances

I tried to benchmark two inf2 instances against teach other.

I got the error KeyError: "Column(s) ['GPUMemoryUtilization', 'GPUUtilization'] do not exist"

error excerpt and config file attached.

However, it does work for configs/llama2/7b/config-llama2-7b-inf2-g5.yml

I think this is because the g5 instance does have those values, so the GPU columns exist in the dataframe.
error log excerpt.txt
Inf2-TGIvLMI.txt

Per notebook run support via py package repo

The need of this issue is to have granular access to each notebook while having the repo being a package that can be run via pip. This is for advanced users in the space who want to change the code for different metrics and modifications so they can run each notebook one by one along with having the option to pip install the fmbt package.

Add Time to first token (TTFT), time per output token (TPOT) and time to last token (TTLT)

To add time to first token (TTFT), time per output token (TPOT) and time to last token (TTLT). This would be done using the streaming API that is now supported for SageMaker and Bedrock, and maybe EC2 as well over time.

config_filepath is incorrect

src/fmbench/config_filepath.txt
and
manifest.txt
Both show the config files being located in the config directory, but they are now split up under subdirectories.

This causes
from fmbench.utils import *
in
src/fmbench/0_setup.ipynb

to fail with:

config file current -> configs/config-bedrock-claude.yml, None
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[9], line 10
      8 import fmbench.scripts
      9 from pathlib import Path
---> 10 from fmbench.utils import *
     11 from fmbench.globals import *
     12 from typing import Dict, List, Optional

File ~/.fmbench/lib/python3.11/site-packages/fmbench/utils.py:11
      9 import unicodedata
     10 from pathlib import Path
---> 11 from fmbench import globals
     12 from fmbench import defaults
     13 from typing import Dict, List

File ~/.fmbench/lib/python3.11/site-packages/fmbench/globals.py:53
     51     CONFIG_FILE_CONTENT = response.text
     52 else:
---> 53     CONFIG_FILE_CONTENT = Path(CONFIG_FILE).read_text()
     55 # check if the file is still parameterized and if so replace the parameters with actual values
     56 # if the file is not parameterized then the following statements change nothing
     57 args = dict(region=session.region_name,
     58             role_arn=arn_string,
     59             write_bucket=f"{defaults.DEFAULT_BUCKET_WRITE}-{region_name}-{account_id}",
     60             read_bucket=f"{defaults.DEFAULT_BUCKET_READ}-{region_name}-{account_id}")

File /usr/lib/python3.11/pathlib.py:1058, in Path.read_text(self, encoding, errors)
   1054 """
   1055 Open the file in text mode, read it, and close the file.
   1056 """
   1057 encoding = io.text_encoding(encoding)
-> 1058 with self.open(mode='r', encoding=encoding, errors=errors) as f:
   1059     return f.read()

File /usr/lib/python3.11/pathlib.py:1044, in Path.open(self, mode, buffering, encoding, errors, newline)
   1042 if "b" not in mode:
   1043     encoding = io.text_encoding(encoding)
-> 1044 return io.open(self, mode, buffering, encoding, errors, newline)

FileNotFoundError: [Errno 2] No such file or directory: 'configs/config-bedrock-claude.yml'

Allow different experiments in the same config file to use different regions

The tool can easily allow to compare performance across regions if we allowed region to be a part of the per experiment config.

FMBench needs to be configurable so that no S3 access is needed

FMBench needs to be configurable so that no S3 access is needed.

To integrate the Open-Orca HF dataset to FMBench

To add another dataset support out of the box using the OpenOrca HF dataset. Link: https://huggingface.co/datasets/Open-Orca/OpenOrca

Add evaluation support to FMBench

To do:

Provide evaluation support to users via FMBench. This includes the following:
1. Subjective Evaluation on different tasks (using LLM as a judge), provide evaluation ratings
2. Quantitative Evaluation using Cosine Similarity Scores
Compare the performance of models based on evaluation criteria: evaluation ratings/overall cosine similarity/etc

This is a WIP issue --> Changes might be made overtime

Assign cost per run for FMBT

To calculate the cost per config file run for this FMBT harness. This includes model instance type, inference, cost per transactions and so on to sum up the entire run's total cost.

Convert 'FMBT' into a python packaged version

Create a Python Package out of the 'FMBT' tool to be able to use it as a PyPi Library.

Add Bedrock benchmarking to this tool

Can we add Bedrock to this tool? While support for bring your own inference script would do that but we need to think through Bedrock specific options such as provisioned throughput, auto-generated reported formats, do we want to compare Bedrock and SageMaker side by side.

FMBench to support benchmarking for embedding models

This issue represents the need to benchmark for embedding models that are hosted via sagemaker, bedrock and bring your own custom embedding models.

aws-samples / foundation-model-benchmarking-tool Goto Github PK

foundation-model-benchmarking-tool's Introduction

FMBench

Amazon Bedrock | Amazon SageMaker | Amazon EKS | Amazon EC2

Models benchmarked

Llama3 on Amazon SageMaker

Full list of benchmarked models

New in this release

v1.0.52

v1.0.51

v1.0.50

Getting started

Quickstart

FMBench on SageMaker in GovCloud

Results

Benchmark models deployed on different AWS Generative AI services (Docs)

Enhancements

Security

License

Star History

Support

Contributors

foundation-model-benchmarking-tool's People

Contributors

Stargazers

Watchers

Forkers

foundation-model-benchmarking-tool's Issues

[2024-07-24 19:42:03,607] p61605 {engines.py:261} INFO - Ending Cell 25----------------------------------------- Executing: 64%|██████▍ | 25/39 [00:05<00:03, 4.22cell/s] [2024-07-24 19:42:04,767] p61605 {main.py:107} ERROR - Failed to execute 3_run_inference.ipynb:

Exception encountered at "In [16]":

Recommend Projects

Recommend Topics

Recommend Org

`FMBench` on SageMaker in GovCloud

[2024-07-24 19:42:03,607] p61605 {engines.py:261} INFO - Ending Cell 25-----------------------------------------
Executing: 64%|██████▍ | 25/39 [00:05<00:03, 4.22cell/s]
[2024-07-24 19:42:04,767] p61605 {main.py:107} ERROR - Failed to execute 3_run_inference.ipynb: