Git Product home page Git Product logo

deepspeed-mii's Introduction

Formatting nv-v100-legacy nv-a6000-fastgen License Apache 2.0 PyPI version

Latest News


DeepSpeed Model Implementations for Inference (MII)

Introducing MII, an open-source Python library designed by DeepSpeed to democratize powerful model inference with a focus on high-throughput, low latency, and cost-effectiveness.

  • MII features include blocked KV-caching, continuous batching, Dynamic SplitFuse, tensor parallelism, and high-performance CUDA kernels to support fast high throughput text-generation for LLMs such as Llama-2-70B, Mixtral (MoE) 8x7B, and Phi-2. The latest updates in v0.2 add new model families, performance optimizations, and feature enhancements. MII now delivers up to 2.5 times higher effective throughput compared to leading systems such as vLLM. For detailed performance results please see our latest DeepSpeed-FastGen blog and DeepSpeed-FastGen release blog.

Key Technologies

MII for High-Throughput Text Generation

MII provides accelerated text-generation inference through the use of four key technologies:

  • Blocked KV Caching
  • Continuous Batching
  • Dynamic SplitFuse
  • High Performance CUDA Kernels

For a deeper dive into understanding these features please refer to our blog which also includes a detailed performance analysis.

MII Legacy

In the past, MII introduced several key performance optimizations for low-latency serving scenarios:

  • DeepFusion for Transformers
  • Multi-GPU Inference with Tensor-Slicing
  • ZeRO-Inference for Resource Constrained Systems
  • Compiler Optimizations

How does MII work?

Figure 1: MII architecture, showing how MII automatically optimizes OSS models using DS-Inference before deploying them. DeepSpeed-FastGen optimizations in the figure have been published in our blog post.

Under-the-hood MII is powered by DeepSpeed-Inference. Based on the model architecture, model size, batch size, and available hardware resources, MII automatically applies the appropriate set of system optimizations to minimize latency and maximize throughput.

Supported Models

MII currently supports over 20,000 models across eight popular model architectures. We plan to add additional models in the near term, if there are specific model architectures you would like supported please file an issue and let us know. All current models leverage Hugging Face in our backend to provide both the model weights and the model's corresponding tokenizer. For our current release we support the following model architectures:

model family size range ~model count
falcon 7B - 180B 300
llama 7B - 65B 19,000
llama-2 7B - 70B 900
mistral 7B 6,000
mixtral (MoE) 8x7B 1,100
opt 0.1B - 66B 1,300
phi-2 2.7B 200
qwen 7B - 72B 200

MII Legacy Model Support

MII Legacy APIs support over 50,000 different models including BERT, RoBERTa, Stable Diffusion, and other text-generation models like Bloom, GPT-J, etc. For a full list please see our legacy supported models table.

Getting Started with MII

DeepSpeed-MII allows users to create non-persistent and persistent deployments for supported models in just a few lines of code.


The fasest way to get started is with our PyPI release of DeepSpeed-MII which means you can get started within minutes via:

pip install deepspeed-mii

For ease of use and significant reduction in lengthy compile times that many projects require in this space we distribute a pre-compiled python wheel covering the majority of our custom kernels through a new library called DeepSpeed-Kernels. We have found this library to be very portable across environments with NVIDIA GPUs with compute capabilities 8.0+ (Ampere+), CUDA 11.6+, and Ubuntu 20+. In most cases you shouldn't even need to know this library exists as it is a dependency of DeepSpeed-MII and will be installed with it. However, if for whatever reason you need to compile our kernels manually please see our advanced installation docs.

Non-Persistent Pipeline

A non-persistent pipeline is a great way to try DeepSpeed-MII. Non-persistent pipelines are only around for the duration of the python script you are running. The full example for running a non-persistent pipeline deployment is only 4 lines. Give it a try!

import mii
pipe = mii.pipeline("mistralai/Mistral-7B-v0.1")
response = pipe(["DeepSpeed is", "Seattle is"], max_new_tokens=128)

The returned response is a list of Response objects. We can access several details about the generation (e.g., response[0].prompt_length):

  • generated_text: str Text generated by the model.
  • prompt_length: int Number of tokens in the original prompt.
  • generated_length: int Number of tokens generated.
  • finish_reason: str Reason for stopping generation. stop indicates the EOS token was generated and length indicates the generation reached max_new_tokens or max_length.

If you want to free device memory and destroy the pipeline, use the destroy method:


Tensor parallelism

Taking advantage of multi-GPU systems for greater performance is easy with MII. When run with the deepspeed launcher, tensor parallelism is automatically controlled by the --num_gpus flag:

# Run on a single GPU
deepspeed --num_gpus 1

# Run on multiple GPUs
deepspeed --num_gpus 2

Pipeline Options

While only the model name or path is required to stand up a non-persistent pipeline deployment, we offer customization options to our users:

mii.pipeline() Options:

  • model_name_or_path: str Name or local path to a HuggingFace model.
  • max_length: int Sets the default maximum token length for the prompt + response.
  • all_rank_output: bool When enabled, all ranks return the generated text. By default, only rank 0 will return text.

Users can also control the generation characteristics for individual prompts (i.e., when calling pipe()) with the following options:

  • max_length: int Sets the per-prompt maximum token length for prompt + response.
  • min_new_tokens: int Sets the minimum number of tokens generated in the response. max_length will take precedence over this setting.
  • max_new_tokens: int Sets the maximum number of tokens generated in the response.
  • ignore_eos: bool (Defaults to False) Setting to True prevents generation from ending when the EOS token is encountered.
  • top_p: float (Defaults to 0.9) When set below 1.0, filter tokens and keep only the most probable, where token probabilities sum to โ‰ฅtop_p.
  • top_k: int (Defaults to None) When None, top-k filtering is disabled. When set, the number of highest probability tokens to keep.
  • temperature: float (Defaults to None) When None, temperature is disabled. When set, modulates token probabilities.
  • do_sample: bool (Defaults to True) When True, sample output logits. When False, use greedy sampling.
  • return_full_text: bool (Defaults to False) When True, prepends the input prompt to the returned text

Persistent Deployment

A persistent deployment is ideal for use with long-running and production applications. The persistent model uses a lightweight GRPC server that can be queried by multiple clients at once. The full example for running a persistent model is only 5 lines. Give it a try!

import mii
client = mii.serve("mistralai/Mistral-7B-v0.1")
response = client.generate(["Deepspeed is", "Seattle is"], max_new_tokens=128)

The returned response is a list of Response objects. We can access several details about the generation (e.g., response[0].prompt_length):

  • generated_text: str Text generated by the model.
  • prompt_length: int Number of tokens in the original prompt.
  • generated_length: int Number of tokens generated.
  • finish_reason: str Reason for stopping generation. stop indicates the EOS token was generated and length indicates the generation reached max_new_tokens or max_length.

If we want to generate text from other processes, we can do that too:

client = mii.client("mistralai/Mistral-7B-v0.1")
response = client.generate("Deepspeed is", max_new_tokens=128)

When we no longer need a persistent deployment, we can shutdown the server from any client:


Model Parallelism

Taking advantage of multi-GPU systems for better latency and throughput is also easy with the persistent deployments. Model parallelism is controlled by the tensor_parallel input to mii.serve:

client = mii.serve("mistralai/Mistral-7B-v0.1", tensor_parallel=2)

The resulting deployment will split the model across 2 GPUs to deliver faster inference and higher throughput than a single GPU.

Model Replicas

We can also take advantage of multi-GPU (and multi-node) systems by setting up multiple model replicas and taking advantage of the load-balancing that DeepSpeed-MII provides:

client = mii.serve("mistralai/Mistral-7B-v0.1", replica_num=2)

The resulting deployment will load 2 model replicas (one per GPU) and load-balance incoming requests between the 2 model instances.

Model parallelism and replicas can also be combined to take advantage of systems with many more GPUs. In the example below, we run 2 model replicas, each split across 2 GPUs on a system with 4 GPUs:

client = mii.serve("mistralai/Mistral-7B-v0.1", tensor_parallel=2, replica_num=2)

The choice between model parallelism and model replicas for maximum performance will depend on the nature of the hardware, model, and workload. For example, with small models users may find that model replicas provide the lowest average latency for requests. Meanwhile, large models may achieve greater overall throughput when using only model parallelism.


MII makes it easy to setup and run model inference via RESTful APIs by setting enable_restful_api=True when creating a persistent MII deployment. The RESTful API can receive requests at http://{HOST}:{RESTFUL_API_PORT}/mii/{DEPLOYMENT_NAME}. A full example is provided below:

client = mii.serve(

๐Ÿ“Œ Note: While providing a deployment_name is not necessary (MII will autogenerate one for you), it is good practice to provide a deployment_name so that you can ensure you are interfacing with the correct RESTful API.

You can then send prompts to the RESTful gateway with any HTTP client, such as curl:

curl --header "Content-Type: application/json" --request POST  -d '{"prompts": ["DeepSpeed is", "Seattle is"], "max_length": 128}' http://localhost:28080/mii/mistral-deployment

or python:

import json
import requests
url = f"http://localhost:28080/mii/mistral-deployment"
params = {"prompts": ["DeepSpeed is", "Seattle is"], "max_length": 128}
json_params = json.dumps(params)
output =
    url, data=json_params, headers={"Content-Type": "application/json"}

Persistent Deployment Options

While only the model name or path is required to stand up a persistent deployment, we offer customization options to our users.

mii.serve() Options:

  • model_name_or_path: str (Required) Name or local path to a HuggingFace model.
  • max_length: int (Defaults to maximum sequence length in model config) Sets the default maximum token length for the prompt + response.
  • deployment_name: str (Defaults to f"{model_name_or_path}-mii-deployment") A unique identifying string for the persistent model. If provided, client objects should be retrieved with client = mii.client(deployment_name).
  • tensor_parallel: int (Defaults to 1) Number of GPUs to split the model across.
  • replica_num: int (Defaults to 1) The number of model replicas to stand up.
  • enable_restful_api: bool (Defaults to False) When enabled, a RESTful API gateway process is launched that can be queried at http://{host}:{restful_api_port}/mii/{deployment_name}. See the section on RESTful APIs for more details.
  • restful_api_port: int (Defaults to 28080) The port number used to interface with the RESTful API when enable_restful_api is set to True.

mii.client() Options:

  • model_or_deployment_name: str Name of the model or deployment_name passed to mii.serve()

Users can also control the generation characteristics for individual prompts (i.e., when calling client.generate()) with the following options:

  • max_length: int Sets the per-prompt maximum token length for prompt + response.
  • min_new_tokens: int Sets the minimum number of tokens generated in the response. max_length will take precedence over this setting.
  • max_new_tokens: int Sets the maximum number of tokens generated in the response.
  • ignore_eos: bool (Defaults to False) Setting to True prevents generation from ending when the EOS token is encountered.
  • top_p: float (Defaults to 0.9) When set below 1.0, filter tokens and keep only the most probable, where token probabilities sum to โ‰ฅtop_p.
  • top_k: int (Defaults to None) When None, top-k filtering is disabled. When set, the number of highest probability tokens to keep.
  • temperature: float (Defaults to None) When None, temperature is disabled. When set, modulates token probabilities.
  • do_sample: bool (Defaults to True) When True, sample output logits. When False, use greedy sampling.
  • return_full_text: bool (Defaults to False) When True, prepends the input prompt to the returned text


This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.


This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

deepspeed-mii's People


ccoulombe avatar cderinbogaz avatar cli99 avatar cmikeh2 avatar dc3671 avatar eltociear avatar gauravrajguru avatar greshilov avatar jeffra avatar jihnenglin avatar kamalkraj avatar kitstar avatar lalalune avatar loadams avatar mallorbc avatar microsoftopensource avatar mrwyattii avatar msinha251 avatar novaturient95 avatar pawanosman avatar phanishekhar avatar ringohoffman avatar s-jse avatar samyam avatar sarattha avatar tahabinhuraib avatar thytu avatar tohtana avatar tosinseg avatar weiqisun avatar


 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar


 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

deepspeed-mii's Issues

No text is shown when using MII in fp32 and greedy search

When using greedy search (do_sample=False) and dtype=fp32 the generated tokens are not shown in the output of the query. I believe the text generation is happening, because different values for max_new_tokens lead to different runtimes for the query. See this notebook as a minimal example.

  • possibly related to #101
  • 1 T4, GPU memory 16GB
  • deepspeed-mii version 0.0.3
  • transformers version 4.24.0
  • Amazon Linux 2

TXT2IMAGE - TXTRuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasGemmStridedBatchedExFix

Thanks for this great optimization,
We're using a fresh ec2 G5XL instance,

After installing everything and running python
I see the following error:

    attn_weights = torch.bmm(query_states, key_states.transpose(1, 2))
RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasGemmStridedBatchedExFix( handle, opa, opb, m, n, k, (void*)(&falpha), a, CUDA_R_16F, lda, stridea, b, CUDA_R_16F, ldb, strideb, (void*)(&fbeta), c, CUDA_R_16F, ldc, stridec, num_batches, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`

I've installed the envoirment using: pip install deepspeed[sd] deepspeed-mii

when running ds_report I see the following output:

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja
ninja .................. [OKAY]

op name ................ installed .. compatible

cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
[WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-devel package with yum
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
spatial_inference ...... [NO] ....... [OKAY]

DeepSpeed general environment info:
torch install path ............... ['/opt/conda/lib/python3.9/site-packages/torch']
torch version .................... 1.13.0+cu117
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.5
deepspeed install path ........... ['/opt/conda/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.7.5, unknown, unknown
deepspeed wheel compiled w. ...... torch 0.0, cuda 0.0

OOM Error when deploying BLOOM-3B on 16GB GPU via MII

When deploying the bigscience/bloom-3b (in fp32) via MII on a T4 GPU I receive a CUDA out of memory error, see this notebook. When deploying the same model (also in fp32) via the standard HF Pipeline API, it works, see this notebook.

My expectation would be that it should be possible to deploy the same model via MII if I can deploy it via HF Pipelines. If this is not possible then it'd be good to explain why and set expectations with users.

  • 1 T4, GPU memory 16GB
  • deepspeed-mii version 0.0.3
  • transformers version 4.24.0
  • Amazon Linux 2

Stable Diffusion | Multi-Pipeline Support (i.e. img2img)

Hi, thank you for the incredible work done here.

Curious as to if img2img and inpainting are planned for release via MII for Stable Diffusion? Happy to potentially help add those features.

It would be ideal to be able to pass in any class that inherits from diffusers.pipeline_utils.DiffusionPipeline, and then just allow the passed kwargs to handle the various inputs. Doing this would allow both img2img, inpainting, and any other community pipelines that exist out there to take advantage of mii.

AML deployment error due to missing az cli arguments

When trying to run the aml example, e.g. bloom aml, it tries to run get_acr_name() but fails because its missing the resource group name argument. Is there be a way to pass in user arguments such as the resource group, subscription, etc? It would also be nice to expose more arguments for the aml online endpoints such as the auth_mode, e.g. we arent allowed to use keys, only aml_tokens in production environments. But I can also imagine other deployment attributes/arguments being useful as well such as instance_count or type.

[2022-12-08 10:53:37,253] [INFO] [] ************* MII is using DeepSpeed Optimizations to accelerate your model *************
ERROR: the following arguments are required: --resource-group/-g, --name/-n

Examples from AI knowledge base:
Read more about the command in reference docs


Unable to obtain ACR name from Azure-CLI. Please verify that you:
        - Have Azure-CLI installed (
        - Are logged in to an active account on Azure-CLI ($az login)
        - Have Azure-CLI ML plugin installed ($az extension add --name ml)


Traceback (most recent call last):
  File "/mnt/c/Users/davidaponte/Documents/CS677-DeepLearning/deeplearning/deeplearning/deep_learning/text_to_image/deepspeed_mii/", line 7, in <module>
  File "/home/bambam/.pyenv/versions/deeplearning/lib/python3.9/site-packages/mii/", line 112, in deploy
    _deploy_aml(deployment_name=deployment_name, model_name=model, version=version)
  File "/home/bambam/.pyenv/versions/deeplearning/lib/python3.9/site-packages/mii/", line 124, in _deploy_aml
    acr_name = mii.aml_related.utils.get_acr_name()
  File "/home/bambam/.pyenv/versions/deeplearning/lib/python3.9/site-packages/mii/aml_related/", line 31, in get_acr_name
    raise (e)
  File "/home/bambam/.pyenv/versions/deeplearning/lib/python3.9/site-packages/mii/aml_related/", line 13, in get_acr_name
    acr_name = subprocess.check_output(
  File "/home/bambam/.pyenv/versions/3.9.0/lib/python3.9/", line 420, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/home/bambam/.pyenv/versions/3.9.0/lib/python3.9/", line 524, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['az', 'ml', 'workspace', 'show', '--query', 'container_registry']' returned non-zero exit status 2.

Ubuntu 20.04.4 LTS (Focal Fossa)

CUDA OOM when loading large models

I'm trying out deepspeed-mii on a local machine (8 GPU with 23GB VRAM each). Smaller models like bloom-560m and EleutherAI/gpt-neo-2.7B worked well. However, I got CUDA OOM errors when loading larger models, like bloom-7b1. For some even larger models like EleutherAI/gpt-neox-20b, the server just crashed without any specific error messages or logs.

I've tried deepspeed inference before, and it worked fine on these models.

I use this script to deploy models

import mii

mii_configs = {"tensor_parallel": 8, "dtype": "fp16"}

Is there something I should change to my deployment script?


RuntimeError: This event loop is already running

Hi all, really intrigued by this project, love the idea of democratising the use large models! I've been playing around and encountered a few bugs/unexpected behaviour, so will raise some issues. Happy to help and provide constructive feedback weher I can :)

When running the example provided at I receive the following error: RuntimeError: This event loop is already running. See this notebook for a minimal example to reproduce the error.

I found a workaround using nest_asyncio.apply(), see this notebook. Nevertheless this strikes me as a bug (or at least unintended behaviour).

  • Possibly related to #87 , although this example here is not using ZeRO
  • Tested on a vanilla g4dn EC2 instance, no processes running
  • 1 T4, GPU memory 16GB
  • deepspeed-mii version 0.0.3
  • transformers version 4.24.0
  • Amazon Linux 2

RuntimeError: server crashed for some reason, unable to proceed

Using default example to deploy Deploying MII-Public on Azure ML:
Compute instance: TeslaK80 12GB
Kernel: Python 3.8 - AzureML

pip install deepspeed-mii

restart kernel

using this fails:

import mii

mii_configs = {"tensor_parallel": 1, "dtype": "fp16"}

AssertionError: text-generation only supports ['distilgpt2', 'gpt2-large'...

using this modified to tensor_parallel=1 fails:

import mii

mii_configs = {
    "dtype": "fp16",
    "tensor_parallel": 1,
    "port_number": 50950,
name = "microsoft/bloom-deepspeed-inference-fp16"

           deployment_name=name + "_deployment",

RuntimeError: server crashed for some reason, unable to proceed

Also switching to int8 didn't help.

Is my compute instance too small?

DS-MII License query

Hi @mrwyattii ,
I am trying to create an inference solution for large models that has support for various frameworks like DS-inference, DS-ZeRO and standard HF codebase.

Is it fine, if I extend some of the classes in MII like MIIServerClient and borrow some pieces of code from the proto files?
This is the relevant PR: huggingface/transformers-bloom-inference#25

Deactivate quantization?


I've been playing around with the SD image generation. I am seeing the 1.8x speedup (which is awesome), but I've also noticed a small drop in quality. How would I go about to deactivate quantization to see whether that's the reason for the drop?

Example "" not working

When running I encounter the following error message :

raise ValueError(f"model must be a torch.nn.Module, got {type(self.module)}"

It's raised from


Is "CompVis/stable-diffusion-v1-4" still handled?

Installed packages
certifi @ file:///croot/certifi_1665076670883/work/certifi

Clean up and enhance MII config

  • Add pydantic support for our config so that mistyped configs error out instead of silently ignoring like they do in deepspeed
  • Bake in default config values so that if someone passes a config of {"tensor_parallel": 4} it will pick up the default port number without them needing to specify it.

example use case:

config = {"tensor_parallel": 4}
           deployment_name=name + "_deployment",
           local_model_path=".cache/models/" + name,
  • add unit tests around configuration files, e.g., error out on typos and incorrect types

New microsoft/bloom-deepspeed-inference-fp16 weights not working with DeepSpeed MII

New microsoft/bloom-deepspeed-inference-fp16 and microsoft/bloom-deepspeed-inference-int8 weights not working with DeepSpeed MII

@jeffra @RezaYazdaniAminabadi

Traceback (most recent call last):
  File "scripts/bloom-inference-server/", line 83, in <module>
    model = DSInferenceGRPCServer(args)
  File "/net/llm-shared-nfs/nfs/mayank/BigScience-Megatron-DeepSpeed/scripts/bloom-inference-server/ds_inference/", line 36, in __init__
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/mii/", line 70, in deploy
    mii.utils.check_if_task_and_model_is_valid(task, model)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/mii/", line 108, in check_if_task_and_model_is_valid
    assert (
AssertionError: text-generation only supports [.....]

The list of models doesn't contain the new weights.

Support for FLAN-T5

I saw that T5 wasn't in the list of supported huggingface transformers models. Are there plans / ETA for when the T5 family would be added? FLAN-T5 is a very strong llm for zero/fewshot instruction prompting. I am currently building out a hacky implementation for hosting with deepspeed-inference, but having it natively supported in deepspeed-mii would be ideal.

Change number of max_tokens from 1024

I am currently able to deploy, query, and shut down a model using the provided scripts.

However, unlike using DeepSpeed inference on its own, I am not able to figure out how to change the number of max generated tokens from 1024 to a different value.

I believe this is currently not supported, but I could be mistaken.

I believe the issue can be found with the code here:

engine = deepspeed.init_inference(getattr(inference_pipeline,

A value called max_tokens needs to be passed as an argument.

If I am correct, this should be a fairly simple fix. I may create a PR for it if I can resolve it.

Socket timeouts in MII

@mrwyattii seeing this a lot lately:

Traceback (most recent call last):
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom-server/lib/python3.8/site-packages/mii/", line 268, in _request_async_response
    response = await self.stubs[stub_id].GeneratorReply(req)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom-server/lib/python3.8/site-packages/grpc/aio/", line 285, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "failed to connect to all addresses"
        debug_error_string = "{"created":"@1667202507.928505909","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/","file_line":5391,"referenced_errors":[{"created":"@1667202507.928504405","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/","file_line":398,"grpc_status":14}]}"
Task exception was never retrieved
future: <Task finished name='Task-3477' coro=<MIIServerClient._request_async_response() done, defined at /net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom-server/lib/python3.8/site-packages/mii/> exception=<AioRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "failed to connect to all addresses"
        debug_error_string = "{"created":"@1667202507.928579654","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/","file_line":5391,"referenced_errors":[{"created":"@1667202507.928578643","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/","file_line":398,"grpc_status":14}]}"
Traceback (most recent call last):
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom-server/lib/python3.8/site-packages/mii/", line 268, in _request_async_response
    response = await self.stubs[stub_id].GeneratorReply(req)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom-server/lib/python3.8/site-packages/grpc/aio/", line 285, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "failed to connect to all addresses"
        debug_error_string = "{"created":"@1667202507.928579654","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/","file_line":5391,"referenced_errors":[{"created":"@1667202507.928578643","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/","file_line":398,"grpc_status":14}]}"
Task exception was never retrieved
future: <Task finished name='Task-3472' coro=<MIIServerClient._request_async_response() done, defined at /net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom-server/lib/python3.8/site-packages/mii/> exception=<AioRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "failed to connect to all addresses"
        debug_error_string = "{"created":"@1667202508.129364892","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/","file_line":5391,"referenced_errors":[{"created":"@1667202508.129363364","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/","file_line":398,"grpc_status":14}]}"
Traceback (most recent call last):
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom-server/lib/python3.8/site-packages/mii/", line 268, in _request_async_response
    response = await self.stubs[stub_id].GeneratorReply(req)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom-server/lib/python3.8/site-packages/grpc/aio/", line 285, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "failed to connect to all addresses"
        debug_error_string = "{"created":"@1667202508.129364892","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/","file_line":5391,"referenced_errors":[{"created":"@1667202508.129363364","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/","file_line":398,"grpc_status":14}]}"
Task exception was never retrieved
future: <Task finished name='Task-3473' coro=<MIIServerClient._request_async_response() done, defined at /net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom-server/lib/python3.8/site-packages/mii/> exception=<AioRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "failed to connect to all addresses"
        debug_error_string = "{"created":"@1667202508.453402948","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/","file_line":5391,"referenced_errors":[{"created":"@1667202508.453401110","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/","file_line":398,"grpc_status":14}]}"
Traceback (most recent call last):
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom-server/lib/python3.8/site-packages/mii/", line 268, in _request_async_response
    response = await self.stubs[stub_id].GeneratorReply(req)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom-server/lib/python3.8/site-packages/grpc/aio/", line 285, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "failed to connect to all addresses"
        debug_error_string = "{"created":"@1667202508.453402948","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/","file_line":5391,"referenced_errors":[{"created":"@1667202508.453401110","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/","file_line":398,"grpc_status":14}]}"

OPT in TP or PP mode

Is there a way to inference OPT models in TensorParallel or PipelineParallel mode?

As I understand:

  • BLOOM uses llm provider which loads the model weights as meta tensors first and then assigns devices during checkpoint loading in ds-inference.

  • OPT uses hf provider with ๐Ÿค— pipeline and directly loads checkpoint weights on a specific device.

However, only MP is supported from ๐Ÿค— side (using accelerate). Is there a way to inference OPT with llm provider?

Unable to run for txt2img benchmark

Ive run into some strange protobuf related errors. When I first ran into this, I was able to resolve by changing my protobuf version to >=3.20.0 but now it doesnt work anymore.

My hunch is that its related to how I am installing things? I wasnt sure what the correct way was to install deespeed and deepspeed-mii, so I have been trying to use the following:
pip install deepspeed[sd] deepspeed-mii

I am now seeing this error when trying to run
Screenshot 2022-12-08 131428

Ubuntu LTS

Support for Fairseq Translation Model

Hi, does DeepSpeed-MII support fairseq's translation model, such as transformer.wmt16.en-de or transformer.wmt19.en-de? as no task translation listed in the Supported Models and Tasks section.

Is model split in OPT TP mode?

I tried HF OPT-13b on a 4 GPU machine with tensor-parallel: 4. One observation is all GPUs used the same amount of memory (~25G). It is consistent with other users report. And I also found the memory is as same as the memory used when tensor-parallel: 2. So my question is whether the model is split after it is loaded into CPU memory as said in this thread? My understanding is the memory should be a fourth if the model is split when tensor-parallel: 4 and a second when tensor-paralle: 2.

By the way, I also didn't really find latency reduction when increasing tensor parallel number (the latency only has 2 or 3 ms difference).

Memory issue when loading OPT

Recently I am trying to run OPT models on MII but came across some memory issues. The OPT model I used is facebook/opt-13b. mii-config and deployment parameters are like this:

mii_configs = {
    "dtype": "fp32",
    "tensor_parallel": 4,

name = "facebook/opt-13b"

           deployment_name=name + "_deployment",

The checkpoint is already downloaded into the model_path. Since the checkpoint size of opt-13b is around 26 Gb, I suppose it should work on a machine with 4 x v100 and 224G memory. But it turns out the loading part (even before the server started), MII reported an error of the server crashed and exit quietly. I then checked the memory usage and surprisingly found MII used up all 224G memory. So my question is why MII consumes several times of memory than the checkpoint? Is there any configuration to change this behavior?

[BUG] number of dims don't match in permute when inferencing with bert model type

When serving models of "bert" type, the following error showed up.

To reproduce, checkout #19 and start a local server with "" using "bert-base-uncased", and query the server with the following client code

import os
import grpc
import mii

# bert
name = "bert-base-uncased"
mask = "[MASK]"
print(f"Querying {name}...")

generator = mii.mii_query_handle(name + "_deployment")
result = generator.query({'query': "Hello I'm a " + mask + " model."})

Question : How to query a remote DeepSpeed server?

I don't see any parameter allowing the user to specify a remote DeepSpeed server to target.
I there any option for that?

If yes :

  • How

If no :

  • How could we do it manually?
  • Do you attend to implement such a feature in a near futur?

Second question : Is there any option to manually load/unload a model at query time ?

Errors running Zero-Inference text generation example


I'm trying to run the example provided for text generation with Zero-Inference, and having trouble getting predictions without running into errors.

When I try to deploy the exact same model and config, I first get a validation error for the aio configuration.

Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/", line 87, in _run_code
    exec(code, run_globals)
  File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/mii/launch/", line 70, in <module>
  File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/mii/launch/", line 56, in main
    inference_pipeline = load_models(task_name=args.task_name,
  File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/mii/models/", line 87, in load_models
    ds_config = DeepSpeedConfig(ds_config_path)
  File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/deepspeed/runtime/", line 811, in __init__
  File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/deepspeed/runtime/", line 830, in _initialize_params
    self.zero_config = get_zero_config(param_dict)
  File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/deepspeed/runtime/zero/", line 66, in get_zero_config
    return DeepSpeedZeroConfig(**zero_config_dict)
  File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/deepspeed/runtime/", line 54, in __init__
  File "pydantic/", line 406, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for DeepSpeedZeroConfig
  extra fields not permitted (type=value_error.extra)

If I remove the aio config the server starts successfully, but as I'm trying to create a generator and query it (just like in your sample, I get another error for the generator.query() call:

RuntimeError                              Traceback (most recent call last)
/tmp/ipykernel_5004/ in <cell line: 1>()
----> 1 result = generator.query({'query': ["DeepSpeed is the", "Seattle is"]})

~/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/mii/ in query(self, request_dict, **query_kwargs)
    357         else:
    358             assert self.initialize_grpc_client, "grpc client has not been setup when this model was created"
--> 359             response = self.asyncio_loop.run_until_complete(
    360                 self._query_in_tensor_parallel(request_dict,
    361                                                query_kwargs))

~/anaconda3/envs/pytorch_p38/lib/python3.8/asyncio/ in run_until_complete(self, future)
    590         """
    591         self._check_closed()
--> 592         self._check_running()
    594         new_task = not futures.isfuture(future)

~/anaconda3/envs/pytorch_p38/lib/python3.8/asyncio/ in _check_running(self)
    550     def _check_running(self):
    551         if self.is_running():
--> 552             raise RuntimeError('This event loop is already running')
    553         if events._get_running_loop() is not None:
    554             raise RuntimeError(

RuntimeError: This event loop is already running

Any help is greatly appreciated.

"error: cuda_runtime_api.h: No such file or directory"

Hello, I'm trying to run the basic example. I have several LLMs working and have used Huggingface Hub to download them, for reference. However, I get this error in the title. Indeed this file is not found in:
/home/user/.local/lib/python3.10/site-packages/torch/include/c10/I did find it here:

I had a challenging time getting my nvidia driver to work with the right cuda version during torch install. Current PyTorch version is: Version: 1.12.1+cu116. You can see the version 11.7 in the above path. I'm not sure how relevant that is, but this is the only combination of cuda and torch versions I could get working. I think c10 denotes the default version of torch installed with python 3.10 on Ubuntu 22.04. Which is supported by this quote from SE:

"PyTorch doesn't use the system's CUDA library. When you install PyTorch using the precompiled binaries using either pip or conda it is shipped with a copy of the specified version of the CUDA library which is installed locally."

The output does say:
Installed CUDA version 11.7 does not match the version torch was compiled with 11.6 but since the APIs are compatible, accepting this combination Using /home/user/.cache/torch_extensions/py310_cu116 as PyTorch extensions root...

Do I need to set some environment vars and/or install another version of PyTorch in a virtualenv? I'm a little short on space, so hopping not. It seems there is some conflict between the default PyTorch c10 locations and the discovered 11.6/11.7 version of Cuda.

Quick side note: the models downloaded to /tmp/mii_models. Is it possible to use the standard Huggingface model locations?

Graceful teardown

Currently there's no way to teardown a local or azure deployment gracefully, we currently just pkill python which is clearly not a clean solution.

Running bigscience/bloom-350m example returns AssertionError

When I run the following example from the readme:

import mii
mii_configs = {"tensor_parallel": 1, "dtype": "fp16"}

It returns the following:

`[/usr/local/lib/python3.7/dist-packages/mii/](https://localhost:8080/#) in check_if_task_and_model_is_valid(task, model_name)
    108     assert (
    109         model_name in valid_task_models
--> 110     ), f"{task_name} only supports {valid_task_models}"

AssertionError: text-generation only supports....

Error. I suspect this is related to a change in model weights.

Can you point me in the right direction?

And also, thanks for this amazing repo! Can't wait to use it ๐Ÿ‘ ๐Ÿ’ฏ

FileNotFoundError: [Errno 2] No such file or directory: 'deepspeed'


When running the following code I get the FileNotFoundError Error.

Any idea why this happens? I follow the usual install through conda (pytorch+cuda) and pip install .

mii_configs = {"tensor_parallel": 1, "dtype": "fp16"}
[2022-08-25 12:41:19,489] [INFO] [] *************DeepSpeed Optimizations: True*************
[2022-08-25 12:41:19,524] [INFO] [] multi-gpu deepspeed launch: ['deepspeed', '--num_gpus', '1', '--no_local_rank', '--no_python', '/mnt/2287294e-32c7-437b-84bd-452a29548b1a/conda_env/DeepSpeedInterface/bin/python', '-m', 'mii.launch.multi_gpu_server', '--task-name', 'text-generation', '--model', 'gpt2', '--model-path', '/tmp/mii_models', '--port', '50050', '--ds-optimize', '--provider', 'hugging-face', '--config', 'eyJ0ZW5zb3JfcGFyYWxsZWwiOiAxLCAicG9ydF9udW1iZXIiOiA1MDA1MCwgImR0eXBlIjogImZwMTYiLCAiZW5hYmxlX2N1ZGFfZ3JhcGgiOiBmYWxzZSwgImNoZWNrcG9pbnRfZGljdCI6IG51bGx9']

FileNotFoundError                         Traceback (most recent call last)
Input In [2], in <cell line: 2>()
      1 mii_configs = {"tensor_parallel": 1, "dtype": "fp16"}
----> 2 mii.deploy(task="text-generation",
      3            model="gpt2",
      4            deployment_name="gpt2_deployment",
      5            mii_config=mii_configs)

File /mnt/2287294e-32c7-437b-84bd-452a29548b1a/NLP/Text_Generation/text_generator/DeepSpeed-MII/mii/, in deploy(task, model, deployment_name, deployment_type, model_path, enable_deepspeed, enable_zero, ds_config, mii_config)
     92     print(f"Score file created at {generated_score_path(deployment_name)}")
     93 elif deployment_type == DeploymentType.LOCAL:
---> 94     return _deploy_local(deployment_name, model_path=model_path)
     95 else:
     96     raise Exception(f"Unknown deployment type: {deployment_type}")

File /mnt/2287294e-32c7-437b-84bd-452a29548b1a/NLP/Text_Generation/text_generator/DeepSpeed-MII/mii/, in _deploy_local(deployment_name, model_path)
     99 def _deploy_local(deployment_name, model_path):
--> 100     mii.utils.import_score_file(deployment_name).init()

File /tmp/mii_cache/gpt2_deployment/, in init()
     26 assert task is not None, "The task name should be set before calling init"
     28 global model
---> 29 model = mii.MIIServerClient(task,
     30                             model_name,
     31                             model_path,
     32                             ds_optimize=configs[mii.constants.ENABLE_DEEPSPEED_KEY],
     33                             ds_zero=configs[mii.constants.ENABLE_DEEPSPEED_ZERO_KEY],
     34                             ds_config=configs[mii.constants.DEEPSPEED_CONFIG_KEY],
     35                             mii_configs=configs[mii.constants.MII_CONFIGS_KEY],
     36                             use_grpc_server=use_grpc_server,
     37                             initialize_grpc_client=initialize_grpc_client)

File /mnt/2287294e-32c7-437b-84bd-452a29548b1a/NLP/Text_Generation/text_generator/DeepSpeed-MII/mii/, in MIIServerClient.__init__(self, task_name, model_name, model_path, ds_optimize, ds_zero, ds_config, mii_configs, initialize_service, initialize_grpc_client, use_grpc_server)
     80     self.model = None
     82 if self.initialize_service:
---> 83     self.process = self._initialize_service(model_name,
     84                                             model_path,
     85                                             ds_optimize,
     86                                             ds_zero,
     87                                             ds_config,
     88                                             mii_configs)
     89     if self.use_grpc_server:
     90         self._wait_until_server_is_live()

File /mnt/2287294e-32c7-437b-84bd-452a29548b1a/NLP/Text_Generation/text_generator/DeepSpeed-MII/mii/, in MIIServerClient._initialize_service(self, model_name, model_path, ds_optimize, ds_zero, ds_config, mii_configs)
    207     mii_env = os.environ.copy()
    208     mii_env["TRANSFORMERS_CACHE"] = model_path
--> 209     process = subprocess.Popen(cmd, env=mii_env)
    210 return process

File /mnt/2287294e-32c7-437b-84bd-452a29548b1a/conda_env/DeepSpeedInterface/lib/python3.9/, in Popen.__init__(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags, restore_signals, start_new_session, pass_fds, user, group, extra_groups, encoding, errors, text, umask)
    947         if self.text_mode:
    948             self.stderr = io.TextIOWrapper(self.stderr,
    949                     encoding=encoding, errors=errors)
--> 951     self._execute_child(args, executable, preexec_fn, close_fds,
    952                         pass_fds, cwd, env,
    953                         startupinfo, creationflags, shell,
    954                         p2cread, p2cwrite,
    955                         c2pread, c2pwrite,
    956                         errread, errwrite,
    957                         restore_signals,
    958                         gid, gids, uid, umask,
    959                         start_new_session)
    960 except:
    961     # Cleanup if the child failed starting.
    962     for f in filter(None, (self.stdin, self.stdout, self.stderr)):

File /mnt/2287294e-32c7-437b-84bd-452a29548b1a/conda_env/DeepSpeedInterface/lib/python3.9/, in Popen._execute_child(self, args, executable, preexec_fn, close_fds, pass_fds, cwd, env, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite, restore_signals, gid, gids, uid, umask, start_new_session)
   1819     if errno_num != 0:
   1820         err_msg = os.strerror(errno_num)
-> 1821     raise child_exception_type(errno_num, err_msg, err_filename)
   1822 raise child_exception_type(err_msg)

FileNotFoundError: [Errno 2] No such file or directory: 'deepspeed'

Add release tag 0.05

I noticed version.txt is at 0.05 but there is no release tag for 0.05 and PyPI is at 0.04. This change was made over a month ago. Perhaps there was meant to be a release tag but for some reason, it was forgotten?

Passing `stopping_criteria` to DeepSpeed MII

Hi, would it be possible to pass in a stopping_criteria inside .generate()?

mii_generator = mii.mii_query_handle('name')
mii_generator.query({"query": ['hello']}, stopping_criteria=[])

Currently we get an error (can't pass a list of objects through grpc):

~/venv/lib/python3.7/site-packages/mii/ in kwarg_dict_to_proto(kwarg_dict)
    176         return proto_value
--> 178     return {k: get_proto_value(v) for k, v in kwarg_dict.items()}

~/venv/lib/python3.7/site-packages/mii/ in <dictcomp>(.0)
    176         return proto_value
--> 178     return {k: get_proto_value(v) for k, v in kwarg_dict.items()}

~/venv/lib/python3.7/site-packages/mii/ in get_proto_value(value)
    173     def get_proto_value(value):
    174         proto_value = mii.grpc_related.proto.modelresponse_pb2.Value()
--> 175         setattr(proto_value, dtype_proto_field[type(value)], value)
    176         return proto_value

KeyError: <class 'list'>

Use case is, for a text-generation task, I'd like to stop at a newline / custom token.

Error in inf/nan tensors


Traceback (most recent call last):
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom-server/lib/python3.8/site-packages/mii/", line 268, in _request_async_response
    response = await self.stubs[stub_id].GeneratorReply(req)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom-server/lib/python3.8/site-packages/grpc/aio/", line 285, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
        status = StatusCode.UNKNOWN
        details = "Exception calling application: probability tensor contains either `inf`, `nan` or element < 0"
        debug_error_string = "{"created":"@1667201081.916813042","description":"Error received from peer ipv6:[::1]:50956","file":"src/core/lib/surface/","file_line":1068,"grpc_message":"Exception calling application: probability tensor contains either `inf`, `nan` or element < 0","grpc_status":2}"

I see this sometimes for BLOOM-176B

Stop Sequence

Hi Deepspeed-MII team,

I was wondering if there is a way to implement a stop sequence or stop token in ds-mii to stop generation early.

In the current implementation, the model mostly generates max_new_tokens number of tokens. In huggingface transformers, it's possible to implement custom stopping criteria but I did not find this option here.

I tried setting the eos_token_id to the desired stop token but somehow the model keeps generating even after producing the stop token.

Cheers, V

Issue with default mii_cache location

The default mii_cache location is hardcoded as /tmp/cache, and we run into issues in a cluster environment when multi-users are trying to submit jobs and write on that directory. Maybe it is better to make the default cache location respect the environment variable set in the system.

MII_CACHE_PATH_DEFAULT = "/tmp/mii_cache"

feature request : Docker image for deepspeed-mii

Motivation :

As a developper I want to easily be able to test deepspeed-mii.
However, while using conda (or other python package manager i.e pypenv), I still encounter error (with protobuf for example).

Solution :

Fastest one : Provide a Dockefile that the developer/user could build to use and test deepspeed-mii
What would be amazing : At each deepspeed-mii modification, a CI build the docker image and upload/update it on the dockerhub.

This should take long to do but would be great to have ๐Ÿ™‚

'DSUNet' object has no attribute 'config'

I'm unable to get the example script working from here:

When I run without arguments it loads the model and deploys okay.
But then using the --query produces this:

ERROR:grpc._server:Exception calling application: 'DSUNet' object has no attribute 'config'
Traceback (most recent call last):
  File "/home/snd/bin/miniconda3/envs/ldm/lib/python3.8/site-packages/grpc/", line 443, in _call_behavior
    response_or_iterator = behavior(argument, context)
  File "/home/snd/bin/miniconda3/envs/ldm/lib/python3.8/site-packages/mii/grpc_related/", line 77, in Txt2ImgReply
    response = self.inference_pipeline(request, **query_kwargs)
  File "/home/snd/bin/miniconda3/envs/ldm/lib/python3.8/site-packages/torch/autograd/", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/snd/bin/miniconda3/envs/ldm/lib/python3.8/site-packages/diffusers/pipelines/stable_diffusion/", line 504, in __call__
    height = height or self.unet.config.sample_size * self.vae_scale_factor
  File "/home/snd/bin/miniconda3/envs/ldm/lib/python3.8/site-packages/torch/nn/modules/", line 1265, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'DSUNet' object has no attribute 'config'
Traceback (most recent call last):
  File "", line 52, in <module>
    result = generator.query({
  File "/home/snd/bin/miniconda3/envs/ldm/lib/python3.8/site-packages/mii/", line 367, in query
    response = self.asyncio_loop.run_until_complete(
  File "/home/snd/bin/miniconda3/envs/ldm/lib/python3.8/asyncio/", line 616, in run_until_complete
    return future.result()
  File "/home/snd/bin/miniconda3/envs/ldm/lib/python3.8/site-packages/mii/", line 263, in _query_in_tensor_parallel
    await responses[0]
  File "/home/snd/bin/miniconda3/envs/ldm/lib/python3.8/site-packages/mii/", line 313, in _request_async_response
    response = await self.stubs[stub_id].Txt2ImgReply(req)
  File "/home/snd/bin/miniconda3/envs/ldm/lib/python3.8/site-packages/grpc/aio/", line 290, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
        status = StatusCode.UNKNOWN
        details = "Exception calling application: 'DSUNet' object has no attribute 'config'"
        debug_error_string = "UNKNOWN:Error received from peer ipv6:%5B::1%5D:50050 {created_time:"2022-11-28T15:19:01.639187607-08:00", grpc_status:2, grpc_message:"Exception calling application: \'DSUNet\' object has no attribute \'config\'"}"


deepspeed                     0.7.5
deepspeed-mii                 0.0.3
transformers                  4.24.0

Some generate parameters do not work for query

When using DeepSpeed MII, there are some parameters that do not work when querying the model that otherwise work when using model.generate or when using huggingface pipelines. I have also tried these parameters using DeepSpeed inference on its own and found them to work

The parameters that cause issues for me are num_beams and bad_words_ids but there may be more.

I have found do_sample, max_length, min_length, top_k, top_p, temperature, repetition_penalty, and early_stopping to not cause issues but there may be more.

Support for Albert and Swin/ViT

Just curious whether there is a plan to support Albert and Swin/ViT. currently I am playing with a model for multimodal learning which involves language models like Albert and visual transformers like Swin and ViT. If there is no immediate plan for this due to tight hands, I am wondering whether there is any documents to guide adding support of new models or customized models so I could help?

mii example FileNotFoundError: [Errno 2] No such file or directory: '/tmp/mii_cache/bert-base-uncased_deployment/'

Deepspeed-MII=latest version

test example in file

import mii

# roberta
name = "roberta-base"
mask = "<mask>"
# bert
name = "bert-base-uncased"
mask = "[MASK]"
print(f"Querying {name}...")

generator = mii.mii_query_handle(name + "_deployment")
result = generator.query({'query': "Hello I'm a " + mask + " model."})
print("time_taken:", result.time_taken)


Error code

(ds1) [root@6301babb8dc8a1eeb0ed2044 DeepSpeed-MII (main)]# python
Querying bert-base-uncased...
Traceback (most recent call last):
  File "", line 11, in <module>
    generator = mii.mii_query_handle(name + "_deployment")
  File "/root/workspace/sharing/big-storage/hyungrak/tuning/text_filter/mii_test/DeepSpeed-MII/mii/", line 34, in mii_query_handle
    configs = mii.utils.import_score_file(deployment_name).configs
  File "/root/workspace/sharing/big-storage/hyungrak/tuning/text_filter/mii_test/DeepSpeed-MII/mii/", line 147, in import_score_file
  File "<frozen importlib._bootstrap_external>", line 839, in exec_module
  File "<frozen importlib._bootstrap_external>", line 975, in get_code
  File "<frozen importlib._bootstrap_external>", line 1032, in get_data
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/mii_cache/bert-base-uncased_deployment/'

why are this error occuring

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.