Git Product home page Git Product logo

sagemaker-inference-toolkit's Introduction

SageMaker

SageMaker Inference Toolkit

Latest Version Supported Python Versions Code Style: Black

Serve machine learning models within a Docker container using Amazon SageMaker.

๐Ÿ“š Background

Amazon SageMaker is a fully managed service for data science and machine learning (ML) workflows. You can use Amazon SageMaker to simplify the process of building, training, and deploying ML models.

Once you have a trained model, you can include it in a Docker container that runs your inference code. A container provides an effectively isolated environment, ensuring a consistent runtime regardless of where the container is deployed. Containerizing your model and code enables fast and reliable deployment of your model.

The SageMaker Inference Toolkit implements a model serving stack and can be easily added to any Docker container, making it deployable to SageMaker. This library's serving stack is built on Multi Model Server, and it can serve your own models or those you trained on SageMaker using machine learning frameworks with native SageMaker support. If you use a prebuilt SageMaker Docker image for inference, this library may already be included.

For more information, see the Amazon SageMaker Developer Guide sections on building your own container with Multi Model Server and using your own models.

๐Ÿ› ๏ธ Installation

To install this library in your Docker image, add the following line to your Dockerfile:

RUN pip3 install multi-model-server sagemaker-inference

Here is an example of a Dockerfile that installs SageMaker Inference Toolkit.

๐Ÿ’ป Usage

Implementation Steps

To use the SageMaker Inference Toolkit, you need to do the following:

  1. Implement an inference handler, which is responsible for loading the model and providing input, predict, and output functions. (Here is an example of an inference handler.)

    from sagemaker_inference import content_types, decoder, default_inference_handler, encoder, errors
    
    class DefaultPytorchInferenceHandler(default_inference_handler.DefaultInferenceHandler):
    
        def default_model_fn(self, model_dir, context=None):
            """Loads a model. For PyTorch, a default function to load a model cannot be provided.
            Users should provide customized model_fn() in script.
    
            Args:
                model_dir: a directory where model is saved.
                context (obj): the request context (default: None).
    
            Returns: A PyTorch model.
            """
            raise NotImplementedError(textwrap.dedent("""
            Please provide a model_fn implementation.
            See documentation for model_fn at https://github.com/aws/sagemaker-python-sdk
            """))
    
        def default_input_fn(self, input_data, content_type, context=None):
            """A default input_fn that can handle JSON, CSV and NPZ formats.
    
            Args:
                input_data: the request payload serialized in the content_type format
                content_type: the request content_type
                context (obj): the request context (default: None).
    
            Returns: input_data deserialized into torch.FloatTensor or torch.cuda.FloatTensor depending if cuda is available.
            """
            return decoder.decode(input_data, content_type)
    
        def default_predict_fn(self, data, model, context=None):
            """A default predict_fn for PyTorch. Calls a model on data deserialized in input_fn.
            Runs prediction on GPU if cuda is available.
    
            Args:
                data: input data (torch.Tensor) for prediction deserialized by input_fn
                model: PyTorch model loaded in memory by model_fn
                context (obj): the request context (default: None).
    
            Returns: a prediction
            """
            return model(input_data)
    
        def default_output_fn(self, prediction, accept, context=None):
            """A default output_fn for PyTorch. Serializes predictions from predict_fn to JSON, CSV or NPY format.
    
            Args:
                prediction: a prediction result from predict_fn
                accept: type which the output data needs to be serialized
                context (obj): the request context (default: None).
    
            Returns: output data serialized
            """
            return encoder.encode(prediction, accept)

    Note, passing context as an argument to the handler functions is optional. Customer can choose to omit context from the function declaration if it's not needed in the runtime. For example, the following handler function declarations will also work:

    def default_model_fn(self, model_dir)
    
    def default_input_fn(self, input_data, content_type)
    
    def default_predict_fn(self, data, model)
    
    def default_output_fn(self, prediction, accept)
    
  2. Implement a handler service that is executed by the model server. (Here is an example of a handler service.) For more information on how to define your HANDLER_SERVICE file, see the MMS custom service documentation.

    from sagemaker_inference.default_handler_service import DefaultHandlerService
    from sagemaker_inference.transformer import Transformer
    from sagemaker_pytorch_serving_container.default_inference_handler import DefaultPytorchInferenceHandler
    
    
    class HandlerService(DefaultHandlerService):
        """Handler service that is executed by the model server.
        Determines specific default inference handlers to use based on model being used.
        This class extends ``DefaultHandlerService``, which define the following:
            - The ``handle`` method is invoked for all incoming inference requests to the model server.
            - The ``initialize`` method is invoked at model server start up.
        Based on: https://github.com/awslabs/multi-model-server/blob/master/docs/custom_service.md
        """
        def __init__(self):
            transformer = Transformer(default_inference_handler=DefaultPytorchInferenceHandler())
            super(HandlerService, self).__init__(transformer=transformer)
  3. Implement a serving entrypoint, which starts the model server. (Here is an example of a serving entrypoint.)

    from sagemaker_inference import model_server
    
    model_server.start_model_server(handler_service=HANDLER_SERVICE)
  4. Define the location of the entrypoint in your Dockerfile.

    ENTRYPOINT ["python", "/usr/local/bin/entrypoint.py"]

Complete Example

Here is a complete example demonstrating usage of the SageMaker Inference Toolkit in your own container for deployment to a multi-model endpoint.

๐Ÿ“œ License

This library is licensed under the Apache 2.0 License. For more details, please take a look at the LICENSE file.

๐Ÿค Contributing

Contributions are welcome! Please read our contributing guidelines if you'd like to open an issue or submit a pull request.

sagemaker-inference-toolkit's People

Contributors

ajaykarpur avatar akulk314 avatar amaharek avatar bastianzim avatar bveeramani avatar chuyang-deng avatar davidthomas426 avatar dhanainme avatar ericangelokim avatar humanzz avatar jakob-keller avatar knakad avatar laurenyu avatar maaquib avatar mvsusp avatar nskool avatar obrep avatar qingzi-lan avatar rohithkrn avatar sachanub avatar waytrue17 avatar yonghyeokrhee avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sagemaker-inference-toolkit's Issues

How to overwrite batch transform output in S3

I did not find the doc on overwrite batch transform output
If I try to run the same batch transform job multiple times along the time, how should I set the transformer to overwrite the output results (i.e., I don not change the output_path)

Support user-defined batch inference logic

Describe the feature you'd like

Currently, TorchServe's batch inference is handled by looping through the requests and feeding them individually to the user-defined transform function (#108). However, this doesn't take full advantage of GPU's parallelism and compute power, thus yielding slower endpoints with low resource usage.

On the other hand, TorchServe's documentation on batch inference, shows an example where the developer handles this logic and feeds the entire input batch to the model.

For my use case, this is highly desirable to increase the throughput of the model.

How would this feature be used? Please describe.

Provide batch_transform_fn functions. If a user wants to customize the default batch logic, they can provide functions batch_input_fn, batch_predict_fn and batch_output_fn where they are given the entire batch of requests as input.

Describe alternatives you've considered

I haven't found an alternative to achieve this functionality using the sagemaker-pytorch-inference-toolkit, so I'm writing a custom Dockerfile that just uses torchserve.

Ability to define custom log4j properties

Is your feature request related to a problem? Please describe.

Currently I can't define the log4j properties because the default log4j properties is hardcoded in.

Would like to configure this so that the MMS logs are written.

Describe the solution you'd like
Have the ability to define a log4j.properties file in my package.

I think the cleanest way here is to perhaps add more arguments to the start_model_server() function so that users can define their own properties.

At the very least I think the default log4j properties should match the default log4j properties in multi-model-server

https://github.com/awslabs/multi-model-server/blob/master/frontend/server/src/main/resources/log4j.properties

Additional context

A similar issue for config.properties is tracked here.

Support CustomAttributes in script mode (for PyTorch, HuggingFace, etc)

Describe the feature you'd like

Users of {PyTorch, HuggingFace, XYZ...} SageMaker DLCs in script mode should be able to access the SageMaker CustomAttributes request header (and maybe other current/future request context?) via script mode function overrides (like input_fn, transform_fn): Just like TensorFlow users already can as pointed out by aws/sagemaker-pytorch-inference-toolkit#111

How would this feature be used? Please describe.

CustomAttributes could be useful for a range of purposes as outlined in the service doc. One particular use case I have come across multiple times now is to build endpoints for processing images and video chunks, where we'd like the main request body content type to be the image/video itself, but add some additional metadata (video stream ID, language, feature flags, etc).

Today, AFAIK, users looking to consume this header would need to plan and build quite complex modifications to the DLC serving containers. It's a steep curve for users to go from nice function override API, to having to understand the ecosystem of:

  • Transformers vs Handler Services vs Handlers
  • "Default" vs final implementations
  • Relationship between TorchServe (or SageMaker MMS), the sagemaker-inference-toolkit, and the toolkit for their framework of choice e.g. sageamker-huggingface-inference-toolkit

To support consuming the extra context simply via custom inference script overrides (input_fn and etc), I believe change to this sagemaker-inference-toolkit library would be needed/best.

Describe alternatives you've considered

  1. Are there any nice code samples out there demonstrating minimal tweaks to the transformer/handler stacks of containers like PyTorch and HF?
    • I'm not aware of any in the gap between script mode vs big serving stack changes... But would love to hear!
  2. Breaking change to handler override function APIs
    • Because there's no dict/namespace-like context object or **kwargs flexibility in the current APIs for input_fn / predict_fn / output_fn / transform_fn, there's nowhere really to put additional data without breaking things
    • The API of these functions could be amended to accept some kind of context object through which CustomAttributes (and any additional requirements in future) could be surfaced.
    • If I understand correctly, Transformer._default_transform_fn defines the signatures expected of these functions in script mode. Transformer.transform seems to dictate what the expected API of a transform_fn override would be, but it doesn't pass through additional context there either.
  3. Non-breaking change via Python inspect.signature
    • If a breaking change to these function override APIs is not possible, perhaps this library's default Transformer could use inspect.signature(...) to check the provided function's APIs on the fly at run time and pass through extra context arguments iff it's supported?
    • This would allow e.g. def input_fn(self, input_data, content_type) and def input_fn(self, input_data, content_type, extra_context) in user code to both work correctly with the library.
    • The change could be made to this base library without requiring default handlers in downstream libraries to be updated straight away (e.g. pytorch-inference-toolkit, huggingface-inference-toolkit, etc).

Even if (3) is the selected option, I think I'd suggest to introduce an extensible object/namespace argument like context, rather than a specific item like custom_attributes, to avoid growing complexity of the default transformer code and API if further fields need to be added in future?

Additional context

As far as I can tell, this library's default Transformer (and therefore its user function override logic) seems to be directly inherited by:

So I think, that adding CustomAttributes support in this toolkit would bubble through to at least these frameworks, once their DLCs are re-built and consuming the updated version?

psutil 5.9.6 seems to be throwing ZombieProcess when retrieving the mms process

Describe the bug
We use a custom image for our Sagemaker endpoint, and on Friday, Oct 20, 2023, we experienced instability in our endpoint after re-deploying. It seems that the latest version fo psutil 5.9.6 will throw ZombieProcess more frequently, causing the server to restart. This causes the endpoint to occasionally return non-200 responses when predictions are requested.

The change in psutil may be this fix on their end with what they recognize as a ZombieProcess.
giampaolo/psutil#2288

We were able to resolve our issue by rolling back to psutil 5.9.5. So, I'm unsure if sagemaker-inference should pin the version of psutil in your package or if the fix needs to be done here:

https://github.com/aws/sagemaker-inference-toolkit/blob/master/src/sagemaker_inference/model_server.py#L276

To reproduce
Create a custom sagemaker endpoint image with psutil 5.9.6 and deploy it.

Expected behavior
The model endpoint is stable and consistently returns successful predictions and the ZombieProcess exception is not being raised frequently.

Screenshots or logs
Here is a traceback we are seeing:

  File "/usr/local/lib/python3.8/site-packages/sagemaker_inference/model_server.py", line 99, in start_model_server
    mms_process = _retry_retrieve_mms_server_process(env.startup_timeout)
  File "/usr/local/lib/python3.8/site-packages/sagemaker_inference/model_server.py", line 199, in _retry_retrieve_mms_server_process
    return retrieve_mms_server_process()
  File "/usr/local/lib/python3.8/site-packages/retrying.py", line 49, in wrapped_f
    return Retrying(*dargs, **dkw).call(f, *args, **kw)
  File "/usr/local/lib/python3.8/site-packages/retrying.py", line 212, in call
    raise attempt.get()
  File "/usr/local/lib/python3.8/site-packages/retrying.py", line 247, in get
    six.reraise(self.value[0], self.value[1], self.value[2])
  File "/usr/local/lib/python3.8/site-packages/six.py", line 719, in reraise
    raise value
  File "/usr/local/lib/python3.8/site-packages/retrying.py", line 200, in call
    attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
  File "/usr/local/lib/python3.8/site-packages/sagemaker_inference/model_server.py", line 206, in _retrieve_mms_server_process
    if MMS_NAMESPACE in process.cmdline():
  File "/usr/local/lib64/python3.8/site-packages/psutil/__init__.py", line 702, in cmdline
    return self._proc.cmdline()
  File "/usr/local/lib64/python3.8/site-packages/psutil/_pslinux.py", line 1650, in wrapper
    return fun(self, *args, **kwargs)
  File "/usr/local/lib64/python3.8/site-packages/psutil/_pslinux.py", line 1788, in cmdline
    self._raise_if_zombie()
  File "/usr/local/lib64/python3.8/site-packages/psutil/_pslinux.py", line 1693, in _raise_if_zombie
    raise ZombieProcess(self.pid, self._name, self._ppid)

System information

  • sagemaker inference version 1.5.11
  • custom docker image based on amazon linux 2
    • framework name: scikit-learn
    • framework version: 1.0.2
    • Python version: 3.8
    • processing unit type: cpu

Additional context
n/a

Launch MMS without repackaging model contents

Describe the feature you'd like
Inference toolkit, when starting up MMS, will repackage the model contents by copying the contents from /opt/ml/model to /.sagemaker/mms/models: https://github.com/aws/sagemaker-inference-toolkit/blob/master/src/sagemaker_inference/model_server.py#L76.

This is unnecessary and MMS can simply read the model contents from /opt/ml/model. This will save some startup time due to removing the need to copy files from one location to another, and will also help the container restart successfully as it will not run into the issue where /.sagemaker/mms/models/model is already present (say if the container crashed and is restarted on the same host).

How would this feature be used? Please describe.
See above

Describe alternatives you've considered
N/A

Additional context
N/A

Enhance UX for inference

The current SageMaker module wrapping process makes debugging very hard for both training and deployment. Now for inference, decorator is the simplest kind of solutions. For example, instead of requiring users to provide model_fn (which currently takes only one argument model_dir and basically if a model needs more argument to initialize it'd make the users frustrated), we can have a decorator like

@sagemaker.model_fn
def foo(*args, **kwargs): # one arg should be named model_dir for example
    ...

(Same goes for transform_fn, input_fn, output_fn). Then the decorator while expansion, looks for model_dir and everything else follows.

Document default_pre_model_fn and default_model_warmup_fn

What did you find confusing? Please describe.
I was trying to investigate how I could load the model already when the inference endpoint starts, so that there wouldn't be any delay in the first request when the model needs to be loaded. I was able to find support for this by reading the code: https://github.com/aws/sagemaker-inference-toolkit/blob/master/src/sagemaker_inference/transformer.py#L200. There was nothing mentioned about this functionality in the README of this repo.

Describe how documentation can be improved
Document the default_pre_model_fn and default_model_warmup_fn into the README.

Thanks!

Cloudwatch does not log stack trace when error occurs during invocation.

Describe the bug
When a code related error comes up during model invocation, the stack trace of the error is not logged on cloudwatch.
The exception from sagemaker points to a cloudwatch url alright but the cloudwatch logs in the attached url does not show the trace of the error.
My current work around is to manually catch and print the stack trace.:

try:
    from a.b import c
except Exception as error:
    print(error)
    print("First attempt failed. Trying second.")
    exc_type, exc_obj, exc_tb = sys.exc_info()
    fname = os.path.split(exc_tb.tb_frame.f_code.co_filename)[1]
    print(exc_type, fname, exc_tb.tb_lineno)
    print(traceback.format_exc())

    from A.a.b import c

To reproduce

  1. Deploy a model making sure the entry_point has some python error. (Like an import)
  2. Invoke the deployed endpoint.

Expected behavior
The stack trace should be logged in cloudwatch.

Screenshots or logs
If applicable, add screenshots or logs to help explain your problem.
image
When you go to the attached URL, the stack trace is not found there.

System information
A description of your system. Please provide:

  • SageMaker Python SDK version:1.56.2
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): Pytorch
  • Framework version: 1.5.0
  • Python version: 3.7
  • CPU or GPU: CPU
  • Custom Docker image (Y/N): N

Support for parquet encoder and decoder

Describe the feature you'd like
Support for the MIME type parquet files in the sagemaker toolkit. E.g. in the README of this repo, there is an example default_input_fn():

   def default_input_fn(self, input_data, content_type, context=None):
        """A default input_fn that can handle JSON, CSV and NPZ formats.

        Args:
            input_data: the request payload serialized in the content_type format
            content_type: the request content_type
            context (obj): the request context (default: None).

        Returns: input_data deserialized into torch.FloatTensor or torch.cuda.FloatTensor depending if cuda is available.
        """
        return decoder.decode(input_data, content_type)

Looking into decoder.decode, I see the following MIME types are supported:

_decoder_map = {
    content_types.NPY: _npy_to_numpy,
    content_types.CSV: _csv_to_numpy,
    content_types.JSON: _json_to_numpy,
    content_types.NPZ: _npz_to_sparse,
}

Should not be too hard to add parquet here. Parquet is a dat file commonly used with large datasets and also supported in other sagemaker services, for example in Autopilot.

How would this feature be used? Please describe.
Reduce storage costs, data I/O costs, increase speed while processing.

Describe alternatives you've considered

CSV is the standard, but it's a much less efficient way to store, read and write column-oriented data.

Additional context

Error in batch transform with custom image

Describe the problem

I needed to add GluonCV library in my code environment, and since the Default MXNet container does not have the python package, I needed to create a custom image with the python package installed.

I got the default MXNet container from here: https://github.com/aws/sagemaker-mxnet-serving-container and follow all the instructions. To include GluonCV, i then simply added this to the docker file and build the image

RUN ${PIP} install --no-cache-dir mxnet-mkl==$MX_VERSION \
                                  mxnet-model-server==$MMS_VERSION \
                                  keras-mxnet==2.2.4.1 \
                                  numpy==1.14.5 \
				  gluoncv \
                                  onnx==1.4.1 \
                                  ...

I build the image, then uploaded it to a AWS ECR.

I am able to verify that the docker image has been successfully uploaded and I have a valid URI like so:
552xxxxxxx.dkr.ecr.us-west-2.amazonaws.com/preprod-mxnet-serving:1.4.1-cpu-py3

THEN,
when instantiating the MXNet model, I added a reference to this image URI like so

sagemaker_model = MXNetModel(model_data = 's3://' + sagemaker_session.default_bucket() + '/model/yolo_object_person_detector.tar.gz',
                            role = role, 
                             entry_point = 'entry_point.py',
                             image = '552xxxxxxxx.dkr.ecr.us-west-2.amazonaws.com/preprod-mxnet-serving:1.4.1-cpu-py3',
                             py_version='py3',
                             framework_version='1.4.1',
                            sagemaker_session = sagemaker_session)

BUT i got an error message:
Here is the full log

Traceback (most recent call last):
File "/usr/local/bin/dockerd-entrypoint.py", line 21, in <module>
serving.main()
File "/usr/local/lib/python3.6/site-packages/sagemaker_mxnet_serving_container/serving.py", line 54, in main
_start_model_server()
File "/usr/local/lib/python3.6/site-packages/retrying.py", line 49, in wrapped_f
return Retrying(*dargs, **dkw).call(f, *args, **kw)
File "/usr/local/lib/python3.6/site-packages/retrying.py", line 206, in call
return attempt.get(self._wrap_exception)
File "/usr/local/lib/python3.6/site-packages/retrying.py", line 247, in get
six.reraise(self.value[0], self.value[1], self.value[2])
File "/usr/local/lib/python3.6/site-packages/six.py", line 693, in reraise
raise value
File "/usr/local/lib/python3.6/site-packages/retrying.py", line 200, in call
attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
File "/usr/local/lib/python3.6/site-packages/sagemaker_mxnet_serving_container/serving.py", line 49, in _start_model_server
model_server.start_model_server(handler_service=HANDLER_SERVICE)
File "/usr/local/lib/python3.6/site-packages/sagemaker_inference/model_server.py", line 63, in start_model_server
'/dev/null'])
File "/usr/local/lib/python3.6/subprocess.py", line 287, in call
with Popen(*popenargs, **kwargs) as p:
File "/usr/local/lib/python3.6/subprocess.py", line 729, in __init__
restore_signals, start_new_session)
File "/usr/local/lib/python3.6/subprocess.py", line 1364, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)

OSError: [Errno 14] Bad address: 'tail'

Be able to change SageMaker endpoint log level

Describe the feature you'd like
Be able to change the SageMaker endpoint cloudwatch log level.

As in the AWS support case 7309023801, currently the pre-built AWS DL container + SageMaker endpoint has no option to change the cloudwatch log level, hence creating INFO logs for every health check access. It makes difficult to see the relevant error logs.

How would this feature be used? Please describe.
Be able to only see the error logs.

Describe alternatives you've considered
As in the case 7309023801, create BYO container but it is overkill just to change the log level.

Additional context

CloudWatch log being cluttered with INFO with /ping health checks.

2020-09-21 11:24:10,359 [INFO ] pool-1-thread-3 ACCESS_LOG - /127.0.0.1:32868 "GET /ping HTTP/1.1" 200 0
ย  | 2020-09-21 11:24:15,359 [INFO ] pool-1-thread-3 ACCESS_LOG - /127.0.0.1:32868 "GET /ping HTTP/1.1" 200 0
ย  | 2020-09-21 11:24:20,359 [INFO ] pool-1-thread-3 ACCESS_LOG - /127.0.0.1:32868 "GET /ping HTTP/1.1" 200 0
ย  | 2020-09-21 11:24:25,359 [INFO ] pool-1-thread-3 ACCESS_LOG - /127.0.0.1:32868 "GET /ping HTTP/1.1" 200 0
ย  | 2020-09-21 11:24:30,359 [INFO ] pool-1-thread-3 ACCESS_LOG - /127.0.0.1:32868 "GET /ping HTTP/1.1" 200 0
ย  | 2020-09-21 11:24:35,359 [INFO ] pool-1-thread-3 ACCESS_LOG - /127.0.0.1:32868 "GET /ping HTTP/1.1" 200 0
ย  | 2020-09-21 11:24:40,359 [INFO ] pool-1-thread-3 ACCESS_LOG - /127.0.0.1:32868 "GET /ping HTTP/1.1" 200 0
ย  | 2020-09-21 11:24:45,359 [INFO ] pool-1-thread-3 ACCESS_LOG - /127.0.0.1:32868 "GET /ping HTTP/1.1" 200 0
ย  | 2020-09-21 11:24:50,359 [INFO ] pool-1-thread-3 ACCESS_LOG - /127.0.0.1:32868 "GET /ping HTTP/1.1" 200 0
ย  | 2020-09-21 11:24:55,359 [INFO ] pool-1-thread-3 ACCESS_LOG - /127.0.0.1:32868 "GET /ping HTTP/1.1" 200 0
ย  | 2020-09-21 11:25:00,359 [INFO ] pool-1-thread-3 ACCESS_LOG - /127.0.0.1:32868 "GET /ping HTTP/1.1" 200 0
ย  | 2020-09-21 11:25:05,359 [INFO ] pool-1-thread-3 ACCESS_LOG - /127.0.0.1:32868 "GET /ping HTTP/1.1" 200 0
ย  | 2020-09-21 11:25:10,359 [INFO ] pool-1-thread-3 ACCESS_LOG - /127.0.0.1:32868 "GET /ping HTTP/1.1" 200 0
ย  | 2020-09-21 11:25:15,359 [INFO ] pool-1-thread-3 ACCESS_LOG - /127.0.0.1:32868 "GET /ping HTTP/1.1" 200 0
ย  | 2020-09-21 11:25:20,359 [INFO ] pool-1-thread-3 ACCESS_LOG - /127.0.0.1:32868 "GET /ping HTTP/1.1" 200 0
ย  | 2020-09-21 11:25:25,359 [INFO ] pool-1-thread-3 ACCESS_LOG - /127.0.0.1:32868 "GET /ping HTTP/1.1" 200 0
ย  | 2020-09-21 11:25:30,359 [INFO ] pool-1-thread-3 ACCESS_LOG - /127.0.0.1:32868 "GET /ping HTTP/1.1" 200 0
ย  | 2020-09-21 11:25:35,359 [INFO ] pool-1-thread-3 ACCESS_LOG - /127.0.0.1:32868 "GET /ping HTTP/1.1" 200 0
ย  | 2020-09-21 11:25:40,359 [INFO ] pool-1-thread-3 ACCESS_LOG - /127.0.0.1:32868 "GET /ping HTTP/1.1" 200 0
ย  | 2020-09-21 11:25:45,359 [INFO ] pool-1-thread-3 ACCESS_LOG - /127.0.0.1:32868 "GET /ping HTTP/1.1" 200 0
ย  | 2020-09-21 11:25:50,359 [INFO ] pool-1-thread-3 ACCESS_LOG - /127.0.0.1:32868 "GET /ping HTTP/1.1" 200 0
ย  | 2020-09-21 11:25:55,359 [INFO ] pool-1-thread-3 ACCESS_LOG - /127.0.0.1:32868 "GET /ping HTTP/1.1" 200 0
ย  | 2020-09-21 11:26:00,359 [INFO ] pool-1-thread-3 ACCESS_LOG - /127.0.0.1:32868 "GET /ping HTTP/1.1" 200 0
ย  | 2020-09-21 11:26:05,359 [INFO ] pool-1-thread-3 ACCESS_LOG - /127.0.0.1:32868 "GET /ping HTTP/1.1" 200 0
ย  | 2020-09-21 11:26:10,359 [INFO ] pool-1-thread-3 ACCESS_LOG - /127.0.0.1:32868 "GET /ping HTTP/1.1" 200 0
ย  | 2020-09-21 11:26:15,359 [INFO ] pool-1-thread-3 ACCESS_LOG - /127.0.0.1:32868 "GET /ping HTTP/1.1" 200 0
ย  | 2020-09-21 11:26:20,359 [INFO ] pool-1-thread-3 ACCESS_LOG - /127.0.0.1:32868 "GET /ping HTTP/1.1" 200 0
ย  | 2020-09-21 11:26:25,359 [INFO ] pool-1-thread-3 ACCESS_LOG - /127.0.0.1:32868 "GET /ping HTTP/1.1" 200 0
ย ```

Enhancing Multi-Model Server Logging Support - JSON Format Integration

Description:

In an effort to improve the flexibility and integration capacity of our Multi-Model Server (MMS) with third-party log management solutions, such as Datadog, I propose that we extend the existing logging functionality to support JSON log format.

Currently, the MMS properties can be configured, but the logging format is restricted and does not permit users to alter the log4j2.xml file to implement JSON logging. The existing log4j2.xml configuration is as follows:

<Console name="STDOUT" target="SYSTEM_OUT">
    <PatternLayout pattern="%d{ISO8601} [%-5p] %t %c - %m%n"/>
</Console>

To enable JSON logging (for instance, in a CloudWatch environment), the configuration would need modification to something like:

<Console name="STDOUT" target="SYSTEM_OUT">
    <JSONLayout compact="true" eventEol="true" properties="true" stacktraceAsString="true" includeTimeMillis="true"/>
</Console>

This modification, however, would necessitate the installation of additional dependencies for the multi-model-server:

https://repo1.maven.org/maven2/com/fasterxml/jackson/core/jackson-databind/2.12.3/jackson-databind-2.12.3.jar
https://repo1.maven.org/maven2/com/fasterxml/jackson/core/jackson-annotations/2.12.3/jackson-annotations-2.12.3.jar
https://repo1.maven.org/maven2/com/fasterxml/jackson/core/jackson-core/2.12.3/jackson-core-2.12.3.jar

Proposed Solution

The optimal solution would be to incorporate the XML configuration into an external configuration file and introduce a user-configurable property that allows users to toggle JSON logging on or off as needed. This can be added to the config.properties file:

json_logging=true/false

This adjustment would drastically improve the server's versatility and adaptability to various logging requirements and environments.

Alternative Solution

An additional option to consider would be the introduction of an environment variable that points to the log4j2.xml file's location. This variable could be used at the entry point when initiating the model server start process.

this adjustment could be done here

https://github.com/aws/sagemaker-inference-toolkit/blob/master/src/sagemaker_inference/model_server.py#L43

Support CodeArtifact repositories for installing Python packages

Describe the feature you'd like

We'd like the ability to install internal Python packages via CodeArtifact instead of just PyPI.

How would this feature be used? Please describe.

To install Internal Python packages that cannot be published publicly to PyPI, in SageMaker serving instances. Adding support for CodeArtifact would integrate it better with other AWS services.

CodeArtifact provides a 12-hour token, so if we create credentials and pass them on during model package creation, it'd likely expire before the endpoint was refreshed in the future or a new batch transform job is run >12 hours after the model package creation.

(This applies more to inference jobs like endpoints and batch transforms because dependencies get installed at run-time, not build time.)

This is not as much of a concern for SageMaker Training Jobs since we can pass credentials and jobs start up almost immediately (probably an issue with spot instance jobs which have a >12 hour wait time, though). But our use case is specifically for inference related services.

A solution could be to add the AWS CodeArtifact login step with something like the below before _install_requirements: here:

@dataclass
class CodeArtifactConfig:
    domain: str
    account: int
    repository: str


def _code_artifact_login(code_artifact_config: CodeArtifactConfig):
    logger.info("logging into CodeArtifact...")
    code_artifact_login_cmd = [
        "aws",
        "codeartifact",
        "login",
        "--tool",
        "pip",
        "--domain",
        code_artifact_config.domain,
        "--domain-owner",
        code_artifact_config.account,
        "--repository",
        code_artifact_config.repository,
    ]

    try:
        subprocess.check_call(code_artifact_login_cmd)
    except subprocess.CalledProcessError:
        logger.error("failed to login to CodeArtifact, exiting")
        raise ValueError("failed to login to CodeArtifact, exiting")

And add before line 79:

     if code_artifact_config:
          _code_artifact_login(code_artifact_config)

Describe alternatives you've considered

Currently, private packages can either be served via an external service like Artifactory / Gemfury (by adding --extra-index-url <URL> to requirements.txt, or by relative imports and dependency injection during packaging.

Another alternative we've considered is forking the repo and adding the above mentioned changes in a private fork and using that for our SageMaker model deploys.

Additional context

I'm happy to take a stab at implementing this if there's interest.

Triton container documentation implies py38

The documentation for the Triton image (https://github.com/aws/deep-learning-containers/blob/master/available_images.md#nvidia-triton-inference-containers-sm-support-only) mentions py38 as the python version option. However the Example URL contains py3.

What did you find confusing? Please describe.
Different versions of Triton support different Python versions depending on which version of Ubuntu the container is based on (i.e. up to 23.02 ships with py38 as the default, later versions ship with py310)

Describe how documentation can be improved
Explicitly/correctly document which system python is shipped with the image.

Additional context
Unlike most sagemaker containers where additional packages are installed via requirements.txt Triton requires custom packages to be installed into a conda venv and then compressed using conda pack. This requires explicit knowledge of which system python the Triton image contains and this isn't py38 since 23.02.

Local development WorkerLifeCycle skip

Is your feature request related to a problem? Please describe.
Whenever I test my API locally I have to wait for about 5 min for the WorkerLifeCycle to finish. That slowdowns development immensely.

Describe the solution you'd like
I would like a config of sorts to initialize the server faster (maybe a dummy worker) so I can test my inference function faster without waiting for 5 min

Describe alternatives you've considered
Right now I made a separate Python script to simulate the call to my model (with the payload), but it is far from ideal

Include "Requires-Python" in source distributions

Describe the bug

The sagemaker inference toolkit sagemaker_inference is only being tested against Python 2.7, 3.6 and 3.7, yet installers like pip, poetry etc. would gladly attempt to install the toolkit in any other python version (potentially up to 3.10, depending on transitive dependencies and their metadata specs) since the package lacks the metadata entry Requires-Python.

Inspecting the sdist archives of the toolkit indicates that a version of setuptools is used which supports Metadata Version 2.1. Support for Metadata Version 2.1 was added to setuptools in 38.6.0. Support for Requires-Python was however already added to setuptools in 24.2.0

In return, it should be an easy change to add a python_requires parameter to the setup call inside the setup.py.

To reproduce

Just build the following docker file:

FROM python:3.8.13-slim

RUN python -m pip install --upgrade pip==22.1
RUN python -m pip install sagemaker_inference

It will install just fine without any issues, albeit sagemaker_inference has never been tested against 3.8.

Expected behavior

Installers like pip, poetry etc. should reject to install the sagemaker_inference package against a python version the toolkit has never been tested against.

Screenshots or logs

docker build logs (docker build -t foo .):

Sending build context to Docker daemon  46.08kB
Step 1/3 : FROM python:3.8.13-slim
 ---> 09d1a78893d5
Step 2/3 : RUN python -m pip install --upgrade pip==22.1
 ---> Running in 47ec102cfcfb
Collecting pip==22.1
  Downloading pip-22.1-py3-none-any.whl (2.1 MB)
     โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 2.1/2.1 MB 4.4 MB/s eta 0:00:00
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 22.0.4
    Uninstalling pip-22.0.4:
      Successfully uninstalled pip-22.0.4
Successfully installed pip-22.1
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Removing intermediate container 47ec102cfcfb
 ---> f582073bb4b0
Step 3/3 : RUN python -m pip install sagemaker_inference
 ---> Running in f0f9f41b53a4
Collecting sagemaker_inference
  Downloading sagemaker_inference-1.6.1.tar.gz (21 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting numpy
  Downloading numpy-1.22.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.8 MB)
     โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 16.8/16.8 MB 5.7 MB/s eta 0:00:00
Collecting six
  Downloading six-1.16.0-py2.py3-none-any.whl (11 kB)
Collecting psutil
  Downloading psutil-5.9.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (283 kB)
     โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 283.8/283.8 kB 4.6 MB/s eta 0:00:00
Collecting retrying==1.3.3
  Downloading retrying-1.3.3.tar.gz (10 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting scipy
  Downloading scipy-1.8.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (41.6 MB)
     โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 41.6/41.6 MB 7.5 MB/s eta 0:00:00
Building wheels for collected packages: sagemaker_inference, retrying
  Building wheel for sagemaker_inference (setup.py): started
  Building wheel for sagemaker_inference (setup.py): finished with status 'done'
  Created wheel for sagemaker_inference: filename=sagemaker_inference-1.6.1-py2.py3-none-any.whl size=27836 sha256=0617626f9f56cba60d256217b6a83421bb221c19646f5b3a540cb23816be4c03
  Stored in directory: /root/.cache/pip/wheels/47/27/bf/37ef3641057c337d5d7116f8b8ce87be599608a70c571115c8
  Building wheel for retrying (setup.py): started
  Building wheel for retrying (setup.py): finished with status 'done'
  Created wheel for retrying: filename=retrying-1.3.3-py3-none-any.whl size=11447 sha256=38dbee0dcb07c7f2a723b1659fe06481fc6b97b63378889acdd141eeffebd6fa
  Stored in directory: /root/.cache/pip/wheels/c4/a7/48/0a434133f6d56e878ca511c0e6c38326907c0792f67b476e56
Successfully built sagemaker_inference retrying
Installing collected packages: six, psutil, numpy, scipy, retrying, sagemaker_inference
Successfully installed numpy-1.22.3 psutil-5.9.0 retrying-1.3.3 sagemaker_inference-1.6.1 scipy-1.8.0 six-1.16.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Removing intermediate container f0f9f41b53a4
 ---> 6e235e79ade8
Successfully built 6e235e79ade8
Successfully tagged foo:latest

System information

None

Additional context

None

Model Server fails to start with multi-model-server version 1.1.5

Describe the bug
With the last update of component multi-model-server, 1.1.5, for log4j security issue, If I want to start a server doesn't not run. With the error for sure the problem is related with that log4j, probably with the configuration file

Multi-model-server 1.1.4 works with log4j v.1.* and now 1.1.5 works with log4j v 2.16

To reproduce
Create a virtualenv, install multi-model-server 1.1.4, install sagemaker-inference-toolkit and start a server, works correctly
Updated multi-model-server to 1.1.5 and starts the server, returns the error indicated in the logs section

Expected behavior
Server starts with no problem

Screenshots or logs

java.lang.NoClassDefFoundError: com/lmax/disruptor/EventTranslatorVararg
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:756)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
at org.apache.logging.log4j.core.async.AsyncLoggerContextSelector.createContext(AsyncLoggerContextSelector.java:46)
at org.apache.logging.log4j.core.selector.ClassLoaderContextSelector.locateContext(ClassLoaderContextSelector.java:218)
at org.apache.logging.log4j.core.selector.ClassLoaderContextSelector.getContext(ClassLoaderContextSelector.java:136)
at org.apache.logging.log4j.core.selector.ClassLoaderContextSelector.getContext(ClassLoaderContextSelector.java:123)
at org.apache.logging.log4j.core.selector.ClassLoaderContextSelector.getContext(ClassLoaderContextSelector.java:117)
at org.apache.logging.log4j.core.impl.Log4jContextFactory.getContext(Log4jContextFactory.java:150)
at org.apache.logging.log4j.core.impl.Log4jContextFactory.getContext(Log4jContextFactory.java:47)
at org.apache.logging.log4j.LogManager.getContext(LogManager.java:196)
at org.apache.logging.log4j.spi.AbstractLoggerAdapter.getContext(AbstractLoggerAdapter.java:137)
at org.apache.logging.slf4j.Log4jLoggerFactory.getContext(Log4jLoggerFactory.java:55)
at org.apache.logging.log4j.spi.AbstractLoggerAdapter.getLogger(AbstractLoggerAdapter.java:47)
at org.apache.logging.slf4j.Log4jLoggerFactory.getLogger(Log4jLoggerFactory.java:33)
at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:363)
at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:388)
at com.amazonaws.ml.mms.servingsdk.impl.PluginsManager.(PluginsManager.java:30)
at com.amazonaws.ml.mms.servingsdk.impl.PluginsManager.(PluginsManager.java:29)
at com.amazonaws.ml.mms.ModelServer.main(ModelServer.java:84)
Caused by: java.lang.ClassNotFoundException: com.lmax.disruptor.EventTranslatorVararg
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
... 29 more

System information
A description of your system.

  • SageMaker Inference Toolkit 1.5.7
  • Custom Docker image:
    • framework name Tensorflow
    • framework version 2.2
    • Python version 3.6
    • processing unit type CPU

In a multi-model scenario pass model name as argument to input_fn() and output_fn()

Describe the feature you'd like
In cases where the input being inferenced is different for each of the hosted models, the input_fn() would need to branch based on the model that was invoked. Currently, input_fn() is only passed the input data and so branching based on model name in the input_fn() is not possible. Same for output_fn().

How would this feature be used? Please describe.
This feature would be used in multi-model endpoints where different kinds of models work on different kinds on inputs. These input would need some preprocessing based on the model that was invoked.

Describe alternatives you've considered
The only workaround is to pass the model name as a feature in the input data.

Additional context
None.

End-to-End example for *NON* multi model deployment

What did you find confusing? Please describe.
Based on the documentation the complete example multi_model_bring_your_own, it seems that sagemaker-inference-toolkit is only for multi-model requirements. But I have also seen links for sagemaker_pytorch_serving_container which suggests that that is not the case.

There is no clear instruction in the documentation or an end-to-end example link indicating that it can be used for single model hosting scenarios as well.

Describe how documentation can be improved
You can provide one more end-to-end example for single model hosting, along with some points in favor of using this python package, instead of designing our own docker containers from scratch.

Additional context

Deploying multiple model artifacts, each having their own inference handler

What did you find confusing? Please describe.
I am trying to deploy multiple tarball model artifacts, to a SageMaker multi-model endpoint, but would like to use different inference handlers for each model - since each model needs different pre-processing and post-processing.

Describe how documentation can be improved
I see the documentation is fairly clear on how to specify a custom inference handler, but not clear on whether differing custom handlers can be specified for each model.

Additional context
I discovered that a custom handler can be provided to the MMS model archiver here, but it's not clear if this allows different handlers for each model.

I love the inference toolkit, and would sincerely appreciate a response regarding whether it is possible to define differing inference handlers per model, and how to do so.

Allow model_fn to get more arguments

Right now, model_fn only accepts model_dir which is very limiting in case a model needs some arguments to be instantiated first then can load the checkpoint.

Server Timeout Unit is Minutes in MMS, but docstring says Seconds

Describe the bug
The model server timeout ("used for model server's backend workers before they are deemed unresponsive and rebooted") currently set in with env vars using SAGEMAKER_MODEL_SERVER_TIMEOUT is listed in seconds in the property method description docstring...

def model_server_timeout(self):  # type: () -> int
    """int: Timeout, in **seconds**, used for model server's backend workers before
    they are deemed unresponsive and rebooted.
    """
    return self._model_server_timeout

...but the actual unit used downstream in multi-model server worker manager is minutes, not seconds.

// TODO: Change this to configurable param
ModelWorkerResponse reply = replies.poll(responseTimeout, TimeUnit.MINUTES);

Because of this, the default timeout of 20 in inference toolkit is actually a 20 minute timeout, not a 20 second timeout.

It seems odd that the unit is minutes, and because this is a parsed as an int in inference-toolkit argparse it does only give a resolution of whole minutes (instead of say, .33 minutes for a 20s equivalent timeout), so should I report this downstream in multi-model-server? If you don't want to change it, we should at least fix the docstring in inference-toolkit.

Issue attaching eia device for huggingface transformers roBERTa model

Describe the bug
Invoking the torch.jit.attach_eia() method on the huggingface transformers roBERTa model results in the following error: RuntimeError: class '__torch__.torch.nn.modules.normalization.___torch_mangle_6.LayerNorm' already defined.

To reproduce

import torch, torcheia
from transformers import RobertaForSequenceClassification

# Load pretrained roBERTa base model 
roberta_model = RobertaForSequenceClassification.from_pretrained("roberta-base", torchscript=True)

# Manufacture some input data
attention_mask = torch.Tensor([[1,1,1,1,1,1], [1,1,1,1,0,0]] * 4).long()
input_ids = torch.Tensor([[0,9226,16,10,1296,2], [0,463,277,2,1,1]] * 4).long()

# Validate that the model can run with input data
roberta_model.eval()
roberta_model(input_ids, attention_mask)

# Trace model
traced_model = torch.jit.trace(roberta_model, [attention_mask, input_ids])

# Validate that the traced model can run with input data
traced_model.eval()
traced_model(input_ids, attention_mask)


torch._C._jit_set_profiling_executor(False)
eia_model = torcheia.jit.attach_eia(traced_model, 0)

Expected behavior
I expect the attach_eia method to work correctly with this model.

System information

  • SageMaker JupyterLab notebook with conda_amazonei_pytorch_latest_p36 environment

Additional context
There's a similar error message associated with this issue in the pytorch project: pytorch/pytorch#29170
It looks like the related solution was merged in before version 1.5.1.

SageMaker inference should be able to run as non-root user.

Describe the bug

When running as a non-root user within a container, sagemaker-inference fails to start the multi-model-server. This works when all packages are installed as root, and the entrypoint script is run as root. The entrypoint script starts the model server using:

sagemaker_inference.model_server.start_model_server(......)

To reproduce

  1. Install the libraries as in the Dockerfile snippet:
RUN ["useradd", "-ms", "/bin/bash", "-d", "/home/<user>", "<user>" ]

ENV CUSTOM_INFERENCE_DIR=/home/<user>/custom_inference

RUN mkdir -p ${CUSTOM_INFERENCE_DIR}

COPY code/* ${CUSTOM_INFERENCE_DIR}/

RUN chown -R <user>:root ${CUSTOM_INFERENCE_DIR}

RUN chmod -R +rwx ${CUSTOM_INFERENCE_DIR}

USER <user>

RUN pip install mxnet-model-server multi-model-server sagemaker-inference

RUN pip install retrying

NOTE: Running a CLI

Expected behavior

SageMaker MMS should start without any issues.

Screenshots or logs

File "/home/<user>/.local/lib/python3.6/site-packages/retrying.py", line 49, in wrapped_f

    return Retrying(*dargs, **dkw).call(f, *args, **kw)

  File "/home/<user>/.local/lib/python3.6/site-packages/retrying.py", line 206, in call

    return attempt.get(self._wrap_exception)

  File "/home/<user>/.local/lib/python3.6/site-packages/retrying.py", line 247, in get

    six.reraise(self.value[0], self.value[1], self.value[2])

  File "/usr/local/lib/python3.6/dist-packages/six.py", line 703, in reraise

    raise value

  File "/home/<user>/.local/lib/python3.6/site-packages/retrying.py", line 200, in call

    attempt = Attempt(fn(*args, **kwargs), attempt_number, False)

  File "/home/<user>/custom_inference/entrypoint.py", line 21, in _start_mms

    model_server.start_model_server(handler_service=HANDLER_SERVICE)

  File "/home/<user>/.local/lib/python3.6/site-packages/sagemaker_inference/model_server.py", line 77, in start_model_server

    _create_model_server_config_file()

  File "/home/<user>/.local/lib/python3.6/site-packages/sagemaker_inference/model_server.py", line 143, in _create_model_server_config_file

    utils.write_file(MMS_CONFIG_FILE, configuration_properties)

  File "/home/<user>/.local/lib/python3.6/site-packages/sagemaker_inference/utils.py", line 47, in write_file

    with open(path, mode) as f:

PermissionError: [Errno 13] Permission denied: '/etc/sagemaker-mms.properties'

Checking on my development machine as well, it doesn't seem like non-root user has access to /etc.

Can this library be updated so as to run as non-root user?

System information

sagemaker-inference==1.5.2

  • Custom Docker image, ubuntu based.

    • framework name: tensorflow

    • framework version: 2.3.0

    • Python version: 3.6

    • processing unit type: CPU

Additional context

I worked-around this initial problem by granting write access to the /etc folder but it would be ideal if the configuration were stored in a user-writeable directory.

Controlling multiple copies of same model within Sagemaker multi-model-server

As per mxnet inference doc, the main dispatcher thread is single threaded. https://cwiki.apache.org/confluence/display/MXNET/Parallel+Inference+in+MXNet

And sagemaker inference toolkit calls/start mxnet multi-model-server.
https://github.com/aws/sagemaker-inference-toolkit/blob/master/src/sagemaker_inference/model_server.py#L54

When I load a single model in the sagemkaer endpoint with multi-model-server, the Cloudwatch metrics for TotalModelCount shows as 4 (instead of 1). Similarly for 2 models, the count increase to 8. Could you please explain the reason??

How does mxnet model server handle multiple concurrent/parallel request for a particular model?
When we load a model inside multi-model-server, does it apply forking/multiprocessing to host multiple copies of the same model to improve the throughput/latency?
If yes, What is the default value and is there any config to decide how many copies of the model will be spawned?

Also, what does the worker thread actually perform?
https://github.com/awslabs/multi-model-server/blob/master/mms/model_service_worker.py#L166-L212

Any guide or pointer will be highly appreciated.

OOM errors creating an endpoint for LLMs

Describe the bug

It seems like the model serving endpoints don't utilize the NVME drives effectively. When I try to serve a 13B parameter LLM (my model.tar.gz is ~42GB on s3) I get errors that the disk is out of space. The endpoint fails to create.

I think the root of the issue is that the endpoint is trying to put too much stuff into the / disk volume instead of using the NVME which is located on /tmp.

Screenshots or logs

Here's all my log events for the endpoint startup failure.

ERROR - Failed to save the model-archive to model-path "/.sagemaker/mms/models". Check the file permissions and retry.
Traceback (most recent call last):
  File "/opt/conda/bin/model-archiver", line 8, in <module>
    sys.exit(generate_model_archive())
  File "/opt/conda/lib/python3.8/site-packages/model_archiver/model_packaging.py", line 63, in generate_model_archive
    package_model(args, manifest=manifest)
  File "/opt/conda/lib/python3.8/site-packages/model_archiver/model_packaging.py", line 44, in package_model
    ModelExportUtils.archive(export_file_path, model_name, model_path, files_to_exclude, manifest,
  File "/opt/conda/lib/python3.8/site-packages/model_archiver/model_packaging_utils.py", line 262, in archive
    ModelExportUtils.archive_dir(model_path, mar_path,
  File "/opt/conda/lib/python3.8/site-packages/model_archiver/model_packaging_utils.py", line 308, in archive_dir
    shutil.copy(file_path, dst_dir)
  File "/opt/conda/lib/python3.8/shutil.py", line 418, in copy
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/opt/conda/lib/python3.8/shutil.py", line 275, in copyfile
    _fastcopy_sendfile(fsrc, fdst)
  File "/opt/conda/lib/python3.8/shutil.py", line 166, in _fastcopy_sendfile
    raise err from None
  File "/opt/conda/lib/python3.8/shutil.py", line 152, in _fastcopy_sendfile
    sent = os.sendfile(outfd, infd, offset, blocksize)
OSError: [Errno 28] No space left on device: '/opt/ml/model/pytorch_model-00003-of-00005.bin' -> '/.sagemaker/mms/models/model/pytorch_model-00003-of-00005.bin'
Traceback (most recent call last):
  File "/usr/local/bin/dockerd-entrypoint.py", line 23, in <module>
    serving.main()
  File "/opt/conda/lib/python3.8/site-packages/sagemaker_huggingface_inference_toolkit/serving.py", line 34, in main
    _start_mms()
  File "/opt/conda/lib/python3.8/site-packages/retrying.py", line 49, in wrapped_f
    return Retrying(*dargs, **dkw).call(f, *args, **kw)
  File "/opt/conda/lib/python3.8/site-packages/retrying.py", line 212, in call
    raise attempt.get()
  File "/opt/conda/lib/python3.8/site-packages/retrying.py", line 247, in get
    six.reraise(self.value[0], self.value[1], self.value[2])
  File "/opt/conda/lib/python3.8/site-packages/six.py", line 719, in reraise
    raise value
  File "/opt/conda/lib/python3.8/site-packages/retrying.py", line 200, in call
    attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
  File "/opt/conda/lib/python3.8/site-packages/sagemaker_huggingface_inference_toolkit/serving.py", line 30, in _start_mms
    mms_model_server.start_model_server(handler_service=HANDLER_SERVICE)
  File "/opt/conda/lib/python3.8/site-packages/sagemaker_huggingface_inference_toolkit/mms_model_server.py", line 85, in start_model_server
    _adapt_to_mms_format(handler_service, model_dir)
  File "/opt/conda/lib/python3.8/site-packages/sagemaker_huggingface_inference_toolkit/mms_model_server.py", line 138, in _adapt_to_mms_format
    subprocess.check_call(model_archiver_cmd)
  File "/opt/conda/lib/python3.8/subprocess.py", line 364, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['model-archiver', '--model-name', 'model', '--handler', 'sagemaker_huggingface_inference_toolkit.handler_service', '--model-path', '/opt/ml/model', '--export-path', '/.sagemaker/mms/models', '--archive-format', 'no-archive', '--f']' returned non-zero exit status 1.

I also injected a df -kh call to see what the disk utilization was and got:

Filesystem      Size  Used Avail Use% Mounted on
overlay          52G   27G   26G  52% /
tmpfs            64M     0   64M   0% /dev
tmpfs            32G     0   32G   0% /sys/fs/cgroup
shm              30G   20K   30G   1% /dev/shm
/dev/nvme1n1    550G  948M  521G   1% /tmp
/dev/nvme0n1p1   52G   27G   26G  52% /etc/hosts
tmpfs            32G   12K   32G   1% /proc/driver/nvidia
devtmpfs         32G     0   32G   0% /dev/nvidia0
tmpfs            32G     0   32G   0% /proc/acpi
tmpfs            32G     0   32G   0% /sys/firmware

So storing things at /.sagemaker/... or at /opt/ml/... are both going to fail. It needs to be on the nvme at /tmp

System information
Specifics of my requirements.txt, inference.py, and invocation code in the details.

requirements.txt in model.tar.gz/code

accelerate==0.16.0
transformers==4.26.0
bitsandbytes==0.37.0

inference.py in model.tar.gz/code

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

def model_fn(model_dir):
    model =  AutoModelForSeq2SeqLM.from_pretrained(
        model_dir,
        device_map="auto",
        load_in_8bit=True,
        cache_dir="/tmp/model_cache/",
    )
    tokenizer = AutoTokenizer.from_pretrained(
        model_dir,
        cache_dir="/tmp/model_cache/",
    )

    return model, tokenizer


def predict_fn(data, model_and_tokenizer):
    model, tokenizer = model_and_tokenizer
    inputs = data.pop("inputs", data)
    parameters = data.pop("parameters", None)

    input_ids = tokenizer(inputs, return_tensors="pt").input_ids
    if parameters is not None:
        outputs = model.generate(input_ids, **parameters)
    else:
        outputs = model.generate(input_ids)

    prediction = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return [{"generated_text": prediction}]

My invocation code:

import boto3
from sagemaker.huggingface import HuggingFaceModel
import sagemaker

default_bucket = "my_bucket"
boto_session = boto3.Session(profile_name="sagemaker", region_name="us-west-2")
sagemaker_session = sagemaker.Session(boto_session=boto_session, default_bucket=default_bucket)

huggingface_model = HuggingFaceModel(
   model_data=f"s3://{default_bucket}/sagemaker/google/flan-t5-xxl/model.tar.gz",
   role="arn:aws:iam::my_role",
   env={'HF_TASK':'text-generation'},
   sagemaker_session=sagemaker_session,
   transformers_version="4.17",
   pytorch_version="1.10",
   py_version="py38",
)

predictor = huggingface_model.deploy(
   initial_instance_count=1,
   instance_type="ml.g5.4xlarge",
   model_data_download_timeout=300,
   container_startup_health_check_timeout=600,
)

data = {"inputs": "What color is a banana?"}
predictor.predict(data)

Additional details

I've also tried to alter the SAGEMAKER_BASE_DIR env variable to be in /tmp but it just gives an error about an model-dir directory.

model-server archives model in /opt/ml/model

Hello,

Is it an expected behavior that in multi model, the container would be run with one model in the single model dir /opt/ml/model? I noticed that the model server starts by running the model-archive in this directory. I wasn't aware that this was a requirement for starting MME in a container

stop_server function

A stop_server() function in model_server.py

It would be helpful to include a simple-to-use function for stopping the server. This would be specifically helpful for testing the server locally and tearing it down. This helps in simplicity of use for developers and a high quality, easy experience for users of AWS.

Alternatives: It seems to be possible to go into the multi-model-server code to stop the server when needed. Yet, this requires some digging and switching libraries to do. It seems like it would also be possible to stop the server through managing running process, but this would also require some digging on how to do.

Additional context: After some additional digging, It seems that a call to multi-model-server --stop would do the job, perhaps with some additional modifications. I still think a wrapper for that in this library would make for a better development experience than needing to dive into the source code like that.

Where to find the current source that would handle this: awslabs/multi-model-server/mms/model_server.py in the start() function, line 40 at the time of writing this. There is also reference to awslabs/multi-model-server/mms/arg_parser.py ArgParser class, mms_parser() function -- line 34 at the time of writing this.

Add (and test) Support for Python 3.8

Describe the feature you'd like

The sagemaker inference toolkit sagemaker_inference should support (at a bare minimum) Python 3.8. It's currently only tested against 2.7, 3.6 and 3.7.

How would this feature be used? Please describe.

Python 3.7 is about to reach End-Of-Live by June 2023. In fact, many popular open source libraries commonly used also in the scope of this inference toolkit are shifting to 3.8 already and are discontinuing support for 3.7. Examples include, but are not limited to:

Describe alternatives you've considered

There are no alternatives. Python 3.7 will reach end of life soon and the inference toolkit will become impossible to use if you want to use up-to-date common AI libraries.

Additional context

Related: #106

Custom `model_fn` function not found when extending the PyTorch inference container

Background

I am trying to do single-model batch transform in SageMaker to get predictions from a pre-trained model (I did not train the model on SageMaker). My end goal is to be able to run just a bit of python code to start a batch transform job and grab the results from S3 when it's done.

import boto3
client = boto3.client("sagemaker")
client.create_transform_job(...)

# occasionally monitor the job
client.describe_transform_job(...)

# fetch results once job is finished
client = boto3.client("s3")
...

I can successfully get the results I need using Transformer.transform() in a SageMaker notebook instance (see the appendix below for code snippets), but in my project I do not want to depend on the SageMaker Python SDK. Instead, I'd rather use boto3 like in the pseudocode above.

The issue

I referenced this example notebook to try and extend a PyTorch inference container (see appendix below for the dockerfile I am using), but I can't get the same results that I can when I use the SageMaker Python SDK in a notebook instance. Instead I get this error:

Backend worker process died.
Traceback (most recent call last):
    File "/opt/conda/lib/python3.6/site-packages/ts/model_service_worker.py", line 182, in <module>
        worker.run_server()
    File "/opt/conda/lib/python3.6/site-packages/ts/model_service_worker.py", line 154, in run_server
        self.handle_connection(cl_socket)
    File "/opt/conda/lib/python3.6/site-packages/ts/model_service_worker.py", line 116, in handle_connection
        service, result, code = self.load_model(msg)
    File "/opt/conda/lib/python3.6/site-packages/ts/model_service_worker.py", line 89, in load_model
        service = model_loader.load(model_name, model_dir, handler, gpu, batch_size, envelope)
    File "/opt/conda/lib/python3.6/site-packages/ts/model_loader.py", line 110, in load
        initialize_fn(service.context)
    File "/home/model-server/tmp/models/d00cc5c716dc4e4582250bd89915b99b/handler_service.py", line 51, in initialize
        super().initialize(context)
    File "/opt/conda/lib/python3.6/site-packages/sagemaker_inference/default_handler_service.py", line 66, in initialize
        self._service.validate_and_initialize(model_dir=model_dir)
    File "/opt/conda/lib/python3.6/site-packages/sagemaker_inference/transformer.py", line 158, in validate_and_initialize
        self._model = self._model_fn(model_dir)
    File "/opt/conda/lib/python3.6/site-packages/sagemaker_pytorch_serving_container/default_pytorch_inference_handler.py", line 55, in default_model_fn
        NotImplementedError:
            Please provide a model_fn implementation.
            See documentation for model_fn at https://github.com/aws/sagemaker-python-sdk

The problem seems to be that when the inference toolkit tries to import a customized inference.py script, it can't find it, presumably because /opt/ml/model/code is not found in sys.path.

if find_spec(user_module_name) is not None:

If I understand the code correctly, then in this snippet below (which runs before the snippet above), we are attempting to add the code_dir to sys.path, but this won't affect the current runtime.

# add model_dir/code to python path
code_dir_path = "{}:".format(model_dir + "/code")
if PYTHON_PATH_ENV in os.environ:
os.environ[PYTHON_PATH_ENV] = code_dir_path + os.environ[PYTHON_PATH_ENV]
else:
os.environ[PYTHON_PATH_ENV] = code_dir_path

I wonder if it should be like this instead:

import sys
from sagemaker_inference.environment import code_dir
...
# add model_dir/code to python path 
if code_dir not in sys.path:
    sys.path.append(code_dir)

Appendix

Notebook cells containing code I was able to run successfully

Here's what I can get running in a SageMaker notebook instance (ml.p2.xlarge). The last cell takes about 5 minutes to run.

from sagemaker import get_execution_role
from sagemaker.pytorch.model import PyTorchModel

# fill out proper values here
path_to_model = "s3://bucket/path/to/model/model.tar.gz"

repo = "GITHUB_REPO_URL_HERE"
branch = "BRANCH_NAME_HERE"
token = "GITHUB_PAT_HERE"

path_to_code_location = "s3://bucket/path/to/code/location"
github_repo_source_dir = "relative/path/to/entry/point"

path_to_output = "s3://bucket/path/to/output"
path_to_input = "s3://bucket/path/to/input"
pytorch_model = PyTorchModel(
    image_uri="763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:1.4-gpu-py36",  # the latest supported version I could get working
    model_data=path_to_model,
    git_config={
        "repo": repo,
        "branch": branch,
        "token": token,
    },
    code_location=path_to_code_location,  # must provide this so that a default bucket isn't created
    source_dir=github_repo_source_dir,
    entry_point="inference.py",
    role=get_execution_role(),
    py_version="py3",
    framework_version="1.4",  # must provide this even though we are supplying `image_uri`
)
transformer = pytorch_model.transformer(
    instance_count=1,
    instance_type="local_gpu",
    strategy="SingleRecord",
    output_path=path_to_output,
    accept="image/png",
)
transformer.transform(
    data=path_to_input,
    data_type="S3Prefix",
    content_type="image/png",
    compression_type=None,
    wait=True,
    logs=True,
)

Dockerfile for extended container

# Tutorial for extending AWS SageMaker PyTorch containers:
# https://github.com/aws/amazon-sagemaker-examples/blob/master/advanced_functionality/pytorch_extending_our_containers/pytorch_extending_our_containers.ipynb
ARG REGION=us-west-2

# SageMaker PyTorch Image
FROM 763104351884.dkr.ecr.$REGION.amazonaws.com/pytorch-inference:1.8.1-gpu-py36-cu111-ubuntu18.04

ARG CODE_DIR=/opt/ml/model/code
ENV PATH="${CODE_DIR}:${PATH}"

# /opt/ml and all subdirectories are utilized by SageMaker, we use the /code subdirectory to store our user code.
COPY /inference ${CODE_DIR}

# Used by the SageMaker PyTorch container to determine our user code directory.
ENV SAGEMAKER_SUBMIT_DIRECTORY ${CODE_DIR}

# Used by the SageMaker PyTorch container to determine our program entry point.
# For more information: https://github.com/aws/sagemaker-pytorch-container
ENV SAGEMAKER_PROGRAM inference.py

JVM detect the CPU count as 1 when more CPUs are available for the container.

Describe the bug
This issue is related to the issue aws/sagemaker-python-sdk#1275

JVM detect the CPU count as 1 when more CPUs are available for the container.

To reproduce

  1. Clone the SaeMaker example
  2. Deploy the model using the same endpoint.
  3. Check CloudWatch logs and the number of CPU cores detected will be like Number of CPUs: 1

Expected behavior
The CPU count from CloudWatch should match the CPU count for the used instance. For example, 4 if the instance is ml.m4.xlarge

System information
Container: pytorch-inference:1.5-gpu-py3
SageMaker inference v1.1.2

Warning: "Calling MMS with mxnet-model-server. Please move to multi-model-server."

Describe the bug
When deploying sagemaker endpoint using the example dockerfile: https://github.com/awslabs/amazon-sagemaker-examples/tree/master/advanced_functionality/multi_model_bring_your_own we now get the above warning "Calling MMS with mxnet-model-server. Please move to multi-model-server."

Seems the command here:
https://github.com/aws/sagemaker-inference-toolkit/blob/master/src/sagemaker_inference/model_server.py#L83

Should change according to:
https://github.com/awslabs/multi-model-server/blob/72e09f998874f5701404031441573e0eb6f2f866/mms/model_server.py#L20

To reproduce
See above

Expected behavior
No warning given

Screenshots or logs
If applicable, add screenshots or logs to help explain your problem.

System information
A description of your system.

  • Include the version of SageMaker Inference Toolkit you are using.
  • If you are using a prebuilt Amazon SageMaker Docker image, provide the URL.
  • If you are using a custom Docker image, provide:
    • framework name (eg. PyTorch)
    • framework version
    • Python version
    • processing unit type (ie. CPU or GPU)

Additional context
Add any other context about the problem here.

Long Model Loading times in Multimodel Server

Describe the bug
According to the Sagemaker Multimodel Server documentation the server caches 'frequently' used models in memory (to my understanding in RAM) in order to increase response time via avoiding to load the model again and again.
First Question would be: What does 'frequently' mean?

If I query the same model again and again with a delay of 30s between the invoke_endpoint calls, the server seems to load the model again into memory leading to long response times of 3s instead of the usual ~0.5s obtained via calling the model in <30s interval.

To reproduce

  • Deploy a Sagemaker Multimodel Server using boto3
  • Generate a sagemaker runtime_client using boto3 and execute the following code:
for i in range(20):
  start = time.time()
  response = rt_client.invoke_endpoint(
                              EndpointName=self.endpoint_name,
                              ContentType='application/x-npy',
                              TargetModel='model_store/custom_model_1.tar.gz', # Constantly the same model
                              Body=payload, # Byte encoded numpy array
                          )
   end = time.time()
   response_time  = end - start
   print(f'Request took {response_time}s'.)
   time.sleep(30)

Expected behavior
First call is slow (about 3s) and the following 19 calls lie in the expected ~0.5s range, which is the time it takes to call the endpoint when the model is already loaded.

Once i set the time.sleep() argument lower than 30s, f.e. to 20s, the calls are most of the time as fast as expected.

Ist there any way to influence the timing of the unloading behavior?
To my understanding I would expect that the model stays in memory as long as the memory is not needed for loading other more frequently used models. However, this does not seem to be the case, as each call takes the full 3s.

Screenshots or logs
Time sleep 30s:

	 Call: 0 of 20 with 4 samples took: 2.847299098968506s.
	 Call: 1 of 20 with 4 samples took: 3.017570734024048s.
	 Call: 2 of 20 with 4 samples took: 2.866020917892456s.
	 Call: 3 of 20 with 4 samples took: 2.888610363006592s.
	 Call: 4 of 20 with 4 samples took: 3.0125389099121094s.
	 Call: 5 of 20 with 4 samples took: 2.9569602012634277s.
	 Call: 6 of 20 with 4 samples took: 2.8126561641693115s.
	 Call: 7 of 20 with 4 samples took: 2.912917375564575s.
	 Call: 8 of 20 with 4 samples took: 2.866114854812622s.
	 Call: 9 of 20 with 4 samples took: 2.9781384468078613s.
	 Call: 10 of 20 with 4 samples took: 3.4418649673461914s.
	 Call: 11 of 20 with 4 samples took: 2.79472017288208s.
	 Call: 12 of 20 with 4 samples took: 2.992703437805176s.
	 Call: 13 of 20 with 4 samples took: 2.954014301300049s.
	 Call: 14 of 20 with 4 samples took: 2.9481523036956787s.
	 Call: 15 of 20 with 4 samples took: 2.928661346435547s.
	 Call: 16 of 20 with 4 samples took: 2.8345978260040283s.
	 Call: 17 of 20 with 4 samples took: 2.922405481338501s.
	 Call: 18 of 20 with 4 samples took: 2.982257843017578s.
	 Call: 19 of 20 with 4 samples took: 2.8227620124816895s.

Time sleep(20)s

	 Call: 0 of 20 with 4 samples took: 3.329136848449707s.
	 Call: 1 of 20 with 4 samples took: 0.5629911422729492s.
	 Call: 2 of 20 with 4 samples took: 0.5595850944519043s.
	 Call: 3 of 20 with 4 samples took: 0.5578911304473877s.
	 Call: 4 of 20 with 4 samples took: 0.5557725429534912s.
	 Call: 5 of 20 with 4 samples took: 0.5681345462799072s.
	 Call: 6 of 20 with 4 samples took: 0.5488979816436768s.
	 Call: 7 of 20 with 4 samples took: 0.5555169582366943s.
	 Call: 8 of 20 with 4 samples took: 0.5792186260223389s.
	 Call: 9 of 20 with 4 samples took: 0.9297688007354736s.
	 Call: 10 of 20 with 4 samples took: 0.6043572425842285s.
	 Call: 11 of 20 with 4 samples took: 0.572312593460083s.
	 Call: 12 of 20 with 4 samples took: 0.5600907802581787s.
	 Call: 13 of 20 with 4 samples took: 2.9460437297821045s.
	 Call: 14 of 20 with 4 samples took: 0.5780775547027588s.
	 Call: 15 of 20 with 4 samples took: 0.5762953758239746s.
	 Call: 16 of 20 with 4 samples took: 0.5773897171020508s.
	 Call: 17 of 20 with 4 samples took: 0.5769815444946289s.
	 Call: 18 of 20 with 4 samples took: 0.5663411617279053s.
	 Call: 19 of 20 with 4 samples took: 0.579679012298584s.

System information

  • Custom Docker Image:
    • Inference Framework: SkLearn
    • Sagemaker Inference Toolkit: 1.6.1
    • Multimodel Server: 1.1.8
    • Python version: 3.9
    • processing unit type CPU (ml.t2.medium)

Is streaming supported?

Hi,

This isn't documented anywhere, or at least where I've looked.

Looking at the MMS repo's issues, I don't think it is.

But a confirmation on this/an updating of the docs would be greatly appreciated.

Thanks,
Aaron

Request encoding function for application/jsonlines output format

Is your feature request related to a problem? Please describe.
When I call encoder.encode(prediction, accept='application/jsonlines') with prediction a list of json, it generates an error.
Describe the solution you'd like
Hope the returned a output file with json records arranged by lines, i.e.:

{'key': value, ...}
{'key': value, ...}
...

Describe alternatives you've considered
If the above request has been solved, please give the document souce. Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.