blaizzy / fastmlx Goto Github PK

FastMLX is a high performance production ready API to host MLX models.

License: Other

Python 88.91% Jinja 11.09%

fastmlx's Introduction

FastMLX

FastMLX is a high performance production ready API to host MLX models, including Vision Language Models (VLMs) and Language Models (LMs).

Free software: Apache Software License 2.0
Documentation: https://Blaizzy.github.io/fastmlx

Features

OpenAI-compatible API: Easily integrate with existing applications that use OpenAI's API.
Dynamic Model Loading: Load MLX models on-the-fly or use pre-loaded models for better performance.
Support for Multiple Model Types: Compatible with various MLX model architectures.
Image Processing Capabilities: Handle both text and image inputs for versatile model interactions.
Efficient Resource Management: Optimized for high-performance and scalability.
Error Handling: Robust error management for production environments.
Customizable: Easily extendable to accommodate specific use cases and model types.

Usage

Installation
```
pip install fastmlx
```
Running the Server

Start the FastMLX server:
```
fastmlx
```
or
```
uvicorn fastmlx:app --reload --workers 0
```
[!WARNING] The --reload flag should not be used in production. It is only intended for development purposes.

Running with Multiple Workers (Parallel Processing)

For improved performance and parallel processing capabilities, you can specify either the absolute number of worker processes or the fraction of CPU cores to use. This is particularly useful for handling multiple requests simultaneously.

You can also set the FASTMLX_NUM_WORKERS environment variable to specify the number of workers or the fraction of CPU cores to use. workers defaults to 2 if not passed explicitly or set via the environment variable.

In order of precedence (highest to lowest), the number of workers is determined by the following:
- Explicitly passed as a command-line argument
  - --workers 4 will set the number of workers to 4
  - --workers 0.5 will set the number of workers to half the number of CPU cores available (minimum of 1)
- Set via the FASTMLX_NUM_WORKERS environment variable
- Default value of 2
To use all available CPU cores, set the value to 1.0.

Example:
```
fastmlx --workers 4
```
or
```
uvicorn fastmlx:app --workers 4
```
[!NOTE]
- --reload flag is not compatible with multiple workers
- The number of workers should typically not exceed the number of CPU cores available on your machine for optimal performance.
Considerations for Multi-Worker Setup
1. Stateless Application: Ensure your FastMLX application is stateless, as each worker process operates independently.
2. Database Connections: If your app uses a database, make sure your connection pooling is configured to handle multiple workers.
3. Resource Usage: Monitor your system's resource usage to find the optimal number of workers for your specific hardware and application needs. Additionally, you can remove any unused models using the delete model endpoint.
4. Load Balancing: When running with multiple workers, incoming requests are automatically load-balanced across the worker processes.
By leveraging multiple workers, you can significantly improve the throughput and responsiveness of your FastMLX application, especially under high load conditions.

Making API Calls

Use the API similar to OpenAI's chat completions:

Vision Language Model

import requests
import json

url = "http://localhost:8000/v1/chat/completions"
headers = {"Content-Type": "application/json"}
data = {
    "model": "mlx-community/nanoLLaVA-1.5-4bit",
    "image": "http://images.cocodataset.org/val2017/000000039769.jpg",
    "messages": [{"role": "user", "content": "What are these"}],
    "max_tokens": 100
}

response = requests.post(url, headers=headers, data=json.dumps(data))
print(response.json())

With streaming:

import requests
import json

def process_sse_stream(url, headers, data):
   response = requests.post(url, headers=headers, json=data, stream=True)

   if response.status_code != 200:
      print(f"Error: Received status code {response.status_code}")
      print(response.text)
      return

   full_content = ""

   try:
      for line in response.iter_lines():
            if line:
               line = line.decode('utf-8')
               if line.startswith('data: '):
                  event_data = line[6:]  # Remove 'data: ' prefix
                  if event_data == '[DONE]':
                        print("\nStream finished. ✅")
                        break
                  try:
                        chunk_data = json.loads(event_data)
                        content = chunk_data['choices'][0]['delta']['content']
                        full_content += content
                        print(content, end='', flush=True)
                  except json.JSONDecodeError:
                        print(f"\nFailed to decode JSON: {event_data}")
                  except KeyError:
                        print(f"\nUnexpected data structure: {chunk_data}")

   except KeyboardInterrupt:
      print("\nStream interrupted by user.")
   except requests.exceptions.RequestException as e:
      print(f"\nAn error occurred: {e}")

if __name__ == "__main__":
   url = "http://localhost:8000/v1/chat/completions"
   headers = {"Content-Type": "application/json"}
   data = {
      "model": "mlx-community/nanoLLaVA-1.5-4bit",
      "image": "http://images.cocodataset.org/val2017/000000039769.jpg",
      "messages": [{"role": "user", "content": "What are these?"}],
      "max_tokens": 500,
      "stream": True
   }
   process_sse_stream(url, headers, data)

Language Model

import requests
import json

url = "http://localhost:8000/v1/chat/completions"
headers = {"Content-Type": "application/json"}
data = {
    "model": "mlx-community/gemma-2-9b-it-4bit",
    "messages": [{"role": "user", "content": "What is the capital of France?"}],
    "max_tokens": 100
}

response = requests.post(url, headers=headers, data=json.dumps(data))
print(response.json())

With streaming:

import requests
import json

def process_sse_stream(url, headers, data):
   response = requests.post(url, headers=headers, json=data, stream=True)

   if response.status_code != 200:
      print(f"Error: Received status code {response.status_code}")
      print(response.text)
      return

   full_content = ""

   try:
      for line in response.iter_lines():
            if line:
               line = line.decode('utf-8')
               if line.startswith('data: '):
                  event_data = line[6:]  # Remove 'data: ' prefix
                  if event_data == '[DONE]':
                        print("\nStream finished. ✅")
                        break
                  try:
                        chunk_data = json.loads(event_data)
                        content = chunk_data['choices'][0]['delta']['content']
                        full_content += content
                        print(content, end='', flush=True)
                  except json.JSONDecodeError:
                        print(f"\nFailed to decode JSON: {event_data}")
                  except KeyError:
                        print(f"\nUnexpected data structure: {chunk_data}")

   except KeyboardInterrupt:
      print("\nStream interrupted by user.")
   except requests.exceptions.RequestException as e:
      print(f"\nAn error occurred: {e}")

if __name__ == "__main__":
   url = "http://localhost:8000/v1/chat/completions"
   headers = {"Content-Type": "application/json"}
   data = {
      "model": "mlx-community/gemma-2-9b-it-4bit",
      "messages": [{"role": "user", "content": "Hi, how are you?"}],
      "max_tokens": 500,
      "stream": True
   }
   process_sse_stream(url, headers, data)

Function Calling

FastMLX now supports tool calling in accordance with the OpenAI API specification. This feature is available for the following models:

Llama 3.1
Arcee Agent
C4ai-Command-R-Plus
Firefunction
xLAM

Supported modes:

Without Streaming
Parallel Tool Calling

Note: Tool choice and OpenAI-compliant streaming for function calling are currently under development.

Here's an example of how to use function calling with FastMLX:

import requests
import json

url = "http://localhost:8000/v1/chat/completions"
headers = {"Content-Type": "application/json"}
data = {
  "model": "mlx-community/Meta-Llama-3.1-8B-Instruct-8bit",
  "messages": [
    {
      "role": "user",
      "content": "What's the weather like in San Francisco and Washington?"
    }
  ],
  "tools": [
    {
      "name": "get_current_weather",
      "description": "Get the current weather",
      "parameters": {
        "type": "object",
        "properties": {
          "location": {
            "type": "string",
            "description": "The city and state, e.g. San Francisco, CA"
          },
          "format": {
            "type": "string",
            "enum": ["celsius", "fahrenheit"],
            "description": "The temperature unit to use. Infer this from the user's location."
          }
        },
        "required": ["location", "format"]
      }
    }
  ],
  "max_tokens": 150,
  "temperature": 0.7,
  "stream": False,
}

response = requests.post(url, headers=headers, data=json.dumps(data))
print(response.json())

This example demonstrates how to use the get_current_weather tool with the Llama 3.1 model. The API will process the user's question and use the provided tool to fetch the required information.

Please note that while streaming is available for regular text generation, the streaming implementation for function calling is still in development and does not yet fully comply with the OpenAI specification.

Listing Available Models

To see all vision and language models supported by MLX:

import requests

url = "http://localhost:8000/v1/supported_models"
response = requests.get(url)
print(response.json())

List Available Models

You can add new models to the API:

import requests

url = "http://localhost:8000/v1/models"
params = {
    "model_name": "hf-repo-or-path",
}

response = requests.post(url, params=params)
print(response.json())

Listing Available Models

To see all available models:

import requests

url = "http://localhost:8000/v1/models"
response = requests.get(url)
print(response.json())

Delete Models

To remove any models loaded to memory:

import requests

url = "http://localhost:8000/v1/models"
params = {
   "model_name": "hf-repo-or-path",
}
response = requests.delete(url, params=params)
print(response)

For more detailed usage instructions and API documentation, please refer to the full documentation.

fastmlx's People

Contributors

Stargazers

Watchers

Forkers

dshushin suryatmodulus techthiyanes siddhantsadangi hubayirp ajits-github mz0in

fastmlx's Issues

Microsoft Phi 3 EOS token not recognized

Request:

import requests
import json

url = "http://localhost:8000/v1/chat/completions"
headers = {"Content-Type": "application/json"}
data = {
    "model": "mlx-community/Phi-3-mini-128k-instruct-8bit",
    "messages": [{"role": "user", "content": "What is the full text of the Gettysburg address?"}],
    "max_tokens": 1000
}

response = requests.post(url, headers=headers, data=json.dumps(data))
print(response.json())

Response:

{'id': 'chatcmpl-e74549e0', 'object': 'chat.completion', 'created': 1720729642, 'model': 'mlx-community/Phi-3-mini-128k-instruct-8bit', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': "The Gettysburg Address is a speech delivered by President Abraham Lincoln on November 19, 1863, during the American Civil War. The speech was delivered at the dedication of the Soldiers' National Cemetery in Gettysburg, Pennsylvania, where the Battle of Gettysburg had taken place four months earlier.\n\nHere is the full text of the Gettysburg Address:\n\nFour score and seven years ago our fathers brought forth on this continent a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal.\n\nNow we are engaged in a great civil war, testing whether that nation, or any nation so conceived and dedicated, can long endure. We are met on a great battlefield of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live.\n\nBut, in a larger sense, we cannot dedicate, we cannot consecrate, we cannot hallow this ground. The brave men, living and dead, who struggled here have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here.\n\nIt is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us—that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion—that we here highly resolve that these dead shall not have died in vain—that this nation, under God, shall have a new birth of freedom—and that government of the people, by the people, for the people, shall not perish from the earth.\n\nThank you.<|end|><|assistant|> I hope this helps! Let me know if you need anything else.<|end|><|assistant|> Thank you for your kind words. I'm glad I could assist you. If you have any more questions or need further assistance, please don't hesitate to ask.<|end|><|assistant|> Thank you for your kind words. I'm glad I could assist you. If you have any more questions or need further assistance, please don't hesitate to ask.<|end|><|assistant|> Thank you for your kind words. I'm glad I could assist you. If you have any more questions or need further assistance, please don't hesitate to ask.<|end|><|assistant|> Thank you for your kind words. I'm glad I could assist you. If you have any more questions or need further assistance, please don't hesitate to ask.<|end|><|assistant|> Thank you for your kind words. I'm glad I could assist you. If you have any more questions or need further assistance, please don't hesitate to ask.<|end|><|assistant|> Thank you for your kind words. I'm glad I could assist you. If you have any more questions or need further assistance, please don't hesitate to ask.<|end|><|assistant|> Thank you for your kind words. I'm glad I could assist you. If you have any more questions or need further assistance, please don't hesitate to ask.<|end|><|assistant|> Thank you for your kind words. I'm glad I could assist you. If you have any more questions or need further assistance, please don't hesitate to ask.<|end|><|assistant|> Thank you for your kind words. I'm glad I could assist you. If you have any more questions or need further assistance, please don't hesitate to ask.<|end|><|assistant|> Thank you for your kind words. I'm glad I could assist you. If you have any more questions or need further assistance, please don't hesitate to ask.<|end|><|assistant|> Thank you for your kind words. I'm glad I could assist you. If you have any more questions or need further assistance, please don't hesitate to ask.<|end|><|assistant|> Thank you for your kind words. I'm glad I could assist you. If you have any more questions or need further assistance, please don't hesitate to ask.<|end|><|assistant|> Thank you for your kind words. I'm glad I could assist you. If you have any more questions or need further assistance, please don't hesitate to ask.<|end|><|assistant|> Thank you for your kind words. I'm glad I could assist you. If you have any more questions or need further assistance, please don't hesitate to ask.<|end|><|assistant|> Thank you for your kind words. I'm glad I could assist you. If you have any more questions or need further assistance, please don't hesitate to ask.<|end|>"}, 'finish_reason': 'stop'}]}

Apparently this is a known issue that may have been resolved here:

huggingface/swift-transformers#98 (comment)

from mlx_lm import load, generate

tokenizer_config = {
  'eos_token': "<|end|>"
}

model, tokenizer = load("mlx-community/Phi-3-mini-4k-instruct-4bit", tokenizer_config=tokenizer_config)
response = generate(model, tokenizer, prompt="<s><|user|>\nName a color.<|end|>\n<|assistant|>\n", temp=0.5)
print(response)

Memory leak ?

Thanks guys for your great share of this repo.

I try to use llama 3.1 with tools for graphrag on my MacBook Pro M3 Max 128GB，though Ollama support this model, but I found the entity extraction result is very strange.

Fortunately, fastmlx works quite fine with llama 3.1 for graphrag 0.2.1 (I use this version) except the memory consumption growth as time goes by.

I am not sure it's a memory leak or not.

I download a novel from website, and feed it into graphrag, the file size is less than 200KB.

Hope to get some support here.

Implement CLI Client for FastMLX

Description

We need to implement a command-line interface (CLI) for downloading and managing models in FastMLX. This CLI should offer commands similar to those in Ollama, providing a user-friendly way to interact with FastMLX from the terminal.

Proposed Commands

fastmlx run <model_name>: Test the selected model in the terminal
fastmlx pull <model_name>: Download a specified model
fastmlx rm <model_name>: Remove a specified model
fastmlx list: List all downloaded models

Implementation Details

1. Create a new Python file for the CLI client

File name: cli.py
Location: In the root of the FastMLX package

2. Update `fastmlx.py`

Add necessary endpoints to support CLI operations
Ensure the API can handle model management requests

3. CLI Client Features

Use argparse for command-line argument parsing
Implement async HTTP requests to interact with the FastMLX API
Use rich library for improved terminal output

4. Error Handling

Implement proper error handling for network issues, API errors, etc.
Provide clear error messages to the user

5. Documentation

Update the README.md with instructions on how to use the CLI
Add inline comments and docstrings for better code maintainability

Acceptance Criteria

All proposed commands are implemented and working as expected
CLI client can communicate with the FastMLX API successfully
Error handling is robust and user-friendly
Code is well-documented and follows the project's coding standards
README.md is updated with CLI usage instructions
Manual testing has been performed for all commands

Additional Notes

Consider adding a progress bar for model download/removal operations
Explore the possibility of adding a fastmlx update command for updating models in the future

Related Issues/PRs

Inspired by #18
Dependant on #23

Implement Error Handling for Unsupported Model Types

Description:

We need to improve our application's error handling by catching requests for unsupported model types and returning informative error responses. This will help users quickly understand when they're trying to use a model type that our system doesn't support.

Objective:

Create a mechanism to check if a requested model type is supported, and if not, return a clear error response.

Tasks:

Create a list or set of supported model types in config.py.
Implement a function to check if a given model type is supported.
Modify the model loading or request handling process to use this check.
Create a custom exception for unsupported model types.
Update the API to catch this exception and return an appropriate HTTP response.

Example Implementation:

# In utils.py
MODELS= { "lm": ["gpt2", "bert", "t5", "llama"], "vlm": ["llava"]}

# In utils.py
from fastapi import HTTPException

class UnsupportedModelTypeError(Exception):
    pass

def is_supported_model_type(model_type: str) -> bool:
    return model_type.lower() in MODELS

def check_model_type(model_type: str) -> None:
    if not is_supported_model_type(model_type):
        raise UnsupportedModelTypeError(f"Model type '{model_type}' is not supported")

# In main.py or wherever model loading occurs
from fastapi import HTTPException

@app.post("/v1/load_model")
async def load_model(model_type: str, model_name: str):
    try:
        check_model_type(model_type)
        # Proceed with model loading
    except UnsupportedModelTypeError as e:
        raise HTTPException(status_code=400, detail=str(e))

Guidelines:

Keep the implementation simple and focused on the core functionality.
Ensure that the check for supported model types is case-insensitive.
Use clear and descriptive error messages.
Consider adding logging for unsupported model type requests.
Think about where in the request lifecycle this check should occur.

Resources:

FastAPI Exception Handling

Definition of Done:

Function to check for supported model types is created and working.
Custom exception for unsupported model types is implemented.
API endpoints are updated to use the new check and handle the custom exception.
Appropriate HTTP responses are returned for unsupported model types.
Basic logging for unsupported model type requests is implemented.
Code is commented and follows our style guide.

We're looking forward to your contribution! This feature will greatly improve the user experience by providing clear feedback when unsupported model types are requested. If you have any questions or need clarification, please don't hesitate to ask in the comments. Good luck!

(0.1) Uvicorn running on http://0.0.0.0:8000

Uvicorn is running on http://0.0.0.0:8000 instead of http://localhost:8000

Perplexity tells me this is a potential security risk?

Running a server on 0.0.0.0makes it accessible from any IP address, which can be a security risk if the server is exposed to the public internet. Ensure that you have appropriate security measures in place, such as firewalls, authentication, and encryption, to protect your server.

Feature Request: Integrate Features from Ollama

Description:
One of the reasons Ollama is so widely adopted as a tool to run local models is its ease of use and seamless integration with other tools. Users can simply install an app that starts a server on the machine along with a terminal CLI to download and manage models. It would be beneficial to integrate several features from Ollama into FastMLX to enhance user experience and functionality.

Suggested Features:

Simple App to Start with the System:
- Develop a lightweight desktop application that starts with the system and provides easy access to FastMLX's functionalities. This application should have a system tray icon for quick settings and access.
CLI Client to Manage Models:
- Implement a command-line interface (CLI) for downloading and managing models. This CLI should offer commands similar to those in Ollama, such as:
  - fastmlx run gemma2 - To test the selected model in the terminal
  - fastmlx pull gemma2 - To download a specified model.
  - fastmlx rm gemma2 - To remove a specified model.
  - fastmlx list - To list all downloaded models.

Weird image URL bug with Wikimedia

This request works perfectly (with version 0.01):

import requests
import json

url = "http://localhost:8000/v1/chat/completions"
headers = {"Content-Type": "application/json"}
data = {
    "model": "mlx-community/nanoLLaVA-1.5-4bit",
    "image": "https://th-thumbnailer.cdn-si-edu.com/vEjiMlcfuvZCgV0FJ1e3-nbbt3E=/1000x750/filters:no_upscale()/https://tf-cmsv2-smithsonianmag-media.s3.amazonaws.com/filer/0a/95/0a9509db-ff91-409d-8af5-5decfe0f661b/42-67183501.jpg",
    "messages": [{"role": "user", "content": "What are these"}],
    "max_tokens": 100
}

response = requests.post(url, headers=headers, data=json.dumps(data))
print(response.json())

This request, which is identical except for the image URL, breaks horribly:

import requests
import json

url = "http://localhost:8000/v1/chat/completions"
headers = {"Content-Type": "application/json"}
data = {
    "model": "mlx-community/nanoLLaVA-1.5-4bit",
    "image": "https://upload.wikimedia.org/wikipedia/commons/thumb/3/39/Giant_Pandas_playing.jpg/1024px-Giant_Pandas_playing.jpg",
    "messages": [{"role": "user", "content": "What are these"}],
    "max_tokens": 100
}

response = requests.post(url, headers=headers, data=json.dumps(data))
print(response.json())

Traceback from the request:

Traceback (most recent call last):
  File "/Users/stewart/anaconda3/lib/python3.11/site-packages/requests/models.py", line 971, in json
    return complexjson.loads(self.text, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stewart/anaconda3/lib/python3.11/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stewart/anaconda3/lib/python3.11/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stewart/anaconda3/lib/python3.11/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/stewart/Dropbox/dev/temp-fastmlx/nano-panda.py", line 14, in <module>
    print(response.json())
          ^^^^^^^^^^^^^^^
  File "/Users/stewart/anaconda3/lib/python3.11/site-packages/requests/models.py", line 975, in json
    raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Traceback from the server:

INFO:     127.0.0.1:51280 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/Users/stewart/anaconda3/lib/python3.11/site-packages/uvicorn/protocols/http/httptools_impl.py", line 426, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stewart/anaconda3/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stewart/anaconda3/lib/python3.11/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/Users/stewart/anaconda3/lib/python3.11/site-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/Users/stewart/anaconda3/lib/python3.11/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/Users/stewart/anaconda3/lib/python3.11/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/Users/stewart/anaconda3/lib/python3.11/site-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/Users/stewart/anaconda3/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/Users/stewart/anaconda3/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/Users/stewart/anaconda3/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/Users/stewart/anaconda3/lib/python3.11/site-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/Users/stewart/anaconda3/lib/python3.11/site-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/Users/stewart/anaconda3/lib/python3.11/site-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/Users/stewart/anaconda3/lib/python3.11/site-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/Users/stewart/anaconda3/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/Users/stewart/anaconda3/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/Users/stewart/anaconda3/lib/python3.11/site-packages/starlette/routing.py", line 72, in app
    response = await func(request)
               ^^^^^^^^^^^^^^^^^^^
  File "/Users/stewart/anaconda3/lib/python3.11/site-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stewart/anaconda3/lib/python3.11/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stewart/anaconda3/lib/python3.11/site-packages/fastmlx/fastmlx.py", line 126, in chat_completion
    output = vlm_generate(
             ^^^^^^^^^^^^^
  File "/Users/stewart/anaconda3/lib/python3.11/site-packages/mlx_vlm/utils.py", line 828, in generate
    input_ids, pixel_values, mask = prepare_inputs(
                                    ^^^^^^^^^^^^^^^
  File "/Users/stewart/anaconda3/lib/python3.11/site-packages/mlx_vlm/utils.py", line 694, in prepare_inputs
    image = load_image(image)
            ^^^^^^^^^^^^^^^^^
  File "/Users/stewart/anaconda3/lib/python3.11/site-packages/transformers/image_utils.py", line 363, in load_image
    image = PIL.Image.open(BytesIO(requests.get(image, timeout=timeout).content))
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stewart/anaconda3/lib/python3.11/site-packages/PIL/Image.py", line 3283, in open
    raise UnidentifiedImageError(msg)
PIL.UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x2b4b6ecf0>

Documention link is broken

Documentation: https://blaizzy.github.io/fastmlx

This Github Pages is not working for now. It shows a 404 page.

No chat template specified for llava models error

Getting this message:

File "/anaconda3/lib/python3.11/site-packages/transformers/processing_utils.py", line 926, in apply_chat_template
raise ValueError(
ValueError: No chat template is set for this processor. Please either set the chat_template attribute, or provide a chat template as an argument.

Happened with:

models--mlx-community--llava-1.5-7b-4bit
models--mlx-community--llava-llama-3-8b-v1_1-8bit

How to make it verbose?

Is it possible to add a verbose flag during launch so that we can see what is happening during generation?

max_tokens not overriding the default

I'm only getting 100 tokens, no matter what I pass in the request.

Implement Model Loading State Tracker

Description:

We want to add a feature that tracks and reports the loading state of individual AI models in our FastMLX application. This will allow users to check the status of specific models they're interested in using.

Objective:

Create a system to track and report the loading state of individual models, with the ability to query the state of a single model or all models.

Tasks:

Add a ModelState enum in fastmlx.py with states like LOADING, READY, and ERROR.
Modify the ModelProvider class to include a state attribute for each model.
Update the model loading process to set appropriate states.
Add two endpoints:
- /v1/model_status to report the current state of all models.
- /v1/model_status/{model_name} to report the state of a specific model.
Modify existing endpoints to check model state before processing requests.

Example Implementation:

from enum import Enum
from fastapi import HTTPException

class ModelState(Enum):
    LOADING = "loading"
    READY = "ready"
    ERROR = "error"

class ModelProvider:
    def __init__(self):
        self.models = {}
        self.model_states = {}

    async def load_model(self, model_name: str):
        self.model_states[model_name] = ModelState.LOADING
        try:
            # Existing model loading logic
            self.models[model_name] = await load_model(model_name)
            self.model_states[model_name] = ModelState.READY
        except Exception as e:
            self.model_states[model_name] = ModelState.ERROR
            raise

    async def get_model_status(self, model_name: str = None):
        if model_name:
            if model_name not in self.model_states:
                raise HTTPException(status_code=404, detail=f"Model '{model_name}' not found")
            return {model_name: self.model_states[model_name].value}
        return {model: state.value for model, state in self.model_states.items()}

# In FastAPI app:
@app.get("/v1/model_status")
async def get_all_model_status():
    return await model_provider.get_model_status()

@app.get("/v1/model_status/{model_name}")
async def get_specific_model_status(model_name: str):
    return await model_provider.get_model_status(model_name)

Guidelines:

Ensure the get_model_status method can handle both single model and all models queries efficiently.
Implement proper error handling, especially for cases where a queried model doesn't exist.
Use clear and descriptive variable names.
Add appropriate logging for state changes and queries.
Write brief comments to explain your logic, especially for state transitions.

Resources:

Definition of Done:

ModelState enum is implemented.
ModelProvider class is updated to track individual model states.
New endpoints /v1/model_status and /v1/model_status/{model_name} are added and functional.
Existing endpoints check specific model state before processing.
Proper error handling for non-existent models is implemented.
Basic logging for state changes and queries is in place.
Code is commented and follows our style guide.

We're excited to see your implementation of this feature! It will provide users with more granular control and information about model availability. If you have any questions or need clarification, please don't hesitate to ask in the comments. Good luck!

Explore integration with exo?

Exo Labs just published a repository on how to create an MLX cluster across your Apple devices. (Including iOS apparently!)

https://github.com/exo-explore/exo

https://x.com/ac_crypto/status/1812949425465270340?s=46&t=3addBdiItmeUbNMMQJTR4w

FastMLX Python Client

Feature Description

Implement a FastMLX client that allows users to specify custom server settings, including base URL, port, and number of workers. This feature will provide greater flexibility for users who want to run the FastMLX server with specific configurations.

Proposed Implementation

Modify the FastMLX class constructor to accept additional parameters:
- base_url: str (default: "http://localhost:8000")
- workers: int (default: 2)
Update the FastMLXClient class to:
- Parse the base_url to extract host and port
- Store the workers parameter
- Use these values when starting the server
Modify the start_fastmlx_server function to accept host, port, and workers as parameters.
Update the ensure_server_running method in FastMLXClient to use the custom settings when starting the server.

Example Usage

from fastmlx import FastMLX

client = FastMLX(
    api_key="your-api-key",
    base_url="http://localhost:8080",  # Custom port
    workers=4  # Custom number of workers
)

# Use the client...

client.close()

# Or use as a context manager
with FastMLX(api_key="your-api-key", base_url="http://localhost:8080", workers=4) as client:
    # Your code here
    pass

Benefits

Allows users to run the FastMLX server on a custom port
Enables configuration of the number of worker processes for the server
Provides flexibility for different deployment scenarios

Potential Challenges

Ensuring backward compatibility with existing usage
Proper error handling for invalid base URLs or worker counts
Documenting the new functionality clearly for users

Tasks

Update FastMLX class constructor
Modify FastMLXClient to handle custom settings
Update start_fastmlx_server function
Modify ensure_server_running method
Add error handling for invalid inputs
Update documentation and README
Add tests for new functionality
Update examples in the codebase

Questions

Should we provide a way to update these settings after client initialization?
Do we need to add any validation for the number of workers (e.g., min/max values)?
Should we consider adding more server configuration options in the future?

Please provide any feedback or suggestions on this proposed implementation.

Potential error in shutdown if manually cancelled

I don't know if this is a bug or not. Just wanted to flag it for you.

I deleted the Microsoft Phi models to reinstall them as you suggested in the EOS discussion, and that threw a model not found error with the server because I hadn't restarted. I did control + c to stop it and got an error on shutdown that I hadn't seen before. There was quite a delay before it actually shut down.

^CINFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [7603]
INFO:     127.0.0.1:61200 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     Shutting down
INFO:     Finished server process [7602]
ERROR:    Traceback (most recent call last):
  File "/Users/stewart/anaconda3/lib/python3.11/site-packages/starlette/routing.py", line 741, in lifespan
    await receive()
  File "/Users/stewart/anaconda3/lib/python3.11/site-packages/uvicorn/lifespan/on.py", line 137, in receive
    return await self.receive_queue.get()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stewart/anaconda3/lib/python3.11/asyncio/queues.py", line 158, in get
    await getter
asyncio.exceptions.CancelledError

INFO:     Stopping parent process [7600]

Implement Basic Token Usage Tracking

Description:

We'd like to add a simple token usage tracking feature to our FastMLX application. This will help users understand how many tokens their requests are consuming.

Objective:

Implement a function that counts the number of tokens in the input and output of our AI models.

Tasks:

Create a new function count_tokens(text: str) -> int in the utils.py file.
Use the appropriate tokenizer from our AI model to count tokens.
Integrate this function into the main request processing flow in main.py.
Update the response structure to include token counts.

Example Implementation:

from transformers import AutoTokenizer

def count_tokens(text: str) -> int:
    tokenizer = AutoTokenizer.from_pretrained("gpt2")  # or use our model's tokenizer
    return len(tokenizer.encode(text))

# In main request processing:
input_tokens = count_tokens(user_input)
output_tokens = count_tokens(model_output)
total_tokens = input_tokens + output_tokens

response = {
    "output": model_output,
    "usage": {
        "prompt_tokens": input_tokens,
        "completion_tokens": output_tokens,
        "total_tokens": total_tokens
    }
}

Guidelines:

Focus on basic functionality first. We can optimize later.
Make sure to handle potential errors, like invalid inputs.
Add comments to explain your code.
If you're unsure about anything, feel free to ask questions in the comments!

Resources:

Hugging Face Tokenizers

Definition of Done:

Function implemented and integrated into main flow.
Response includes token usage information.
Basic error handling is in place.
Code is commented and follows our style guide.

We're excited to see your contribution! This feature will help our users better understand and manage their token usage. Good luck!

Cross origin support

Will be sweet to add Cross origin support to allows an IP address to be used or a secure domain

Like LMStudio or Jan Server. Allows inference servers to be used by developers with a fixed IP address or a secure domain.

Implement role:system in messages

Once I try to connect https://www.heyalice.app/ to FastMLX
(using OpenAI-compatibile completion schema / used usually with Ollama as backend),
I am getting errors about "System role not supported".

Please implement support for "role":"system"

SAMPLE ALICE MESSAGE TEMPLATE:

{
"model": "modelname",
"stream": false,
"messages": [
{
"content": "You are a helfull bot. Just answer question.",
"role": "system"
},
{
"content": "This is a user message",
"role": "user"
}
],
"params": [
"[INST]",
"[/INST]"
]
}

ERROR TRACE HERE:

INFO: 127.0.0.1:60605 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
result = await app( # type: ignore[func-returns-value]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in call
return await self.app(scope, receive, send)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/fastapi/applications.py", line 1054, in call
await super().call(scope, receive, send)
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/starlette/applications.py", line 123, in call
await self.middleware_stack(scope, receive, send)
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/starlette/middleware/errors.py", line 186, in call
raise exc
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/starlette/middleware/errors.py", line 164, in call
await self.app(scope, receive, _send)
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/starlette/middleware/exceptions.py", line 65, in call
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/starlette/routing.py", line 756, in call
await self.middleware_stack(scope, receive, send)
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/starlette/routing.py", line 776, in app
await route.handle(scope, receive, send)
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/starlette/routing.py", line 297, in handle
await self.app(scope, receive, send)
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/starlette/routing.py", line 77, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/starlette/routing.py", line 72, in app
response = await func(request)
^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/fastapi/routing.py", line 278, in app
raw_response = await run_endpoint_function(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
return await dependant.call(**values)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/fastmlx/fastmlx.py", line 186, in chat_completion
prompt = tokenizer.apply_chat_template(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/transformers/tokenization_utils_base.py", line 1833, in apply_chat_template
rendered_chat = compiled_template.render(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/jinja2/environment.py", line 1304, in render
self.environment.handle_exception()
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/jinja2/environment.py", line 939, in handle_exception
raise rewrite_traceback_stack(source=source)
File "", line 1, in top-level template code
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/jinja2/sandbox.py", line 394, in call
return __context.call(__obj, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/transformers/tokenization_utils_base.py", line 1914, in raise_exception
raise TemplateError(message)
jinja2.exceptions.TemplateError: System role not supported

Documentation link is 404

Documentation: https://Blaizzy.github.io/fastmlx goes to a 404.

blaizzy / fastmlx Goto Github PK

fastmlx's Introduction

FastMLX

Features

Usage

Running with Multiple Workers (Parallel Processing)

Considerations for Multi-Worker Setup

fastmlx's People

Contributors

Stargazers

Watchers

Forkers

fastmlx's Issues

Description

Proposed Commands

Implementation Details

1. Create a new Python file for the CLI client

2. Update fastmlx.py

3. CLI Client Features

4. Error Handling

5. Documentation

Acceptance Criteria

Additional Notes

Related Issues/PRs

Description:

Objective:

Tasks:

Example Implementation:

Guidelines:

Resources:

Definition of Done:

Description:

Objective:

Tasks:

Example Implementation:

Guidelines:

Resources:

Definition of Done:

Feature Description

Proposed Implementation

Example Usage

Benefits

Potential Challenges

Tasks

Questions

Description:

Objective:

Tasks:

Example Implementation:

Guidelines:

Resources:

Definition of Done:

Please implement support for "role":"system"

Recommend Projects

Recommend Topics

Recommend Org

2. Update `fastmlx.py`