Git Product home page Git Product logo

multimodal-maestro's Introduction

multimodal-maestro


version license python-version Gradio Colab

πŸ‘‹ hello

Multimodal-Maestro gives you more control over large multimodal models to get the outputs you want. With more effective prompting tactics, you can get multimodal models to do tasks you didn't know (or think!) were possible. Curious how it works? Try our HF space!

πŸ’» install

⚠️ Our package has been renamed to maestro. Install the package in a 3.11>=Python>=3.8 environment.

pip install maestro

πŸ”Œ API

🚧 The project is still under construction. The redesigned API is coming soon.

maestro-docs-Snap

πŸ§‘β€πŸ³ prompting cookbooks

Description Colab
Prompt LMMs with Multimodal Maestro Colab
Manually annotate ONE image and let GPT-4V annotate ALL of them Colab

πŸš€ example

Find dog.

>>> The dog is prominently featured in the center of the image with the label [9].
πŸ‘‰ read more
  • load image

    import cv2
    
    image = cv2.imread("...")
  • create and refine marks

    import maestro
    
    generator = maestro.SegmentAnythingMarkGenerator(device='cuda')
    marks = generator.generate(image=image)
    marks = maestro.refine_marks(marks=marks)
  • visualize marks

    mark_visualizer = maestro.MarkVisualizer()
    marked_image = mark_visualizer.visualize(image=image, marks=marks)

    image-vs-marked-image

  • prompt

    prompt = "Find dog."
    
    response = maestro.prompt_image(api_key=api_key, image=marked_image, prompt=prompt)
    >>> "The dog is prominently featured in the center of the image with the label [9]."
    
  • extract related marks

    masks = maestro.extract_relevant_masks(text=response, detections=refined_marks)
    >>> {'6': array([
    ...     [False, False, False, ..., False, False, False],
    ...     [False, False, False, ..., False, False, False],
    ...     [False, False, False, ..., False, False, False],
    ...     ...,
    ...     [ True,  True,  True, ..., False, False, False],
    ...     [ True,  True,  True, ..., False, False, False],
    ...     [ True,  True,  True, ..., False, False, False]])
    ... }
    

multimodal-maestro

🚧 roadmap

  • Rewriting the maestro API.
  • Update HF space.
  • Documentation page.
  • Add GroundingDINO prompting strategy.
  • CovVLM demo.
  • Qwen-VL demo.

πŸ’œ acknowledgement

🦸 contribution

We would love your help in making this repository even better! If you noticed any bug, or if you have any suggestions for improvement, feel free to open an issue or submit a pull request.

multimodal-maestro's People

Contributors

capjamesg avatar deependujha avatar eltociear avatar josephofiowa avatar pedrogengo avatar skalskip avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

multimodal-maestro's Issues

Improve segmentation step to get single label and single marker for each object

Search before asking

  • I have searched the Multimodal Maestro issues and found no similar feature requests.

Description

I am trying to achieve segmentation of objects such that each object has only one label and clear segmentation boundary defined.
At the moment in the post-processing refiner step of the tutorial (Colab) notebook in the repo, the hard-coded 0.02 value isn’t perfect for most images and misses correct segmentation clusters. So misses most individual objects or they are clustered with the background.

The refiner function does 4 different tasks at once (hole filling, minimum area , max …) Good to isolate or please suggest a better way to isolate individual objects and their segmentation pixels perfectly.

Use case

No response

Additional

No response

Are you willing to submit a PR?

  • Yes I'd like to help by submitting a PR!

issue with masks_to_marks mapping

Search before asking

  • I have searched the Multimodal Maestro issues and found no similar bug report.

Bug

Hi,

First and foremost thanks for your nice work so far.

I was testing your code with your google collab tutorial, and the mark creation (SAM), visualization and refining goes smoothly. Also the prompt call with marks to gpt4 goes well without any issue and I get response back.

In the part that you try to extract and visualize relevant marks, the resultset of masks_to_marks throws the error shown below.

With the example I used I expect a large output (20-30 marks), if this helps.

Environment

0.1.0rc1
Google collab (T4 vm)

Minimal Reproducible Example

masks = maestro.extract_relevant_masks(text=response, detections=refined_marks)
masks = np.array([mask for mask in masks.values()])
detections = maestro.masks_to_marks(masks)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-61-a9e5dd9e84f7>](https://localhost:8080/#) in <cell line: 3>()
      1 masks = maestro.extract_relevant_masks(text=response, detections=marked_image)
      2 masks = np.array([mask for mask in masks.values()])
----> 3 detections = maestro.masks_to_marks(masks)

3 frames
[/usr/local/lib/python3.10/dist-packages/supervision/detection/core.py](https://localhost:8080/#) in _validate_mask(mask, n)
     27     )
     28     if not is_valid:
---> 29         raise ValueError("mask must be 3d np.ndarray with (n, H, W) shape")
     30 
     31 

ValueError: mask must be 3d np.ndarray with (n, H, W) shape

Additional

No response

Are you willing to submit a PR?

  • Yes I'd like to help by submitting a PR!

Proposed repository structure

Proposed Code Structure

Every prompting pipeline comes with prompt_creator and result_processor. You can manually instantiate instances of those classes or call pipeline function providing name argument.

from abc import ABC, abstractmethod
from typing import Tuple, List, Dict
import numpy as np
import supervision as sv


class BasePromptCreator(ABC):
    @abstractmethod
    def create(self, image: np.ndarray, *args, **kwargs) -> Tuple[np.ndarray, sv.Detections]:
        """
        Create a prompt from an image and additional arguments.

        Args:
            image (np.ndarray): The input image.
            *args, **kwargs: Additional arguments.

        Returns:
            Tuple[np.ndarray, sv.Detections]: A tuple containing a processed image and detections.
        """
        pass


class BaseResultProcessor(ABC):
    @abstractmethod
    def process(self, text: str, marks: sv.Detections, *args, **kwargs) -> Dict[str, str]:
        """
        Process the results with given text and detections.

        Args:
            text (str): The input text.
            marks (sv.Detections): Detections to be used in processing.
            *args, **kwargs: Additional arguments.

        Returns:
            Dict[str, str]: Processed results.
        """
        pass


    @abstractmethod
    def visualize(self, text: str, image: np.ndarray, marks: sv.Detections, *args, **kwargs) -> np.ndarray:
        """
        Visualize the results on an image.

        Args:
            text (str): The input text.
            image (np.ndarray): The input image.
            marks (sv.Detections): Detections to be visualized.
            *args, **kwargs: Additional arguments.

        Returns:
            np.ndarray: The image with visualizations.
        """
        pass


class SamPromptCreator(BasePromptCreator):
    def __init__(self, device: str):
        self.device = device

    def create(image: np.ndarray, mask: Optional[np.ndarray] = none) -> Tuple[image: np.ndarray, sv.Detections]:
        pass


class SamResultProcessor(BaseResultProcessor):
    
    def process(text: str, marks: sv.Detections) -> List[str]:
        pass

    def visualize(text: str, image: np.ndarray, marks: sv.Detections) -> np.ndarray:
        pass


class GroundingDinoPromptCreator(BasePromptCreator):
    def __init__(self, device: str):
        self.device = device

    def create(image: np.ndarray, categories: List[str]) -> Tuple[image: np.ndarray, sv.Detections]:
        pass


class GroundingDinoResultProcessor(BaseResultProcessor):
    
    def process(text: str, marks: sv.Detections) -> Dict[str, str]:
        pass

    def visualize(text: str, image: np.ndarray, marks: sv.Detections) -> np.ndarray:
        pass


PIPELINES = {
    'sam': (SamPromptCreator, SamResultProcessor),
    'grounding-dino': (GroundingDinoPromptCreator, GroundingDinoResultProcessor)
}


def pipeline(name: str, **kwargs) -> Tuple[BasePromptCreator, BaseResultProcessor]:
    """Retrieves the prompt creator and result processor for the specified pipeline.

    Args:
        name (str): The name of the pipeline.
        **kwargs: Additional keyword arguments for initializing the classes.

    Returns:
        Tuple[BasePromptCreator, BaseResultProcessor]: Instances of the prompt creator and result processor.

    Raises:
        ValueError: If the pipeline name is not in the PIPELINES dictionary.
    """
    pipeline_classes = PIPELINES.get(name)

    if pipeline_classes is None:
        raise ValueError(f"Pipeline '{name}' not found. Please choose from {list(PIPELINES.keys())}.")

    PromptCreatorClass, ResultProcessorClass = pipeline_classes

    prompt_creator = PromptCreatorClass(**kwargs)
    result_processor = ResultProcessorClass(**kwargs)

    return prompt_creator, result_processor

Example Usage

LMM inference gets sandwiched between prompt_creator and result_processor calls.

import cv2
from maestro import pipeline, prompt_gpt4_vision

prompt_creator, result_processor = pipeline('sam', device='cuda')

image_prompt, marks = prompt_creator(image=image)
text_prompt = 'Find dog.'
api_key = '...'

response = prompt_gpt4_vision(
    text_prompt=text_prompt, 
    image_prompt=image_prompt, 
    api_key=api_key)

visualization = result_processor.visualize(
    text=response, 
    image=image, 
    marks=marks)

Bug at generate marks

Search before asking

  • I have searched the Multimodal Maestro issues and found no similar bug report.

Bug

Traceback (most recent call last):
File "/data/megvii/projects/Qwen-VL/scripts/test_maestro.py", line 7, in
marks = generator.generate(image=image)
File "/data/Anaconda/anaconda3/envs/autogpt/lib/python3.10/site-packages/maestro/markers/sam.py", line 44, in generate
return masks_to_marks(masks=masks)
File "/data/Anaconda/anaconda3/envs/autogpt/lib/python3.10/site-packages/maestro/postprocessing/mask.py", line 187, in masks_to_marks
return sv.Detections(
File "", line 8, in init
File "/data/Anaconda/anaconda3/envs/autogpt/lib/python3.10/site-packages/supervision/detection/core.py", line 89, in post_init
_validate_mask(mask=self.mask, n=n)
File "/data/Anaconda/anaconda3/envs/autogpt/lib/python3.10/site-packages/supervision/detection/core.py", line 29, in _validate_mask
raise ValueError("mask must be 3d np.ndarray with (n, H, W) shape")
ValueError: mask must be 3d np.ndarray with (n, H, W) shape

Environment

Ubuntu 20.04
python=3.10.10

Minimal Reproducible Example

No response

Additional

No response

Are you willing to submit a PR?

  • Yes I'd like to help by submitting a PR!

refine_marks

Search before asking

  • I have searched the Multimodal Maestro issues and found no similar feature requests.

Description

Thanks for creating this amazing package! The refine function is not very helpful in most instances. It would be highly beneficial to prompt gpt with the marks of the smaller masks within larger ones and also visualize and extract them. Is it possible to bypass refine_marks?

Use case

No response

Additional

No response

Are you willing to submit a PR?

  • Yes I'd like to help by submitting a PR!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.