The multimodal-maestro from roboflow

Description	Colab
Prompt LMMs with Multimodal Maestro
Manually annotate ONE image and let GPT-4V annotate ALL of them

Improve segmentation step to get single label and single marker for each object

Search before asking

I have searched the Multimodal Maestro issues and found no similar feature requests.

Description

I am trying to achieve segmentation of objects such that each object has only one label and clear segmentation boundary defined.
At the moment in the post-processing refiner step of the tutorial (Colab) notebook in the repo, the hard-coded 0.02 value isn’t perfect for most images and misses correct segmentation clusters. So misses most individual objects or they are clustered with the background.

The refiner function does 4 different tasks at once (hole filling, minimum area , max …) Good to isolate or please suggest a better way to isolate individual objects and their segmentation pixels perfectly.

Use case

No response

Additional

No response

Are you willing to submit a PR?

Yes I'd like to help by submitting a PR!

.

issue with masks_to_marks mapping

Search before asking

I have searched the Multimodal Maestro issues and found no similar bug report.

Bug

Hi,

First and foremost thanks for your nice work so far.

I was testing your code with your google collab tutorial, and the mark creation (SAM), visualization and refining goes smoothly. Also the prompt call with marks to gpt4 goes well without any issue and I get response back.

In the part that you try to extract and visualize relevant marks, the resultset of masks_to_marks throws the error shown below.

With the example I used I expect a large output (20-30 marks), if this helps.

Environment

0.1.0rc1
Google collab (T4 vm)

Minimal Reproducible Example

masks = maestro.extract_relevant_masks(text=response, detections=refined_marks)
masks = np.array([mask for mask in masks.values()])
detections = maestro.masks_to_marks(masks)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-61-a9e5dd9e84f7>](https://localhost:8080/#) in <cell line: 3>()
      1 masks = maestro.extract_relevant_masks(text=response, detections=marked_image)
      2 masks = np.array([mask for mask in masks.values()])
----> 3 detections = maestro.masks_to_marks(masks)

3 frames
[/usr/local/lib/python3.10/dist-packages/supervision/detection/core.py](https://localhost:8080/#) in _validate_mask(mask, n)
     27     )
     28     if not is_valid:
---> 29         raise ValueError("mask must be 3d np.ndarray with (n, H, W) shape")
     30 
     31 

ValueError: mask must be 3d np.ndarray with (n, H, W) shape

Additional

No response

Are you willing to submit a PR?

Yes I'd like to help by submitting a PR!

Proposed repository structure

Proposed Code Structure

Every prompting pipeline comes with prompt_creator and result_processor. You can manually instantiate instances of those classes or call pipeline function providing name argument.

from abc import ABC, abstractmethod
from typing import Tuple, List, Dict
import numpy as np
import supervision as sv


class BasePromptCreator(ABC):
    @abstractmethod
    def create(self, image: np.ndarray, *args, **kwargs) -> Tuple[np.ndarray, sv.Detections]:
        """
        Create a prompt from an image and additional arguments.

        Args:
            image (np.ndarray): The input image.
            *args, **kwargs: Additional arguments.

        Returns:
            Tuple[np.ndarray, sv.Detections]: A tuple containing a processed image and detections.
        """
        pass


class BaseResultProcessor(ABC):
    @abstractmethod
    def process(self, text: str, marks: sv.Detections, *args, **kwargs) -> Dict[str, str]:
        """
        Process the results with given text and detections.

        Args:
            text (str): The input text.
            marks (sv.Detections): Detections to be used in processing.
            *args, **kwargs: Additional arguments.

        Returns:
            Dict[str, str]: Processed results.
        """
        pass


    @abstractmethod
    def visualize(self, text: str, image: np.ndarray, marks: sv.Detections, *args, **kwargs) -> np.ndarray:
        """
        Visualize the results on an image.

        Args:
            text (str): The input text.
            image (np.ndarray): The input image.
            marks (sv.Detections): Detections to be visualized.
            *args, **kwargs: Additional arguments.

        Returns:
            np.ndarray: The image with visualizations.
        """
        pass


class SamPromptCreator(BasePromptCreator):
    def __init__(self, device: str):
        self.device = device

    def create(image: np.ndarray, mask: Optional[np.ndarray] = none) -> Tuple[image: np.ndarray, sv.Detections]:
        pass


class SamResultProcessor(BaseResultProcessor):
    
    def process(text: str, marks: sv.Detections) -> List[str]:
        pass

    def visualize(text: str, image: np.ndarray, marks: sv.Detections) -> np.ndarray:
        pass


class GroundingDinoPromptCreator(BasePromptCreator):
    def __init__(self, device: str):
        self.device = device

    def create(image: np.ndarray, categories: List[str]) -> Tuple[image: np.ndarray, sv.Detections]:
        pass


class GroundingDinoResultProcessor(BaseResultProcessor):
    
    def process(text: str, marks: sv.Detections) -> Dict[str, str]:
        pass

    def visualize(text: str, image: np.ndarray, marks: sv.Detections) -> np.ndarray:
        pass


PIPELINES = {
    'sam': (SamPromptCreator, SamResultProcessor),
    'grounding-dino': (GroundingDinoPromptCreator, GroundingDinoResultProcessor)
}


def pipeline(name: str, **kwargs) -> Tuple[BasePromptCreator, BaseResultProcessor]:
    """Retrieves the prompt creator and result processor for the specified pipeline.

    Args:
        name (str): The name of the pipeline.
        **kwargs: Additional keyword arguments for initializing the classes.

    Returns:
        Tuple[BasePromptCreator, BaseResultProcessor]: Instances of the prompt creator and result processor.

    Raises:
        ValueError: If the pipeline name is not in the PIPELINES dictionary.
    """
    pipeline_classes = PIPELINES.get(name)

    if pipeline_classes is None:
        raise ValueError(f"Pipeline '{name}' not found. Please choose from {list(PIPELINES.keys())}.")

    PromptCreatorClass, ResultProcessorClass = pipeline_classes

    prompt_creator = PromptCreatorClass(**kwargs)
    result_processor = ResultProcessorClass(**kwargs)

    return prompt_creator, result_processor

Example Usage

LMM inference gets sandwiched between prompt_creator and result_processor calls.

import cv2
from maestro import pipeline, prompt_gpt4_vision

prompt_creator, result_processor = pipeline('sam', device='cuda')

image_prompt, marks = prompt_creator(image=image)
text_prompt = 'Find dog.'
api_key = '...'

response = prompt_gpt4_vision(
    text_prompt=text_prompt, 
    image_prompt=image_prompt, 
    api_key=api_key)

visualization = result_processor.visualize(
    text=response, 
    image=image, 
    marks=marks)

Bug at generate marks

Search before asking

I have searched the Multimodal Maestro issues and found no similar bug report.

Bug

Traceback (most recent call last):
File "/data/megvii/projects/Qwen-VL/scripts/test_maestro.py", line 7, in
marks = generator.generate(image=image)
File "/data/Anaconda/anaconda3/envs/autogpt/lib/python3.10/site-packages/maestro/markers/sam.py", line 44, in generate
return masks_to_marks(masks=masks)
File "/data/Anaconda/anaconda3/envs/autogpt/lib/python3.10/site-packages/maestro/postprocessing/mask.py", line 187, in masks_to_marks
return sv.Detections(
File "", line 8, in init
File "/data/Anaconda/anaconda3/envs/autogpt/lib/python3.10/site-packages/supervision/detection/core.py", line 89, in post_init
_validate_mask(mask=self.mask, n=n)
File "/data/Anaconda/anaconda3/envs/autogpt/lib/python3.10/site-packages/supervision/detection/core.py", line 29, in _validate_mask
raise ValueError("mask must be 3d np.ndarray with (n, H, W) shape")
ValueError: mask must be 3d np.ndarray with (n, H, W) shape

Environment

Ubuntu 20.04
python=3.10.10

Minimal Reproducible Example

No response

Additional

No response

Are you willing to submit a PR?

Yes I'd like to help by submitting a PR!

refine_marks

Search before asking

I have searched the Multimodal Maestro issues and found no similar feature requests.

Description

Thanks for creating this amazing package! The refine function is not very helpful in most instances. It would be highly beneficial to prompt gpt with the marks of the smaller masks within larger ones and also visualize and extract them. Is it possible to bypass refine_marks?

Use case

No response

Additional

No response

Are you willing to submit a PR?

Yes I'd like to help by submitting a PR!

Minor spelling mistake:

roboflow / multimodal-maestro Goto Github PK

multimodal-maestro's Introduction

multimodal-maestro

👋 hello

💻 install

🔌 API

🧑‍🍳 prompting cookbooks

🚀 example

🚧 roadmap

💜 acknowledgement

🦸 contribution

multimodal-maestro's People

Contributors

Stargazers

Watchers

Forkers

multimodal-maestro's Issues

Search before asking

Description

Use case

Additional

Are you willing to submit a PR?

Search before asking

Bug

Environment

Minimal Reproducible Example

Additional

Are you willing to submit a PR?

Proposed Code Structure

Example Usage

Search before asking

Bug

Environment

Minimal Reproducible Example

Additional

Are you willing to submit a PR?

Search before asking

Description

Use case

Additional

Are you willing to submit a PR?

Recommend Projects

Recommend Topics

Recommend Org