Git Product home page Git Product logo

Comments (2)

dosubot avatar dosubot commented on July 17, 2024 2

To resolve the error "'HuggingFaceEmbedding' object has no attribute '_model'" when using the HuggingFaceEmbedding class in your DSPy training pipeline, ensure that the _model attribute is correctly initialized in the HuggingFaceEmbedding class. Here is the relevant part of the code that shows how _model should be initialized within the __init__ method:

class HuggingFaceEmbedding(BaseEmbedding):
    tokenizer_name: str = Field(description="Tokenizer name from HuggingFace.")
    max_length: int = Field(
        default=DEFAULT_HUGGINGFACE_LENGTH, description="Maximum length of input.", gt=0
    )
    pooling: Pooling = Field(default=None, description="Pooling strategy.")
    normalize: bool = Field(default=True, description="Normalize embeddings or not.")
    query_instruction: Optional[str] = Field(
        description="Instruction to prepend to query text."
    )
    text_instruction: Optional[str] = Field(
        description="Instruction to prepend to text."
    )
    cache_folder: Optional[str] = Field(
        description="Cache folder for huggingface files."
    )

    _model: Any = PrivateAttr()
    _tokenizer: Any = PrivateAttr()
    _device: str = PrivateAttr()

    def __init__(
        self,
        model_name: Optional[str] = None,
        tokenizer_name: Optional[str] = None,
        pooling: Optional[str] = None,
        max_length: Optional[int] = None,
        query_instruction: Optional[str] = None,
        text_instruction: Optional[str] = None,
        normalize: bool = True,
        model: Optional[Any] = None,
        tokenizer: Optional[Any] = None,
        embed_batch_size: int = DEFAULT_EMBED_BATCH_SIZE,
        cache_folder: Optional[str] = None,
        trust_remote_code: bool = False,
        device: Optional[str] = None,
        callback_manager: Optional[CallbackManager] = None,
    ):
        try:
            from transformers import AutoModel, AutoTokenizer
        except ImportError:
            raise ImportError(
                "HuggingFaceEmbedding requires transformers to be installed.\n"
                "Please install transformers with `pip install transformers`."
            )

        self._device = device or infer_torch_device()

        cache_folder = cache_folder or get_cache_dir()
        if model is None:  # Use model_name with AutoModel
            model_name = (
                model_name
                if model_name is not None
                else DEFAULT_HUGGINGFACE_EMBEDDING_MODEL
            )
            model = AutoModel.from_pretrained(
                model_name, cache_dir=cache_folder, trust_remote_code=trust_remote_code
            )
        elif model_name is None:  # Extract model_name from model
            model_name = model.name_or_path
        self._model = model.to(self._device)

        if tokenizer is None:  # Use tokenizer_name with AutoTokenizer
            tokenizer_name = (
                model_name or tokenizer_name or DEFAULT_HUGGINGFACE_EMBEDDING_MODEL
            )
            tokenizer = AutoTokenizer.from_pretrained(
                tokenizer_name, cache_dir=cache_folder
            )
        elif tokenizer_name is None:  # Extract tokenizer_name from model
            tokenizer_name = tokenizer.name_or_path
        self._tokenizer = tokenizer

        if max_length is None:
            try:
                max_length = int(self._model.config.max_position_embeddings)
            except AttributeError as exc:
                raise ValueError(
                    "Unable to find max_length from model config. Please specify max_length."
                ) from exc

        if not pooling:
            pooling = get_pooling_mode(model_name)
        try:
            pooling = Pooling(pooling)
        except ValueError as exc:
            raise NotImplementedError(
                f"Pooling {pooling} unsupported, please pick one in"
                f" {[p.value for p in Pooling]}."
            ) from exc

        super().__init__(
            embed_batch_size=embed_batch_size,
            callback_manager=callback_manager,
            model_name=model_name,
            tokenizer_name=tokenizer_name,
            max_length=max_length,
            pooling=pooling,
            normalize=normalize,
            query_instruction=query_instruction,
            text_instruction=text_instruction,
        )

Ensure that the AutoModel initialization is correctly placed within the __init__ method and that the _model attribute is properly set. This should resolve the attribute error [1][2].

To continue talking to Dosu, mention @dosu.

from llama_index.

theta-lin avatar theta-lin commented on July 17, 2024

@tituslhy Hi, I am also trying to use HuggingFaceEmbedding with DSPy. Different from the cookbook, I simply used vector retriever inside a DSPy module. Here is a even more minimal example to demonstrate this issue:

from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
import dspy
from dspy.teleprompt import BootstrapFewShot


class Rag(dspy.Module):
    def __init__(self):
        super().__init__()
        reader = SimpleDirectoryReader(input_files=["paul_graham_essay.txt"])
        docs = reader.load_data()
        index = VectorStoreIndex.from_documents(docs)
        self.retriever = index.as_retriever()

    def forward(self, question):
        return dspy.Prediction(answer=str(self.retriever.retrieve(question)))


Settings.embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-small-en-v1.5", trust_remote_code=True
)
Settings.llm = None

teleprompter = BootstrapFewShot()
train_examples = [
    dspy.Example(
        question="What did the author do growing up?",
        answer="The author wrote short stories and also worked on programming.",
    ).with_inputs("question"),
    dspy.Example(
        question="What did the author do during his time at YC?",
        answer="organizing a Summer Founders Program, funding startups, writing essays, working on a new version of Arc, creating Hacker News, and developing internal software for YC",
    ).with_inputs("question"),
]
teleprompter.compile(Rag(), trainset=train_examples)

Just download the dataset with

wget https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt -O paul_graham_essay.txt

and you should be able to run the example even without an LLM.

This gives the output of

2024-07-02T08:47:05.960695Z [error    ] Failed to run or to evaluate example Example({'question': 'What did the author do growing up?', 'answer': 'The author wrote short stories and also worked on programming.'}) (input_keys={'question'}) with None due to 'HuggingFaceEmbedding' object has no attribute '_model'. [dspy.teleprompt.bootstrap] filename=bootstrap.py lineno=211
2024-07-02T08:47:05.961326Z [error    ] Failed to run or to evaluate example Example({'question': 'What did the author do during his time at YC?', 'answer': 'organizing a Summer Founders Program, funding startups, writing essays, working on a new version of Arc, creating Hacker News, and developing internal software for YC'}) (input_keys={'question'}) with None due to 'HuggingFaceEmbedding' object has no attribute '_model'. [dspy.teleprompt.bootstrap] filename=bootstrap.py lineno=211

on my side.

While I haven't done more investigation yet, I think this should be related to another issue I opened some time ago #13956. On a related issue #14236, a developer mentioned that you cannot multiprocess with a local embedding model. Therefore, I suspect that maybe there is some kind of pickling (as multiprocessing is not used by DSPy, I think) in teleprompter.compile() that has something to do with this.

from llama_index.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.