xlang-ai / instructor-embedding Goto Github PK

[ACL 2023] One Embedder, Any Task: Instruction-Finetuned Text Embeddings

License: Apache License 2.0

Python 100.00%

embeddings information-retrieval language-model text-classification text-clustering text-embedding text-evaluation text-semantic-similarity prompt-retrieval text-reranking

instructor-embedding's Introduction

My Personal Fork

This is a fork for the Instructor model becuase the original repository isn't kept up anymore. I've also made some improvements to their source code:

Fixing it to work with the sentence-transformers library above 2.2.2.
Properly download the models from huggingface using the new "snapshot download" API.
Ability to specify where you want the model donwloaded with the "cache_dir" parameter.

What follows is the original repository's readme file. Ignore the quantization section, however, becuase pytorch has changed its API since then.

One Embedder, Any Task: Instruction-Finetuned Text Embeddings

This repository contains the code and pre-trained models for our paper One Embedder, Any Task: Instruction-Finetuned Text Embeddings. Please refer to our project page for a quick project overview.

We introduce Instructor👨‍🏫, an instruction-finetuned text embedding model that can generate text embeddings tailored to any task (e.g., classification, retrieval, clustering, text evaluation, etc.) and domains (e.g., science, finance, etc.) by simply providing the task instruction, without any finetuning. Instructor👨‍ achieves sota on 70 diverse embedding tasks!

**************************** Updates ****************************

01/21: We updated the code structure, which supports easy package installation.
12/28: We updated the checkpoint with hard negatives.
12/20: We released our paper, code, project page and checkpoint. Check them out!

Quick Links

One Embedder, Any Task: Instruction-Finetuned Text Embeddings

Installation

It is very easy to use INSTRUCTOR for any text embeddings. You can easily try it out in Colab notebook. In your local machine, we recommend to first create a virtual environment:

conda env create -n instructor python=3.7
git clone https://github.com/HKUNLP/instructor-embedding
pip install -r requirements.txt

That will create the environment instructor we used. To use the embedding tool, first install the InstructorEmbedding package from PyPI

pip install InstructorEmbedding

or directly install it from our code

pip install -e .

Environment setup

Activate the environment by running

conda activate instructor

Getting Started

First download a pretrained model (See model list for a full list of available models)

from InstructorEmbedding import INSTRUCTOR
model = INSTRUCTOR('hkunlp/instructor-large')

Then provide the sentence and customized instruction to the model.

# prepare texts with instructions
text_instruction_pairs = [
    {"instruction": "Represent the Science title:", "text": "3D ActionSLAM: wearable person tracking in multi-floor environments"},
    {"instruction": "Represent the Medicine sentence for retrieving a duplicate sentence:", "text": "Recent studies have suggested that statins, an established drug group in the prevention of cardiovascular mortality, could delay or prevent breast cancer recurrence but the effect on disease-specific mortality remains unclear."}
]

# postprocess
texts_with_instructions = []
for pair in text_instruction_pairs:
    texts_with_instructions.append([pair["instruction"], pair["text"]])

# calculate embeddings
customized_embeddings = model.encode(texts_with_instructions)

And that's it already. We now have a list of numpy arrays with the embeddings.

for pair, embedding in zip(text_instruction_pairs, customized_embeddings):
    print("Instruction: ", pair["instruction"])
    print("text: ", pair["text"])
    print("Embedding: ", embedding)
    print("")

The `encode` function

The users of the model need to use only the encode function:

model.encode( sentences,
              batch_size: int = 32,
              show_progress_bar: bool = None,
              output_value: str = 'sentence_embedding',
              convert_to_numpy: bool = True,
              convert_to_tensor: bool = False,
              device: str = None,
              normalize_embeddings: bool = False)

sentences: The sentences to be embedded. It should be in the format of [["instruction prompt 0", "text to be embedded 0], ["instruction prompt 1", "text to be embedded 1], ...].
batch_size (default: 32): The batch size used for the computation. It determines the number of sentences processed together in each batch.
show_progress_bar (default: None): If set to True, it displays a progress bar while encoding sentences, providing a visual indication of the encoding progress.
output_value (default: 'sentence_embedding'): Specifies the desired output type. The default value 'sentence_embedding' returns sentence embeddings. Setting it to 'token_embeddings' returns wordpiece token embeddings. Setting it to None returns all output values.
convert_to_numpy (default: True): If set to True, the output is a list of numpy vectors. If set to False, the output is a list of PyTorch tensors.
convert_to_tensor (default: False): If set to True, the function returns a stacked tensor as a single output. This parameter overrides any setting specified by convert_to_numpy.
device (default: None): Specifies the torch.device to use for the computation. If not specified, the function uses the default device.
normalize_embeddings (default: False): If set to True, the returned vectors will have a length of 1, indicating that they are normalized. In this case, similarity search would use the faster dot-product (util.dot_score), instead of cosine similarity.

Model List

We released a series of INSTRUCTOR checkpoints with different sizes. You can easily load these models with InstructorEmbedding package.

Model	Avg. Score
hkunlp/instructor-base	55.9
hkunlp/instructor-large	58.4
hkunlp/instructor-xl	58.8

Use Cases

We provide a few specific use cases in the following. For more examples and applications, refer to our paper

Calculate embeddings for your customized texts

If you want to calculate customized embeddings for specific sentences, you may follow the unified template to write instructions:

Represent the domain text_type for task_objective:

domain is optional, and it specifies the domain of the text, e.g., science, finance, medicine, etc.
text_type is required, and it specifies the encoding unit, e.g., sentence, document, paragraph, etc.
task_objective is optional, and it specifies the objective of embedding, e.g., retrieve a document, classify the sentence, etc.

Compute similarities between texts

You can use INSTRUCTOR to compute similarities between two groups of sentences, with customized embeddings.

from sklearn.metrics.pairwise import cosine_similarity
sentences_a = [['Represent the Science sentence: ','Parton energy loss in QCD matter'], 
               ['Represent the Financial statement: ','The Federal Reserve on Wednesday raised its benchmark interest rate.']]
sentences_b = [['Represent the Science sentence: ','The Chiral Phase Transition in Dissipative Dynamics'],
               ['Represent the Financial statement: ','The funds rose less than 0.5 per cent on Friday']]
embeddings_a = model.encode(sentences_a)
embeddings_b = model.encode(sentences_b)
similarities = cosine_similarity(embeddings_a,embeddings_b)

Use customized embeddings for information retrieval

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
query  = [['Represent the Wikipedia question for retrieving supporting documents: ','where is the food stored in a yam plant']]
corpus = [['Represent the Wikipedia document for retrieval: ','Capitalism has been dominant in the Western world since the end of feudalism, but most feel[who?] that the term "mixed economies" more precisely describes most contemporary economies, due to their containing both private-owned and state-owned enterprises. In capitalism, prices determine the demand-supply scale. For example, higher demand for certain goods and services lead to higher prices and lower demand for certain goods lead to lower prices.'],
          ['Represent the Wikipedia document for retrieval: ',"The disparate impact theory is especially controversial under the Fair Housing Act because the Act regulates many activities relating to housing, insurance, and mortgage loansâ€”and some scholars have argued that the theory's use under the Fair Housing Act, combined with extensions of the Community Reinvestment Act, contributed to rise of sub-prime lending and the crash of the U.S. housing market and ensuing global economic recession"],
          ['Represent the Wikipedia document for retrieval: ','Disparate impact in United States labor law refers to practices in employment, housing, and other areas that adversely affect one group of people of a protected characteristic more than another, even though rules applied by employers or landlords are formally neutral. Although the protected classes vary by statute, most federal civil rights laws protect based on race, color, religion, national origin, and sex as protected traits, and some laws include disability status and other traits as well.']]
query_embeddings = model.encode(query)
corpus_embeddings = model.encode(corpus)
similarities = cosine_similarity(query_embeddings,corpus_embeddings)
retrieved_doc_id = np.argmax(similarities)
print(retrieved_doc_id)

Use customized embeddings for clustering

import sklearn.cluster
sentences = [['Represent the Medicine sentence for clustering: ','Dynamical Scalar Degree of Freedom in Horava-Lifshitz Gravity'],
             ['Represent the Medicine sentence for clustering: ','Comparison of Atmospheric Neutrino Flux Calculations at Low Energies'],
             ['Represent the Medicine sentence for clustering: ','Fermion Bags in the Massive Gross-Neveu Model'],
             ['Represent the Medicine sentence for clustering: ',"QCD corrections to Associated t-tbar-H production at the Tevatron"],
             ['Represent the Medicine sentence for clustering: ','A New Analysis of the R Measurements: Resonance Parameters of the Higher,  Vector States of Charmonium']]
embeddings = model.encode(sentences)
clustering_model = sklearn.cluster.MiniBatchKMeans(n_clusters=2)
clustering_model.fit(embeddings)
cluster_assignment = clustering_model.labels_
print(cluster_assignment)

Training

Data

We construct Multitask Embeddings Data with Instructions (MEDI), consisting of a collection of 330 datasets from Super-NI(Super-NaturalInstructions), sentence-transformer embedding training data, KILT and MedMCQA, spanning a wide range of domains and tasks. We construct positive and negative pairs if they are not provided, and store them in a unified format:

[
    {'query': ['Represent the Wikipedia question for retrieving relevant documents;', 'big little lies season 2 how many episodes'], 'pos': ['Represent the Wikipedia document for retrieval;', 'Big Little Lies (TV series) series garnered several accolades. It received 16 Emmy Award nominations and won eight, including Outstanding Limited Series and acting awards for Kidman, Skarsgård, and Dern. The trio also won Golden Globe Awards in addition to a Golden Globe Award for Best Miniseries or Television Film win for the series. Kidman and Skarsgård also received Screen Actors Guild Awards for their performances. Despite originally being billed as a miniseries, HBO renewed the series for a second season. Production on the second season began in March 2018 and is set to premiere in 2019. All seven episodes are being written by Kelley'], 'neg': ['Represent the Wikipedia document for retrieval;', 'Little People, Big World final minutes of the season two-A finale, "Farm Overload". A crowd had gathered around Jacob, who was lying on the ground near the trebuchet. The first two episodes of season two-B focus on the accident, and how the local media reacted to it. The first season of "Little People, Big World" generated solid ratings for TLC (especially in the important 18–49 demographic), leading to the show\'s renewal for a second season. Critical reviews of the series have been generally positive, citing the show\'s positive portrayal of little people. Conversely, other reviews have claimed that the show has a voyeuristic bend'], 'task_id': 1}
    {'query': ['Represent the Wikipedia question for retrieving relevant documents;', 'who sang waiting for a girl like you'], 'pos': ['Represent the Wikipedia document for retrieval;', 'Waiting for a Girl Like You Waiting for a Girl Like You "Waiting for a Girl Like You" is a 1981 power ballad by the British-American rock band Foreigner. The distinctive synthesizer theme was performed by the then-little-known Thomas Dolby, and this song also marked a major departure from their earlier singles because their previous singles were mid to upper tempo rock songs while this song was a softer love song with the energy of a power ballad. It was the second single released from the album "4" (1981) and was co-written by Lou Gramm and Mick Jones. It has become one of the band\'s most'], 'neg': ['Represent the Wikipedia document for retrieval;', 'Waiting for a Girl Like You held off the number 1 spot by Olivia Newton-John\'s single "Physical" for nine consecutive weeks, and then by Hall & Oates\' "I Can\'t Go for That (No Can Do)" for a tenth week on January 30, 1982. Because of its chart longevity, it ended up being the number 19 song on the Top 100 singles of 1982. The song was the band\'s biggest hit until "I Want to Know What Love Is" hit number 1 in 1985. The song lists at number 100 on ""Billboard"\'s Greatest Songs of All Time". Waiting for a Girl Like You "Waiting for a Girl'], 'task_id': 1}
    ...
    {'query': ['Represent the Wikipedia sentence for retrieving relevant documents;', 'i LOVE sweet martini drinks!'], 'pos': ['Represent the Wikipedia document for retrieval;', "Appletini Appletini\nAn Apple martini (Appletini for short) is a cocktail containing vodka and one or more of apple juice, apple cider, apple liqueur, or apple brandy.\nThis drink, originally called an Adam's Apple Martini because the bartender who created it was named Adam, was created in 1996 at Lola's West Hollywood restaurant.\nThe drink, Adam's Apple was advertised by Smirnoff in the July 1972 issue of Playboy Magazine to the inside front cover. The recipe called for an ounce or so of Smirnoff"], 'neg': ['Represent the Wikipedia document for retrieval;', "Aromatised wine similar beverages described in this legislation are 'aromatised wine-based drinks' (non-fortified) and 'aromatised wine-product cocktail' (blended, lower alcohol drink under 7% ABV).\nVarieties of aromatised wine.\nVarieties of aromatised wine Vermouth.\nVermouth is the most widely used aromatised wine due to its use in cocktails and famous commercial brands such as Martini and Cinzano which are commonplace around the world. Vermouth can be sweet or dry and red, white, pink or orange. It is traditionally"], 'task_id': 300}
]

Each instance consists of a query, a positive pair, a negative pair and the id of the task, which is used to ensure data in the same training batch are from the same task. The MEDI data is available to be downloaded at this link.

Train INSTRUCTOR

We provide the example script for training INSTRUCTOR. You may need to first download the MEDI data, unzip the folder and put medi-data.json under --cache_dir.

python train.py --model_name_or_path sentence-transformers/gtr-t5-large --output_dir {output_directory} --cache_dir {cache_directory} --max_source_length 512 --num_train_epochs 10 --save_steps 500 --cl_temperature 0.1 --warmup_ratio 0.1 --learning_rate 2e-5 --overwrite_output_dir

We explain the arguments in the following:

--model_name_or_path: Pretrained checkpoints to start with. We support both model id (e.g., sentence-transformers/gtr-t5-large, sentence-transformers/sentence-t5-large) or checkpoint path (e.g., checkpoint saved by transformers trainer).
--cl_temperature: Temperature for contrastive loss
--cache_dir: The directory to cache downloaded models and data. The downloaded MEDI data(medi-data.json) should be put under the directory --cache_dir.
--output_dir: The directory to store the trained models(checkpoints) for evaluation.

All the other arguments are standard Huggingface's transformers training arguments, such as --overwrite_output_dir, --num_train_epochs, --learning_rate. For details, refer to Huggingface transformers

Evaluation

We evaluate INSTRUCTOR massively on 70 diverse tasks, spanning a wide range of tasks and domains. Specifically, we build our evaluation on three benchmarks, MTEB, Billboard, and Prompt Retrieval. We explain the details about running evaluation scripts in the following.

MTEB

To evaluate the model performance on MTEB benchmark dataset, first install the MTEB library

cd evaluation/MTEB
pip install -e .

Then run the following command:

python examples/evaluate_model.py --model_name hkunlp/instructor-large --output_dir outputs --task_name ArguAna --result_file results

You can evaluate your trained model checkpoints by specifying --model_name and run all MTEB datasets by changing --task_name. Check our paper or MTEB benchmark for evaluation metrics of all tasks.

Billboard

To evaluate the model performance on Billboard, run the following command:

cd evaluation/text_evaluation
python main.py --model_name hkunlp/instructor-large --task mscoco --add_prompt

You can evaluate your trained model checkpoints by specifying --model_name and run all Billboard datasets by changing --task. In all of the three datasets in Billboard, we report the Pearson correlation.

Prompt Retrieval

To evaluate the model performance on Prompt Retrieval, run the following command:

cd evaluation/prompt_retrieval
python main.py --embedding_model hkunlp/instructor-large --task rte --model_cache_dir {cache_dir} --output_dir {output_dir} --add_prompt

You can evaluate your trained model checkpoints by specifying --model_name and run prompt retrieval datasets by changing --task. In order to have a consistent metric, we cast all tasks in Prompt Retrieval into a "text-to-text" format, and report the Rouge-L score.

Quantization

To Quantize the Instructor embedding model, run the following code:

# imports 
import torch
from InstructorEmbedding import INSTRUCTOR

# load the model 
model = INSTRUCTOR('hkunlp/instructor-large', device='cpu')  # you can use GPU

# quantize the model 
qmodel = torch.quantization.quantize_dynamic(
model, {torch.nn.Linear}, dtype=torch.qint8)

# Inference 
sentence = "3D ActionSLAM: wearable person tracking in multi-floor environments"
instruction = "Represent the Science title:"

embeddings = qmodel.encode([[instruction,sentence]])  
# you can also normalize the embeddings:  normalize_embeddings=True 

print(f"Quantized Embeddings:\n {embeddings}")

It reduces the model size by 10x and inference time will be lesser than normal model :)

Bugs or questions?

If you have any question related to the code or the paper, feel free to email Hongjin ([email protected]) and Weijia ([email protected]). Please try to specify the problem with details so we can help you better and quicker.

Citation

If you find our work helpful, please cite us:

@inproceedings{INSTRUCTOR,
  title={One Embedder, Any Task: Instruction-Finetuned Text Embeddings},
  author={Su, Hongjin and Shi, Weijia and Kasai, Jungo and Wang, Yizhong and Hu, Yushi and  Ostendorf, Mari and Yih, Wen-tau and Smith, Noah A. and  Zettlemoyer, Luke and Yu, Tao},
  url={https://arxiv.org/abs/2212.09741},
  year={2022},
}

INSTRUCTOR Elsewhere

We thank the community's efforts for extending INSTRUCTOR!

LangChain supports InstructEmbeddings, which use the INSTRUCTOR model.
MosaicML has included Instructor-Large and Instructor-XL
embaas integrated Instructor-Large
Haystack includes InstructorTextEmbedder and InstructorDocumentEmbedder components.

instructor-embedding's People

Contributors

Stargazers

Watchers

Forkers

instructor-embedding dumpmemory taowangzj vseledkin creatorrr techthiyanes rohitpandey13 alirezabayatmk conceptofmind tomaarsen yuanhuanghuang englhardt roeika maxmatical nhsjgczryf fire avinashronanki ahoho rogervaas ching1221 lululouisjin ard-skelling ashokrajab sferrerfln jeregrine waundme gaylonalfano avsolatorio slack0 hammer-wang leochencipher gburachas ishandutta2007 jw2100 fredaidev atananetwork dheeraj7596 duckheada dattgoswami tongyx361 lgm2061779 andrewtsuei scriptonics ffengill ticlazau zhanglei3019 aphexus brodie-101 uakarsh raravindds abecid jackhwl leegisang standardgalactic vincentami hbcbh1999 worthmining ybkangster quanticoi goldenretriever98 lewieyasu zhangxt venkatesh-pro jiaqianjing dassaswat dkzdev hellonlp amitduwal qiuwenbogdut apollohuang1 krishna999 seawolfxiwu garrett361 positivewon touristshaun codeaudit codeakrome tuanacelik rochemedia liuguoyou mvandermeulen yinonglong chow12327 strata-tech tony2023b daominhkhanh20 wenhaoy-0428 heliosprimeone knowledgehacker xavier1999-chen hector918 bobqywei keshavaspanda jeromyjsmith govind636 rajsh1111 4ursmile fahdamjad quduoduo shaoqian12

instructor-embedding's Issues

Plans for releasing distilled or smaller models than base* ?

Congrats for the awesome work! Any plans in the future "distilling" the models?

GPU memory leak.

I am doing batch inference over a very large dataset. And I see that slowly over time I become OOM even though I am deleting all variables assigned. Here is the code

import gc

for batch_number in tqdm(batch_numbers):
  ids = []
  inputs = []
  for x in sampled_input_batched.filter(col('batch') == batch_number).collect():
    ids.append(x[0])
    inputs.append(x[1])
  
  output = instructor_model.encode(inputs, batch_size=BATCH_SIZE)

  temp_df = spark.createDataFrame(pd.DataFrame({
  'reviewid' : ids,
  'instructor_embeddings' : output
  }))

  temp_df.write.parquet(f"{inference_folder}/temp_inference_output_batch_{batch_number}.parquet")

  with torch.no_grad():
    del review_ids
    del review_inputs
    del output
    del temp_df
    torch.cuda.empty_cache()
    gc.collect()

But after each iteration of the loop. there is some residual memory being retained by the gpu. The model takes 5685mb of memory. But after each loop this number increases slightly. So after enough loops I run OOM. Could you tell me where the memory leak may be?

Code embeddings

I would like to finetine this model for code embeddings. Have you tried this before. Any suggestions on how to proceed. Do we need hard negatives or we can use in batch negatives?

Example for classification task

Hi,
I was wondering if you have an example code snippet to use on a text classification task. Thank you.

Is it possible to use the Instructor model directly via sentence-transformers?

It would be very useful, e.g. to train a cross-encoder.

KeyError: 'task_id'

Hi! I am trying to train instructor-embedding and come up with the error shown in the title. More specifically,

Traceback (most recent call last):
  File "train.py", line 577, in <module>
    main()
  File "train.py", line 450, in main
    print(f'one batch in task {old_train_examples_raw[idx1]["task_id"]} is skipped')
KeyError: 'task_id'

I have downloaded data and put them right in the cache_dir. And here is my running script:

# train the model
model_name=hkunlp/instructor-base
sentence_model_name=sentence-transformers/gtr-t5-base
output_dir=outputs
data_dir=medi-data

python train.py \
    --model_name_or_path=${sentence_model_name} \
    --output_dir=${output_dir} \
    --cache_dir=${data_dir} \
    --max_source_length=512 \
    --num_train_epochs=10 \
    --save_steps=500 \
    --cl_temperature=0.01 \
    --warmup_ratio=0.1 \
    --learning_rate=2e-5 \
    --overwrite_output_dir

show_progress_bar = True does not work properly in notebook

Hi,

Thank you for this model.
I'm running model.encode() in a Jupyter notebook on VSCode. It works fine at generating the embeddings, however when I try using the progress bar (show_progress_bar = True), no progress bar shows up in the VS Code notebook cell output, and the cell also hangs (it does not finish running, even when the embeddings have been generated).

When I try interrupting the cell in VSC, it says: "Interrupting the kernel timed out. Do you want to restart the kernel instead? All variables will be lost." After restarting the kernel, the output shows up.

Another potentially relevant piece of context is that when I initialize the model, I get the following warning

TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)
  from tqdm.autonotebook import trange

I'm not sure what's causing this behavior, but wondering if you have any thoughts/suggestions. Thanks!

Details about the Training Configurations

Thanks for your very exciting work and nice publicly released code and data. It is a very very great work in sentence representation learning!

I am very willing to follow this work and re-implement the training process. But I am not very clear about the per-device-batch-size and total-batch-size when tuning the large and XL models. And I am also a bit confused about the actual training steps of all the models. Could you please answer my above questions? I would like to estimate how many V100 32G GPUs are required based on them :)

Model Fine-tune for Classification Mission?

Hello! There is an automatic classification mission for engineering construction safe cases, we want to fine-tune your model using our dataset. However, no specification was found on your web page. Could you please tell me HOW to do that?

Does embedding pooling include instructions?

In section 2.1 of your paper, you indicate

Given an input text x and a task instruction I_x, INSTRUCTOR encodes their concatenation Ix ⊕ x. We then generate a fixed-sized, task-specific embedding E_I (I_x, x) by applying mean pooling to the last hidden representations over the tokens in x

I interpret this to mean that you average only over x and ignore the instruction embeddings.

I see two different kinds of model definitions in this repo: INSTRUCTOR_Transformer and INSTRUCTOR. It looks like the first one averages over x only and the second one will average over the instruction and x. Any reason why the second model averages over the whole sequence instead of just x? It appears that INSTRUCTOR_Transformer is not used anywhere else in this repo.

Input Length / Accuracy

Do you have any data on the performance given a range of input lengths? I'm working on neural search, and I came across instructor-xl as a potential replacement for text-embedding-ada-002, which has an context window of 8,191 tokens. Can instructor-xl handle that length without degrading? Any longer?

Issue 12 touched on this but didn't provide many details.

My immediate use is cosine similarity for search but I also have a need for clustering and categorization. Any info you can provide regarding the context length in relation to these use-cases will be super helpful and appreciated.

For anyone else reading this trying to compare the model to ada, here's a bit of discussion: UKPLab/sentence-transformers#1897

and related benchmarking: https://huggingface.co/spaces/mteb/leaderboard

hard negative dataset

is the dataset used in the hard negative commit changes available for download somewhere?

i'm assuming the MEDI instructions are different from the dataset originally released, since the eval instructions are different (e.g. dropping ; Input:)

Trouble running training jobs

Hi HKUNLP,

First off, really awesome paper on leveraging instructions to improve embedding quality across domains and tasks.

I am trying to train a model by following the directions to train a model. I downloaded the MTEB dataset, installed the requirements and am running the train job and continue to run into this error:
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.

Any idea why this is happening?

Thanks!

Install fails, dependencies don't get installed

I installed this package via:

pip install InstructorEmbedding

Then I try to run the quickstart. However, this yields an error on import.

---------------------------------------------------------------------------

ModuleNotFoundError                       Traceback (most recent call last)

[<ipython-input-2-479956c3d0e9>](https://localhost:8080/#) in <module>
----> 1 from InstructorEmbedding import INSTRUCTOR
      2 model = INSTRUCTOR('hkunlp/instructor-large')
      3 sentence = "3D ActionSLAM: wearable person tracking in multi-floor environments"
      4 instruction = "Represent the Science title:"
      5 

1 frames

[/usr/local/lib/python3.8/dist-packages/InstructorEmbedding/instructor.py](https://localhost:8080/#) in <module>
      7 from tqdm.autonotebook import trange
      8 from torch import Tensor, device
----> 9 from sentence_transformers import SentenceTransformer
     10 from sentence_transformers.models import Transformer
     11 from transformers import AutoConfig

ModuleNotFoundError: No module named 'sentence_transformers'

It seems this project doesn't install it's dependencies.

multilingual model

Thank you for this wonderful model. I have a question for you: Do you have any plans to develop a multilingual version of INSTRUCTOR?

Install MTEB

Hi,
Great work! I'm trying to reproduce the evaluation results on MTEB, but there occurs an error when I install the evaluation/MTEB following this command cd evaluation/MTEB pip install -e . :
OSError: [Errno 2] No such file or directory: '/tmp/tmpni4r6sx4/output.json

Can you give some advice to fix this error? Thank you!

Other model bases?

Thanks for this repo! I noticed that Instructor is initialised from a T5 / GTR

In your experiments how crucial is this base - if we were to try train say a MPNET or BERT with this dataset has this been explored?

Just wondering before I bite the bullet on this!

Set embedding vector dimension

Hi, how are you?
Is there a way to set the dimension of the returned vector or is always fixed to 768?
(I'm very new to this)

Thanks,
Fran

docs: instructions syntax

Here is what I have noted for myself.

The syntax to write instructions: "Represent the domain text_type for task_objective: ", where:

domain is optional, and it specifies the domain of the text, e.g., science, finance, medicine, etc.
text_type is required, and it specifies the encoding unit, e.g., sentence, document, paragraph, etc.
task_objective is optional, and it specifies the objective of embedding, e.g., retrieve a document, classify the sentence, etc.

The questions:

what is the exhaustive list of possible values for domain?
what is the exhaustive list of possible values for text_type?
what is the exhaustive list of possible values for task_objective?

Fine tuning: Some weights of the model checkpoint were not used when initializing T5EncoderModel

Thanks for open sourcing this model! I have attempted to fine tune instructor-large and instructor-xl.

I now get an error on the outputted model when I try to load it "Some weights of the model checkpoint at ../up-l/ were not used when initializing T5EncoderModel" similarly the outputted model is missing key files like modules.json etc. which I attempted to use from your existing model given it should be the same architecture.

I modified medi-data.json to have my training data.

docs: using the model with sagemaker

Hi,

I am following this guide to deploy instructor-embedding on Amazon SageMaker.

https://www.philschmid.de/custom-inference-huggingface-sagemaker

I've created model.tar.gz that contains cached version of the model.

drwxr-xr-x root/root         0 2023-06-20 15:40 model/
-rw-r--r-- root/root      1477 2023-06-20 15:33 model/.gitattributes
drwxr-xr-x root/root         0 2023-06-20 15:40 model/.ipynb_checkpoints/
-rw-r--r-- root/root     66318 2023-06-20 15:33 model/.ipynb_checkpoints/README-checkpoint.md
-rw-r--r-- root/root       122 2023-06-20 15:33 model/.ipynb_checkpoints/config_sentence_transformers-checkpoint.json
drwxr-xr-x root/root         0 2023-06-20 15:33 model/1_Pooling/
-rw-r--r-- root/root       270 2023-06-20 15:33 model/1_Pooling/config.json
drwxr-xr-x root/root         0 2023-06-20 15:33 model/2_Dense/
-rw-r--r-- root/root       116 2023-06-20 15:33 model/2_Dense/config.json
-rw-r--r-- root/root   3146603 2023-06-20 15:33 model/2_Dense/pytorch_model.bin
-rw-r--r-- root/root     66318 2023-06-20 15:33 model/README.md
-rw-r--r-- root/root      1529 2023-06-20 15:33 model/config.json
-rw-r--r-- root/root       122 2023-06-20 15:33 model/config_sentence_transformers.json
-rw-r--r-- root/root       461 2023-06-20 15:33 model/modules.json
-rw-r--r-- root/root 1339823867 2023-06-20 15:33 model/pytorch_model.bin
-rw-r--r-- root/root         53 2023-06-20 15:33 model/sentence_bert_config.json
-rw-r--r-- root/root       2201 2023-06-20 15:33 model/special_tokens_map.json
-rw-r--r-- root/root     791656 2023-06-20 15:33 model/spiece.model
-rw-r--r-- root/root    2422360 2023-06-20 15:33 model/tokenizer.json
-rw-r--r-- root/root       2407 2023-06-20 15:33 model/tokenizer_config.json
-rw-r--r-- root/root       2177 2023-06-22 18:17 code/inference.py
-rw-r--r-- root/root         70 2023-06-22 18:17 code/requirements.txt

In the inference.py I load the model from the model directory in model.tar.gz.

from transformers import AutoTokenizer, AutoModel

    tokenizer = AutoTokenizer.from_pretrained(model_dir + "/model")
    model = AutoModel.from_pretrained(model_dir + "/model")

The model type for this one comes up as T5Model and it does not have encode method.

[INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - mms.service.PredictionException: 'T5Model' object has no attribute 'encode' : 400",

Which method and syntax do I use to perform the embedding?

Path to contributing to sentence-transformers upstream?

Kudos for the amazing work and hitting the top of the leaderboard! Thoughts on how to contribute changes to sentence-transformers upstream?

Why use DataCollatorForSeq2Seq to collect data?

I think this is a text retrieval task. Why use the seq2seq approach to construct the data?

file downloads on package import

When first importing the package, I noticed a number of downloads happening.

from InstructorEmbedding import INSTRUCTOR
model = INSTRUCTOR('hkunlp/instructor-large')
sentence = "3D ActionSLAM: wearable person tracking in multi-floor environments"
instruction = "Represent the Science title:"
embeddings = model.encode([[instruction,sentence]])
print(embeddings)

I am deploying the package in the environment where I don't have access to Internet.

Is there a way to download all the required files ahead of time and tell the INSTRUCTOR to use the files instead of downloading them at runtime?

When does the training stop?

Hello,

Was there a specific reason for ending the training at 40K steps?
What criteria or basis(i.e. metric) did you use to determine the stopping point?

How can we prune the INSTRUCTOR

How can we prune teh INSTRUCTOR model

special tokens in tokenizer

hi,

Is it possible to add special tokens to the tokenizer and retrain for a domain specific task?

License?

Hi, there is no license in this repo, and I was wondering if you would consider adding an explicit license? (Hopefully apache or other friendly license :)) Thanks, and thanks for the awesome work!

SummEval has incorrect description

Very minor problem, but I noticed that the description for SummEval in MTEB was incorrect, so the instructions in SummarizationEvaluator.py should also be changed.

chinese support?

Have the model trained on some chinese corpus dataset?

Multilingual capabilities

Dead authors,

It was a delight to read your work on instruction fine-tuned embeddings.

Are there any plans for extending the capabilities to a multilingual setup?

Can't load config for 'sentence-transformers/gtr-t5-large'

Running the following command:
python train.py --model_name_or_path sentence-transformers/gtr-t5-large --output_dir . --cache_dir medi-data/medi-data.json --max_source_length 512 --num_train_epochs 10 --save_steps 500 --cl_temperature 0.01 --warmup_ratio 0.1 --learning_rate 2e-5 --overwrite_output_dir

and receiving the following error repeatedly:

  File "/home/engineering/instructor-embedding/train.py", line 570, in <module>
    main()
  File "/home/engineering/instructor-embedding/train.py", line 423, in main
    tokenizer = AutoTokenizer.from_pretrained(
  File "/home/engineering/miniconda3/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 535, in from_pretrained
    config = AutoConfig.from_pretrained(
  File "/home/engineering/miniconda3/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 705, in from_pretrained
    config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/home/engineering/miniconda3/lib/python3.10/site-packages/transformers/configuration_utils.py", line 553, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/home/engineering/miniconda3/lib/python3.10/site-packages/transformers/configuration_utils.py", line 641, in _get_config_dict
    raise EnvironmentError(
OSError: Can't load config for 'sentence-transformers/gtr-t5-large'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'sentence-transformers/gtr-t5-large' is the correct path to a directory containing a config.json file```

I even tried copying the config.json (from https://huggingface.co/sentence-transformers/gtr-t5-large/blob/main/config.json) into a directory I created in sentence-transformers/gtr-t5-large, but I receive the same error.

The time cost for training this instructor?

Hi, thanks for your great work. I do not see the time cost for training this model on the paper. Could you public the time cost for training the different size models(base, large, xl)?

Multi-gpu evaluation

Hi there,

Is there a way to evaluate models with multiple GPUs? It is intimidating to run over large datasets.

Thanks,
Rui

Do other languages also work?

Is it possible to use instructor on other languages than english and get meaningful results?

Cheers

hard coding problem of max sequence length

Due to this line, even if I run the code below, I still got max_seq_length fixed at 512.
model.max_seq_length = 1024

Is it because I can't change the max_seq_length?
If not, it would be right to erase this line.

How influential is the "domain" and "task_objective" of each instruction on new data?

Hey Team, thank you for the awesome project and for sharing it with the wider community.

I have a question regarding the optimal strategy for attaching instructions to text we plan on embedding with InstructOR. If I have a set of documents I wish to embed, specifically for information retrieval, what would be the best way to write the instructions for these new documents and queries?

Additional Context:

In the appendices of your paper, I see that retrieval instructions were frequently tailored to the training dataset in question (e.g. for the gooaq_pairs dataset, you refer to each document as a "Google answer for retrieval"). This makes sense for training, but now for inference on new data, how much does the domain and task_objective influence the resulting embeddings? I see they are both optional parts of the unified template, but the examples you provide have domains and objectives that range from general "science sentence" to specific "Wikipedia question".

I figure using domains & task objectives leads to more specialized embeddings, but is there a curated list of domains/objectives somewhere to help guide end users?

Thanks for your time!

Can SetFit be incorporated InstructorEmbedding?

@tomaarsen Hi Tom. I am wondering if setfit can be incorporated with InstructorEmbedding. Would like to see a comparison for topic classification between setfit and InstructorEmbedding.

Bib citation issue

Hi, thanks for making the model publicly available. 🙂

I wanted to cite it so I used the provided bibtex entry:

@inproceedings{INSTRUCTOR,
  title={One Embedder, Any Task: Instruction-Finetuned Text Embeddings},
  author={Hongjin Su, Weijia Shi, Jungo Kasai, Yizhong Wang, Yushi Hu, Mari Ostendorf, Wen-tau Yih, Noah A. Smith, Luke Zettlemoyer, Tao Yu},
  url={https://arxiv.org/abs/2212.09741},
  year={2022},
}

However the Overleaf BibTeX installation is complaining that there are too many commas in the author list. My understanding is that it treats and as author separator and , as last name, first names separator.

Changing it to the following makes it work.

@inproceedings{INSTRUCTOR,
  title={One Embedder, Any Task: {I}nstruction-Finetuned Text Embeddings},
  author={Su, Hongjin and Shi, Weijia and Kasai, Jungo and Wang, Yizhong and Hu, Yushi and  Ostendorf, Mari and Yih, Wen-tau and Smith, Noah A. and  Zettlemoyer, Luke and Yu, Tao},
  url={https://arxiv.org/abs/2212.09741},
  year={2022},
}

can I implement this code in google colab?

https://github.com/HKUNLP/instructor-embedding/blob/39a590f431b028cbcc61f0e914e26e39de6f7a2a/InstructorEmbedding/instructor.py#L28

Replication of Instructor

Hey, we are currently trying to replicate the Instructor model. Issue #14 already asks this, but please report the exact training setup for the models.

Also, I am interested in the loss of your model. I didn't get your reported results by running the model for 100k steps. It could be more evident to me how you used just 40k steps for the model while you mentioned in your paper that you trained it on the MEDI dataset.

I would appreciate your help here :)

Peculiar Cosine Similarity Values

Is there a reason the model seems only to output embeddings with cosine similarities all in a very narrow range?

(One way this can happen is if effectively only a very small subspace of the 768 dimensions is getting used)

I have tried a number of different tasks, with many different strings and types of strings, and find that the results are nearly always in a very narrow range from about +0.4 to +0.9. This is despite creating test sets that should generate lots of orthogonal embeddings and graded similarities from near to further. I have literally been unable to get a value under +0.4 for any two embeddings; I have only ever gotten above +0.9 when testing a vector with itself (as a test.)

I find this true for both base and XL models.

I find this even for the example given in the README. I created the following test code:

arguments = sys.argv
original=(arguments[1].lower()=="true")
q1=(arguments[2].lower()=="true")
crosscheck=True and ~original

#Below copy-pasted from https://github.com/HKUNLP/instructor-embedding except 
#  for if statements failing "original"

from InstructorEmbedding import INSTRUCTOR
model = INSTRUCTOR('hkunlp/instructor-base')

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity


if (original or q1):
    query  = [['Represent the Wikipedia question for retrieving supporting documents: ','where is the food stored in a yam plant']]
else:
    query  = [['Represent the Wikipedia question for retrieving supporting documents: ','what is the dominant economic theory in the United States?']]
corpus = [['Represent the Wikipedia document for retrieval: ','Capitalism has been dominant in the Western world since the end of feudalism, but most feel[who?] that the term "mixed economies" more precisely describes most contemporary economies, due to their containing both private-owned and state-owned enterprises. In capitalism, prices determine the demand-supply scale. For example, higher demand for certain goods and services lead to higher prices and lower demand for certain goods lead to lower prices.'],
          ['Represent the Wikipedia document for retrieval: ',"The disparate impact theory is especially controversial under the Fair Housing Act because the Act regulates many activities relating to housing, insurance, and mortgage loansâ€”and some scholars have argued that the theory's use under the Fair Housing Act, combined with extensions of the Community Reinvestment Act, contributed to rise of sub-prime lending and the crash of the U.S. housing market and ensuing global economic recession"],
          ['Represent the Wikipedia document for retrieval: ','Disparate impact in United States labor law refers to practices in employment, housing, and other areas that adversely affect one group of people of a protected characteristic more than another, even though rules applied by employers or landlords are formally neutral. Although the protected classes vary by statute, most federal civil rights laws protect based on race, color, religion, national origin, and sex as protected traits, and some laws include disability status and other traits as well.']]
if (crosscheck and ~original):
    corpus.append(query[0]+[])
query_embeddings = model.encode(query)
corpus_embeddings = model.encode(corpus)
similarities = cosine_similarity(query_embeddings,corpus_embeddings)
retrieved_doc_id = np.argmax(similarities)
print(retrieved_doc_id)
if (original):
    pass
else:
    print(similarities)

I then ran it three times: exactly as posted on the README, adding a printout of similarities (with the query vector with itself added as a test), and then using a different query created to score very highly with the first corpus item and printing out the results:

$ python3 instructor_example_2.py True True
load INSTRUCTOR_Transformer
max_seq_length 512
3
$ python3 instructor_example_2.py False True
load INSTRUCTOR_Transformer
max_seq_length 512
3
[[0.7325637 0.71300924 0.7206404 1. ]]
$ python3 instructor_example_2.py False False
load INSTRUCTOR_Transformer
max_seq_length 512
3
[[0.86386305 0.83299637 0.8046411 0.9999999 ]]

In high dimensional spaces cosine similarity of 0.7 is very significant; however a question about a yam has nothing visibly to do with capitalism or disparate impact.

Two other embedding models both returned much more intuitive results (nearly all 0 for the first yam query; graduated similarities for the capitalism query) that were both pretty close to one another.

I've found the same thing with other queries and corpuses; the output of similarities is always in a very narrow range.

Erro with test on MSMARCO dataset.

I have try to do the evaluation following read me. When I run:
python examples/evaluate_model.py --model_name hkunlp/instructor-large --output_dir msmarco.out --task_name MSMARCO --result_file msmarco.res
I met the following erro:
`
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

Data & training details

Hi, awesome work on text embeddings!

After reading your paper and the code, I have a few questions.

As stated in the paper (Section 2.3 data construction),

Following Ni et al. (2021), we use four negative pairs (hard or in-batch negatives) during the model finetuning process

But in the data downloaded from the link of your repo, each training instance from each task is accompanied with exactly 1 positive and 1 negative. Since some datasets from embedding-training-data do not contain negatives, I'm wondering how the negatives are sampled. Randomly or following the same way as superNI datasets? Also the data construction code for 300 datasets from superNI is missing.

In addition, I think the current ckpt is different from the first released ones since it's trained by hard negatives. But details about how hard negatives are sampled is missing...

Finally, many tasks are subsampled according to the paper to balance each dataset, would you mind sharing the whole data for each data source with all the hard negatives? Thanks.

How would you use instructor with DNA sequences?

Thank you for your work,

I was wondering if it would be possible to train instructor to do embedding on DNA sequences for clustering/classification.

best,
Loïc

from InstructorEmbedding import INSTRUCTOR
model = INSTRUCTOR('hkunlp/instructor-xl')
sentence = "3D ActionSLAM: wearable person tracking in multi-floor environments"
instruction = "Represent the Science title:"
embeddings = model.encode([[instruction,sentence]])
print(embeddings[0, :5])

Then I batch process using Hugging Face datasets:

from datasets import Dataset
ds = Dataset.from_parquet('../data/mydata.parquet')

import torch
torch.backends.cuda.matmul.allow_tf32 = True

BATCH_SIZE = 4

instruction = "Represent the news article for clustering"
def encode_text(item):
    input = [[instruction, subitem] for subitem in item['text']]
    return {"embedding": model.encode(input, batch_size=BATCH_SIZE)}

ds = ds.map(encode_text, batched=True, batch_size=BATCH_SIZE, remove_columns='text')

This will either crash my ipykernel or worse take my entire EC2 instance offline. Seems like this shouldn't be happening, the model needs 5GB of VRAM and my g5.xlarge instance has 24GB.

Am I doing the batching correctly? This is the only way I could make it work/make it make sense.