flagopen / flagembedding Goto Github PK

View Code? Open in Web Editor NEW

6.5K 39.0 464.0 23.45 MB

Retrieval and Retrieval-augmented LLMs

License: MIT License

Python 98.72% Shell 0.11% Jupyter Notebook 1.17%

embeddings information-retrieval llm sentence-embeddings text-semantic-similarity retrieval-augmented-generation

flagembedding's Introduction

FlagEmbedding

News | Projects | Model List | Contributor | Citation | License

English | 中文

FlagEmbedding focuses on retrieval-augmented LLMs, consisting of the following projects currently:

Long-Context LLM: Activation Beacon, LongLLM QLoRA
Fine-tuning of LM : LM-Cocktail
Embedding Model: Visualized-BGE, BGE-M3, LLM Embedder, BGE Embedding
Reranker Model: llm rerankers, BGE Reranker
Benchmark: C-MTEB, AIR-Bench, MLVU

News

7/26/2024: Release a new embedding model bge-en-icl, an embedding model that incorporates in-context learning capabilities, which, by providing task-relevant query-response examples, can encode semantically richer queries, further enhancing the semantic representation ability of the embeddings. 🔥
7/26/2024: Release a new embedding model bge-multilingual-gemma2, a multilingual embedding model based on gemma-2-9b, which supports multiple languages and diverse downstream tasks, achieving new SOTA on multilingual benchmarks (MIRACL, MTEB-fr, and MTEB-pl). 🔥
7/26/2024: Release a new lightweight reranker bge-reranker-v2.5-gemma2-lightweight, a lightweight reranker based on gemma-2-9b, which supports token compression and layerwise lightweight operations, can still ensure good performance while saving a significant amount of resources. 🔥
6/7/2024: Release a new benchmark MLVU, the first comprehensive benchmark specifically designed for long video understanding. MLVU features an extensive range of video durations, a diverse collection of video sources, and a set of evaluation tasks uniquely tailored for long-form video understanding. 🔥
5/21/2024: Release a new benchmark AIR-Bench together with Jina AI, Zilliz, HuggingFace, and other partners. AIR-Bench focuses on a fair out-of-distribution evaluation for Neural IR & RAG. It generates the synthetic data for benchmarking w.r.t. diverse domains and languages. It is dynamic and will be updated on regular basis. Leaderboard 🔥
4/30/2024: Release Llama-3-8B-Instruct-80K-QLoRA, extending the context length of Llama-3-8B-Instruct from 8K to 80K via QLoRA training on a few synthesized long-context data. The model achieves remarkable performance on various long-context benchmarks. Code 🔥
3/18/2024: Release new rerankers, built upon powerful M3 and LLM (GEMMA and MiniCPM, not so large actually 😃) backbones, supporitng multi-lingual processing and larger inputs, massive improvements of ranking performances on BEIR, C-MTEB/Retrieval, MIRACL, LlamaIndex Evaluation 🔥
3/18/2024: Release Visualized-BGE, equipping BGE with visual capabilities. Visualized-BGE can be utilized to generate embeddings for hybrid image-text data. 🔥
1/30/2024: Release BGE-M3, a new member to BGE model series! M3 stands for Multi-linguality (100+ languages), Multi-granularities (input length up to 8192), Multi-Functionality (unification of dense, lexical, multi-vec/colbert retrieval). It is the first embedding model which supports all three retrieval methods, achieving new SOTA on multi-lingual (MIRACL) and cross-lingual (MKQA) benchmarks. Technical Report and Code. 🔥
1/9/2024: Release Activation-Beacon, an effective, efficient, compatible, and low-cost (training) method to extend the context length of LLM. Technical Report
12/24/2023: Release LLaRA, a LLaMA-7B based dense retriever, leading to state-of-the-art performances on MS MARCO and BEIR. Model and code will be open-sourced. Please stay tuned. Technical Report
11/23/2023: Release LM-Cocktail, a method to maintain general capabilities during fine-tuning by merging multiple language models. Technical Report
10/12/2023: Release LLM-Embedder, a unified embedding model to support diverse retrieval augmentation needs for LLMs. Technical Report
09/15/2023: The technical report of BGE has been released
09/15/2023: The massive training data of BGE has been released
09/12/2023: New models:
- New reranker model: release cross-encoder models BAAI/bge-reranker-base and BAAI/bge-reranker-large, which are more powerful than embedding model. We recommend to use/fine-tune them to re-rank top-k documents returned by embedding models.
- update embedding model: release bge-*-v1.5 embedding model to alleviate the issue of the similarity distribution, and enhance its retrieval ability without instruction.

09/07/2023: Update fine-tune code: Add script to mine hard negatives and support adding instruction during fine-tuning.
08/09/2023: BGE Models are integrated into Langchain, you can use it like this; C-MTEB leaderboard is available.
08/05/2023: Release base-scale and small-scale models, best performance among the models of the same size 🤗
08/02/2023: Release bge-large-*(short for BAAI General Embedding) Models, rank 1st on MTEB and C-MTEB benchmark! 🎉 🎉
08/01/2023: We release the Chinese Massive Text Embedding Benchmark (C-MTEB), consisting of 31 test dataset.

Projects

BGE-M3(Paper, Code)

In this project, we introduce BGE-M3, the first embedding model which supports multiple retrieval modes、multilingual and multi-granularity retrieval.

Multi-Functionality: It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval.
Multi-Linguality: It can support more than 100 working languages.
Multi-Granularity: It is able to process inputs of different granularities, spanning from short sentences to long documents of up to 8192 tokens.

We propose a novel self-knowledge distillation approach to improve the performance of single retrieval mode. We optimize the batching strategy, enabling a large batch size, which can used simply when fine-tuning with long text or large language model. We also construct a dataset for document retrieval and propose a simple strategy to improve the ability to model long text. The training code and fine-tuning data will be open-sourced in the near future.

Visualized-BGE

In this project, we introduce Visualized-BGE, which integrating image token embedding into the BGE Text Embedding framework. Visualized-BGE can be used for various hybrid modal retrieval tasks, such as Multi-Modal Knowledge Retrieval, Composed Image Retrieval, and Knowledge Retrieval with Multi-Modal Queries.

Our model delivers outstanding zero-shot performance across multiple hybrid modal retrieval tasks. It can also serve as a base model for downstream fine-tuning for hybrid modal retrieval tasks.

LongLLM QLoRA

We extend the context length of Llama-3-8B-Instruct from 8K to 80K via QLoRA fine-tuning. The entire training cycle is super efficient, which takes 8 hours on one 8xA800 (80G) GPU machine. The resulted model exhibits superior performances across a broad range of evaluation tasks, such as NIHS, topic retrieval, and long-context language understanding; meanwhile, it also well preserves the original capability over short contexts. The dramatic context extension is mainly attributed to merely 3.5K synthetic data generated by GPT-4, which indicates the LLMs' inherent (yet largely underestimated) potential to extend its original context length. In fact, the context length could be extended far beyond 80K with more computing resources.

Activation Beacon

The utilization of long contexts poses a big challenge for large language models due to their limited context window length. Activation Beacon condenses LLM's raw activations into more compact forms such that it can perceive a much longer context with a limited context window. It is an effective, efficient, compatible, and low-cost (training) method to extend the context length of LLM. More details please refer to our paper and code.

LM-Cocktail

Model merging has been used to improve the performance of single model. We find this method is also useful for large language models and dense embedding model, and design the LM-Cocktail strategy which automatically merges fine-tuned models and base model using a simple function to compute merging weights. LM-Cocktail can be used to improve the performance on target domain without decrease the general capabilities beyond target domain. It also can be used to generate a model for new tasks without fine-tuning. You can use it to merge the LLMs (e.g., Llama) or embedding models. More details please refer to our report: LM-Cocktail and code.

LLM Embedder

LLM Embedder is fine-tuned based on the feedback from LLMs. It can support the retrieval augmentation needs of large language models, including knowledge retrieval, memory retrieval, example retrieval, and tool retrieval. It is fine-tuned over 6 tasks: Question Answering, Conversational Search, Long Conversation, Long-Range Language Modeling, In-Context Learning, and Tool Learning. For more details please refer to report and ./FlagEmbedding/llm_embedder/README.md

BGE Reranker

Cross-encoder will perform full-attention over the input pair, which is more accurate than embedding model (i.e., bi-encoder) but more time-consuming than embedding model. Therefore, it can be used to re-rank the top-k documents returned by embedding model. We train the cross-encoder on a multilingual pair data, The data format is the same as embedding model, so you can fine-tune it easily following our example. For more details please refer to ./FlagEmbedding/reranker/README.md

We provide a new version of the cross-encoder that supports more languages and longer lengths. The data format is similar to our embedding models, but now includes prompt data for fine-tuning and inference. You can perform inference using specific layers or using the entire layers. You can fine-tune it easily following our example. For more details please refer to ./FlagEmbedding/llm_reranker/README.md.

BGE Embedding

BGE embedding is a general Embedding Model. We pre-train the models using retromae and train them on large-scale pair data using contrastive learning. You can fine-tune the embedding model on your data following our examples. We also provide a pre-train example. Note that the goal of pre-training is to reconstruct the text, and the pre-trained model cannot be used for similarity calculation directly, it needs to be fine-tuned. Refer to our report: c-pack and code for more details.

BGE uses the last hidden state of [cls] as the sentence embedding: sentence_embeddings = model_output[0][:, 0]. If you use mean pooling, there will be a significant decrease in performance.

C-MTEB

A benchmark for chinese text embedding. This benchmark has been merged into MTEB. Refer to our report: c-pack and code for more details.

Model List

bge is short for BAAI general embedding.

Model	Language		Description	query instruction for retrieval
BAAI/bge-en-icl	English		A LLM-based embedding model with in-context learning capabilities, which can fully leverage the model's potential based on a few shot examples	Provide instructions and few-shot examples freely based on the given task.
BAAI/bge-multilingual-gemma2	Multilingual	-	A LLM-based multilingual embedding model, trained on a diverse range of languages and tasks.	Provide instructions based on the given task.
BAAI/bge-m3	Multilingual	Inference Fine-tune	Multi-Functionality(dense retrieval, sparse retrieval, multi-vector(colbert)), Multi-Linguality, and Multi-Granularity(8192 tokens)
LM-Cocktail	English		fine-tuned models (Llama and BGE) which can be used to reproduce the results of LM-Cocktail
BAAI/llm-embedder	English	Inference Fine-tune	a unified embedding model to support diverse retrieval augmentation needs for LLMs	See README
BAAI/bge-reranker-v2-m3	Multilingual	Inference Fine-tune	a lightweight cross-encoder model, possesses strong multilingual capabilities, easy to deploy, with fast inference.
BAAI/bge-reranker-v2-gemma	Multilingual	Inference Fine-tune	a cross-encoder model which is suitable for multilingual contexts, performs well in both English proficiency and multilingual capabilities.
BAAI/bge-reranker-v2-minicpm-layerwise	Multilingual	Inference Fine-tune	a cross-encoder model which is suitable for multilingual contexts, performs well in both English and Chinese proficiency, allows freedom to select layers for output, facilitating accelerated inference.
BAAI/bge-reranker-v2.5-gemma2-lightweight	Multilingual	Inference	a cross-encoder model which is suitable for multilingual contexts, performs well in both English and Chinese proficiency, allows freedom to select layers, compress ratio and compress layers for output, facilitating accelerated inference.
BAAI/bge-reranker-large	Chinese and English	Inference Fine-tune	a cross-encoder model which is more accurate but less efficient
BAAI/bge-reranker-base	Chinese and English	Inference Fine-tune	a cross-encoder model which is more accurate but less efficient
BAAI/bge-large-en-v1.5	English	Inference Fine-tune	version 1.5 with more reasonable similarity distribution	`Represent this sentence for searching relevant passages:`
BAAI/bge-base-en-v1.5	English	Inference Fine-tune	version 1.5 with more reasonable similarity distribution	`Represent this sentence for searching relevant passages:`
BAAI/bge-small-en-v1.5	English	Inference Fine-tune	version 1.5 with more reasonable similarity distribution	`Represent this sentence for searching relevant passages:`
BAAI/bge-large-zh-v1.5	Chinese	Inference Fine-tune	version 1.5 with more reasonable similarity distribution	`为这个句子生成表示以用于检索相关文章：`
BAAI/bge-base-zh-v1.5	Chinese	Inference Fine-tune	version 1.5 with more reasonable similarity distribution	`为这个句子生成表示以用于检索相关文章：`
BAAI/bge-small-zh-v1.5	Chinese	Inference Fine-tune	version 1.5 with more reasonable similarity distribution	`为这个句子生成表示以用于检索相关文章：`
BAAI/bge-large-en	English	Inference Fine-tune	Embedding Model which map text into vector	`Represent this sentence for searching relevant passages:`
BAAI/bge-base-en	English	Inference Fine-tune	a base-scale model but with similar ability to `bge-large-en`	`Represent this sentence for searching relevant passages:`
BAAI/bge-small-en	English	Inference Fine-tune	a small-scale model but with competitive performance	`Represent this sentence for searching relevant passages:`
BAAI/bge-large-zh	Chinese	Inference Fine-tune	Embedding Model which map text into vector	`为这个句子生成表示以用于检索相关文章：`
BAAI/bge-base-zh	Chinese	Inference Fine-tune	a base-scale model but with similar ability to `bge-large-zh`	`为这个句子生成表示以用于检索相关文章：`
BAAI/bge-small-zh	Chinese	Inference Fine-tune	a small-scale model but with competitive performance	`为这个句子生成表示以用于检索相关文章：`

Contributors:

Citation

If you find this repository useful, please consider giving a star ⭐ and citation

@misc{bge_m3,
  title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation},
  author={Chen, Jianlv and Xiao, Shitao and Zhang, Peitian and Luo, Kun and Lian, Defu and Liu, Zheng},
  year={2023},
  eprint={2309.07597},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

@misc{cocktail,
      title={LM-Cocktail: Resilient Tuning of Language Models via Model Merging}, 
      author={Shitao Xiao and Zheng Liu and Peitian Zhang and Xingrun Xing},
      year={2023},
      eprint={2311.13534},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@misc{llm_embedder,
      title={Retrieve Anything To Augment Large Language Models}, 
      author={Peitian Zhang and Shitao Xiao and Zheng Liu and Zhicheng Dou and Jian-Yun Nie},
      year={2023},
      eprint={2310.07554},
      archivePrefix={arXiv},
      primaryClass={cs.IR}
}

@misc{bge_embedding,
      title={C-Pack: Packaged Resources To Advance General Chinese Embedding}, 
      author={Shitao Xiao and Zheng Liu and Peitian Zhang and Niklas Muennighoff},
      year={2023},
      eprint={2309.07597},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

License

FlagEmbedding is licensed under the MIT License.

flagembedding's People

Contributors

Stargazers

Watchers

Forkers

sxm1129 lllfx freedak-wang rkp64 hongdangshao yanpym moomoofarm1 zyasjtu itsharex mz0in haikuoxin liuq4360 yuanhuachao ng-fukgin techthiyanes moqingxinai lisinan layjoy blm666 sfidea andysdc liushuchun nanjiye123 john1203 xphuster nashuju kacperwyrwal zwkkk newmedia2 hellozhaojian shuxjweb davche163 msymsy mtcto muennighoff codew1zard amarone yang182 thelongestusernameofall yanjiangjerry todanchen superxiang liangofthechen logos3755 mxchinegod sorokinvld eltociear slashthinking ashokrajab amaliujia qiufengyuyi qiuwenbogdut vineetm vitaly-z andreasjansson rchanggogogo hannkeat ailijian chinainfant pscd-354 eivindeklund marvsaidev iquackenbos xfg0913 shen9c sjliu0920 geekyin suravshresth singl3 yingchaox htaoruan ayushrakesh kalyani2003 harshhere905 chinmay7016 yanyuxiyangzk zyf-fmg demaolianda ebuty ai-mou jxzhangjhu pst2016 songzx66 luchuanze alpha-wxy criticalpulsar liuyupan00 yuanxiaoming8899 knowledge-llz zranwang acondess zhangjiekui grainrainn atomgogogo devolper-gaohf chenweihua91 ranchlai hezuogongying litchistudio dcalsky

flagembedding's Issues

依据任务特性修改网络的两个问题

期望cpu推理加速（从768降到128），故想在您训练代码的最后一层网络结构中加一个768*128的pooling
期望相似句子相似度高，不相似句子，相似度低，故将损失函数改为交叉熵
您对比学习的网络结构请问有代码公示吗？可否标记位置呢？如果尝试后有效果我向您汇报一下

请问输出的embedding是 encoder的还是decoder的？

请问输出的embedding是 encoder的还是decoder的？如果是decoder的不是一个个生成字吗？是生成后整体的平均？

硬件要求

请问使用这些模型的硬件最低要求是什么? 和推荐使用的硬件要求是什么？

C-MTEB部分数据集评测分数为0

刚开始跑评测，发现 TNews 和 IFlyTek 的分数是 0，hf hub上看 test label 全是 -1，这个是数据上传的问题吗？

INFO - mteb.evaluation.MTEB : Evaluation for TNews on test took 41.18 seconds
INFO - mteb.evaluation.MTEB : Scores: {'accuracy': 0.0, 'f1': 0.0, 'accuracy_stderr': 0.0, 'f1_stderr': 0.0, 'main_score': 0.0, 'evaluation_time': 41.18}

waimai 等其他数据 test label 都是正常的，我自己评测也正常有分数。

https://huggingface.co/datasets/C-MTEB/TNews-classification
https://huggingface.co/datasets/C-MTEB/IFlyTek-classification

增大模型最长长度限制

当前我需要对大量较长的文档片段进行检索，但不幸的是，这些文档片段很多超过了512个token的长度限制，甚至达到了1000个token以上。因为上下文语义的关系，我不希望对这些文本进行分割。
我手里有大量提问与对应的长片段的数据，请问我能否直接将模型中的max_length参数改为2048，并用手中的长文本数据进行微调，希望能增强模型检索长文档的能力，请问这种做法是否可行？

各个小项的指标

很高兴看到您团队的工作，想问一下能不能展示一下benchmark里每一个测试集的结果呢。
还有就是他们的评测指标分别是什么？

请问下bge-large-zh这些模型都可以微调吗？请问下该怎么操作？

相似度的归一化相比于OpenAI的Embedding差距还是不小的

好多非常匹配的问答对，分数上不去，没法用来做置信度，要投入生产环境还得再做个判断是否匹配的模型。。。。

发现这个instruction【为这个句子生成表示以用于检索相关文章：】加在query上，添加与不添加对精度没有任何影响

请问这报错什么原因，mac跟linux都一样不行

OSError: Unable to load weights from pytorch checkpoint file for '/FlagEmbedding/BAAI/bge-large-zh/pytorch_model.bin' at '/FlagEmbedding/BAAI/bge-large-zh/pytorch_model.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.

Cross-encoder Rerank？

想问一下bge是否有cross-encoder来进行rerank？我看到MTEB榜单里bge在Rerank任务上也是第一，请问是如何使用bge进行rerank？

能否开放api服务,给其他应用调用.

例如: https://github.com/shibing624/text2vec/blob/master/examples/fastapi_server_demo.py

readme为空

https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/README.md

这个为空了

[finetune]对文档检索任务应该怎么调优

想要提高模型文档检索能力。请问是要在finetune时，给三元组数据集的query添加instruction吗？

请问使用时只需要给query添加instruction，不需要给document添加instruction吗？

开心和石头的相似度得分都0.8，这也太高了吧

fintune出现报错，RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`

已经排除了数据和gpu问题，请问有遇到过吗？代码报错指向
def compute_loss(self, model, inputs, return_outputs=False):
"""
How the loss is computed by Trainer. By default, all models return the loss in the first element.

    Subclass and override for custom behavior.
    """
    outputs = model(**inputs)
    loss = outputs.loss

    return (loss, outputs) if return_outputs else loss

self-hosting: what are the estimated GPU size requirements?

Hello, I primarily work on https://github.com/arguflow . We are looking to deploy the Data platform to provide semantic search for a broader range of use-cases. To that end, we need to self-host a production-level embedding model. Is it viable to host bge-large-en ? How big of a GPU will we need? What are the performance expectations? Thanks!

Information on our use-case

There is an important dataset in the Argument-Mining research space called DebateSum. Every year, thousands of competitive debate students around the world open source high quality argument mining data that gets buried in hard to parse docx files.

We built a flexible and open source system to parse those files, extract embedding segments, and semantically search them. Further, there is a neat UI to encourage more contributions to the dataset.

Check out the demo @ https://vault.arguflow.ai

Code @ https://github.com/arguflow

感觉准确度在60-70%左右，拿来排序，确实大多数都是准确的，但是头疼的是，时不时会冒出一堆毫不相干的内容。

用于搜索的话，还有待完善。最主要的是，这不相干的内容很可能排很前面，搞不懂原因。但是如果你单独生成两个向量去比对，他又每问题。

关于FineTune

您好，由于我是该相关领域中的新手，我想请教一些可能比较基础简单的问题

当我想在您BGE-small的模型基础上fine-tune自己的模型的时候

如果在我的conda环境中已经配置好了sentence-transformers及相关环境（pytorch等），按照Readme中的介绍，Installation步骤中，我是否还需要安装FlagEmbedding，以及执行pip install . 或 pip install -e . 安装setup.py中的内容。
在fine-tune数据上，我使用了您提供的example里的json文件，我是否只需要在pycharm，Run中配置Train中提示的相关参数。或者在linux系统中，直接执行Train中提示的运行脚本。在这些参数配置中，如果我没有GPU，是否只需要去除第一行和negatives_cross_device 的参数，即可？
我按照第2点的想法，在pycharm上Run了run.py，在当前目录下得到了一个temp的文件夹，里面包含若干checkpoint文件夹，及模型相关文件(config.json、pytorch_model.bin等)，请问我需要如何获取自己的model，我尝试了SentenceTransformers(./temp)。但是他提示了找不到模型。

非常感谢您能够拨冗回复解答我的疑惑。

难负例的挖掘

您好，感谢非常棒的工作。
1.请问下，训练过程中，微调阶段，难负例的挖掘具体是怎么做的呢？可否简单介绍下，挖掘的个数呢？我看默认，负例是7个。
2.模型微调阶段，forward里面的参数passage，是否维度是[8,batch_size,max_length]？第一个tensor[batch_size,max_length]是当前batch query的正例，后面7个tensor[batch_size,max_length]是当前batch的query的7个负例？

Langchain BGE implementation

Hello, do you think this code is still needed in langchain?

关于预训练和微调数据量的问题

您好！非常棒的工作！我很好奇你们在预训练和微调阶段使用数据量的情况，我看你们说英文微调语料大致在230M，请问这些语料是全都训完了或者甚至训了多个epoch，还是说训的时候只使用了部分呢？以及预训练阶段模型见过的数据量大概是多少呢？

微调数据咨询

您好，可以通过问答的数据集来构建微调数据集吗？期待您的回答

不相关的句子的相似度得分很高

几个毫不相关的句子的得分基本上都有0.7,0.8左右，那相关性高的阈值是多少，为什么会存在不相干的相似度很高的情况

线上推理性能咨询

支持转onnx提升推理速度吗？
线上推理，由于无GPU，请问large系列用CPU推理（控制在100ms内）的最低要求是什么呢？比如4核8G

raise ValueError('Distributed training has not been initialized for representation all gather.')

在我运行finetune代码时报了这个错误
似乎finetune代码中也没有看到分布式的初始化? 也没有看到将model放到ddp wrapper里面

hybrid search using both BGE and BM25

Dear Team,

Thank you for this great work.

Our team has been actively engaged in testing hybrid search strategies by integrating BM25 search with semantic search (we leverage search engines like Elasticsearch / opensearch / azure cognitive search). In our recent experimentation, we have consistently observed that the hybrid search approach outperforms the individual model performance for both the instructor (https://instructor-embedding.github.io/) and sentence-transformer (multi-qa-mpnet-base-dot-v1) models. This encouraging outcome underscores the efficacy of the hybrid strategy.

However, an interesting pattern emerged when we incorporated BGE into our hybrid search framework. In this specific scenario, the hybrid search results fell between the performance of the BM25-only search and the BGE-only search strategies and worse than hybrid search results from BM25 + instructor.

Given these intriguing findings, we are reaching out to inquire if any of you have hands-on experience or insights related to hybrid search strategies involving BGE. Your valuable input could significantly contribute to our ongoing research and experimentation.

Thank you for your time and assistance.

Multilingual Models

Do you plan to train and release multilingual embedding models in the near future?

关于negatives_cross_device的疑问

请问设置了negatives_cross_device之后，会出现其他device上的example的negatives恰好是当前device上的pos这种情况吗，这会对模型的学习造成一定影响吗

支持中英文双语的模型？

您的模型是按照语言分为了bge-large-en、bge-large-zh，但我测试了bge-large-zh也能很好地对英文计算相似度。想确认一下bge-large-zh是否支持中英文双语？如果支持，bge-large-zh在英文数据集上测试的效果如何？

MTEB榜单没看懂，为什么说BGE是最佳？

是串列了吗？指标方面不应该是分数越高越好吗？

parameter: batch size per gpu

您好，非常棒的工作！
有一点疑惑，您在readme中提到

We trained our model on 48 A100(40G) GPUs with a large batch size of 32,768 (so there are 65,535 negatives for each query in a batch)

根据我的理解，negative num = (group size * batch size per gpu * gpu_num) - 1
您提到在finetune时 negative num = 65535, 那么可以推算出 group size * batch size per gpu = (65535+1) / 48 ，但显然这不是一个整数。所以是不是哪里弄错了呢？

请问能提供一下BGE-base-zh在finetune时的具体参数吗，想了解下，感谢！

Hope to get in touch

Dear FlagEmbedding developer,
Greetings! I am Jimmy, a community developer and volunteer at InternLM. Your work has been immensely beneficial to me, and I believe it can be effectively utilized in InternLM as well. Welcome to add Discord https://discord.gg/gF9ezcmtM3 . I hope to get in touch with you.

Best regards,
Jimmy

FlagModel 没办法使用cpu吗

FlagModel will use all available GPUs when encoding, please set os.environ["CUDA_VISIBLE_DEVICES"] to choose GPU.

可以介绍一下如何从wudao中挖掘训练样本对的思路吗？

RT，怎么挖掘这么大规模的样本对的？1.2亿

如果使用bge_large_en，instruction该如何写？

目前的范例中只给出了一个中文的instruction：

instruction = "为这个句子生成表示以用于检索相关文章："

我猜测之前在训练时，是不一样的吧？

请问在微调的过程中，数据会打乱吗？

您好，咨询一下几个问题。感谢您的回答！
1.看代码好像每个epoch并没有打乱数据。请问下是否代码支持打乱呢，如何实现。
2.不打乱数据，是不是batch内负采样见到的负例就会少一点？

huggingface上使用的Sentence Similarity是什么方法

我尝试了example的两种相似度计算方式，但是他们都和huggingface上的计算结果不同
https://huggingface.co/BAAI/bge-large-zh

from sentence_transformers import SentenceTransformer
queries = ["那是 個快樂的人"]
passages = ["那是 條快樂的狗", "那是 個非常幸福的人", "今天是晴天", "那是个快乐的人", "那是 個悲伤的人"]
instruction = "为这个句子生成表示以用于检索相关文章："
model = SentenceTransformer('BAAI/bge-large-zh')
q_embeddings = model.encode([instruction+q for q in queries], normalize_embeddings=True)
p_embeddings = model.encode(passages, normalize_embeddings=True)
scores = q_embeddings @ p_embeddings.T
print(scores)
# [[0.7388883 0.7768293 0.5702877 0.8297329 0.7422564]]

from FlagEmbedding import FlagModel
model = FlagModel('BAAI/bge-large-zh', query_instruction_for_retrieval="为这个句子生成表示以用于检索相关文章：")
queries = ["那是 個快樂的人"]
passages = ["那是 條快樂的狗", "那是 個非常幸福的人", "今天是晴天", "那是个快乐的人", "那是 個悲伤的人"]
q_embeddings = model.encode_queries(queries)
p_embeddings = model.encode(passages)
scores = q_embeddings @ p_embeddings.T
print(scores)
# [[0.7388883 0.7768293 0.5702877 0.8297329 0.7422564]]

怎么扩展输入文本的长度

Domain tokens and pretrain or fine-tuning

Did you experiment with adding domain specific tokens and either pretrain or fine tune? Any tips/tricks?

[Instructions] Any other query instructions for other tasks?

Are there other instructions for tasks like clustering/classification demonstrated in the Instructor Embedding?

请问fine-tuning的训练数据是如何收集的

你好，我在readme中看到
“
Training data:

For English, we collect 230M text pairs from wikipedia, cc-net, and so on.

For chinese, we collect 120M text pairs from wudao, simclue and so on.
”
想请教一下collect text pairs的细节。

请问在云服务器中报错 104 是什么问题，本地执行是ok的，是网络配置有问题吗？

Traceback (most recent call last):
File "/repo/app.py", line 7, in
model = FlagModel('BAAI/bge-large-zh', query_instruction_for_retrieval="为这个句子生成表示以用于检索相关文章：")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/FlagEmbedding/baai_general_embedding/flag_models.py", line 18, in init
self.model = AutoModel.from_pretrained(model_name_or_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 493, in from_pretrained
return model_class.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2539, in from_pretrained
resolved_archive_file = cached_file(
^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/transformers/utils/hub.py", line 417, in cached_file
resolved_file = hf_hub_download(
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 1364, in hf_hub_download
http_get(
File "/usr/local/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 505, in http_get
r = _request_wrapper(
^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 442, in _request_wrapper
return http_backoff(
^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/huggingface_hub/utils/_http.py", line 258, in http_backoff
response = session.request(method=method, url=url, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/requests/sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/requests/sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/huggingface_hub/utils/_http.py", line 63, in send
return super().send(request, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/requests/adapters.py", line 501, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: (ProtocolError('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')), '(Request ID: 4fdd90d4-fed9-44c7-8cf3-1cb8d1c56c6a)')

Pre-train 示例疑问

请问一下，所给出 Pre-train 示例是继续预训练还是从零预训练呢？FlagEmbedding.baai_general_embedding.retromae_pretrain.run

文档检索任务，在query上加instruction的效果反而没有不加的好

instruction用的是“为这个句子生成表示以用于检索相关文章：”，方法用的是transformers预测向量，然后计算余弦相似度，是哪里不对吗

hf上c-mteb的文件编码格式

想问下hf上c-mteb的文件编码格式，utf8和gb18030均打不开

Error while trying to use custom DistilBERT model in the pretraining script

Hi,
I am trying to use custom DistilBERT model while using pertaining script.
But I am facing below error.
WARNING:torch.distributed.run:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

[2023-08-22 10:20:17,917] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-22 10:20:17,967] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
08/22/2023 10:20:18 - WARNING - main - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: False
08/22/2023 10:20:18 - WARNING - main - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False
08/22/2023 10:20:18 - INFO - main - Training/evaluation parameters TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=True,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=False,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=2e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=bge_pretrained/runs/Aug22_10-20-18_experimentllm-0,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=10,
logging_strategy=steps,
lr_scheduler_type=linear,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=2.0,
optim=adamw_hf,
optim_args=None,
output_dir=bge_pretrained,
overwrite_output_dir=False,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=20,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=False,
report_to=[],
resume_from_checkpoint=None,
run_name=bge_pretrained,
save_on_each_node=False,
save_safetensors=False,
save_steps=500,
save_strategy=steps,
save_total_limit=None,
seed=42,
sharded_ddp=[],
skip_memory_metrics=True,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
xpu_backend=None,
)
08/22/2023 10:20:18 - INFO - main - Model parameters ModelArguments(model_name_or_path='./upload/saved_models/handmade_tokenizer/azr_105epoch_mlm/resume_in_stag', config_name=None)
08/22/2023 10:20:18 - INFO - main - Data parameters DataTrainingArguments(train_data='bge_pretraining.jsonl', tokenizer_name=None, max_seq_length=512, encoder_mlm_probability=0.3, decoder_mlm_probability=0.5)
[INFO|configuration_utils.py:710] 2023-08-22 10:20:18,672 >> loading configuration file ./upload/saved_models/handmade_tokenizer/azr_105epoch_mlm/resume_in_stag/config.json
[INFO|configuration_utils.py:768] 2023-08-22 10:20:18,673 >> Model config DistilBertConfig {
"name_or_path": "./upload/saved_models/handmade_tokenizer/azr_105epoch_mlm/resume_in_stag",
"activation": "gelu",
"architectures": [
"DistilBertForMaskedLM"
],
"attention_dropout": 0.1,
"dim": 768,
"dropout": 0.1,
"hidden_dim": 3072,
"initializer_range": 0.02,
"max_position_embeddings": 512,
"model_type": "distilbert",
"n_heads": 12,
"n_layers": 6,
"pad_token_id": 0,
"qa_dropout": 0.1,
"seq_classif_dropout": 0.2,
"sinusoidal_pos_embds": false,
"tie_weights": true,
"torch_dtype": "float32",
"transformers_version": "4.31.0",
"vocab_size": 42550
}

[INFO|modeling_utils.py:2600] 2023-08-22 10:20:18,688 >> loading weights file ./upload/saved_models/handmade_tokenizer/azr_105epoch_mlm/resume_in_stag/pytorch_model.bin
[INFO|modeling_utils.py:3329] 2023-08-22 10:20:19,301 >> All model checkpoint weights were used when initializing DistilBertForMaskedLM.

[INFO|modeling_utils.py:3337] 2023-08-22 10:20:19,301 >> All the weights of DistilBertForMaskedLM were initialized from the model checkpoint at ./upload/saved_models/handmade_tokenizer/azr_105epoch_mlm/resume_in_stag.
If your task is similar to the task the model of the checkpoint was trained on, you can already use DistilBertForMaskedLM for predictions without further training.
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.8/site-packages/FlagEmbedding/baai_general_embedding/retromae_pretrain/run.py", line 128, in
main()
File "/opt/conda/lib/python3.8/site-packages/FlagEmbedding/baai_general_embedding/retromae_pretrain/run.py", line 93, in main
model = model_class.from_pretrained(model_args, model_args.model_name_or_path)
File "/opt/conda/lib/python3.8/site-packages/FlagEmbedding/baai_general_embedding/retromae_pretrain/modeling.py", line 91, in from_pretrained
model = cls(hf_model, model_args)
File "/opt/conda/lib/python3.8/site-packages/FlagEmbedding/baai_general_embedding/retromae_pretrain/modeling.py", line 24, in init
self.decoder_embeddings = self.lm.bert.embeddings
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1614, in getattr
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'DistilBertForMaskedLM' object has no attribute 'bert'
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.8/site-packages/FlagEmbedding/baai_general_embedding/retromae_pretrain/run.py", line 128, in
main()
File "/opt/conda/lib/python3.8/site-packages/FlagEmbedding/baai_general_embedding/retromae_pretrain/run.py", line 93, in main
model = model_class.from_pretrained(model_args, model_args.model_name_or_path)
File "/opt/conda/lib/python3.8/site-packages/FlagEmbedding/baai_general_embedding/retromae_pretrain/modeling.py", line 91, in from_pretrained
model = cls(hf_model, model_args)
File "/opt/conda/lib/python3.8/site-packages/FlagEmbedding/baai_general_embedding/retromae_pretrain/modeling.py", line 24, in init
self.decoder_embeddings = self.lm.bert.embeddings
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1614, in getattr
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'DistilBertForMaskedLM' object has no attribute 'bert'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 588) of binary: /opt/conda/bin/python3.8
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 8, in
sys.exit(main())
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

FlagEmbedding.baai_general_embedding.retromae_pretrain.run FAILED

Failures:
[1]:
time : 2023-08-22_10:20:20
host : experimentllm-0
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 589)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2023-08-22_10:20:20
host : experimentllm-0
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 588)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

文本长度和相似度的关系

使用Embedding模型实现一个文档检索应用，文档按照语义段落结构进行切割，因此有一些切片大一些，有一些切片的文字少。使用bge模型编码时，发现字数少的切片经常相似度更高，哪怕文字多的切片从语义上更加相关。而使用OpenAI的embedding似乎没有这个问题。请问有什么优化的方法吗？