Git Product home page Git Product logo

flagembedding's Introduction

FlagEmbedding

Build License Build Build

English | 中文

FlagEmbedding focuses on retrieval-augmented LLMs, consisting of the following projects currently:

News

  • 7/26/2024: Release a new embedding model bge-en-icl, an embedding model that incorporates in-context learning capabilities, which, by providing task-relevant query-response examples, can encode semantically richer queries, further enhancing the semantic representation ability of the embeddings. 🔥
  • 7/26/2024: Release a new embedding model bge-multilingual-gemma2, a multilingual embedding model based on gemma-2-9b, which supports multiple languages and diverse downstream tasks, achieving new SOTA on multilingual benchmarks (MIRACL, MTEB-fr, and MTEB-pl). 🔥
  • 7/26/2024: Release a new lightweight reranker bge-reranker-v2.5-gemma2-lightweight, a lightweight reranker based on gemma-2-9b, which supports token compression and layerwise lightweight operations, can still ensure good performance while saving a significant amount of resources. 🔥
  • 6/7/2024: Release a new benchmark MLVU, the first comprehensive benchmark specifically designed for long video understanding. MLVU features an extensive range of video durations, a diverse collection of video sources, and a set of evaluation tasks uniquely tailored for long-form video understanding. 🔥
  • 5/21/2024: Release a new benchmark AIR-Bench together with Jina AI, Zilliz, HuggingFace, and other partners. AIR-Bench focuses on a fair out-of-distribution evaluation for Neural IR & RAG. It generates the synthetic data for benchmarking w.r.t. diverse domains and languages. It is dynamic and will be updated on regular basis. Leaderboard 🔥
  • 4/30/2024: Release Llama-3-8B-Instruct-80K-QLoRA, extending the context length of Llama-3-8B-Instruct from 8K to 80K via QLoRA training on a few synthesized long-context data. The model achieves remarkable performance on various long-context benchmarks. Code 🔥
  • 3/18/2024: Release new rerankers, built upon powerful M3 and LLM (GEMMA and MiniCPM, not so large actually 😃) backbones, supporitng multi-lingual processing and larger inputs, massive improvements of ranking performances on BEIR, C-MTEB/Retrieval, MIRACL, LlamaIndex Evaluation 🔥
  • 3/18/2024: Release Visualized-BGE, equipping BGE with visual capabilities. Visualized-BGE can be utilized to generate embeddings for hybrid image-text data. 🔥
  • 1/30/2024: Release BGE-M3, a new member to BGE model series! M3 stands for Multi-linguality (100+ languages), Multi-granularities (input length up to 8192), Multi-Functionality (unification of dense, lexical, multi-vec/colbert retrieval). It is the first embedding model which supports all three retrieval methods, achieving new SOTA on multi-lingual (MIRACL) and cross-lingual (MKQA) benchmarks. Technical Report and Code. 🔥
  • 1/9/2024: Release Activation-Beacon, an effective, efficient, compatible, and low-cost (training) method to extend the context length of LLM. Technical Report
  • 12/24/2023: Release LLaRA, a LLaMA-7B based dense retriever, leading to state-of-the-art performances on MS MARCO and BEIR. Model and code will be open-sourced. Please stay tuned. Technical Report
  • 11/23/2023: Release LM-Cocktail, a method to maintain general capabilities during fine-tuning by merging multiple language models. Technical Report
  • 10/12/2023: Release LLM-Embedder, a unified embedding model to support diverse retrieval augmentation needs for LLMs. Technical Report
  • 09/15/2023: The technical report of BGE has been released
  • 09/15/2023: The massive training data of BGE has been released
  • 09/12/2023: New models:
    • New reranker model: release cross-encoder models BAAI/bge-reranker-base and BAAI/bge-reranker-large, which are more powerful than embedding model. We recommend to use/fine-tune them to re-rank top-k documents returned by embedding models.
    • update embedding model: release bge-*-v1.5 embedding model to alleviate the issue of the similarity distribution, and enhance its retrieval ability without instruction.
More
  • 09/07/2023: Update fine-tune code: Add script to mine hard negatives and support adding instruction during fine-tuning.
  • 08/09/2023: BGE Models are integrated into Langchain, you can use it like this; C-MTEB leaderboard is available.
  • 08/05/2023: Release base-scale and small-scale models, best performance among the models of the same size 🤗
  • 08/02/2023: Release bge-large-*(short for BAAI General Embedding) Models, rank 1st on MTEB and C-MTEB benchmark! 🎉 🎉
  • 08/01/2023: We release the Chinese Massive Text Embedding Benchmark (C-MTEB), consisting of 31 test dataset.

Projects

BGE-M3(Paper, Code)

In this project, we introduce BGE-M3, the first embedding model which supports multiple retrieval modes、multilingual and multi-granularity retrieval.

  • Multi-Functionality: It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval.
  • Multi-Linguality: It can support more than 100 working languages.
  • Multi-Granularity: It is able to process inputs of different granularities, spanning from short sentences to long documents of up to 8192 tokens.

We propose a novel self-knowledge distillation approach to improve the performance of single retrieval mode. We optimize the batching strategy, enabling a large batch size, which can used simply when fine-tuning with long text or large language model. We also construct a dataset for document retrieval and propose a simple strategy to improve the ability to model long text. The training code and fine-tuning data will be open-sourced in the near future.

In this project, we introduce Visualized-BGE, which integrating image token embedding into the BGE Text Embedding framework. Visualized-BGE can be used for various hybrid modal retrieval tasks, such as Multi-Modal Knowledge Retrieval, Composed Image Retrieval, and Knowledge Retrieval with Multi-Modal Queries.

Our model delivers outstanding zero-shot performance across multiple hybrid modal retrieval tasks. It can also serve as a base model for downstream fine-tuning for hybrid modal retrieval tasks.

We extend the context length of Llama-3-8B-Instruct from 8K to 80K via QLoRA fine-tuning. The entire training cycle is super efficient, which takes 8 hours on one 8xA800 (80G) GPU machine. The resulted model exhibits superior performances across a broad range of evaluation tasks, such as NIHS, topic retrieval, and long-context language understanding; meanwhile, it also well preserves the original capability over short contexts. The dramatic context extension is mainly attributed to merely 3.5K synthetic data generated by GPT-4, which indicates the LLMs' inherent (yet largely underestimated) potential to extend its original context length. In fact, the context length could be extended far beyond 80K with more computing resources.

The utilization of long contexts poses a big challenge for large language models due to their limited context window length. Activation Beacon condenses LLM's raw activations into more compact forms such that it can perceive a much longer context with a limited context window. It is an effective, efficient, compatible, and low-cost (training) method to extend the context length of LLM. More details please refer to our paper and code.

Model merging has been used to improve the performance of single model. We find this method is also useful for large language models and dense embedding model, and design the LM-Cocktail strategy which automatically merges fine-tuned models and base model using a simple function to compute merging weights. LM-Cocktail can be used to improve the performance on target domain without decrease the general capabilities beyond target domain. It also can be used to generate a model for new tasks without fine-tuning. You can use it to merge the LLMs (e.g., Llama) or embedding models. More details please refer to our report: LM-Cocktail and code.

LLM Embedder is fine-tuned based on the feedback from LLMs. It can support the retrieval augmentation needs of large language models, including knowledge retrieval, memory retrieval, example retrieval, and tool retrieval. It is fine-tuned over 6 tasks: Question Answering, Conversational Search, Long Conversation, Long-Range Language Modeling, In-Context Learning, and Tool Learning. For more details please refer to report and ./FlagEmbedding/llm_embedder/README.md

Cross-encoder will perform full-attention over the input pair, which is more accurate than embedding model (i.e., bi-encoder) but more time-consuming than embedding model. Therefore, it can be used to re-rank the top-k documents returned by embedding model. We train the cross-encoder on a multilingual pair data, The data format is the same as embedding model, so you can fine-tune it easily following our example. For more details please refer to ./FlagEmbedding/reranker/README.md

We provide a new version of the cross-encoder that supports more languages and longer lengths. The data format is similar to our embedding models, but now includes prompt data for fine-tuning and inference. You can perform inference using specific layers or using the entire layers. You can fine-tune it easily following our example. For more details please refer to ./FlagEmbedding/llm_reranker/README.md.

BGE embedding is a general Embedding Model. We pre-train the models using retromae and train them on large-scale pair data using contrastive learning. You can fine-tune the embedding model on your data following our examples. We also provide a pre-train example. Note that the goal of pre-training is to reconstruct the text, and the pre-trained model cannot be used for similarity calculation directly, it needs to be fine-tuned. Refer to our report: c-pack and code for more details.

BGE uses the last hidden state of [cls] as the sentence embedding: sentence_embeddings = model_output[0][:, 0]. If you use mean pooling, there will be a significant decrease in performance.

A benchmark for chinese text embedding. This benchmark has been merged into MTEB. Refer to our report: c-pack and code for more details.

Model List

bge is short for BAAI general embedding.

Model Language Description query instruction for retrieval
BAAI/bge-en-icl English A LLM-based embedding model with in-context learning capabilities, which can fully leverage the model's potential based on a few shot examples Provide instructions and few-shot examples freely based on the given task.
BAAI/bge-multilingual-gemma2 Multilingual - A LLM-based multilingual embedding model, trained on a diverse range of languages and tasks. Provide instructions based on the given task.
BAAI/bge-m3 Multilingual Inference Fine-tune Multi-Functionality(dense retrieval, sparse retrieval, multi-vector(colbert)), Multi-Linguality, and Multi-Granularity(8192 tokens)
LM-Cocktail English fine-tuned models (Llama and BGE) which can be used to reproduce the results of LM-Cocktail
BAAI/llm-embedder English Inference Fine-tune a unified embedding model to support diverse retrieval augmentation needs for LLMs See README
BAAI/bge-reranker-v2-m3 Multilingual Inference Fine-tune a lightweight cross-encoder model, possesses strong multilingual capabilities, easy to deploy, with fast inference.
BAAI/bge-reranker-v2-gemma Multilingual Inference Fine-tune a cross-encoder model which is suitable for multilingual contexts, performs well in both English proficiency and multilingual capabilities.
BAAI/bge-reranker-v2-minicpm-layerwise Multilingual Inference Fine-tune a cross-encoder model which is suitable for multilingual contexts, performs well in both English and Chinese proficiency, allows freedom to select layers for output, facilitating accelerated inference.
BAAI/bge-reranker-v2.5-gemma2-lightweight Multilingual Inference a cross-encoder model which is suitable for multilingual contexts, performs well in both English and Chinese proficiency, allows freedom to select layers, compress ratio and compress layers for output, facilitating accelerated inference.
BAAI/bge-reranker-large Chinese and English Inference Fine-tune a cross-encoder model which is more accurate but less efficient
BAAI/bge-reranker-base Chinese and English Inference Fine-tune a cross-encoder model which is more accurate but less efficient
BAAI/bge-large-en-v1.5 English Inference Fine-tune version 1.5 with more reasonable similarity distribution Represent this sentence for searching relevant passages:
BAAI/bge-base-en-v1.5 English Inference Fine-tune version 1.5 with more reasonable similarity distribution Represent this sentence for searching relevant passages:
BAAI/bge-small-en-v1.5 English Inference Fine-tune version 1.5 with more reasonable similarity distribution Represent this sentence for searching relevant passages:
BAAI/bge-large-zh-v1.5 Chinese Inference Fine-tune version 1.5 with more reasonable similarity distribution 为这个句子生成表示以用于检索相关文章:
BAAI/bge-base-zh-v1.5 Chinese Inference Fine-tune version 1.5 with more reasonable similarity distribution 为这个句子生成表示以用于检索相关文章:
BAAI/bge-small-zh-v1.5 Chinese Inference Fine-tune version 1.5 with more reasonable similarity distribution 为这个句子生成表示以用于检索相关文章:
BAAI/bge-large-en English Inference Fine-tune Embedding Model which map text into vector Represent this sentence for searching relevant passages:
BAAI/bge-base-en English Inference Fine-tune a base-scale model but with similar ability to bge-large-en Represent this sentence for searching relevant passages:
BAAI/bge-small-en English Inference Fine-tune a small-scale model but with competitive performance Represent this sentence for searching relevant passages:
BAAI/bge-large-zh Chinese Inference Fine-tune Embedding Model which map text into vector 为这个句子生成表示以用于检索相关文章:
BAAI/bge-base-zh Chinese Inference Fine-tune a base-scale model but with similar ability to bge-large-zh 为这个句子生成表示以用于检索相关文章:
BAAI/bge-small-zh Chinese Inference Fine-tune a small-scale model but with competitive performance 为这个句子生成表示以用于检索相关文章:

Contributors:

Citation

If you find this repository useful, please consider giving a star ⭐ and citation

@misc{bge_m3,
  title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation},
  author={Chen, Jianlv and Xiao, Shitao and Zhang, Peitian and Luo, Kun and Lian, Defu and Liu, Zheng},
  year={2023},
  eprint={2309.07597},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

@misc{cocktail,
      title={LM-Cocktail: Resilient Tuning of Language Models via Model Merging}, 
      author={Shitao Xiao and Zheng Liu and Peitian Zhang and Xingrun Xing},
      year={2023},
      eprint={2311.13534},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@misc{llm_embedder,
      title={Retrieve Anything To Augment Large Language Models}, 
      author={Peitian Zhang and Shitao Xiao and Zheng Liu and Zhicheng Dou and Jian-Yun Nie},
      year={2023},
      eprint={2310.07554},
      archivePrefix={arXiv},
      primaryClass={cs.IR}
}

@misc{bge_embedding,
      title={C-Pack: Packaged Resources To Advance General Chinese Embedding}, 
      author={Shitao Xiao and Zheng Liu and Peitian Zhang and Niklas Muennighoff},
      year={2023},
      eprint={2309.07597},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

License

FlagEmbedding is licensed under the MIT License.

flagembedding's People

Contributors

545999961 avatar ayushrakesh avatar blue-vision0 avatar ccs96307 avatar chinainfant avatar dcalsky avatar eivindeklund avatar eltociear avatar fengsxy avatar ftgreat avatar hanhainebula avatar hotchpotch avatar huniu20 avatar junjie99 avatar khazic avatar luisegarduno avatar metanov avatar muazhari avatar muennighoff avatar mxchinegod avatar namespace-pt avatar nlpjcl avatar rainym00d avatar shtdbb avatar shuyansy avatar staoxiao avatar startnew avatar userz avatar zh217 avatar zhengliu101 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

flagembedding's Issues

依据任务特性修改网络的两个问题

  1. 期望cpu推理加速(从768降到128),故想在您训练代码的最后一层网络结构中加一个768*128的pooling
  2. 期望相似句子相似度高,不相似句子,相似度低,故将损失函数改为交叉熵
    您对比学习的网络结构请问有代码公示吗?可否标记位置呢?如果尝试后有效果我向您汇报一下

硬件要求

请问使用这些模型的硬件最低要求是什么? 和推荐使用的硬件要求是什么?

C-MTEB部分数据集评测分数为0

刚开始跑评测,发现 TNews 和 IFlyTek 的分数是 0,hf hub上看 test label 全是 -1,这个是数据上传的问题吗?

INFO - mteb.evaluation.MTEB : Evaluation for TNews on test took 41.18 seconds
INFO - mteb.evaluation.MTEB : Scores: {'accuracy': 0.0, 'f1': 0.0, 'accuracy_stderr': 0.0, 'f1_stderr': 0.0, 'main_score': 0.0, 'evaluation_time': 41.18}

waimai 等其他数据 test label 都是正常的,我自己评测也正常有分数。

https://huggingface.co/datasets/C-MTEB/TNews-classification
https://huggingface.co/datasets/C-MTEB/IFlyTek-classification

增大模型最长长度限制

当前我需要对大量较长的文档片段进行检索,但不幸的是,这些文档片段很多超过了512个token的长度限制,甚至达到了1000个token以上。因为上下文语义的关系,我不希望对这些文本进行分割。
我手里有大量提问与对应的长片段的数据,请问我能否直接将模型中的max_length参数改为2048,并用手中的长文本数据进行微调,希望能增强模型检索长文档的能力,请问这种做法是否可行?

各个小项的指标

很高兴看到您团队的工作,想问一下能不能展示一下benchmark里每一个测试集的结果呢。
还有就是他们的评测指标分别是什么?

请问这报错什么原因,mac跟linux都一样不行

OSError: Unable to load weights from pytorch checkpoint file for '/FlagEmbedding/BAAI/bge-large-zh/pytorch_model.bin' at '/FlagEmbedding/BAAI/bge-large-zh/pytorch_model.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.

Cross-encoder Rerank?

想问一下bge是否有cross-encoder来进行rerank?我看到MTEB榜单里bge在Rerank任务上也是第一,请问是如何使用bge进行rerank?

fintune出现报错,RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`

已经排除了数据和gpu问题,请问有遇到过吗?代码报错指向
def compute_loss(self, model, inputs, return_outputs=False):
"""
How the loss is computed by Trainer. By default, all models return the loss in the first element.

    Subclass and override for custom behavior.
    """
    outputs = model(**inputs)
    loss = outputs.loss

    return (loss, outputs) if return_outputs else loss

self-hosting: what are the estimated GPU size requirements?

Hello, I primarily work on https://github.com/arguflow . We are looking to deploy the Data platform to provide semantic search for a broader range of use-cases. To that end, we need to self-host a production-level embedding model. Is it viable to host bge-large-en ? How big of a GPU will we need? What are the performance expectations? Thanks!

Information on our use-case

There is an important dataset in the Argument-Mining research space called DebateSum. Every year, thousands of competitive debate students around the world open source high quality argument mining data that gets buried in hard to parse docx files.

We built a flexible and open source system to parse those files, extract embedding segments, and semantically search them. Further, there is a neat UI to encourage more contributions to the dataset.

Check out the demo @ https://vault.arguflow.ai

Code @ https://github.com/arguflow

关于FineTune

您好,由于我是该相关领域中的新手,我想请教一些可能比较基础简单的问题

当我想在您BGE-small的模型基础上fine-tune自己的模型的时候

  1. 如果在我的conda环境中已经配置好了sentence-transformers及相关环境(pytorch等),按照Readme中的介绍,Installation步骤中,我是否还需要安装FlagEmbedding,以及执行pip install . 或 pip install -e . 安装setup.py中的内容。
  2. 在fine-tune数据上,我使用了您提供的example里的json文件,我是否只需要在pycharm,Run中配置Train中提示的相关参数。或者在linux系统中,直接执行Train中提示的运行脚本。在这些参数配置中,如果我没有GPU,是否只需要去除第一行和negatives_cross_device 的参数,即可?
  3. 我按照第2点的想法,在pycharm上Run了run.py,在当前目录下得到了一个temp的文件夹,里面包含若干checkpoint文件夹,及模型相关文件(config.json、pytorch_model.bin等),请问我需要如何获取自己的model,我尝试了SentenceTransformers(./temp)。但是他提示了找不到模型。

非常感谢您能够拨冗回复解答我的疑惑。

难负例的挖掘

您好,感谢非常棒的工作。
1.请问下,训练过程中,微调阶段,难负例的挖掘具体是怎么做的呢?可否简单介绍下,挖掘的个数呢?我看默认,负例是7个。
2.模型微调阶段,forward里面的参数passage,是否维度是[8,batch_size,max_length]?第一个tensor[batch_size,max_length]是当前batch query的正例,后面7个tensor[batch_size,max_length]是当前batch的query的7个负例?

关于预训练和微调数据量的问题

您好!非常棒的工作!我很好奇你们在预训练和微调阶段使用数据量的情况,我看你们说英文微调语料大致在230M,请问这些语料是全都训完了或者甚至训了多个epoch,还是说训的时候只使用了部分呢?以及预训练阶段模型见过的数据量大概是多少呢?

微调数据咨询

您好,可以通过问答的数据集来构建微调数据集吗?期待您的回答

线上推理性能咨询

  1. 支持转onnx提升推理速度吗?
  2. 线上推理,由于无GPU,请问large系列用CPU推理(控制在100ms内)的最低要求是什么呢?比如4核8G

hybrid search using both BGE and BM25

Dear Team,

Thank you for this great work.

Our team has been actively engaged in testing hybrid search strategies by integrating BM25 search with semantic search (we leverage search engines like Elasticsearch / opensearch / azure cognitive search). In our recent experimentation, we have consistently observed that the hybrid search approach outperforms the individual model performance for both the instructor (https://instructor-embedding.github.io/) and sentence-transformer (multi-qa-mpnet-base-dot-v1) models. This encouraging outcome underscores the efficacy of the hybrid strategy.

However, an interesting pattern emerged when we incorporated BGE into our hybrid search framework. In this specific scenario, the hybrid search results fell between the performance of the BM25-only search and the BGE-only search strategies and worse than hybrid search results from BM25 + instructor.

Given these intriguing findings, we are reaching out to inquire if any of you have hands-on experience or insights related to hybrid search strategies involving BGE. Your valuable input could significantly contribute to our ongoing research and experimentation.

Thank you for your time and assistance.

Multilingual Models

Do you plan to train and release multilingual embedding models in the near future?

关于negatives_cross_device的疑问

请问设置了negatives_cross_device之后,会出现其他device上的example的negatives恰好是当前device上的pos这种情况吗,这会对模型的学习造成一定影响吗

支持中英文双语的模型?

您的模型是按照语言分为了bge-large-en、bge-large-zh,但我测试了bge-large-zh也能很好地对英文计算相似度。想确认一下bge-large-zh是否支持中英文双语?如果支持,bge-large-zh在英文数据集上测试的效果如何?

parameter: batch size per gpu

您好,非常棒的工作!
有一点疑惑,您在readme中提到

We trained our model on 48 A100(40G) GPUs with a large batch size of 32,768 (so there are 65,535 negatives for each query in a batch)

根据我的理解,negative num = (group size * batch size per gpu * gpu_num) - 1
您提到在finetune时 negative num = 65535, 那么可以推算出 group size * batch size per gpu = (65535+1) / 48 ,但显然这不是一个整数。所以是不是哪里弄错了呢?

请问能提供一下BGE-base-zh在finetune时的具体参数吗,想了解下,感谢!

Hope to get in touch

Dear FlagEmbedding developer,
Greetings! I am Jimmy, a community developer and volunteer at InternLM. Your work has been immensely beneficial to me, and I believe it can be effectively utilized in InternLM as well. Welcome to add Discord https://discord.gg/gF9ezcmtM3 . I hope to get in touch with you.

Best regards,
Jimmy

请问在微调的过程中,数据会打乱吗?

您好,咨询一下几个问题。感谢您的回答!
1.看代码好像每个epoch并没有打乱数据。请问下是否代码支持打乱呢,如何实现。
2.不打乱数据,是不是batch内负采样见到的负例就会少一点?

huggingface上使用的Sentence Similarity是什么方法

我尝试了example的两种相似度计算方式,但是他们都和huggingface上的计算结果不同
https://huggingface.co/BAAI/bge-large-zh
image

from sentence_transformers import SentenceTransformer
queries = ["那是 個快樂的人"]
passages = ["那是 條快樂的狗", "那是 個非常幸福的人", "今天是晴天", "那是个快乐的人", "那是 個悲伤的人"]
instruction = "为这个句子生成表示以用于检索相关文章:"
model = SentenceTransformer('BAAI/bge-large-zh')
q_embeddings = model.encode([instruction+q for q in queries], normalize_embeddings=True)
p_embeddings = model.encode(passages, normalize_embeddings=True)
scores = q_embeddings @ p_embeddings.T
print(scores)
# [[0.7388883 0.7768293 0.5702877 0.8297329 0.7422564]]
from FlagEmbedding import FlagModel
model = FlagModel('BAAI/bge-large-zh', query_instruction_for_retrieval="为这个句子生成表示以用于检索相关文章:")
queries = ["那是 個快樂的人"]
passages = ["那是 條快樂的狗", "那是 個非常幸福的人", "今天是晴天", "那是个快乐的人", "那是 個悲伤的人"]
q_embeddings = model.encode_queries(queries)
p_embeddings = model.encode(passages)
scores = q_embeddings @ p_embeddings.T
print(scores)
# [[0.7388883 0.7768293 0.5702877 0.8297329 0.7422564]]

请问在云服务器中报错 104 是什么问题,本地执行是ok的,是网络配置有问题吗?

Traceback (most recent call last):
File "/repo/app.py", line 7, in
model = FlagModel('BAAI/bge-large-zh', query_instruction_for_retrieval="为这个句子生成表示以用于检索相关文章:")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/FlagEmbedding/baai_general_embedding/flag_models.py", line 18, in init
self.model = AutoModel.from_pretrained(model_name_or_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 493, in from_pretrained
return model_class.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2539, in from_pretrained
resolved_archive_file = cached_file(
^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/transformers/utils/hub.py", line 417, in cached_file
resolved_file = hf_hub_download(
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 1364, in hf_hub_download
http_get(
File "/usr/local/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 505, in http_get
r = _request_wrapper(
^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 442, in _request_wrapper
return http_backoff(
^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/huggingface_hub/utils/_http.py", line 258, in http_backoff
response = session.request(method=method, url=url, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/requests/sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/requests/sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/huggingface_hub/utils/_http.py", line 63, in send
return super().send(request, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/requests/adapters.py", line 501, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: (ProtocolError('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')), '(Request ID: 4fdd90d4-fed9-44c7-8cf3-1cb8d1c56c6a)')

Pre-train 示例疑问

请问一下,所给出 Pre-train 示例是 继续预训练 还是 从零预训练 呢?FlagEmbedding.baai_general_embedding.retromae_pretrain.run

Error while trying to use custom DistilBERT model in the pretraining script

Hi,
I am trying to use custom DistilBERT model while using pertaining script.
But I am facing below error.
WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


[2023-08-22 10:20:17,917] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-22 10:20:17,967] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
08/22/2023 10:20:18 - WARNING - main - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: False
08/22/2023 10:20:18 - WARNING - main - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False
08/22/2023 10:20:18 - INFO - main - Training/evaluation parameters TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=True,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=False,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=2e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=bge_pretrained/runs/Aug22_10-20-18_experimentllm-0,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=10,
logging_strategy=steps,
lr_scheduler_type=linear,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=2.0,
optim=adamw_hf,
optim_args=None,
output_dir=bge_pretrained,
overwrite_output_dir=False,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=20,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=False,
report_to=[],
resume_from_checkpoint=None,
run_name=bge_pretrained,
save_on_each_node=False,
save_safetensors=False,
save_steps=500,
save_strategy=steps,
save_total_limit=None,
seed=42,
sharded_ddp=[],
skip_memory_metrics=True,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
xpu_backend=None,
)
08/22/2023 10:20:18 - INFO - main - Model parameters ModelArguments(model_name_or_path='./upload/saved_models/handmade_tokenizer/azr_105epoch_mlm/resume_in_stag', config_name=None)
08/22/2023 10:20:18 - INFO - main - Data parameters DataTrainingArguments(train_data='bge_pretraining.jsonl', tokenizer_name=None, max_seq_length=512, encoder_mlm_probability=0.3, decoder_mlm_probability=0.5)
[INFO|configuration_utils.py:710] 2023-08-22 10:20:18,672 >> loading configuration file ./upload/saved_models/handmade_tokenizer/azr_105epoch_mlm/resume_in_stag/config.json
[INFO|configuration_utils.py:768] 2023-08-22 10:20:18,673 >> Model config DistilBertConfig {
"name_or_path": "./upload/saved_models/handmade_tokenizer/azr_105epoch_mlm/resume_in_stag",
"activation": "gelu",
"architectures": [
"DistilBertForMaskedLM"
],
"attention_dropout": 0.1,
"dim": 768,
"dropout": 0.1,
"hidden_dim": 3072,
"initializer_range": 0.02,
"max_position_embeddings": 512,
"model_type": "distilbert",
"n_heads": 12,
"n_layers": 6,
"pad_token_id": 0,
"qa_dropout": 0.1,
"seq_classif_dropout": 0.2,
"sinusoidal_pos_embds": false,
"tie_weights
": true,
"torch_dtype": "float32",
"transformers_version": "4.31.0",
"vocab_size": 42550
}

[INFO|modeling_utils.py:2600] 2023-08-22 10:20:18,688 >> loading weights file ./upload/saved_models/handmade_tokenizer/azr_105epoch_mlm/resume_in_stag/pytorch_model.bin
[INFO|modeling_utils.py:3329] 2023-08-22 10:20:19,301 >> All model checkpoint weights were used when initializing DistilBertForMaskedLM.

[INFO|modeling_utils.py:3337] 2023-08-22 10:20:19,301 >> All the weights of DistilBertForMaskedLM were initialized from the model checkpoint at ./upload/saved_models/handmade_tokenizer/azr_105epoch_mlm/resume_in_stag.
If your task is similar to the task the model of the checkpoint was trained on, you can already use DistilBertForMaskedLM for predictions without further training.
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.8/site-packages/FlagEmbedding/baai_general_embedding/retromae_pretrain/run.py", line 128, in
main()
File "/opt/conda/lib/python3.8/site-packages/FlagEmbedding/baai_general_embedding/retromae_pretrain/run.py", line 93, in main
model = model_class.from_pretrained(model_args, model_args.model_name_or_path)
File "/opt/conda/lib/python3.8/site-packages/FlagEmbedding/baai_general_embedding/retromae_pretrain/modeling.py", line 91, in from_pretrained
model = cls(hf_model, model_args)
File "/opt/conda/lib/python3.8/site-packages/FlagEmbedding/baai_general_embedding/retromae_pretrain/modeling.py", line 24, in init
self.decoder_embeddings = self.lm.bert.embeddings
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1614, in getattr
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'DistilBertForMaskedLM' object has no attribute 'bert'
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.8/site-packages/FlagEmbedding/baai_general_embedding/retromae_pretrain/run.py", line 128, in
main()
File "/opt/conda/lib/python3.8/site-packages/FlagEmbedding/baai_general_embedding/retromae_pretrain/run.py", line 93, in main
model = model_class.from_pretrained(model_args, model_args.model_name_or_path)
File "/opt/conda/lib/python3.8/site-packages/FlagEmbedding/baai_general_embedding/retromae_pretrain/modeling.py", line 91, in from_pretrained
model = cls(hf_model, model_args)
File "/opt/conda/lib/python3.8/site-packages/FlagEmbedding/baai_general_embedding/retromae_pretrain/modeling.py", line 24, in init
self.decoder_embeddings = self.lm.bert.embeddings
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1614, in getattr
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'DistilBertForMaskedLM' object has no attribute 'bert'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 588) of binary: /opt/conda/bin/python3.8
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 8, in
sys.exit(main())
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

FlagEmbedding.baai_general_embedding.retromae_pretrain.run FAILED

Failures:
[1]:
time : 2023-08-22_10:20:20
host : experimentllm-0
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 589)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2023-08-22_10:20:20
host : experimentllm-0
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 588)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

文本长度和相似度的关系

使用Embedding模型实现一个文档检索应用,文档按照语义段落结构进行切割,因此有一些切片大一些,有一些切片的文字少。使用bge模型编码时,发现字数少的切片经常相似度更高,哪怕文字多的切片从语义上更加相关。而使用OpenAI的embedding似乎没有这个问题。请问有什么优化的方法吗?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.