Git Product home page Git Product logo

bcembedding's Introduction

BCEmbedding: Bilingual and Crosslingual Embedding for RAG

         

English | 简体中文

Click to Open Contents

Bilingual and Crosslingual Embedding (BCEmbedding) in English and Chinese, developed by NetEase Youdao, encompasses EmbeddingModel and RerankerModel. The EmbeddingModel specializes in generating semantic vectors, playing a crucial role in semantic search and question-answering, and the RerankerModel excels at refining search results and ranking tasks.

BCEmbedding serves as the cornerstone of Youdao's Retrieval Augmented Generation (RAG) implementation, notably QAnything [github], an open-source implementation widely integrated in various Youdao products like Youdao Speed Reading and Youdao Translation.

Distinguished for its bilingual and crosslingual proficiency, BCEmbedding excels in bridging Chinese and English linguistic gaps, which achieves

Our Goals

Provide a bilingual and crosslingual two-stage retrieval model repository for the RAG community, which can be used directly without finetuning, including EmbeddingModel and RerankerModel:

  • One Model: EmbeddingModel handle bilingual and crosslingual retrieval task in English and Chinese. RerankerModel supports English, Chinese, Japanese and Korean.
  • One Model: Cover common business application scenarios with RAG optimization. e.g. Education, Medical Scenario, Law, Finance, Literature, FAQ, Textbook, Wikipedia, General Conversation.
  • Easy to Integrate: We provide API in BCEmbedding for LlamaIndex and LangChain integrations.
  • Others Points:
    • RerankerModel supports long passages (more than 512 tokens, less than 32k tokens) reranking;
    • RerankerModel provides meaningful relevance score that helps to remove passages with low quality.
    • EmbeddingModel does not need specific instructions.

Third-party Examples

🌐 Bilingual and Crosslingual Superiority

Existing embedding models often encounter performance challenges in bilingual and crosslingual scenarios, particularly in Chinese, English and their crosslingual tasks. BCEmbedding, leveraging the strength of Youdao's translation engine, excels in delivering superior performance across monolingual, bilingual, and crosslingual settings.

EmbeddingModel supports Chinese (ch) and English (en) (more languages support will come soon), while RerankerModel supports Chinese (ch), English (en), Japanese (ja) and Korean (ko).

💡 Key Features

  • Bilingual and Crosslingual Proficiency: Powered by Youdao's translation engine, excelling in Chinese, English and their crosslingual retrieval task, with upcoming support for additional languages.
  • RAG-Optimized: Tailored for diverse RAG tasks including translation, summarization, and question answering, ensuring accurate query understanding. See RAG Evaluations in LlamaIndex.
  • Efficient and Precise Retrieval: Dual-encoder for efficient retrieval of EmbeddingModel in first stage, and cross-encoder of RerankerModel for enhanced precision and deeper semantic analysis in second stage.
  • Broad Domain Adaptability: Trained on diverse datasets for superior performance across various fields.
  • User-Friendly Design: Instruction-free, versatile use for multiple tasks without specifying query instruction for each task.
  • Meaningful Reranking Scores: RerankerModel provides relevant scores to improve result quality and optimize large language model performance.
  • Proven in Production: Successfully implemented and validated in Youdao's products.

🚀 Latest Updates

🍎 Model List

Model Name Model Type Languages Parameters Weights
bce-embedding-base_v1 EmbeddingModel ch, en 279M Huggingface, 国内通道
bce-reranker-base_v1 RerankerModel ch, en, ja, ko 279M Huggingface, 国内通道

📖 Manual

Installation

First, create a conda environment and activate it.

conda create --name bce python=3.10 -y
conda activate bce

Then install BCEmbedding for minimal installation (To avoid cuda version conflicting, you should install torch that is compatible to your system cuda version manually first):

pip install BCEmbedding==0.1.5

Or install from source (recommended):

git clone [email protected]:netease-youdao/BCEmbedding.git
cd BCEmbedding
pip install -v -e .

Quick Start

1. Based on BCEmbedding

Use EmbeddingModel, and cls pooler is default.

from BCEmbedding import EmbeddingModel

# list of sentences
sentences = ['sentence_0', 'sentence_1']

# init embedding model
model = EmbeddingModel(model_name_or_path="maidalun1020/bce-embedding-base_v1")

# extract embeddings
embeddings = model.encode(sentences)

Use RerankerModel to calculate relevant scores and rerank:

from BCEmbedding import RerankerModel

# your query and corresponding passages
query = 'input_query'
passages = ['passage_0', 'passage_1']

# construct sentence pairs
sentence_pairs = [[query, passage] for passage in passages]

# init reranker model
model = RerankerModel(model_name_or_path="maidalun1020/bce-reranker-base_v1")

# method 0: calculate scores of sentence pairs
scores = model.compute_score(sentence_pairs)

# method 1: rerank passages
rerank_results = model.rerank(query, passages)

NOTE:

  • In RerankerModel.rerank method, we provide an advanced preproccess that we use in production for making sentence_pairs, when "passages" are very long.

2. Based on transformers

For EmbeddingModel:

from transformers import AutoModel, AutoTokenizer

# list of sentences
sentences = ['sentence_0', 'sentence_1']

# init model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('maidalun1020/bce-embedding-base_v1')
model = AutoModel.from_pretrained('maidalun1020/bce-embedding-base_v1')

device = 'cuda'  # if no GPU, set "cpu"
model.to(device)

# get inputs
inputs = tokenizer(sentences, padding=True, truncation=True, max_length=512, return_tensors="pt")
inputs_on_device = {k: v.to(device) for k, v in inputs.items()}

# get embeddings
outputs = model(**inputs_on_device, return_dict=True)
embeddings = outputs.last_hidden_state[:, 0]  # cls pooler
embeddings = embeddings / embeddings.norm(dim=1, keepdim=True)  # normalize

For RerankerModel:

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# init model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('maidalun1020/bce-reranker-base_v1')
model = AutoModelForSequenceClassification.from_pretrained('maidalun1020/bce-reranker-base_v1')

device = 'cuda'  # if no GPU, set "cpu"
model.to(device)

# get inputs
inputs = tokenizer(sentence_pairs, padding=True, truncation=True, max_length=512, return_tensors="pt")
inputs_on_device = {k: v.to(device) for k, v in inputs.items()}

# calculate scores
scores = model(**inputs_on_device, return_dict=True).logits.view(-1,).float()
scores = torch.sigmoid(scores)

3. Based on sentence_transformers

For EmbeddingModel:

from sentence_transformers import SentenceTransformer

# list of sentences
sentences = ['sentence_0', 'sentence_1', ...]

# init embedding model
## New update for sentence-trnasformers. So clean up your "`SENTENCE_TRANSFORMERS_HOME`/maidalun1020_bce-embedding-base_v1" or "~/.cache/torch/sentence_transformers/maidalun1020_bce-embedding-base_v1" first for downloading new version.
model = SentenceTransformer("maidalun1020/bce-embedding-base_v1")

# extract embeddings
embeddings = model.encode(sentences, normalize_embeddings=True)

For RerankerModel:

from sentence_transformers import CrossEncoder

# init reranker model
model = CrossEncoder('maidalun1020/bce-reranker-base_v1', max_length=512)

# calculate scores of sentence pairs
scores = model.predict(sentence_pairs)

Embedding and Reranker Integrations for RAG Frameworks

1. Used in langchain

We provide BCERerank in BCEmbedding.tools.langchain that inherits the advanced preproc tokenization of RerankerModel.

  • Install langchain first
pip install langchain==0.1.0
pip install langchain-community==0.0.9
pip install langchain-core==0.1.7
pip install langsmith==0.0.77
  • Demo
# We provide the advanced preproc tokenization for reranking.
from BCEmbedding.tools.langchain import BCERerank

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.vectorstores import FAISS

from langchain.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores.utils import DistanceStrategy
from langchain.retrievers import ContextualCompressionRetriever


# init embedding model
embedding_model_name = 'maidalun1020/bce-embedding-base_v1'
embedding_model_kwargs = {'device': 'cuda:0'}
embedding_encode_kwargs = {'batch_size': 32, 'normalize_embeddings': True, 'show_progress_bar': False}

embed_model = HuggingFaceEmbeddings(
  model_name=embedding_model_name,
  model_kwargs=embedding_model_kwargs,
  encode_kwargs=embedding_encode_kwargs
)

reranker_args = {'model': 'maidalun1020/bce-reranker-base_v1', 'top_n': 5, 'device': 'cuda:1'}
reranker = BCERerank(**reranker_args)

# init documents
documents = PyPDFLoader("BCEmbedding/tools/eval_rag/eval_pdfs/Comp_en_llama2.pdf").load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=200)
texts = text_splitter.split_documents(documents)

# example 1. retrieval with embedding and reranker
retriever = FAISS.from_documents(texts, embed_model, distance_strategy=DistanceStrategy.MAX_INNER_PRODUCT).as_retriever(search_type="similarity", search_kwargs={"score_threshold": 0.3, "k": 10})

compression_retriever = ContextualCompressionRetriever(
    base_compressor=reranker, base_retriever=retriever
)

response = compression_retriever.get_relevant_documents("What is Llama 2?")

2. Used in llama_index

We provide BCERerank in BCEmbedding.tools.llama_index that inherits the advanced preproc tokenization of RerankerModel.

  • Install llama_index first
pip install llama-index==0.9.42.post2
  • Demo
# We provide the advanced preproc tokenization for reranking.
from BCEmbedding.tools.llama_index import BCERerank

import os
from llama_index.embeddings import HuggingFaceEmbedding
from llama_index import VectorStoreIndex, ServiceContext, SimpleDirectoryReader
from llama_index.node_parser import SimpleNodeParser
from llama_index.llms import OpenAI
from llama_index.retrievers import VectorIndexRetriever

# init embedding model and reranker model
embed_args = {'model_name': 'maidalun1020/bce-embedding-base_v1', 'max_length': 512, 'embed_batch_size': 32, 'device': 'cuda:0'}
embed_model = HuggingFaceEmbedding(**embed_args)

reranker_args = {'model': 'maidalun1020/bce-reranker-base_v1', 'top_n': 5, 'device': 'cuda:1'}
reranker_model = BCERerank(**reranker_args)

# example #1. extract embeddings
query = 'apples'
passages = [
        'I like apples', 
        'I like oranges', 
        'Apples and oranges are fruits'
    ]
query_embedding = embed_model.get_query_embedding(query)
passages_embeddings = embed_model.get_text_embedding_batch(passages)

# example #2. rag example
llm = OpenAI(model='gpt-3.5-turbo-0613', api_key=os.environ.get('OPENAI_API_KEY'), api_base=os.environ.get('OPENAI_BASE_URL'))
service_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model)

documents = SimpleDirectoryReader(input_files=["BCEmbedding/tools/eval_rag/eval_pdfs/Comp_en_llama2.pdf"]).load_data()
node_parser = SimpleNodeParser.from_defaults(chunk_size=400, chunk_overlap=80)
nodes = node_parser.get_nodes_from_documents(documents[0:36])
index = VectorStoreIndex(nodes, service_context=service_context)

query = "What is Llama 2?"

# example #2.1. retrieval with EmbeddingModel and RerankerModel
vector_retriever = VectorIndexRetriever(index=index, similarity_top_k=10, service_context=service_context)
retrieval_by_embedding = vector_retriever.retrieve(query)
retrieval_by_reranker = reranker_model.postprocess_nodes(retrieval_by_embedding, query_str=query)

# example #2.2. query with EmbeddingModel and RerankerModel
query_engine = index.as_query_engine(node_postprocessors=[reranker_model])
query_response = query_engine.query(query)

⚙️ Evaluation

Evaluate Semantic Representation by MTEB

We provide evaluation tools for embedding and reranker models, based on MTEB and C_MTEB.

First, install MTEB:

pip install mteb==1.1.1

1. Embedding Models

Just run following cmd to evaluate your_embedding_model (e.g. maidalun1020/bce-embedding-base_v1) in bilingual and crosslingual settings (e.g. ["en", "zh", "en-zh", "zh-en"]).

python BCEmbedding/tools/eval_mteb/eval_embedding_mteb.py --model_name_or_path maidalun1020/bce-embedding-base_v1 --pooler cls

The total evaluation tasks contain 114 datasets of "Retrieval", "STS", "PairClassification", "Classification", "Reranking" and "Clustering".

NOTE:

  • All models are evaluated in their recommended pooling method (pooler).
    • mean pooler: "jina-embeddings-v2-base-en", "m3e-base", "m3e-large", "e5-large-v2", "multilingual-e5-base", "multilingual-e5-large" and "gte-large".
    • cls pooler: Other models.
  • "jina-embeddings-v2-base-en" model should be loaded with trust_remote_code.
python BCEmbedding/tools/eval_mteb/eval_embedding_mteb.py --model_name_or_path {mean_pooler_models} --pooler mean

python BCEmbedding/tools/eval_mteb/eval_embedding_mteb.py --model_name_or_path jinaai/jina-embeddings-v2-base-en --pooler mean --trust_remote_code

2. Reranker Models

Run following cmd to evaluate your_reranker_model (e.g. "maidalun1020/bce-reranker-base_v1") in bilingual and crosslingual settings (e.g. ["en", "zh", "en-zh", "zh-en"]).

python BCEmbedding/tools/eval_mteb/eval_reranker_mteb.py --model_name_or_path maidalun1020/bce-reranker-base_v1

The evaluation tasks contain 12 datasets of "Reranking".

3. Metrics Visualization Tool

We provide a one-click script to summarize evaluation results of embedding and reranker models as Embedding Models Evaluation Summary and Reranker Models Evaluation Summary.

python BCEmbedding/evaluation/mteb/summarize_eval_results.py --results_dir {your_embedding_results_dir | your_reranker_results_dir}

Evaluate RAG by LlamaIndex

LlamaIndex is a famous data framework for LLM-based applications, particularly in RAG. Recently, a LlamaIndex Blog has evaluated the popular embedding and reranker models in RAG pipeline and attracts great attention. Now, we follow its pipeline to evaluate our BCEmbedding.

First, install LlamaIndex, and upgrade transformers to 4.36.0:

pip install transformers==4.36.0

pip install llama-index==0.9.22

Export your "openai" and "cohere" app keys, and openai base url (e.g. "https://api.openai.com/v1") to env:

export OPENAI_BASE_URL={openai_base_url}  # https://api.openai.com/v1
export OPENAI_API_KEY={your_openai_api_key}
export COHERE_APPKEY={your_cohere_api_key}

1. Metrics Definition

  • Hit Rate:

    Hit rate calculates the fraction of queries where the correct answer is found within the top-k retrieved documents. In simpler terms, it's about how often our system gets it right within the top few guesses. The larger, the better.

  • Mean Reciprocal Rank (MRR):

    For each query, MRR evaluates the system's accuracy by looking at the rank of the highest-placed relevant document. Specifically, it's the average of the reciprocals of these ranks across all the queries. So, if the first relevant document is the top result, the reciprocal rank is 1; if it's second, the reciprocal rank is 1/2, and so on. The larger, the better.

2. Reproduce LlamaIndex Blog

In order to compare our BCEmbedding with other embedding and reranker models fairly, we provide a one-click script to reproduce results of the LlamaIndex Blog, including our BCEmbedding:

# There should be two GPUs available at least.
CUDA_VISIBLE_DEVICES=0,1 python BCEmbedding/tools/eval_rag/eval_llamaindex_reproduce.py

Then, summarize the evaluation results by:

python BCEmbedding/tools/eval_rag/summarize_eval_results.py --results_dir BCEmbedding/results/rag_reproduce_results

Results reproduced from the LlamaIndex Blog can be checked in Reproduced Summary of RAG Evaluation, with some obvious conclusions:

  • In WithoutReranker setting, our bce-embedding-base_v1 outperforms all the other embedding models.
  • With fixing the embedding model, our bce-reranker-base_v1 achieves the best performance.
  • The combination of bce-embedding-base_v1 and bce-reranker-base_v1 is SOTA.

3. Broad Domain Adaptability

The evaluation of LlamaIndex Blog is monolingual, small amount of data, and specific domain (just including "llama2" paper). In order to evaluate the broad domain adaptability, bilingual and crosslingual capability, we follow the blog to build a multiple domains evaluation dataset (includding "Computer Science", "Physics", "Biology", "Economics", "Math", and "Quantitative Finance". Details), named CrosslingualMultiDomainsDataset:

  • To prevent test data leakage, English eval data is selected from the latest English articles in various fields on ArXiv, up to date December 30, 2023. Chinese eval data is selected from high-quality, as recent as possible, Chinese articles in the corresponding fields on Semantic Scholar.
  • Use OpenAI gpt-4-1106-preview to produce eval data for high quality.

First, run following cmd to evaluate the most popular and powerful embedding and reranker models:

# There should be two GPUs available at least.
CUDA_VISIBLE_DEVICES=0,1 python BCEmbedding/tools/eval_rag/eval_llamaindex_multiple_domains.py

Then, run the following script to summarize the evaluation results:

python BCEmbedding/tools/eval_rag/summarize_eval_results.py --results_dir BCEmbedding/results/rag_results

The summary of multiple domains evaluations can be seen in Multiple Domains Scenarios.

📈 Leaderboard

Semantic Representation Evaluations in MTEB

1. Embedding Models

Model Dimensions Pooler Instructions Retrieval (47) STS (19) PairClassification (5) Classification (21) Reranking (12) Clustering (15) AVG (119)
bge-base-en-v1.5 768 cls Need 37.14 55.06 75.45 59.73 43.00 37.74 47.19
bge-base-zh-v1.5 768 cls Need 47.63 63.72 77.40 63.38 54.95 32.56 53.62
bge-large-en-v1.5 1024 cls Need 37.18 54.09 75.00 59.24 42.47 37.32 46.80
bge-large-zh-v1.5 1024 cls Need 47.58 64.73 79.14 64.19 55.98 33.26 54.23
gte-large 1024 mean Free 36.68 55.22 74.29 57.73 42.44 38.51 46.67
gte-large-zh 1024 cls Free 41.15 64.62 77.58 62.04 55.62 33.03 51.51
jina-embeddings-v2-base-en 768 mean Free 31.58 54.28 74.84 58.42 41.16 34.67 44.29
m3e-base 768 mean Free 46.29 63.93 71.84 64.08 52.38 37.84 53.54
m3e-large 1024 mean Free 34.85 59.74 67.69 60.07 48.99 31.62 46.78
e5-large-v2 1024 mean Need 35.98 55.23 75.28 59.53 42.12 36.51 46.52
multilingual-e5-base 768 mean Need 54.73 65.49 76.97 69.72 55.01 38.44 58.34
multilingual-e5-large 1024 mean Need 56.76 66.79 78.80 71.61 56.49 43.09 60.50
bce-embedding-base_v1 768 cls Free 57.60 65.73 74.96 69.00 57.29 38.95 59.43

NOTE:

  • Our bce-embedding-base_v1 outperforms other open-source embedding models with comparable model sizes.
  • 114 datasets including 119 eval results (some dataset contains multiple languages) of "Retrieval", "STS", "PairClassification", "Classification", "Reranking" and "Clustering" in ["en", "zh", "en-zh", "zh-en"] setting, including MTEB and CMTEB.
  • The crosslingual evaluation datasets we released belong to Retrieval task.
  • More evaluation details should be checked in Embedding Models Evaluations.

2. Reranker Models

Model Reranking (12) AVG (12)
bge-reranker-base 59.04 59.04
bge-reranker-large 60.86 60.86
bce-reranker-base_v1 61.29 61.29

NOTE:

  • Our bce-reranker-base_v1 outperforms other open-source reranker models.
  • 12 datasets of "Reranking" in ["en", "zh", "en-zh", "zh-en"] setting.
  • More evaluation details should be checked in Reranker Models Evaluations.

RAG Evaluations in LlamaIndex

1. Multiple Domains Scenarios

NOTE:

  • Data Quality:
    • To prevent test data leakage, English eval data is selected from the latest English articles in various fields on ArXiv, up to date December 30, 2023. Chinese eval data is selected from high-quality, as recent as possible, Chinese articles in the corresponding fields on Semantic Scholar.
    • Use OpenAI gpt-4-1106-preview to produce eval data for high quality.
  • Evaluated in ["en", "zh", "en-zh", "zh-en"] setting. If you are interested in monolingual setting, please check in Chinese RAG evaluations with ["zh"] setting, and English RAG evaluations with ["en"] setting.
  • Consistent with our Reproduced Results of LlamaIndex Blog.
  • In WithoutReranker setting, our bce-embedding-base_v1 outperforms all the other embedding models.
  • With fixing the embedding model, our bce-reranker-base_v1 achieves the best performance.
  • The combination of bce-embedding-base_v1 and bce-reranker-base_v1 is SOTA.

🛠 Youdao's BCEmbedding API

For users who prefer a hassle-free experience without the need to download and configure the model on their own systems, BCEmbedding is readily accessible through Youdao's API. This option offers a streamlined and efficient way to integrate BCEmbedding into your projects, bypassing the complexities of manual setup and maintenance. Detailed instructions and comprehensive API documentation are available at Youdao BCEmbedding API. Here, you'll find all the necessary guidance to easily implement BCEmbedding across a variety of use cases, ensuring a smooth and effective integration for optimal results.

🧲 WeChat Group

Welcome to scan the QR code below and join the WeChat group.

✏️ Citation

If you use BCEmbedding in your research or project, please feel free to cite and star it:

@misc{youdao_bcembedding_2023,
    title={BCEmbedding: Bilingual and Crosslingual Embedding for RAG},
    author={NetEase Youdao, Inc.},
    year={2023},
    howpublished={\url{https://github.com/netease-youdao/BCEmbedding}}
}

🔐 License

BCEmbedding is licensed under Apache 2.0 License

🔗 Related Links

Netease Youdao - QAnything

FlagEmbedding

MTEB

C_MTEB

LLama Index | LlamaIndex Blog

HuixiangDou

bcembedding's People

Contributors

codesmith-emmy avatar shenlei1020 avatar tpoisonooo avatar yazooliu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bcembedding's Issues

BCEembedding max_length

BCEembedding模型支持的最大长度是多少?如果我设置的是512,实际长度大于512,结果会怎么样

这个模型输出维度是768吗?怎么修改成1024?

model_name = 'maidalun1020/bce-embedding-base_v1'
model_kwargs = {'device': 'cuda'}
encode_kwargs = {'batch_size': 64, 'normalize_embeddings': True, 'show_progress_bar': False}

embed_model = HuggingFaceEmbeddings(
model_name=model_name,
model_kwargs=model_kwargs,
encode_kwargs=encode_kwargs
)
我按照上面的示例,得到的结果维度是768,但我们系统是1024,怎么调整参数呢?

AttributeError: 'SequenceClassifierOutput' object has no attribute 'last_hidden_state'

from transformers import AutoModel, AutoTokenizer

# list of sentences
sentences = ['sentence_0', 'sentence_1']

# init model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('maidalun1020/bce-embedding-base_v1')
model = AutoModel.from_pretrained('maidalun1020/bce-embedding-base_v1')

device = 'cpu'  # if no GPU, set "cpu"
model.to(device)

# get inputs
inputs = tokenizer(sentences, padding=True, truncation=True, max_length=512, return_tensors="pt")
inputs_on_device = {k: v.to(device) for k, v in inputs.items()}

# get embeddings
outputs = model(**inputs_on_device, return_dict=True)
embeddings = outputs.last_hidden_state[:, 0]  # cls pooler
embeddings = embeddings / embeddings.norm(dim=1, keepdim=True)  # normalize

我从hf上将模型下载到本地,运行 embedding 的时候,遇到 error 如下:

AttributeError: 'SequenceClassifierOutput' object has no attribute 'last_hidden_state'

请问应该如何解决,谢谢!

同时我看到有这些log,请问需要关注吗?

Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at bce-embedding-base_v1 and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']

You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

遇到一个很奇怪的问题

代码:
reranker = BCERerank(model="./bce-reranker-base_v1", top_n=5, device='cuda:0')

错误信息:

ValueError Traceback (most recent call last)
Cell In[6], line 39
31 embed_model = HuggingFaceEmbeddings(
32 model_name=embedding_model_name,
33 model_kwargs=embedding_model_kwargs,
34 encode_kwargs=embedding_encode_kwargs
35 )
36 # 创建一个reranker模型
37 # reranker_args = {'model': './bce-reranker-base_v1', 'top_n': 5, 'device': 'cuda:0'}
38 # reranker = BCERerank()
---> 39 reranker = BCERerank(model="./bce-reranker-base_v1", top_n=5, device='cuda:0')

File /mnt/workspace/BCEmbedding/BCEmbedding/tools/langchain/bce_rerank.py:55, in BCERerank.init(self, top_n, model, device, **kwargs)
50 except ImportError:
51 raise ImportError(
52 "Cannot import BCEmbedding package,",
53 "please pip install BCEmbedding>=0.1.2",
54 )
---> 55 self._model = RerankerModel(model_name_or_path=model, device=device, **kwargs)
56 super().init(top_n=top_n, model=model)

File /opt/conda/lib/python3.10/site-packages/pydantic/v1/main.py:357, in BaseModel.setattr(self, name, value)
354 return object_setattr(self, name, value)
356 if self.config.extra is not Extra.allow and name not in self.fields:
--> 357 raise ValueError(f'"{self.class.name}" object has no field "{name}"')
358 elif not self.config.allow_mutation or self.config.frozen:
359 raise TypeError(f'"{self.class.name}" is immutable and does not support item assignment')

ValueError: "BCERerank" object has no field "_model"

bce-reranker模型的分数什么普遍偏低

我使用 sentence_transformers 进行部署,
输入: [[你是谁,你是谁],[你是谁,今年几岁]]
输出 [0.625,0.425]
为什么第一个 pair 算出的分数这么低

bce-reranker-base_v1原生支持的passage长度问题

您好,

我关注到bce-reranker-base_v1使用的base model只支持512长度的输入(position embedding限制了长度),大于512长度则是通过“把长passage分成多个chunk,每个chunk分别求score,在取max”的形式。我担心这样的做法还是会丢失一部分长文本的原始语义,有办法让模型支持原生的passage输入超过512吗

关于模型微调的问题

你好,请问我想在自己的数据集上微调一个领域的embedding和reranker模型,应该怎么微调,微调的工程推荐那些?

QAnything和这个模型库的对应关系?

我使用QAnything里面的embed模型model.onnx的triton部署经过相同的输入query所跑出来的编码似乎和maidalun1020/bce-embedding-base_v1没有关系,余弦相似度为-0.026,请问一下它们是同一个模型吗

表格支持

请问bce embedding以及bce rerank模型对表格语义的拟合能力如何?训练数据中是否会存在表格?如果存在的话是以markdown的形式体现的吗?

在纯中文应用场景下的评测指标比较

你好,我看到你们的工作取得了非常优秀的评测成绩。

我想知道在纯中文RAG评测集中,你们的embedding和reranker组合与其他组合相比怎么样,比如bge-zh和bge-reranker。
在我们目前的落地需求中,文档以中文为主,双语的需求应该会非常少。

reranker速度问题

reranker速度问题,设置一个问题1000条数据查找40条数据需要20s,有没有加速的方案,比如TensorRT?

cuda版本是多少?

import的时候报错:

ImportError Traceback (most recent call last)
Cell In[1], line 1
----> 1 import torch
2 import os
3 import re

File ~/.conda/envs/bce/lib/python3.10/site-packages/torch/init.py:235
233 if USE_GLOBAL_DEPS:
234 _load_global_deps()
--> 235 from torch._C import * # noqa: F403
237 # Appease the type checker; ordinarily this binding is inserted by the
238 # torch._C module initialization code in C
239 if TYPE_CHECKING:

ImportError: /home/powerop/.conda/envs/bce/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so: undefined symbol: ncclCommInitRankConfig

想问一下关于评测的内容。

请问目前评测中文数据时,使用的chunk是多少?使用gpt4构造出来的Q和reference_context是强相关关系嘛?因为在我们私有评测数据集下效果没有这么出众呢?

关于评测的设置和比较的问题

感谢你们的出色工作!

关于贵团队展示的评测结果,我有一个小疑问:重点提及的评测结果的总表,似乎是在MTEB和CMTEB上,加上一些新创建的双语任务上进行的,并不是在单个榜单上,但比较的baseline都是单语言的模型,想问一下有没有和一些多语言模型,例如multilingual-e5,在相同设置下进行比较呢?

希望能够解答,感谢!

Request for non-NVIDIA GPU compatibility

问题: 在执行embedding and Reranker Integrations for RAG Frameworks例子的时候, 两种方式都会报以下错误

Traceback (most recent call last):
  File "/home/zc/miniconda3/BCEmbedding/test2.py", line 18, in <module>
    embed_model = HuggingFaceEmbeddings(
  File "/home/zc/miniconda3/envs/bce/lib/python3.10/site-packages/langchain_community/embeddings/huggingface.py", line 65, in __init__
    self.client = sentence_transformers.SentenceTransformer(
  File "/home/zc/miniconda3/envs/bce/lib/python3.10/site-packages/sentence_transformers/SentenceTransformer.py", line 215, in __init__
    self.to(device)
  File "/home/zc/miniconda3/envs/bce/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1152, in to
    return self._apply(convert)
  File "/home/zc/miniconda3/envs/bce/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
    module._apply(fn)
  File "/home/zc/miniconda3/envs/bce/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
    module._apply(fn)
  File "/home/zc/miniconda3/envs/bce/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
    module._apply(fn)
  [Previous line repeated 1 more time]
  File "/home/zc/miniconda3/envs/bce/lib/python3.10/site-packages/torch/nn/modules/module.py", line 825, in _apply
    param_applied = fn(param)
  File "/home/zc/miniconda3/envs/bce/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1150, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
  File "/home/zc/miniconda3/envs/bce/lib/python3.10/site-packages/torch/cuda/__init__.py", line 302, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

请求:
是否有例子可以增加对其他类型GPU的支持,或者有没有什么其他方案, 感谢感谢

使用gpu就会崩溃

04/29/2024 15:58:03 - [INFO] -BCEmbedding.models.RerankerModel->>> Loading from /workspace/bce-reranker-base_v1.
04/29/2024 15:58:04 - [INFO] -BCEmbedding.models.RerankerModel->>> Execute device: cuda; gpu num: 2; use fp16: False
Calculate scores: 0%| | 0/1 [00:00<?, ?it/s]Bus error (core dumped)

如上日志, 使用两块3090gpu;
在Docker内跑, 就必现的崩溃;
如果是改为CPU跑,就没有问题,但是cpu跑的比较慢, 按秒算;

erank_server和RerankerModel的不同

想咨询下 对于同样的query和similar_questions,使用QAnything里面的rerank_server和BCEmbedding下 RerankerModel得到的排序结果不一样,是模型不一样吗
dcf0b81fa0dc04424ce58ba795702b9

Rerank阈值设置咨询

感谢贵团队的工作!

想请教一下,检索完成以后采用排序模型进行Rerank,这个Rerank的值设置为多大比较合适?0.5吗,低于0.5就是不相关,高于0.5就是相关?

可能是badcase?

下面是两个例子
['运费是多少?', '打电话0.1元每分钟,短信0.1元每条扣费的喔亲亲~,接听电话免费']
分数:0.524
['运费是多少', '打电话0.1元每分钟,短信0.1元每条扣费的喔亲亲~,接听电话免费']
分数:0.487

疑问:

  1. 这两个例子很明显分数都偏高了,语义完全不相关的
  2. 第一个例子只比第二个例子的query多了一个问号,分数就高很多,这个感觉也不合理

关于推理加速问题。

各位的工作特别是双语这块,给rag开源社区带来了巨大贡献!目前在部署推理的时候,转为onnx过后精度不见损失,使用的opset版本为17,torch版本为2.1.2,onnx版本为:1.14.1。
onnx转换出来精度未见损失,但是将onnx转为trt时报warring:
[2024-04-05 03:07:58 WARNING] onnx2trt_utils.cpp:374: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
得到的trt模型,精度误差特别大,实际使用作为召回也只有3%左右准确度。请问这是因为这个模型中int64的原因嘛?但按理来说不应该这么大的精度损失,麻烦有空帮助解答一下,万分感谢。
转换代码如下:

model = AutoModel.from_pretrained('./bce-emb')

def make_train_dummy_input(seq_len):
    org_input_ids = torch.tensor(
        [[i for i in range(seq_len)]], dtype=torch.int32)
    org_input_mask = torch.tensor([[1 for i in range(int(
        seq_len/2))] + [1 for i in range(seq_len - int(seq_len/2))]], dtype=torch.int32)
    return (org_input_ids.to(device), org_input_mask.to(device))

model.eval()

with torch.no_grad():
    model=model.to(device)
    org_dummy_input = make_train_dummy_input(64)
    # print(org_dummy_input)
    output = torch.onnx.export(model,
                               org_dummy_input,
                               "model17.onnx",
                               verbose=True,
                               opset_version=17,
                               # 需要注意顺序!不可随意改变, 否则结果与预期不符
                               input_names=[
                                   'input_ids', 'attention_mask'],
                               # 需要注意顺序, 否则在推理阶段可能用错output_names
                               output_names=['logits'],
                               do_constant_folding=True,
                               dynamic_axes={"input_ids": {0: "batch_size", 1: "sequence_length"},
                                             "attention_mask": {0: "batch_size", 1: "sequence_length"},
                                             "logits": {0: "batch_size"}
                                            })

trt转换如下:环境为nvcr.io/nvidia/tensorrt:23.06-py3 nvidia的官方docker,tensorrt版本为8.6.1。
trt转换cli:

trtexec --onnx=/workspace/bce-emb.onnx \
--saveEngine=/workspace/model.plan \
--minShapes=input_ids:1x1,attention_mask:1x1 \
--optShapes=input_ids:4x128,attention_mask:4x128 \
--maxShapes=input_ids:64x512,attention_mask:64x512 \
--memPoolSize=workspace:8192MiB\
--fp16

log信息如下:

[04/05/2024-12:27:54] [I] === Model Options ===
[04/05/2024-12:27:54] [I] Format: ONNX
[04/05/2024-12:27:54] [I] Model: /workspace/bce-emb.onnx
[04/05/2024-12:27:54] [I] Output:
[04/05/2024-12:27:54] [I] === Build Options ===
[04/05/2024-12:27:54] [I] Max batch: explicit batch
[04/05/2024-12:27:54] [I] Memory Pools: workspace: 8192 MiB, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[04/05/2024-12:27:54] [I] minTiming: 1
[04/05/2024-12:27:54] [I] avgTiming: 8
[04/05/2024-12:27:54] [I] Precision: FP32
[04/05/2024-12:27:54] [I] LayerPrecisions:
[04/05/2024-12:27:54] [I] Layer Device Types:
[04/05/2024-12:27:54] [I] Calibration:
[04/05/2024-12:27:54] [I] Refit: Disabled
[04/05/2024-12:27:54] [I] Version Compatible: Disabled
[04/05/2024-12:27:54] [I] TensorRT runtime: full
[04/05/2024-12:27:54] [I] Lean DLL Path:
[04/05/2024-12:27:54] [I] Tempfile Controls: { in_memory: allow, temporary: allow }
[04/05/2024-12:27:54] [I] Exclude Lean Runtime: Disabled
[04/05/2024-12:27:54] [I] Sparsity: Disabled
[04/05/2024-12:27:54] [I] Safe mode: Disabled
[04/05/2024-12:27:54] [I] Build DLA standalone loadable: Disabled
[04/05/2024-12:27:54] [I] Allow GPU fallback for DLA: Disabled
[04/05/2024-12:27:54] [I] DirectIO mode: Disabled
[04/05/2024-12:27:54] [I] Restricted mode: Disabled
[04/05/2024-12:27:54] [I] Skip inference: Disabled
[04/05/2024-12:27:54] [I] Save engine: /workspace/model.plan
[04/05/2024-12:27:54] [I] Load engine:
[04/05/2024-12:27:54] [I] Profiling verbosity: 0
[04/05/2024-12:27:54] [I] Tactic sources: Using default tactic sources
[04/05/2024-12:27:54] [I] timingCacheMode: local
[04/05/2024-12:27:54] [I] timingCacheFile:
[04/05/2024-12:27:54] [I] Heuristic: Disabled
[04/05/2024-12:27:54] [I] Preview Features: Use default preview flags.
[04/05/2024-12:27:54] [I] MaxAuxStreams: -1
[04/05/2024-12:27:54] [I] BuilderOptimizationLevel: -1
[04/05/2024-12:27:54] [I] Input(s)s format: fp32:CHW
[04/05/2024-12:27:54] [I] Output(s)s format: fp32:CHW
[04/05/2024-12:27:54] [I] Input build shape: input_ids=1x1+4x128+64x512
[04/05/2024-12:27:54] [I] Input build shape: attention_mask=1x1+4x128+64x512
[04/05/2024-12:27:54] [I] Input calibration shapes: model
[04/05/2024-12:27:54] [I] === System Options ===
[04/05/2024-12:27:54] [I] Device: 0
[04/05/2024-12:27:54] [I] DLACore:
[04/05/2024-12:27:54] [I] Plugins:
[04/05/2024-12:27:54] [I] setPluginsToSerialize:
[04/05/2024-12:27:54] [I] dynamicPlugins:
[04/05/2024-12:27:54] [I] ignoreParsedPluginLibs: 0
[04/05/2024-12:27:54] [I]
[04/05/2024-12:27:54] [I] === Inference Options ===
[04/05/2024-12:27:54] [I] Batch: Explicit
[04/05/2024-12:27:54] [I] Input inference shape: attention_mask=4x128
[04/05/2024-12:27:54] [I] Input inference shape: input_ids=4x128
[04/05/2024-12:27:54] [I] Iterations: 10
[04/05/2024-12:27:54] [I] Duration: 3s (+ 200ms warm up)
[04/05/2024-12:27:54] [I] Sleep time: 0ms
[04/05/2024-12:27:54] [I] Idle time: 0ms
[04/05/2024-12:27:54] [I] Inference Streams: 1
[04/05/2024-12:27:54] [I] ExposeDMA: Disabled
[04/05/2024-12:27:54] [I] Data transfers: Enabled
[04/05/2024-12:27:54] [I] Spin-wait: Disabled
[04/05/2024-12:27:54] [I] Multithreading: Disabled
[04/05/2024-12:27:54] [I] CUDA Graph: Disabled
[04/05/2024-12:27:54] [I] Separate profiling: Disabled
[04/05/2024-12:27:54] [I] Time Deserialize: Disabled
[04/05/2024-12:27:54] [I] Time Refit: Disabled
[04/05/2024-12:27:54] [I] NVTX verbosity: 0
[04/05/2024-12:27:54] [I] Persistent Cache Ratio: 0
[04/05/2024-12:27:54] [I] Inputs:
[04/05/2024-12:27:54] [I] === Reporting Options ===
[04/05/2024-12:27:54] [I] Verbose: Disabled
[04/05/2024-12:27:54] [I] Averages: 10 inferences
[04/05/2024-12:27:54] [I] Percentiles: 90,95,99
[04/05/2024-12:27:54] [I] Dump refittable layers:Disabled
[04/05/2024-12:27:54] [I] Dump output: Disabled
[04/05/2024-12:27:54] [I] Profile: Disabled
[04/05/2024-12:27:54] [I] Export timing to JSON file:
[04/05/2024-12:27:54] [I] Export output to JSON file:
[04/05/2024-12:27:54] [I] Export profile to JSON file:
[04/05/2024-12:27:54] [I]
[04/05/2024-12:27:54] [I] === Device Information ===
[04/05/2024-12:27:54] [I] Selected Device: NVIDIA A10
[04/05/2024-12:27:54] [I] Compute Capability: 8.6
[04/05/2024-12:27:54] [I] SMs: 72
[04/05/2024-12:27:54] [I] Device Global Memory: 22731 MiB
[04/05/2024-12:27:54] [I] Shared Memory per SM: 100 KiB
[04/05/2024-12:27:54] [I] Memory Bus Width: 384 bits (ECC enabled)
[04/05/2024-12:27:54] [I] Application Compute Clock Rate: 1.695 GHz
[04/05/2024-12:27:54] [I] Application Memory Clock Rate: 6.251 GHz
[04/05/2024-12:27:54] [I]
[04/05/2024-12:27:54] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.
[04/05/2024-12:27:54] [I]
[04/05/2024-12:27:54] [I] TensorRT version: 8.6.1
[04/05/2024-12:27:54] [I] Loading standard plugins
[04/05/2024-12:27:55] [I] [TRT] [MemUsageChange] Init CUDA: CPU +520, GPU +0, now: CPU 537, GPU 13924 (MiB)
[04/05/2024-12:28:01] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +1436, GPU +266, now: CPU 2050, GPU 14190 (MiB)
[04/05/2024-12:28:01] [W] [TRT] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usageand speed up TensorRT initialization. See "Lazy Loading" section of CUDA documentation https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading
[04/05/2024-12:28:01] [I] Start parsing network model.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:604] Reading dangerously large protocol message.  If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:81] The total number of bytes read was 1118829273
[04/05/2024-12:28:09] [I] [TRT] ----------------------------------------------------------------
[04/05/2024-12:28:09] [I] [TRT] Input filename:   /workspace/bce-emb.onnx
[04/05/2024-12:28:09] [I] [TRT] ONNX IR version:  0.0.8
[04/05/2024-12:28:09] [I] [TRT] Opset version:    17
[04/05/2024-12:28:09] [I] [TRT] Producer name:    pytorch
[04/05/2024-12:28:09] [I] [TRT] Producer version: 2.1.2
[04/05/2024-12:28:09] [I] [TRT] Domain:
[04/05/2024-12:28:09] [I] [TRT] Model version:    0
[04/05/2024-12:28:09] [I] [TRT] Doc string:
[04/05/2024-12:28:09] [I] [TRT] ----------------------------------------------------------------
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:604] Reading dangerously large protocol message.  If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:81] The total number of bytes read was 1118829273
[04/05/2024-12:28:11] [W] [TRT] onnx2trt_utils.cpp:374: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[04/05/2024-12:28:12] [I] Finished parsing network model. Parse time: 10.2094
[04/05/2024-12:28:12] [I] [TRT] Graph optimization time: 0.0657748 seconds.
[04/05/2024-12:28:12] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[04/05/2024-12:28:28] [I] [TRT] Detected 2 inputs and 2 output network tensors.
[04/05/2024-12:28:31] [I] [TRT] Total Host Persistent Memory: 48
[04/05/2024-12:28:31] [I] [TRT] Total Device Persistent Memory: 0
[04/05/2024-12:28:31] [I] [TRT] Total Scratch Memory: 2114454528
[04/05/2024-12:28:31] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 1060 MiB, GPU 3512MiB
[04/05/2024-12:28:31] [I] [TRT] [BlockAssignment] Started assigning block shifts. This will take 2 steps to complete.
[04/05/2024-12:28:31] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 0.013715ms to assign 2 blocks to 2 nodes requiring 2114455040 bytes.
[04/05/2024-12:28:31] [I] [TRT] Total Activation Memory: 2114455040
[04/05/2024-12:28:31] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +2048, now: CPU 0, GPU 2048 (MiB)
[04/05/2024-12:28:39] [I] Engine built in 44.7706 sec.
[04/05/2024-12:28:40] [I] [TRT] Loaded engine size: 1063 MiB
[04/05/2024-12:28:40] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +1060,now: CPU 0, GPU 1060 (MiB)
[04/05/2024-12:28:40] [I] Engine deserialized in 0.121818 sec.
[04/05/2024-12:28:40] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +2017, now: CPU 0, GPU 3077 (MiB)
[04/05/2024-12:28:40] [W] [TRT] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usageand speed up TensorRT initialization. See "Lazy Loading" section of CUDA documentation https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading
[04/05/2024-12:28:40] [I] Setting persistentCacheLimit to 0 bytes.
[04/05/2024-12:28:40] [I] Using random values for input input_ids
[04/05/2024-12:28:40] [I] Input binding for input_ids with dimensions 4x128 is created.
[04/05/2024-12:28:40] [I] Using random values for input attention_mask
[04/05/2024-12:28:40] [I] Input binding for attention_mask with dimensions 4x128 is created.
[04/05/2024-12:28:40] [I] Output binding for logits with dimensions 4x128x768 is created.
[04/05/2024-12:28:40] [I] Output binding for 1488 with dimensions 4x768 is created.
[04/05/2024-12:28:40] [I] Starting inference
[04/05/2024-12:28:43] [I] Warmup completed 44 queries over 200 ms
[04/05/2024-12:28:43] [I] Timing trace has 634 queries over 3.01157 s
[04/05/2024-12:28:43] [I]
[04/05/2024-12:28:43] [I] === Trace details ===
[04/05/2024-12:28:43] [I] Trace averages of 10 runs:
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.71777 ms - Host latency: 4.8111 ms (enqueue 4.68778 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.69606 ms - Host latency: 4.78887 ms (enqueue 4.66904 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.71172 ms - Host latency: 4.80529 ms (enqueue 4.6842 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.71255 ms - Host latency: 4.80536 ms (enqueue 4.68546 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.71368 ms - Host latency: 4.8067 ms (enqueue 4.68657 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.71378 ms - Host latency: 4.8076 ms (enqueue 4.68813 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.7095 ms - Host latency: 4.80175 ms (enqueue 4.68302 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.71083 ms - Host latency: 4.80322 ms (enqueue 4.68549 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.70866 ms - Host latency: 4.80078 ms (enqueue 4.68127 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.71213 ms - Host latency: 4.80558 ms (enqueue 4.68696 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.7102 ms - Host latency: 4.80333 ms (enqueue 4.68237 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.7098 ms - Host latency: 4.80272 ms (enqueue 4.68302 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.71245 ms - Host latency: 4.80533 ms (enqueue 4.68322 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.79283 ms - Host latency: 4.88607 ms (enqueue 4.73782 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 5.08766 ms - Host latency: 5.18107 ms (enqueue 5.05737 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.84741 ms - Host latency: 4.94027 ms (enqueue 4.84573 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.74828 ms - Host latency: 4.84208 ms (enqueue 4.72477 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.71704 ms - Host latency: 4.81016 ms (enqueue 4.68956 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.70856 ms - Host latency: 4.80144 ms (enqueue 4.6801 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.711 ms - Host latency: 4.80448 ms (enqueue 4.68986 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.71707 ms - Host latency: 4.8125 ms (enqueue 4.68035 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.71266 ms - Host latency: 4.80608 ms (enqueue 4.68488 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.71327 ms - Host latency: 4.8064 ms (enqueue 4.68513 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.70937 ms - Host latency: 4.80234 ms (enqueue 4.684 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.71021 ms - Host latency: 4.80277 ms (enqueue 4.68435 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.70916 ms - Host latency: 4.80209 ms (enqueue 4.68029 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.71064 ms - Host latency: 4.80372 ms (enqueue 4.68414 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.71089 ms - Host latency: 4.80433 ms (enqueue 4.68595 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.71422 ms - Host latency: 4.80765 ms (enqueue 4.68622 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.71155 ms - Host latency: 4.80374 ms (enqueue 4.68427 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.70978 ms - Host latency: 4.80316 ms (enqueue 4.68428 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.71857 ms - Host latency: 4.81082 ms (enqueue 4.68978 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.71797 ms - Host latency: 4.8114 ms (enqueue 4.68748 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.80277 ms - Host latency: 4.89617 ms (enqueue 4.76615 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.90116 ms - Host latency: 4.99402 ms (enqueue 4.87052 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.84751 ms - Host latency: 4.94182 ms (enqueue 4.83071 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.75146 ms - Host latency: 4.84404 ms (enqueue 4.72413 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.74745 ms - Host latency: 4.84027 ms (enqueue 4.72244 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.74814 ms - Host latency: 4.84119 ms (enqueue 4.72034 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.75085 ms - Host latency: 4.8439 ms (enqueue 4.72666 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.70796 ms - Host latency: 4.79934 ms (enqueue 4.68162 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.71643 ms - Host latency: 4.80862 ms (enqueue 4.68887 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.70688 ms - Host latency: 4.79854 ms (enqueue 4.67922 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.70933 ms - Host latency: 4.8021 ms (enqueue 4.68403 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.71401 ms - Host latency: 4.80779 ms (enqueue 4.68694 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.70757 ms - Host latency: 4.80063 ms (enqueue 4.67991 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.70662 ms - Host latency: 4.79973 ms (enqueue 4.67981 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.71475 ms - Host latency: 4.80798 ms (enqueue 4.68315 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.7467 ms - Host latency: 4.83914 ms (enqueue 4.71975 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.73523 ms - Host latency: 4.828 ms (enqueue 4.70801 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.74326 ms - Host latency: 4.83728 ms (enqueue 4.71604 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.72288 ms - Host latency: 4.81548 ms (enqueue 4.6989 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.74504 ms - Host latency: 4.83687 ms (enqueue 4.71489 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.76904 ms - Host latency: 4.86096 ms (enqueue 4.73633 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.82097 ms - Host latency: 4.91309 ms (enqueue 4.79429 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.75395 ms - Host latency: 4.84707 ms (enqueue 4.73091 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.78035 ms - Host latency: 4.87405 ms (enqueue 4.74929 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.76145 ms - Host latency: 4.85464 ms (enqueue 4.73542 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.77607 ms - Host latency: 4.86899 ms (enqueue 4.74966 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.76257 ms - Host latency: 4.85547 ms (enqueue 4.73748 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.75317 ms - Host latency: 4.84756 ms (enqueue 4.72434 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.7521 ms - Host latency: 4.84453 ms (enqueue 4.72729 ms)
[04/05/2024-12:28:43] [I] Average on 10 runs - GPU latency: 4.74985 ms - Host latency: 4.84331 ms (enqueue 4.72151 ms)
[04/05/2024-12:28:43] [I]
[04/05/2024-12:28:43] [I] === Performance summary ===
[04/05/2024-12:28:43] [I] Throughput: 210.521 qps
[04/05/2024-12:28:43] [I] Latency: min = 4.78314 ms, max = 5.24432 ms, mean = 4.8347 ms, median = 4.80896 ms, percentile(90%) = 4.88477 ms, percentile(95%) = 4.93652 ms, percentile(99%) = 5.16418 ms
[04/05/2024-12:28:43] [I] Enqueue Time: min = 4.5 ms, max = 5.13123 ms, mean = 4.71434 ms, median = 4.69315 ms, percentile(90%) = 4.76709 ms, percentile(95%) = 4.8573 ms, percentile(99%) = 5.03857 ms
[04/05/2024-12:28:43] [I] H2D Latency: min = 0.00610352 ms, max = 0.0211182 ms, mean = 0.00693844 ms, median = 0.00683594 ms, percentile(90%) = 0.00756836 ms, percentile(95%) = 0.0078125 ms, percentile(99%) = 0.00830078 ms
[04/05/2024-12:28:43] [I] GPU Compute Time: min = 4.68994 ms, max = 5.15076 ms, mean = 4.74169 ms, median = 4.71545 ms, percentile(90%) = 4.79224 ms, percentile(95%) = 4.84351 ms, percentile(99%) = 5.07086 ms
[04/05/2024-12:28:43] [I] D2H Latency: min = 0.081543 ms, max = 0.0933533 ms, mean = 0.0860713 ms, median = 0.0859375 ms, percentile(90%) = 0.0877686 ms, percentile(95%) = 0.0881348 ms, percentile(99%) = 0.0895996 ms
[04/05/2024-12:28:43] [I] Total Host Walltime: 3.01157 s
[04/05/2024-12:28:43] [I] Total GPU Compute Time: 3.00623 s
[04/05/2024-12:28:43] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[04/05/2024-12:28:43] [W]   If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[04/05/2024-12:28:43] [W] * GPU compute time is unstable, with coefficient of variance = 1.33615%.
[04/05/2024-12:28:43] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[04/05/2024-12:28:43] [I] Explanations of the performance metrics are printed in the verbose logs.

查看过log信息,有些异常的就只有INT64转为INT32的warring。请问你们遇到过这种问题吗?能提供一些参考的思路么 万分感谢。

同样在进行bce-rerank转换时,得到的plan模型,在推理单句时:

[["what is panda", "panda is an animal"]] 能得到和pytorch一致的推理结果。
但在执行[["what is panda", "panda is an animal"],["what is panda", "panda is an animal"] ]时triton推理出来未经处理的结果和pytorch完全不一致。就很奇怪... 请问对于rerank有没有开源的onnx或者plan模型呢?如果我解决上述问题也愿意贡献可直接使用的转换模型。
如能解答万分感谢!

下载 tokenizer 时不能指定 auth token

models/embedding.py 和 models/reranker.py 的__init__里面都有如下代码:

self.tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) # 这里没有 **kwargs
self.model = AutoModel.from_pretrained(model_name_or_path, **kwargs)

由于下载tokenizer文件时没有传**kwargs,所以指定的 use_auth_token 没有传递进去,下载报错。

Cannot access gated repo for url https://huggingface.co/maidalun1020/bce-embedding-base_v1/resolve/main/tokenizer_config.json.
Access to model maidalun1020/bce-embedding-base_v1 is restricted. You must be authenticated to access it.

长度

关于bce reranker模型,文本长度是多少?若是超过长度,他的处理机制是什么?

rerank模型部署和应用疑问

embedding模型我可以理解使用SentenceTransformer的方式部署在服务器,使用API访问,rerank模型放在本地应用中可行吗?需要GPU资源吗?对这部分有点疑问

Use in LlamaIndex the BCEReranker import error

I install the llama-index-0.10.14, according the Demo :from BCEmbedding.tools.llama_index import BCERerank
when i import it report:
`---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[24], line 14
12 from llama_index.vector_stores.postgres import PGVectorStore
13 import os
---> 14 from BCEmbedding.tools.llama_index import BCERerank
15 from flask import Flask, request, jsonify

File ~/anaconda3/lib/python3.11/site-packages/BCEmbedding/tools/llama_index/init.py:8
1 '''
2 @description:
3 @author: shenlei
(...)
6 @LastEditors: shenlei
7 '''
----> 8 from .bce_rerank import BCERerank

File ~/anaconda3/lib/python3.11/site-packages/BCEmbedding/tools/llama_index/bce_rerank.py:10
1 '''
2 @description:
3 @author: shenlei
(...)
6 @LastEditors: shenlei
7 '''
8 from typing import Any, List, Optional
---> 10 from llama_index.bridge.pydantic import Field, PrivateAttr
11 from llama_index.callbacks import CBEventType, EventPayload
12 from llama_index.postprocessor.types import BaseNodePostprocessor

ModuleNotFoundError: No module named 'llama_index.bridge'`

and i tried to uninstall llama-index reinstall it,but it not works,if you see this issus please give me some suggest.

BCE Fine tuning

您好,我在您的项目基础上构建了一个端到端的文本匹配模型,使用了BCEmbedding进行Feature Extraction,余弦相似度进行相似比较召回以及Reranker进行精排,取得了不错的效果。

现在希望进行Fine tuning,想在我们的细分领域下有更好的表现,请问您是否有更新Fine tuning相关内容的后续计划呢?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.