prithivirajdamodaran / flashrank Goto Github PK

Ultra-lite & Super-fast re-ranking for your search & retrieval pipelines. Based on SoTA models like cross-encoders and more. Created by Prithivi Da, open for PRs & Collaborations.

License: Apache License 2.0

Python 100.00%

cross-encoder full-text-search hybrid-search lexical-search rag ranking reranking retrieval-augmented-generation semantic-search vector-database

flashrank's Introduction

👋 I am Prithivida !

25 Million+ Model downloads in 🤗 | Cited in NeurIPS, ICLR, ACL | 3K+ ⭐️ GitHub.

flashrank's People

Contributors

Stargazers

Watchers

flashrank's Issues

Support for Custom Models like ce-esci-MiniLM-L12-v2 in FlashRank

I am currently integrating a reranking solution into my Haystack RAG pipeline. While I have been using MetaRank Reranker, I am exploring FlashRank as an alternative due to its operational efficiency and the requirement of a dedicated container for MetaRank.

In my evaluations using a specific dataset (I will provide the dataset details), I observed that the MetaRank models, particularly ce-esci-MiniLM-L12-v2, perform better on this data.

I am interested to know if FlashRank supports the integration of custom models like ce-esci-MiniLM-L12-v2. The ability to use such models could greatly influence the effectiveness of FlashRank in specific use cases, especially where certain models have shown superior performance.

Looking forward to your guidance on this.

Thank you!

Used dataset and results:

from haystack import Document
documents = [
    Document(
        "A West Virginia university women 's basketball team , officials , and a small gathering of fans are in a West Virginia arena ."),
    Document("A wild animal races across an uncut field with a minimal amount of trees ."),
    Document(
        "People line the stands which advertise Freemont 's orthopedics , a cowboy rides a light brown bucking bronco ."),
    Document("A man who is riding a wild horse in the rodeo is very near to falling off ."),
    Document("A rodeo cowboy , wearing a cowboy hat , is being thrown off of a wild white horse ."),
]
...
query = "wild west"
flashranker = FlashRankReranker(cache_dir="/tmp", model_name="rank-T5-flan")
...

MetaRankReranker:
score: 0.008271879516541958, content: A wild animal races across an uncut field with a minimal amount of trees .
score: 0.0013235947117209435, content: A rodeo cowboy , wearing a cowboy hat , is being thrown off of a wild white horse .
score: 0.0004636832163669169, content: A man who is riding a wild horse in the rodeo is very near to falling off .
score: 0.00036893945070914924, content: A West Virginia university women 's basketball team , officials , and a small gathering of fans are in a West Virginia arena .
score: 2.5501691197860055e-05, content: People line the stands which advertise Freemont 's orthopedics , a cowboy rides a light brown bucking bronco .
Execution Time: 0.29767465591430664 seconds

FlashRankReranker:
score: 0.5397251844406128, content: People line the stands which advertise Freemont 's orthopedics , a cowboy rides a light brown bucking bronco .
score: 0.5245905518531799, content: A man who is riding a wild horse in the rodeo is very near to falling off .
score: 0.51319420337677, content: A West Virginia university women 's basketball team , officials , and a small gathering of fans are in a West Virginia arena .
score: 0.47907745838165283, content: A wild animal races across an uncut field with a minimal amount of trees .
score: 0.4261687099933624, content: A rodeo cowboy , wearing a cowboy hat , is being thrown off of a wild white horse .
Execution Time: 0.0816802978515625 seconds

requests.exceptions.HTTPError: 403 Client Error

Hey, Thanks for the great work! I have been using flashrank since few days. It worked fine but currently I am getting following error.

requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://storage.googleapis.com/flashrank/ms-marco-MiniLM-L-12-v2.zip

I tried to use different versions of rerankers given by you. Changed the network also, but still getting this issue. Any leads to troubleshoot it?

Multi-lingual model results are not as expected

Hi! Thank you for open sourcing a sleek & wonderful package. We performed couple of tests, noticed nano and small were giving a good (expected) results, while medium (multi-lingual) is not providing us the good results.

Please find below results of nano/small & medium for the example available in readme page:
query = "Tricks to accelerate LLM inference"
passages = [
"Introduce lookahead decoding: - a parallel decoding algo to accelerate LLM inference - w/o the need for a draft model or a data store - linearly decreases # decoding steps relative to log(FLOPs) used per decoding step.",
"LLM inference efficiency will be one of the most crucial topics for both industry and academia, simply because the more efficient you are, the more $$$ you will save. vllm project is a must-read for this direction, and now they have just released the paper",
"There are many ways to increase LLM inference throughput (tokens/second) and decrease memory footprint, sometimes at the same time. Here are a few methods I’ve found effective when working with Llama 2. These methods are all well-integrated with Hugging Face. This list is far from exhaustive; some of these techniques can be used in combination with each other and there are plenty of others to try. - Bettertransformer (Optimum Library): Simply call model.to_bettertransformer() on your Hugging Face model for a modest improvement in tokens per second. - Fp4 Mixed-Precision (Bitsandbytes): Requires minimal configuration and dramatically reduces the model's memory footprint. - AutoGPTQ: Time-consuming but leads to a much smaller model and faster inference. The quantization is a one-time cost that pays off in the long run. ",
"Ever want to make your LLM inference go brrrrr but got stuck at implementing speculative decoding and finding the suitable draft model? No more pain! Thrilled to unveil Medusa, a simple framework that removes the annoying draft model while getting 2x speedup. ",
"vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Optimized CUDA kernels"
]

Results:
Nano/Small:
[{'score': 0.9957617,
'passage': 'Introduce lookahead decoding: - a parallel decoding algo to accelerate LLM inference - w/o the need for a draft model or a data store - linearly decreases # decoding steps relative to log(FLOPs) used per decoding step.'},
{'score': 0.9336851,
'passage': "There are many ways to increase LLM inference throughput (tokens/second) and decrease memory footprint, sometimes at the same time. Here are a few methods I’ve found effective when working with Llama 2. These methods are all well-integrated with Hugging Face. This list is far from exhaustive; some of these techniques can be used in combination with each other and there are plenty of others to try. - Bettertransformer (Optimum Library): Simply call model.to_bettertransformer() on your Hugging Face model for a modest improvement in tokens per second. - Fp4 Mixed-Precision (Bitsandbytes): Requires minimal configuration and dramatically reduces the model's memory footprint. - AutoGPTQ: Time-consuming but leads to a much smaller model and faster inference. The quantization is a one-time cost that pays off in the long run. "},
{'score': 0.50486594,
'passage': 'vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Optimized CUDA kernels'},
{'score': 0.3989764,
'passage': 'LLM inference efficiency will be one of the most crucial topics for both industry and academia, simply because the more efficient you are, the more $$$ you will save. vllm project is a must-read for this direction, and now they have just released the paper'},
{'score': 0.05916641,
'passage': 'Ever want to make your LLM inference go brrrrr but got stuck at implementing speculative decoding and finding the suitable draft model? No more pain! Thrilled to unveil Medusa, a simple framework that removes the annoying draft model while getting 2x speedup. '}]

Medium:
[{'score': 0.9666068,
'passage': "There are many ways to increase LLM inference throughput (tokens/second) and decrease memory footprint, sometimes at the same time. Here are a few methods I’ve found effective when working with Llama 2. These methods are all well-integrated with Hugging Face. This list is far from exhaustive; some of these techniques can be used in combination with each other and there are plenty of others to try. - Bettertransformer (Optimum Library): Simply call model.to_bettertransformer() on your Hugging Face model for a modest improvement in tokens per second. - Fp4 Mixed-Precision (Bitsandbytes): Requires minimal configuration and dramatically reduces the model's memory footprint. - AutoGPTQ: Time-consuming but leads to a much smaller model and faster inference. The quantization is a one-time cost that pays off in the long run. "},
{'score': 0.9641034,
'passage': 'vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Optimized CUDA kernels'},
{'score': 0.9625791,
'passage': 'Introduce lookahead decoding: - a parallel decoding algo to accelerate LLM inference - w/o the need for a draft model or a data store - linearly decreases # decoding steps relative to log(FLOPs) used per decoding step.'},
{'score': 0.95415944,
'passage': 'Ever want to make your LLM inference go brrrrr but got stuck at implementing speculative decoding and finding the suitable draft model? No more pain! Thrilled to unveil Medusa, a simple framework that removes the annoying draft model while getting 2x speedup. '},
{'score': 0.9465828,
'passage': 'LLM inference efficiency will be one of the most crucial topics for both industry and academia, simply because the more efficient you are, the more $$$ you will save. vllm project is a must-read for this direction, and now they have just released the paper'}]

As we can notice, the scores in medium are high and not much varied.

Please guide me, if I missed anything while using.
Thank you

Wrong scoring when the query and 1 sentence in a passage is the same.

Hi
Im trying to test your ranking with Vietnamese data and got a weird result like this:

Query:  [Đầu tư căn hộ TP. Thủ Đức - Kỳ 2] Tiềm năng lợi nhuận đầu tư căn hộ TP. Thủ Đức như thế nào??
0.94775605 	 [Chuyên đề căn hộ Quận Thủ Đức - Kỳ 2] Tiện ích - hạ tầng đang hiện hữu tại thị trường căn hộ chung cư quận Thủ Đức
0.9460485 	 Chuyên Đề: Tiềm Năng Đầu Tư Căn Hộ TP Thủ Đức Quý II/2023
0.9404418 	 Có nên đầu tư căn hộ Empire City Thủ Thiêm?
0.93990564 	 [Đầu tư căn hộ TP. Thủ Đức - Kỳ 1]: Mức độ đô thị hóa và cơ sở hạ tầng TP. Thủ Đức đang như thế nào?
0.937425 	 [Đầu tư căn hộ TP. Thủ Đức - Kỳ 3]: Nhận diện những rủi ro tiềm ẩn khi đầu tư căn hộ
0.92928797 	 Infographic: Phân tích tiềm năng đầu tư dự án nhà phố, biệt thự TP Thủ Đức năm 2021
0.924897 	 [Chuyên đề căn hộ Quận Thủ Đức - Kỳ 3] Tiềm năng thị trường căn hộ chung cư quận Thủ Đức
0.9248124 	 Các dự án nhà phố, biệt thự Quận 2 (TP Thủ Đức) hiện có mức giá ra sao?
0.92082846 	 [Chuyên đề căn hộ Quận Thủ Đức - Kỳ 6] Các chủ đầu tư đang triển khai dự án căn hộ chung cư quận Thủ Đức
0.9133441 	 4 lý do bạn nên đầu tư căn hộ tại Thành phố Thủ Đức
0.84951854 	 Thông tin chi tiết về TP. Thủ Đức và các dự án chung cư nổi bật
0.84106666 	 Phân tích chi tiết tiềm năng đầu tư bất động sản TP Thủ Đức năm 2022
0.8402047 	 Cập nhật giá bán các dự án biệt thự, nhà phố tại TP Thủ Đức (Mới nhất)
0.7789081 	 Mua căn hộ Thủ Thiêm: Những thông tin bạn cần phải biết!
0.1311889 	 [Đầu tư căn hộ TP. Thủ Đức - Kỳ 2] Tiềm năng lợi nhuận đầu tư căn hộ TP. Thủ Đức như thế nào??

The last result is exactly the same as the query but got the lowest score. This is so weird to me.

0.1311889 	 [Đầu tư căn hộ TP. Thủ Đức - Kỳ 2] Tiềm năng lợi nhuận đầu tư căn hộ TP. Thủ Đức như thế nào??

Here is the code snippet to produce the above result


class RankerTestCase(unittest.TestCase):

    def test_flash_rank(self):
        passages = [
            "[Đầu tư căn hộ TP. Thủ Đức - Kỳ 2] Tiềm năng lợi nhuận đầu tư căn hộ TP. Thủ Đức như thế nào??",
            "[Đầu tư căn hộ TP. Thủ Đức - Kỳ 1]: Mức độ đô thị hóa và cơ sở hạ tầng TP. Thủ Đức đang như thế nào?",
            "[Đầu tư căn hộ TP. Thủ Đức - Kỳ 3]: Nhận diện những rủi ro tiềm ẩn khi đầu tư căn hộ",
            "Chuyên Đề: Tiềm Năng Đầu Tư Căn Hộ TP Thủ Đức Quý II/2023",
            "Infographic: Phân tích tiềm năng đầu tư dự án nhà phố, biệt thự TP Thủ Đức năm 2021",
            "Các dự án nhà phố, biệt thự Quận 2 (TP Thủ Đức) hiện có mức giá ra sao?",
            "[Chuyên đề căn hộ Quận Thủ Đức - Kỳ 3] Tiềm năng thị trường căn hộ chung cư quận Thủ Đức",
            "[Chuyên đề căn hộ Quận Thủ Đức - Kỳ 6] Các chủ đầu tư đang triển khai dự án căn hộ chung cư quận Thủ Đức",
            "[Chuyên đề căn hộ Quận Thủ Đức - Kỳ 2] Tiện ích - hạ tầng đang hiện hữu tại thị trường căn hộ chung cư quận Thủ Đức",
            "Có nên đầu tư căn hộ Empire City Thủ Thiêm?",
            "Phân tích chi tiết tiềm năng đầu tư bất động sản TP Thủ Đức năm 2022",
            "Cập nhật giá bán các dự án biệt thự, nhà phố tại TP Thủ Đức (Mới nhất)",
            "Thông tin chi tiết về TP. Thủ Đức và các dự án chung cư nổi bật",
            "Mua căn hộ Thủ Thiêm: Những thông tin bạn cần phải biết!",
            "4 lý do bạn nên đầu tư căn hộ tại Thành phố Thủ Đức"
        ]

        from flashrank.Ranker import Ranker
        ranker = Ranker(model_name="ms-marco-MultiBERT-L-12", cache_dir="./.pytest_cache")

        query = "[Đầu tư căn hộ TP. Thủ Đức - Kỳ 2] Tiềm năng lợi nhuận đầu tư căn hộ TP. Thủ Đức như thế nào??"
        results = ranker.rerank(query, passages)
        print('\nQuery: ', query)
        for r in results:
            print(r['score'], '\t', r['passage'])

Initializing Ranker not working

Hi there, I noticed that there was a change in the Config.py to the model url to https://storage.googleapis.com/flash-rank
However when initializing a Ranker, it tries to retrieve the model from https://storage.googleapis.com/flashrank which causes a 403 forbidden.

Kind regards.

HTTPError: 403 Client Error: Forbidden for url: https://storage.googleapis.com/flashrank/ms-marco-TinyBERT-L-2-v2.zip

Here is the following error

from fastapi import FastAPI
from flashrank import Ranker, RerankRequest
from pydantic import BaseModel
from typing import List

app = FastAPI()

ranker = Ranker()


class Passage(BaseModel):
    id: str
    text: str
    meta: dict


class RankRequest(BaseModel):
    query: str
    passages: List[Passage]

@app.post("/rank")
async def rank(request: RankRequest):
    print(request.query, request.passages)
    passages_dict = [passage.model_dump() for passage in request.passages]
    rerankrequest = RerankRequest(query=request.query, passages=passages_dict)
    result = ranker.rerank(rerankrequest)
    return result

App is started with

uvicorn main:app --reload

Traceback (most recent call last):
  File "/Users/abcd/.local/share/virtualenvs/flashrank-api-aZfxAaAm/lib/python3.11/site-packages/fastapi/encoders.py", line 322, in jsonable_encoder
    data = dict(obj)
           ^^^^^^^^^
TypeError: 'numpy.float32' object is not iterable

Models are not accesible anymore - Google sends 403

The models downloads are not reachable anymore:
403 Client Error: Forbidden for url: https://storage.googleapis.com/flashrank/ms-marco-MiniLM-L-12-v2.zip

prithivirajdamodaran / flashrank Goto Github PK

flashrank's Introduction

👋 I am Prithivida !

flashrank's People

Contributors

Stargazers

Watchers

Forkers

flashrank's Issues

Recommend Projects

Recommend Topics

Recommend Org