Git Product home page Git Product logo

prithivirajdamodaran / flashrank Goto Github PK

View Code? Open in Web Editor NEW
263.0 4.0 24.0 2.03 MB

Ultra-lite & Super-fast re-ranking for your search & retrieval pipelines. Based on SoTA models like cross-encoders and more. Created by Prithivi Da, open for PRs & Collaborations.

License: Apache License 2.0

Python 100.00%
cross-encoder full-text-search hybrid-search lexical-search rag ranking reranking retrieval-augmented-generation semantic-search vector-database

flashrank's Introduction

👋 I am Prithivida !

Visiters

25 Million+ Model downloads in 🤗 | Cited in NeurIPS, ICLR, ACL | 3K+ ⭐️ GitHub.

Prithivida's GitHub stats

Top Langs

flashrank's People

Contributors

jnash10 avatar prithivirajdamodaran avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

flashrank's Issues

Support for Custom Models like ce-esci-MiniLM-L12-v2 in FlashRank

I am currently integrating a reranking solution into my Haystack RAG pipeline. While I have been using MetaRank Reranker, I am exploring FlashRank as an alternative due to its operational efficiency and the requirement of a dedicated container for MetaRank.

In my evaluations using a specific dataset (I will provide the dataset details), I observed that the MetaRank models, particularly ce-esci-MiniLM-L12-v2, perform better on this data.

I am interested to know if FlashRank supports the integration of custom models like ce-esci-MiniLM-L12-v2. The ability to use such models could greatly influence the effectiveness of FlashRank in specific use cases, especially where certain models have shown superior performance.

Looking forward to your guidance on this.

Thank you!

Used dataset and results:

from haystack import Document
documents = [
    Document(
        "A West Virginia university women 's basketball team , officials , and a small gathering of fans are in a West Virginia arena ."),
    Document("A wild animal races across an uncut field with a minimal amount of trees ."),
    Document(
        "People line the stands which advertise Freemont 's orthopedics , a cowboy rides a light brown bucking bronco ."),
    Document("A man who is riding a wild horse in the rodeo is very near to falling off ."),
    Document("A rodeo cowboy , wearing a cowboy hat , is being thrown off of a wild white horse ."),
]
...
query = "wild west"
flashranker = FlashRankReranker(cache_dir="/tmp", model_name="rank-T5-flan")
...

MetaRankReranker:
score: 0.008271879516541958, content: A wild animal races across an uncut field with a minimal amount of trees .
score: 0.0013235947117209435, content: A rodeo cowboy , wearing a cowboy hat , is being thrown off of a wild white horse .
score: 0.0004636832163669169, content: A man who is riding a wild horse in the rodeo is very near to falling off .
score: 0.00036893945070914924, content: A West Virginia university women 's basketball team , officials , and a small gathering of fans are in a West Virginia arena .
score: 2.5501691197860055e-05, content: People line the stands which advertise Freemont 's orthopedics , a cowboy rides a light brown bucking bronco .
Execution Time: 0.29767465591430664 seconds

FlashRankReranker:
score: 0.5397251844406128, content: People line the stands which advertise Freemont 's orthopedics , a cowboy rides a light brown bucking bronco .
score: 0.5245905518531799, content: A man who is riding a wild horse in the rodeo is very near to falling off .
score: 0.51319420337677, content: A West Virginia university women 's basketball team , officials , and a small gathering of fans are in a West Virginia arena .
score: 0.47907745838165283, content: A wild animal races across an uncut field with a minimal amount of trees .
score: 0.4261687099933624, content: A rodeo cowboy , wearing a cowboy hat , is being thrown off of a wild white horse .
Execution Time: 0.0816802978515625 seconds

Multi-lingual model results are not as expected

Hi! Thank you for open sourcing a sleek & wonderful package. We performed couple of tests, noticed nano and small were giving a good (expected) results, while medium (multi-lingual) is not providing us the good results.

Please find below results of nano/small & medium for the example available in readme page:
query = "Tricks to accelerate LLM inference"
passages = [
"Introduce lookahead decoding: - a parallel decoding algo to accelerate LLM inference - w/o the need for a draft model or a data store - linearly decreases # decoding steps relative to log(FLOPs) used per decoding step.",
"LLM inference efficiency will be one of the most crucial topics for both industry and academia, simply because the more efficient you are, the more $$$ you will save. vllm project is a must-read for this direction, and now they have just released the paper",
"There are many ways to increase LLM inference throughput (tokens/second) and decrease memory footprint, sometimes at the same time. Here are a few methods I’ve found effective when working with Llama 2. These methods are all well-integrated with Hugging Face. This list is far from exhaustive; some of these techniques can be used in combination with each other and there are plenty of others to try. - Bettertransformer (Optimum Library): Simply call model.to_bettertransformer() on your Hugging Face model for a modest improvement in tokens per second. - Fp4 Mixed-Precision (Bitsandbytes): Requires minimal configuration and dramatically reduces the model's memory footprint. - AutoGPTQ: Time-consuming but leads to a much smaller model and faster inference. The quantization is a one-time cost that pays off in the long run. ",
"Ever want to make your LLM inference go brrrrr but got stuck at implementing speculative decoding and finding the suitable draft model? No more pain! Thrilled to unveil Medusa, a simple framework that removes the annoying draft model while getting 2x speedup. ",
"vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Optimized CUDA kernels"
]

Results:
Nano/Small:
[{'score': 0.9957617,
'passage': 'Introduce lookahead decoding: - a parallel decoding algo to accelerate LLM inference - w/o the need for a draft model or a data store - linearly decreases # decoding steps relative to log(FLOPs) used per decoding step.'},
{'score': 0.9336851,
'passage': "There are many ways to increase LLM inference throughput (tokens/second) and decrease memory footprint, sometimes at the same time. Here are a few methods I’ve found effective when working with Llama 2. These methods are all well-integrated with Hugging Face. This list is far from exhaustive; some of these techniques can be used in combination with each other and there are plenty of others to try. - Bettertransformer (Optimum Library): Simply call model.to_bettertransformer() on your Hugging Face model for a modest improvement in tokens per second. - Fp4 Mixed-Precision (Bitsandbytes): Requires minimal configuration and dramatically reduces the model's memory footprint. - AutoGPTQ: Time-consuming but leads to a much smaller model and faster inference. The quantization is a one-time cost that pays off in the long run. "},
{'score': 0.50486594,
'passage': 'vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Optimized CUDA kernels'},
{'score': 0.3989764,
'passage': 'LLM inference efficiency will be one of the most crucial topics for both industry and academia, simply because the more efficient you are, the more $$$ you will save. vllm project is a must-read for this direction, and now they have just released the paper'},
{'score': 0.05916641,
'passage': 'Ever want to make your LLM inference go brrrrr but got stuck at implementing speculative decoding and finding the suitable draft model? No more pain! Thrilled to unveil Medusa, a simple framework that removes the annoying draft model while getting 2x speedup. '}]

Medium:
[{'score': 0.9666068,
'passage': "There are many ways to increase LLM inference throughput (tokens/second) and decrease memory footprint, sometimes at the same time. Here are a few methods I’ve found effective when working with Llama 2. These methods are all well-integrated with Hugging Face. This list is far from exhaustive; some of these techniques can be used in combination with each other and there are plenty of others to try. - Bettertransformer (Optimum Library): Simply call model.to_bettertransformer() on your Hugging Face model for a modest improvement in tokens per second. - Fp4 Mixed-Precision (Bitsandbytes): Requires minimal configuration and dramatically reduces the model's memory footprint. - AutoGPTQ: Time-consuming but leads to a much smaller model and faster inference. The quantization is a one-time cost that pays off in the long run. "},
{'score': 0.9641034,
'passage': 'vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Optimized CUDA kernels'},
{'score': 0.9625791,
'passage': 'Introduce lookahead decoding: - a parallel decoding algo to accelerate LLM inference - w/o the need for a draft model or a data store - linearly decreases # decoding steps relative to log(FLOPs) used per decoding step.'},
{'score': 0.95415944,
'passage': 'Ever want to make your LLM inference go brrrrr but got stuck at implementing speculative decoding and finding the suitable draft model? No more pain! Thrilled to unveil Medusa, a simple framework that removes the annoying draft model while getting 2x speedup. '},
{'score': 0.9465828,
'passage': 'LLM inference efficiency will be one of the most crucial topics for both industry and academia, simply because the more efficient you are, the more $$$ you will save. vllm project is a must-read for this direction, and now they have just released the paper'}]

As we can notice, the scores in medium are high and not much varied.

Please guide me, if I missed anything while using.
Thank you

Wrong scoring when the query and 1 sentence in a passage is the same.

Hi
Im trying to test your ranking with Vietnamese data and got a weird result like this:

Query:  [Đầu tư căn hộ TP. Thủ Đức - Kỳ 2] Tiềm năng lợi nhuận đầu tư căn hộ TP. Thủ Đức như thế nào??
0.94775605 	 [Chuyên đề căn hộ Quận Thủ Đức - Kỳ 2] Tiện ích - hạ tầng đang hiện hữu tại thị trường căn hộ chung cư quận Thủ Đức
0.9460485 	 Chuyên Đề: Tiềm Năng Đầu Tư Căn Hộ TP Thủ Đức Quý II/2023
0.9404418 	 Có nên đầu tư căn hộ Empire City Thủ Thiêm?
0.93990564 	 [Đầu tư căn hộ TP. Thủ Đức - Kỳ 1]: Mức độ đô thị hóa và cơ sở hạ tầng TP. Thủ Đức đang như thế nào?
0.937425 	 [Đầu tư căn hộ TP. Thủ Đức - Kỳ 3]: Nhận diện những rủi ro tiềm ẩn khi đầu tư căn hộ
0.92928797 	 Infographic: Phân tích tiềm năng đầu tư dự án nhà phố, biệt thự TP Thủ Đức năm 2021
0.924897 	 [Chuyên đề căn hộ Quận Thủ Đức - Kỳ 3] Tiềm năng thị trường căn hộ chung cư quận Thủ Đức
0.9248124 	 Các dự án nhà phố, biệt thự Quận 2 (TP Thủ Đức) hiện có mức giá ra sao?
0.92082846 	 [Chuyên đề căn hộ Quận Thủ Đức - Kỳ 6] Các chủ đầu tư đang triển khai dự án căn hộ chung cư quận Thủ Đức
0.9133441 	 4 lý do bạn nên đầu tư căn hộ tại Thành phố Thủ Đức
0.84951854 	 Thông tin chi tiết về TP. Thủ Đức và các dự án chung cư nổi bật
0.84106666 	 Phân tích chi tiết tiềm năng đầu tư bất động sản TP Thủ Đức năm 2022
0.8402047 	 Cập nhật giá bán các dự án biệt thự, nhà phố tại TP Thủ Đức (Mới nhất)
0.7789081 	 Mua căn hộ Thủ Thiêm: Những thông tin bạn cần phải biết!
0.1311889 	 [Đầu tư căn hộ TP. Thủ Đức - Kỳ 2] Tiềm năng lợi nhuận đầu tư căn hộ TP. Thủ Đức như thế nào??

The last result is exactly the same as the query but got the lowest score. This is so weird to me.

0.1311889 	 [Đầu tư căn hộ TP. Thủ Đức - Kỳ 2] Tiềm năng lợi nhuận đầu tư căn hộ TP. Thủ Đức như thế nào??

Here is the code snippet to produce the above result


class RankerTestCase(unittest.TestCase):

    def test_flash_rank(self):
        passages = [
            "[Đầu tư căn hộ TP. Thủ Đức - Kỳ 2] Tiềm năng lợi nhuận đầu tư căn hộ TP. Thủ Đức như thế nào??",
            "[Đầu tư căn hộ TP. Thủ Đức - Kỳ 1]: Mức độ đô thị hóa và cơ sở hạ tầng TP. Thủ Đức đang như thế nào?",
            "[Đầu tư căn hộ TP. Thủ Đức - Kỳ 3]: Nhận diện những rủi ro tiềm ẩn khi đầu tư căn hộ",
            "Chuyên Đề: Tiềm Năng Đầu Tư Căn Hộ TP Thủ Đức Quý II/2023",
            "Infographic: Phân tích tiềm năng đầu tư dự án nhà phố, biệt thự TP Thủ Đức năm 2021",
            "Các dự án nhà phố, biệt thự Quận 2 (TP Thủ Đức) hiện có mức giá ra sao?",
            "[Chuyên đề căn hộ Quận Thủ Đức - Kỳ 3] Tiềm năng thị trường căn hộ chung cư quận Thủ Đức",
            "[Chuyên đề căn hộ Quận Thủ Đức - Kỳ 6] Các chủ đầu tư đang triển khai dự án căn hộ chung cư quận Thủ Đức",
            "[Chuyên đề căn hộ Quận Thủ Đức - Kỳ 2] Tiện ích - hạ tầng đang hiện hữu tại thị trường căn hộ chung cư quận Thủ Đức",
            "Có nên đầu tư căn hộ Empire City Thủ Thiêm?",
            "Phân tích chi tiết tiềm năng đầu tư bất động sản TP Thủ Đức năm 2022",
            "Cập nhật giá bán các dự án biệt thự, nhà phố tại TP Thủ Đức (Mới nhất)",
            "Thông tin chi tiết về TP. Thủ Đức và các dự án chung cư nổi bật",
            "Mua căn hộ Thủ Thiêm: Những thông tin bạn cần phải biết!",
            "4 lý do bạn nên đầu tư căn hộ tại Thành phố Thủ Đức"
        ]

        from flashrank.Ranker import Ranker
        ranker = Ranker(model_name="ms-marco-MultiBERT-L-12", cache_dir="./.pytest_cache")

        query = "[Đầu tư căn hộ TP. Thủ Đức - Kỳ 2] Tiềm năng lợi nhuận đầu tư căn hộ TP. Thủ Đức như thế nào??"
        results = ranker.rerank(query, passages)
        print('\nQuery: ', query)
        for r in results:
            print(r['score'], '\t', r['passage'])

Option to Use GPU, CUDA

I really appreciate this repository. I hope the rerank model can optionally use a GPU to fully utilize the performance increase, potentially even with multi-GPU support.

Thank you.

Failing inside a fast api call

from fastapi import FastAPI
from flashrank import Ranker, RerankRequest
from pydantic import BaseModel
from typing import List

app = FastAPI()

ranker = Ranker()


class Passage(BaseModel):
    id: str
    text: str
    meta: dict


class RankRequest(BaseModel):
    query: str
    passages: List[Passage]

@app.post("/rank")
async def rank(request: RankRequest):
    print(request.query, request.passages)
    passages_dict = [passage.model_dump() for passage in request.passages]
    rerankrequest = RerankRequest(query=request.query, passages=passages_dict)
    result = ranker.rerank(rerankrequest)
    return result

App is started with

uvicorn main:app --reload
Traceback (most recent call last):
  File "/Users/abcd/.local/share/virtualenvs/flashrank-api-aZfxAaAm/lib/python3.11/site-packages/fastapi/encoders.py", line 322, in jsonable_encoder
    data = dict(obj)
           ^^^^^^^^^
TypeError: 'numpy.float32' object is not iterable

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.