Git Product home page Git Product logo

Comments (5)

MingLiiii avatar MingLiiii commented on July 28, 2024 1

Firstly, thank you for your interest in our work. The calculation of IFD scores requires the inference on LLMs, thus it's naturally time-consuming. However, we also proposed Superfiltering(ACL'24), which utilizes small language models like GPT2 to select the data rather than LLMs, it tremendously lowers the time and cost for the data selection process. If efficiency is important to you, please try it.

Secondly, you did not provide enough information for your observation:

  1. Since this method was originally used for single-round data, how did you implement it for multi-round samples? Calculate IFD once every turn? Did you use the whole previous conversations when calculating or just use the question at each turn?
  2. How large are your 50k multi-turn samples? 50k is not a small number, even on 50k simple Alpaca data, it needs several hours. If the questions/answers are long in your sample, and there are a lot of rounds in each sample, it should definitely cost a lot more hours. Maybe you should first estimate the token count and inference count.
  3. What base LLM did you use?
  4. What GPU did you use?

Again, thank you for your interest! We highly recommend you try our Superfiltering(ACL'24) if efficiency is important to you!

from cherry_llm.

lihongxiacream avatar lihongxiacream commented on July 28, 2024

Thank you for your answer!!
The sample is indeed very large, which is 458 MB . I just use the question and answer at each turn instead of history, and I use Qwen1.5-7B-Chat Model and a A800 gpu. I calculate the loss once every turn during data analysis. Do you have any good ideas to accelerate inference.
Thank you again and I will also try to use Superfiltering Method.

from cherry_llm.

lihongxiacream avatar lihongxiacream commented on July 28, 2024

And does this project support Chinese datasets selection?

from cherry_llm.

MingLiiii avatar MingLiiii commented on July 28, 2024

Thank you for your interest!

Based on your data, I think it is quite reasonable that it will cost a lot of hours. Though it has only 50k samples, the size is almost 20 times of the alpaca data. Unfortunately, I am no expert on accelerating inferences, sorry about that.

As for whether this method supports Chinese datasets, I think the answer should be yes. Our method is a language-agnostic method, it computes and compares the losses/perplexities generated by base models. So if the base model itself supports other language, then our method should be useful.

from cherry_llm.

MingLiiii avatar MingLiiii commented on July 28, 2024

If you are interested in our method or have further questions, we can also add WeChat friends for better communication.
Please send me an email if you are interested!

Thank you!

from cherry_llm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.