Git Product home page Git Product logo

Comments (6)

ryusaeba avatar ryusaeba commented on August 21, 2024 1

You edited the message several times while I was writing this answer (not complaining)3

Sorry about that. I should send the message after confirming everyhing questions are valid. I'm happy and appreciate for your detailed response.

To make this worth releasing, we'll need to upgrade the code so it actually runs outside of our specific case.

Sounds great. Will look forward your valuable release.

About 10T,

After checking, you actually using 10B, istead of 10T. I didn't notice the dataset you are using had suffix Sample previously.

Redpajamas on Phi 3 instruct

When applying pretrain dataset on instruct model, did you see obvious quality degradataion on chat?

llama-3 70B instruct that is not included in the PV paper. I don't know the exact dataset that he used, but IIRC he was experimenting with Cosmopedia and SlimOrca. I don't know the exact answer, but it is likely one of them.

It's nice to see @Godofnothing reponse.

Thank you again :)

from aqlm.

justheuristic avatar justheuristic commented on August 21, 2024

Hi!

@disclaimer1 I am not an author, but regularly talk with the first two authors.

@disclaimer2 You edited the message several times while I was writing this answer (not complaining, thanks for the clarification). Below, I am quoting you on some of the words you said in the earlier versions of the message, before you edited it.

GPTQ-version of PV tuning

To the best of my knowledge, authors do plan to release it, but it is not the top priority r/n. The GPTQ experiments were using a very hacky version of the code that manually partitioned hessians between devices and only work for llama-2 7B with that specific 2-bit configuration (on 8x a100 and very inefficiently). To make this worth releasing, we'll need to upgrade the code so it actually runs outside of our specific case.

In the previous version of the message (iirc), you ask about adapting current vq code for GPTQ. There is one caveat to this: during V step, the current code runs beam search that considers all codes. This is very wasteful for GPTQ: instead, you obtain the new codes by rounding the update to the nearest integer, after taking scale and zero-point into account. Aside from that, the rest of the code should work.

maximum tokens is 10T?

In theory, the training script would indeed stop after processing roughly 10T (EDIT: actually 10B, see below) tokens. However, all llama models either converged before completing 2 full epochs or showed very small improvements after 2 epochs (within 0.03 PPL wiki2), and mistral / phi models requires 2-4 epochs to converge.

If you want to get the best quality (as opposed to reproducing our exact setup), I'd recommend that you explore training on more data rather than repeating the same sample multiple times. In the paper, authors had to use the sample to make a fair comparison with other works that used that sample.

Instruct-version model

To the best of my knowledge, the only instruct version released in the PV paper is Phi 3 instruct. This model was fine-tuned on a sample of RedPajama data for fair comparison.

However, @Godofnothing also recently released a llama-3 70B instruct that is not included in the PV paper. I don't know the exact dataset that he used, but IIRC he was experimenting with Cosmopedia and SlimOrca. I don't know the exact answer, but it is likely one of them.

@Godofnothing if you're available, please educate us :)

from aqlm.

justheuristic avatar justheuristic commented on August 21, 2024

When applying pretrain dataset on instruct model, did you see obvious quality degradation on chat?

To see degradation, one needs a comparison against something. To the best of my knowledge, authors currently do not have such an alternative for Phi 3 models, (e.g. did not train on cosmopedia yet). For Llama 3, @Godofnothing knows best.

10B instead of 10T

Yes, good catch :)
The entire RedPajama dataset is just 1T tokens, PV-Tuning paper uses a 1B sample for calibration.

from aqlm.

ryusaeba avatar ryusaeba commented on August 21, 2024

However, all llama models either converged before completing 2 full epochs or showed very small improvements after 2 epochs (within 0.03 PPL wiki2), and mistral / phi models requires 2-4 epochs to converge.

The 2 full epochs achieves this results is pretty promising.

from aqlm.

github-actions avatar github-actions commented on August 21, 2024

This issue is stale because it has been open for 30 days with no activity.

from aqlm.

github-actions avatar github-actions commented on August 21, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

from aqlm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.