SmoothQuant vs AWQ which one is faster? about llm-awq HOT 2 CLOSED

mit-han-lab commented on June 21, 2024 1

SmoothQuant vs AWQ which one is faster?

from llm-awq.

Comments (2)

tonylins commented on June 21, 2024 2

Hi @codertimo , usually W8A8 (SmoothQuant) is better for compute-bounded scenarios (e.g., large batch size, targeting large throughput), and W4A16 (AWQ) is better for memory-bounded scenarios (smaller batch size, lower latency). Let me know if you have more questions.

from llm-awq.

casper-hansen commented on June 21, 2024 1

Question

We are very interested in two post-training quantization papers from han lab!

SmoothQuant use W8A8 for efficient GPU computation. AWQ uses W4/3A16 for lower memory requirements and higher memory throughput.

But which one is faster in actual production? If you have any data about this, could you share it with us?

W4A16 is the fastest. I believe this is discussed in the paper, something along the lines of “weights make up the majority of delay”. Most layers in transformers are linear layers, so naturally you will see a large benefit from quantizing them.

I don’t have benchmarks to compare against SmoothQuant as it seems AWQ is preferred by the authors due to usability and speed with TinyChat.

from llm-awq.

SmoothQuant vs AWQ which one is faster? about llm-awq HOT 2 CLOSED

Comments (2)

Question

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent