I have had the great pleasure of testing out TinyChat today - it's blazing fast. <

TGI That sounds great! :) <p dir="aut

[Question/Feature] Fused attention/mlp/norm for MPT about llm-awq HOT 3 OPEN

mit-han-lab commented on July 2, 2024

[Question/Feature] Fused attention/mlp/norm for MPT

from llm-awq.

Comments (3)

casper-hansen commented on July 2, 2024

It seems TinyChat is currently very CPU-bound for all other models than LLaMa. On A6000, 3090, 4090 with AMD EPYC 7-Series CPU, performance is largely the same due to low single-threaded performance of the CPU. However, if the CPU is upgraded to an i9-13900k (roughly double the performance of AMD CPU), the performance also gets a 100% boost.

@Sakits Any plans for adding further speedups for TinyChat to make it less CPU-bound? I see the fused/optimized layers for LLaMa-2 helped with utilizing the GPU more.

Rough expectations for speedup:

MLP: 0.5-1.0ms
LayerNorm: ~3ms
Attention: ~7ms

If all parts are optimized, we should see below 10ms inference per token, even on slower CPUs. Could even get close to 5-6ms on better GPUs if TinyChat was optimized further.

I got about 0.5-1.0ms speedup (2.7%-5.5% speedup) by replacing the linear layers of MPT. You can see my fork/branch here.

class QuantMPTMLP(nn.Module):
    def __init__(
        self,
        up_proj,
        act,
        down_proj
    ):
        super().__init__()
        self.register_buffer('up_proj_qweight', up_proj.qweight)
        self.register_buffer('up_proj_scales', up_proj.scales)
        self.register_buffer('up_proj_qzeros', up_proj.qzeros)

        self.up_proj = up_proj
        self.act = act
        self.down_proj = down_proj

    def forward(self, x: torch.Tensor):
        x = x.reshape(-1, x.shape[-1])
        x = awq_inference_engine.gemm_forward_cuda(x, self.up_proj_qweight, self.up_proj_scales, self.up_proj_qzeros, 8)

        return self.down_proj(self.act(x))

from llm-awq.

Sakits commented on July 2, 2024

Hi @casperbh96,

Thank you for your suggestions and contributions!

The current version of AWQ library mainly focuses on usability, hence it hasn't been fully optimized for speed. However, we're planning a reimplementation based on a more efficient baseline (e.g. TGI). Please stay tuned for future updates! :)

from llm-awq.

casper-hansen commented on July 2, 2024

TGI

That sounds great! :)

Only thing to keep in mind is that TGI has recently switched license, so be careful if you plan to use their code.

Edit: Looks like you can still use TGI commercially for 90%+ of use-cases, so might be still be a good idea with TGI.
huggingface/text-generation-inference#744

from llm-awq.

Recommend Projects

[Question/Feature] Fused attention/mlp/norm for MPT about llm-awq HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent