Git Product home page Git Product logo

Comments (3)

casper-hansen avatar casper-hansen commented on July 2, 2024

It seems TinyChat is currently very CPU-bound for all other models than LLaMa. On A6000, 3090, 4090 with AMD EPYC 7-Series CPU, performance is largely the same due to low single-threaded performance of the CPU. However, if the CPU is upgraded to an i9-13900k (roughly double the performance of AMD CPU), the performance also gets a 100% boost.

@Sakits Any plans for adding further speedups for TinyChat to make it less CPU-bound? I see the fused/optimized layers for LLaMa-2 helped with utilizing the GPU more.

Rough expectations for speedup:

  • MLP: 0.5-1.0ms
  • LayerNorm: ~3ms
  • Attention: ~7ms

If all parts are optimized, we should see below 10ms inference per token, even on slower CPUs. Could even get close to 5-6ms on better GPUs if TinyChat was optimized further.

I got about 0.5-1.0ms speedup (2.7%-5.5% speedup) by replacing the linear layers of MPT. You can see my fork/branch here.

class QuantMPTMLP(nn.Module):
    def __init__(
        self,
        up_proj,
        act,
        down_proj
    ):
        super().__init__()
        self.register_buffer('up_proj_qweight', up_proj.qweight)
        self.register_buffer('up_proj_scales', up_proj.scales)
        self.register_buffer('up_proj_qzeros', up_proj.qzeros)

        self.up_proj = up_proj
        self.act = act
        self.down_proj = down_proj

    def forward(self, x: torch.Tensor):
        x = x.reshape(-1, x.shape[-1])
        x = awq_inference_engine.gemm_forward_cuda(x, self.up_proj_qweight, self.up_proj_scales, self.up_proj_qzeros, 8)

        return self.down_proj(self.act(x))

from llm-awq.

Sakits avatar Sakits commented on July 2, 2024

Hi @casperbh96,

Thank you for your suggestions and contributions!

The current version of AWQ library mainly focuses on usability, hence it hasn't been fully optimized for speed. However, we're planning a reimplementation based on a more efficient baseline (e.g. TGI). Please stay tuned for future updates! :)

from llm-awq.

casper-hansen avatar casper-hansen commented on July 2, 2024

TGI

That sounds great! :)

Only thing to keep in mind is that TGI has recently switched license, so be careful if you plan to use their code.

Edit: Looks like you can still use TGI commercially for 90%+ of use-cases, so might be still be a good idea with TGI.
huggingface/text-generation-inference#744

from llm-awq.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.