Git Product home page Git Product logo

Comments (7)

tpoisonooo avatar tpoisonooo commented on May 29, 2024

不能同时打开,里面有 assert。

fmha 用的 cutlass 实现, 而完整的量化方案(w4a4) 还没最终确定,不应该写 cutlass 提前优化。现在做的不兼容。

from lmdeploy.

tpoisonooo avatar tpoisonooo commented on May 29, 2024

“: an illegal memory access was encountered” 这个报错,好像和量化没关系。use_context_fmha=0 只用 fp16 能复现么?

from lmdeploy.

senbeiasano avatar senbeiasano commented on May 29, 2024

“: an illegal memory access was encountered” 这个报错,好像和量化没关系。use_context_fmha=0 只用 fp16 能复现么?

use_context_fmha=0,quant=0/4都不会报错,但是use_context_fmha=1,quant=4会报错,quant=0不会。
我用的数据是input 4096,output512。

如果不兼容assert,为什么我这边运行成功了呀?整体的量化方案会对目前的kv cache量化有影响吗?
还有想问下kv cache你们目前repo了memory和精度的结果,请问后续会repo速度的overhead吗?

多谢回复!

from lmdeploy.

tpoisonooo avatar tpoisonooo commented on May 29, 2024

感觉你不知道这俩选项啥意思。

fmha flush attention, 一种高性能 attention 实现:

  • 0 不开,用原始的 attention 版本
  • 1 开,用 flush attention 那篇论文的实现

quant_policy:

  • 0x0 表示不开,用 fp16
  • 0x1 nvidia 用掉了,保留
  • 0x2 nvidia 用掉了
  • 0x4 表示 kvCache_int8。

组合排列一下。
“但是use_context_fmha=1,quant=0不会。” , 开着 fp16 优化、不开量化,当然不报错了。

from lmdeploy.

tpoisonooo avatar tpoisonooo commented on May 29, 2024

LLM 量化的意义在吞吐,一个机器可以同时给更多人用。

单用户响应速度会变慢。

from lmdeploy.

senbeiasano avatar senbeiasano commented on May 29, 2024

感谢回复!

不过我这个回复是想回这个⬇️,我以为你问use_context_fmha=0 + quant=0会不会复现这个报错
““: an illegal memory access was encountered” 这个报错,好像和量化没关系。use_context_fmha=0 只用 fp16 能复现么?”

还有就是我说不会报错是指问题刚开始描述的,我在llama 7b上设置use_context_fmha = 1 + quant = 4没有报错。

速度问题我了解了~

from lmdeploy.

tpoisonooo avatar tpoisonooo commented on May 29, 2024

哦,容我加个 PR 让它在 use_context_fmha = 1 + quant = 4 报错。

from lmdeploy.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.