tlc-pack / cutlass_fpa_intb_gemm Goto Github PK

View Code? Open in Web Editor NEW

77.0 77.0 17.0 208 KB

A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer

License: Apache License 2.0

CMake 1.10% C++ 86.59% Cuda 12.31%

cutlass_fpa_intb_gemm's People

Contributors

Stargazers

Watchers

Forkers

songkq yuehchuan mfkiwl oliver-ss linhr000 resorcap cyang49 tangmoming deftruth vinx13 cyx-6 junrushao flytigerw

cutlass_fpa_intb_gemm's Issues

Test example

Would you plan to provide any test example of this kernel in the future which will be very helpful?
I have observed that some recent SOTA work like GPTQ and AWQ have used 4/3 bit weight-only quantization, so I wonder if I can use the kernel in this repo to do 4bit weight-only quantization.

compute result problem

This work is awesome, and I hope to ask for more details about using. Have you tested the accuracy or are there any test cases, because I don't know if I use this correctly. Below is my wrapper for gemm_fp16_int_bias_act . I get weird result such as nan but the torch output is correct in the same case.

void fpA_intB_gemm_forward_cuda(torch::Tensor &input,
                                torch::Tensor &weight,
                                torch::Tensor &scale,
                                torch::Tensor &output,
                                int m, int n, int k)
{
    c10::cuda::CUDAGuard device_guard(input.device());
    const fastertransformer::half *input_ptr = reinterpret_cast<fastertransformer::half *>(input.data_ptr());
    const uint8_t *weight_ptr = reinterpret_cast<const uint8_t *>(weight.data_ptr());
    const fastertransformer::half *scale_ptr = reinterpret_cast<fastertransformer::half *>(scale.data_ptr());
    fastertransformer::half *output_ptr = reinterpret_cast<fastertransformer::half *>(output.data_ptr());

    fastertransformer::gemm_fp16_int_bias_act(
        input_ptr,
        weight_ptr,
        scale_ptr,
        nullptr,
        output_ptr,
        std::nullopt,
        m, n, k,
        0,
        nullptr,
        0,
        0);
}

Question about INT4 weight only GEMM

Thank you for your excellent work.

May I ask if this project fully supports int4 weight-only quantized inference, such as AWQ's group-wise int4 quantization?

I've seen some features related to int4

cutlass_fpA_intB_gemm/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template.h

Line 112 in ed951b0

if (group_size != 64 && group_size != 128)

cutlass_fpA_intB_gemm/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm.h

Line 29 in ed951b0

T in {half, __nv_bfloat} WeightType in {int8_t, cutlass::uint4b_t}

but I'm not sure how to use them specifically.

Windows-compatible

Have you tested the code for compatibility in the Windows environment? Can you publish a Windows-compatible version?

tlc-pack / cutlass_fpa_intb_gemm Goto Github PK

cutlass_fpa_intb_gemm's People

Contributors

Stargazers

Watchers

Forkers

cutlass_fpa_intb_gemm's Issues

Test example

compute result problem

Question about INT4 weight only GEMM

Windows-compatible

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent