netease-youdao / emll Goto Github PK

Edge Machine Learning Library

License: Apache License 2.0

CMake 0.31% C 99.69%

emll's Introduction

Edge ML Library - High-performance Compute Library for On-device Machine Learning Inference

Edge ML Library (EMLL) offers optimized basic routines like general matrix multiplications (GEMM) and quantizations, to speed up machine learning (ML) inference on ARM-based devices. EMLL supports fp32, fp16 and int8 data types. EMLL accelerates on-device NMT, ASR and OCR engines of Youdao, Inc.

Features

Performance-Oriented Design

The matrix-multiplication routines are heavily-optimized for matrix shapes common in on-device ML tasks, including "skinny" ones. The matrix-multiplication kernels are tuned for specific CPUs with a large portion of inline assembly codes.

Here are benchmarks of SGEMM on 2 machines^[1]:

armv8a cortex-A35 4-thread	armv8a cortex-A53 4-thread

[1].The fomular of GEMM: C[MxN] = A[MxK] B[KxN]; For each test case, the better performance in all-row-major and all-column-major situations is selected.

Facile Interface

The data and parameters are passed straightforward without wrappings. Matrices and arrays are passed with base address + dimensions. GEMM parameters seldom used in on-device inference like LDA-LDC are excluded from the interface. There is no dependency on any third-party compute libraries.

Extensibility

EMLL abstracts the core structures of CPU-based high-performance matrix multiplication algorithms and also bias/quant functions to general macros (see files under include/common), which can be applied to a variety of processors. When developing for a new architecture, a lot of coding works can be saved with these macros.

EMLL APIs

EMLL provides a series of C functions. See Usage_EN.md for details.

Type	Name	Parameters
Matrix Multiplication	data_type + "gemm"	matrix_orders, addresses of matrices, M, N, K, beta, number of threads
Fully-connect Layer (fp32)	"fc"	addresses of src/weight/bias/output, dimensions M/K/N, orders of source matrices, (number of threads)
Quantization	"quantize_" + "symmetric"/"asymmetric" + input_type + output_type	input array, output array, (zero point), scale, size of array, input range
Requantization	"requantize_" + "symmetric/asymmetric" + "_XtoY"	input array, output array, (zero point), output scale, size of array, input range
Bias	"bias" + data_type	the matrix to be biased, scalar bias to all elements, vector bias along major direction, vector bias along minor direction, dimensions of the matrix

Supported Architectures and Data Types

Target CPU	Matrix Multiplication	Bias	Quantization	Requantization
ARMv7a 32-bit	fp32 -> fp32, (u)int8 -> (u)int32	fp32, int32	fp32 -> (u)int8/(u)int16	int32 -> (u)int8/(u)int16, int16 -> (u)int8
ARMv8a 64-bit	fp32 -> fp32, (u)int8 -> (u)int32, fp16 -> fp16	fp32, fp16, int32	fp32 -> (u)int8/(u)int16	int32 -> (u)int8/(u)int16, int16 -> (u)int8

Supported OS: Linux & Android

Supported Compilers: GCC & Clang

Future Plan

EMLL may support on-device GPUs and NPUs in the future, with the expansion of available functions, according to business requirements.

License

Apache 2.0

Reference

Eigen: [https://eigen.tuxfamily.org]

OpenBLAS: [https://github.com/xianyi/OpenBLAS]

emll's People

Stargazers

Watchers

emll's Issues

c 矩阵为什么要强制列优先,在很多上下文相关的场景处理不方便

1.是为了更好的性能而设计的吗?
2. 另外可以开源全连接层的量化版本吗? (matmul+bias +activation)
3. cnn相关的量化可否也开源

编译问题咨询

编译遇到下面问题,请教以下如何解决

/data/home/.../EMLL/src/arm_neon/ARMCpuType.c:1:0: error: unknown value ‘armv8.2-a+dotprod+fp16’ for -march
/*****************************************************************************/

/data/home/.../EMLL/src/arm_neon/ARMCompareAndSwap.c:1:0: error: unknown value ‘armv8.2-a+dotprod+fp16’ for -march
/*****************************************************************************/

CMakeFiles/eml-armneon.dir/build.make:89: recipe for target 'CMakeFiles/eml-armneon.dir/src/arm_neon/ARMCpuType.c.o' failed
make[2]: *** [CMakeFiles/eml-armneon.dir/src/arm_neon/ARMCpuType.c.o] Error 1
make[2]: *** Waiting for unfinished jobs....
CMakeFiles/eml-armneon.dir/build.make:75: recipe for target 'CMakeFiles/eml-armneon.dir/src/arm_neon/ARMCompareAndSwap.c.o' failed
make[2]: *** [CMakeFiles/eml-armneon.dir/src/arm_neon/ARMCompareAndSwap.c.o] Error 1
CMakeFiles/Makefile2:88: recipe for target 'CMakeFiles/eml-armneon.dir/all' failed
make[1]: *** [CMakeFiles/eml-armneon.dir/all] Error 2
Makefile:135: recipe for target 'all' failed
make: *** [all] Error 2

使用EMLL比OpenBlas+RUY的方式多使用10M内存

一实验背景:
基于ctranslator2框架适配EMLL, 跑翻译demo(int8), 对比基于EMLL推理和OpenBlas+RUY推理的速度和内存使用情况
EMLL版本: #9 (comment)
已经应用patch: #8 (comment)

二 C++ bin demo
① EMLL矩阵计算
内存: 102M
速度: 约380-500ms
②Openblas + RUY矩阵计算
内存: 107M
速度: 约680-720ms

三 Qt集成
单独跑bin demo速度和内存上都没啥问题,但是集成到qt应用上基于EMLL方式的内存占用会比基于Openblas+RUY的方式多出10M,
两者除了矩阵这块其他程序和输入都是一样

请问这种情况有可能是什么原因导致?

如何在总程序运行时不使用EMLL的时候释放掉所占内存？

我通过lib库的方式使用EMLL，但是发现一个问题就是在调用一次EMLL之后，如果后续程序不再使用EMLL，EMLL之前所占用的内容并不释放，请问有什么方法在不使用EMLL 的时候释放掉EMLL所占用的内存（在不杀死进程的情况下）？

sgemm找不到

还有人维护吗
编译好之后，测试sgemm函数，提示找不到

关于与BLAS中Gemm的对应

非常感谢您分享这个工作，在使用EMLL替换OpenBLAS的时候我发现两者的参数有一些区别，您能够给出一些如何进行迁移参数的意见么或者给出一个简单的README，这样也可以有助于推广EMLL。

主要的区别在于 Transpose似乎EMLL中是没有的，其他的参数含义应该是一样的？

EMLL:

a_rowmajor	源矩阵 A 的排列顺序，非零表示行主序
b_rowmajor	源矩阵 B 的排列顺序，非零表示行主序
A	源矩阵 A 的地址
B	源矩阵 B 的地址
C	输出矩阵 C 的地址
M	矩阵 A 的行数
N	矩阵 B 的列数
K	A的列数，必须等于 B 的行数
beta	作用于矩阵 C 的预乘因子
num_threads	并行时能够使用的线程数

OpenBLAS:
int an = a->dimSize[0];
int am = a->dimSize[1];
int bn = b->dimSize[0];
int bm = b->dimSize[1];
int cn = c->dimSize[0];
int cm = c->dimSize[1];
GEMM(CblasRowMajor, CblasNoTrans, CblasNoTrans, cn, cm, am, alpha, (DTYPE*)a->data, am, (DTYPE*)b->data, bm, beta, (DTYPE*)c->data, cm)

请问微信推文里的反量化这里的意思是？

https://mp.weixin.qq.com/s/qiS_DSpkaX_mCNpvLgeS-Q

how to solve cross compile EMLL problem below

/EMLL/src/arm_neon/ARMCompareAndSwap.c:1:0: error: invalid feature modifier in '-march=armv8.2-a+dotprod+fp16'
/*****************************************************************************/

CMakeFiles/eml-armneon.dir/build.make:62: recipe for target 'CMakeFiles/eml-armneon.dir/src/arm_neon/ARMCompareAndSwap.c.o' failed
make[2]: *** [CMakeFiles/eml-armneon.dir/src/arm_neon/ARMCompareAndSwap.c.o] Error 1
CMakeFiles/Makefile2:109: recipe for target 'CMakeFiles/eml-armneon.dir/all' failed
make[1]: *** [CMakeFiles/eml-armneon.dir/all] Error 2
Makefile:129: recipe for target 'all' failed
make: *** [all] Error 2

openmp如何打包到so中

您好，按照使用说明在配置中添加-fopenmp以使用openmp，由于需要把所有的计算库作为so提供给实际应用，所以当前通过objdump -x查看so还是依赖于openmp，请问有什么推荐的配置可以打包到so么？

sgemm和openblas结果不一致,导致推理结果相差很远

如题, 使用EMLL sgemm计算的结果和Openblas cblas_sgemm计算的结果有较小差异,但是会导致模型推理结果不正常, 然而使用EMLL 动态量化再s8s32gemm计算最后反量化的方式, 推理结果是正常的, 这两者间的差异是什么?
代码如下:

  enum QuantType
  {
    NO_QUANT = 0,
    SYMMETRIC,
    ASYMMETRIC
  };

inline int emll_s8s32gemm(bool transpose_a, bool transpose_b,
                            dim_t m, dim_t n, dim_t k,
                            const int8_t *a,
                            const int8_t *b,
                            float beta,
                            int32_t *c)
  {
    int status;
    if (!transpose_a && !transpose_b)
    {
      status = s8s32gemm(0, 0, b, a, c, n, m, k, beta, 0);
    }
    else if (transpose_a && !transpose_b)
    {
      status = s8s32gemm(0, 1, b, a, c, n, m, k, beta, 0);
    }
    else if (!transpose_a && transpose_b)
    {
      status = s8s32gemm(1, 0, b, a, c, n, m, k, beta, 0);
    }
    else // transpose_a && transpose_b
    {
      status = s8s32gemm(1, 1, b, a, c, n, m, k, beta, 0);
    }

    return status;
  }

  inline int emll_u8u32gemm(bool transpose_a, bool transpose_b,
                            dim_t m, dim_t n, dim_t k,
                            const uint8_t *a,
                            const uint8_t *b,
                            float beta,
                            uint32_t *c)
  {
    int status;
    if (!transpose_a && !transpose_b)
    {
      status = u8u32gemm(0, 0, b, a, c, n, m, k, beta, 0);
    }
    else if (transpose_a && !transpose_b)
    {
      status = u8u32gemm(0, 1, b, a, c, n, m, k, beta, 0);
    }
    else if (!transpose_a && transpose_b)
    {
      status = u8u32gemm(1, 0, b, a, c, n, m, k, beta, 0);
    }
    else // transpose_a && transpose_b
    {
      status = u8u32gemm(1, 1, b, a, c, n, m, k, beta, 0);
    }

    return status;
  }

  int emll_sgemm(bool transpose_a, bool transpose_b,
                 dim_t m, dim_t n, dim_t k,
                 float alpha,
                 const float *a,
                 const float *b,
                 float beta,
                 float *c,
                 QuantType quant_type)
  {
    int status;

    float *a_f = nullptr;
    if (alpha != 1.0f)
    {
      a_f = static_cast<float *>(allocator.allocate(m * k * sizeof(float)));
      cpu::parallel_for(0, m * k, cpu::GRAIN_SIZE / 2, [&](dim_t begin, dim_t end) {
        for (dim_t i = begin; i < end; ++i)
        {
          a_f[i] = static_cast<float>(alpha * a[i]);
        }
      });
    }

    if (quant_type == QuantType::NO_QUANT) // 这种方法结果不对!!!
    {
      if (!transpose_a && !transpose_b)
      {
        // std::cout << "!!! !transpose_a && !transpose_b" << std::endl;
        if (a_f != nullptr)
        {
          status = sgemm(0, 0, b, a_f, c, n, m, k, beta, 0);
        }
        else
        {
          status = sgemm(0, 0, b, a, c, n, m, k, beta, 0);
        }
      }
      else if (transpose_a && !transpose_b)
      {
        // std::cout << "@@@ transpose_a && !transpose_b" << std::endl;
        if (a_f != nullptr)
        {
          status = sgemm(0, 1, b, a_f, c, n, m, k, beta, 0);
        }
        else
        {
          status = sgemm(0, 1, b, a, c, n, m, k, beta, 0);
        }
      }
      else if (!transpose_a && transpose_b)
      {
        // std::cout << "### !transpose_a && transpose_b" << std::endl;
        if (a_f != nullptr)
        {
          status = sgemm(1, 0, b, a_f, c, n, m, k, beta, 0);
        }
        else
        {
          status = sgemm(1, 0, b, a, c, n, m, k, beta, 0);
        }
      }
      else // transpose_a && transpose_b
      {
        // std::cout << "$$$ transpose_a && transpose_b" << std::endl;
        if (a_f != nullptr)
        {
          status = sgemm(1, 1, b, a_f, c, n, m, k, beta, 0);
        }
        else
        {
          status = sgemm(1, 1, b, a, c, n, m, k, beta, 0);
        }
      }
    }
    else if (quant_type == QuantType::SYMMETRIC)
    {
      int8_t *const a_s = static_cast<int8_t *>(allocator.allocate(m * k * sizeof(int8_t)));
      int8_t *const b_s = static_cast<int8_t *>(allocator.allocate(n * k * sizeof(int8_t)));
      int32_t *const c_qs = static_cast<int32_t *>(allocator.allocate(m * n * sizeof(int32_t)));

      float scale_a, scale_b;

      if (a_f != nullptr)
      {
        quantize_symmetric_f32_s8(a_f, a_s, &scale_a, m * k, 0, -1);
      }
      else
      {
        quantize_symmetric_f32_s8(a, a_s, &scale_a, m * k, 0, -1);
      }

      quantize_symmetric_f32_s8(b, b_s, &scale_b, n * k, 0, -1);

      status = emll_s8s32gemm(transpose_a, transpose_b,
                              m, n, k, a_s, b_s, beta, c_qs);
      if (status != 0)
      {
        fprintf(stderr, "u8u32gemm returns error code %d\n", status);
      }
      else
      {
        dequantize_symmetric_f32_s32(c_qs, c, scale_a * scale_b, m * n);
      }

      allocator.free(a_s);
      allocator.free(b_s);
      allocator.free(c_qs);
    }
    else // ASYMMETRIC  
    {
      uint8_t *const a_u = static_cast<uint8_t *>(allocator.allocate(m * k * sizeof(uint8_t)));
      uint8_t *const b_u = static_cast<uint8_t *>(allocator.allocate(n * k * sizeof(uint8_t)));
      int32_t *const c_qu = static_cast<int32_t *>(allocator.allocate(m * n * sizeof(int32_t)));

      uint32_t *const a_sum = (uint32_t *)(allocator.allocate(m * sizeof(uint32_t)));
      uint32_t *const b_sum = (uint32_t *)(allocator.allocate(n * sizeof(uint32_t)));

      float scale_a, scale_b;
      uint8_t zero_point_a, zero_point_b;

      if (a_f != nullptr)
      {
        quantize_asymmetric_f32_u8(a_f, a_u, &zero_point_a, &scale_a, m * k, 0, -1);
      }
      else
      {
        quantize_asymmetric_f32_u8(a, a_u, &zero_point_a, &scale_a, m * k, 0, -1);
      }

      quantize_asymmetric_f32_u8(b, b_u, &zero_point_b, &scale_b, n * k, 0, -1);

      status = emll_u8u32gemm(transpose_a, transpose_b,
                              m, n, k, a_u, b_u, beta, (uint32_t*)c_qu);

      if (status != 0)
      {
        fprintf(stderr, "u8u32gemm returns error code %d\n", status);
      }
      else
      {
        /* sum row/col of source matrices (along K dim) */
        u8u32_sum(a_u, (uint32_t*)(a_sum), m, k, 0);
        u8u32_sum(b_u, (uint32_t*)(b_sum), k, n, 1);
        /* bias the result of 8->32 bit GEMM */
        bias_int32_t(c_qu,
                     (int32_t)zero_point_a * (int32_t)zero_point_b * (int32_t)k,
                     (int32_t *)(a_sum), -(int32_t)zero_point_b,
                     (int32_t *)(b_sum), -(int32_t)zero_point_a, m, n);
        /* dequantitize the result */
        /* dequant(input_addr, output_addr, scale, array_length) */
        dequantize_symmetric_f32_s32(c_qu, c, scale_a * scale_b, m * n);
      }

      allocator.free(a_u);
      allocator.free(b_u);
      allocator.free(c_qu);
      allocator.free(a_sum);
      allocator.free(b_sum);
    }

    if (a_f != nullptr)
    {
      allocator.free(a_f);
    }

    return status;
  }

多线程使用emll会造成内存占用增大

一个进程创建了四个线程：
thread-1：执行emll的gemm (使用1个线程计算，即gemm最后一个参数为1)
thread-2：sleep
thread-3：sleep
thread-4：sleep
这比一个进程只创建一个线程：
thread-1：执行emll的gemm (使用1个线程计算，即gemm最后一个参数为1)
占用更多内存，这个增大的内存占用来源于emll，经过我的统计，每多一个线程会增加大约768kb的内存，这个是否跟CommonDriver.h中GEMM_STATIC_BUFFER这个有关，这个也刚好创建了768kb内存（1024x192x4/1024）。
想问一下我该如何解决这种问题呢，我的线程2/3/4并不需要emll的gemm运算，如何不增加该内存占用。

如何在编译EMLL时选择性打开GEMM的某种优化手段？

通过EMLL的介绍，GEMM的优化主要有分块、重排和汇编优化三个手段，请问如何选择性的让其中某个或者某几个手段生效呢？

因为我自测发现，同一份数据在单独写demo在设备上测试是没问题的，但是集成到项目中，在设备上测试，会出core，经分析大概率是由于内存不足导致，所以怀疑对于GEMM的三个优化手段，是不是有些优化手段非常耗费内存？

unknown architecture `armv8.2-a+dotprod+fp16'

This error occurred when I installed EMLL on the a57 platform.

A53平台速度比arm compute library慢

我试了这样的一个流程,Android aarch64 a53
quantize_symmetric_f32_s8
s8s32gemm
requantize_symmetric_32to8
s8s32gemm
requantize_symmetric_32to8
.... s8s32gemm 和requantize_symmetric_32to8 这样循环数层
dequantize_symmetric_f32_s32
矩阵mn=(mk)x(k*n), 大概分布是这样的几个矩阵m=8,k=100,n=400; m=8,k=400,n=100;
速度差不多是acl的两倍.
A76平台上多个矩阵是比acl快一点的,很赞.
看到a35,a53,a7x是有不同的代码优化的