Git Product home page Git Product logo

gemmlowp's Introduction

gemmlowp: a small self-contained low-precision GEMM library

Build Status

This is not a full linear algebra library, only a GEMM library: it only does general matrix multiplication ("GEMM").

The meaning of "low precision" is detailed in this document: doc/low-precision.md

Some of the general design is explained in doc/design.md.

Warning: This library goes very slow if compiled incorrectly; see below.

Disclaimer

This is not an official Google product (experimental or otherwise), it is just code that happens to be owned by Google.

Mailing list

gemmlowp-related discussion, about either development or usage, is welcome on this Google Group (mailing list / forum):

https://groups.google.com/forum/#!forum/gemmlowp

Portability, target platforms/architectures

Should be portable to any platform with some C++11 and POSIX support, while we have optional optimized code paths for specific architectures.

Required:

  • C++11 (a small conservative subset of it)

Required for some features:

  • Some POSIX interfaces:
    • pthreads (for multi-threaded operation and for profiling).
    • sysconf (for multi-threaded operation to detect number of cores; may be bypassed).

Optional:

  • Architecture-specific code paths use intrinsics or inline assembly. See "Architecture-specific optimized code paths" below.

Architecture-specific optimized code paths

We have some optimized code paths for specific instruction sets. Some are written in inline assembly, some are written in C++ using intrinsics. Both GCC and Clang are supported.

Current optimized code paths:

  • ARM with NEON (both 32bit and 64bit).
  • Intel x86 with SSE 4.1 (both 32bit and 64bit).

When building for x86, it's very important to pass -msse4.1 to the compiler, otherwise gemmlowp will use slow reference code. Bazel users can compile by running bazel build --copt=-msse4.1 //gemmlowp:all. The compiled binary should work on all Intel CPUs since 2008 (including low power microarchitectures) as well as AMD CPUs since 2011.

Please note when compiling binaries that don't need to be distributed, it's generally a better idea to pass -march=native to the compiler. That flag implies -msse4.1 flag, along with others that might be helpful. This of course assumes the host machine supports those instructions. Bazel users should prefer to run bazel build --config=opt //gemmlowp:all instead.

Details of what it takes to make an efficient port of gemmlowp, namely writing a suitable GEMM kernel and accompanying packing code, are explained in this file: doc/kernel.md.

Public interfaces

The gemmlowp public interface

gemmlowp's main public interface is in the public/ subdirectory.

This is a headers-only library, so there is nothing to link to.

Usage documentation, and comments on the deprecation status of each public entry point, may be found in doc/public.md .

A full, self-contained usage example, showing how to quantize float matrices and perform a quantized matrix multiplication approximating a float matrix multiplication, is given in doc/quantization_example.cc.

Old EightBitIntGemm legacy deprecated interface

The eight_bit_int_gemm/ subdirectory contains an alternate interface that should be considered purely legacy, deprecated, and going to be removed at some point in the future.

Building

Building by manually invoking your compiler

Because gemmlowp is so simple, working with it involves only single-command-line compiler invocations. Therefore we expect that most people working with gemmlowp will either manually invoke their compiler, or write their own rules for their own preferred build system.

Keep in mind (previous section) that gemmlowp itself is a pure-headers-only library so there is nothing to build.

For a Android gemmlowp development workflow, the scripts/ directory contains a script to build and run a program on an Android device:

scripts/test-android.sh

Building using Bazel

That being said, we also maintain a Bazel BUILD system as part of gemmlowp. Its usage is not mandatory at all and is only one possible way that gemmlowp libraries and tests may be built. If you are interested, Bazel's home page is http://bazel.build/ And you can get started with using Bazel to build gemmlowp targets by first creating an empty WORKSPACE file in a parent directory, for instance:

$ cd gemmlowp/..  # change to parent directory containing gemmlowp/
$ touch WORKSPACE # declare that to be our workspace root
$ bazel build gemmlowp:all

Building gemmlowp - Using vcpkg

You can download and install gemmlowp using the vcpkg dependency manager:

git clone https://github.com/Microsoft/vcpkg.git
cd vcpkg
./bootstrap-vcpkg.sh
./vcpkg integrate install
./vcpkg install gemmlowp

The gemmlowp port in vcpkg is kept up to date by Microsoft team members and community contributors. If the version is out of date, please create an issue or pull request on the vcpkg repository.

Testing

Testing by manually building and running tests

The test/ directory contains unit tests. The primary unit test is

test/test.cc

Since it covers also the EightBitIntGemm interface, it needs to be linked against

eight_bit_int_gemm/eight_bit_int_gemm.cc

It also uses realistic data captured from a neural network run in

test/test_data.cc

Thus you'll want to pass the following list of source files to your compiler/linker:

test/test.cc
eight_bit_int_gemm/eight_bit_int_gemm.cc
test/test_data.cc

The scripts/ directory contains a script to build and run a program on an Android device:

scripts/test-android.sh

It expects the CXX environment variable to point to an Android toolchain's C++ compiler, and expects source files (and optionally, cflags) as command-line parameters. To build and run the above-mentioned main unit test, first set CXX e.g.:

$ export CXX=/some/toolchains/arm-linux-androideabi-4.8/bin/arm-linux-androideabi-g++

Then run:

$ ./scripts/test-android.sh \
test/test.cc \
eight_bit_int_gemm/eight_bit_int_gemm.cc \
test/test_data.cc

Testing using Bazel

Alternatively, you can use Bazel to build and run tests. See the Bazel instruction in the above section on building. Once your Bazel workspace is set up, you can for instance do:

$ bazel test gemmlowp:all

Troubleshooting Compilation

If you're having trouble finding the compiler, follow these instructions to build a standalone toolchain: https://developer.android.com/ndk/guides/standalone_toolchain.html

Here's an example of setting up Clang 3.5:

$ export INSTALL_DIR=~/toolchains/clang-21-stl-gnu
$ $NDK/build/tools/make-standalone-toolchain.sh \
--toolchain=arm-linux-androideabi-clang3.5 --platform=android-21 \
--install-dir=$INSTALL_DIR
$ export CXX="$INSTALL_DIR/bin/arm-linux-androideabi-g++ \
--sysroot=$INSTALL_DIR/sysroot"

Some compilers (e.g. the default clang++ in the same bin directory) don't support NEON assembly. The benchmark build process will issue a warning if support isn't detected, and you should make sure you're using a compiler like arm-linux-androideabi-g++ that does include NEON.

Benchmarking

The main benchmark is

test/benchmark.cc

It doesn't need to be linked to any other source file. We recommend building with assertions disabled (-DNDEBUG).

For example, the benchmark can be built and run on an Android device by doing:

$ ./scripts/test-android.sh test/benchmark.cc -DNDEBUG

If GEMMLOWP_TEST_PROFILE is defined then the benchmark will be built with profiling instrumentation (which makes it slower) and will dump profiles. See next section on profiling.

Profiling

The profiling/ subdirectory offers a very simple, naive, inaccurate, non-interrupting sampling profiler that only requires pthreads (no signals).

It relies on source code being instrumented with pseudo-stack labels. See profiling/instrumentation.h. A full example of using this profiler is given in the top comment of profiling/profiler.h.

Contributing

Contribution-related discussion is always welcome on the gemmlowp mailing list (see above).

We try to keep a current list of TODO items in the todo/ directory. Prospective contributors are welcome to pick one to work on, and communicate about it on the gemmlowp mailing list.

Details of the contributing process, including legalese, are in CONTRIBUTING.

Performance goals

Our performance goals differ from typical GEMM performance goals in the following ways:

  1. We care not only about speed, but also about minimizing power usage. We specifically care about charge usage in mobile/embedded devices. This implies that we care doubly about minimizing memory bandwidth usage: we care about it, like any GEMM, because of the impact on speed, and we also care about it because it is a key factor of power usage.

  2. Most GEMMs are optimized primarily for large dense matrix sizes (>= 1000). We do care about large sizes, but we also care specifically about the typically smaller matrix sizes encountered in various mobile applications. This means that we have to optimize for all sizes, not just for large enough sizes.

gemmlowp's People

Contributors

ajtulloch avatar andreasgal avatar andrewharp avatar arcank avatar arritmic avatar bjacob avatar craig-chasseur avatar cwhipkey avatar davidmansell avatar echristo avatar guschmue avatar ianfhunter avatar jalexstark avatar jart avatar jdcormie avatar lamarrr avatar lee-bin avatar legrosbuffle avatar mariecwhite avatar mgouicem avatar miaowang14 avatar mjmatthews avatar mrry avatar multiverse-tf avatar petewarden avatar planetmarshall avatar rongjiecomputer avatar skligys avatar tetsuok avatar yongtang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gemmlowp's Issues

A problem with the design of kernel in Arm64

I am studying the kernels implemented in gemmlowp for arm64, and noticed that the main kernel we used is 12x8x2 , I know the cellformat is
KernelFormat<
KernelSideFormat<CellFormat<4, 2>, 3>,
KernelSideFormat<CellFormat<4, 2>, 2>>
,and I'd like to know why the depth is choose 2,instead of 1 or others. Is that the reason in this condition we can use more efficiently of registers ? or there are some other scientific reasons to choose kernel depth?
Thanks a lot if anyone could help.

How to calculate non-linear function after 8bit quantization ?

For 8bit quantization, zero point and scale are applied.
But when in non-linear function layer,
I want to know if I can process the input data without converting it to a real number.
Or is there any way to calibrate?
please answer about my question.

non-linear function : tanh / sigmoid / softmax / exp(x)

core dump in the neon-gemm-kernel-benchmark.cc.

Hardware: NVidia TX 2
OS: Ubuntu 16.04
GCC: 5.40
compile flag: -std=c++11 -O3

error message:
:~/workspace/test/test_neon$ ./bench_mm
kernel,Gop/s
Arithmetic error in kernel:
NEON_64bit_GEMM_Int425Operands
Wrong accumulator for depth=32, at l = 1, r = 0
reference value: -47
actual value: -94
Aborted (core dumped)

SIMD back-end for IBM Power and Z

This is not really an issue. I'd like to know if there is an interest in incorporating code to support IBM's Power and Z architectures as a back-end. In-house me and a colleague actually worked on this and we have extensions ready for gemmlowp to run optimized on P and Z depending on compiler flags when these architectures are detected. In principle this does not touch or disrupt any of the existing code.
Please comment on this issue and provide advice as to how best proceed.

Reconsider GEMMLOWP_ALLOW_SLOW_SCALAR_FALLBACK

Is there any chance the authors of gemmlowp would consider removing the #error directive that halts non-optimized builds? It's not a common practice. I'd like for it to be possible to build TensorFlow without passing any special configuration directives. Perhaps consider printing a big ASCII art skull and crossbones #warning instead? Sort of like Gosling Emacs?

cc: @gunan

How to use gemmlowp in C project?

Hello, I'm using a framwork named "darknet" which is coded by C programm lauguage. I realized parameters quantization in conv, and try to use gemmlowp in darknet. When I try to pack function "EightBitIntGemm" from C++ to C, here occurs an error: fatal error: cstdint: No such file or directory. Could you give me some advice kindly?

How to quantize accumulator from int32 to uint8

I am trying to implement the quantize version of MobilenNet v1 in OpenCL. I have referenced the method that you have provided in https://arxiv.org/pdf/1712.05877.pdf . I am using pretrained Mobilnet weights from the tflite file. I have got all the required quantization parameters Eg: S1 S2 and S3 from the tflite file.
The only issue is converting the accumulator back from int32 to uint8.
In the gemmlowp kernel it uses the min and max of the output tensor to quantize the accumulator from int32 to uint8, But for my implementation as I am using OpenCL, I cannot get the min and max values of the output tensor at runtime, I will have to write additional logic at the host side which will incur additional execution time.

M = (S1*S2)/S3
To quantize the accumulator currently, i am using q = ((int32 * M ) + bias)
But this output does not match with intermediate output obtained from the tensorflow lite api.

int8*int8 -> float?

Hey,

I'm looking to perform int8 * int8 -> fp32. where at the output stage I dequantise the int32_t result into float (and then potentially add a bias. I was following the example from https://github.com/google/gemmlowp/blob/master/doc/quantization_example.cc#L305
But it seems that in order to unquantise to float you compute the quantisation parameters from the fp32 result that you had already computed before, which in practise I wouldn't know. I can compute it with a compensation factor, but it becomes incredibly complicated and computationally (and memory) expensive. Any alternatives?

If I am able to assume quantisation into int8 as opposed to uint8 as in the example, I would be able to have quantisation without the zero_point parameter (assuming zero cantered distribution) which would massively simplify dequantisation. Do you support this? Do you have any examples in the codebase where something like this is done?

[BUG] Failed to build on several architectures

Is this product range of int8*int8 in comment document expected?

Hi there,

In

// their range being [ -2^7 , 2^7 ), their products are in range
// [ -2^14 , 2^14 - 1 ), meaning that we can add two such values

I guess the product range should be [ (-2^7)*(2^7 - 1), (-2^7)*(-2^7)] which is in (-2^14, 2^14] - closed for 2^14? If we are putting a int8*int8 + int8*int8 in int16, do we need the assumption that -128 is not included in int8 (from B. Appendix: ARM NEON details of the paper)? To me, if int8 takes -128, the int8*int8 + int8*int8 could be as large as 2^15 which cannot be hold in int16.

Thanks

what is "ab_x2_high32" in <func::SaturatingRoundingDoublingHighMul> stand for?

Hi,
In

inline std::int32_t SaturatingRoundingDoublingHighMul(std::int32_t a,
                                                      std::int32_t b) {
  bool overflow = a == b && a == std::numeric_limits<std::int32_t>::min();
  std::int64_t a_64(a);
  std::int64_t b_64(b);
  std::int64_t ab_64 = a_64 * b_64;
  std::int32_t nudge = ab_64 >= 0 ? (1 << 30) : (1 - (1 << 30));
  std::int32_t ab_x2_high32 =
      static_cast<std::int32_t>((ab_64 + nudge) / (1ll << 31));
  return overflow ? std::numeric_limits<std::int32_t>::max() : ab_x2_high32;
}

it seemed this function is computing a * b
i wondered what is the relationship between ab_x2_high32 and ab_64. Could you explain how is ab_x2_high32 computed?

Thank you!

Document code style/formatting

If you're using clang-format or something similar, please document the settings, so the contributors can format their code prior to making pull requests.

run ./correctness_meta_gemm ...............................Bus error

hello my devices is rk3288(32 bit) and I use android-ndk-r14b
cd gemmlowp/jni and use ndk-build to build it
adb push exec to the rk3288(32bit)
I can successfully run the ./benchmark_meta_gemm and ./benchmark
but when I run the ./correctness_meta_gemm ,something wrong as bleow
root@rk3288:/data/local/tmp # ./correctness_meta_gemm
WARNING: linker: ./correctness_meta_gemm: unused DT entry: type 0x6ffffffe arg 0x1198
WARNING: linker: ./correctness_meta_gemm: unused DT entry: type 0x6fffffff arg 0x1
Threads: 1
Quantized 8 bit.
Small.
...............................Bus error
135|root@rk3288:/data/local/tmp #

How should I solve it ?
Thanks very much, good luck to you

there is a problem about the result_scale and result_zero_point?

I saw the doc/quantization_example.cc.

there is a problem about the result_scale and result_zero_point?

how to make sure about it ?

you count the real result ,them calculate it ?

but if we all do it every time in the real net work,the speed is lower obviously?

who can help me ,solve the problem ??

Suggestions for resources to understand gemmlowp

Hello,

I'm a noob trying to understand the theory and implementation of gemmlowp. Can you please share any resources that I can start with?
Any help in this regard is very much appreciated.

Thanks,
Thomas.

Bug in neon-gemm-kernel-benchmark.cc?

When I try to compile following the guide,
aarch64-linux-android-clang++ -mcpu=cortex-a55 -fPIE -static -O3 --std=c++11 neon-gemm-kernel-benchmark.cc -o bench -D__ARM_FEATURE_DOTPROD
it shows errors like below.
neon-gemm-kernel-benchmark.cc:2585:10: error: invalid operand for instruction
"udot v8.4s, v2.16b, v0.b[0]\n"
^
:29:21: note: instantiated into assembly here
udot v8.4s, v2.16b, v0.b[0]
^
neon-gemm-kernel-benchmark.cc:2586:10: error: invalid operand for instruction
"udot v9.4s, v2.16b, v0.b[1]\n"
^
:30:21: note: instantiated into assembly here
udot v9.4s, v2.16b, v0.b[1]
^
neon-gemm-kernel-benchmark.cc:2588:10: error: invalid operand for instruction
"udot v10.4s, v2.16b, v0.b[2]\n"
....

So, I think below patch should be applied for fixing compile issue.

diff --git a/standalone/neon-gemm-kernel-benchmark.cc b/standalone/neon-gemm-kernel-benchmark.cc
index aabeac9..e62115d 100644
--- a/standalone/neon-gemm-kernel-benchmark.cc
+++ b/standalone/neon-gemm-kernel-benchmark.cc
@@ -2103,11 +2103,11 @@ struct NEON_64bit_GEMM_Uint8Operands_Uint32Accumulators_noexpand_A57 {
//
//
// +--------+--------+--------+--------+

  •    //                               |v0.b[0] |v1.b[0] |v2.b[0] |v3.b[0] |
    
  •    //                               |v0.4b[0] |v1.4b[0] |v2.b[0] |v3.b[0] |
       //                          Rhs  +--------+--------+--------+--------+
       //                               |  ...   |  ...   |  ...   |  ...   |
       //                               +--------+--------+--------+--------|
    
  •    //                               |v0.b[15]|v1.b[15]|v2.b[15]|v3.b[15]|
    
  •    //                               |v0.4b[15]|v1.4b[15]|v2.b[15]|v3.b[15]|
       //                               +--------+--------+--------+--------+
       //
       //                               |        |        |        |        |
    

@@ -2344,11 +2344,11 @@ struct NEON_64bit_GEMM_Int8Operands_AccumTwoWithin16Bits {
// Register layout (ignoring the v8--v15 temporary 16bit accumulators):
//
// +--------+--------+--------+--------+

  •    //                               |v0.b[0] |v1.b[0] |v2.b[0] |v3.b[0] |
    
  •    //                               |v0.4b[0] |v1.4b[0] |v2.b[0] |v3.b[0] |
       //                          Rhs  +--------+--------+--------+--------+
       //                               |  ...   |  ...   |  ...   |  ...   |
       //                               +--------+--------+--------+--------|
    
  •    //                               |v0.b[15]|v1.b[15]|v2.b[15]|v3.b[15]|
    
  •    //                               |v0.4b[15]|v1.4b[15]|v2.b[15]|v3.b[15]|
       //                               +--------+--------+--------+--------+
       //
       //                               |        |        |        |        |
    

@@ -3197,41 +3197,41 @@ struct NEON_64bit_GEMM_Uint8Operands_Uint32Accumulators_dotproduct {

     // Start the MACs at the head of the loop - 1st cell from each side
     // already loaded.
  •    "udot v8.4s, v2.16b, v0.b[0]\n"
    
  •    "udot v9.4s, v2.16b, v0.b[1]\n"
    
  •    "udot v8.4s, v2.16b, v0.4b[0]\n"
    
  •    "udot v9.4s, v2.16b, v0.4b[1]\n"
       "ld1 {v1.16b}, [%[rhs_ptr]], #16\n"  // Load second Rhs cell.
    
  •    "udot v10.4s, v2.16b, v0.b[2]\n"
    
  •    "udot v11.4s, v2.16b, v0.b[3]\n"
    
  •    "udot v10.4s, v2.16b, v0.4b[2]\n"
    
  •    "udot v11.4s, v2.16b, v0.4b[3]\n"
       "ld1 {v3.16b}, [%[lhs_ptr]], #16\n"  // Load second Lhs cell.
    
  •    "udot v12.4s, v2.16b, v1.b[0]\n"
    
  •    "udot v13.4s, v2.16b, v1.b[1]\n"
    
  •    "udot v12.4s, v2.16b, v1.4b[0]\n"
    
  •    "udot v13.4s, v2.16b, v1.4b[1]\n"
       "ld1 {v4.16b}, [%[lhs_ptr]], #16\n"  // Load third Lhs cell.
    
  •    "udot v14.4s, v2.16b, v1.b[2]\n"
    
  •    "udot v15.4s, v2.16b, v1.b[3]\n"
    
  •    "udot v14.4s, v2.16b, v1.4b[2]\n"
    
  •    "udot v15.4s, v2.16b, v1.4b[3]\n"
       "ld1 {v2.16b}, [%[lhs_ptr]], #16\n"  // Done with first Lhs cell - load
       // for the next iteration early.
    
  •    "udot v16.4s, v3.16b, v0.b[0]\n"
    
  •    "udot v17.4s, v3.16b, v0.b[1]\n"
    
  •    "udot v18.4s, v3.16b, v0.b[2]\n"
    
  •    "udot v19.4s, v3.16b, v0.b[3]\n"
    
  •    "udot v20.4s, v3.16b, v1.b[0]\n"
    
  •    "udot v21.4s, v3.16b, v1.b[1]\n"
    
  •    "udot v22.4s, v3.16b, v1.b[2]\n"
    
  •    "udot v23.4s, v3.16b, v1.b[3]\n"
    
  •    "udot v24.4s, v4.16b, v0.b[0]\n"
    
  •    "udot v25.4s, v4.16b, v0.b[1]\n"
    
  •    "udot v26.4s, v4.16b, v0.b[2]\n"
    
  •    "udot v27.4s, v4.16b, v0.b[3]\n"
    
  •    "udot v16.4s, v3.16b, v0.4b[0]\n"
    
  •    "udot v17.4s, v3.16b, v0.4b[1]\n"
    
  •    "udot v18.4s, v3.16b, v0.4b[2]\n"
    
  •    "udot v19.4s, v3.16b, v0.4b[3]\n"
    
  •    "udot v20.4s, v3.16b, v1.4b[0]\n"
    
  •    "udot v21.4s, v3.16b, v1.4b[1]\n"
    
  •    "udot v22.4s, v3.16b, v1.4b[2]\n"
    
  •    "udot v23.4s, v3.16b, v1.4b[3]\n"
    
  •    "udot v24.4s, v4.16b, v0.4b[0]\n"
    
  •    "udot v25.4s, v4.16b, v0.4b[1]\n"
    
  •    "udot v26.4s, v4.16b, v0.4b[2]\n"
    
  •    "udot v27.4s, v4.16b, v0.4b[3]\n"
       "ld1 {v0.16b}, [%[rhs_ptr]], #16\n"  // Done with the first Rhs cell -
       // load for the next iteration early.
    
  •    "udot v28.4s, v4.16b, v1.b[0]\n"
    
  •    "udot v29.4s, v4.16b, v1.b[1]\n"
    
  •    "udot v28.4s, v4.16b, v1.4b[0]\n"
    
  •    "udot v29.4s, v4.16b, v1.4b[1]\n"
    
       // Loop.  Decrement loop index (depth) by 4 as udot processes 4
       // depth values.
       "subs %w[depth], %w[depth], #4\n"
    
  •    "udot v30.4s, v4.16b, v1.b[2]\n"
    
  •    "udot v31.4s, v4.16b, v1.b[3]\n"
    
  •    "udot v30.4s, v4.16b, v1.4b[2]\n"
    
  •    "udot v31.4s, v4.16b, v1.4b[3]\n"
    
       "bne " GEMMLOWP_LABEL_LOOP
       "b\n"
    

@@ -3327,53 +3327,53 @@ struct NEON_64bit_GEMM_Uint8Operands_Uint32Accumulators_dotproduct_A55r1 {
GEMMLOWP_LABEL_LOOP
":\n"

  •    "udot v8.4s, v2.16b, v0.b[0]\n"
    
  •    "udot v8.4s, v2.16b, v0.4b[0]\n"
       "ldr d1, [%[rhs_ptr], #16]\n"         // Bottom half of v1
    
  •    "udot v9.4s, v2.16b, v0.b[1]\n"
    
  •    "udot v9.4s, v2.16b, v0.4b[1]\n"
       "ins v0.d[1], x18\n"                  // Finish loading v0
    
  •    "udot v16.4s, v3.16b, v0.b[0]\n"      // out of sequence - used to reduce load/use pressure.
    
  •    "udot v16.4s, v3.16b, v0.4b[0]\n"      // out of sequence - used to reduce load/use pressure.
       "ldr x18, [%[rhs_ptr], #24]\n"        // Top half of v1 to X register
    
  •    "udot v17.4s, v3.16b, v0.b[1]\n"      // out of sequence - used to reduce load/use pressure.
    
  •    "udot v17.4s, v3.16b, v0.4b[1]\n"      // out of sequence - used to reduce load/use pressure.
       "add %[rhs_ptr], %[rhs_ptr], #32\n"   // RHS loads complete - increment pointer.
    
  •    "udot v10.4s, v2.16b, v0.b[2]\n"
    
  •    "udot v10.4s, v2.16b, v0.4b[2]\n"
       "ldr d4, [%[lhs_ptr], #32]\n"         // Bottom half of v4
    
  •    "udot v11.4s, v2.16b, v0.b[3]\n"
    
  •    "udot v11.4s, v2.16b, v0.4b[3]\n"
       "ins v1.d[1], x18\n"                  // Finish loading v1
    
  •    "udot v12.4s, v2.16b, v1.b[0]\n"
    
  •    "udot v12.4s, v2.16b, v1.4b[0]\n"
       "ldr x18, [%[lhs_ptr], #40]\n"        // Top half of v4 to X register
    
  •    "udot v13.4s, v2.16b, v1.b[1]\n"
    
  •    "udot v13.4s, v2.16b, v1.4b[1]\n"
       "add %[lhs_ptr], %[lhs_ptr], #48\n"   // LHS loads complete - increment pointer.
    
  •    "udot v14.4s, v2.16b, v1.b[2]\n"
    
  •    "udot v14.4s, v2.16b, v1.4b[2]\n"
    
  •    "udot v15.4s, v2.16b, v1.b[3]\n"
    
  •    "udot v15.4s, v2.16b, v1.4b[3]\n"
       "ldr d2, [%[lhs_ptr]]\n"              // Bottom half of v2 (for next time)
    
  •    "udot v18.4s, v3.16b, v0.b[2]\n"
    
  •    "udot v18.4s, v3.16b, v0.4b[2]\n"
       "ins v4.d[1], x18\n"                  // Finish loading v4
    
  •    "udot v19.4s, v3.16b, v0.b[3]\n"
    
  •    "udot v19.4s, v3.16b, v0.4b[3]\n"
       "ldr x18, [%[lhs_ptr], #8]\n"         // Top half of next v2 to X register
    
  •    "udot v20.4s, v3.16b, v1.b[0]\n"
    
  •    "udot v20.4s, v3.16b, v1.4b[0]\n"
       "subs %w[depth], %w[depth], #4\n"
    
  •    "udot v21.4s, v3.16b, v1.b[1]\n"
    
  •    "udot v21.4s, v3.16b, v1.4b[1]\n"
    
  •    "udot v22.4s, v3.16b, v1.b[2]\n"
    
  •    "udot v22.4s, v3.16b, v1.4b[2]\n"
    
  •    "udot v23.4s, v3.16b, v1.b[3]\n"
    
  •    "udot v23.4s, v3.16b, v1.4b[3]\n"
       "ldr d3, [%[lhs_ptr], #16]\n"         // Bottom half of v3 (for next time)
    
  •    "udot v24.4s, v4.16b, v0.b[0]\n"
    
  •    "udot v24.4s, v4.16b, v0.4b[0]\n"
       "ins v2.d[1], x18\n"                  // Finish loading next v2
    
  •    "udot v25.4s, v4.16b, v0.b[1]\n"
    
  •    "udot v25.4s, v4.16b, v0.4b[1]\n"
       "ldr x18, [%[lhs_ptr], #24]\n"        // Top half of next v3 to X register
    
  •    "udot v26.4s, v4.16b, v0.b[2]\n"
    
  •    "udot v26.4s, v4.16b, v0.4b[2]\n"
    
  •    "udot v27.4s, v4.16b, v0.b[3]\n"
    
  •    "udot v27.4s, v4.16b, v0.4b[3]\n"
       "ldr d0, [%[rhs_ptr]]\n"              // Bottom half of v0 (for next time)
    
  •    "udot v28.4s, v4.16b, v1.b[0]\n"
    
  •    "udot v28.4s, v4.16b, v1.4b[0]\n"
       "ins v3.d[1], x18\n"                  // Finish loading next v3
    
  •    "udot v29.4s, v4.16b, v1.b[1]\n"
    
  •    "udot v29.4s, v4.16b, v1.4b[1]\n"
       "ldr x18, [%[rhs_ptr], #8]\n"         // Top half of next v0 to X register
    
  •    "udot v30.4s, v4.16b, v1.b[2]\n"
    
  •    "udot v30.4s, v4.16b, v1.4b[2]\n"
    
  •    "udot v31.4s, v4.16b, v1.b[3]\n"
    
  •    "udot v31.4s, v4.16b, v1.4b[3]\n"
       "bne " GEMMLOWP_LABEL_LOOP "b\n"
    

CellFormat in AVX2 kernel incorrect? Question for clarification

I am studying the kernels implemented in gemmlowp, and noticed a possible discrepancy in the KernelFormat of the AVX2 kernel, i.e. here:

KernelSideFormat<CellFormat<4, 2, CellOrder::WidthMajor>, 1>>

So if I understand CellFormat<4,2> correctly, width is 4 (corresponding the the number of columns on the RHS), and depth is 2 (rows in the RHS), meaning we have a 2x4 matrix. Further, `CellOrder::WidthMajor for the RHS implies column order storage, so consecutive increments should be placed in consecutive rows:

1 3 5 7
2 4 6 8

The comment in kernel_avx.h says:

A 2x8 cell of Rhs is stored in 16bit in ymm1 ,

So here is already one discrepancy: the format described by the template is 2x4, not 2x8.

We see the next discrepancy in the inline asm code in kernel_avx.h contains this:

"vpmovzxbw (%[rhs_ptr]), %%ymm1 \n\t" // mov rhs to ymm1

The command vpmovzxbw corresponds to _mm256_cvtepu8_epi16 (__m128i a), which reads 16 bytes (16 * 8 = 128), and sign extends them to 16 i16 (16 * 16 = 256, the amount of bits fitting in a ymm* register). But the template only allows for 8 bytes! So my intuition would be that the comment should correct, given that it defines a matrix of 16 elements. Yet I don't understand how the code apparently seems to work if we only have 8 bytes for the RHS?

Finally, we have:

"vpmovzxbw 0x08(%[rhs_ptr]), %%ymm1 \n\t" // mov rhs to ymm1

Here we jump to the 8th element in the kernel. But according to the template this should be outside of the defined region of RHS?

I tried to correct the template to work according to my expectations and according to the comments, i.e. I set KernelSideFormat<CellFormat<8, 2, CellOrder::WidthMajor>, 1>>, but that lead to the tests erroring.

Can somebody comment on my observation and rectify my misunderstanding of the code? I guess it seems to work, but I don't understand how. Specifically, how is the code not reading past the defined memory regions?

On the other hand, given that we have 16 ymm* registers on AVX machines, I also understand what the code itself tries to accomplish: by having 4 columns of on the RHS, we can broadcast two rows at a time to a packed vector, and combine it with one cell of the LHS. This way, we only need 4 registers to accumulate one cell block of the LHS with all of RHS, which only has one cell and will thus always stay in ymm1. This allows the entire operation to be had in the 16 ymm* registers without a single register spill.

I just don't understand how an AVX load works with 8 bytes only.

Issues compiling for bare metal application

Hi,

I am trying to compile TFLite for a bare metal application, and have run into issues with gemmlowp while doing so. For my target platform I do not have unistd.h, can anyone help me find a workaround?

Error compilation for Windows x64 using MingGW 64

wintime -= 116444736000000000i64; // 1jan1601 to 1jan1970

When I try to compile with MinGW 64 bit it fails here. I understand that i64 is a Microsoft-specific suffix. Could it be replaced by a more standard version compatible with MVSC compiler and MinGW??

I think that for that function a "wintime -= 116444736000000000LL" would be ok?

How can I use a new gemm-kernel in tensorflow or other machine learning framework?

Hello everyone. I've modified some content on the file called kernel_neon.h , I imitate the original kernel(12x8x2) and wrote a new kernel(8x8x8), and I've run the benchmark, It seems that the results are not very different on my arm64 environment. so i want to use in the machine learning framewok such as tensorflow or mxnet ,etc. I've tried so many methods but failed at last. So is this feasible? anyone can help? Thanks a lot.

issue with aligned_alloc on macOS 10.12 with clang 802.0.42 cannot find <malloc.h>

It seems that under /usr/include malloc.h is at /usr/include/malloc/malloc.h

I thought we could use stdlib.h instead, but after some reading I found that this has other issues.

Anyway the use case was to compile tensorflow r1.6 from tensorflow/contrib/cmake on macOS 10.12 with cland 802.0.42 and everything worked apart from gemmlowp so I thought I ought to let you know.

Mixing openmp with gemmlowp multithreading causes low performance

If I run a loop with multi threads using openmp, and then call gemmlowp, the performance of gemm will be affected. Any clue?

e.g.

  #pragma omp parallel for
  for (int i = 0; i < 100; ++i) {
  }

  gemmlowp::GemmContext gemm_context;
  gemm_context.set_max_num_threads(4);
  using BitDepthParams = gemmlowp::L8R8WithLhsNonzeroBitDepthParams;
  while (iters--) {
    gemmlowp::GemmWithOutputPipeline<std::uint8_t, std::int32_t,
                                     BitDepthParams>(
        &gemm_context, lhs.const_map(), rhs.const_map(), &result.map(), -128,
        -128, output_pipeline);
  }

is only conv layer supported?

Great Thanks for sharing the knowledge about gemmlowp.
I have a question described as below.
How to do the fixed-point arithmetic the other layers (such as shortcut/routine/upsample...) in networks not only in conv layers?
for example the arithmetic in shortcut layer, the inputs are layer A and layer B, but the range of the 2 layers are not same. Is there any way to process such scenarios?

great thanks if someone can help.

error result when W and X don't range from -1 to 1

when I change W and X range (-20 to 20) , gemm result loss too much pricision.
diff --git a/doc/quantization_example.cc b/doc/quantization_example.cc
index d7b147d..f7178b9 100644
--- a/doc/quantization_example.cc
+++ b/doc/quantization_example.cc
@@ -157,7 +157,7 @@ class MatrixWithStorage {
: storage(rows * cols), matrix_map(storage.data(), rows, cols) {}
void MakeRandom() {
static std::mt19937 random_engine;

  • std::uniform_real_distribution distribution(-1, 1);
  • std::uniform_real_distribution distribution(-20, 20);
    for (auto& x : storage) {
    x = static_cast(distribution(random_engine));
    }

the gemm result is:
Difference between ACTUAL and REFERENCE float results:
-0.27 3.05 -0.269
-0.269 0.881 1.47

iOS gemmlowp_test failed with linker

Hi all at gemmlowp,

When I was playing with gemmlowp_test folder on iOS with xcode 9.4. I have linker issue with RandomEngine. (probably the one in test.h in gemmlowp_test folder)

duplicate symbol __ZN8gemmlowp12RandomEngineEv in:
    /Users/wyiming/Library/Developer/Xcode/DerivedData/gemmlowp_test-fuolktnjvyekdvhaevdpoedkksum/Build/Intermediates.noindex/gemmlowp_test.build/Debug-iphonesimulator/gemmlowp_test.build/Objects-normal/x86_64/test.o
    /Users/wyiming/Library/Developer/Xcode/DerivedData/gemmlowp_test-fuolktnjvyekdvhaevdpoedkksum/Build/Intermediates.noindex/gemmlowp_test.build/Debug-iphonesimulator/gemmlowp_test.build/Objects-normal/x86_64/benchmark.o
ld: 2 duplicate symbols for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)

would you mind help looking into this? :)

Add two feature maps

Adding two feature maps in ResNet is a common operation, but how to add outputs of two layers with different scales and zero_points?
Let's say r3 = r1*r2, and
r1 = s1*(q1-z1),
r2 = s2*(q2-z2)
r3 = s3 * (q3-z3) .
So how to get q3?
Obviously,
q3 = t1*(q1-z1) + t2*(q2-z2) + z3, where
t1 = s1/s3
t2 = s2/s3.
How to get the result of t1*(q1-z1)?

There is a problem about how to quantize the accumulator(int32) into uint8

Thank you for your contribution,I have a problem bout how to quantize the accumulator(int32) into uint8.
when I run your quantization_example.cc,
Quantized uint8 LHS matrix:
208 236 0 238
3 214 255 29
Quantized uint8 RHS matrix:
152 51 244
60 26 255
0 127 246
127 254 247
I computate the LHS matrixRHS matrix is
76002 77196 169718
16979 45468 125195
Quantized uint8 result matrix obtained by quantized multiplication:
168 115 255
0 66 151
In your paper,you said that "The down-scaling corresponds to multiplication by the multiplier M in equation (7)",but how to quantize
76002 77196 169718
16979 45468 125195
into
168 115 255
0 66 151
the quantized_multiplier is 1200097792 and the right_shift is 7 ,how to use these parameter ?
in your paper,you said "The down-scaling corresponds to multiplication by the multiplier M in equation (7). ",M := S1
S2/S3=0.0066030.007050/0.010663=0.004366, but 76002M != 168
could you tell me how to quantize the accumulator(int32) into uint8?
Looking forward to your reply, thanks a lot

run dotprod instruction failed on apple A12 and qualcomm 845

hi, i tested the udot gemm kernel on "kernel_neon.h" on qualcomm 845 big core and iphone A12, but get errors like following:
android, qualcomm 845: "Illegal instruction "
iphone xr, A12: "Thread 1: EXC_BAD_INSTRUCTION (code=1, subcode=0x6f80e048)"
i would like to ask how to enable or run sdot on these devices

i tested this instruction "MRS %[id], ID_AA64ISAR0_EL1", and got "Illegal instruction"?

Does avx2 feature of gemmlowp support gcc4.8.5?

when I want to use gemmlowp of AVX2, I add GEMMLOWP_ENABLE_AVX2 as #122 said, However I got error as below:

"internal/pack_avx.h:83:51: error: there are no arguments to ‘_mm256_set_m128i’ that depend on a template parameter, so a declaration of ‘_mm256_set_m128i’ must be available"

My gcc is gcc4.8.5, Does gcc not support ‘_mm256_set_m128i’?, If I want to used gemmlowp AVX2 feature under gcc, what I should do ? Thank you !

eight_bit_int_gemm get all zero output and segment fault

I a rookie at gemmlowp, and I used the EightBitIntGemm as:

EightBitIntGemm(
            false, false, false,
            m, n, k,
            aptr, 0, Atrd,
            bptr, 0, Btrd,
            cptr, 0.f, Ctrd,
            BitDepthSetting::A8B8);

but i got all zero output and segment fault, is EightBitIntGemm a async function? But I did not find any sync function.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.