google / gemmlowp Goto Github PK

Low-precision matrix multiplication

License: Apache License 2.0

C++ 96.55% Shell 0.06% Python 3.01% Objective-C 0.03% Objective-C++ 0.04% Makefile 0.02% CMake 0.10% C 0.11% Starlark 0.09%

gemmlowp's Issues

Does avx2 feature of gemmlowp support gcc4.8.5?

when I want to use gemmlowp of AVX2, I add GEMMLOWP_ENABLE_AVX2 as #122 said, However I got error as below:

"internal/pack_avx.h:83:51: error: there are no arguments to ‘_mm256_set_m128i’ that depend on a template parameter, so a declaration of ‘_mm256_set_m128i’ must be available"

My gcc is gcc4.8.5, Does gcc not support ‘_mm256_set_m128i’?, If I want to used gemmlowp AVX2 feature under gcc, what I should do ? Thank you !

is only conv layer supported?

Great Thanks for sharing the knowledge about gemmlowp.
I have a question described as below.
How to do the fixed-point arithmetic the other layers (such as shortcut/routine/upsample...) in networks not only in conv layers?
for example the arithmetic in shortcut layer, the inputs are layer A and layer B, but the range of the 2 layers are not same. Is there any way to process such scenarios?

great thanks if someone can help.

Add two feature maps

Adding two feature maps in ResNet is a common operation, but how to add outputs of two layers with different scales and zero_points?
Let's say r3 = r1*r2, and
r1 = s1*(q1-z1),
r2 = s2*(q2-z2)
r3 = s3 * (q3-z3) .
So how to get q3?
Obviously,
q3 = t1*(q1-z1) + t2*(q2-z2) + z3, where
t1 = s1/s3
t2 = s2/s3.
How to get the result of t1*(q1-z1)?

SIMD back-end for IBM Power and Z

This is not really an issue. I'd like to know if there is an interest in incorporating code to support IBM's Power and Z architectures as a back-end. In-house me and a colleague actually worked on this and we have extensions ready for gemmlowp to run optimized on P and Z depending on compiler flags when these architectures are detected. In principle this does not touch or disrupt any of the existing code.
Please comment on this issue and provide advice as to how best proceed.

How to calculate non-linear function after 8bit quantization ?

For 8bit quantization, zero point and scale are applied.
But when in non-linear function layer,
I want to know if I can process the input data without converting it to a real number.
Or is there any way to calibrate?
please answer about my question.

non-linear function : tanh / sigmoid / softmax / exp(x)

A problem with the design of kernel in Arm64

I am studying the kernels implemented in gemmlowp for arm64, and noticed that the main kernel we used is 12x8x2 , I know the cellformat is
KernelFormat<
KernelSideFormat<CellFormat<4, 2>, 3>,
KernelSideFormat<CellFormat<4, 2>, 2>>
,and I'd like to know why the depth is choose 2,instead of 1 or others. Is that the reason in this condition we can use more efficiently of registers ? or there are some other scientific reasons to choose kernel depth?
Thanks a lot if anyone could help.

core dump in the neon-gemm-kernel-benchmark.cc.

Hardware: NVidia TX 2
OS: Ubuntu 16.04
GCC: 5.40
compile flag: -std=c++11 -O3

error message:
:~/workspace/test/test_neon$ ./bench_mm
kernel,Gop/s
Arithmetic error in kernel:
NEON_64bit_GEMM_Int425Operands
Wrong accumulator for depth=32, at l = 1, r = 0
reference value: -47
actual value: -94
Aborted (core dumped)

Mixing openmp with gemmlowp multithreading causes low performance

If I run a loop with multi threads using openmp, and then call gemmlowp, the performance of gemm will be affected. Any clue?

e.g.

  #pragma omp parallel for
  for (int i = 0; i < 100; ++i) {
  }

  gemmlowp::GemmContext gemm_context;
  gemm_context.set_max_num_threads(4);
  using BitDepthParams = gemmlowp::L8R8WithLhsNonzeroBitDepthParams;
  while (iters--) {
    gemmlowp::GemmWithOutputPipeline<std::uint8_t, std::int32_t,
                                     BitDepthParams>(
        &gemm_context, lhs.const_map(), rhs.const_map(), &result.map(), -128,
        -128, output_pipeline);
  }

eight_bit_int_gemm get all zero output and segment fault

I a rookie at gemmlowp, and I used the EightBitIntGemm as:

EightBitIntGemm(
            false, false, false,
            m, n, k,
            aptr, 0, Atrd,
            bptr, 0, Btrd,
            cptr, 0.f, Ctrd,
            BitDepthSetting::A8B8);

but i got all zero output and segment fault, is EightBitIntGemm a async function? But I did not find any sync function.

Is this product range of int8*int8 in comment document expected?

Hi there,

gemmlowp/internal/kernel_neon.h

Lines 708 to 709 in 58825b1

 // their range being [ -2^7 , 2^7 ), their products are in range 

 // [ -2^14 , 2^14 - 1 ), meaning that we can add two such values

I guess the product range should be [ (-2^7)*(2^7 - 1), (-2^7)*(-2^7)] which is in (-2^14, 2^14] - closed for 2^14? If we are putting a int8*int8 + int8*int8 in int16, do we need the assumption that -128 is not included in int8 (from B. Appendix: ARM NEON details of the paper)? To me, if int8 takes -128, the int8*int8 + int8*int8 could be as large as 2^15 which cannot be hold in int16.

Thanks

[BUG] Failed to build on several architectures

Overview: https://buildd.debian.org/status/package.php?p=gemmlowp&suite=experimental

Failed on i386 https://buildd.debian.org/status/fetch.php?pkg=gemmlowp&arch=i386&ver=0%7E20180308-gf59a96b-1&stamp=1527673668&raw=0

Failed on armel https://buildd.debian.org/status/fetch.php?pkg=gemmlowp&arch=armel&ver=0%7E20180308-gf59a96b-1&stamp=1527747448&raw=0

mips https://buildd.debian.org/status/fetch.php?pkg=gemmlowp&arch=mips&ver=0%7E20180308-gf59a96b-1&stamp=1527656602&raw=0

powerpc https://buildd.debian.org/status/fetch.php?pkg=gemmlowp&arch=powerpc&ver=0%7E20180308-gf59a96b-1&stamp=1527657665&raw=0

The errors look the same:

E: Build killed with signal TERM after 150 minutes of inactivity

Deadlock triggered in test_blocking_counter?

Reconsider GEMMLOWP_ALLOW_SLOW_SCALAR_FALLBACK

Is there any chance the authors of gemmlowp would consider removing the #error directive that halts non-optimized builds? It's not a common practice. I'd like for it to be possible to build TensorFlow without passing any special configuration directives. Perhaps consider printing a big ASCII art skull and crossbones #warning instead? Sort of like Gosling Emacs?

cc: @gunan

iOS gemmlowp_test failed with linker

Hi all at gemmlowp,

When I was playing with gemmlowp_test folder on iOS with xcode 9.4. I have linker issue with RandomEngine. (probably the one in test.h in gemmlowp_test folder)

duplicate symbol __ZN8gemmlowp12RandomEngineEv in:
    /Users/wyiming/Library/Developer/Xcode/DerivedData/gemmlowp_test-fuolktnjvyekdvhaevdpoedkksum/Build/Intermediates.noindex/gemmlowp_test.build/Debug-iphonesimulator/gemmlowp_test.build/Objects-normal/x86_64/test.o
    /Users/wyiming/Library/Developer/Xcode/DerivedData/gemmlowp_test-fuolktnjvyekdvhaevdpoedkksum/Build/Intermediates.noindex/gemmlowp_test.build/Debug-iphonesimulator/gemmlowp_test.build/Objects-normal/x86_64/benchmark.o
ld: 2 duplicate symbols for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)

would you mind help looking into this? :)

Document code style/formatting

If you're using clang-format or something similar, please document the settings, so the contributors can format their code prior to making pull requests.

Bug in neon-gemm-kernel-benchmark.cc?

When I try to compile following the guide,
aarch64-linux-android-clang++ -mcpu=cortex-a55 -fPIE -static -O3 --std=c++11 neon-gemm-kernel-benchmark.cc -o bench -D__ARM_FEATURE_DOTPROD
it shows errors like below.
neon-gemm-kernel-benchmark.cc:2585:10: error: invalid operand for instruction
"udot v8.4s, v2.16b, v0.b[0]\n"
^
:29:21: note: instantiated into assembly here
udot v8.4s, v2.16b, v0.b[0]
^
neon-gemm-kernel-benchmark.cc:2586:10: error: invalid operand for instruction
"udot v9.4s, v2.16b, v0.b[1]\n"
^
:30:21: note: instantiated into assembly here
udot v9.4s, v2.16b, v0.b[1]
^
neon-gemm-kernel-benchmark.cc:2588:10: error: invalid operand for instruction
"udot v10.4s, v2.16b, v0.b[2]\n"
....

So, I think below patch should be applied for fixing compile issue.

diff --git a/standalone/neon-gemm-kernel-benchmark.cc b/standalone/neon-gemm-kernel-benchmark.cc
index aabeac9..e62115d 100644
--- a/standalone/neon-gemm-kernel-benchmark.cc
+++ b/standalone/neon-gemm-kernel-benchmark.cc
@@ -2103,11 +2103,11 @@ struct NEON_64bit_GEMM_Uint8Operands_Uint32Accumulators_noexpand_A57 {
//
//
// +--------+--------+--------+--------+

   //                               |v0.b[0] |v1.b[0] |v2.b[0] |v3.b[0] |

   //                               |v0.4b[0] |v1.4b[0] |v2.b[0] |v3.b[0] |
   //                          Rhs  +--------+--------+--------+--------+
   //                               |  ...   |  ...   |  ...   |  ...   |
   //                               +--------+--------+--------+--------|

   //                               |v0.b[15]|v1.b[15]|v2.b[15]|v3.b[15]|

   //                               |v0.4b[15]|v1.4b[15]|v2.b[15]|v3.b[15]|
   //                               +--------+--------+--------+--------+
   //
   //                               |        |        |        |        |

@@ -2344,11 +2344,11 @@ struct NEON_64bit_GEMM_Int8Operands_AccumTwoWithin16Bits {
// Register layout (ignoring the v8--v15 temporary 16bit accumulators):
//
// +--------+--------+--------+--------+

   //                               |v0.b[0] |v1.b[0] |v2.b[0] |v3.b[0] |

   //                               |v0.4b[0] |v1.4b[0] |v2.b[0] |v3.b[0] |
   //                          Rhs  +--------+--------+--------+--------+
   //                               |  ...   |  ...   |  ...   |  ...   |
   //                               +--------+--------+--------+--------|

   //                               |v0.b[15]|v1.b[15]|v2.b[15]|v3.b[15]|

   //                               |v0.4b[15]|v1.4b[15]|v2.b[15]|v3.b[15]|
   //                               +--------+--------+--------+--------+
   //
   //                               |        |        |        |        |

@@ -3197,41 +3197,41 @@ struct NEON_64bit_GEMM_Uint8Operands_Uint32Accumulators_dotproduct {

     // Start the MACs at the head of the loop - 1st cell from each side
     // already loaded.

```
   "udot v8.4s, v2.16b, v0.b[0]\n"
```
```
   "udot v9.4s, v2.16b, v0.b[1]\n"
```

```
   "udot v8.4s, v2.16b, v0.4b[0]\n"
```

   "udot v9.4s, v2.16b, v0.4b[1]\n"
   "ld1 {v1.16b}, [%[rhs_ptr]], #16\n"  // Load second Rhs cell.

```
   "udot v10.4s, v2.16b, v0.b[2]\n"
```
```
   "udot v11.4s, v2.16b, v0.b[3]\n"
```

```
   "udot v10.4s, v2.16b, v0.4b[2]\n"
```

   "udot v11.4s, v2.16b, v0.4b[3]\n"
   "ld1 {v3.16b}, [%[lhs_ptr]], #16\n"  // Load second Lhs cell.

```
   "udot v12.4s, v2.16b, v1.b[0]\n"
```
```
   "udot v13.4s, v2.16b, v1.b[1]\n"
```

```
   "udot v12.4s, v2.16b, v1.4b[0]\n"
```

   "udot v13.4s, v2.16b, v1.4b[1]\n"
   "ld1 {v4.16b}, [%[lhs_ptr]], #16\n"  // Load third Lhs cell.

```
   "udot v14.4s, v2.16b, v1.b[2]\n"
```
```
   "udot v15.4s, v2.16b, v1.b[3]\n"
```

```
   "udot v14.4s, v2.16b, v1.4b[2]\n"
```

   "udot v15.4s, v2.16b, v1.4b[3]\n"
   "ld1 {v2.16b}, [%[lhs_ptr]], #16\n"  // Done with first Lhs cell - load
   // for the next iteration early.

```
   "udot v16.4s, v3.16b, v0.b[0]\n"
```
```
   "udot v17.4s, v3.16b, v0.b[1]\n"
```
```
   "udot v18.4s, v3.16b, v0.b[2]\n"
```
```
   "udot v19.4s, v3.16b, v0.b[3]\n"
```
```
   "udot v20.4s, v3.16b, v1.b[0]\n"
```
```
   "udot v21.4s, v3.16b, v1.b[1]\n"
```
```
   "udot v22.4s, v3.16b, v1.b[2]\n"
```
```
   "udot v23.4s, v3.16b, v1.b[3]\n"
```
```
   "udot v24.4s, v4.16b, v0.b[0]\n"
```
```
   "udot v25.4s, v4.16b, v0.b[1]\n"
```
```
   "udot v26.4s, v4.16b, v0.b[2]\n"
```
```
   "udot v27.4s, v4.16b, v0.b[3]\n"
```

```
   "udot v16.4s, v3.16b, v0.4b[0]\n"
```
```
   "udot v17.4s, v3.16b, v0.4b[1]\n"
```
```
   "udot v18.4s, v3.16b, v0.4b[2]\n"
```
```
   "udot v19.4s, v3.16b, v0.4b[3]\n"
```
```
   "udot v20.4s, v3.16b, v1.4b[0]\n"
```
```
   "udot v21.4s, v3.16b, v1.4b[1]\n"
```
```
   "udot v22.4s, v3.16b, v1.4b[2]\n"
```
```
   "udot v23.4s, v3.16b, v1.4b[3]\n"
```
```
   "udot v24.4s, v4.16b, v0.4b[0]\n"
```
```
   "udot v25.4s, v4.16b, v0.4b[1]\n"
```
```
   "udot v26.4s, v4.16b, v0.4b[2]\n"
```

   "udot v27.4s, v4.16b, v0.4b[3]\n"
   "ld1 {v0.16b}, [%[rhs_ptr]], #16\n"  // Done with the first Rhs cell -
   // load for the next iteration early.

```
   "udot v28.4s, v4.16b, v1.b[0]\n"
```
```
   "udot v29.4s, v4.16b, v1.b[1]\n"
```

```
   "udot v28.4s, v4.16b, v1.4b[0]\n"
```

   "udot v29.4s, v4.16b, v1.4b[1]\n"

   // Loop.  Decrement loop index (depth) by 4 as udot processes 4
   // depth values.
   "subs %w[depth], %w[depth], #4\n"

```
   "udot v30.4s, v4.16b, v1.b[2]\n"
```
```
   "udot v31.4s, v4.16b, v1.b[3]\n"
```

```
   "udot v30.4s, v4.16b, v1.4b[2]\n"
```

   "udot v31.4s, v4.16b, v1.4b[3]\n"

   "bne " GEMMLOWP_LABEL_LOOP
   "b\n"

@@ -3327,53 +3327,53 @@ struct NEON_64bit_GEMM_Uint8Operands_Uint32Accumulators_dotproduct_A55r1 {
GEMMLOWP_LABEL_LOOP
":\n"

```
   "udot v8.4s, v2.16b, v0.b[0]\n"
```

   "udot v8.4s, v2.16b, v0.4b[0]\n"
   "ldr d1, [%[rhs_ptr], #16]\n"         // Bottom half of v1

```
   "udot v9.4s, v2.16b, v0.b[1]\n"
```

   "udot v9.4s, v2.16b, v0.4b[1]\n"
   "ins v0.d[1], x18\n"                  // Finish loading v0

   "udot v16.4s, v3.16b, v0.b[0]\n"      // out of sequence - used to reduce load/use pressure.

   "udot v16.4s, v3.16b, v0.4b[0]\n"      // out of sequence - used to reduce load/use pressure.
   "ldr x18, [%[rhs_ptr], #24]\n"        // Top half of v1 to X register

   "udot v17.4s, v3.16b, v0.b[1]\n"      // out of sequence - used to reduce load/use pressure.

   "udot v17.4s, v3.16b, v0.4b[1]\n"      // out of sequence - used to reduce load/use pressure.
   "add %[rhs_ptr], %[rhs_ptr], #32\n"   // RHS loads complete - increment pointer.

```
   "udot v10.4s, v2.16b, v0.b[2]\n"
```

   "udot v10.4s, v2.16b, v0.4b[2]\n"
   "ldr d4, [%[lhs_ptr], #32]\n"         // Bottom half of v4

```
   "udot v11.4s, v2.16b, v0.b[3]\n"
```

   "udot v11.4s, v2.16b, v0.4b[3]\n"
   "ins v1.d[1], x18\n"                  // Finish loading v1

```
   "udot v12.4s, v2.16b, v1.b[0]\n"
```

   "udot v12.4s, v2.16b, v1.4b[0]\n"
   "ldr x18, [%[lhs_ptr], #40]\n"        // Top half of v4 to X register

```
   "udot v13.4s, v2.16b, v1.b[1]\n"
```

   "udot v13.4s, v2.16b, v1.4b[1]\n"
   "add %[lhs_ptr], %[lhs_ptr], #48\n"   // LHS loads complete - increment pointer.

```
   "udot v14.4s, v2.16b, v1.b[2]\n"
```

```
   "udot v14.4s, v2.16b, v1.4b[2]\n"
```

```
   "udot v15.4s, v2.16b, v1.b[3]\n"
```

   "udot v15.4s, v2.16b, v1.4b[3]\n"
   "ldr d2, [%[lhs_ptr]]\n"              // Bottom half of v2 (for next time)

```
   "udot v18.4s, v3.16b, v0.b[2]\n"
```

   "udot v18.4s, v3.16b, v0.4b[2]\n"
   "ins v4.d[1], x18\n"                  // Finish loading v4

```
   "udot v19.4s, v3.16b, v0.b[3]\n"
```

   "udot v19.4s, v3.16b, v0.4b[3]\n"
   "ldr x18, [%[lhs_ptr], #8]\n"         // Top half of next v2 to X register

```
   "udot v20.4s, v3.16b, v1.b[0]\n"
```

   "udot v20.4s, v3.16b, v1.4b[0]\n"
   "subs %w[depth], %w[depth], #4\n"

```
   "udot v21.4s, v3.16b, v1.b[1]\n"
```

```
   "udot v21.4s, v3.16b, v1.4b[1]\n"
```

```
   "udot v22.4s, v3.16b, v1.b[2]\n"
```

```
   "udot v22.4s, v3.16b, v1.4b[2]\n"
```

```
   "udot v23.4s, v3.16b, v1.b[3]\n"
```

   "udot v23.4s, v3.16b, v1.4b[3]\n"
   "ldr d3, [%[lhs_ptr], #16]\n"         // Bottom half of v3 (for next time)

```
   "udot v24.4s, v4.16b, v0.b[0]\n"
```

   "udot v24.4s, v4.16b, v0.4b[0]\n"
   "ins v2.d[1], x18\n"                  // Finish loading next v2

```
   "udot v25.4s, v4.16b, v0.b[1]\n"
```

   "udot v25.4s, v4.16b, v0.4b[1]\n"
   "ldr x18, [%[lhs_ptr], #24]\n"        // Top half of next v3 to X register

```
   "udot v26.4s, v4.16b, v0.b[2]\n"
```

```
   "udot v26.4s, v4.16b, v0.4b[2]\n"
```

```
   "udot v27.4s, v4.16b, v0.b[3]\n"
```

   "udot v27.4s, v4.16b, v0.4b[3]\n"
   "ldr d0, [%[rhs_ptr]]\n"              // Bottom half of v0 (for next time)

```
   "udot v28.4s, v4.16b, v1.b[0]\n"
```

   "udot v28.4s, v4.16b, v1.4b[0]\n"
   "ins v3.d[1], x18\n"                  // Finish loading next v3

```
   "udot v29.4s, v4.16b, v1.b[1]\n"
```

   "udot v29.4s, v4.16b, v1.4b[1]\n"
   "ldr x18, [%[rhs_ptr], #8]\n"         // Top half of next v0 to X register

```
   "udot v30.4s, v4.16b, v1.b[2]\n"
```

```
   "udot v30.4s, v4.16b, v1.4b[2]\n"
```

```
   "udot v31.4s, v4.16b, v1.b[3]\n"
```

   "udot v31.4s, v4.16b, v1.4b[3]\n"
   "bne " GEMMLOWP_LABEL_LOOP "b\n"

run ./correctness_meta_gemm ...............................Bus error

hello my devices is rk3288(32 bit) and I use android-ndk-r14b
cd gemmlowp/jni and use ndk-build to build it
adb push exec to the rk3288(32bit)
I can successfully run the ./benchmark_meta_gemm and ./benchmark
but when I run the ./correctness_meta_gemm ,something wrong as bleow
root@rk3288:/data/local/tmp # ./correctness_meta_gemm
WARNING: linker: ./correctness_meta_gemm: unused DT entry: type 0x6ffffffe arg 0x1198
WARNING: linker: ./correctness_meta_gemm: unused DT entry: type 0x6fffffff arg 0x1
Threads: 1
Quantized 8 bit.
Small.
...............................Bus error
135|root@rk3288:/data/local/tmp #

How should I solve it ?
Thanks very much, good luck to you

run dotprod instruction failed on apple A12 and qualcomm 845

hi, i tested the udot gemm kernel on "kernel_neon.h" on qualcomm 845 big core and iphone A12, but get errors like following:
android, qualcomm 845: "Illegal instruction "
iphone xr, A12: "Thread 1: EXC_BAD_INSTRUCTION (code=1, subcode=0x6f80e048)"
i would like to ask how to enable or run sdot on these devices

i tested this instruction "MRS %[id], ID_AA64ISAR0_EL1", and got "Illegal instruction"?

CellFormat in AVX2 kernel incorrect? Question for clarification

I am studying the kernels implemented in gemmlowp, and noticed a possible discrepancy in the KernelFormat of the AVX2 kernel, i.e. here:

gemmlowp/internal/kernel_avx.h

Line 33 in 1490d29

KernelSideFormat<CellFormat<4, 2, CellOrder::WidthMajor>, 1>>

So if I understand CellFormat<4,2> correctly, width is 4 (corresponding the the number of columns on the RHS), and depth is 2 (rows in the RHS), meaning we have a 2x4 matrix. Further, `CellOrder::WidthMajor for the RHS implies column order storage, so consecutive increments should be placed in consecutive rows:

1 3 5 7
2 4 6 8

The comment in kernel_avx.h says:

A 2x8 cell of Rhs is stored in 16bit in ymm1 ,

So here is already one discrepancy: the format described by the template is 2x4, not 2x8.

We see the next discrepancy in the inline asm code in kernel_avx.h contains this:

gemmlowp/internal/kernel_avx.h

Line 113 in 1490d29

"vpmovzxbw (%[rhs_ptr]), %%ymm1 \n\t" // mov rhs to ymm1

The command vpmovzxbw corresponds to _mm256_cvtepu8_epi16 (__m128i a), which reads 16 bytes (16 * 8 = 128), and sign extends them to 16 i16 (16 * 16 = 256, the amount of bits fitting in a ymm* register). But the template only allows for 8 bytes! So my intuition would be that the comment should correct, given that it defines a matrix of 16 elements. Yet I don't understand how the code apparently seems to work if we only have 8 bytes for the RHS?

Finally, we have:

gemmlowp/internal/kernel_avx.h

Line 180 in 1490d29

"vpmovzxbw 0x08(%[rhs_ptr]), %%ymm1 \n\t" // mov rhs to ymm1

Here we jump to the 8th element in the kernel. But according to the template this should be outside of the defined region of RHS?

I tried to correct the template to work according to my expectations and according to the comments, i.e. I set KernelSideFormat<CellFormat<8, 2, CellOrder::WidthMajor>, 1>>, but that lead to the tests erroring.

Can somebody comment on my observation and rectify my misunderstanding of the code? I guess it seems to work, but I don't understand how. Specifically, how is the code not reading past the defined memory regions?

On the other hand, given that we have 16 ymm* registers on AVX machines, I also understand what the code itself tries to accomplish: by having 4 columns of on the RHS, we can broadcast two rows at a time to a packed vector, and combine it with one cell of the LHS. This way, we only need 4 registers to accumulate one cell block of the LHS with all of RHS, which only has one cell and will thus always stay in ymm1. This allows the entire operation to be had in the 16 ymm* registers without a single register spill.

I just don't understand how an AVX load works with 8 bytes only.

there is a problem about the result_scale and result_zero_point?

I saw the doc/quantization_example.cc.

there is a problem about the result_scale and result_zero_point?

how to make sure about it ?

you count the real result ,them calculate it ?

but if we all do it every time in the real net work,the speed is lower obviously?

who can help me ,solve the problem ??

Compile-time warning: implicit conversion loses integer precision: 'unsigned long' to 'int' [-Wshorten-64-to-32]

./third_party/gemmlowp/public/../internal/multi_thread_gemm.h:368:38: warning: implicit conversion loses integer precision: 'unsigned long' to 'int' [-Wshorten-64-to-32]
int workers_count = tasks.size() - 1;
~~~~~~~~~~~~~ ~~~~~~~~~~~~~^~~
1 warning generated.

See https://github.com/google/gemmlowp/blob/master/internal/multi_thread_gemm.h#L368.

Quantize matmul in CPU avx2 have effect？

I want to know quantize matmul use gemmlowp have any effect? can improve the performance of inference？so can I export the python interfaces by .cc？

Python 3 support

As python 3 going to be unsupported in Jan 2020, so is there any plan of upgrading the few python scripts inside the meta/generators folder to support python 3?
The changes looks to be only the print statements ones when I ran the 2to3 tool
For example:- https://github.com/google/gemmlowp/blob/master/meta/generators/cc_emitter.py

How can I use a new gemm-kernel in tensorflow or other machine learning framework?

Hello everyone. I've modified some content on the file called kernel_neon.h , I imitate the original kernel(12x8x2) and wrote a new kernel(8x8x8), and I've run the benchmark, It seems that the results are not very different on my arm64 environment. so i want to use in the machine learning framewok such as tensorflow or mxnet ,etc. I've tried so many methods but failed at last. So is this feasible? anyone can help? Thanks a lot.

How to quantize accumulator from int32 to uint8

I am trying to implement the quantize version of MobilenNet v1 in OpenCL. I have referenced the method that you have provided in https://arxiv.org/pdf/1712.05877.pdf . I am using pretrained Mobilnet weights from the tflite file. I have got all the required quantization parameters Eg: S1 S2 and S3 from the tflite file.
The only issue is converting the accumulator back from int32 to uint8.
In the gemmlowp kernel it uses the min and max of the output tensor to quantize the accumulator from int32 to uint8, But for my implementation as I am using OpenCL, I cannot get the min and max values of the output tensor at runtime, I will have to write additional logic at the host side which will incur additional execution time.

M = (S1*S2)/S3
To quantize the accumulator currently, i am using q = ((int32 * M ) + bias)
But this output does not match with intermediate output obtained from the tensorflow lite api.

There is a problem about how to quantize the accumulator(int32) into uint8

Thank you for your contribution，I have a problem bout how to quantize the accumulator(int32) into uint8.
when I run your quantization_example.cc,
Quantized uint8 LHS matrix:
208 236 0 238
3 214 255 29
Quantized uint8 RHS matrix:
152 51 244
60 26 255
0 127 246
127 254 247
I computate the LHS matrixRHS matrix is
76002 77196 169718
16979 45468 125195
Quantized uint8 result matrix obtained by quantized multiplication:
168 115 255
0 66 151
In your paper,you said that "The down-scaling corresponds to multiplication by the multiplier M in equation (7)",but how to quantize
76002 77196 169718
16979 45468 125195
into
168 115 255
0 66 151
the quantized_multiplier is 1200097792 and the right_shift is 7 ,how to use these parameter ?
in your paper,you said "The down-scaling corresponds to multiplication by the multiplier M in equation (7). ",M := S1S2/S3=0.0066030.007050/0.010663=0.004366, but 76002M != 168
could you tell me how to quantize the accumulator(int32) into uint8?
Looking forward to your reply, thanks a lot

no such package '@remotejdk_linux//':

when use bazel test gemmlowp:all to build repo.

what is "ab_x2_high32" in <func::SaturatingRoundingDoublingHighMul> stand for?

Hi,
In

inline std::int32_t SaturatingRoundingDoublingHighMul(std::int32_t a,
                                                      std::int32_t b) {
  bool overflow = a == b && a == std::numeric_limits<std::int32_t>::min();
  std::int64_t a_64(a);
  std::int64_t b_64(b);
  std::int64_t ab_64 = a_64 * b_64;
  std::int32_t nudge = ab_64 >= 0 ? (1 << 30) : (1 - (1 << 30));
  std::int32_t ab_x2_high32 =
      static_cast<std::int32_t>((ab_64 + nudge) / (1ll << 31));
  return overflow ? std::numeric_limits<std::int32_t>::max() : ab_x2_high32;
}

it seemed this function is computing a * b
i wondered what is the relationship between ab_x2_high32 and ab_64. Could you explain how is ab_x2_high32 computed?

Thank you!

Recent fix causes 2.5x performance regression

This fix causes a 2.5x performance regression in our internal benchmarks:

#105

I'll try to post more information shortly @bjacob

How to use gemmlowp in C project?

Hello, I'm using a framwork named "darknet" which is coded by C programm lauguage. I realized parameters quantization in conv, and try to use gemmlowp in darknet. When I try to pack function "EightBitIntGemm" from C++ to C, here occurs an error: fatal error: cstdint: No such file or directory. Could you give me some advice kindly?

issue with aligned_alloc on macOS 10.12 with clang 802.0.42 cannot find <malloc.h>

It seems that under /usr/include malloc.h is at /usr/include/malloc/malloc.h

I thought we could use stdlib.h instead, but after some reading I found that this has other issues.

Anyway the use case was to compile tensorflow r1.6 from tensorflow/contrib/cmake on macOS 10.12 with cland 802.0.42 and everything worked apart from gemmlowp so I thought I ought to let you know.

Suggestions for resources to understand gemmlowp

Hello,

I'm a noob trying to understand the theory and implementation of gemmlowp. Can you please share any resources that I can start with?
Any help in this regard is very much appreciated.

Thanks,
Thomas.

is D in your fomation (2) calculated by this?

max = scale * (255 - zero_point);
min = scale * (0 - zero_point);

and this zero_point and scale both are real number , so is not calculated like what i do ?

Error compilation for Windows x64 using MingGW 64

gemmlowp/internal/platform.h

Line 75 in dec2b7d

wintime -= 116444736000000000i64; // 1jan1601 to 1jan1970

When I try to compile with MinGW 64 bit it fails here. I understand that i64 is a Microsoft-specific suffix. Could it be replaced by a more standard version compatible with MVSC compiler and MinGW??

I think that for that function a "wintime -= 116444736000000000LL" would be ok?

Issues compiling for bare metal application

Hi,

I am trying to compile TFLite for a bare metal application, and have run into issues with gemmlowp while doing so. For my target platform I do not have unistd.h, can anyone help me find a workaround?

error result when W and X don't range from -1 to 1

when I change W and X range (-20 to 20) , gemm result loss too much pricision.
diff --git a/doc/quantization_example.cc b/doc/quantization_example.cc
index d7b147d..f7178b9 100644
--- a/doc/quantization_example.cc
+++ b/doc/quantization_example.cc
@@ -157,7 +157,7 @@ class MatrixWithStorage {
: storage(rows * cols), matrix_map(storage.data(), rows, cols) {}
void MakeRandom() {
static std::mt19937 random_engine;

std::uniform_real_distribution distribution(-1, 1);

std::uniform_real_distribution distribution(-20, 20);
for (auto& x : storage) {
x = static_cast(distribution(random_engine));
}

the gemm result is:
Difference between ACTUAL and REFERENCE float results:
-0.27 3.05 -0.269
-0.269 0.881 1.47

int8*int8 -> float?

Hey,

I'm looking to perform int8 * int8 -> fp32. where at the output stage I dequantise the int32_t result into float (and then potentially add a bias. I was following the example from https://github.com/google/gemmlowp/blob/master/doc/quantization_example.cc#L305
But it seems that in order to unquantise to float you compute the quantisation parameters from the fp32 result that you had already computed before, which in practise I wouldn't know. I can compute it with a compensation factor, but it becomes incredibly complicated and computationally (and memory) expensive. Any alternatives?

If I am able to assume quantisation into int8 as opposed to uint8 as in the example, I would be able to have quantisation without the zero_point parameter (assuming zero cantered distribution) which would massively simplify dequantisation. Do you support this? Do you have any examples in the codebase where something like this is done?

	// their range being [ -2^7 , 2^7 ), their products are in range
	// [ -2^14 , 2^14 - 1 ), meaning that we can add two such values

google / gemmlowp Goto Github PK

gemmlowp's Issues

Recommend Projects

Recommend Topics

Recommend Org