Git Product home page Git Product logo

google / xnnpack Goto Github PK

View Code? Open in Web Editor NEW
1.7K 53.0 312.0 120.4 MB

High-efficiency floating-point neural network inference operators for mobile, server, and Web

License: Other

C++ 22.63% C 72.45% Shell 1.04% Assembly 2.37% Python 0.39% CMake 0.54% Starlark 0.57% Batchfile 0.01%
neural-networks inference inference-optimization simd cpu multithreading matrix-multiplication convolutional-neural-networks convolutional-neural-network neural-network

xnnpack's Introduction

XNNPACK

XNNPACK is a highly optimized solution for neural network inference on ARM, x86, WebAssembly, and RISC-V platforms. XNNPACK is not intended for direct use by deep learning practitioners and researchers; instead it provides low-level performance primitives for accelerating high-level machine learning frameworks, such as TensorFlow Lite, TensorFlow.js, PyTorch, ONNX Runtime, and MediaPipe.

Supported Architectures

  • ARM64 on Android, iOS, macOS, Linux, and Windows
  • ARMv7 (with NEON) on Android
  • ARMv6 (with VFPv2) on Linux
  • x86 and x86-64 (up to AVX512) on Windows, Linux, macOS, Android, and iOS simulator
  • WebAssembly MVP
  • WebAssembly SIMD
  • WebAssembly Relaxed SIMD (experimental)
  • RISC-V (RV32GC and RV64GC)

Operator Coverage

XNNPACK implements the following neural network operators:

  • 2D Convolution (including grouped and depthwise)
  • 2D Deconvolution (AKA Transposed Convolution)
  • 2D Average Pooling
  • 2D Max Pooling
  • 2D ArgMax Pooling (Max Pooling + indices)
  • 2D Unpooling
  • 2D Bilinear Resize
  • 2D Depth-to-Space (AKA Pixel Shuffle)
  • Add (including broadcasting, two inputs only)
  • Subtract (including broadcasting)
  • Divide (including broadcasting)
  • Maximum (including broadcasting)
  • Minimum (including broadcasting)
  • Multiply (including broadcasting)
  • Squared Difference (including broadcasting)
  • Global Average Pooling
  • Channel Shuffle
  • Fully Connected
  • Abs (absolute value)
  • Bankers' Rounding (rounding to nearest, ties to even)
  • Ceiling (rounding to integer above)
  • Clamp (includes ReLU and ReLU6)
  • Convert (includes fixed-point and half-precision quantization and dequantization)
  • Copy
  • ELU
  • Floor (rounding to integer below)
  • HardSwish
  • Leaky ReLU
  • Negate
  • Sigmoid
  • Softmax
  • Square
  • Tanh
  • Transpose
  • Truncation (rounding to integer towards zero)
  • PReLU

All operators in XNNPACK support NHWC layout, but additionally allow custom stride along the Channel dimension. Thus, operators can consume a subset of channels in the input tensor, and produce a subset of channels in the output tensor, providing a zero-cost Channel Split and Channel Concatenation operations.

Performance

Mobile phones

The table below presents single-threaded performance of XNNPACK library on three generations of MobileNet models and three generations of Pixel phones.

Model Pixel, ms Pixel 2, ms Pixel 3a, ms
FP32 MobileNet v1 1.0X 82 86 88
FP32 MobileNet v2 1.0X 49 53 55
FP32 MobileNet v3 Large 39 42 44
FP32 MobileNet v3 Small 12 14 14

The following table presents multi-threaded (using as many threads as there are big cores) performance of XNNPACK library on three generations of MobileNet models and three generations of Pixel phones.

Model Pixel, ms Pixel 2, ms Pixel 3a, ms
FP32 MobileNet v1 1.0X 43 27 46
FP32 MobileNet v2 1.0X 26 18 28
FP32 MobileNet v3 Large 22 16 24
FP32 MobileNet v3 Small 7 6 8

Benchmarked on March 27, 2020 with end2end_bench --benchmark_min_time=5 on an Android/ARM64 build with Android NDK r21 (bazel build -c opt --config android_arm64 :end2end_bench) and neural network models with randomized weights and inputs.

Raspberry Pi

The table below presents multi-threaded performance of XNNPACK library on three generations of MobileNet models and three generations of Raspberry Pi boards.

Model RPi Zero W (BCM2835), ms RPi 2 (BCM2836), ms RPi 3+ (BCM2837B0), ms RPi 4 (BCM2711), ms RPi 4 (BCM2711, ARM64), ms
FP32 MobileNet v1 1.0X 3919 302 114 72 77
FP32 MobileNet v2 1.0X 1987 191 79 41 46
FP32 MobileNet v3 Large 1658 161 67 38 40
FP32 MobileNet v3 Small 474 50 22 13 15
INT8 MobileNet v1 1.0X 2589 128 46 29 24
INT8 MobileNet v2 1.0X 1495 82 30 20 17

Benchmarked on Feb 8, 2022 with end2end-bench --benchmark_min_time=5 on a Raspbian Buster build with CMake (./scripts/build-local.sh) and neural network models with randomized weights and inputs. INT8 inference was evaluated on per-channel quantization schema.

Minimum build requirements

  • C11
  • C++14
  • Python 3

Publications

Ecosystem

Machine Learning Frameworks

Acknowledgements

XNNPACK is a based on QNNPACK library. Over time its codebase diverged a lot, and XNNPACK API is no longer compatible with QNNPACK.

xnnpack's People

Contributors

ablavatski avatar akopich avatar alankelly avatar bhbruce avatar bjacob avatar digantdesai avatar dsharletg avatar ejparkqc avatar fbarchard avatar ferev avatar fredrec avatar geng-yan avatar gonnet avatar grantjensen avatar gregorycomer avatar iskunk avatar kartynnik avatar lk-chen avatar malfet avatar manojimg avatar maratyszcza avatar mattn avatar mcr229 avatar multiverse-tf avatar ngzhian avatar phoebesv avatar qukhan avatar simonmaurer avatar waterdropw avatar xnnpack-bot avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

xnnpack's Issues

'vdotq_lane_s32' is invalid in C99 [-Wimplicit-function-declaration]

[ 10%] Building C object _deps/xnnpack-build/CMakeFiles/XNNPACK.dir/src/qs8-igemm/gen/6x8c4-minmax-neondot.c.o
/home/chendongmin/project/tensorflow_lite_cmake/dtln_aec_android_build/xnnpack/src/qs8-gemm/gen/1x16c4-minmax-neondot.c:64:20: warning: implicit declaration of function
'vdotq_lane_s32' is invalid in C99 [-Wimplicit-function-declaration]
vacc0x0123 = vdotq_lane_s32(vacc0x0123, vb0123x0123, va0x01234567, 0);
^
/home/chendongmin/project/tensorflow_lite_cmake/dtln_aec_android_build/xnnpack/src/qs8-gemm/gen/1x16c4-minmax-neondot.c:64:18: error: assigning to 'int32x4_t'
(vector of 4 'int32_t' values) from incompatible type 'int'
vacc0x0123 = vdotq_lane_s32(vacc0x0123, vb0123x0123, va0x01234567, 0);
^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/chendongmin/project/tensorflow_lite_cmake/dtln_aec_android_build/xnnpack/src/qs8-gemm/gen/1x16c4-minmax-neondot.c:65:18: error: assigning to 'int32x4_t'
(vector of 4 'int32_t' values) from incompatible type 'int'
vacc0x4567 = vdotq_lane_s32(vacc0x4567, vb0123x4567, va0x01234567, 0);
^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

neondot on armv7-32bit breaks the build

Hi, a recent change breaks the build for ARMv7-32 bit (e.g. RPI2, or RPI3-32bit). This is what I set for cmake

set(CMAKE_SYSTEM_PROCESSOR armv7)
FAILED: ...  -c ../../src/qs8-gemm/gen/8x8c4-minmax-neondot.c
arm-linux-gnueabihf-gcc: error: unrecognized argument in option '-march=armv8.2-a+dotprod'

I think the XNNPACK_NEONDOT_MICROKERNEL_SRCS should be limited to if user specify dotprod modifier in the CMAKE_SYSTEM_PROCESSOR E.g.

set(CMAKE_SYSTEM_PROCESSOR armv8.2-a+dotprod)

Unable to build the tflite & xnnpack from CMakeLists.txt

The CmakeLists doesn't support emscripten. I am building the tflite with emscripten. I am able to build all the required static libraries of libtensorflowlite. I need a delegate to launch the interpreter. But emscripten build also have source code dependencies on xnnpack_delegates.h/cc .
So can you please guide me on how to build xnnpack and xnnpack_delegate.cc so that i could register the xnnpack as delegate using low level delegate api(or any other means) listed in https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/delegates/xnnpack

output and input address change dynamically in operators

Dear all,

we are evaluating the use of XNNPACK for our own development. I have seen that the input and output vectors are set in the *_setup_* method that construct the operator.

I wonder if it possible to extend the API to set the output and input address after the ***_setup*** has been called?
We are happy to develop the change ourselves, but wanted to be sure if this is all together possible?

Thanks,

Pablo.

wrong link of downloading clog

Hi, thanks for contrbuting such a good project. I'm trying to build xnnpack on nvidia jetson tx2 using cmake. But it seems the download link of clog and cpuinfo is identical. Is this mistake?

in cmake/DownloadCLog.cmake:

ExternalProject_Add(clog
URL https://github.com/pytorch/cpuinfo/archive/d5e37adf1406cf899d7d9ec1d317c47506ccb970.tar.gz
URL_HASH SHA256=3f2dc1970f397a0e59db72f9fca6ff144b216895c1d606f6c94a507c1e53a025
SOURCE_DIR "${CMAKE_BINARY_DIR}/clog-source"
BINARY_DIR "${CMAKE_BINARY_DIR}/clog"
CONFIGURE_COMMAND ""
BUILD_COMMAND ""
INSTALL_COMMAND ""
TEST_COMMAND ""
)

in cmake/DownloadCpuinfo.cmake:

ExternalProject_Add(cpuinfo
URL https://github.com/pytorch/cpuinfo/archive/d5e37adf1406cf899d7d9ec1d317c47506ccb970.tar.gz
URL_HASH SHA256=3f2dc1970f397a0e59db72f9fca6ff144b216895c1d606f6c94a507c1e53a025
SOURCE_DIR "${CMAKE_BINARY_DIR}/cpuinfo-source"
BINARY_DIR "${CMAKE_BINARY_DIR}/cpuinfo"
CONFIGURE_COMMAND ""
BUILD_COMMAND ""
INSTALL_COMMAND ""
TEST_COMMAND ""
)

iOS support

Hi,

Will XNNPACK support iOS in future?
Currently, XNNPACK seems cannot be compiled for iOS.

Is it possible to build XNNPACK without AVX ?

I'm building some Mediapipe examples, and have noticed that it uses AVX512 / AVX2 functions from xnnpack (depending on the cpu capabilities). (Windows build)
Is there a good way to build xnnpack in a way that won't build the AVX parts? Modifying BUILD.Bazel in xnnpack is throwing some linker errors if I just comment out sections related to AVX like following

xnnpack_cc_library(
    name = "avx2_ukernels",
    hdrs = INTERNAL_HDRS,
    gcc_copts = xnnpack_gcc_std_copts(),
    gcc_x86_copts = [
        "-mfma",
        "-mavx2",
    ],
    msvc_copts = xnnpack_msvc_std_copts(),
    msvc_x86_32_copts = ["/arch:AVX2"],
    msvc_x86_64_copts = ["/arch:AVX2"],
    x86_srcs = AVX2_UKERNELS,
    deps = [
        ":tables",
        "@FP16",
        "@pthreadpool",
    ],
)

so wanted to see if there is another way?

XNNPACK slower than NNPACK?

I tried to substitute engine from NNPACK to XNNPACK but faced that XNNPACK 3-5x times slower than NNPACK on my nets on both arm64 and x86 devices. I took some layers from net and tried to run benchmarks on ubuntu and got even worse result:

XNNPACK (bazel run //:convolution_bench):

Run on (6 X 4300 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 256 KiB (x6)
  L3 Unified 9216 KiB (x1)
Load Average: 3.69, 1.76, 1.46
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                                               Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
xnnpack_convolution_f32/some_test/N:1/H:128/W:128/KH:3/KW:3/PH:1/PW:1/S:1/D:1/G:1/GCin:256/GCout:128/real_time 1345810175 ns   1345571878 ns            1 FLOPS=7.06881G/s Freq=4.19863G
xnnpack_convolution_f32/some_test/N:1/H:256/W:256/KH:3/KW:3/PH:1/PW:1/S:1/D:1/G:1/GCin:192/GCout:96/real_time  3060002089 ns   3059821942 ns            1 FLOPS=7.05024G/s Freq=1.17911G

And NNPACK:

./benchmark_conv -ic 256 -oc 128 -is 128 128 -ks 3 3 -m inference -ip 1 -t 1 -a wt8x8
Batch size: 1
Input channels: 256
Output channels: 128
Input: 128x128 with implicit padding 1
Kernel: 3x3
Subsampling: 1x1
Algorithm: WT8x8
Threads: 1
Iterations: 3
Time: 31.948 ms
Input transform: 7.692 ms (24.1%) [6.3 GB/s]
Kernel transform: 0.460 ms (1.4%) [20.8 GB/s]
Output transform: 1.587 ms (5.0%) [15.3 GB/s]
Block multiplication: 22.206 ms (69.5%) [91.4 GFLOPS]
Overhead: 0.002 ms (0.0%)

./benchmark_conv -ic 192 -oc 96 -is 256 256 -ks 3 3 -m inference -ip 1 -t 1 -a wt8x8
Batch size: 1
Input channels: 192
Output channels: 96
Input: 256x256 with implicit padding 1
Kernel: 3x3
Subsampling: 1x1
Algorithm: WT8x8
Threads: 1
Iterations: 3
Time: 76.170 ms
Input transform: 22.479 ms (29.5%) [6.3 GB/s]
Kernel transform: 0.258 ms (0.3%) [20.8 GB/s]
Output transform: 4.566 ms (6.0%) [15.5 GB/s]
Block multiplication: 48.866 ms (64.2%) [89.3 GFLOPS]
Overhead: 0.001 ms (0.0%)

Why it may be so much slower?

Apply Google style?

Since the author is at Google now, may I ask that clang-format is run over the code with Google settings? This is trival to do and it would significantly improve readability. I could send a PR if you'd like.

Thanks!

Does XNNPACK support SeparableConv2D?

Hi,
I have a model that uses SeparableConv2D layers extensively. The results from model with XNNPACK is very different than without XNNPACK i.e. unable to reproduce the original results with XNNPACK. Does XNNPACK support SeparableConv2D?

Issue when building iOS for arm64

I'm running into an error when compiling asm files in XNNPACK_AARCH64_ASM_MICROKERNEL_SRCS for arm64

error: /Users/taox/Projects/XNNPACK/src/f32-dwconv/up4x9-aarch64-neonfma.S:21:23: error: error: unknown token in expression
error: brackets expression not supported on this target
brackets expression not supported on this target
        LD2R {v30.4s, v31.4s}, [x8]
        STP d10, d11, [sp, 16]
        STP d10, d11, [sp, 16]

Seems like the compiler doesn't recognize the syntax. I'm not an expert in ASM, but my guess is that the compiler flag -march=armv8.2-a+fp16 is not supported by Clang? However, I did find a link that discussed adding such support - https://reviews.llvm.org/D41792.

SPMM slower with higher sparsity rate.

I read the excellent paper 'Fast Sparse Convolutions' in CVPR2020 and I'm very interested in it. However, when I run the SPMM benchmark implemented in XNNPACK, it seems that in some cases it is even slower with higher sparsity. My codes are as follows:

pthreadpool_t threadpool = pthreadpool_create(1);
  status = xnn_initialize(NULL);
  fprintf(stderr, "mr = %d, nr = %d\n", xnn_params.f32.spmm.mr, xnn_params.f32.spmm.nr);
  xnn_operator_t spmm_op;
  status = xnn_create_convolution2d_nchw_f32(
    // padding
    0, 0, 0, 0,
    // kernel size
    1, 1,
    // stride
    1, 1,
    // dilation
    1, 1,
    // groups
    1,
    // input/output channels per group
    K, N,
    // input/output channel stride
    K, N,
    // kernel, bias     
    weight, bias,
    // min/max value of output
    0, FLT_MAX,
    // input tensor stored in NCHW order
    0,
    &spmm_op
  );
  status = xnn_setup_convolution2d_nchw_f32(
    spmm_op,
    1,
    1, M,
    input, output,
    threadpool
  );

According to the implementation of xnn_convlution2d_nchw_f32 in XNNPACK/src/operators/convolution_nchw.c, it will run the convolution with SPMM when the kernel size/stride/dilation are all 1, and input/output tensors are stored in NCHW layout. I run the operator with M = 49[spatial dimension of fmap], N = 512[output channels], K = 1024[input channels] under different weight sparsity, results are shown as follows:
sparsity | time (ms)
0.0 12.60
0.1 12.64
0.2 25.52
0.3 22.94
0.4 19.84
0.5 16.77
0.6 13.36
0.7 9.76
0.8 5.80
0.9 2.06

Is there any plan to support lstm operation?

Hi,

I created a model using lstm operations.

I used performance tools and used xnnpack delegate, and found that the lstm operation is not supported by xnnpack delegate. Are there any plans to support this operation?

Thanks.

Build crashes on ios_armv7 with bazel

Hi, thank you for your great project.

I'd tried to build XNNPACK e2ebench with ios_armv7 config with commands below:

$ bazel build -c opt --config ios_armv7 :end2end_bench

and bazel build log saids:

console outputs
$ bazel build -c opt --config ios_armv7 :end2end_bench
INFO: Analyzed target //:end2end_bench (24 packages loaded, 1387 targets configured).
INFO: Found 1 target...
ERROR: /Users/jhyoo/workspace/src/XNNPACK/BUILD.bazel:3122:19: C++ compilation of rule '//:tables' failed (Exit 1): wrapped_clang failed: error executing command external/local_config_cc/wrapped_clang '-D_FORTIFY_SOURCE=1' -fstack-protector -fcolor-diagnostics -Wall -Wthread-safety -Wself-assign -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG ... (remaining 37 argument(s) skipped)

Use --sandbox_debug to see verbose messages from the sandbox wrapped_clang failed: error executing command external/local_config_cc/wrapped_clang '-D_FORTIFY_SOURCE=1' -fstack-protector -fcolor-diagnostics -Wall -Wthread-safety -Wself-assign -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG ... (remaining 37 argument(s) skipped)

Use --sandbox_debug to see verbose messages from the sandbox
clang: error: invalid iOS deployment version '--target=armv7-apple-ios', iOS 10 is the maximum deployment target for 32-bit targets [-Winvalid-ios-deployment-target]
Target //:end2end_bench failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 12.155s, Critical Path: 0.37s
INFO: 133 processes: 133 internal.
FAILED: Build did NOT complete successfully

The main reason seems clang: error: invalid iOS deployment version '--target=armv7-apple-ios', iOS 10 is the maximum deployment target for 32-bit . My ios sdk version is 13.7 so I guess local_config_cc put -miphoneos-version-min=13.7 on ios_armv7 build. To avoid this I'd tried:

$ bazel build --ios_minimum_os='10.0' -c opt --config ios_armv7 :end2end_bench

And then I faced compile error because armv7 doesn't support dot product simd command:

console outputs
$ bazel build --ios_minimum_os='10.0' -c opt --config ios_armv7 :end2end_bench
INFO: Build option --ios_minimum_os has changed, discarding analysis cache.
INFO: Analyzed target //:end2end_bench (0 packages loaded, 1387 targets configured).
INFO: Found 1 target...
ERROR: /Users/jhyoo/workspace/src/XNNPACK/BUILD.bazel:3470:19: C++ compilation of rule '//:neondot_ukernels' failed (Exit 1): wrapped_clang failed: error executing command external/local_config_cc/wrapped_clang '-D_FORTIFY_SOURCE=1' -fstack-protector -fcolor-diagnostics -Wall -Wthread-safety -Wself-assign -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG ... (remaining 65 argument(s) skipped)

Use --sandbox_debug to see verbose messages from the sandbox wrapped_clang failed: error executing command external/local_config_cc/wrapped_clang '-D_FORTIFY_SOURCE=1' -fstack-protector -fcolor-diagnostics -Wall -Wthread-safety -Wself-assign -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG ... (remaining 65 argument(s) skipped)

Use --sandbox_debug to see verbose messages from the sandbox
src/qs8-gemm/gen/1x8c4-minmax-neondot.c:62:20: warning: implicit declaration of function 'vdotq_lane_s32' is invalid in C99 [-Wimplicit-function-declaration]
      vacc0x0123 = vdotq_lane_s32(vacc0x0123, vb0123x0123, va0x01234567, 0);
                   ^
src/qs8-gemm/gen/1x8c4-minmax-neondot.c:62:18: error: assigning to 'int32x4_t' (vector of 4 'int32_t' values) from incompatible type 'int'
      vacc0x0123 = vdotq_lane_s32(vacc0x0123, vb0123x0123, va0x01234567, 0);
                 ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/qs8-gemm/gen/1x8c4-minmax-neondot.c:63:18: error: assigning to 'int32x4_t' (vector of 4 'int32_t' values) from incompatible type 'int'
      vacc0x4567 = vdotq_lane_s32(vacc0x4567, vb0123x4567, va0x01234567, 0);
                 ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/qs8-gemm/gen/1x8c4-minmax-neondot.c:64:18: error: assigning to 'int32x4_t' (vector of 4 'int32_t' values) from incompatible type 'int'
      vacc0x0123 = vdotq_lane_s32(vacc0x0123, vb4567x0123, va0x01234567, 1);
                 ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/qs8-gemm/gen/1x8c4-minmax-neondot.c:65:18: error: assigning to 'int32x4_t' (vector of 4 'int32_t' values) from incompatible type 'int'
      vacc0x4567 = vdotq_lane_s32(vacc0x4567, vb4567x4567, va0x01234567, 1);
                 ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/qs8-gemm/gen/1x8c4-minmax-neondot.c:79:20: warning: implicit declaration of function 'vdotq_lane_s32' is invalid in C99 [-Wimplicit-function-declaration]
      vacc0x0123 = vdotq_lane_s32(vacc0x0123, vb0123x0123, va0x01234567, 0);
                   ^
src/qs8-gemm/gen/1x8c4-minmax-neondot.c:79:18: error: assigning to 'int32x4_t' (vector of 4 'int32_t' values) from incompatible type 'int'
      vacc0x0123 = vdotq_lane_s32(vacc0x0123, vb0123x0123, va0x01234567, 0);
                 ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/qs8-gemm/gen/1x8c4-minmax-neondot.c:80:18: error: assigning to 'int32x4_t' (vector of 4 'int32_t' values) from incompatible type 'int'
      vacc0x4567 = vdotq_lane_s32(vacc0x4567, vb0123x4567, va0x01234567, 0);
                 ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/qs8-gemm/gen/1x8c4-minmax-neondot.c:88:20: error: assigning to 'int32x4_t' (vector of 4 'int32_t' values) from incompatible type 'int'
        vacc0x0123 = vdotq_lane_s32(vacc0x0123, vb4567x0123, va0x01234567, 1);
                   ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/qs8-gemm/gen/1x8c4-minmax-neondot.c:89:20: error: assigning to 'int32x4_t' (vector of 4 'int32_t' values) from incompatible type 'int'
        vacc0x4567 = vdotq_lane_s32(vacc0x4567, vb4567x4567, va0x01234567, 1);
                   ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2 warnings and 8 errors generated.
Target //:end2end_bench failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 7.145s, Critical Path: 6.64s
INFO: 215 processes: 130 internal, 85 darwin-sandbox.
FAILED: Build did NOT complete successfully

I'm not sure whether ios armv7 is still used popular or not but it was necessary for my work, so I fixed this by adding 'apple_aarch32_copt' on neon dot product targets in BUILD.bazel and add an ios option on ios_armv7 config.

If you're ok, may I make a PR for this?

Thank you. :)

Quantized inference performance

Have you compared quantized (unsigned 8-bit) inference with QNNPACK? Given that this library was forked off of QNNPACK, are there any optimizations in XNNPACK on top which could make it faster for quantized inference (maybe for some particular layers)?

Build is broken on Ubuntu Xenial with CMake 3.5 and GCC 5.4

Reproducible steps:

docker run --it --name xnnpack_test ubuntu:16.04
apt install -y cmake git build-essential
# cmake --version shows 3.5.1
# gcc --version shows 5.4.0

git clone https://github.com/google/XNNPACK.git
cd XNNPACK
mkdir build
cd build
cmake -DXNNPACK_BUILD_TESTS=OFF -DXNNPACK_BUILD_BENCHMARKS=OFF ..
make -j8

Logs:

[  4%] Building C object CMakeFiles/XNNPACK.dir/src/operators/convolution-nhwc.c.o
cc: error: ../src/operators/average-pooling-nhwc.c../src/operators/average-pooling-nhwc.cNOT:../src/operators/average-pooling-nhwc.cCONFIG:Debug: No such file or directory
cc: error: CMakeFiles/XNNPACK.dir/src/operators/average-pooling-nhwc.c.o: No such file or directory
[  4%] Building C object CMakeFiles/XNNPACK.dir/src/operators/convolution-nchw.c.o
CMakeFiles/XNNPACK.dir/build.make:86: recipe for target 'CMakeFiles/XNNPACK.dir/src/operators/average-pooling-nhwc.c.o' failed
make[2]: *** [CMakeFiles/XNNPACK.dir/src/operators/average-pooling-nhwc.c.o] Error 1
make[2]: *** Waiting for unfinished jobs....
cc: error: ../src/operators/binary-elementwise-nd.c../src/operators/binary-elementwise-nd.cNOT:../src/operators/binary-elementwise-nd.cCONFIG:Debug: No such file or directory
cc: error: CMakeFiles/XNNPACK.dir/src/operators/binary-elementwise-nd.c.o: No such file or directory
cc: error: ../src/operators/constant-pad-nd.c../src/operators/constant-pad-nd.cNOT:../src/operators/constant-pad-nd.cCONFIG:Debug: No such file or directory
cc: error: CMakeFiles/XNNPACK.dir/src/operators/constant-pad-nd.c.o: No such file or directory
CMakeFiles/XNNPACK.dir/build.make:158: recipe for target 'CMakeFiles/XNNPACK.dir/src/operators/constant-pad-nd.c.o' failed
make[2]: *** [CMakeFiles/XNNPACK.dir/src/operators/constant-pad-nd.c.o] Error 1
CMakeFiles/XNNPACK.dir/build.make:110: recipe for target 'CMakeFiles/XNNPACK.dir/src/operators/binary-elementwise-nd.c.o' failed
make[2]: *** [CMakeFiles/XNNPACK.dir/src/operators/binary-elementwise-nd.c.o] Error 1
cc: error: ../src/operators/argmax-pooling-nhwc.c../src/operators/argmax-pooling-nhwc.cNOT:../src/operators/argmax-pooling-nhwc.cCONFIG:Debug: No such file or directory
cc: error: CMakeFiles/XNNPACK.dir/src/operators/argmax-pooling-nhwc.c.o: No such file or directory
cc: error: ../src/operators/convolution-nhwc.c../src/operators/convolution-nhwc.cNOT:../src/operators/convolution-nhwc.cCONFIG:Debug: No such file or directory
cc: error: CMakeFiles/XNNPACK.dir/src/operators/convolution-nhwc.c.o: No such file or directory
cc: error: ../src/operators/convolution-nchw.c../src/operators/convolution-nchw.cNOT:../src/operators/convolution-nchw.cCONFIG:Debug: No such file or directory
cc: error: CMakeFiles/XNNPACK.dir/src/operators/convolution-nchw.c.o: No such file or directory
[  4%] Building C object CMakeFiles/XNNPACK.dir/src/operators/channel-shuffle-nc.c.o
CMakeFiles/XNNPACK.dir/build.make:62: recipe for target 'CMakeFiles/XNNPACK.dir/src/operators/argmax-pooling-nhwc.c.o' failed
make[2]: *** [CMakeFiles/XNNPACK.dir/src/operators/argmax-pooling-nhwc.c.o] Error 1
CMakeFiles/XNNPACK.dir/build.make:206: recipe for target 'CMakeFiles/XNNPACK.dir/src/operators/convolution-nhwc.c.o' failed
make[2]: *** [CMakeFiles/XNNPACK.dir/src/operators/convolution-nhwc.c.o] Error 1
CMakeFiles/XNNPACK.dir/build.make:182: recipe for target 'CMakeFiles/XNNPACK.dir/src/operators/convolution-nchw.c.o' failed
make[2]: *** [CMakeFiles/XNNPACK.dir/src/operators/convolution-nchw.c.o] Error 1
cc: error: ../src/operators/channel-shuffle-nc.c../src/operators/channel-shuffle-nc.cNOT:../src/operators/channel-shuffle-nc.cCONFIG:Debug: No such file or directory
cc: error: CMakeFiles/XNNPACK.dir/src/operators/channel-shuffle-nc.c.o: No such file or directory
CMakeFiles/XNNPACK.dir/build.make:134: recipe for target 'CMakeFiles/XNNPACK.dir/src/operators/channel-shuffle-nc.c.o' failed
make[2]: *** [CMakeFiles/XNNPACK.dir/src/operators/channel-shuffle-nc.c.o] Error 1
[  4%] Building C object CMakeFiles/XNNPACK.dir/src/operators/deconvolution-nhwc.c.o
cc: error: ../src/operators/deconvolution-nhwc.c../src/operators/deconvolution-nhwc.cNOT:../src/operators/deconvolution-nhwc.cCONFIG:Debug: No such file or directory
cc: error: CMakeFiles/XNNPACK.dir/src/operators/deconvolution-nhwc.c.o: No such file or directory
CMakeFiles/XNNPACK.dir/build.make:230: recipe for target 'CMakeFiles/XNNPACK.dir/src/operators/deconvolution-nhwc.c.o' failed
make[2]: *** [CMakeFiles/XNNPACK.dir/src/operators/deconvolution-nhwc.c.o] Error 1
CMakeFiles/Makefile2:69: recipe for target 'CMakeFiles/XNNPACK.dir/all' failed
make[1]: *** [CMakeFiles/XNNPACK.dir/all] Error 2
Makefile:127: recipe for target 'all' failed
make: *** [all] Error 2

[question] Pooling with filter 1×1

It seems all of the pooling operators doesn't support filter size 1×1 with the code like here since you think 1x1 pooling is meaningless. But recently I'm trying to do inference with tfjs-backend-wasm while the model includes maxpooling with filter 1×1 and stride 2×2. This seems more like a downsampling process and looks meaningful. Do you think the pooling operators should support case like that?

Question about "ModifyGraphWithDelegate is disallowed"

Hi

I have built XNN pack successfully for target device (RPI2) and can run the benchmark tests. However, when I built libtensorflow-lite.a (or .so) with XNNPACK and run with the prebuilts models I run into ModifyGraphWithDelegate is disallowed error

INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
ERROR: ModifyGraphWithDelegate is disallowed when graph is immutable.

The inference still ran successfully with correct result, but I guess it didn't benefit from the XNNPACK speedup because of the error.

In my testing, I used these two models from google prebuilt:

ssd_mobilenet_v3_small_coco_2020_01_14.tflite 
coco_ssd_mobilenet_v1_1.0_quant_2018_06_29.tflite

Is this ModifyGraphWithDelegate is disallowed expected when the model I used didn't comply to certain requirements? Can you give some pointers how to get around this issue?

You can also find more details of my setup at this tensorflow issue I created.

Thanks!

macOS_arm64 build

hi y'all

with the aim of compiling the benchmark_model in TF (as this depends on XNNPACK) on commit c2db3a8fae0f6558e9dbdee79e67e74c1e95981c I was trying to build the end2end_bench using bazel 4.0.0 (ARM64)

the docs state a macOS support for arm64, I assume this only holds true when using cmake.
so I added configs macos_arm64by updating .bazelrc, build_defs.bzl, cpuinfo.BUILD, BUILD.bazel and then run:

bazel build --config=macos_arm64 :end2end_bench

compiling with ios_arm64 as build config works fine. however not with macos_arm64 even though the macOS should be using the iOS kernels

could you give me a hint on how to build for platform macos_arm64 with bazel ? @Maratyszcza

Windows clang-cl doesn't build xop sources

The thing with clang-cl.exe on windows is that it doesn't predefine __GNUC__ macro as in clang. Instead it defines _MSC_VER and __clang__. In order to build xop sources with clang-cl, should we can add an or condition to include x86intrin.h?

 #include <assert.h>

$if SSE == 5:
  -#ifdef __GNUC__
  +#if defined(__GNUC__) || defined(__clang__)
    #include <x86intrin.h>
  #else
    #include <immintrin.h>

Build fail 4x16c4-minmax-neondot.c

I'm trying to build neondot using NDK, but it failed:
error: assigning to 'int32x4_t' (vector of 4 'int32_t' values) from incompatible type 'int'
vacc0x0123 = vdotq_lane_s32(vacc0x0123, vb0123x0123, va0x01234567, 0);

ANDROID_ABI="arm64-v8a", ndk 21.0.6113669

question about notation of micro kernel's arguments

I have look into micro kernel implementation for personal study.

Due to the lack of documentation, I am little confused about notation.

For example, xnn_qu8_gemm_ukernel_function benchmark in

static void GEMMBenchmark(benchmark::State& state,
xnn_qu8_gemm_ukernel_function gemm,
size_t mr, size_t nr, size_t kr, size_t sr,
benchmark::utils::IsaCheckFunction isa_check = nullptr)

takes mr, nr, kr as arguments.

My question is which is the dimension for fully-connected operation's kernel(weight) ? I think nr*kr is the size of FC's kernel and mr*kr is the size of FC's input. Please let me know if this is incorrect.

Thank you.

Sparse Model Benchmark

Dear authors,
In the "Fast Sparse ConvNets" paper, it says: "Instead, we implement a dense convolutional kernel which takes as input the image in the standard HWC layout and outputs the CHW layout consumed by the sparse operations in the rest of the network.' However, its seems to me that the layout of the feature maps are not changed after the Conv2d layer when I was inspecting the pre-trained sparse model with Netron. Could you please explain to me about this?

Another issue is, I encontered some problams when I tried to run the benchmark using bazel build with tensorflow lite on armv7 with linux (which I have raised an issue in their repo). And I am trying to run the end-2-end benchmark in this repository. Do you have c++ implementation of the pre-trained sparse model (like the ones in the models folder) so that I can run directly with this repo? Since its hard to extract the parameters used in the sparse model (subsampling size, relu ,etc).

Thank you very much!

Raspberry Pi 4B configuraion for performace test data

Hi
Could you tell me the detail configuration of Raspberry Pi 4(RPi 4 (BCM2711)) in the performance data table?
for example, the memory of the RPi 4 is 1G,2G,4G or 8G? the OS is Raspbian Buster 32 bit or 64 bit, and release date?
I ran the benchmark command but did not got the same performance result,
and I'm not sure which result from the end2end-bench output you are checking.
I'm running it in a 64 bit Ubuntu build.
here is the result:

ubuntu@ubuntu:~$ sudo cpupower frequency-set --governor performance
Setting cpu: 0
Setting cpu: 1
Setting cpu: 2
Setting cpu: 3
ubuntu@ubuntu:/mnt/home/ubuntu/XNNPACK/build/local$ ./end2end-bench --benchmark_min_time=5
2020-07-09 13:45:46
Running ./end2end-bench
Run on (4 X 1500 MHz CPU s)

Benchmark Time CPU Iterations UserCounters...

FP32MobileNetV1/T:1/real_time 312008 us 311970 us 22 Freq=1.5G
FP32MobileNetV1/T:2/real_time 186397 us 186380 us 37 Freq=1.5G
FP32MobileNetV1/T:3/real_time 147883 us 147872 us 48 Freq=1.5G
FP32MobileNetV1/T:4/real_time 142362 us 142349 us 49 Freq=1.5G
FP32MobileNetV2/T:1/real_time 193028 us 193004 us 36 Freq=1.5G
FP32MobileNetV2/T:2/real_time 106852 us 106843 us 65 Freq=1.5G
FP32MobileNetV2/T:3/real_time 81655 us 81648 us 85 Freq=1.5G
FP32MobileNetV2/T:4/real_time 72311 us 72304 us 97 Freq=1.5G
FP32MobileNetV3Large/T:1/real_time 156868 us 156850 us 45 Freq=1.5G
FP32MobileNetV3Large/T:2/real_time 91508 us 91499 us 76 Freq=1.5G
FP32MobileNetV3Large/T:3/real_time 71158 us 71150 us 98 Freq=1.5G
FP32MobileNetV3Large/T:4/real_time 65070 us 65061 us 107 Freq=1.5G
FP32MobileNetV3Small/T:1/real_time 48827 us 48821 us 143 Freq=1.5G
FP32MobileNetV3Small/T:2/real_time 31378 us 31375 us 223 Freq=1.5G
FP32MobileNetV3Small/T:3/real_time 24950 us 24947 us 280 Freq=1.5G
FP32MobileNetV3Small/T:4/real_time 22732 us 22729 us 309 Freq=1.5G
failed to create operation #0
FP16MobileNetV1/T:1/real_time ERROR OCCURRED: 'failed to create a model'
failed to create operation #0
FP16MobileNetV1/T:2/real_time ERROR OCCURRED: 'failed to create a model'
failed to create operation #0
FP16MobileNetV1/T:3/real_time ERROR OCCURRED: 'failed to create a model'
failed to create operation #0
FP16MobileNetV1/T:4/real_time ERROR OCCURRED: 'failed to create a model'
failed to create operation #0
FP16MobileNetV2/T:1/real_time ERROR OCCURRED: 'failed to create a model'
failed to create operation #0
FP16MobileNetV2/T:2/real_time ERROR OCCURRED: 'failed to create a model'
failed to create operation #0
FP16MobileNetV2/T:3/real_time ERROR OCCURRED: 'failed to create a model'
failed to create operation #0
FP16MobileNetV2/T:4/real_time ERROR OCCURRED: 'failed to create a mode

Question About Pthreadpool (Seems Bug)

Hi,

I build a CNN model which has three convolution layers.

I use pthreadpool as follows (the platform is ARM architecture):

auto num_cores = 1;
auto threads = pthreadpool_create(num_cores);

// init
xnn_initialize(nullptr /* allocator */);

// create
xnn_status status = xnn_create_convolution2d_nchw_f32(...);

// setup
xnn_setup_convolution2d_nchw_f32(..., threads /* thread poll */)

// inference
xnn_run_operator(..., threads /* thread pool */);

When the num_cores is 1 (if I pass threads as nullptr, it uses 1 core by default.). The result is correct and everything is fine.

However, if I set the num_cores to value larger than 1 (whatever 2 or 6, or others), the result is wrong.

There are two things I want to highlight in this error:

  1. In my case, the result res has dim: 1, 16, 240, 240. The res[0][0][:][:] is correct, and others are wrong (most of them are zeros).
  2. The first and second convolutions' outputs are correct. The incorrectness occurs from the third convolution.

According to my observations, I have three questions:

  1. I want to know whether I missed something or used pthreadpool in the wrong way?
  2. Did you have any test/benchmark about multi-cpucores usage?
  3. Does it need to synchronize when using multi-cpucores?

Thanks a lot!

Issue with multithread on Windows/mingw64

Hi, I cross-compiled TFLite (v2.4.1 and pre-release 2.5.0) with XNNPACK for Windows using Mingw-w64 cmake. On a single thread, the model inference works as expected. When choosing more than 1 thread (example: 2 or 4), the program quits during Invoke() unexpectedly (no errors printed).

I used the following command to set number of threads: InterpreterBuilder (*model, resolver)(&interpreter, num_threads)

A direct compile for Linux works fine when num_threads is greater than 1. Inference, as expected, is faster on 2 threads than 1.

When using default TFLite kernels on Windows (cross compiled as well), the model works fine for any number of threads (Threads set via SetNumThreads(num_threads)).

Am I missing any configuration steps when trying to cross-compile? Any assistance is appreciated. Thank you.

Does iPhoneOS armv7 support neon dot and neon v8?

In CMakeLists.txt we have:

IF(CMAKE_SYSTEM_PROCESSOR MATCHES "^armv[5-8]" OR IOS_ARCH MATCHES "^armv7")
...
  SET_PROPERTY(SOURCE ${XNNPACK_NEONV8_MICROKERNEL_SRCS} APPEND_STRING PROPERTY COMPILE_FLAGS " -march=armv8-a -mfpu=neon-fp-armv8 ")
  SET_PROPERTY(SOURCE ${XNNPACK_NEONDOT_MICROKERNEL_SRCS} APPEND_STRING PROPERTY COMPILE_FLAGS " -march=armv8.2-a+dotprod -mfpu=neon-fp-armv8 ")
...
ENDIF()

In pytorch we are not able to build this on iphoneos armv7 because in arm_neon.h (clang 10.0.0) it fails these checks:

#if __ARM_ARCH >= 8 && defined(__ARM_FEATURE_DIRECTED_ROUNDING)
...
#if defined(__ARM_FEATURE_DOTPROD)

Is it possible for us to not include neon_dot and neon_v8 for iphoneos armv7? Or can we have a macro to exclude those two features?

[question] Handling NaN for min/max instruction

What is the design policy of handling NaN for min/max instruction in XNNPACK?

vminq_f32/vmaxq_f32 : Not IEEE754-2008 aware(NaN propagates when either input is NaN) http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0802b/CIHDEEBE.html

vminnmq_f32/vmaxnmq_f32 : IEEE754-2008 aware(= matches with SSE2's min/max) http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0802b/CIHFCJCF.html
(available only in ARMv8(or AARCH64))

SSE2 _mm_min_ps/_mm_max_ps : IEEE754-2008 aware
https://www.felixcloutier.com/x86/maxps

Currently, XNNPACK uses VMIN/VMAX for min/max instruction, thus at least there is an inconsistency between ARM and x86 code paths when handling NaN value.

Related:

Implement NaN-propagating max/min on Vec256
pytorch/pytorch#13399

tflite_with_xnnpack=true

I tried to compile tflite_with_xnnpack=true in the tensorflow folder using the following command line on aarch64

bazel build --define tflite_with_xnnpack=true //tensorflow/tools/pip_package:build_pip_package --discard_analysis_cache --notrack_incremental_state --jobs=1

After quite a long time compilation, I got the the following error information:

bazel-out/aarch64-opt-exec-50AE0418/bin/_solib_aarch64/_U_S_Stensorflow_Spython_Cgen_Ustate_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framework.so.2: error: undefined reference to 'aws_checksums_do_cpu_id'
bazel-out/aarch64-opt-exec-50AE0418/bin/_solib_aarch64/_U_S_Stensorflow_Spython_Cgen_Ustate_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framework.so.2: error: undefined reference to 'aws_checksums_crc32c_hw'
collect2: error: ld returned 1 exit status
Target //tensorflow/tools/pip_package:build_pip_package failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 8709.447s, Critical Path: 152.23s
INFO: 2105 processes: 2105 local.
FAILED: Build did NOT complete successfully
 

I wonder if this results from some flats that were not set up correctly.

Thanks in advance.

MacOS

By default the MacOS filesystem is case-insensitive, which means a 'build' directory cannot be created in the root of the repository since a file with the same uppercase name is already present. This renders some of the scripts unusable.

two questions about the indirection in dwconv

I'm trying to understand the indirect convolution algorithm used in xnnpack. It's a cool idea to implement convolution and thanks for contributing this project!
During the code reading, I find a few questions about the implementation of indirect convolution. I list them below.

  1. in XNNPACK/bench/f32-dwconv.cc, line 69
    I think the step_height represents how many pointers are for one single row of the output. But I cannot understand why it is calculated as kernel_size + (output_width * step_width - 1) * kernel_height. If I understand it correctly, it should be kernel_size + ((output_width-1) * step_width) * kernel_height. The first kernel_size is for one complete convolution window and the following part computes how many new pointers are needed in each step. Please correct me if I do wrong.

  2. in XNNPACK/src/indirection.c: xnn_indirection_init_dwconv2d
    In this function, we compute the input spatial location (input_x, input_y). The code checks if it's outside the input(input_x < input_width, input_y < input_height) but it does not compare input_x and input_y to zero. For example, when we compute the input spatial location for the output location (0, 0) and the padding size is not zero(input_padding_top>0, input_padding_left>0), the input_x and input_y will be negative. Is the corresponding address used to set indirection_buffer right in this situation?

Question about raddstoreexpminusmax_ukernel

Hi,

I am running a tflite model which has a final softmax layer whose input is a heatmap of dimension 1x64x64x3 where 3 is the number of channels. The output dimension is 1x64x64. Tflite is built on mac and I use the c api with XNN delegate. When running this on an iPad, i get a EXC_BAD_ACCESS error and the program crashes.

I was able to narrow the error down to 'raddstoreexpminusmax_ukernel' function inside 'xnn_compute_f32_three_pass_softmax'. Based on the device, the function 'xnn_f32_raddstoreexpminusmax_ukernel__neonfma_lut64_p2_x16' is called.

Within this function, the crash happens at the last call to 'xnn_compute_f32_three_pass_softmax' on line 263 of function 'xnn_f32_raddstoreexpminusmax_ukernel__neonfma_lut64_p2_x16' which is
const float32x4_t vi = vld1q_f32(input); input += 4;

The two things i want to highlight are that the crash occurs at the last call of 'xnn_compute_f32_three_pass_softmax' (ie. batch_index = 4095, total = 64x64 calls) and the value for variable 'elements' inside raddstoreexpminusmax_ukernel is 12 ie. 3 (number of channels) * sizeof(float)

From my observation, in the above line we are loading 4 inputs. However, in my case, the number of inputs is 3. My question is would this lead to out of bounds read specially for the last iteration of the function?

Thank you!

aarch64 build failure {incompatible type for argument}

Hi I ran into this error when building for aarch6. The argument expect f32 but given f16 and are incompatible (alignment)

../../src/f16-hswish/gen/hswish-neonfp16arith-x8.c:32:48: error: incompatible type for argument 1 of ‘vreinterpretq_s16_f32’
   const int16x8_t vsix = vreinterpretq_s16_f32(vld1q_dup_f16(&params->six));
                                                ^~~~~~~~~~~~~

xnn_params.f32.hwc2spchw_dconv3x3c3s2.ukernel_with_symm_padding is NULL

Hi!
I am using X86 desktop.
when I try to create the full convolution which takes nhwc as input and outputs nchw:
xnn_create_convolution2d_nchw_f32(
1 /* top padding /, 1 / right padding /,
1 /
bottom padding /, 1 / left padding /,
3 /
kernel height /, 3 / kernel width /,
2 /
subsampling height /, 2 / subsampling width /,
1 /
dilation_height /, 1 / dilation_width /,
1 /
groups /,
3 /
input channels per group /,
24 /
output_channels_per_group /,
w0, w1,
0.0f /
output min /, 6.0f / output max /,
XNN_FLAG_INPUT_NHWC/
flags */,
&op0);

en error occured: failed to create Convolution operator: only selected Convolution parameters are supported
and I found that it is because xnn_params.f32.hwc2spchw_dconv3x3c3s2.ukernel_with_symm_padding == NULL which should be initialized for x86

I tried to exclude 'XNN_NO_NCHW_OPERATORS' in BUILD.bazel for the "xnnpack_operators_nhwc_f32" library but received the same error.

Compilation error for RPI 4

Hi all,

Just got this error compiling in raspberry pi 4 and AmazonEC2 ARM64:

./XNNPACK/src/qs8-gemm/2x8c16-aarch64-neon-mlal-padal.S: Assembler messages:
./XNNPACK/src/qs8-gemm/2x8c16-aarch64-neon-mlal-padal.S:51: Error: operand mismatch -- `mov v17.4s,v16.4s'

could you please let me know if I am doing something wrong or is it an actual compilation error in the repo?

Thanks,

Pablo.

end2end-bench failure on RPI2: "FP16MobileNetV1...failed to create a model"

Hi, I cross compiled XNNPACK for RPI2. When I ran the end2end-bench I got some failures on FP16MobileNetV1 others passed.

 $ ./end2end-bench

2020-08-13 01:15:36
Running ./end2end-bench
Run on (4 X 900 MHz CPU s)
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
------------------------------------------------------------------------------------------
Benchmark                                   Time           CPU Iterations UserCounters...
------------------------------------------------------------------------------------------
FP32MobileNetV1/T:1/real_time         1068462 us    1067627 us          1 Freq=900M


FP32MobileNetV1/T:2/real_time          556305 us     555739 us          1 Freq=900M
FP32MobileNetV1/T:3/real_time          399058 us     396603 us          2 Freq=900M
FP32MobileNetV1/T:4/real_time          322151 us     321932 us          2 Freq=900M
FP32MobileNetV2/T:1/real_time          661108 us     658790 us          1 Freq=900M
FP32MobileNetV2/T:2/real_time          354653 us     344778 us          2 Freq=900M
FP32MobileNetV2/T:3/real_time          251801 us     251222 us          3 Freq=900M
FP32MobileNetV2/T:4/real_time          220476 us     219576 us          3 Freq=900M
FP32MobileNetV3Large/T:1/real_time     514123 us     512953 us          1 Freq=900M
FP32MobileNetV3Large/T:2/real_time     287760 us     287761 us          2 Freq=900M
FP32MobileNetV3Large/T:3/real_time     216038 us     215373 us          3 Freq=900M
FP32MobileNetV3Large/T:4/real_time     191086 us     190798 us          4 Freq=900M
FP32MobileNetV3Small/T:1/real_time     156918 us     156769 us          4 Freq=900M
FP32MobileNetV3Small/T:2/real_time      89814 us      89439 us          8 Freq=900M
FP32MobileNetV3Small/T:3/real_time      67036 us      66751 us         10 Freq=900M
FP32MobileNetV3Small/T:4/real_time      56187 us      55934 us         12 Freq=900M
failed to create operation #0
FP16MobileNetV1/T:1/real_time      ERROR OCCURRED: 'failed to create a model'
failed to create operation #0
FP16MobileNetV1/T:2/real_time      ERROR OCCURRED: 'failed to create a model'
failed to create operation #0
FP16MobileNetV1/T:3/real_time      ERROR OCCURRED: 'failed to create a model'
failed to create operation #0
FP16MobileNetV1/T:4/real_time      ERROR OCCURRED: 'failed to create a model'
failed to create operation #0
FP16MobileNetV2/T:1/real_time      ERROR OCCURRED: 'failed to create a model'
failed to create operation #0
FP16MobileNetV2/T:2/real_time      ERROR OCCURRED: 'failed to create a model'
failed to create operation #0
FP16MobileNetV2/T:3/real_time      ERROR OCCURRED: 'failed to create a model'
failed to create operation #0
FP16MobileNetV2/T:4/real_time      ERROR OCCURRED: 'failed to create a model'
QS8MobileNetV1/T:1/real_time           747497 us     747360 us          1 Freq=900M
QS8MobileNetV1/T:2/real_time           379634 us     379594 us          2 Freq=900M
QS8MobileNetV1/T:3/real_time           256996 us     256934 us          3 Freq=900M
QS8MobileNetV1/T:4/real_time           196462 us     196338 us          4 Freq=900M

Can you shed some light what could be the cause failure?

about _cvtu32_mask16 is missing on macOS10.13.6 + xcode 10.1

Dear colleagues:

I tried with building pytorch on macOS10.13.6 + xcode 10.1. And one show-blocker during this process is that
XNNPACK is missing with symbol "__cvtu32_mask16".

I double-checked with clang version, which is 10 not 11. so when compiler linked object to executable file, it can't find "__cvtu32_mask16" in clang.

Also refer to discussion in https://discuss.pytorch.org/t/pytorch-build-almost-succeeds-but-fails-undefined-symbols-for-architecture-x86-64-cvtu32-mask16/73000.

My question is:
if we try to make building working on macOS10.13.6 + xcode 10.1, can we have some tweaks to solve such issue? I means to tell compiler use __cvtu32_mask16 defined within XNNPACK instead of clang library, where such function does not exist.

any hints?
https://github.com/google/XNNPACK/blob/master/src/xnnpack/intrinsics-polyfill.h#L36

support for quantized tflite models

Hi, currently it seems only f32 delegates to xnnpack is supported, though the qs8~/qu8~ operators available in xnnpack.
Is it on roadmap to add runtime support so quantized tflite models can also delegate to xnnpack qs8*/qu8* operators?

Thanks!

Enabling XNNPACK in TFLite for ARM64?

Hey my fellow developers,

Was peaking around the build instructions, and upon inspecting the bash script download_dependencies.sh nothing shows XNNPACK being downloaded from anywhere.

I wasn't sure if this is a TensorFlow issue or post this issue directly to XNNPACK.

Thank you for any support I can get.

-Montana

Commit #07feec8df3927f3c150dd0cf7db9a54927bd2569 causes build errors on Intel

The latest commit causes the following build errors on intel machines. (Pasting only the first few lines here, can provide more if required):

[141/425] Building C object CMakeFiles/XNNPACK.dir/src/qs8-dwconv/gen/up8x9-minmax-sse41-mul32.c.o
FAILED: CMakeFiles/XNNPACK.dir/src/qs8-dwconv/gen/up8x9-minmax-sse41-mul32.c.o 
/usr/bin/cc -DCPUINFO_SUPPORTED_PLATFORM=1 -DFXDIV_USE_INLINE_ASSEMBLY=0 -DPTHREADPOOL_NO_DEPRECATED_API=1 -DXNN_ENABLE_ASSEMBLY=1 -DXNN_ENABLE_MEMOPT=1 -DXNN_ENABLE_SPARSE=1 -DXNN_LOG_LEVEL=0 -I../../include -I../../src -Iclog-source/deps/clog/include -Icpuinfo-source/include -Ipthreadpool-source/include -IFXdiv-source/include -IFP16-source/include -O3 -DNDEBUG -fPIC   -Wno-psabi -pthread -std=gnu99  -msse4.1  -O2 -MD -MT CMakeFiles/XNNPACK.dir/src/qs8-dwconv/gen/up8x9-minmax-sse41-mul32.c.o -MF CMakeFiles/XNNPACK.dir/src/qs8-dwconv/gen/up8x9-minmax-sse41-mul32.c.o.d -o CMakeFiles/XNNPACK.dir/src/qs8-dwconv/gen/up8x9-minmax-sse41-mul32.c.o   -c ../../src/qs8-dwconv/gen/up8x9-minmax-sse41-mul32.c
../../src/qs8-dwconv/gen/up8x9-minmax-sse41-mul32.c: In function ‘xnn_qs8_dwconv_minmax_ukernel_up8x9__sse41_mul32’:
../../src/qs8-dwconv/gen/up8x9-minmax-sse41-mul32.c:87:50: warning: implicit declaration of function ‘_mm_loadu_si32’; did you mean ‘_mm_loadu_si128’? [-Wimplicit-function-declaration]
       const __m128i vi0x0123 = _mm_cvtepi8_epi32(_mm_loadu_si32(i0));
                                                  ^~~~~~~~~~~~~~
                                                  _mm_loadu_si128
../../src/qs8-dwconv/gen/up8x9-minmax-sse41-mul32.c:87:50: error: incompatible type for argument 1 of ‘_mm_cvtepi8_epi32’
In file included from /usr/lib/gcc/x86_64-linux-gnu/7/include/immintrin.h:37:0,
                 from ../../src/qs8-dwconv/gen/up8x9-minmax-sse41-mul32.c:12:
/usr/lib/gcc/x86_64-linux-gnu/7/include/smmintrin.h:482:1: note: expected ‘__m128i {aka __vector(2) long long int}’ but argument is of type ‘int’
 _mm_cvtepi8_epi32 (__m128i __X)
 ^~~~~~~~~~~~~~~~~
../../src/qs8-dwconv/gen/up8x9-minmax-sse41-mul32.c:88:50: error: incompatible type for argument 1 of ‘_mm_cvtepi8_epi32’

I have verified that the previous commit works well.

Two questions about the library

  1. It seems the F32 GEMM implementation quantizes the input and output? I got that from the usage here, where one has to pass min/max values of the output. I'm worried that the approximation will degrade accuracy significantly (I'm already quantising all the layers I can to int8), still have to test that, but just to confirm is there no SGEMM implementation that doesn't quantise the input and output?

  2. The readme says

XNNPACK is a highly optimized library of floating-point neural network inference operators

however in the code there seems to be implementation for GEMM with int8 weights etc. ? I'm using QNNPACK at the moment for that, would it make sense to switch to XNNPACK for int8 layers?

compilation for x86 and armv8 platforms

Hi, I try to build xnnpack on my devices, a nvidia jetson tx2 and a macbook pro(2015), but encounter some probelms. I use the scripts/build-local.sh to build.
For tx2, the detected arch is aarch64 which is set in CMAKE_SYSTEM_PROCESSOR. In this situation, the -march=armv8.2-a+fp16 flag is added but tx2 does not implement armv8.2 instruction set. Similarly, on x86, the arch is x86_64 and avx512 is used in compilation. Even though I comment related source files(XNNPACK_AVX512F_MICROKERNEL_SRCS) and compilation flag(-mavx512f) in CMakeList.txt, avx512 code still exists in files like f32-rmax.cc in benchmark, which is activated if the marcro XNN_ARCH_X86 orXNN_ARCH_X86_64 is defined.

It seems the armv8.2 and avx512 support is necessary in default for aarch and x86, respectively. Will xnnpack support older arm archs like armv8 and x86 without avx512?

Sorry to open a new issue. But I think describing problems with a more related title helps.

Perform coordinate regression, this optimization will cause errors

Hi,

I build a coordinate regression network based on MobileNetV2.
When I used the xnnpack delegate, I found that the results of inference had a big error, and found the error caused by the optimization here in xnnpack.

if (producer->type == xnn_node_type_static_constant_pad) {

Through more detailed research, it is found that when InputSize is (384, 384), MobileNetV2 will perform a ZeroPadding2D operation with Pad Size of (0,1)(0,1), similar to this:
image

I think it should be caused by this unconventional operation.

Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.