larq / compute-engine Goto Github PK

Highly optimized inference engine for Binarized Neural Networks

Home Page: https://docs.larq.dev/compute-engine

License: Apache License 2.0

Python 10.36% Shell 1.09% C++ 72.02% Starlark 5.66% MLIR 10.17% Dockerfile 0.08% C 0.06% CMake 0.55%

aarch32 aarch64 android armv7 armv8 binarized-neural-networks bnn inference keras larq mlir raspberry-pi simd tensorflow tflite

compute-engine's Introduction

Larq is an open-source deep learning library for training neural networks with extremely low precision weights and activations, such as Binarized Neural Networks (BNNs).

Existing deep neural networks use 32 bits, 16 bits or 8 bits to encode each weight and activation, making them large, slow and power-hungry. This prohibits many applications in resource-constrained environments. Larq is the first step towards solving this. It is designed to provide an easy to use, composable way to train BNNs (1 bit) and other types of Quantized Neural Networks (QNNs) and is based on the tf.keras interface. Note that efficient inference using a trained BNN requires the use of an optimized inference engine; we provide these for several platforms in Larq Compute Engine.

Larq is part of a family of libraries for BNN development; you can also check out Larq Zoo for pretrained models and Larq Compute Engine for deployment on mobile and edge devices.

Getting Started

To build a QNN, Larq introduces the concept of quantized layers and quantizers. A quantizer defines the way of transforming a full precision input to a quantized output and the pseudo-gradient method used for the backwards pass. Each quantized layer requires an input_quantizer and a kernel_quantizer that describe the way of quantizing the incoming activations and weights of the layer respectively. If both input_quantizer and kernel_quantizer are None the layer is equivalent to a full precision layer.

You can define a simple binarized fully-connected Keras model using the Straight-Through Estimator the following way:

model = tf.keras.models.Sequential(
    [
        tf.keras.layers.Flatten(),
        larq.layers.QuantDense(
            512, kernel_quantizer="ste_sign", kernel_constraint="weight_clip"
        ),
        larq.layers.QuantDense(
            10,
            input_quantizer="ste_sign",
            kernel_quantizer="ste_sign",
            kernel_constraint="weight_clip",
            activation="softmax",
        ),
    ]
)

This layer can be used inside a Keras model or with a custom training loop.

Examples

Check out our examples on how to train a Binarized Neural Network in just a few lines of code:

Installation

Before installing Larq, please install:

Python version 3.7, 3.8, 3.9, or 3.10
Tensorflow version 1.14, 1.15, 2.0, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9, or 2.10:
```
pip install tensorflow  # or tensorflow-gpu
```

You can install Larq with Python's pip package manager:

pip install larq

About

Larq is being developed by a team of deep learning researchers and engineers at Plumerai to help accelerate both our own research and the general adoption of Binarized Neural Networks.

compute-engine's People

Contributors

Stargazers

Watchers

compute-engine's Issues

add support of zero-padding (padding type 'SAME' in TF) in binary convolution op

doing zero-padding in {-1,1} space with im2col algorithm is not trivial because injected zeros in im2col buffer, depending on the implementation, will be interpreted as -1 or 1 which leads to wrong results.

currently I propose the following solution: storing the negative value of the corresponding kernel cell in each padding cell so after im2col and in XNOR operation of bgemm the elements cancel out. However this solution results in slower im2col algorithm since the padding elements can not simply put to zero!

simplify and document how to run the tests for TF lite ops

there are multiple steps to build and figure out how to run the python tests for tf lite ops. this needs to be simplified and documented (the only documentation I could find was in Github actions instructions)

add python benchmark testsuite

Future optimizations

Things we know that might be faster, but that we left for now because it simplifies the code

Move the bitpack_array C++ loop inside the assembly part. Gives speedup of about 5% for the bitpack_array function on Raspberry Pi.
Remove the bitpack_matrix loop when there's no padding (i.e. bitpack_matrix becomes a single bitpack_array call).
Decouple bitpacking bitwidth and bgemm bitwidth: we can use the 64-bit bitpacking code for 32-bit bgemm code.
Cache RUY prepacking for weights

add larq compute engine to PyPI

we need to reserve the name for larq-compute-engine. I guess we push this initial version which does not have any functionality before going completely public! what do you think @lgeiger ?

Benchmark a version of LCE_BMLA with less temporaries.

The current LCE_BMLA macro in PR #128 uses 4 temporary registers v26, v27, v28, v29.
We could do it with only 2 temporaries v26, v27, by reordering the instructions (same total number of instructions):

eor v26.16b,  Vr.16b,  Vl1.16b
cnt v26.16b, v26.16b
addv b26, v26.16b

eor v27.16b,  Vr.16b,  Vl2.16b
cnt v27.16b, v27.16b
addv b27, v27.16b
ins v26.s[1], v27.s[0]

eor v27.16b,  Vr.16b,  Vl3.16b
cnt v27.16b, v27.16b
addv b27, v27.16b
ins v26.s[2], v27.s[0]

eor v27.16b,  Vr.16b,  Vl4.16b
cnt v27.16b, v27.16b
addv b27, v27.16b
ins v26.s[3], v27.s[0]

if the eor, cnt, addv instructions can run in parallel (out of order cores) then this version might be slower. But maybe they can only run in parallel with memory read/write instructions, so then this version is equally fast with less temporaries (we should benchmark that). Right now we don't need the extra registers, but we can keep this idea around for future versions.

add a procedure to test each new op with TF lite

converting the ops from TF to TF Lite with the TF lite convertor does not work for custom ops since new ops need its own kernel and op implementation with TF lite OpResolver. -> we need a procedure to add and test the new custom ops. This requires building the TF lite with new ops and use the binary in a c++ program which executes the op exactly the same way our users would use our ops in their mobile applications.

moving the back transformation to the {0} space numbers in BN

Speedup CI builds using caching

GitHub actions now supports caching using actions/cache. I think this could significantly speedup our CI build in particular the TFLite part.

Bazel supports different ways of caching build artifacts either via remote caching probably the GCS backend is the easiest to setup or simple disk caching using the --disk_cache=/path/to/build/cache flag which might also be useful for local development.

add tests for bconv op with padding, strides and dilation

Investigate the support of dilations

adding support of dilation to padding_functor

TF lite converter weight bitpacking

Ideally we would like the TF lite converter to store the binary weights in a packed way so that the tflite model file stays small.

Another option might be that we write a tool ourselves that takes a converted .tflite model and does another pass on it, transforming the weights appropriately.

investigating TF lite SELECTIVE_REGISTRATION feature

the feature SELECTIVE_REGISTRATION in TF Lite allows to create smaller TF lite library by only including the ops used in the converting model (see here). I couldn't find proper doc in the official TF lite guide though.

add ".gitignore" and ".clang-format" files

C++ testsuite for TF lite code

We need a C++ testsuite for TF lite (similar the one for TF) so we can test individual components (currently there are only python tests in ops level and not components level)

Use TFlite conv2d setup

We should use the multithreading setup of tflite, as well as their im2col algorithm and all that.
Basically we can copy almost everything and only change the bgemm part.

Investigating the issues related to QuickNet TF lite conversion

T vs TOut template name
OpDef (see here): What is it? is it only for builtin ops? if not, why it is not available for our bconv2d op?
what are the differences between TF 2.0 and TF nightly-build which make it possible to convert QuickNet?

create bazel target to build tflite pip package and library

the original bash script to build a pip package for TF is available here: https://github.com/plumerai/compute-engine/blob/master/build_pip_pkg.sh
this script is used here in bazel to create the pip package for TF through Bazel commands:
https://github.com/plumerai/compute-engine/blob/master/BUILD

we can use exact same approach to build the TF lite library and pip package by just running one bazel command which executes our build script based on our custom makefile for TF lite. By doing so we don;t need to manually run a couple of commands to build the TF lite library every time.

Add prebuilt TF lite static library to repo to speedup the build

TF lite Android deployment

We currently support Raspberry Pi and iOS but not Android.
This is because the build system for Android uses Bazel wheras the other targets work with a Makefile.
We have to figure out if we can have a Bazel file that has tflite android as a dependency but with our library added to it.

investigating performance of cached pre-packed matrices in RUY

see tensorflow/tensorflow@68c33b1

add c++ testsuite

Use op hints to simplify model TFLite conversion

TF 2.1 introduced @tf.function(experimental_implements="larq.bsign") to annotate custom implementations that can be represented by standard TF ops, but would benefit from custom implementations during deployment.
It would be great to integrate this into larq to make our TFLite conversion simpler.

Using experimental_implements might be a bit early, at least can't find how we can retrieve this information easily from TFLite (just from a grep in the code). It might be worth using tf.compat.v1.lite.OpHint for now, which (judging from the code) seams to be supported both by the TOCO and new MLIR converter. And should be easy to integrate as well.

The benefit of doing something like this is that we don't need build and maintain high performance TF ops for things like BSign which likely don't impact training performance much.

Switch to a single build system (bazel)

Maintaining two buildsystems (Makefile and bazel) is too much work.

Dependency handling like downloading google-test and so on is easier in bazel. The azure system should also use the bazel system.

We can leave a Makefile that only includes a simple build command for the library (which assumes all the dependencies are already present), for easier development when you are locally recompiling the cpp code. The makefile should not include things like building the python package, that should all be in the bazel system.

Implement optimized bitpacking

This stackoverflow post might be useful for bitpacking using NEON.

add c++ benchmark testsuite

take a look what TF is using internally and if its too cumbersome to use the TF mechanism the google benchmark looks like the right tool to add to the compute-engine infrastructure: https://github.com/google/benchmark

Check for tensorflow release tag

See the comments in PR #55 .
Once a new tag, such as v2.0.0-rc3 includes this commit then we can use that tag for our submodule, instead of a random commit.
This is also better with regards to Issue #51 .

setup the CI system

fix the bazel build files

optimizing RUY packing algorithm for int32/64 inputs

currently the generic algorithm is used for 32/64-bit bitpacked elements bc there is no template specialized impl. for the int64_t elements is implemented in RUY.

add binary max pooling operator

bitwise OR can be used to get the max of a sequence of ones and zeros

update the readme file

the pip package does not include .so library

>>> import larq_compute_engine as lce
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/larq_compute_engine/__init__.py", line 3, in <module>
    from larq_compute_engine.python.ops.compute_engine_ops import bgemm
  File "/usr/local/lib/python2.7/dist-packages/larq_compute_engine/python/ops/compute_engine_ops.py", line 25, in <module>
    resource_loader.get_path_to_datafile('_larq_compute_engine_ops.so'))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/load_library.py", line 61, in load_op_library
    lib_handle = py_tf.TF_LoadLibrary(library_filename)
tensorflow.python.framework.errors_impl.NotFoundError: /usr/local/lib/python2.7/dist-packages/larq_compute_engine/python/ops/_larq_compute_engine_ops.so: cannot open shared object file: No such file or directory

implement optimized BGEMM for ARM architecture

current reference GEMM implementation does not support multi-threading, cache optimization techniques, SIMD and many other optimization techniques used in efficient impl. of GEMM.

There are multiple projects that we can take a look for inspiration and extend their impl. for binary GEMM:

Eigen
OpenBLAS -> probably only for x86
TF lite RUY impl.
ARM ComputeLibrary
Google XNNPACK

TODO:

understanding the RUY codebase
extending the 8-bit assembly kernels (32/64-bit NEONs) for binary gemm with 8bit bitpacking
writing 32-bit assembly kernels (32/64-bit NEONs) for binary gemm with 32bit bitpacking
writing 64-bit assembly kernels (32/64-bit NEONs) for binary gemm with 64bit bitpacking

Add naive Implementation of the binary conv op

bitpacking
Im2col
BGEMM
putting everything together

Bit-packing of input tensor along channel dimension

Current implementation of in compute-engine is as following:

im2col
bitpack the im2col matrix and filter matrix
BGEMM

However, it makes sense to bitpack the input matrix first along the channel dimension (you might need extra bitpadding) and do the im2col afterward (or fuzed bitpacking-im2col algorithm)

add binary fully connected operator

Binary fully connected operator is in essence doing binary matrix matrix multiplication (BGemm). Assume that the input is M × N , the weight is N×K (M is the batch size,N is the number of neurons of the previous layer, K is the number of neurons of the current layer)

Adhere to TensorFlow custom op naming scheme

External custom ops should be prefixed with a namespace according to https://github.com/tensorflow/community/blob/master/rfcs/20190726-custom-ops.md#out-of-tree-ops-must-be-namespaced

I think we should adhere to that.

Add padding support to bitpacking

Add padding support so that tensor dimensions do not have to be multiple of 64.

adding testsuite for ARM specific codebase

code written with ARM assembly can only be tested on an ARM machine or an ARM emulator. Currently we have no testsuite for ARM architecture that we can run either on rasberry pi or on our Github actions with an ARM emulator.

adding the extra ops required for linear trans. from binary to float to model converter.

fusing bitpacking with im2col operation for bitpacking along channel dimension

falls #52 is finished, we can fuse the bitpacking with im2col. Technically we bitpack the channels while we are copying them in im2col.

flaky TF Lite inference tests

the tflite python inference tests are failing in PR #61 despite the fact that any thing related to those test cases are not touched in this PR. So i guess the tests are flaky!

ReorderAxesOperator in ModelConverter

When converting a model to tflite, the converter will generally leave the tensors (inputs, weights) as they are, so that one can use the same code in tflite as in tf.
However, for Conv2D in particular, the most efficient memory layout for the weights is [Out channels, H, W, In channels]. In TF, however, they are stored as [H, W, In channels, Out channels]. One would therefore want to store the weights in the more efficient format. This could be done during the graph setup phase of tflite, but also already during the conversion process which is what they do. The way they accomplish this is as follows: when the converter sees a Conv2D op, it will insert a ReorderAxesOperator between the weights tensor and the Conv2D op. This is done in the file tflite/toco/import_tensorflow.cc (search for "conv2d" to find it). Then later in the converter, in particular in tflite/toco/graph_transformations/{convert,resolve}_reorder_axes.cc, if it finds a ReorderAxesOperator of which the input is constant (like weights instead of inputs) then it will apply that operator during the conversion process.

Therefore, all we have to do in our ModelConverter, is add this ReorderAxesOperator ourselves, and the converter will do the conversion for us.

TF lite deployment options

The normal TF lite system consists of a C++ library which can be used directly (on android, ios, raspberry pi etc), and wrappers around it for python (laptops and raspberry pi), Java (Android apps) and iOS apps. All wrappers use the same C++ library as core.

There are several ways we can add our custom ops to this:

(current method) First compile the tflite C++ library. Then compile our own functions and append them to the library file. Now any wrapper (python, Android, iOS) can be built using the normal methods and will have our additions builtin.
Pro's: build method is very easy, no altering of the wrappers. Usage is also easy, just replace the normal tflite package by our tflite package.
Con's: everytime tflite updates, we have to release an updated package as well to have those changes.
Do not alter the C++ library. Instead compile our additions to a separate library, say lqce_lite.so. Instead modify the wrappers and make them add our custom ops.
Pros: None
Cons: This is much more work than the first option and users still have to replace the tflite package by ours.
Python wrapper only, can be done as an additional deployment option next to the existing ones. This option is marked as "experimental" in the tflite source code. For this option, we do not alter the C++ library. Instead we compile our additions to a separate lqce_lite.so . Then write a single python file lqce_lite.py that only loads this lqce_lite.so file and nothing else. A user then downloads the original tflite pip package as well as our lqce_lite package and uses it as shown in the code snippet below.
Pro's: User can use the original tflite package. When the original tflite updates, we don't have to change anything. User can use our custom ops together with other custom ops.
Cons: It seems that only the python wrapper has this InterpreterWithCustomOps thing so this doesn't work for android and ios.

import tflite_runtime.interpreter import InterpreterWithCustomOps
import lqce_lite

tflite.InterpreterWithCustomOps(["lqce_lite_op_loader"])

We could add option 3 to the larq compute engine.

TFLite conversion bug in TF1

These are some notes and thoughts on tflite conversion.

When converting models, we could put either tf.sign or lqce.bsign in the graph. It doesn't matter which you choose because both of these become custom ops ("Sign" or "Bsign") and in tflite we can handle them appropriately. We can even register both strings in tflite and refer them to the same op.

There is still a weird issue about data types: in BinaryAlexNet near the end there is
BatchNorm -> Flatten -> Sign -> Dense

When using the tflite converter from tensorflow 2, this is all fine and the Flatten becomes a Reshape op.

When converting using tensorflow1 conversion, the flatten layer becomes a very weird thing with ( "Shape" , "StridedSlice" , "Pack" ) and a shortcut. Even though this should not be there, it revealed an issue:
When putting tf.sign after this flatten layer, the "Shape" op would go from float32 to int32.
When putting tf.bsign after this flatten layer, the "Shape" op would go from float32 to float32, yielding an error because "StridedSlice" expected an int32.

So this shows that our bsign op is somehow different from tf.sign regarding type info.
Maybe we should try to investigate this difference.
As suggested by Arash, maybe we can try putting the bsign dtypes in a different order.

Add 'packed' datatype.

It might make sense to have a packed datatype, or maybe packed8, packed32 etc which is really just uint8 / uint32 under the hood.
This way, we can have the bitpacking op that sends float to packed32, and then the bgemm op should only accept packedXX as inputs and produces intYY as outputs. Then it will properly give an error if you try to run bgemm on non-packed data. This way, if we have some operation that is supposed to support both normal uint8 and bitpacked things, it can distinguish those.

We can rename the current bitpacking operation to bsign because it is kind of the sign operator but which has a different result type, namely packedXX.

EDIT: They way to implement these in tensorflow (not lite) is by using their DT_VARIANT datatype which supports arbitrary C++ structs. So we can use something like

struct Packed32 {
    uint32_t x;
};

It is somewhat explained in this blogpost

LICENSE

Before we make the compute engine public, we should make sure we include proper licenses for all the things we are using, which is for now tensorflow and all the things it depends on.
For example, daBNN includes all kinds of indirect dependencies in its LICENSE file such as Eigen and Flatbuffers.

bash script to build TF lite does not work with multiple arguments

root@80187a34a9dc:/working_dir/compute-engine/larq_compute_engine/tflite/build# ./build_lqce.sh --cleanbuild --benchmark                                     
./build_lqce.sh: line 5: [: too many arguments