Git Product home page Git Product logo

larq / compute-engine Goto Github PK

View Code? Open in Web Editor NEW
239.0 239.0 33.0 5.03 MB

Highly optimized inference engine for Binarized Neural Networks

Home Page: https://docs.larq.dev/compute-engine

License: Apache License 2.0

Python 10.36% Shell 1.09% C++ 72.02% Starlark 5.66% MLIR 10.17% Dockerfile 0.08% C 0.06% CMake 0.55%
aarch32 aarch64 android armv7 armv8 binarized-neural-networks bnn inference keras larq mlir raspberry-pi simd tensorflow tflite

compute-engine's Introduction

logo


Codecov PyPI - Python Version PyPI PyPI - License DOI Code style: black

Larq is an open-source deep learning library for training neural networks with extremely low precision weights and activations, such as Binarized Neural Networks (BNNs).

Existing deep neural networks use 32 bits, 16 bits or 8 bits to encode each weight and activation, making them large, slow and power-hungry. This prohibits many applications in resource-constrained environments. Larq is the first step towards solving this. It is designed to provide an easy to use, composable way to train BNNs (1 bit) and other types of Quantized Neural Networks (QNNs) and is based on the tf.keras interface. Note that efficient inference using a trained BNN requires the use of an optimized inference engine; we provide these for several platforms in Larq Compute Engine.

Larq is part of a family of libraries for BNN development; you can also check out Larq Zoo for pretrained models and Larq Compute Engine for deployment on mobile and edge devices.

Getting Started

To build a QNN, Larq introduces the concept of quantized layers and quantizers. A quantizer defines the way of transforming a full precision input to a quantized output and the pseudo-gradient method used for the backwards pass. Each quantized layer requires an input_quantizer and a kernel_quantizer that describe the way of quantizing the incoming activations and weights of the layer respectively. If both input_quantizer and kernel_quantizer are None the layer is equivalent to a full precision layer.

You can define a simple binarized fully-connected Keras model using the Straight-Through Estimator the following way:

model = tf.keras.models.Sequential(
    [
        tf.keras.layers.Flatten(),
        larq.layers.QuantDense(
            512, kernel_quantizer="ste_sign", kernel_constraint="weight_clip"
        ),
        larq.layers.QuantDense(
            10,
            input_quantizer="ste_sign",
            kernel_quantizer="ste_sign",
            kernel_constraint="weight_clip",
            activation="softmax",
        ),
    ]
)

This layer can be used inside a Keras model or with a custom training loop.

Examples

Check out our examples on how to train a Binarized Neural Network in just a few lines of code:

Installation

Before installing Larq, please install:

  • Python version 3.7, 3.8, 3.9, or 3.10
  • Tensorflow version 1.14, 1.15, 2.0, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9, or 2.10:
    pip install tensorflow  # or tensorflow-gpu

You can install Larq with Python's pip package manager:

pip install larq

About

Larq is being developed by a team of deep learning researchers and engineers at Plumerai to help accelerate both our own research and the general adoption of Binarized Neural Networks.

compute-engine's People

Contributors

adamhillier avatar andrewstanfordjason avatar arashb avatar cnugteren avatar dependabot-preview[bot] avatar dependabot[bot] avatar honglh avatar jamescook106 avatar jneeven avatar leonoverweel avatar lgeiger avatar luciengaitskell avatar panickal-xmos avatar sib1 avatar simonmaurer avatar timdebruin avatar tombana avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

compute-engine's Issues

add support of zero-padding (padding type 'SAME' in TF) in binary convolution op

doing zero-padding in {-1,1} space with im2col algorithm is not trivial because injected zeros in im2col buffer, depending on the implementation, will be interpreted as -1 or 1 which leads to wrong results.

currently I propose the following solution: storing the negative value of the corresponding kernel cell in each padding cell so after im2col and in XNOR operation of bgemm the elements cancel out. However this solution results in slower im2col algorithm since the padding elements can not simply put to zero!

Future optimizations

Things we know that might be faster, but that we left for now because it simplifies the code

  • Move the bitpack_array C++ loop inside the assembly part. Gives speedup of about 5% for the bitpack_array function on Raspberry Pi.
  • Remove the bitpack_matrix loop when there's no padding (i.e. bitpack_matrix becomes a single bitpack_array call).
  • Decouple bitpacking bitwidth and bgemm bitwidth: we can use the 64-bit bitpacking code for 32-bit bgemm code.
  • Cache RUY prepacking for weights

add larq compute engine to PyPI

we need to reserve the name for larq-compute-engine. I guess we push this initial version which does not have any functionality before going completely public! what do you think @lgeiger ?

Benchmark a version of LCE_BMLA with less temporaries.

The current LCE_BMLA macro in PR #128 uses 4 temporary registers v26, v27, v28, v29.
We could do it with only 2 temporaries v26, v27, by reordering the instructions (same total number of instructions):

eor v26.16b,  Vr.16b,  Vl1.16b
cnt v26.16b, v26.16b
addv b26, v26.16b

eor v27.16b,  Vr.16b,  Vl2.16b
cnt v27.16b, v27.16b
addv b27, v27.16b
ins v26.s[1], v27.s[0]

eor v27.16b,  Vr.16b,  Vl3.16b
cnt v27.16b, v27.16b
addv b27, v27.16b
ins v26.s[2], v27.s[0]

eor v27.16b,  Vr.16b,  Vl4.16b
cnt v27.16b, v27.16b
addv b27, v27.16b
ins v26.s[3], v27.s[0]

if the eor, cnt, addv instructions can run in parallel (out of order cores) then this version might be slower. But maybe they can only run in parallel with memory read/write instructions, so then this version is equally fast with less temporaries (we should benchmark that). Right now we don't need the extra registers, but we can keep this idea around for future versions.

add a procedure to test each new op with TF lite

converting the ops from TF to TF Lite with the TF lite convertor does not work for custom ops since new ops need its own kernel and op implementation with TF lite OpResolver. -> we need a procedure to add and test the new custom ops. This requires building the TF lite with new ops and use the binary in a c++ program which executes the op exactly the same way our users would use our ops in their mobile applications.

TF lite converter weight bitpacking

Ideally we would like the TF lite converter to store the binary weights in a packed way so that the tflite model file stays small.

Another option might be that we write a tool ourselves that takes a converted .tflite model and does another pass on it, transforming the weights appropriately.

C++ testsuite for TF lite code

We need a C++ testsuite for TF lite (similar the one for TF) so we can test individual components (currently there are only python tests in ops level and not components level)

Use TFlite conv2d setup

We should use the multithreading setup of tflite, as well as their im2col algorithm and all that.
Basically we can copy almost everything and only change the bgemm part.

create bazel target to build tflite pip package and library

the original bash script to build a pip package for TF is available here: https://github.com/plumerai/compute-engine/blob/master/build_pip_pkg.sh
this script is used here in bazel to create the pip package for TF through Bazel commands:
https://github.com/plumerai/compute-engine/blob/master/BUILD

we can use exact same approach to build the TF lite library and pip package by just running one bazel command which executes our build script based on our custom makefile for TF lite. By doing so we don;t need to manually run a couple of commands to build the TF lite library every time.

TF lite Android deployment

We currently support Raspberry Pi and iOS but not Android.
This is because the build system for Android uses Bazel wheras the other targets work with a Makefile.
We have to figure out if we can have a Bazel file that has tflite android as a dependency but with our library added to it.

Use op hints to simplify model TFLite conversion

TF 2.1 introduced @tf.function(experimental_implements="larq.bsign") to annotate custom implementations that can be represented by standard TF ops, but would benefit from custom implementations during deployment.
It would be great to integrate this into larq to make our TFLite conversion simpler.

Using experimental_implements might be a bit early, at least can't find how we can retrieve this information easily from TFLite (just from a grep in the code). It might be worth using tf.compat.v1.lite.OpHint for now, which (judging from the code) seams to be supported both by the TOCO and new MLIR converter. And should be easy to integrate as well.

The benefit of doing something like this is that we don't need build and maintain high performance TF ops for things like BSign which likely don't impact training performance much.

Switch to a single build system (bazel)

Maintaining two buildsystems (Makefile and bazel) is too much work.

Dependency handling like downloading google-test and so on is easier in bazel. The azure system should also use the bazel system.

We can leave a Makefile that only includes a simple build command for the library (which assumes all the dependencies are already present), for easier development when you are locally recompiling the cpp code. The makefile should not include things like building the python package, that should all be in the bazel system.

the pip package does not include .so library

>>> import larq_compute_engine as lce
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/larq_compute_engine/__init__.py", line 3, in <module>
    from larq_compute_engine.python.ops.compute_engine_ops import bgemm
  File "/usr/local/lib/python2.7/dist-packages/larq_compute_engine/python/ops/compute_engine_ops.py", line 25, in <module>
    resource_loader.get_path_to_datafile('_larq_compute_engine_ops.so'))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/load_library.py", line 61, in load_op_library
    lib_handle = py_tf.TF_LoadLibrary(library_filename)
tensorflow.python.framework.errors_impl.NotFoundError: /usr/local/lib/python2.7/dist-packages/larq_compute_engine/python/ops/_larq_compute_engine_ops.so: cannot open shared object file: No such file or directory

implement optimized BGEMM for ARM architecture

current reference GEMM implementation does not support multi-threading, cache optimization techniques, SIMD and many other optimization techniques used in efficient impl. of GEMM.

There are multiple projects that we can take a look for inspiration and extend their impl. for binary GEMM:

TODO:

  • understanding the RUY codebase
  • extending the 8-bit assembly kernels (32/64-bit NEONs) for binary gemm with 8bit bitpacking
  • writing 32-bit assembly kernels (32/64-bit NEONs) for binary gemm with 32bit bitpacking
  • writing 64-bit assembly kernels (32/64-bit NEONs) for binary gemm with 64bit bitpacking

Bit-packing of input tensor along channel dimension

Current implementation of in compute-engine is as following:

  • im2col
  • bitpack the im2col matrix and filter matrix
  • BGEMM

However, it makes sense to bitpack the input matrix first along the channel dimension (you might need extra bitpadding) and do the im2col afterward (or fuzed bitpacking-im2col algorithm)

add binary fully connected operator

Binary fully connected operator is in essence doing binary matrix matrix multiplication (BGemm). Assume that the input is M × N , the weight is N×K (M is the batch size,N is the number of neurons of the previous layer, K is the number of neurons of the current layer)

adding testsuite for ARM specific codebase

code written with ARM assembly can only be tested on an ARM machine or an ARM emulator. Currently we have no testsuite for ARM architecture that we can run either on rasberry pi or on our Github actions with an ARM emulator.

flaky TF Lite inference tests

the tflite python inference tests are failing in PR #61 despite the fact that any thing related to those test cases are not touched in this PR. So i guess the tests are flaky!

ReorderAxesOperator in ModelConverter

When converting a model to tflite, the converter will generally leave the tensors (inputs, weights) as they are, so that one can use the same code in tflite as in tf.
However, for Conv2D in particular, the most efficient memory layout for the weights is [Out channels, H, W, In channels]. In TF, however, they are stored as [H, W, In channels, Out channels]. One would therefore want to store the weights in the more efficient format. This could be done during the graph setup phase of tflite, but also already during the conversion process which is what they do. The way they accomplish this is as follows: when the converter sees a Conv2D op, it will insert a ReorderAxesOperator between the weights tensor and the Conv2D op. This is done in the file tflite/toco/import_tensorflow.cc (search for "conv2d" to find it). Then later in the converter, in particular in tflite/toco/graph_transformations/{convert,resolve}_reorder_axes.cc, if it finds a ReorderAxesOperator of which the input is constant (like weights instead of inputs) then it will apply that operator during the conversion process.

Therefore, all we have to do in our ModelConverter, is add this ReorderAxesOperator ourselves, and the converter will do the conversion for us.

TF lite deployment options

The normal TF lite system consists of a C++ library which can be used directly (on android, ios, raspberry pi etc), and wrappers around it for python (laptops and raspberry pi), Java (Android apps) and iOS apps. All wrappers use the same C++ library as core.

There are several ways we can add our custom ops to this:

  1. (current method) First compile the tflite C++ library. Then compile our own functions and append them to the library file. Now any wrapper (python, Android, iOS) can be built using the normal methods and will have our additions builtin.
    Pro's: build method is very easy, no altering of the wrappers. Usage is also easy, just replace the normal tflite package by our tflite package.
    Con's: everytime tflite updates, we have to release an updated package as well to have those changes.

  2. Do not alter the C++ library. Instead compile our additions to a separate library, say lqce_lite.so. Instead modify the wrappers and make them add our custom ops.
    Pros: None
    Cons: This is much more work than the first option and users still have to replace the tflite package by ours.

  3. Python wrapper only, can be done as an additional deployment option next to the existing ones. This option is marked as "experimental" in the tflite source code. For this option, we do not alter the C++ library. Instead we compile our additions to a separate lqce_lite.so . Then write a single python file lqce_lite.py that only loads this lqce_lite.so file and nothing else. A user then downloads the original tflite pip package as well as our lqce_lite package and uses it as shown in the code snippet below.
    Pro's: User can use the original tflite package. When the original tflite updates, we don't have to change anything. User can use our custom ops together with other custom ops.
    Cons: It seems that only the python wrapper has this InterpreterWithCustomOps thing so this doesn't work for android and ios.

import tflite_runtime.interpreter import InterpreterWithCustomOps
import lqce_lite

tflite.InterpreterWithCustomOps(["lqce_lite_op_loader"])

We could add option 3 to the larq compute engine.

TFLite conversion bug in TF1

These are some notes and thoughts on tflite conversion.

When converting models, we could put either tf.sign or lqce.bsign in the graph. It doesn't matter which you choose because both of these become custom ops ("Sign" or "Bsign") and in tflite we can handle them appropriately. We can even register both strings in tflite and refer them to the same op.

There is still a weird issue about data types: in BinaryAlexNet near the end there is
BatchNorm -> Flatten -> Sign -> Dense

When using the tflite converter from tensorflow 2, this is all fine and the Flatten becomes a Reshape op.

When converting using tensorflow1 conversion, the flatten layer becomes a very weird thing with ( "Shape" , "StridedSlice" , "Pack" ) and a shortcut. Even though this should not be there, it revealed an issue:
When putting tf.sign after this flatten layer, the "Shape" op would go from float32 to int32.
When putting tf.bsign after this flatten layer, the "Shape" op would go from float32 to float32, yielding an error because "StridedSlice" expected an int32.

So this shows that our bsign op is somehow different from tf.sign regarding type info.
Maybe we should try to investigate this difference.
As suggested by Arash, maybe we can try putting the bsign dtypes in a different order.

Add 'packed' datatype.

It might make sense to have a packed datatype, or maybe packed8, packed32 etc which is really just uint8 / uint32 under the hood.
This way, we can have the bitpacking op that sends float to packed32, and then the bgemm op should only accept packedXX as inputs and produces intYY as outputs. Then it will properly give an error if you try to run bgemm on non-packed data. This way, if we have some operation that is supposed to support both normal uint8 and bitpacked things, it can distinguish those.

We can rename the current bitpacking operation to bsign because it is kind of the sign operator but which has a different result type, namely packedXX.

EDIT: They way to implement these in tensorflow (not lite) is by using their DT_VARIANT datatype which supports arbitrary C++ structs. So we can use something like

struct Packed32 {
    uint32_t x;
};

It is somewhat explained in this blogpost

LICENSE

Before we make the compute engine public, we should make sure we include proper licenses for all the things we are using, which is for now tensorflow and all the things it depends on.
For example, daBNN includes all kinds of indirect dependencies in its LICENSE file such as Eigen and Flatbuffers.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.