tum-ei-eda / muriscv-nn Goto Github PK

View Code? Open in Web Editor NEW

56.0 56.0 6.0 5.29 MB

muRISCV-NN is a collection of efficient deep learning kernels for embedded platforms and microcontrollers.

License: Apache License 2.0

CMake 0.53% C 8.30% Shell 0.32% Python 0.35% Jupyter Notebook 0.12% Jinja 0.03% C++ 90.33% Assembly 0.01%

muriscv-nn's People

Contributors

Stargazers

Watchers

Forkers

ulrikhjort parkerjones567 magics-tech fatihgulakar

muriscv-nn's Issues

Fix README batches

Some of the CI status batches in the README are broken...

Make Vicuna Verilator fail on test failure

Need to modify muriscv-nn/Sim/Vicuna/vicuna/sim/verilator_main.cpp

Spike TFLM integration tests broke

Die to upstream changes at riscv-isa-sim (riscv-software-src/riscv-isa-sim@f9c78b8) the mtime CSRs can not be accessed anymore without enabling its bit in mcounteren before. As MLonMCU and the unit tests do not use mtime, we just need to fix this in the TFLM patches.

tflm no commit specified

The tflm integration requires checking out tflm, and applying a patch, but doesn't state onto which commit.
Specifying this would be the bare minimum, but the cleaner solution would be to provide an already patched fork on github (and potentially include this as a submodule)

CMake warning `Unable to determine default CMAKE_INSTALL_LIBDIR`

CMake Warning (dev) at /usr/share/cmake-3.21/Modules/GNUInstallDirs.cmake:236 (message):
  Unable to determine default CMAKE_INSTALL_LIBDIR directory because no
  target architecture is known.  Please enable at least one language before
  including GNUInstallDirs.
Call Stack (most recent call first):
  build/_deps/unity-src/CMakeLists.txt:64 (include)
This warning is for project developers.  Use -Wno-dev to suppress it.

No idea why this is coming up. Both my CMakeLists.txt and Unity's CMakeLists.txt set the language (to C) using the project().

PEXT is getting a major refactoring

It seems like most of the proposed RVP extensions are getting renamed to be more consistent overall.

Examples:

ADD8 -> ADD.B
ADD16 -> ADD.H

See http://www.jhauser.us/RISCV/ext-P/RVP-baseInstrs-A-002.pdf for more details.

In addition there will also be be new instructions added to the spec and some of the proposed ones might be dropped. Since none of this is finalized I would not put effort into changing muRISCV-NNs PEXT implementation because it might change again later. We should really wait for it’s ratification and until a proper intrinsics API + usable toolchain (GCC/LLVM) is available.

Consider supporting only subset of vector instructions to conform to Zve

See 18.2. Zve*: Vector Extensions for Embedded Processors in vector spec

Use updated Intrinsics API for V-EXT

While the RVV extension is already ratified, its C-intrinsics API is still in the works. This year it got a major refactoring: https://github.com/riscv-non-isa/rvv-intrinsic-doc

Our current implementation still uses this older version: https://raw.githubusercontent.com/riscv-non-isa/rvv-intrinsic-doc/8dadca57e220f7eca5936fb1c76169678a2832e7/intrinsic_funcs.md

I am not sure if the latest development LLVM version already uses the new intrinsics, but I am pretty sure the RVV GCC does not. Therefore let’s delay any efforts to use the newer intrinsics until the toolchains are both supporting it.

unity.c:(.text+0x1cd8): undefined reference to `__subsf3'

When I run the build.sh I got this unity error

Vicuna: Validation (during integration tests) fails using workarounds

Using the workarounds explained in #31 the validation in the integration tests using the Vicuna simulator seem to fail. We have to figure out if this is an rtl or muriscvnn bug.

muriscv_nn_avgpool_s16 channel loop

In the average pooling function, you use a while loop on the chCnt, and also set the vector length to it size_t vl = vsetvl_e32m8(chCnt);.

End of the loop, you decrement chCnt--;.

I had a hard time understanding why this would be done. For one, the loop could be converted to a for loop, but also you are just decrementing the vector length until you are working with a vector length of 1.

Was it maybe intended to decrement the chCnt variable by the vector length at the end of the while loop?
like
chCnt -= vl; pSrc += vl;

On a side note, you might not be able to get super long vector length anyway, because the channel count is likely to be low, right? Wouldn't it make sense to apply the vector operation along a different dimension?

Possible Targets

There are a number of academic vector cores out there:

Low Power / Embedded

2022 ETH Zurich Spatz: A Compact Vector Processing Unit for High-Performance and Energy-Efficient Shared-L1 Clusters RVV 1.0 Zve32x embedded integer subset
2021 TU Wien Vicuna: A Timing-Predictable RISC-V Vector Coprocessor RVV 0.10
2021 Univ. of Beirut Arrow: A RISC-V Vector Accelerator for Machine Learning
Inference RVV 0.9
2020 Univ. of Southampton A Minimal RISC-V Vector Processor for Embedded Systems RVV 0.8 (see GitHub repo)

Application Level

2020 ETH Zurich Ara: A 1-GHz+ Scalable and Energy-Efficient RISC-V Vector Processor With Multiprecision Floating-Point Support in 22-nm FD-SOI GitHub RVV 0.10
2022 Barcelona Supercomputing Center Vitruvius+: An Area-Efficient RISC-V Decoupled Vector Accelerator for High Performance Computing RVV0.7.1
2020 RISC-V2: A Scalable RISC-V Vector Processor
2018 UC Berkley Hwacha V4: Decoupled Data Parallel Custom Extension and GitHub Repo
2021 Development of RISC-V Based Soft-core Processor with Scalable Vector Extension for Embedded System and GitHub Repo

On the commercial side, we have the following:

Commercial IP Cores:

Andes NX27V
Alibaba OpenC906 and OpenC910 RVV0.7.1
SiFive P270 and X280 RVV 1.0
NSITEXE,Inc. NS 72 and DR1000C RVV 1.0

Actual Hardware:

Allwinner D1 based on the OpenC906, RVV0.7.1

General Notes

The larger vector cores probably compare to ARM's Scalable Vector Extension (SVE, aka Neon), while the smaller vector cores appear to target a similar domain as ARM's embedded vector extension for Cortex-M microcontroller (MVE, aka Helium). This is also the vector extension used in CMSIS-NN. However, the only core with MVE appears to be the Cortex-M55. And there are, to this date, no open performance numbers available.
ARM's DSP extension for Cortex-M can be compared to RISC-V's packed extension.

Tracking Issue: LSTM Support

Upstream CMSIS-NN recently implemented support for running LSTMs with CMSIS-NN. Here are few task related to that:

Fix broken TFLM integration test due to LSTM changes (see #22)
Support scalar (unoptimized) versions of LSTM functions
Add reference LSTML model to benchmarking flow (does ARM provide one?)
Look into optimizing LSTM performance using VEXT and PEXT specific code (low priority)

Document workarounds to support VEXT on Vicuna

It seems like Vicuna currently does not support the complete Zve32x spec. Let’s use this issue to list all of the unsupported instructions (the Vicuna docs are not mentioning all of them).

muriscv_elementwise_mul_s16_s8.c vs muriscv_nn_elementwise_mul_s16_s8.c

Seems to be more or less the same file besides the headers. Looks like this is something that shouldn't be this way

Check if P/V-Extension is actually supported at compile time

Check if v extension is supported with __riscv_v from here. Maybe also print value / arch version.

Install LLVM to specific directory in CI and introduce path variable which can be passed to CMake pointing to LLVM install dir

Suggest by @PhilippvK in order to keep things tidier.

Remove tar archives after download

We should do some cleanup after downloading the Toolchain/Simulator archives...

Cost of instructions and impact of `vsetvl`

The instruction-level simulators do not take into account the individual number of cycles per instruction. Thus, using the vsetvl instruction in every loop iteration might appear very inefficient. However, this does not correctly reflect real-world implementation costs. In most uArch implementations, the vsetvl instruction would actually incurre very little extra overhead. See this for more info.

Additionally, the vsetvl instructions can be fused internally into a single vector microop. From the rvv1.0 spec:

The primary motivation for the vtype CSR is to allow the vector instruction set to t into a 32-bit instruction encoding space. A
separate vset{i}vl{i} instruction can be used to set vl and/or vtype elds before execution of a vector instruction, and
implementations may choose to fuse these two instructions into a single internal vector microop. In many cases, the vl and vtype
values can be reused across multiple instructions, reducing the static and dynamic instruction overhead from the vset{i}vl{i}
instructions. It is anticipated that a future extended 64-bit instruction encoding would allow these elds to be specied statically in
the instruction encoding.

Additionally, when tuning the performance of muRISCV-NN kernels, it is important that vector instructions are correctly weighted according to their relative cost in actual implementations. For more info on an actual implementation example with some ballpark numbers, look here.

VWW model does not fit on Vicuna

This is just documentation that vww (visual wakeworks) integration tests on Vicuna are supposed to fail due to the memory being too small to fit all data.

We have to evaluate if increasing the memory size in the linker script is feasible without major changes to the RTL and CRT.

Self-contained releases

I tried running the provided scripts, but this proves quite difficult due to policies, proxies etc. - Just to many scripts downloading stuff from somewhere.
Is there a chance that when you create a release, you build a self-contained package that contains the necessary toolchains and simulators etc?

LLVM 14 has no support for Tail Undisturbed Intrinsics

In the innermost accumulation loops, we require the vadd and vmacc operations to be length agnostic in a tail undisturbed way. This can be indicated by postpending intrinsics with _tu . However, the current LLVM 14 does not seem to support this. It seems to have gotten removed in LLVM due to binary size/compilation time. At least for now. The GCC vector fork we are using supports _tu just fine.

In order to stay compatible with both GCC and LLVM we will resort to the vmacc intrinsic. Thankfully, vmaccs are by default _tu. Thus, we can use them as they are. We are replacing vadds with vmaccs (using multiplication by 1) for now. This has some slight performance implications, but its the best we can do for now!

TFLM person_detection_benchmark calls mostly ds_conv and 1x1 conv. Tested on GCC. With "real" vadds:

WithPersonDataIterations(10) took 939686 ticks (939 ms)
NoPersonDataIterations(10) took 939673 ticks (939 ms)

With vmaccs simulating vadds:

WithPersonDataIterations(10) took 945601 ticks (945 ms)
NoPersonDataIterations(10) took 945601 ticks (945 ms)

LLVM 14 with vmaccs simulating vadds:

WithPersonDataIterations(10) took 846704 ticks (846 ms)
NoPersonDataIterations(10) took 846703 ticks (846 ms)

TFLM and tflm directories in /Integration should be changed

two directories with the same name in a different case can create issues depending on the OS, and for sure doesnt help to understand whats beneath. Suggest to change this

Fix broken compatibility with upstream TFLM

This week the TFLM integration test broke due to upstream changes. (https://github.com/tum-ei-eda/muriscv-nn/actions/runs/4311825732)

TODO:

Fix ccompatibility
Instead of only running that CI job against the latest upstream codebase, I would propose to deploy a "matrix" of two runs:
- Test vs. Upstream: If something breaks due to upstream changes.
- Test vs. fixed commit chosen by us: This way we find out if something broke on our side.

The fixed commit should be (automatically) updated to the latest commit which is known to be compatible.

Refactor Unit Tests (Fix OVPSim)

In #21 there have been issues with the unit tests. Since the timeout only occours for the first test, it seems like a server-side issue.

Proposed solution:

Try to add delays to make sure that OVPSim server does not get too many requests at the same time
Split Spike and OVPSim unit test workflow into two separate runs using CI matrix: which should help to tell if there was a simulator-related problem or not.

This depends on making the provided scripts for unit testing more flexible which is currently beeing tracked in #15.

ARM CMSIS-NN / TFLite Rounding Mode

It's extremely hard (or at least very verbose) to reproduce the bit-exact rounding behaviour of ARMs CMSIS-NN library. This is because their integer C implementation mimics ARM instructions. Taking the arm_nn_requantize() function as an example. It in turn calls arm_nn_doubling_high_mult_no_sat() and arm_nn_divide_by_power_of_two(). These instructions translate, more or less, directly into ARM instructions, when using the ARM vector extension (Helium / MVE). However, I was unable to reproduce the behaviour using RISC-V vector instructions and the available RISC-V vector rounding modes. I am, in about 5% of the results of the test, off by one bit. This is, as far as I can judge, due to the different

Some similar issues were faced by TVM, see here, here, and here. It appears that they have not yet solved the issue.

This recent PR in CMSIS-NN as a response to this PR in TF has made the whole thing even more interesting. More links with similar content: ruy matrix multiplication library PR, TF issue on this.

I will need to dig into this rabbit hole some more. But the way the rounding is currently implemented in muRISCV-NN using the vector intrinsics is far from optimal. In terms of both readability/maintainability and performance!

However: In how far is it actually important that our kernels are bit-exact to CMSIS-NN kernels? According to this comment it appears that it is not that critical.