tum-ei-eda / muriscv-nn Goto Github PK
View Code? Open in Web Editor NEWmuRISCV-NN is a collection of efficient deep learning kernels for embedded platforms and microcontrollers.
License: Apache License 2.0
muRISCV-NN is a collection of efficient deep learning kernels for embedded platforms and microcontrollers.
License: Apache License 2.0
Some of the CI status batches in the README are broken...
Need to modify muriscv-nn/Sim/Vicuna/vicuna/sim/verilator_main.cpp
Die to upstream changes at riscv-isa-sim
(riscv-software-src/riscv-isa-sim@f9c78b8) the mtime
CSRs can not be accessed anymore without enabling its bit in mcounteren
before. As MLonMCU and the unit tests do not use mtime
, we just need to fix this in the TFLM patches.
The tflm integration requires checking out tflm, and applying a patch, but doesn't state onto which commit.
Specifying this would be the bare minimum, but the cleaner solution would be to provide an already patched fork on github (and potentially include this as a submodule)
CMake Warning (dev) at /usr/share/cmake-3.21/Modules/GNUInstallDirs.cmake:236 (message):
Unable to determine default CMAKE_INSTALL_LIBDIR directory because no
target architecture is known. Please enable at least one language before
including GNUInstallDirs.
Call Stack (most recent call first):
build/_deps/unity-src/CMakeLists.txt:64 (include)
This warning is for project developers. Use -Wno-dev to suppress it.
No idea why this is coming up. Both my CMakeLists.txt
and Unity's CMakeLists.txt
set the language (to C
) using the project()
.
Related links:
https://www.spinics.net/lists/fedora-devel/msg301866.html refers to this logic here https://chromium.googlesource.com/external/github.com/g-truc/glm/+/0.9.5/cmake/GNUInstallDirs.cmake#90
It seems like most of the proposed RVP extensions are getting renamed to be more consistent overall.
Examples:
See http://www.jhauser.us/RISCV/ext-P/RVP-baseInstrs-A-002.pdf for more details.
In addition there will also be be new instructions added to the spec and some of the proposed ones might be dropped. Since none of this is finalized I would not put effort into changing muRISCV-NNs PEXT implementation because it might change again later. We should really wait for it’s ratification and until a proper intrinsics API + usable toolchain (GCC/LLVM) is available.
See 18.2. Zve*: Vector Extensions for Embedded Processors
in vector spec
While the RVV extension is already ratified, its C-intrinsics API is still in the works. This year it got a major refactoring: https://github.com/riscv-non-isa/rvv-intrinsic-doc
Our current implementation still uses this older version: https://raw.githubusercontent.com/riscv-non-isa/rvv-intrinsic-doc/8dadca57e220f7eca5936fb1c76169678a2832e7/intrinsic_funcs.md
I am not sure if the latest development LLVM version already uses the new intrinsics, but I am pretty sure the RVV GCC does not. Therefore let’s delay any efforts to use the newer intrinsics until the toolchains are both supporting it.
When I run the build.sh I got this unity error
Using the workarounds explained in #31 the validation in the integration tests using the Vicuna simulator seem to fail. We have to figure out if this is an rtl or muriscvnn bug.
In the average pooling function, you use a while loop on the chCnt
, and also set the vector length to it size_t vl = vsetvl_e32m8(chCnt);
.
End of the loop, you decrement chCnt--;
.
I had a hard time understanding why this would be done. For one, the loop could be converted to a for loop, but also you are just decrementing the vector length until you are working with a vector length of 1.
Was it maybe intended to decrement the chCnt
variable by the vector length at the end of the while loop?
like
chCnt -= vl; pSrc += vl;
On a side note, you might not be able to get super long vector length anyway, because the channel count is likely to be low, right? Wouldn't it make sense to apply the vector operation along a different dimension?
There are a number of academic vector cores out there:
On the commercial side, we have the following:
The larger vector cores probably compare to ARM's Scalable Vector Extension (SVE, aka Neon), while the smaller vector cores appear to target a similar domain as ARM's embedded vector extension for Cortex-M microcontroller (MVE, aka Helium). This is also the vector extension used in CMSIS-NN. However, the only core with MVE appears to be the Cortex-M55. And there are, to this date, no open performance numbers available.
ARM's DSP extension for Cortex-M can be compared to RISC-V's packed extension.
Upstream CMSIS-NN recently implemented support for running LSTMs with CMSIS-NN. Here are few task related to that:
It seems like Vicuna currently does not support the complete Zve32x spec. Let’s use this issue to list all of the unsupported instructions (the Vicuna docs are not mentioning all of them).
Seems to be more or less the same file besides the headers. Looks like this is something that shouldn't be this way
Check if v extension is supported with __riscv_v
from here. Maybe also print value / arch version.
Suggest by @PhilippvK in order to keep things tidier.
We should do some cleanup after downloading the Toolchain/Simulator archives...
The instruction-level simulators do not take into account the individual number of cycles per instruction. Thus, using the vsetvl
instruction in every loop iteration might appear very inefficient. However, this does not correctly reflect real-world implementation costs. In most uArch implementations, the vsetvl
instruction would actually incurre very little extra overhead. See this for more info.
Additionally, the vsetvl
instructions can be fused internally into a single vector microop. From the rvv1.0 spec:
The primary motivation for the vtype CSR is to allow the vector instruction set to t into a 32-bit instruction encoding space. A
separate vset{i}vl{i} instruction can be used to set vl and/or vtype elds before execution of a vector instruction, and
implementations may choose to fuse these two instructions into a single internal vector microop. In many cases, the vl and vtype
values can be reused across multiple instructions, reducing the static and dynamic instruction overhead from the vset{i}vl{i}
instructions. It is anticipated that a future extended 64-bit instruction encoding would allow these elds to be specied statically in
the instruction encoding.
Additionally, when tuning the performance of muRISCV-NN kernels, it is important that vector instructions are correctly weighted according to their relative cost in actual implementations. For more info on an actual implementation example with some ballpark numbers, look here.
This is just documentation that vww
(visual wakeworks) integration tests on Vicuna are supposed to fail due to the memory being too small to fit all data.
We have to evaluate if increasing the memory size in the linker script is feasible without major changes to the RTL and CRT.
I tried running the provided scripts, but this proves quite difficult due to policies, proxies etc. - Just to many scripts downloading stuff from somewhere.
Is there a chance that when you create a release, you build a self-contained package that contains the necessary toolchains and simulators etc?
In the innermost accumulation loops, we require the vadd
and vmacc
operations to be length agnostic in a tail undisturbed way. This can be indicated by postpending intrinsics with _tu
. However, the current LLVM 14 does not seem to support this. It seems to have gotten removed in LLVM due to binary size/compilation time. At least for now. The GCC vector fork we are using supports _tu
just fine.
In order to stay compatible with both GCC and LLVM we will resort to the vmacc
intrinsic. Thankfully, vmacc
s are by default _tu
. Thus, we can use them as they are. We are replacing vadd
s with vmacc
s (using multiplication by 1) for now. This has some slight performance implications, but its the best we can do for now!
TFLM person_detection_benchmark
calls mostly ds_conv and 1x1 conv. Tested on GCC. With "real" vadd
s:
WithPersonDataIterations(10) took 939686 ticks (939 ms)
NoPersonDataIterations(10) took 939673 ticks (939 ms)
With vmacc
s simulating vadd
s:
WithPersonDataIterations(10) took 945601 ticks (945 ms)
NoPersonDataIterations(10) took 945601 ticks (945 ms)
LLVM 14 with vmacc
s simulating vadd
s:
WithPersonDataIterations(10) took 846704 ticks (846 ms)
NoPersonDataIterations(10) took 846703 ticks (846 ms)
two directories with the same name in a different case can create issues depending on the OS, and for sure doesnt help to understand whats beneath. Suggest to change this
This week the TFLM integration test broke due to upstream changes. (https://github.com/tum-ei-eda/muriscv-nn/actions/runs/4311825732)
TODO:
The fixed commit should be (automatically) updated to the latest commit which is known to be compatible.
In #21 there have been issues with the unit tests. Since the timeout only occours for the first test, it seems like a server-side issue.
Proposed solution:
matrix:
which should help to tell if there was a simulator-related problem or not.This depends on making the provided scripts for unit testing more flexible which is currently beeing tracked in #15.
It's extremely hard (or at least very verbose) to reproduce the bit-exact rounding behaviour of ARMs CMSIS-NN library. This is because their integer C implementation mimics ARM instructions. Taking the arm_nn_requantize()
function as an example. It in turn calls arm_nn_doubling_high_mult_no_sat()
and arm_nn_divide_by_power_of_two()
. These instructions translate, more or less, directly into ARM instructions, when using the ARM vector extension (Helium / MVE). However, I was unable to reproduce the behaviour using RISC-V vector instructions and the available RISC-V vector rounding modes. I am, in about 5% of the results of the test, off by one bit. This is, as far as I can judge, due to the different
Some similar issues were faced by TVM, see here, here, and here. It appears that they have not yet solved the issue.
This recent PR in CMSIS-NN as a response to this PR in TF has made the whole thing even more interesting. More links with similar content: ruy matrix multiplication library PR, TF issue on this.
I will need to dig into this rabbit hole some more. But the way the rounding is currently implemented in muRISCV-NN using the vector intrinsics is far from optimal. In terms of both readability/maintainability and performance!
However: In how far is it actually important that our kernels are bit-exact to CMSIS-NN kernels? According to this comment it appears that it is not that critical.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.