Git Product home page Git Product logo

muriscv-nn's People

Contributors

actions-user avatar fpedd avatar parkerjones567 avatar philippvk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

muriscv-nn's Issues

tflm no commit specified

The tflm integration requires checking out tflm, and applying a patch, but doesn't state onto which commit.
Specifying this would be the bare minimum, but the cleaner solution would be to provide an already patched fork on github (and potentially include this as a submodule)

CMake warning `Unable to determine default CMAKE_INSTALL_LIBDIR`

CMake Warning (dev) at /usr/share/cmake-3.21/Modules/GNUInstallDirs.cmake:236 (message):
  Unable to determine default CMAKE_INSTALL_LIBDIR directory because no
  target architecture is known.  Please enable at least one language before
  including GNUInstallDirs.
Call Stack (most recent call first):
  build/_deps/unity-src/CMakeLists.txt:64 (include)
This warning is for project developers.  Use -Wno-dev to suppress it.

No idea why this is coming up. Both my CMakeLists.txt and Unity's CMakeLists.txt set the language (to C) using the project().

Related links:
https://www.spinics.net/lists/fedora-devel/msg301866.html refers to this logic here https://chromium.googlesource.com/external/github.com/g-truc/glm/+/0.9.5/cmake/GNUInstallDirs.cmake#90

PEXT is getting a major refactoring

It seems like most of the proposed RVP extensions are getting renamed to be more consistent overall.

Examples:

  • ADD8 -> ADD.B
  • ADD16 -> ADD.H

See http://www.jhauser.us/RISCV/ext-P/RVP-baseInstrs-A-002.pdf for more details.

In addition there will also be be new instructions added to the spec and some of the proposed ones might be dropped. Since none of this is finalized I would not put effort into changing muRISCV-NNs PEXT implementation because it might change again later. We should really wait for it’s ratification and until a proper intrinsics API + usable toolchain (GCC/LLVM) is available.

Use updated Intrinsics API for V-EXT

While the RVV extension is already ratified, its C-intrinsics API is still in the works. This year it got a major refactoring: https://github.com/riscv-non-isa/rvv-intrinsic-doc

Our current implementation still uses this older version: https://raw.githubusercontent.com/riscv-non-isa/rvv-intrinsic-doc/8dadca57e220f7eca5936fb1c76169678a2832e7/intrinsic_funcs.md

I am not sure if the latest development LLVM version already uses the new intrinsics, but I am pretty sure the RVV GCC does not. Therefore let’s delay any efforts to use the newer intrinsics until the toolchains are both supporting it.

muriscv_nn_avgpool_s16 channel loop

In the average pooling function, you use a while loop on the chCnt, and also set the vector length to it size_t vl = vsetvl_e32m8(chCnt);.

End of the loop, you decrement chCnt--;.

I had a hard time understanding why this would be done. For one, the loop could be converted to a for loop, but also you are just decrementing the vector length until you are working with a vector length of 1.

Was it maybe intended to decrement the chCnt variable by the vector length at the end of the while loop?
like
chCnt -= vl; pSrc += vl;

On a side note, you might not be able to get super long vector length anyway, because the channel count is likely to be low, right? Wouldn't it make sense to apply the vector operation along a different dimension?

Possible Targets

There are a number of academic vector cores out there:

Low Power / Embedded

Application Level


On the commercial side, we have the following:

Commercial IP Cores:

Actual Hardware:


General Notes

The larger vector cores probably compare to ARM's Scalable Vector Extension (SVE, aka Neon), while the smaller vector cores appear to target a similar domain as ARM's embedded vector extension for Cortex-M microcontroller (MVE, aka Helium). This is also the vector extension used in CMSIS-NN. However, the only core with MVE appears to be the Cortex-M55. And there are, to this date, no open performance numbers available.
ARM's DSP extension for Cortex-M can be compared to RISC-V's packed extension.

Tracking Issue: LSTM Support

Upstream CMSIS-NN recently implemented support for running LSTMs with CMSIS-NN. Here are few task related to that:

  • Fix broken TFLM integration test due to LSTM changes (see #22)
  • Support scalar (unoptimized) versions of LSTM functions
  • Add reference LSTML model to benchmarking flow (does ARM provide one?)
  • Look into optimizing LSTM performance using VEXT and PEXT specific code (low priority)

Document workarounds to support VEXT on Vicuna

It seems like Vicuna currently does not support the complete Zve32x spec. Let’s use this issue to list all of the unsupported instructions (the Vicuna docs are not mentioning all of them).

Cost of instructions and impact of `vsetvl`

The instruction-level simulators do not take into account the individual number of cycles per instruction. Thus, using the vsetvl instruction in every loop iteration might appear very inefficient. However, this does not correctly reflect real-world implementation costs. In most uArch implementations, the vsetvl instruction would actually incurre very little extra overhead. See this for more info.

Additionally, the vsetvl instructions can be fused internally into a single vector microop. From the rvv1.0 spec:

The primary motivation for the vtype CSR is to allow the vector instruction set to t into a 32-bit instruction encoding space. A
separate vset{i}vl{i} instruction can be used to set vl and/or vtype elds before execution of a vector instruction, and
implementations may choose to fuse these two instructions into a single internal vector microop. In many cases, the vl and vtype
values can be reused across multiple instructions, reducing the static and dynamic instruction overhead from the vset{i}vl{i}
instructions. It is anticipated that a future extended 64-bit instruction encoding would allow these elds to be specied statically in
the instruction encoding.

Additionally, when tuning the performance of muRISCV-NN kernels, it is important that vector instructions are correctly weighted according to their relative cost in actual implementations. For more info on an actual implementation example with some ballpark numbers, look here.

VWW model does not fit on Vicuna

This is just documentation that vww (visual wakeworks) integration tests on Vicuna are supposed to fail due to the memory being too small to fit all data.

We have to evaluate if increasing the memory size in the linker script is feasible without major changes to the RTL and CRT.

Self-contained releases

I tried running the provided scripts, but this proves quite difficult due to policies, proxies etc. - Just to many scripts downloading stuff from somewhere.
Is there a chance that when you create a release, you build a self-contained package that contains the necessary toolchains and simulators etc?

LLVM 14 has no support for Tail Undisturbed Intrinsics

In the innermost accumulation loops, we require the vadd and vmacc operations to be length agnostic in a tail undisturbed way. This can be indicated by postpending intrinsics with _tu . However, the current LLVM 14 does not seem to support this. It seems to have gotten removed in LLVM due to binary size/compilation time. At least for now. The GCC vector fork we are using supports _tu just fine.

In order to stay compatible with both GCC and LLVM we will resort to the vmacc intrinsic. Thankfully, vmaccs are by default _tu. Thus, we can use them as they are. We are replacing vadds with vmaccs (using multiplication by 1) for now. This has some slight performance implications, but its the best we can do for now!

TFLM person_detection_benchmark calls mostly ds_conv and 1x1 conv. Tested on GCC. With "real" vadds:

WithPersonDataIterations(10) took 939686 ticks (939 ms)
NoPersonDataIterations(10) took 939673 ticks (939 ms)

With vmaccs simulating vadds:

WithPersonDataIterations(10) took 945601 ticks (945 ms)
NoPersonDataIterations(10) took 945601 ticks (945 ms)

LLVM 14 with vmaccs simulating vadds:

WithPersonDataIterations(10) took 846704 ticks (846 ms)
NoPersonDataIterations(10) took 846703 ticks (846 ms)

Fix broken compatibility with upstream TFLM

This week the TFLM integration test broke due to upstream changes. (https://github.com/tum-ei-eda/muriscv-nn/actions/runs/4311825732)

TODO:

  • Fix ccompatibility
  • Instead of only running that CI job against the latest upstream codebase, I would propose to deploy a "matrix" of two runs:
    • Test vs. Upstream: If something breaks due to upstream changes.
    • Test vs. fixed commit chosen by us: This way we find out if something broke on our side.

The fixed commit should be (automatically) updated to the latest commit which is known to be compatible.

Refactor Unit Tests (Fix OVPSim)

In #21 there have been issues with the unit tests. Since the timeout only occours for the first test, it seems like a server-side issue.

Proposed solution:

  • Try to add delays to make sure that OVPSim server does not get too many requests at the same time
  • Split Spike and OVPSim unit test workflow into two separate runs using CI matrix: which should help to tell if there was a simulator-related problem or not.

This depends on making the provided scripts for unit testing more flexible which is currently beeing tracked in #15.

ARM CMSIS-NN / TFLite Rounding Mode

It's extremely hard (or at least very verbose) to reproduce the bit-exact rounding behaviour of ARMs CMSIS-NN library. This is because their integer C implementation mimics ARM instructions. Taking the arm_nn_requantize() function as an example. It in turn calls arm_nn_doubling_high_mult_no_sat() and arm_nn_divide_by_power_of_two(). These instructions translate, more or less, directly into ARM instructions, when using the ARM vector extension (Helium / MVE). However, I was unable to reproduce the behaviour using RISC-V vector instructions and the available RISC-V vector rounding modes. I am, in about 5% of the results of the test, off by one bit. This is, as far as I can judge, due to the different

Some similar issues were faced by TVM, see here, here, and here. It appears that they have not yet solved the issue.

This recent PR in CMSIS-NN as a response to this PR in TF has made the whole thing even more interesting. More links with similar content: ruy matrix multiplication library PR, TF issue on this.

I will need to dig into this rabbit hole some more. But the way the rounding is currently implemented in muRISCV-NN using the vector intrinsics is far from optimal. In terms of both readability/maintainability and performance!

However: In how far is it actually important that our kernels are bit-exact to CMSIS-NN kernels? According to this comment it appears that it is not that critical.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.