vproc / vicuna Goto Github PK

RISC-V Zve32x Vector Coprocessor

License: Other

SystemVerilog 39.06% Makefile 2.52% Tcl 0.69% C++ 0.73% Assembly 56.61% C 0.40%

risc-v vector-processor coprocessor systemverilog

vicuna's Introduction

Vicuna - a RISC-V Zve32x Vector Coprocessor

Vicuna is an open-source 32-bit integer vector coprocessor written in SystemVerilog that implements version 1.0 of the RISC-V "V" Vector extension specification . More precisely, Vicuna complies with the Zve32x extension, a variant of the V extension aimed at embedded processors that do not require 64-bit elements or floating-point support (see Sect. 18.2 of the specification for details). As such, Vicuna supports vector element widths of 8, 16, and 32 bits and implements all vector load and store, vector integer¹, vector fixed-point, vector integer reduction, vector mask, and vector permutation instructions.

Vicuna is a coprocessor and thus requires a main processor to function. It uses the OpenHW Group's CORE-V eXtension Interface as interface to the main core. Currently, a modified version of the Ibex core or the CV32E40X core serves as the main core. Support for further RISC-V CPUs is under development.

Vicuna is extensively configurable. For instance, the width of the vector registers, the number and layout of its execution pipelines and the width of its memory interface are configurable. The following figure gives a high-level overview of Vicuna.

Vicuna is under active development, and contributions are welcome!

Documentation

A high-level user guide for using Vicuna can be read online at ReadTheDocs.

Publication

If you use Vicuna in academic work, please cite our publication:

@InProceedings{platzer_et_al:LIPIcs.ECRTS.2021.1,
  author =  {Platzer, Michael and Puschner, Peter},
  title =   {{Vicuna: A Timing-Predictable RISC-V Vector Coprocessor for Scalable Parallel Computation}},
  booktitle =   {33rd Euromicro Conference on Real-Time Systems (ECRTS 2021)},
  pages =   {1:1--1:18},
  series =  {Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =    {978-3-95977-192-4},
  ISSN =    {1868-8969},
  year =    {2021},
  volume =  {196},
  editor =  {Brandenburg, Bj\"{o}rn B.},
  publisher =   {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address = {Dagstuhl, Germany},
  URL =     {https://drops.dagstuhl.de/opus/volltexte/2021/13932},
  URN =     {urn:nbn:de:0030-drops-139323},
  doi =     {10.4230/LIPIcs.ECRTS.2021.1},
  annote =  {Keywords: Real-time Systems, Vector Processors, RISC-V}
}

Getting Started

This repository uses submodules. After cloning the repository, run following command in the top directory to initialize the submodules:

git submodule update --init --recursive

Compiling programs

The sw/ subdirectory contains utilities for generating programs that can be executed on Vicuna.

Simulation

The sim/ subdirectory contains scripts for simulating Vicuna with either Verilator, xsim (the default simulator in Vivado), or Questasim. For Verilator, version 4.210 or newer is required.

Synthesis

The demo/ subdirectory contains a minimalist demo design for Xilinx FPGAs.

Configuration

Vicuna allows for extensive parametrization. In particular, the width of the vector registers, of the memory interface, and of the datapaths of the functional units can be configured independently.

License

Unless otherwise noted, everything in this repository is licensed under the Solderpad Hardware License v2.1, a permissive free software license that is based on the Apache-2.0 license.

The Ibex core (included in this repository as a submodule) is licensed under the Apache License, see the Ibex repository for details.

The CV32E40X core (included in this repository as a submodule) is licensed under the Solderpad Hardware License, see the CV32E40X repository for details.

Currently, the vector integer divide instructions (i.e., vdiv, vdivu, vrem, and vremu) are still missing. ↩

vicuna's People

Contributors

Stargazers

Watchers

vicuna's Issues

Question about implemented codes on Vicuna

Hi @michael-platzer , @stevobailey and @moimfeld ....
In demo project you have added this test code:

#include <uart.h>
int main(void) {
    uart_puts("Hello world!\n");
    return 0;
}

I ran this code on demo project way back as you remember, but I forgot to ask that the tera term displays a continuous prints of Hello world! as if you have written the code in a while(1) statement (like below picture)... why is it like this? I actually expected to see one single Hello world.

Even now when I run my vector codes I see continuous prints of "pass" (since I check failure or success of my written codes) as if all the codes are all in implicit while(1) loop and for this matter the timer that I have added never stopped so I put a empty while(1) at the end of my code to force the program to stop there so that I can measure the run time of the code... Can you clarify?

About test file under test/kernel folder

Hello,
I have found out that there is no test data in conv_3x3.S file. Does it mean that it is still under development or it can work properly now? I have run the program with my own data but didn't receive expected result. If it can work properly now, could you please offer your test data?

Thank you

Updates for newer verilator

Which version of verilator are you using? With 4.210, I need to make a couple of changes to verilator_main.cpp for it to work. First, I include the following:

#include "Vvproc_top___024root.h"

I also have to change the following line:

vicuna/sim/verilator_main.cpp

Line 239 in beb86f9

 top->rst_ni, top->mem_req_o, top->mem_addr_o, top->vproc_top__DOT__v_core__DOT__vreg_rd_hazard_map_q, top->vproc_top__DOT__v_core__DOT__vreg_wr_hazard_map_q, 0); 

to become (adds rootp in the hierarchy):

top->rst_ni, top->mem_req_o, top->mem_addr_o, top->rootp->vproc_top__DOT__v_core__DOT__vreg_rd_hazard_map_q, top->rootp->vproc_top__DOT__v_core__DOT__vreg_wr_hazard_map_q, 0);

About MakeFile and Linker

Hi @michael-platzer I have tried to compile 2 simple CNN codes with your Makefile and link.ld file and boot them to Bram of my board using bootserdow . After running make -f /path/to/vicuna/sw/Makefile PROG=test OBJ=test.o for the first program I got the following error:

I searched about the error one solution was adding this code:

void *memcpy(void *dest, const void *src, size_t n)

{

    for (size_t i = 0; i < n; i++)

    {

        ((char*)dest)[i] = ((char*)src)[i];

    }

}

I did it and the error was gone and the output is right... Is it the only way? meaning I should always add this code at the beginning of all of my codes or is there a better way? and what is this error most importantly?
Weirdly I find this code as well but the output differs from each other... This ones output is wrong!!!

void * memcpy ( void * destination, const void * source, int num ){
  int i=0;
  *((int*)destination) = *((int*)source);
}

test.md
I had to change the format from .c to .md in order to send it...

[Question] Out-of-order result transaction

Hi @michael-platzer,

I have a few questions regarding an observation I made:

Observation

While simulating vicuna I found that some instruction sequences (for example vle8_8.S) have out-of-order (OoO) result transactions. I know that this is allowed by the x-interface, but I am not sure whether this is intended on vicuna.

The figure below shows the waveforms of the x-interface while executing vle8_8.S. It can be seen that the third offloaded vector instruction (vmv.v.v2, v0) with id = 2 has its result transaction before the second offloaded vector instruction (vle8.v v0, (a0)) with id = 1. The same is true for the fourth and fifth instructions.

How to replicate

The following command and configuration was used (and the simulation completed successfully):

make lsu/vle8_8 SIMULATOR=questa COMPILER=llvm
[CONFIG ] lsu/vle8_8 VREG_W=128  VMEM_W=32   VMUL_W=32
[SUCCESS] lsu/vle8_8/vle8_8                    293 cycles (       12 -       305)

Additionally the ram_type was set to RAM_ASIC.

Questions

Is this behaviour expected (I mean OoO result transactions)?
If this is not expected, can you replicate the observation stated above (just to be sure that the issue was not introduced by my asic vector register)?
I thought vle8.v v0, (a0) followed by vmv.v.v v2, v0 would cause a data hazard (i.e. hindering the move instruction from executing before the load instruction has completed), why does this not happen here?

Unaligned data error

Can Vicuna handle misaligned data? When simulating an 8-bit vector add using Spike and unaligned data, it works correctly. But when the same program is run on Vicuna RTL, it fails. It think it should work even on misaligned data.

I can recreate this by running your vector add test (alu/vadd_8.S), but adding an offset to the data:

...
    .data
    .align 10
    .global vdata_start
    .global vdata_end
    .byte           0
vdata_start:
    .word           0x323b3f47
    .word           0x47434b3a
    .word           0x302f2e32
...

[CONFIG ] alu/vadd_8 VREG_W=128  VMEM_W=32   VMUL_W=32
[ ERROR ] alu/vadd_8/vadd_8                    607 cycles (       12 -       619)
incorrect memory content; diff:
--- vadd_8.ref.vmem
+++ vadd_8.dump.vmem
@@ -1,7 +1,7 @@
 333c4048
 48444c3b
 31302f33
-e9414b52
+e8414b52
 3f44383b
 37424d54
 5e4b5049
make: *** [Makefile:64: alu/vadd_8] Error 1

Coprocessor stalls indefinitely if result transactions are not accepted immediately

Hi @michael-platzer

This issue might not be reproducible without the UVM environment. It is planned to open-source the environment in the next week.

UVM environment

As discussed yesterday, I am in the progress of setting up a UVM environment to verify Vicuna. The environment drives the core-side signals of the x-interface channels and therefore "emulates" a core. Any handshake signal controlled by the environment can be configured to have a random delay. Here are a few examples of what this means:

commit_valid can be configured to have a random delay (in clock cycles) w.r.t. the issue handshake
mem_ready can be configured to have a random delay (in clock cycles) w.r.t. the assertion of the mem_valid signal
mem_result_valid can be configured to have a random delay (in clock cycles) w.r.t. the memory request/response handshake

(This is not a complete list of all signals with random delay)

Note: Even though there is random delay on certain transactions, the core-side is strictly in-order. So no transaction initiated by the environment is OoO.

Issue

When turning on random delay for the result_ready signal of the result interface the coprocessor stalls indefinitely when result transactions is not immediately accepted. Below you can find a picture of the x-interface signals. After 590 ns the coprocessor stalls indefinitely. I have not further investigated this observation.

Problematic Instruction sequence

#  ------------------------------------------------------------
# |
# |
# |	Next Instruction Sequence Info
# |
# | 		Number of Instructions:           7
# |
# | 		 1. Instruction: 	0002f2d7
# | 		 2. Instruction: 	02050007
# | 		 3. Instruction: 	5e000157
# | 		 4. Instruction: 	0002f2d7
# | 		 5. Instruction: 	00058107
# | 		 6. Instruction: 	0002f2d7
# | 		 7. Instruction: 	02050127
# |
# |
#  ------------------------------------------------------------

This sequence corresponds to the following assembly (vle8_8.S), where only the vector instructions are offloaded:

    la              a0, vdata_start     
    li              t0, 16              
    vsetvli         t0, t0, e8,m1,tu,mu
    vle8.v          v0, (a0)
    vmv.v.v         v2, v0
    vsetvli         t0, t0, e8,m1,tu,mu
    vle8.v          v2, (a1), v0.t
    li              t0, 16              
    vsetvli         t0, t0, e8,m1,tu,mu
    vse8.v          v2, (a0)
    la              a0, vdata_start     
    la              a1, vdata_end       
    j               spill_cache

Expected execution (where random `result_valid` delay is disabled)

LMUL is global or per register group?

I'm unsure if the following code is legal or not.

I am attempting to add reduce a vector group down to a single element. I get something like below when compiling with GCC:

# initialize v1 to 32-bit element, LMUL=1 value 0
0107f7d7            vsetvli a5,a5,e32,m1,tu,mu
5e0030d7            vmv.v.i v1,0

# now add reduce v4, which is also 32-bit elements but has LMUL=4, into v1
0127f757            vsetvli a4,a5,e32,m4,tu,mu
0240a0d7            vredsum.vs  v1,v4,v1

According to the 0.10 spec:

When LMUL=4, the vector register group contains four vector registers, and instructions specifying an LMUL=4 vector register
group using vector register numbers that are not multiples of four are reserved.

So, since LMUL=4, the source v1 and destination v1 are both flagged by Vicuna as invalid registers, since v1 is not a multiple of 4. But this is a strange case. While LMUL=4 for v4, LMUL=1 for v1. This assumes we can associate different LMUL values for different registers/register groups. So there's no fundamental reason why this instruction is invalid.

I do not know if this is valid or not. Spike correctly simulates this. But LLVM compiles using v8 and v16 instead of v1 and v4, so that avoids this issue. However, LLVM produces the following code elsewhere, which is the same problem. Here, I load an 8-bit vector into v9 using LMUL=1, then load a 16-bit vector into v10 using LMUL=2. Then I sign extend the 8-bit register into 16 bits, from v9 to v12. But LMUL=2 for this instruction, so Vicuna flags v9 as an invalid register.

040576d7            vsetvli a3,a0,e8,m1,ta,mu
02058487            vle8.v  v9,(a1)
049576d7            vsetvli a3,a0,e16,m2,ta,mu
02065507            vle16.v v10,(a2)
4a93a657            vsext.vf2 v12,v9
eec52857            vwmul.vv  v16,v12,v10

Demo Project

Hi @michael-platzer, about your demo project I actually couldn't generate it via .tcl files because probably my vivado has problems so I manually added the rtl files needed... my board is nexys 4 ddr and it doesn't have differential clock as stated in top module of demo... what can I do? besides I make all of the tests in sim directory by runung make and I have all of the .vmem files, I want to actually test one in this demo project by adding .mem file and get a simple output for start, I am kind of familiar with Ibex so I know how to do it but after generating .bit file what should I expect as my output? I installed tera term https://osdn.net/projects/ttssh2/releases/ because I saw uart support but I still don't know what will happen after running lets say alu/vaad_8.vmem on the core. Can you clarify? sorry if it is too basic I just started the project...

Error in makefile in test directory

Hi I am working on Ibex and I want to attach vicuna as co processor to it... I managed to install rvv branch of gcc but when I run make alu/vadd_8 I get this error below:

How can I solve it?

Thank you in advance...

Illegal instruction?

I compiled some code and ended up with the following instruction. It simulates in Spike as expected, but Vicuna returns an illegal instruction for it:

c6a40457            vwredsum.vs v8,v10,v8

When manually decoded, I think it looks like:

[6:0]   OP-V   = 1010111 (0x57 = vector arithmetic instruction)
[11:7]  vd     =   01000 (0x08 = v8)
[14:12] funct3 =     000 (0x00 = OPIVV)
[19:15] vs1    =   01000 (0x08 = v8)
[24:20] vs2    =   01010 (0x0A = v10)
[25]    vm     =       1 (0x01 = unmasked)
[31:26] funct6 =  110001 (0x31 = vwredsum)

This looks good to me, though I'm no expert on the RISC-V vector extension. I decoded this using the 1.0 spec, but I assume this is the same in version 0.10.

At a high level, this is just part of an add reduction function. I'm using Vicuna to sum all the elements of a vector.

Simulation fails when memory width is wider than 32 bits without a DCACHE

I changed the alu/test_configs.conf file to

VREG_W=128  VMEM_W=64   VMUL_W=64

and now simulation fails. I believe the data cache is performing a width conversion to reduce a wide Vicuna data port into a 32-bit port for arbitration with the Ibex instruction port. So when you remove the data cache, you cannot have a wide Vicuna data port.

Can you add support for a wide Vicuna data port without a data cache? Or, at the very least, check the input parameters and error when using a wide Vicuna data port without a data cache?

[CONFIG ] alu/vxor_8 VREG_W=128  VMEM_W=64   VMUL_W=64
[ ERROR ] alu/vxor_8/vxor_8                    527 cycles (       12 -       539)
incorrect memory content; diff:
...
@@ -1,7 +1,7 @@
 6861651d
-1d191160
+47434b3a
 6a757468
-b21a100b
+e8404a51
 3f44383b
 37424d54
 5e4b5049
make: *** [Makefile:35: alu/vxor_8] Error 1

Module parameters should have default values

Not sure if this is a Verilog requirement or just good practice, but I'm having issues because not all module parameters have default values. Can you add them? The default can be 0, which may fail gracefully if not overwritten.

vproc_alu.sv
vproc_elem.sv
vproc_lsu.sv
vproc_mul_block.sv
vproc_mul.sv
vproc_sld.sv
vproc_vregfile.sv
vproc_vregpack.sv
vproc_vregunpack.sv

X prop issue

vicuna/rtl/vproc_lsu.sv

Line 798 in 224f33a

.deq_data_o ( {rdata_off_d, rmask_buf_d, deq_state} ),

I see an X prop issue that starts from here. The deq_state is initialized to X because the queue data are not reset. This eventually drives the Ibex CPI response to X, and Ibex flags this as an invalid instruction. This occurs on the first instruction issued by Ibex to Vicuna. When I initialize the queue data to 0 on reset, my test passes. Thoughts?

Intrinsic Code on Vicuna

Hi @michael-platzer and @stevobailey I hope you are doing well... I have been working on RiscV V ISA and its intrinsic and I want to start coding in intrinsic level on Vicuna ..
I have written this simple vector add code:

#include <uart.h>
#include <riscv_vector.h>
#include <stddef.h>

void* memcpy(void* dest, const void* src, size_t n)
{
	for (size_t i = 0; i < n; i++)
	{
		((char*)dest)[i] = ((char*)src)[i];
	}
}

int compare(int* ref, int* actual, int n) {
	int r;
	for (int i = 0; i < n; ++i) {
		if (ref[i] - actual[i] == 0) {
			r = 1;
		}
		else {
			r = 0;
			break;
		}
	}
	return r;
}

// index arithmetic
void add(int* a, int* b, int* c, int n) {
	for (int i = 0; i < n; ++i) {
		a[i] = b[i] + c[i];
	}
}

void vec_add(int* a, int* b, int* c, int n) {
	while (n > 0) {
		size_t vl = vsetvl_e32m1(n);
		vint32m1_t vb = vle32_v_i32m1(b, vl);
		vint32m1_t vc = vle32_v_i32m1(c, vl);
		vint32m1_t va = vadd_vv_i32m1(vb, vc, vl);
		vse32_v_i32m1(a, va, vl);
		a += vl;
		b += vl;
		c += vl;
		n -= vl;
	}

}

int main(void) {
	// data
	int a[31] = { 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31 };
	int b[31] = { 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31 };

	// compute
	int result[31];
	int vecresult[31];
	add(result, a, b, 31);
	vec_add(vecresult, a, b, 31);

	// compare
	uart_puts(compare(vecresult, result, 31) ? "pass" : "fail");
	return 0;
}

All I want to do is check whether or not normal vector add and vector add with V extension give same answer or not...
I compile the code it gives warnings about pointers in vector add function but it doesn't throw any error... but when I implement it on the board the uart doesn't give any output... not even "fail"...
Do you know the problem?

LSU transaction ID width

vicuna/rtl/vproc_core.sv

Line 765 in 66ed264

logic lsu_trans_complete_id;

I think this signal should be XIF_ID_W bits wide, not 1 bit wide.

Combinational loop

There's a combinational loop when using the Ibex core:

vproc_top.vect_instr_gnt
vproc_top.vect_instr_commit
vproc_top.v_core.queue_push
vproc_top.v_core.dec_ready

Line numbers:
https://github.com/vproc/vicuna/blob/main/rtl/vproc_top.sv#L172
https://github.com/vproc/vicuna/blob/main/rtl/vproc_top.sv#L335
https://github.com/vproc/vicuna/blob/main/rtl/vproc_core.sv#L263
https://github.com/vproc/vicuna/blob/main/rtl/vproc_core.sv#L269
https://github.com/vproc/vicuna/blob/main/rtl/vproc_core.sv#L279
https://github.com/vproc/vicuna/blob/main/rtl/vproc_core.sv#L227

Vicuna I$ compliation error when Memory Width and Way Length are same

I$ configured with size of 4KB, parameter WAY_LEN is 128 bits and if the MEM_BYTE_W is being configured with 128 bit. Then there is a compilation error
parameter int unsigned MEM_BYTE_W = 4, // Memory data width (bytes)
parameter int unsigned WAY_LEN = 256 // Cache way length (lines)

Compilation Error/data/shared/mulberry/users/kwali/nna_icache_exp/nna/external/vicuna/rtl/vproc_cache.sv, 45
"val"
Packed union members must have same size.
Member "val" has different size (1 bits) from next member (3 bits).

Syntax error - vcore_xif

vicuna/rtl/vproc_top.sv

Line 180 in 66ed264

  logic [vcore_xif.X_ID_WIDTH-1:0] cpi_instr_id_q, cpi_instr_id_q2, cpi_instr_id_d; 

A commercial linting tool flags this as an error. Can you make X_ID_WIDTH a localparam, then pass it to vcore_xif and use it here and on line 196?

Thanks!

Bitwise OR of different signal widths

vicuna/rtl/vproc_hazards.sv

Line 203 in 691857c

UNIT_ALU: masked = mode_i.alu.masked | mode_i.alu.op_mask;

Is this correct? mode_i.alu.op_mask is 2 bits while the other masked signals are only 1 bit.

Aligning memory requests with Ibex

I need Vicuna + Ibex to support unaligned memory requests, but I do not need the predictable timing that Vicuna guarantees. Can you let me know if the following proposal will work? Assume a memory width of 32 bits.

If I pass 0 to the ADDR_ALIGNED parameter, then Vicuna memory requests will output the full address. I will write an adapter that aligns the memory data to the address. So unaligned reads will read two words and shift them down to become aligned. Unaligned writes will write two words by shifting up the write data and byte enable. I can pipeline this to reduce the latency overhead. Do you see an issue with this?

Does Vicuna expect all read data to be shifted down? For example, if I read one byte at offset 0x1, will Vicuna expect this to be in the lowest byte of rdata?
Does Vicuna expect all reads to be packed? If I read two bytes at offset 0x3, these two bytes span two different words. Will Vicuna expect one valid rdata word with both bytes in the lowest two bytes? Or will it expect two valid rdata words, with each byte in its respective lowest byte?
Does Vicuna align write data? If it writes one byte at offset 0x1, will it place this data in the second byte of the word? Or will it keep it in the lowest byte of the wdata word?
Does Vicuna pack writes? If it writes two bytes at offset 0x3, will it send two write requests at the two different addresses, or will it send one write request with aligned write data?
Same questions as 3 and 4 for the byte enable signal.

Used before declaration

vicuna/rtl/vproc_alu.sv

Line 524 in 66ed264

(state_init.first_cycle & state_init_masked & vreg_pend_wr_q[0]);

Here, state_init_masked is used before its declaration on line 562. Can you move the declaration above line 524?

Thanks!

Attaching DDR as external memory

Hi @michael-platzer again I have a question, I have read your paper multiple times https://publik.tuwien.ac.at/files/publik_296583.pdf and in there (figure 2) you demonstrated a picture of Ibex and Vicuna with both data and instruction caches and an external memory... Since now in the demo project there is a RAM which utilize bram of FPGA and by default no caches are enabled, I really want to replace bram with ddr2 memory of my nexyxs 4 ddr board so that I can implement much larger applications like CNN networks... I am kind of familiar with the MIG IP of Vivado and I have created it but attaching it as an external memory to the core and most importantly filling it with .vmem file is a problem (since like microblaze there is no bootloader for Ibex and all that stuff) since you have done it can you help? Is it possible to have an example for it in this repo as well? I think no one have mentioned it and I have not found anything useful I really tried to do it but that filling part with .vmem (initializing ddr with .vmem file) as I mentioned is really a big problem for me to understand and solve (as you of course know we cannot use $readmemh for ddr to fill it with data)... Here some of my discussions with Ibex develpoers: (I used my other github account :)
https://github.com/lowRISC/ibex/issues/1466
Thank you...

LLVM Installation Error

Hi @moimfeld I have noticed that you added llvm installation in sw directory... So, I tried to make llvm with make llvm LLVM_DIR=/opt/llvm but first I got cmake version error (I had old version of cmake but llvm installation required newer version)... After I updated my cmake version I tried again and this time I get this error:

HEAD is now at 5177676... Updated MLIR type stubs to work with pytype
/bin/sh: 5: cmake: not found
/home/kiian/Desktop/VicunaRepos/vicuna/sw//toolchain.mk:22: recipe for target 'llvm' failed
make: *** [llvm] Error 127

It's wired because I have cmake and when I run cmake --version terminal responds with this:

cmake version 3.22.3

CMake suite maintained and supported by Kitware (kitware.com/cmake).

I have ubuntu 16.04 on vmware. Can you help me solve this problem? I tried more than couple of times and I can't solve it

Coprocessor stalls indefinitely if core issues random memory result transactions

Hi @michael-platzer,

Issue

The x-interface specification state:

Note that a coprocessor shall be able to tolerate memory result transactions for which it did not perform the corresponding memory request handshake itself.

It might be beneficial to understand/ask what the reason behind this rule is before addressing this issue. If you don't know the reason then I can open an issue on the x-interface repository.

My UVM environment can be configured to generate random memory result transaction (i.e. asserting mem_result_valid with unrelated id and random data / exc / dbg signals). At the moment there is a problem with generating "unrelated" id because the memory request transaction does not have a defined id (see #61).

Still, what I find is that the coprocessor stalls indefinitely when it is "bombarded" with random unrelated memory transactions. You can see the coprocessor stall in the screenshot below.

Problematic Instruction Sequence

The problematic Instruction Sequence is the same as in #59

Asynchronous reset in logic input path of data in queue

vicuna/rtl/vproc_queue.sv

Lines 27 to 39 in 4947abf

 always_ff @(posedge clk_i or negedge async_rst_ni) begin 

 if (~async_rst_ni) begin 

 rd_pos <= '0; 

 wr_pos <= '0; 

 last_wr <= '0; 

 end 

 else if (~sync_rst_ni) begin 

 rd_pos <= '0; 

 wr_pos <= '0; 

 last_wr <= '0; 

 end else begin 

 if (enq_ready_o & enq_valid_i) begin 

 data[wr_pos] <= enq_data_i;

There's an asynchronous reset issue since data is assigned here but not reset in either of the blocks above. This puts the asynchronous reset into combinational logic for the input of data.

The assignment to data should either

be in a separate always_ff block with just clk_i in the sensitivity list.
be reset in the asynchronous reset if statement above, line 28.

Illegal instruction?

My compiler produced the following instruction, which Vicuna reports as illegal. It simulates correctly in spike, though:

42802557 vmv.x.s a0,v8

Scalar core stalling logic is incorrect?

From my understanding, the memory arbiter stalls requests from the scalar core if the vector core has outstanding loads or stores. But I do not think this stalling is correct. Consider the following assembly code:

    .text
    .global main
main:
    li              a0, 64
    la              a1, vdata_start
    la              a2, vdata_mid
    la              a3, vref_end

    vsetvli         a4,a0,e32,m8,tu,mu
    vle32.v         v8,(a1)
    vle32.v         v16,(a2)
    vadd.vv         v8,v8,v16
    vse32.v         v8,(a3)

    lw              a4, 0(a3)

    la              a0, vdata_start
    la              a1, vdata_end
    j               spill_cache

    .data
    .align 10
    .global vdata_start
    .global vdata_mid
    .global vdata_end
vdata_start:
    .rept 64
    .word           0x323b3f47
    .endr
vdata_mid:
    .rept 64
    .word           0x47434b3a
    .endr
vdata_end:

    .align 10
    .global vref_start
    .global vref_end
vref_start:
    // not correct, but we don't care for this test
    .rept 64
    .word           0xe2fa599a
    .endr
vref_end:

It sets the vector length to 64. It loads 2 vectors, each with 64 32b entries. Then it adds them and stores the result. Then the scalar core immediately tries to read the first word in the result. This read should be stalled until the vector core completes its store instruction. But the waveform shows otherwise.

Higher resolution:

Arrow points to Ibex memory read, which occurs too early:

The top group is the data memory interface coming from the scalar core (Ibex). The middle group is the data memory interface coming from the vector core (Vicuna). You can see the memory read from Ibex is granted before Vicuna starts loading its vectors.

I believe this is because pending_store_o is not registered, so it is low between the time Vicuna receives the store instruction and the time it actually starts executing it. You can see pending_store_o in the bottom group of the above image.

The assembly code above should work in your test framework, though it will fail. But you can run it and generate a waveform to reproduce this issue.

Issue executing vloxei8.v

Hi @michael-platzer

Issue

I am mostly done with setting up the simulation environment for Questasim. I can run most test cases successfully except the vloxei8.v instructions.

The issue arises when the vproc_tb is in the following configuration:

MEM_W=32
MEM_SZ=262144
MEM_LATENCY=5
VREG_W=512
VMEM_W=256
VMUL_W=128
VALU_W=128
VSLD_W=128
ICACHE_SZ=8192
ICACHE_LINE_W=128
DCACHE_SZ=65536
DCACHE_LINE_W=512

*this is the second configuration defined in the .config files in the
test directories (and the default values set in the sim Makefile)

while simulating the execution of the vloxei8.S code (Main core is ibex). The issue is that the coprocessor (and ibex) gets stuck.

Findings

The coprocessor completes five out of eight instruction (I concluded this because the ready/valid handshake of the result interface happens five times until the coprocessor gets stuck)
This means the coprocessor gets stuck during the following instruction: vloxei8.v v16, (a1), v8, v0.t
During the execution of this particular instruction the valid signal of the memory request/response channel stays high for 23 clock cylces. For the first 16 clock cylces a "valid" address is assigned to the mem_req.addr signal. Afterwards, the mem_req.addr signal goes to an undefined state (e.g. 32'hXXXXXXX0)
This "invalid" address causes unwanted behavior in the cache module (i.e. the cpu_gnt_o signal goes to 1'bX)
In total, the handshake of the memory request/response channel is completed (both valid and ready are high) 19 times during the execution of this instruction. Four times valid is high while ready is in an undefined state.

I have not further investigated this issue, but the problem is either:

the assignment of the valid signal of the memory request/response channel
the address calculation/assignment
- if the issue is in the address calculation, then there could also be an issue with the operands or the register file (e.g. my asic register file)

Question

Before I continue investigating, I would like to ask: Is this a known issue? If not, then I might have introduced it with my ASIC vector register or it is caused by using Questasim.

Coprocessor must set the associated id for each memory transaction

Hi @michael-platzer,

In order to be compliant to the x-interface, the coprocessor must set the associated id for each memory request transaction. This is not the case at the moment as you can see from the screenshot below (id stays "don't care", even during memory transactions).

Type issue in struct default assignment

In numerous places you assign a default of X, for example, in vproc_lsu.sv line 134:

            state_init_q <= '{busy: 1'b0, default: 'x};

From my understanding, this assigns the busy member of state_init_q to 0 on reset, and it defaults all the other members to X for simulation. The issue is that some members of state_init_q are of type enum (if you dive down into all the types). The enums have type logic, which should be able to accept X, but for some reason I'm getting an error here trying to assign an enum to X. I even added X as one of the enum entries, but it didn't help.

Do you need to default all these signals to X?

Reduction Sum Intrinsic

Hi @michael-platzer , @stevobailey and @moimfeld ... I have a question about reduction sum intrinsic function:
I have written following code in normal C:

signed short int dense(signed char* inDense, signed char* wf, signed short int inDenseSize, signed short int biaseDense) {
	signed short int i;
	signed short int outDense = 0;
	for (i = 0; i < inDenseSize; i++) {
		outDense += inDense[i] * wf[i];
	}
	outDense += biaseDense;
	return outDense;
}

I want to convert it into vector mode... I have written following for this purpose but it seems to have a problem since I cannot execute it on vicuna.

signed short int vec_dense(signed char* in1, signed char* in2, signed short int n, signed short int bias) {
	signed short int out;
	size_t vlmax = vsetvlmax_e16m1();
	vint16m1_t vec_zero = vmv_v_x_i16m1(0, vlmax);
	vint16m2_t vout = vmv_v_x_i16m2(0, vlmax);
	while (n > 0) {
		size_t vl = vsetvl_e8m1(n);
		vint8m1_t vin1 = vle8_v_i8m1(in1, vl);
		vint8m1_t vin2 = vle8_v_i8m1(in2, vl);
		vout = vwmul_vv_i16m2(vin1, vin2, vl);
		in1 += vl;
		in2 += vl;
		n -= vl;
	}
	vint16m1_t vec_sum = vredsum_vs_i16m2_i16m1(vec_zero, vout, vec_zero, vlmax);
	out = vmv_x_s_i16m1_i16(vec_sum);
	return out + bias;
}

Can you suggest?

LSU rs1 argument

vicuna/rtl/vproc_core.sv

Line 790 in 66ed264

.rs1_i ( queue_data_q.rs1.r.xval ),

Sorry to bombard you with issues, but I'm finally getting around to running some lint/simulations on recent updates you made.

This connects a 32-bit signal to a 38-bit input port. Shouldn't this be queue_data_q.rs1 instead of queue_data_q.rs1.r.xval?

Add support for fractional LMUL

Fractional LMUL is not working, and indeed:

vicuna/rtl/vproc_core.sv

Line 388 in 6816693

// TODO support fractional LMUL

According to the 0.10 spec:

I believe the minimum SEW is 8 bits and the maximum SEW is 32 bits, so Vicuna is supposed to support fractional LMUL down to 1/4? Please correct me if I am wrong.

Coprocessor stalls indefinitely if memory result transaction happens in the same clock cycle as memory request transaction

Hi @michael-platzer

Issue

This issue is related to #59, but for delay on a different signal. The problematic delay here is mem_result_valid. When a memory request is served in the same clock cycle as it was issued (i.e. no delay), the coprocessor will stall forever. Below you can find a picture of the x-interface signals. After 490 ns the coprocessor stalls indefinitely. I have not further investigated this observation.

This is the corresponding snippet from the x-interface documentation which states that memory result transactions can happen in the same cycle as the memory request transaction.

Memory result interface transactions cannot be initiated before the corresponding memory request interface handshake is completed. They are allowed to be initiated at the same time as or after completion of the memory request interface handshake.

Problematic Instruction Sequence

The problematic instruction sequence is the same as in #59.

Out-of-bounds array access

vicuna/rtl/vproc_vregfile.sv

Line 92 in dd20efa

 wr_data = wr_data ^ ((i < gw) ? rd_data[PORTS_RD+gw-1][i] : rd_data[PORTS_RD+gw][i+1]); 

I'm seeing an error that one of these accesses is out-of-bounds. I'm guessing it's not smart enough to ignore one of these outputs based off the ternary operator condition.

Blocking assignments used in always_ff blocks

Please use non-blocking assignments in all always_ff blocks. I see blocking assignments in these locations

vproc_alu.sv     lines 370-380
vproc_elem.sv    lines 380-394
vproc_lsu.sv     lines 502-510
vproc_mul.sv     lines 359-367
vproc_sld.sv     lines 330-344

Test Directory

Hi @michael-platzer I thought since the title is different I better open new issue and sorry if I am asking a lot but I think probably my low level questions can help you to have great and complete repo as well :)), anyway.... in test directory we can test vicuna and see number of clock cycles in verilator with 2 different configs... I actually want to write the same set of programs (for example vadd_8.s) in C without the help of vector extension just for execution on Ibex alone and see how many clock cycles Ibex will need for them, so that I can compare Ibex with Ibex+Vicuna for a sample vector code. Any suggestions?

Misc. error

Not sure what the exact error is, but see #47 for a new test case that fails. The provided assembly passes when VREG_W is 128, but not when it is larger. Can you first confirm that it is a bug?

Attaching DDR to an FPGA Board

Hi,
as you have mentioned in #16 attaching DDR is really board specific and I see in that issue other people like me are struggling to attach it to the core... I have nexys 4 DDR as well and I think it is really good to have at least one example of how to attach DDR to one specific board (any board even nexys video of your demo project).

Question about registerfile

Hi @michael-platzer

This is Moritz, I will be doing the verification of Vicuna. I am currently setting up a simulation environment for Questasim to replicate your tests, and now I have a question about the register file.

Since I am targeting ASIC I also want to add an option for an ASIC register file. I have a working version of the ASIC registerfile here: https://github.com/moimfeld/vicuna/blob/asic_dev/rtl/vproc_vregfile.sv .

But my question is if the XORs in the following lines are only there for the correct functionality of the "XOR" RAM, or if they have some other purpose (like for example masking)?

vicuna/rtl/vproc_vregfile.sv

Line 92 in 94a7f47

 wr_data = wr_data ^ rd_data[PORTS_RD + gw - ((i < gw) ? 1 : 0)][i + ((i < gw) ? 0 : 1)]; 

vicuna/rtl/vproc_vregfile.sv

Line 164 in 94a7f47

rd_data_o[i] = rd_data_o[i] ^ rd_data[i][j];

I ask because in my ASIC version I don't use this XORs at the moment and I want to make sure that I don't lose some functions.

Not implemented instruction list

Hi,

Where can I find which instructions are not implemented?

Linter doesn't like pulling parameters from interfaces

vicuna/rtl/vproc_core.sv

Line 69 in 66ed264

localparam int unsigned XIF_ID_W = xif_issue_if.X_ID_WIDTH;

Unfortunately, some linters and synthesis tools don't accept this syntax. It seems silly, but I usually have to pass in the parameter at the module level (in addition to it being in the interface).

Illegal instruction?

Another instruction that needs to be implemented:

62850427   vs4r.v  v8,(a0)

CSR changes for Vector extension

The following CSR registers in the CPU probably need to be changed to properly flag having a vector extension. See the privileged spec:

Machine ISA register, misa, bit 21 should be set
Machine status register, mstatus, VS bits 10:9, should be used as vector status bits

There may be others. Note that setting the VS bits in the mstatus register were required to get simulation working with spike. I'm not sure what hardware is supposed to do with the VS bits.

Adding other peripherals to the demo_top.sv

Hi @michael-platzer and @stevobailey sorry I was unfortunately dealing with covid-19 and couldn't bother you more with my basic questions :p... but here is one:
In demo_top.sv you attached UART, ram32 and hwreg_iface... I want to have Ibex as main core, vicuna as vector co-processor, VGA, camera module and LED's attached as well... I have done this before using only ibex and wishbone b4 interface for mentioned peripherals with the help of https://github.com/pbing/ibex_wb... I wonder if I can do it here as well? Even without wishbone interface is it possible to treat those peripherals like URAT and ram32 of your top module code and attach them to ibex memory bus? Any suggestions? (Suppose I will finally attach ddr2 memory of my FPGA board as well)

Help in understing operand widths

Can you help me understand what the operand widths mean? Let's take the default parameters as a starting point:

        parameter int unsigned        VREG_W        = 128, // vector register width in bits
        parameter int unsigned        VMUL_W        = 64,  // MUL unit operand width in bits
        parameter int unsigned        VALU_W        = 64,  // ALU unit operand width in bits
        parameter int unsigned        VSLD_W        = 64,  // SLD unit operand width in bits

From a RISC-V standards perspective, VREG_W equals VLEN, so VLEN is 128. ELEN is 32, since this is a 32-bit vector unit. If VALU_W is 64, the operands are 64 bits. If I want to do vector add on 32-bit vectors, can it only add two elements at a time (because 64-bit operands / 32-bit elements = 2 elements per operand)? So doubling VALU_W to 128 would require twice the hardware but take half as many cycles to perform the addition?

In your paper, this statement

The ability to individually configure the throughput for each unit improves the performance of heavily used operations by increasing the respective unit’s data-path width (e.g., widening the data-path of the multiplier unit).

means, if I use lots of multiplies in my application, I should increase VMUL_W to improve performance at the cost of HW, right? But if I use lots of adds, I should increase VALU_W to improve performance at the cost of HW.

Put another way, VMUL_W, VALU_W, and VSLD_W don't affect functionality (i.e. software), just performance and overhead. Yes?

Help debugging dot product

Can you help me debug a dot product issue? I am taking two vectors, 9 16-bit elements each, and performing a dot product on them. I am using a 16-bit to 32-bit multiply, followed by a regular reduction sum (not width widening). Below is the relevant assembly.

800129cc <vect_dotProduct>:
800129cc: c500f757            vsetivli  a4,1,e32,m1,ta,mu
800129d0: 5e003457            vmv.v.i v8,0 
800129d4: c50d                  beqz  a0,800129fe <vect_dotProduct+0x32>
800129d6: 04057757            vsetvli a4,a0,e8,m1,ta,mu
800129da: 04877057            vsetvli zero,a4,e16,m1,ta,mu
800129de: 0205d487            vle16.v v9,(a1)
800129e2: 02065507            vle16.v v10,(a2)
800129e6: ee952657            vwmul.vv  v12,v9,v10
800129ea: 01107057            vsetvli zero,zero,e32,m2,tu,mu
800129ee: 02c42457            vredsum.vs  v8,v12,v8
800129f2: 00171793            slli  a5,a4,0x1
800129f6: 95be                  add a1,a1,a5
800129f8: 8d19                  sub a0,a0,a4
800129fa: 963e                  add a2,a2,a5
800129fc: fd69                  bnez  a0,800129d6 <vect_dotProduct+0xa>
800129fe: c500f057            vsetivli  zero,1,e32,m1,ta,mu
80012a02: 0206e427            vse32.v v8,(a3)
80012a06: 8082                  ret

I see it loading the vectors correctly, then multiplying them correctly. However, when it gets to the reduction sum, the pipe_in_ctrl_i.vl_part_0 signal is high. This means the following line of RTL never actually adds the elements into the result. I see the elements appearing one-by-one in the elem_q signal, but the result stays 0.

vicuna/rtl/vproc_elem.sv

Line 229 in 06f1d2c

result_d = ~pipe_in_ctrl_i.vl_part_0 ? (elem_q + reduct_val) : reduct_val;

What is the next step in debugging this? What sets the vl_part_0 signal high? Thanks!

Incomplete unique case statements

There are numerous unique case statements that have incomplete entries. I see that verilator simulation ignores them, but it would be better to fix them. Possible solutions:

Enumerate all possible values
Use a default case
Remove unique from the case statment

Thoughts?

Asynchronous reset sensitivity list issue

In several places, the sensitivity list for reset fails linting. For example, in proc_lsu.sv, line 413:

always_ff @(posedge clk_i or negedge async_rst_n) begin : vproc_lsu_stage_vreg
                if (~async_rst_n | (~ASYNC_RESET & ~rst_ni)) begin
                    state_vreg_q <= '{busy: 1'b0, default: 'x};
                end

Having async_rst_n with other signals in the sensitivity list is the problem. I suggest you find a linting tool. Verilator has one, though I'm not sure if it catches this.

This page shows you how to fix the issue:
https://www.intel.com/content/www/us/en/programmable/quartushelp/13.0/mergedProjects/msgs/msgs/evrfx_veri_if_condition_does_not_match_sensitivity_list_edge.htm

	always_ff @(posedge clk_i or negedge async_rst_ni) begin
	if (~async_rst_ni) begin
	rd_pos <= '0;
	wr_pos <= '0;
	last_wr <= '0;
	end
	else if (~sync_rst_ni) begin
	rd_pos <= '0;
	wr_pos <= '0;
	last_wr <= '0;
	end else begin
	if (enq_ready_o & enq_valid_i) begin
	data[wr_pos] <= enq_data_i;

vproc / vicuna Goto Github PK

vicuna's Introduction

Vicuna - a RISC-V Zve32x Vector Coprocessor

Documentation

Publication

Getting Started

Compiling programs

Simulation

Synthesis

Configuration

License

Footnotes

vicuna's People

Contributors

Stargazers

Watchers

Forkers

vicuna's Issues

Observation

How to replicate

Questions

UVM environment

Issue

Problematic Instruction sequence

Expected execution (where random result_valid delay is disabled)

Issue

Problematic Instruction Sequence

Issue

Findings

Question

Issue

Problematic Instruction Sequence

Recommend Projects

Recommend Topics

Recommend Org

Expected execution (where random `result_valid` delay is disabled)