pulp-platform / hwpe-mac-engine Goto Github PK

An example Hardware Processing Engine

License: Other

SystemVerilog 81.01% Python 18.99%

hwpe-mac-engine's Issues

How to change the data width of the input and the output of the hwpe-mac-engine?

Hello,
I am using HWPE to connect my hardware accelerator to the Pulpissimo. For that, I decided to study and make modifications in the hwpe-mac-engine file based on the requirements of my accelerator.

The first modification that I am trying to make is to change the input and output data width to 128 bits from 32 bits, keeping the rest of the structure of hwpe-mac-engine as it is. That is, a, b, c, and d's data width is changed from 32 bits to 128 bits.
For that, I changed the data width parameter in mac_streamer.sv, and mac_top.sv file and also made necessary changes in the size of mult, r_mult, d_nonshifted, etc in the hwpe_mac_engine file.

Now I was going through the following master thesis, ( https://webthesis.biblio.polito.it/11015/1/tesi.pdf ) where it is specified:

Another change, compared to the provided example, was to “virtually” extend
the number of ports to the TCDM. These are considered “virtual” as it is like instantiating eight ports (eight in the SMAC-Engine case) but only four of these are physically there: all these eight ports will think they are attached to the memory but they are not. This is necessary to handle a 128-bit stream at the input and the output. Indeed, a TCDM multiplexer can then be used to funnel more input “virtual” TCDM channels (eight) into a smaller set of master ports (four). Hence, together with the definition of a virtual_tcdm interface, a hwpe_stream_tcdm_mux has been allocated to handle this.

Based on my understanding I had a few questions regarding virtually extending the number of ports.

Is the data width of the TCDM Port fixed (i.e., 32-bit)? Is it possible to increase it to 128-bit?
Are the number of the TCDM ports in hwpe fixed (i.e., maximum 4)? If the data width of tcdm port is fixed to 32-bit, then I will require 16 ports. Can I have those many ports without virtually extending it?
If I have to virtually extend it, can I use the following code (using hwpe_stream_tcdm_mux) and change 8 to 16 and 7 to 15 to meet my requirements? And what amount of delay or latency is expected and how can I combat it?

hwpe_stream_intf_tcdm virtual_tcdm [7:0] (
    .clk ( clk_i )
  );
  // mode 1 - meno efficiente
  hwpe_stream_tcdm_mux #(
    .NB_IN_CHAN  ( 8 ),
    .NB_OUT_CHAN ( 4 )
  ) i_mux (
    .clk_i,
    .rst_ni,
    .clear_i,
    .in  ( virtual_tcdm[7:0] ),
    .out ( tcdm[3:0]         )
  );

Apart from that, are there any other changes required to be made in any of the code files including the testbench files, to change my data width?

Thanking You.

Regards,
Zeal

How does a microcode loop know its number of iterations?

I've been studying the MAC as it's a great starting point for what I need, despite the suboptimal implementation as the devs claim.

I understand that the microcode can be generated with the Python script and I understand the very simple instruction set. I understand that the LOOPSX_OFFS registers hold a loop instruction size and the loop instruction offset, also generated by the Python script.

What I don't understand is how does each loop know where to look for its number of iterations.

In the hwme example in the pulp-rt-examples repo, the NB_ITER register (numbered 4) holds the number of iterations. Indeed, when I change the value in NB_ITER from 4 to 3 in the example, only three iterations get executed, thus the fourth result is wrong. How does the loop0 know to look into register 4? If I used loop1, which register would be used for its number of iterations?

One more thing: does the operation d = a * b + c get performed in every iteration of the most-nested loop?

Error in vector stride?

I'm trying out the HWME in scalar product mode and have started with some very simple examples.
I am using the default microcode supplied in the pulp-rt-example/accelerators/hwme example.

Job dependent parameters are:

hwme_a_addr_set((unsigned int) a);
hwme_b_addr_set((unsigned int) b);
hwme_c_addr_set((unsigned int) c);
hwme_d_addr_set((unsigned int) d);
hwme_nb_iter_set(10);
hwme_len_iter_set(2 - 1 /* My vectors are of length 2. Have to offset by -1 if I understand things correctly. */);
hwme_vectstride_set(0 * sizeof(uint32_t));
hwme_shift_simplemul_set(hwme_shift_simplemul_value(10, 0));

All the results should be the same because 0 for vectstride means I'm always reading the same (first) vector from my a and b inputs. The a, b, c and d arrays are more than 300 bytes apart, so there's no overlap.

This is what the first 15 elements of a, b, c and d look like before and after running the HWME.

Before.
a:  1968 b:  4068 c:     0 d:     0
a:   741 b: -3079 c:     0 d:     0
a:  1968 b: -4369 c:     0 d:     0
a:   741 b: -2234 c:     0 d:     0
a:  1968 b: -2579 c:     0 d:     0
a:   741 b:  1415 c:     0 d:     0
a:  -222 b:  4068 c:     0 d:     0
a:   542 b: -3079 c:     0 d:     0
a:  -222 b: -4369 c:     0 d:     0
a:   542 b: -2234 c:     0 d:     0
a:  -222 b: -2579 c:     0 d:     0
a:   542 b:  1415 c:     0 d:     0
a:  -597 b:  4068 c:     0 d:     0
a:   664 b: -3079 c:     0 d:     0
a:  -597 b: -4369 c:     0 d:     0


After.
a:  1968 b:  4068 c:     0 d: -2229
a:   741 b: -3079 c:     0 d:  5590
a:  1968 b: -4369 c:     0 d: -2229
a:   741 b: -2234 c:     0 d:  5590
a:  1968 b: -2579 c:     0 d:  5590
a:   741 b:  1415 c:     0 d:  5590
a:  -222 b:  4068 c:     0 d: -2229
a:   542 b: -3079 c:     0 d:  5590
a:  -222 b: -4369 c:     0 d:  5590
a:   542 b: -2234 c:     0 d:  5590
a:  -222 b: -2579 c:     0 d:     0
a:   542 b:  1415 c:     0 d:     0
a:  -597 b:  4068 c:     0 d:     0
a:   664 b: -3079 c:     0 d:     0
a:  -597 b: -4369 c:     0 d:     0

The scalar product of the two vectors is 5590: (4068 * 1968 - 3079 * 741) / 1024 = 5590. What I don't understand is where does -2229 come from.

Did I misunderstand any of the configuration options?

How to Run HWPE_MAC_Engine on Zedboard using Pulpissimo

Hello @FrancescoConti ,

I want to run Pulpissimo based HWPE-MAC-Engine example on Zedboard. I tried running the HWME code provided in pulp-rt-examples. But unfortunately I am stuck at a point. I will share the hwme.c file of the code that I have modified:

/*
 * Copyright (C) 2018 ETH Zurich and University of Bologna
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

/* 
 * Authors:  Francesco Conti <[email protected]>
 */

#include "pulp.h"
#include <stdint.h>
#include <stdio.h>
#include "archi/hwme/hwme_v1.h"
#include "hal/hwme/hwme_v1.h"
#include <rt/rt_api.h>

#define ARCHI_SOC_EVENT_FCHWPE0 140

#define USE_STIMULI 
// comment below line to run only dot product with bias
//#define DO_MATVEC_MULT 
#ifndef DO_MATVEC_MULT
    #define DO_DOT_PROD
#endif

#include "hwme_stimuli_a.h"
#include "hwme_stimuli_b.h"
#include "hwme_stimuli_c.h"
#include "hwme_stimuli_d.h"

unsigned int __rt_iodev_uart_baudrate = 115200;
int __rt_fpga_fc_frequency = 20000000; // e.g. 20000000 for 20MHz;
int __rt_fpga_periph_frequency = 10000000; // e.g. 10000000 for 10MHz;

int main() {

  uint32_t *a = (uint8_t *) 0x1c010000;
  uint32_t *b = (uint8_t *) 0x1c010200;
  uint32_t *c = (uint8_t *) 0x1c010400;
  uint32_t *d = (uint8_t *) 0x1c010600;

  int coreID = get_core_id();
#ifdef DO_MATVEC_MULT
  // define dimensions
  uint32_t in_vec_len = 8;
  uint32_t out_vec_len = 10;
#endif

  volatile int errors = 0;
  int gold_sum = 0, check_sum = 0;
  int i,j;
  
  int offload_id_tmp, offload_id;
  
  printf(get_core_id());
  if(get_core_id() == 0) {

#ifdef USE_STIMULI
    for(int i=0; i<512; i++) {
      ((uint8_t *) a)[i] = stim_a[i];
    }
//printing input value of a
    for(int loop = 0; loop < 1; loop++){
      printf("a %d ", a[loop], "\n");
    }

    for(int i=0; i<512; i++) {
      ((uint8_t *) b)[i] = stim_b[i];
    }

//printing b
  for(int loop = 0; loop < 1; loop++){
      printf("b %d ", b[loop], "\n");
    }

    for(int i=0; i<512; i++) {
#ifdef DO_MATVEC_MULT
      ((uint8_t *) c)[i] = 0; // no bias for matrix vector multiplication
#else
      ((uint8_t *) c)[i] = stim_c[i];
#endif
    }

//printing c
  for(int loop = 0; loop < 1; loop++){
      printf("c %d ", c[loop], "\n");
    }

    for(int i=0; i<512; i++) {
      ((uint8_t *) d)[i] = stim_d[i];
    }

//printing d
  for(int loop = 0; loop < 1; loop++){
      printf("d %d ", d[loop], "\n");
    }

#else
    for(int i=0; i<127; i++) {
      a[i] = i;
    }

    for(int loop = 0; loop < 1; loop++){
      printf("%d ", a[loop], "\n");
    }


    for(int i=0; i<127; i++) {
      b[i] = i;
    }
    
    for(int loop = 0; loop < 1; loop++){
      printf("%d ", b[loop], "\n");
    }


    for(int i=0; i<127; i++) {
      c[i] = i;
    }
    
    for(int loop = 0; loop < 1; loop++){
      printf("%d ", c[loop], "\n");
    }


    for(int i=0; i<127; i++) {
      d[i] = i;
    }
 
    for(int loop = 0; loop < 1; loop++){
      printf("%d ", d[loop], "\n");
    }

#endif

    /* convolution-accumulation - HW */
    plp_hwme_enable();

    while((offload_id_tmp = hwme_acquire_job()) < 0);

    // set up bytecode
    hwme_bytecode_set(HWME_LOOPS1_OFFS,           0x00000000);
    hwme_bytecode_set(HWME_BYTECODE5_LOOPS0_OFFS, 0x00040000);
    hwme_bytecode_set(HWME_BYTECODE4_OFFS,        0x00000000);
    hwme_bytecode_set(HWME_BYTECODE3_OFFS,        0x00000000);
    hwme_bytecode_set(HWME_BYTECODE2_OFFS,        0x00000000);
    hwme_bytecode_set(HWME_BYTECODE1_OFFS,        0x000008cd);
    hwme_bytecode_set(HWME_BYTECODE0_OFFS,        0x11a13c05);
    
    // job-dependent registers
    hwme_a_addr_set((unsigned int) a);
    hwme_b_addr_set((unsigned int) b);
    hwme_c_addr_set((unsigned int) c);
    hwme_d_addr_set((unsigned int) d);
#ifdef DO_MATVEC_MULT
    hwme_nb_iter_set(out_vec_len);
    hwme_len_iter_set(in_vec_len-1);
    hwme_vectstride_set(in_vec_len*4); // stride for the matrix is equal to in_vec length * wordsize
    hwme_vectstride2_set(0); // stride for the vector is zero
#else
    hwme_nb_iter_set(4);
    hwme_len_iter_set(32-1);
    hwme_vectstride_set(32*4);
    hwme_vectstride2_set(32*4); // same stride for both streams
#endif
    hwme_shift_simplemul_set(hwme_shift_simplemul_value(0, 0));

    // start HWME operation
 printf("Start HWME operation !\n");
    hwme_trigger_job();
 printf("Started HWME operation !\n");

    // wait for end of compuation
    soc_eu_fcEventMask_setEvent(ARCHI_SOC_EVENT_FCHWPE0);

    __rt_periph_wait_event(ARCHI_SOC_EVENT_FCHWPE0, 1);

    printf("Archi detection ! \n");

    plp_hwme_disable();
    printf("Wait for end of computation !\n");

    // check

#ifndef USE_STIMULI
    if(d[0] != 0x000028b0) errors++;
    if(d[1] != 0x000124b1) errors++;
    if(d[2] != 0x000320b2) errors++;
    if(d[3] != 0x00061cb3) errors++;

#else
    #ifdef DO_MATVEC_MULT
        if(d[0] != 0x7CB12A38) errors++;
        if(d[1] != 0xCD4F4DCB) errors++;
        if(d[2] != 0x49CD5D5C) errors++;
        if(d[3] != 0x2A1D8706) errors++;

    #else
        if(d[0] != 0x7f228fd6) errors++;
        if(d[1] != 0x23a7d5c2) errors++;
        if(d[2] != 0x7f281848) errors++;
        if(d[3] != 0x6127d834) errors++;

    #endif
#endif /* USE_STIMULI */ 

    printf("errors=%d\n", errors);
    
    printf("Done with computation !\n");

    //printing output
    printf("printing d after computation: \n");
    for(int loop = 0; loop < 1; loop++){
      printf("d %d ", d[loop], "\n");
    }

   }

   printf("Sync barrier !\n");
   synch_barrier();
   printf("Done with everything !\n");
   return errors;
}

I updated the code file by adding the following lines:

#define ARCHI_SOC_EVENT_FCHWPE0 140
unsigned int __rt_iodev_uart_baudrate = 115200;
int __rt_fpga_fc_frequency = 20000000; // e.g. 20000000 for 20MHz;
int __rt_fpga_periph_frequency = 10000000; // e.g. 10000000 for 10MHz;

And added a few print statements to view the output.
This is the output observed on minicom.

It does not print anything after "Started HWME operation" which is from the given line onwards:

 soc_eu_fcEventMask_setEvent(ARCHI_SOC_EVENT_FCHWPE0);
    __rt_periph_wait_event(ARCHI_SOC_EVENT_FCHWPE0, 1);

I checked out various issues in pulpissimo but haven't found any proper solution yet. Could you help me out with it? Thanking You.

Regards,
Zeal

Alternative Simulator Options for Running Tests Without QuestaSim

Hey everyone,

I'm currently facing a challenge as I don't have access to QuestaSim for running tests. Are there any alternative simulators that can serve the same purpose? If so, could you please suggest some options? Additionally, any guidance on how to integrate these alternative simulators into my workflow would be immensely helpful.

Thanks!

Bender.yml do not work

Hello to everyone,
i'm trying to install this directory to see how works the HWPE.
This are the step:

$ git clone --recursive https://github.com/pulp-platform/hwpe-mac-engine

$ cd hwpe-mac-engine/

$ make checkout
make: *** Nessuna regola per generare l'obiettivo «checkout». Arresto.

Any suggestion?
thanks in advance

pulp-platform / hwpe-mac-engine Goto Github PK

hwpe-mac-engine's Issues

How to change the data width of the input and the output of the hwpe-mac-engine?

How does a microcode loop know its number of iterations?

Error in vector stride?

How to Run HWPE_MAC_Engine on Zedboard using Pulpissimo

Alternative Simulator Options for Running Tests Without QuestaSim

Bender.yml do not work

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent