pulp-platform / hwpe-mac-engine Goto Github PK
View Code? Open in Web Editor NEWAn example Hardware Processing Engine
License: Other
An example Hardware Processing Engine
License: Other
Hello,
I am using HWPE to connect my hardware accelerator to the Pulpissimo. For that, I decided to study and make modifications in the hwpe-mac-engine file based on the requirements of my accelerator.
The first modification that I am trying to make is to change the input and output data width to 128 bits from 32 bits, keeping the rest of the structure of hwpe-mac-engine as it is. That is, a, b, c, and d's data width is changed from 32 bits to 128 bits.
For that, I changed the data width parameter in mac_streamer.sv, and mac_top.sv file and also made necessary changes in the size of mult, r_mult, d_nonshifted, etc in the hwpe_mac_engine file.
Now I was going through the following master thesis, ( https://webthesis.biblio.polito.it/11015/1/tesi.pdf ) where it is specified:
Another change, compared to the provided example, was to “virtually” extend
the number of ports to the TCDM. These are considered “virtual” as it is like instantiating eight ports (eight in the SMAC-Engine case) but only four of these are physically there: all these eight ports will think they are attached to the memory but they are not. This is necessary to handle a 128-bit stream at the input and the output. Indeed, a TCDM multiplexer can then be used to funnel more input “virtual” TCDM channels (eight) into a smaller set of master ports (four). Hence, together with the definition of a virtual_tcdm interface, a hwpe_stream_tcdm_mux has been allocated to handle this.
Based on my understanding I had a few questions regarding virtually extending the number of ports.
hwpe_stream_intf_tcdm virtual_tcdm [7:0] (
.clk ( clk_i )
);
// mode 1 - meno efficiente
hwpe_stream_tcdm_mux #(
.NB_IN_CHAN ( 8 ),
.NB_OUT_CHAN ( 4 )
) i_mux (
.clk_i,
.rst_ni,
.clear_i,
.in ( virtual_tcdm[7:0] ),
.out ( tcdm[3:0] )
);
Thanking You.
Regards,
Zeal
I've been studying the MAC as it's a great starting point for what I need, despite the suboptimal implementation as the devs claim.
I understand that the microcode can be generated with the Python script and I understand the very simple instruction set. I understand that the LOOPSX_OFFS registers hold a loop instruction size and the loop instruction offset, also generated by the Python script.
What I don't understand is how does each loop know where to look for its number of iterations.
In the hwme example in the pulp-rt-examples repo, the NB_ITER register (numbered 4) holds the number of iterations. Indeed, when I change the value in NB_ITER from 4 to 3 in the example, only three iterations get executed, thus the fourth result is wrong. How does the loop0 know to look into register 4? If I used loop1, which register would be used for its number of iterations?
One more thing: does the operation d = a * b + c get performed in every iteration of the most-nested loop?
I'm trying out the HWME in scalar product mode and have started with some very simple examples.
I am using the default microcode supplied in the pulp-rt-example/accelerators/hwme example.
Job dependent parameters are:
hwme_a_addr_set((unsigned int) a);
hwme_b_addr_set((unsigned int) b);
hwme_c_addr_set((unsigned int) c);
hwme_d_addr_set((unsigned int) d);
hwme_nb_iter_set(10);
hwme_len_iter_set(2 - 1 /* My vectors are of length 2. Have to offset by -1 if I understand things correctly. */);
hwme_vectstride_set(0 * sizeof(uint32_t));
hwme_shift_simplemul_set(hwme_shift_simplemul_value(10, 0));
All the results should be the same because 0 for vectstride means I'm always reading the same (first) vector from my a and b inputs. The a, b, c and d arrays are more than 300 bytes apart, so there's no overlap.
This is what the first 15 elements of a, b, c and d look like before and after running the HWME.
Before.
a: 1968 b: 4068 c: 0 d: 0
a: 741 b: -3079 c: 0 d: 0
a: 1968 b: -4369 c: 0 d: 0
a: 741 b: -2234 c: 0 d: 0
a: 1968 b: -2579 c: 0 d: 0
a: 741 b: 1415 c: 0 d: 0
a: -222 b: 4068 c: 0 d: 0
a: 542 b: -3079 c: 0 d: 0
a: -222 b: -4369 c: 0 d: 0
a: 542 b: -2234 c: 0 d: 0
a: -222 b: -2579 c: 0 d: 0
a: 542 b: 1415 c: 0 d: 0
a: -597 b: 4068 c: 0 d: 0
a: 664 b: -3079 c: 0 d: 0
a: -597 b: -4369 c: 0 d: 0
After.
a: 1968 b: 4068 c: 0 d: -2229
a: 741 b: -3079 c: 0 d: 5590
a: 1968 b: -4369 c: 0 d: -2229
a: 741 b: -2234 c: 0 d: 5590
a: 1968 b: -2579 c: 0 d: 5590
a: 741 b: 1415 c: 0 d: 5590
a: -222 b: 4068 c: 0 d: -2229
a: 542 b: -3079 c: 0 d: 5590
a: -222 b: -4369 c: 0 d: 5590
a: 542 b: -2234 c: 0 d: 5590
a: -222 b: -2579 c: 0 d: 0
a: 542 b: 1415 c: 0 d: 0
a: -597 b: 4068 c: 0 d: 0
a: 664 b: -3079 c: 0 d: 0
a: -597 b: -4369 c: 0 d: 0
The scalar product of the two vectors is 5590: (4068 * 1968 - 3079 * 741) / 1024 = 5590. What I don't understand is where does -2229 come from.
Did I misunderstand any of the configuration options?
Hello @FrancescoConti ,
I want to run Pulpissimo based HWPE-MAC-Engine example on Zedboard. I tried running the HWME
code provided in pulp-rt-examples
. But unfortunately I am stuck at a point. I will share the hwme.c file of the code that I have modified:
/*
* Copyright (C) 2018 ETH Zurich and University of Bologna
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* Authors: Francesco Conti <[email protected]>
*/
#include "pulp.h"
#include <stdint.h>
#include <stdio.h>
#include "archi/hwme/hwme_v1.h"
#include "hal/hwme/hwme_v1.h"
#include <rt/rt_api.h>
#define ARCHI_SOC_EVENT_FCHWPE0 140
#define USE_STIMULI
// comment below line to run only dot product with bias
//#define DO_MATVEC_MULT
#ifndef DO_MATVEC_MULT
#define DO_DOT_PROD
#endif
#include "hwme_stimuli_a.h"
#include "hwme_stimuli_b.h"
#include "hwme_stimuli_c.h"
#include "hwme_stimuli_d.h"
unsigned int __rt_iodev_uart_baudrate = 115200;
int __rt_fpga_fc_frequency = 20000000; // e.g. 20000000 for 20MHz;
int __rt_fpga_periph_frequency = 10000000; // e.g. 10000000 for 10MHz;
int main() {
uint32_t *a = (uint8_t *) 0x1c010000;
uint32_t *b = (uint8_t *) 0x1c010200;
uint32_t *c = (uint8_t *) 0x1c010400;
uint32_t *d = (uint8_t *) 0x1c010600;
int coreID = get_core_id();
#ifdef DO_MATVEC_MULT
// define dimensions
uint32_t in_vec_len = 8;
uint32_t out_vec_len = 10;
#endif
volatile int errors = 0;
int gold_sum = 0, check_sum = 0;
int i,j;
int offload_id_tmp, offload_id;
printf(get_core_id());
if(get_core_id() == 0) {
#ifdef USE_STIMULI
for(int i=0; i<512; i++) {
((uint8_t *) a)[i] = stim_a[i];
}
//printing input value of a
for(int loop = 0; loop < 1; loop++){
printf("a %d ", a[loop], "\n");
}
for(int i=0; i<512; i++) {
((uint8_t *) b)[i] = stim_b[i];
}
//printing b
for(int loop = 0; loop < 1; loop++){
printf("b %d ", b[loop], "\n");
}
for(int i=0; i<512; i++) {
#ifdef DO_MATVEC_MULT
((uint8_t *) c)[i] = 0; // no bias for matrix vector multiplication
#else
((uint8_t *) c)[i] = stim_c[i];
#endif
}
//printing c
for(int loop = 0; loop < 1; loop++){
printf("c %d ", c[loop], "\n");
}
for(int i=0; i<512; i++) {
((uint8_t *) d)[i] = stim_d[i];
}
//printing d
for(int loop = 0; loop < 1; loop++){
printf("d %d ", d[loop], "\n");
}
#else
for(int i=0; i<127; i++) {
a[i] = i;
}
for(int loop = 0; loop < 1; loop++){
printf("%d ", a[loop], "\n");
}
for(int i=0; i<127; i++) {
b[i] = i;
}
for(int loop = 0; loop < 1; loop++){
printf("%d ", b[loop], "\n");
}
for(int i=0; i<127; i++) {
c[i] = i;
}
for(int loop = 0; loop < 1; loop++){
printf("%d ", c[loop], "\n");
}
for(int i=0; i<127; i++) {
d[i] = i;
}
for(int loop = 0; loop < 1; loop++){
printf("%d ", d[loop], "\n");
}
#endif
/* convolution-accumulation - HW */
plp_hwme_enable();
while((offload_id_tmp = hwme_acquire_job()) < 0);
// set up bytecode
hwme_bytecode_set(HWME_LOOPS1_OFFS, 0x00000000);
hwme_bytecode_set(HWME_BYTECODE5_LOOPS0_OFFS, 0x00040000);
hwme_bytecode_set(HWME_BYTECODE4_OFFS, 0x00000000);
hwme_bytecode_set(HWME_BYTECODE3_OFFS, 0x00000000);
hwme_bytecode_set(HWME_BYTECODE2_OFFS, 0x00000000);
hwme_bytecode_set(HWME_BYTECODE1_OFFS, 0x000008cd);
hwme_bytecode_set(HWME_BYTECODE0_OFFS, 0x11a13c05);
// job-dependent registers
hwme_a_addr_set((unsigned int) a);
hwme_b_addr_set((unsigned int) b);
hwme_c_addr_set((unsigned int) c);
hwme_d_addr_set((unsigned int) d);
#ifdef DO_MATVEC_MULT
hwme_nb_iter_set(out_vec_len);
hwme_len_iter_set(in_vec_len-1);
hwme_vectstride_set(in_vec_len*4); // stride for the matrix is equal to in_vec length * wordsize
hwme_vectstride2_set(0); // stride for the vector is zero
#else
hwme_nb_iter_set(4);
hwme_len_iter_set(32-1);
hwme_vectstride_set(32*4);
hwme_vectstride2_set(32*4); // same stride for both streams
#endif
hwme_shift_simplemul_set(hwme_shift_simplemul_value(0, 0));
// start HWME operation
printf("Start HWME operation !\n");
hwme_trigger_job();
printf("Started HWME operation !\n");
// wait for end of compuation
soc_eu_fcEventMask_setEvent(ARCHI_SOC_EVENT_FCHWPE0);
__rt_periph_wait_event(ARCHI_SOC_EVENT_FCHWPE0, 1);
printf("Archi detection ! \n");
plp_hwme_disable();
printf("Wait for end of computation !\n");
// check
#ifndef USE_STIMULI
if(d[0] != 0x000028b0) errors++;
if(d[1] != 0x000124b1) errors++;
if(d[2] != 0x000320b2) errors++;
if(d[3] != 0x00061cb3) errors++;
#else
#ifdef DO_MATVEC_MULT
if(d[0] != 0x7CB12A38) errors++;
if(d[1] != 0xCD4F4DCB) errors++;
if(d[2] != 0x49CD5D5C) errors++;
if(d[3] != 0x2A1D8706) errors++;
#else
if(d[0] != 0x7f228fd6) errors++;
if(d[1] != 0x23a7d5c2) errors++;
if(d[2] != 0x7f281848) errors++;
if(d[3] != 0x6127d834) errors++;
#endif
#endif /* USE_STIMULI */
printf("errors=%d\n", errors);
printf("Done with computation !\n");
//printing output
printf("printing d after computation: \n");
for(int loop = 0; loop < 1; loop++){
printf("d %d ", d[loop], "\n");
}
}
printf("Sync barrier !\n");
synch_barrier();
printf("Done with everything !\n");
return errors;
}
I updated the code file by adding the following lines:
#define ARCHI_SOC_EVENT_FCHWPE0 140
unsigned int __rt_iodev_uart_baudrate = 115200;
int __rt_fpga_fc_frequency = 20000000; // e.g. 20000000 for 20MHz;
int __rt_fpga_periph_frequency = 10000000; // e.g. 10000000 for 10MHz;
And added a few print statements to view the output.
This is the output observed on minicom.
It does not print anything after "Started HWME operation" which is from the given line onwards:
soc_eu_fcEventMask_setEvent(ARCHI_SOC_EVENT_FCHWPE0);
__rt_periph_wait_event(ARCHI_SOC_EVENT_FCHWPE0, 1);
I checked out various issues in pulpissimo but haven't found any proper solution yet. Could you help me out with it? Thanking You.
Regards,
Zeal
Hey everyone,
I'm currently facing a challenge as I don't have access to QuestaSim for running tests. Are there any alternative simulators that can serve the same purpose? If so, could you please suggest some options? Additionally, any guidance on how to integrate these alternative simulators into my workflow would be immensely helpful.
Thanks!
Hello to everyone,
i'm trying to install this directory to see how works the HWPE.
This are the step:
$ git clone --recursive https://github.com/pulp-platform/hwpe-mac-engine
$ cd hwpe-mac-engine/
$ make checkout
make: *** Nessuna regola per generare l'obiettivo «checkout». Arresto.
Any suggestion?
thanks in advance
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.