Comments (25)
You could test something like this, still without changing anything in Yosys:
A synthesizer should be able to handle any code construct efficiently, including the ones mentioned in the issue above. Having to engineer sourcecode and adding pragmas to obtain good results is a non-sustainable solution.
Welcome to reality!
from yosys.
Applying both #3875 and #3877 should now yield something like this:
Chip area for module '\reg_demux_noshift': 18548.573400
Chip area for module '\reg_demux_nowrshmsk_mdim_arr': 37543.224600
Chip area for module '\reg_demux_nowrshmsk_typedef': 37543.224600
Chip area for module '\reg_demux_pow2': 37214.742600
Chip area for module '\reg_demux': 71352.640800
Later optimization passes can yield different results depending on unrelated changes in the input file (and the phase of the moon, for all I know), so results will vary somewhat.
I've disabled the removal of multiplication for (* nowrshmsk *)
, since this for some reason only seemed to yield good results when combined with (* nordshift *)
.
from yosys.
You could test something like this, still without changing anything in Yosys:
A synthesizer should be able to handle any code construct efficiently, including the ones mentioned in the issue above. Having to engineer sourcecode and adding pragmas to obtain good results is a non-sustainable solution.
from yosys.
For writes, it could be possible to expand on the work I did earlier to optimize access to multirange packed arrays (within structs).
This takes the relevant bit stride into account when generating the CASE block requested via the (* nowrshmsk *)
attribute. This CASE block is a substitute for $shift/$shiftx
.
I actually also experimented with a similar optimization for reads, using a new attribute (* nordshift *)
. In my case, this didn't yield any resource savings, however it is not impossible that an expanded version of this could be beneficial in your case. I have rebased the branch for this in case it could be helpful: https://github.com/daglem/yosys/tree/nordshift
from yosys.
However, if we have something like array[i*33+:33] this optimization is no longer possible. Since the LSB of the multiplier is set, any of the following bits could change depending on the value of i. This means no log-shifter stages can be removed and we generate a significant overhead, as we use a much more powerful operation (a generic shift) to implement the part selection.
There's a shiftmul
rule in the peepopt
pass that is supposed to address exactly that. When it sees a shift driven by a mul that effectively work as a group select, the rule will round up the shift stride to the next power-of-2 and rearrange the input to the shift appropriately, so that those optimizations you describe can take place. It only matches on right shifts, so it will only optimize part-selects working as muxes, not the demux you have in your sample as well.
The reason why that rule doesn't fire off in your case seems to be due an earlier opt_expr
pass replacing the bottom bit of the multiplication result with zero since 34
is even, and this way interfering with the detection in peepopt
. (Both opt_expr
and peepopt
are called by the coarse
block in synth
).
I think the cleanest solution would be to have a dedicated $select (or whatever it will be called) construct that represents a selection of a group of bits from an array of groups, this would explicitly implement this concept.
This would give a significant amount of flexibility how it will be mapped later on and for different targets.
FWIW there's the $bmux
cell which I think is exactly that. Maybe that's what the part-select in your sample would synthesize to through the Verific frontend? Who knows.
from yosys.
For writes, it could be possible to expand on the work I did earlier to optimize access to multirange packed arrays (within structs). This takes the relevant bit stride into account when generating the CASE block requested via the
(* nowrshmsk *)
attribute. This CASE block is a substitute for$shift/$shiftx
.
Doy you by chance have a simple example for this?
We actually tried this before but didn't get (* nowrshmsk *)
to do anything but its possible we just didn't use it properly.
from yosys.
For writes, it could be possible to expand on the work I did earlier to optimize access to multirange packed arrays (within structs). This takes the relevant bit stride into account when generating the CASE block requested via the
(* nowrshmsk *)
attribute. This CASE block is a substitute for$shift/$shiftx
.
You could actually try out (* nowrshmsk *)
without making any changes. Even though this is currently not ideal in that it will create an enormous CASE, it could just be that this will be properly optimized. When I implemented (* nowrshmsk *)
for packed arrays within structs, the stride functionality only served to generate a smaller/nicer AST, it didn't really make any difference on the final resource usage.
I.e. (* nowrshmsk *) output reg [(NoPorts * 117) - 1:0] out_req_o;
from yosys.
There's a
shiftmul
rule in thepeepopt
pass...
Damn, pmgen
looks pretty damn complex, I am not quite sure what I am looking at there.
This is indeed extremely close to what I am looking for.
The reason why that rule doesn't fire off in your case seems to be due an earlier
opt_expr
pass replacing the bottom bit of the multiplication result with zero since34
is even, and this way interfering with the detection inpeepopt
. (Bothopt_expr
andpeepopt
are called by thecoarse
block insynth
).
I just played around with this for a bit and I think I found out why it didn't do anything in our case:
- making the bottom bit constant seems to be fine, at least I see no indication in the code why this shouldn't work
- The
$mul
needs to be connected directly to the$shiftx
. In our case it often has an offset (ie something likedata[3+33*index+:5]
which can be read as "from packs of 33bits, select a pack, then read starting from bit pos 3, read 5 consecutive bits")
Bonus: this also means you must runopt_expr
before it since the initial mapping of a part select will always have an$add
cell for the offset (without offset its just+0
but its still there).opt_expr
will optimize this cell away. - It doesn't like slicing downwards (
data[33*index-:5]
; even if you make sure index is >0) - It has a seemingly arbitrary maximum shift-size of 20. Everything above will not be optimized this way (
if (GetSize(shamt) > 20) reject;
) - Running
wreduce
before it is absolutely necessary, this was already the case but still worth mentioning. Otherwise the$mul
will use the default width of the index and it again rejects the optimization.
So in conclusion, this should solve the simplest cases of this problem but anything moderately complex will be left as-is.
Still good to know it exists to help me with the pass I am currently working on. In the end it might be worth to replace this simple implementation with a more holistic optimization, though that means I need to understand what happens in pmgen
, which I am not fond of xD
from yosys.
You could actually try out
(* nowrshmsk *)
without making any changes. Even though this is currently not ideal in that it will create an enormous CASE, it could just be that this will be properly optimized. When I implemented(* nowrshmsk *)
for packed arrays within structs, the stride functionality only served to generate a smaller/nicer AST, it didn't really make any difference on the final resource usage.I.e.
(* nowrshmsk *) output reg [(NoPorts * 117) - 1:0] out_req_o;
Quality-of-Results seems to be a lot better, not quite as good as my manual fix but pretty close to it.
What worries me is that CPU time increases to 18x and memory footprint to 10x of the initial run.
Since our memory footprint without a fix is already >300GB, I don't see us running this on the large modules.
Still, it shows that the general approach of replacing $shift/$shiftx
cells is a good one and I can likely use it to get better estimates for the expected QoR with a fix.
from yosys.
I happen to be writing some patches for the shiftmul matcher, I sure invite you to take a look and provide feedback once it's posted.
just played around with this for a bit and I think I found out why it didn't do anything in our case:
- making the bottom bit constant seems to be fine, at least I see no indication in the code why this shouldn't work
It's the index <SigSpec> port(mul, \Y) === shamt
instruction that restricts it. It means the shift amount in shamt
must match what's on the output port of the $mul
cell exactly (modulo wire aliases, since those are dealt with in the port()
helper, and zero-padding from above, since that's explicitly stripped some lines above).
- The
$mul
needs to be connected directly to the$shiftx
. In our case it often has an offset (ie something likedata[3+33*index+:5]
which can be read as "from packs of 33bits, select a pack, then read starting from bit pos 3, read 5 consecutive bits")
Bonus: this also means you must runopt_expr
before it since the initial mapping of a part select will always have an$add
cell for the offset (without offset its just+0
but its still there).opt_expr
will optimize this cell away.
If there isn't a pass already that deals with the constant offset in the shift and implements it as two consecutive shifts maybe we can add that as a separate pattern to peepopt
, once that pattern is applied the regular shiftmul
can deal with what's left.
- It has a seemingly arbitrary maximum shift-size of 20. Everything above will not be optimized this way (
if (GetSize(shamt) > 20) reject;
)
That's restriction on the bitsize of the shift amount, so it restricts the shift to 2**20
.
from yosys.
Good to know that someone with more experience is also working on this :-)
Sorry for not seeing the GetSize()
, that was stupid.
Below you can find what I have been working on so far, feel free to take whatever you want from it, though I doubt its 'good code' since I am still getting my grasp on the code-base.
As a note, it currently doesn't actually do anything since the final replacement is still missing but it does identify the cells and the necessary signals, parameters and so on.
https://github.com/phsauter/yosys/tree/opt-part-select
from yosys.
@phsauter No worries! See #3873
from yosys.
Quality-of-Results seems to be a lot better, not quite as good as my manual fix but pretty close to it. What worries me is that CPU time increases to 18x and memory footprint to 10x of the initial run. Since our memory footprint without a fix is already >300GB, I don't see us running this on the large modules. Still, it shows that the general approach of replacing
$shift/$shiftx
cells is a good one and I can likely use it to get better estimates for the expected QoR with a fix.
You could test something like this, still without changing anything in Yosys:
typedef struct packed {
logic [NoPorts][117] port;
} out_req_t;
...
(* nowrshmsk *) output out_req_t out_req_o;
...
out_req_o.port[in_select_i] = in_req_i;
Unfortunately packed multirange arrays are currently only available within packed structs in Yosys - otherwise this code could be further simplified.
from yosys.
You could test something like this, still without changing anything in Yosys ...
Maybe I am still doing it wrong but if I do that, it just reduces everything to one single out_req_o
wire instead of it being a 117bit 1:13 demultiplexer.
from yosys.
You could test something like this, still without changing anything in Yosys ...
Maybe I am still doing it wrong but if I do that, it just reduces everything to one single
out_req_o
wire instead of it being a 117bit 1:13 demultiplexer.
Doesn't this wire have wiretype out_req_t? It looks plausible to me when I run yosys -p 'read_verilog -sv -dump_ast2 reg_demux.v'
- the output contains the appropriate number of AST_CASE / AST_COND. Here is the complete modified module for reference:
module reg_demux (
clk_i,
rst_ni,
in_select_i,
in_req_i,
in_rsp_o,
out_req_o,
out_rsp_i
);
parameter [31:0] NoPorts = 32'd13;
typedef struct packed {
logic [NoPorts][117] port;
} out_req_t;
input wire clk_i;
input wire rst_ni;
input wire [3:0] in_select_i;
input wire [116:0] in_req_i;
output reg [33:0] in_rsp_o;
// output reg [(NoPorts * 117) - 1:0] out_req_o;
(* nowrshmsk *) output out_req_t out_req_o;
input wire [(NoPorts * 34) - 1:0] out_rsp_i;
always @(*) begin
out_req_o = 1'sb0;
in_rsp_o = 1'sb0;
// out_req_o[in_select_i * 117+:117] = in_req_i;
out_req_o.port[in_select_i] = in_req_i;
in_rsp_o = out_rsp_i[in_select_i * 34+:34];
end
endmodule
from yosys.
In the final netlist I am getting:
(* wiretype = "\\out_req_t" *)
output out_req_o;
wire out_req_o;
It notes the wiretype but the size is still only one bit. Is there some additional command I need to run after read_verilog
?
from yosys.
Heh, sorry, I have no idea what is going on there. Could the netlist be wrong (do you use it for anything later)?
Hopefully someone else can explain.
from yosys.
In the final netlist I am getting:
(* wiretype = "\\out_req_t" *) output out_req_o; wire out_req_o;It notes the wiretype but the size is still only one bit. Is there some additional command I need to run after
read_verilog
?
Evidently I've been on vacation for too long. Change the typedef to this, and things should hopefully start working:
typedef struct packed {
logic [NoPorts-1:0][117-1:0] port;
} out_req_t;
from yosys.
Evidently I've been on vacation for too long. Change the typedef to this, and things should hopefully start working:
And I shouldn't just copy stuff blindly ^^
Sadly if I do this I now get the AST_CASE
and AST_COND
as I would expect but they only seem to checks whether in_select_i
is equal to zero. Meaning it probably doesn't like multi-dimensional packed arrays.
If I change it to:
typedef struct packed {
logic [117-1:0] port[NoPorts-1:0];
} out_req_t;
This then does actually give me the correct AST and initial RTLIL (even though it is kinda scary since now it isn't packed anymore).
But then somewhere inside synth -run begin:fine (ie in coarse synthesis) it converts it back to a shift.
The peepopt
shiftmul
approach looks pretty promising though so unless you are curious, I don't think its necessary to debug the nowrshmsk
approach.
For documentation purposes, here is a zip file with the initial, manual fix and nowrshmsk approaches:
shift_issue2.zip
from yosys.
Sadly if I do this I now get the
AST_CASE
andAST_COND
as I would expect but they only seem to checks whetherin_select_i
is equal to zero. Meaning it probably doesn't like multi-dimensional packed arrays.
I'm using (* nowrshmsk *)
for multidimensional packed arrays in https://github.com/daglem/reDIP-SID to good effect, so it should work. I don't see anything wrong in the AST offhand - e.g. the output of yosys -p 'read_verilog -sv -dump_vlog2 reg_demux.v'
looks like this:
...
case ((in_select_i)*(117))
0:
(* wiretype = /** AST_STRUCT_ITEM **/ *)out_req_o[1520:0][116:0] = in_req_i;
117:
(* wiretype = /** AST_STRUCT_ITEM **/ *)out_req_o[1520:0][233:117] = in_req_i;
234:
(* wiretype = /** AST_STRUCT_ITEM **/ *)out_req_o[1520:0][350:234] = in_req_i;
...
I'd better look into avoiding the multiplication with a non-power-of-two number, but it should still work.
If I change it to:
typedef struct packed { logic [117-1:0] port[NoPorts-1:0]; } out_req_t;This then does actually give me the correct AST and initial RTLIL (even though it is kinda scary since now it isn't packed anymore). But then somewhere inside synth -run begin:fine (ie in coarse synthesis) it converts it back to a shift.
Well, unpacked arrays within packed structs are not allowed by the LRM (it's a Yosys extension which should probably be removed), so I wouldn't use that in any case.
The
peepopt
shiftmul
approach looks pretty promising though so unless you are curious, I don't think its necessary to debug thenowrshmsk
approach.
I'd actually be very interested in having nowrshmsk
work in your case, comparing it to the shift approach. Could you please have another look, and show how you get an incorrect AST?
from yosys.
PR #3875 gets rid of the multiplication by 117 in the (* nowrshmsk *)
AST_CASE
generated from reg_demux.v
.
from yosys.
I reworked #3875 so that the optimizations for two-dimensional packed ranges also covers variable slices on the form dst[i*w +: w] = src, i.e. you should only have to make the following change to generate an optimized AST:
(* nowrshmsk *) output reg [(NoPorts * 117) - 1:0] out_req_o;
Does this work for you?
from yosys.
This weekend I don‘t have access to a computer, I will try it out on monday.
But you should be able to check as well by downloading the zip file and running make all
and then checking the RTLIL/netlist.
from yosys.
This weekend I don‘t have access to a computer, I will try it out on monday.
Please do! I also made PR #3877 so that indexing of rvalues can be similarily optimized as well.
After adding both #3875 and #3877, you can do:
(* nowrshmsk *) output reg [(NoPorts * 117) - 1:0] out_req_o;
(* nordshift *) input wire [(NoPorts * 34) - 1:0] out_rsp_i;
But you should be able to check as well by downloading the zip file and running
make all
and then checking the RTLIL/netlist.
I'll have to leave it to you to check for correctness - perhaps the result is too good to be true?
$ cat output/*.rpt | grep "Chip area"
Chip area for module '\reg_demux_noshift': 21636.720000
Chip area for module '\reg_demux_pow2': 37214.742600
Chip area for module '\reg_demux': 74589.114600
from yosys.
Sadly if I do this I now get the
AST_CASE
andAST_COND
as I would expect but they only seem to checks whetherin_select_i
is equal to zero. Meaning it probably doesn't like multi-dimensional packed arrays.
Okay this issue was fixed by using a newer version of yosys (before it was 0.27 or 0.28, can't remember).
This gives me a correct solution with 32'700um^2 compared to the 37'200um^2 of the pow2
solution. This is roughly what I would expect, even a bit better.
If I apply your PRs and use:
module reg_demux_nowrshmsk_mdim_arr
// ...
(* nowrshmsk *) output reg [(NoPorts * 117) - 1:0] out_req_o;
// ...
This results in an area of 50'000um^2, though for some reason now even the basic version is slightly worse (above 66k).
Same result for the typedef solution, a significant increase compared to the version without the PRs.
using:
module reg_demux_noshift
// ...
(* nowrshmsk *) output reg [(NoPorts * 117) - 1:0] out_req_o;
(* nordshift *) input wire [(NoPorts * 34) - 1:0] out_rsp_i;
// ...
I also get 19k, this is both in-line with expectations from our commercial tools and it is also functionally equivalent, meaning this is a significant improvement over other solutions and should be seen as the gold standard.
Maybe we should move the following discussion over to #3875:
If I apply both PRs I get a base of 66k, with just PR #3875 its even higher at 75k and both the typedef and strided-array access are lower at 50k but only compared to this worse base, compared to typedef and yosys 0.32 which achieves 32k this is too high.
So I think there may be some bug in this PR.
TL;DR: Using both attributes generates fantastic results, PR #3875 diminishes the results from nowrshmsk
alone, my issue was likely an old yosys version.
from yosys.
Related Issues (20)
- Parameters in other packages HOT 2
- Reduce default severity of Verific messages that produce warnings on commonly used coding styles
- Latch inferred for x signal HOT 4
- Assertion Failure in AST Processing: node->bits == v at frontends/ast/ast.cc:855
- Inconsistency in Verilog Synthesis: Yosys Successfully Synthesizes Code That Fails in Vivado and Quartus Due to Syntax Errors HOT 3
- Yosys Fails to Synthesize Tri-State Logic Correctly for inout Ports HOT 1
- Another out-of-memory problem with for loop
- synth_* passes should call `check -mapped`
- "ERROR: Assert `count_id(wire->name) == 0' failed in kernel/rtlil.cc:2143" when using synth_{ice40,ecp5} on simple design HOT 2
- Crash in yosys-abc
- Manual title page should have yosys version number
- Should -nomx8 be the default for the GateMate?
- opt: no "-purge" option but public names removed HOT 2
- Tests fails on Debian GNU/Linux on ppc64 HOT 6
- write_smt2: "-wires" option leads to inequivalent descriptions
- Spurious warnings "select out of bounds on signal" when there is no such thing ... HOT 1
- Inconsistent simulation before and after yosys synthesis HOT 1
- Inout can't be read with constant value HOT 3
- Inout port not working with array replication operator HOT 6
- Add support for SystemVerilog's `==?` and `!=?` operators
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from yosys.