Git Product home page Git Product logo

Comments (25)

daglem avatar daglem commented on July 19, 2024 2

You could test something like this, still without changing anything in Yosys:

A synthesizer should be able to handle any code construct efficiently, including the ones mentioned in the issue above. Having to engineer sourcecode and adding pragmas to obtain good results is a non-sustainable solution.

Welcome to reality!

from yosys.

daglem avatar daglem commented on July 19, 2024 2

Applying both #3875 and #3877 should now yield something like this:

   Chip area for module '\reg_demux_noshift': 18548.573400
   Chip area for module '\reg_demux_nowrshmsk_mdim_arr': 37543.224600
   Chip area for module '\reg_demux_nowrshmsk_typedef': 37543.224600
   Chip area for module '\reg_demux_pow2': 37214.742600
   Chip area for module '\reg_demux': 71352.640800

Later optimization passes can yield different results depending on unrelated changes in the input file (and the phase of the moon, for all I know), so results will vary somewhat.
I've disabled the removal of multiplication for (* nowrshmsk *), since this for some reason only seemed to yield good results when combined with (* nordshift *).

from yosys.

thommythomaso avatar thommythomaso commented on July 19, 2024 1

You could test something like this, still without changing anything in Yosys:

A synthesizer should be able to handle any code construct efficiently, including the ones mentioned in the issue above. Having to engineer sourcecode and adding pragmas to obtain good results is a non-sustainable solution.

from yosys.

daglem avatar daglem commented on July 19, 2024

For writes, it could be possible to expand on the work I did earlier to optimize access to multirange packed arrays (within structs).
This takes the relevant bit stride into account when generating the CASE block requested via the (* nowrshmsk *) attribute. This CASE block is a substitute for $shift/$shiftx.

I actually also experimented with a similar optimization for reads, using a new attribute (* nordshift *). In my case, this didn't yield any resource savings, however it is not impossible that an expanded version of this could be beneficial in your case. I have rebased the branch for this in case it could be helpful: https://github.com/daglem/yosys/tree/nordshift

from yosys.

povik avatar povik commented on July 19, 2024

However, if we have something like array[i*33+:33] this optimization is no longer possible. Since the LSB of the multiplier is set, any of the following bits could change depending on the value of i. This means no log-shifter stages can be removed and we generate a significant overhead, as we use a much more powerful operation (a generic shift) to implement the part selection.

There's a shiftmul rule in the peepopt pass that is supposed to address exactly that. When it sees a shift driven by a mul that effectively work as a group select, the rule will round up the shift stride to the next power-of-2 and rearrange the input to the shift appropriately, so that those optimizations you describe can take place. It only matches on right shifts, so it will only optimize part-selects working as muxes, not the demux you have in your sample as well.

The reason why that rule doesn't fire off in your case seems to be due an earlier opt_expr pass replacing the bottom bit of the multiplication result with zero since 34 is even, and this way interfering with the detection in peepopt. (Both opt_expr and peepopt are called by the coarse block in synth).

I think the cleanest solution would be to have a dedicated $select (or whatever it will be called) construct that represents a selection of a group of bits from an array of groups, this would explicitly implement this concept.
This would give a significant amount of flexibility how it will be mapped later on and for different targets.

FWIW there's the $bmux cell which I think is exactly that. Maybe that's what the part-select in your sample would synthesize to through the Verific frontend? Who knows.

from yosys.

phsauter avatar phsauter commented on July 19, 2024

For writes, it could be possible to expand on the work I did earlier to optimize access to multirange packed arrays (within structs). This takes the relevant bit stride into account when generating the CASE block requested via the (* nowrshmsk *) attribute. This CASE block is a substitute for $shift/$shiftx.
Doy you by chance have a simple example for this?

We actually tried this before but didn't get (* nowrshmsk *) to do anything but its possible we just didn't use it properly.

from yosys.

daglem avatar daglem commented on July 19, 2024

For writes, it could be possible to expand on the work I did earlier to optimize access to multirange packed arrays (within structs). This takes the relevant bit stride into account when generating the CASE block requested via the (* nowrshmsk *) attribute. This CASE block is a substitute for $shift/$shiftx.

You could actually try out (* nowrshmsk *) without making any changes. Even though this is currently not ideal in that it will create an enormous CASE, it could just be that this will be properly optimized. When I implemented (* nowrshmsk *) for packed arrays within structs, the stride functionality only served to generate a smaller/nicer AST, it didn't really make any difference on the final resource usage.

I.e. (* nowrshmsk *) output reg [(NoPorts * 117) - 1:0] out_req_o;

from yosys.

phsauter avatar phsauter commented on July 19, 2024

There's a shiftmul rule in the peepopt pass...

Damn, pmgen looks pretty damn complex, I am not quite sure what I am looking at there.
This is indeed extremely close to what I am looking for.

The reason why that rule doesn't fire off in your case seems to be due an earlier opt_expr pass replacing the bottom bit of the multiplication result with zero since 34 is even, and this way interfering with the detection in peepopt. (Both opt_expr and peepopt are called by the coarse block in synth).

I just played around with this for a bit and I think I found out why it didn't do anything in our case:

  1. making the bottom bit constant seems to be fine, at least I see no indication in the code why this shouldn't work
  2. The $mul needs to be connected directly to the $shiftx. In our case it often has an offset (ie something like data[3+33*index+:5] which can be read as "from packs of 33bits, select a pack, then read starting from bit pos 3, read 5 consecutive bits")
    Bonus: this also means you must run opt_expr before it since the initial mapping of a part select will always have an $add cell for the offset (without offset its just +0 but its still there). opt_expr will optimize this cell away.
  3. It doesn't like slicing downwards (data[33*index-:5]; even if you make sure index is >0)
  4. It has a seemingly arbitrary maximum shift-size of 20. Everything above will not be optimized this way (if (GetSize(shamt) > 20) reject;)
  5. Running wreduce before it is absolutely necessary, this was already the case but still worth mentioning. Otherwise the $mul will use the default width of the index and it again rejects the optimization.

So in conclusion, this should solve the simplest cases of this problem but anything moderately complex will be left as-is.

Still good to know it exists to help me with the pass I am currently working on. In the end it might be worth to replace this simple implementation with a more holistic optimization, though that means I need to understand what happens in pmgen, which I am not fond of xD

from yosys.

phsauter avatar phsauter commented on July 19, 2024

You could actually try out (* nowrshmsk *) without making any changes. Even though this is currently not ideal in that it will create an enormous CASE, it could just be that this will be properly optimized. When I implemented (* nowrshmsk *) for packed arrays within structs, the stride functionality only served to generate a smaller/nicer AST, it didn't really make any difference on the final resource usage.

I.e. (* nowrshmsk *) output reg [(NoPorts * 117) - 1:0] out_req_o;

Quality-of-Results seems to be a lot better, not quite as good as my manual fix but pretty close to it.
What worries me is that CPU time increases to 18x and memory footprint to 10x of the initial run.
Since our memory footprint without a fix is already >300GB, I don't see us running this on the large modules.
Still, it shows that the general approach of replacing $shift/$shiftx cells is a good one and I can likely use it to get better estimates for the expected QoR with a fix.

from yosys.

povik avatar povik commented on July 19, 2024

@phsauter

I happen to be writing some patches for the shiftmul matcher, I sure invite you to take a look and provide feedback once it's posted.

just played around with this for a bit and I think I found out why it didn't do anything in our case:

  1. making the bottom bit constant seems to be fine, at least I see no indication in the code why this shouldn't work

It's the index <SigSpec> port(mul, \Y) === shamt instruction that restricts it. It means the shift amount in shamt must match what's on the output port of the $mul cell exactly (modulo wire aliases, since those are dealt with in the port() helper, and zero-padding from above, since that's explicitly stripped some lines above).

  1. The $mul needs to be connected directly to the $shiftx. In our case it often has an offset (ie something like data[3+33*index+:5] which can be read as "from packs of 33bits, select a pack, then read starting from bit pos 3, read 5 consecutive bits")
    Bonus: this also means you must run opt_expr before it since the initial mapping of a part select will always have an $add cell for the offset (without offset its just +0 but its still there). opt_expr will optimize this cell away.

If there isn't a pass already that deals with the constant offset in the shift and implements it as two consecutive shifts maybe we can add that as a separate pattern to peepopt, once that pattern is applied the regular shiftmul can deal with what's left.

  1. It has a seemingly arbitrary maximum shift-size of 20. Everything above will not be optimized this way (if (GetSize(shamt) > 20) reject;)

That's restriction on the bitsize of the shift amount, so it restricts the shift to 2**20.

from yosys.

phsauter avatar phsauter commented on July 19, 2024

Good to know that someone with more experience is also working on this :-)

Sorry for not seeing the GetSize(), that was stupid.

Below you can find what I have been working on so far, feel free to take whatever you want from it, though I doubt its 'good code' since I am still getting my grasp on the code-base.
As a note, it currently doesn't actually do anything since the final replacement is still missing but it does identify the cells and the necessary signals, parameters and so on.
https://github.com/phsauter/yosys/tree/opt-part-select

from yosys.

povik avatar povik commented on July 19, 2024

@phsauter No worries! See #3873

from yosys.

daglem avatar daglem commented on July 19, 2024

Quality-of-Results seems to be a lot better, not quite as good as my manual fix but pretty close to it. What worries me is that CPU time increases to 18x and memory footprint to 10x of the initial run. Since our memory footprint without a fix is already >300GB, I don't see us running this on the large modules. Still, it shows that the general approach of replacing $shift/$shiftx cells is a good one and I can likely use it to get better estimates for the expected QoR with a fix.

You could test something like this, still without changing anything in Yosys:

	typedef struct packed {
	    logic [NoPorts][117] port;
	} out_req_t;
	...
	(* nowrshmsk *) output out_req_t out_req_o;
	...
		out_req_o.port[in_select_i] = in_req_i;

Unfortunately packed multirange arrays are currently only available within packed structs in Yosys - otherwise this code could be further simplified.

from yosys.

phsauter avatar phsauter commented on July 19, 2024

You could test something like this, still without changing anything in Yosys ...

Maybe I am still doing it wrong but if I do that, it just reduces everything to one single out_req_o wire instead of it being a 117bit 1:13 demultiplexer.

from yosys.

daglem avatar daglem commented on July 19, 2024

You could test something like this, still without changing anything in Yosys ...

Maybe I am still doing it wrong but if I do that, it just reduces everything to one single out_req_o wire instead of it being a 117bit 1:13 demultiplexer.

Doesn't this wire have wiretype out_req_t? It looks plausible to me when I run yosys -p 'read_verilog -sv -dump_ast2 reg_demux.v' - the output contains the appropriate number of AST_CASE / AST_COND. Here is the complete modified module for reference:

module reg_demux (
	clk_i,
	rst_ni,
	in_select_i,
	in_req_i,
	in_rsp_o,
	out_req_o,
	out_rsp_i
);
	parameter [31:0] NoPorts = 32'd13;
	typedef struct packed {
	    logic [NoPorts][117] port;
	} out_req_t;
	input wire clk_i;
	input wire rst_ni;
	input wire [3:0] in_select_i;
	input wire [116:0] in_req_i;
	output reg [33:0] in_rsp_o;
//	output reg [(NoPorts * 117) - 1:0] out_req_o;
	(* nowrshmsk *) output out_req_t out_req_o;
	input wire [(NoPorts * 34) - 1:0] out_rsp_i;
	always @(*) begin
		out_req_o = 1'sb0;
		in_rsp_o = 1'sb0;
		
//		out_req_o[in_select_i * 117+:117] = in_req_i;
		out_req_o.port[in_select_i] = in_req_i;
		in_rsp_o = out_rsp_i[in_select_i * 34+:34];
	end
endmodule

from yosys.

phsauter avatar phsauter commented on July 19, 2024

In the final netlist I am getting:

  (* wiretype = "\\out_req_t" *)
  output out_req_o;
  wire out_req_o;

It notes the wiretype but the size is still only one bit. Is there some additional command I need to run after read_verilog?

from yosys.

daglem avatar daglem commented on July 19, 2024

Heh, sorry, I have no idea what is going on there. Could the netlist be wrong (do you use it for anything later)?
Hopefully someone else can explain.

from yosys.

daglem avatar daglem commented on July 19, 2024

In the final netlist I am getting:

  (* wiretype = "\\out_req_t" *)
  output out_req_o;
  wire out_req_o;

It notes the wiretype but the size is still only one bit. Is there some additional command I need to run after read_verilog?

Evidently I've been on vacation for too long. Change the typedef to this, and things should hopefully start working:

	typedef struct packed {
	    logic [NoPorts-1:0][117-1:0] port;
	} out_req_t;

from yosys.

phsauter avatar phsauter commented on July 19, 2024

Evidently I've been on vacation for too long. Change the typedef to this, and things should hopefully start working:

And I shouldn't just copy stuff blindly ^^

Sadly if I do this I now get the AST_CASE and AST_COND as I would expect but they only seem to checks whether in_select_i is equal to zero. Meaning it probably doesn't like multi-dimensional packed arrays.

If I change it to:

typedef struct packed {
	    logic [117-1:0] port[NoPorts-1:0];
	} out_req_t;

This then does actually give me the correct AST and initial RTLIL (even though it is kinda scary since now it isn't packed anymore).
But then somewhere inside synth -run begin:fine (ie in coarse synthesis) it converts it back to a shift.

The peepopt shiftmul approach looks pretty promising though so unless you are curious, I don't think its necessary to debug the nowrshmsk approach.

For documentation purposes, here is a zip file with the initial, manual fix and nowrshmsk approaches:
shift_issue2.zip

from yosys.

daglem avatar daglem commented on July 19, 2024

Sadly if I do this I now get the AST_CASE and AST_COND as I would expect but they only seem to checks whether in_select_i is equal to zero. Meaning it probably doesn't like multi-dimensional packed arrays.

I'm using (* nowrshmsk *) for multidimensional packed arrays in https://github.com/daglem/reDIP-SID to good effect, so it should work. I don't see anything wrong in the AST offhand - e.g. the output of yosys -p 'read_verilog -sv -dump_vlog2 reg_demux.v' looks like this:

...
          case ((in_select_i)*(117))
            0:
              (* wiretype = /** AST_STRUCT_ITEM **/ *)out_req_o[1520:0][116:0] = in_req_i;
            117:
              (* wiretype = /** AST_STRUCT_ITEM **/ *)out_req_o[1520:0][233:117] = in_req_i;
            234:
              (* wiretype = /** AST_STRUCT_ITEM **/ *)out_req_o[1520:0][350:234] = in_req_i;
...

I'd better look into avoiding the multiplication with a non-power-of-two number, but it should still work.

If I change it to:

typedef struct packed {
	    logic [117-1:0] port[NoPorts-1:0];
	} out_req_t;

This then does actually give me the correct AST and initial RTLIL (even though it is kinda scary since now it isn't packed anymore). But then somewhere inside synth -run begin:fine (ie in coarse synthesis) it converts it back to a shift.

Well, unpacked arrays within packed structs are not allowed by the LRM (it's a Yosys extension which should probably be removed), so I wouldn't use that in any case.

The peepopt shiftmul approach looks pretty promising though so unless you are curious, I don't think its necessary to debug the nowrshmsk approach.

I'd actually be very interested in having nowrshmsk work in your case, comparing it to the shift approach. Could you please have another look, and show how you get an incorrect AST?

from yosys.

daglem avatar daglem commented on July 19, 2024

PR #3875 gets rid of the multiplication by 117 in the (* nowrshmsk *) AST_CASE generated from reg_demux.v.

from yosys.

daglem avatar daglem commented on July 19, 2024

I reworked #3875 so that the optimizations for two-dimensional packed ranges also covers variable slices on the form dst[i*w +: w] = src, i.e. you should only have to make the following change to generate an optimized AST:

(* nowrshmsk *) output reg [(NoPorts * 117) - 1:0] out_req_o;

Does this work for you?

from yosys.

phsauter avatar phsauter commented on July 19, 2024

This weekend I don‘t have access to a computer, I will try it out on monday.

But you should be able to check as well by downloading the zip file and running make all and then checking the RTLIL/netlist.

from yosys.

daglem avatar daglem commented on July 19, 2024

This weekend I don‘t have access to a computer, I will try it out on monday.

Please do! I also made PR #3877 so that indexing of rvalues can be similarily optimized as well.

After adding both #3875 and #3877, you can do:

        (* nowrshmsk *) output reg [(NoPorts * 117) - 1:0] out_req_o;
        (* nordshift *) input wire [(NoPorts * 34) - 1:0] out_rsp_i;

But you should be able to check as well by downloading the zip file and running make all and then checking the RTLIL/netlist.

I'll have to leave it to you to check for correctness - perhaps the result is too good to be true?

$ cat output/*.rpt | grep "Chip area"
   Chip area for module '\reg_demux_noshift': 21636.720000
   Chip area for module '\reg_demux_pow2': 37214.742600
   Chip area for module '\reg_demux': 74589.114600

from yosys.

phsauter avatar phsauter commented on July 19, 2024

Sadly if I do this I now get the AST_CASE and AST_COND as I would expect but they only seem to checks whether in_select_i is equal to zero. Meaning it probably doesn't like multi-dimensional packed arrays.

Okay this issue was fixed by using a newer version of yosys (before it was 0.27 or 0.28, can't remember).
This gives me a correct solution with 32'700um^2 compared to the 37'200um^2 of the pow2 solution. This is roughly what I would expect, even a bit better.

If I apply your PRs and use:

module reg_demux_nowrshmsk_mdim_arr
// ...
(* nowrshmsk *) output reg [(NoPorts * 117) - 1:0] out_req_o;
// ...

This results in an area of 50'000um^2, though for some reason now even the basic version is slightly worse (above 66k).
Same result for the typedef solution, a significant increase compared to the version without the PRs.

using:

module reg_demux_noshift
// ...
(* nowrshmsk *) output reg [(NoPorts * 117) - 1:0] out_req_o;
(* nordshift *) input wire [(NoPorts * 34) - 1:0] out_rsp_i;
// ...

I also get 19k, this is both in-line with expectations from our commercial tools and it is also functionally equivalent, meaning this is a significant improvement over other solutions and should be seen as the gold standard.

Maybe we should move the following discussion over to #3875:
If I apply both PRs I get a base of 66k, with just PR #3875 its even higher at 75k and both the typedef and strided-array access are lower at 50k but only compared to this worse base, compared to typedef and yosys 0.32 which achieves 32k this is too high.
So I think there may be some bug in this PR.

TL;DR: Using both attributes generates fantastic results, PR #3875 diminishes the results from nowrshmsk alone, my issue was likely an old yosys version.

from yosys.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.