While recording benchmarks for the s-step method, I recorded some strange behavior. To be more precise, I see relatively low throughput for small to intermediate sizes. On the AMD Epyc system, I get up to 4.6 GDoFs/s when running the merged operations from cache, but only around 1.5 GDoFs/s for s-step with 4 steps (I have already multiplied the numbers by 4). The problematic part is e.g. seen for a size of a million DoFs:
5 7 1536 610203 0.0004661 1.309e+09 324 0.0001763 0.0001968 4.29e-05 4.829e-05 5.086e-06 5.058e-05 2.524e-06
5 7 3072 1205523 0.0008053 1.497e+09 312 0.0003057 0.0003585 8.127e-05 7.769e-05 1.039e-05 8.959e-05 3.427e-06
I can't really explain the gap between the mat-vec (0.3585ms) and the sum for the other others, giving a total of 0.62 ms/iteration while the time for a CG iteration is 0.8ms. @peterrum have you seen something like that before? As you see from the iteration counts I have already increased the maximum number of iterations to see if it is some very expensive initialization. At least, I could bring down the timings a bit, for 100 iterations I get 1.16ms/it, so even more than the 0.8 and almost a factor of 2 of loss in performance compared to the part that we have inside timers:
5 7 3072 1205523 0.001158 1.041e+09 100 0.0003099 0.0003517 7.086e-05 7.33e-05 1.036e-05 8.898e-05 3.338e-06
But nothing that would explain it, so there seems to be something very weird going on.