Git Product home page Git Product logo

Comments (27)

ChrisRackauckas avatar ChrisRackauckas commented on May 23, 2024

With array symbolics now merged (https://symbolics.juliasymbolics.org/dev/manual/arrays/) we have a better idea of where this is all going. Pulling in @shashi, @chriselrod, and @YingboMa into this as well.

From https://tutorials.sciml.ai/html/introduction/03-optimizing_diffeq_code.html we know that fusing the operators is a pretty major effect on performance. So we do not want the final code to have sparse matrix multiplications, and instead want to make sure that this all lowers to something that uses LoopVectorization.jl over contiguous loop indices. In order to support huge equations, we probably do not want this to be scalarized. Also, we want to make sure that this use a cache-optimized or at least cache-oblivious loop iteration, which should be the case by using LoopVectorization.jl. It would be good for the code to be GPU friendly as well.

So on that note, it would be good to see a test of LoopVectorization against something like NNLib's conv for the nonlinear Laplacian. I have faith it could match or surpass it, but that's a thing we should benchmark.

Additionally, we need to think about how to lower symbolic maps and such down to something that could be GPU friendly, and what do about the boundary conditions in this case.

It would also be good to note the effect of the matrix-vector for m on the index reduction algorithm. IIUC one would need to write the boundary terms separate from the matvecs in order to allow for the tearing to work, since otherwise that would break the full incidence assumption. If the boundary conditions already have to be written out in index form, I don't know if that saves any simplicity.

@YingboMa can you share the suggested MTK code for the current form?

from methodoflines.jl.

chriselrod avatar chriselrod commented on May 23, 2024

So on that note, it would be good to see a test of LoopVectorization against something like NNLib's conv for the nonlinear Laplacian. I have faith it could match or surpass it, but that's a thing we should benchmark.

Have an example?

using NNlib, Polyester, LoopVectorization, Static
function kernaxes(::DenseConvDims{2,K,C_in, C_out}) where {K,C_in, C_out} # LoopVectorization can take advantage of static size information
  Kโ‚ =  StaticInt(1):StaticInt(K[1])
  Kโ‚‚ =  StaticInt(1):StaticInt(K[2])
  Cแตขโ‚™ =  StaticInt(1):StaticInt(C_in)
  Cโ‚’แตคโ‚œ = StaticInt(1):StaticInt(C_out)
  (Kโ‚, Kโ‚‚, Cแตขโ‚™, Cโ‚’แตคโ‚œ)
end
function convlayer!(
  out::AbstractArray{<:Any,4}, img, kern,
  dcd::DenseConvDims{2, <:Any, <:Any, <:Any, (1, 1), (0, 0, 0, 0), (1, 1), true}
  )
  (Kโ‚, Kโ‚‚, Cแตขโ‚™, Cโ‚’แตคโ‚œ) = kernaxes(dcd)
  @batch for d โˆˆ axes(out,4)
    @turbo for jโ‚ โˆˆ axes(out,1), jโ‚‚ โˆˆ axes(out,2), o โˆˆ Cโ‚’แตคโ‚œ
      s = zero(eltype(out))
      for kโ‚ โˆˆ Kโ‚, kโ‚‚ โˆˆ Kโ‚‚, i โˆˆ Cแตขโ‚™
        s += img[jโ‚ + kโ‚ - 1, jโ‚‚ + kโ‚‚ - 1, i, d] * kern[kโ‚, kโ‚‚, i, o]
      end
      out[jโ‚, jโ‚‚, o, d] = s
    end
  end
  out
end
img = rand(Float32, 260, 260, 3, 100);
kern = rand(Float32, 5, 5, 3, 6);
out1 = Array{Float32}(undef, size(img,1)+1-size(kern,1), size(img,2)+1-size(kern,2), size(kern,4), size(img,4));
out2 = similar(out1);
dcd = NNlib.DenseConvDims(img, kern, flipkernel = true);
@time NNlib.conv!(out1, img, kern, dcd);
@time convlayer!(out2, img, kern, dcd);
out1 โ‰ˆ out2
@benchmark NNlib.conv!($out1, $img, $kern, $dcd)
@benchmark convlayer!($out1, $img, $kern, $dcd)

LoopVectorization's algorithms scale very badly with the number of loops. 7 works, but takes a minute to compile. So I'm using 6 above, @batching across the seventh. For a CPU-optimized NNlib, consistent threading (e.g., always across batches) would also have the advantage of improving locality, i.e. data will hopefully be able to stay in the same L2 cache from one layer to the next (assuming the data is small enough to fit, as may be the case if training on the CPU).

This is what I get on an 18 core system with AVX512:

julia> @time NNlib.conv!(out1, img, kern, dcd);
  0.802161 seconds (2.06 M allocations: 790.165 MiB, 20.41% gc time)

julia> @time convlayer!(out2, img, kern, dcd);
  9.553409 seconds (21.51 M allocations: 1.149 GiB, 1.06% gc time)

julia> out1 โ‰ˆ out2
true

julia> @benchmark NNlib.conv!($out1, $img, $kern, $dcd)
samples: 27; evals/sample: 1; memory estimate: 675.02 MiB; allocs estimate: 195
ns

 (1.8e8 - 1.9e8]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 24
 (1.9e8 - 2.1e8]  โ–ˆโ–Ž1
 (2.1e8 - 2.3e8]   0
 (2.3e8 - 2.5e8]   0
 (2.5e8 - 2.7e8]  โ–ˆโ–Ž1
 (2.7e8 - 2.9e8]  โ–ˆโ–Ž1

                  Counts

min: 175.652 ms (0.00% GC); mean: 191.870 ms (4.80% GC); median: 184.447 ms (0.44% GC); max: 288.455 ms (36.61% GC).

julia> @benchmark convlayer!($out1, $img, $kern, $dcd)
samples: 944; evals/sample: 1; memory estimate: 32 bytes; allocs estimate: 4
ns

 (5.233e6 - 5.256e6]  โ–ˆโ–ˆโ–31
 (5.256e6 - 5.278e6]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–193
 (5.278e6 - 5.301e6]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ412
 (5.301e6 - 5.324e6]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Š284
 (5.324e6 - 5.347e6]  โ–ˆโ–Œ20
 (5.347e6 - 5.37e6 ]   0
 (5.37e6  - 5.392e6]   0
 (5.392e6 - 5.415e6]  โ–1
 (5.415e6 - 5.438e6]   0
 (5.438e6 - 5.461e6]  โ–1
 (5.461e6 - 5.484e6]  โ–Ž2

                  Counts

min: 5.233 ms (0.00% GC); mean: 5.292 ms (0.00% GC); median: 5.294 ms (0.00% GC); max: 5.484 ms (0.00% GC).

julia> versioninfo()
Julia Version 1.7.0-DEV.1208
Commit 0a133ff22b* (2021-05-30 11:28 UTC)
Platform Info:
  OS: Linux (x86_64-generic-linux)
  CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.0 (ORCJIT, cascadelake)
Environment:
  JULIA_NUM_THREADS = 36

LoopVectorization was over 30x faster.

This was also a generic convolution. The Laplacian has more structure you could exploit.
If it's only a 1-d laplacian, it'd probably be harder to do well. The more compute, the less it'll be bottle-necked by memory.

EDIT:
Just realized convlayer! is allocating...
Apparently it isn't making much use of threading -- only about 4x faster, so I guess it's already pretty constrained by memory:

julia> function convlayer_st!(
         out::AbstractArray{<:Any,4}, img, kern,
         dcd::DenseConvDims{2, <:Any, <:Any, <:Any, (1, 1), (0, 0, 0, 0), (1, 1), true}
         )
         (Kโ‚, Kโ‚‚, Cแตขโ‚™, Cโ‚’แตคโ‚œ) = kernaxes(dcd)
       for d โˆˆ axes(out,4)
           @turbo for jโ‚ โˆˆ axes(out,1), jโ‚‚ โˆˆ axes(out,2), o โˆˆ Cโ‚’แตคโ‚œ
             s = zero(eltype(out))
             for kโ‚ โˆˆ Kโ‚, kโ‚‚ โˆˆ Kโ‚‚, i โˆˆ Cแตขโ‚™
               s += img[jโ‚ + kโ‚ - 1, jโ‚‚ + kโ‚‚ - 1, i, d] * kern[kโ‚, kโ‚‚, i, o]
             end
             out[jโ‚, jโ‚‚, o, d] = s
           end
           out
         end
       end
convlayer_st! (generic function with 1 method)

julia> @benchmark convlayer_st!($out1, $img, $kern, $dcd)
samples: 180; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns

 (2.754e7 - 2.758e7]  โ–‹2
 (2.758e7 - 2.763e7]   0
 (2.763e7 - 2.768e7]   0
 (2.768e7 - 2.772e7]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–19
 (2.772e7 - 2.777e7]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–37
 (2.777e7 - 2.781e7]  โ–‹2
 (2.781e7 - 2.786e7]  โ–ˆโ–ˆโ–Œ9
 (2.786e7 - 2.79e7 ]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 108
 (2.79e7  - 2.795e7]  โ–‰3

                  Counts

min: 27.539 ms (0.00% GC); mean: 27.824 ms (0.00% GC); median: 27.865 ms (0.00% GC); max: 27.949 ms (0.00% GC).

Apparently Intel's next generation of server chips (Saphire Rapids) will come with up to 64 GiB of HBME2 memory. Maybe it'll only be a select few extremely expensive models, but if there're affordable options, that'd be exciting.
AMD also announced 3d stacking for 192 MiB of L3 cache, which isn't quite enough for problems of this size:

julia> (sizeof(out1) + sizeof(img) + sizeof(kern)) / (1<<20) # bytes of data / (bytes/MiB)
227.36377716064453

The three arrays here are 227 MiB. Making the problem just a little smaller would then let them all fit.
For comparison, the 10980XE I ran these benchmarks on has only 24.75 MiB of L3.
These were 100 260x260 input images (and 100 256x256 outputs), so the real use cases here are hopefully smaller.

from methodoflines.jl.

ChrisRackauckas avatar ChrisRackauckas commented on May 23, 2024

Look at the PDE example in https://tutorials.sciml.ai/html/introduction/03-optimizing_diffeq_code.html for an example of it.

  mul!(Ayu,Ay,u)
  mul!(uAx,u,Ax)
  mul!(Ayv,Ay,v)
  mul!(vAx,v,Ax)

is just a cheap way to do 2D convolutions with dense matmuls, but it would be good to transcribe that into conv!

from methodoflines.jl.

chriselrod avatar chriselrod commented on May 23, 2024

I played with it a bit, and haven't been able to get any major improvements over the explicit loop version:

using LinearAlgebra, OrdinaryDiffEq
# Generate the constants
p = (1.0,1.0,1.0,10.0,0.001,100.0) # a,ฮฑ,ubar,ฮฒ,D1,D2
N = 100
const Ax = Array(Tridiagonal([1.0 for i in 1:N-1],[-2.0 for i in 1:N],[1.0 for i in 1:N-1]));
const Ay = copy(Ax);
Ax[2,1] = 2.0;
Ax[end-1,end] = 2.0;
Ay[1,2] = 2.0;
Ay[end,end-1] = 2.0;

function basic_version!(dr,r,p,t)
  a,ฮฑ,ubar,ฮฒ,D1,D2 = p
  u = r[:,:,1]
  v = r[:,:,2]
  Du = D1*(Ay*u + u*Ax)
  Dv = D2*(Ay*v + v*Ax)
  dr[:,:,1] = Du .+ a.*u.*u./v .+ ubar .- ฮฑ*u
  dr[:,:,2] = Dv .+ a.*u.*u .- ฮฒ*v
end

a,ฮฑ,ubar,ฮฒ,D1,D2 = p;
uss = (ubar+ฮฒ)/ฮฑ;
vss = (a/ฮฒ)*uss^2;
r0 = zeros(100,100,2);
r0[:,:,1] .= uss.+0.1.*rand.();
r0[:,:,2] .= vss;

prob = ODEProblem(basic_version!,r0,(0.0,0.1),p);
@benchmark solve($prob,Tsit5())
# samples: 123; evals/sample: 1; memory estimate: 186.80 MiB; allocs estimate: 4555
# ns

#  (3.0e7 - 5.0e7]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 121
#  (5.0e7 - 7.0e7]   0
#  (7.0e7 - 8.0e7]   0
#  (8.0e7 - 1.0e8]   0
#  (1.0e8 - 1.2e8]   0
#  (1.2e8 - 1.3e8]  โ–Ž1
#  (1.3e8 - 1.5e8]   0
#  (1.5e8 - 1.6e8]  โ–Ž1

#                   Counts

# min: 34.770 ms (7.03% GC); mean: 40.722 ms (9.64% GC); median: 39.011 ms (5.79% GC); max: 164.571 ms (77.65% GC).
  
function gm2!(dr,r,p,t)
  a,ฮฑ,ubar,ฮฒ,D1,D2 = p
  u = @view r[:,:,1]
  v = @view r[:,:,2]
  du = @view dr[:,:,1]
  dv = @view dr[:,:,2]
  Du = D1*(Ay*u + u*Ax)
  Dv = D2*(Ay*v + v*Ax)
  @. du = Du + a.*u.*u./v + ubar - ฮฑ*u
  @. dv = Dv + a.*u.*u - ฮฒ*v
end
prob = ODEProblem(gm2!,r0,(0.0,0.1),p);
@benchmark solve($prob,Tsit5())
# samples: 171; evals/sample: 1; memory estimate: 119.44 MiB; allocs estimate: 2790
# ns

#  (2.68e7 - 2.75e7]  โ–‹1
#  (2.75e7 - 2.82e7]  โ–ˆโ–ˆโ–ˆโ–6
#  (2.82e7 - 2.9e7 ]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–45
#  (2.9e7  - 2.97e7]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–58
#  (2.97e7 - 3.04e7]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ55
#  (3.04e7 - 3.11e7]  โ–ˆโ–ˆโ–‹5
#  (3.11e7 - 3.18e7]   0
#  (3.18e7 - 3.25e7]   0
#  (3.25e7 - 3.32e7]  โ–‹1

#                   Counts

# min: 26.824 ms (7.47% GC); mean: 29.393 ms (6.39% GC); median: 29.594 ms (7.36% GC); max: 33.241 ms (27.63% GC).


const Ayu = zeros(N,N);
const uAx = zeros(N,N);
const Du = zeros(N,N);
const Ayv = zeros(N,N);
const vAx = zeros(N,N);
const Dv = zeros(N,N);
function gm3!(dr,r,p,t)
  a,ฮฑ,ubar,ฮฒ,D1,D2 = p
  u = @view r[:,:,1]
  v = @view r[:,:,2]
  du = @view dr[:,:,1]
  dv = @view dr[:,:,2]
  mul!(Ayu,Ay,u)
  mul!(uAx,u,Ax)
  mul!(Ayv,Ay,v)
  mul!(vAx,v,Ax)
  @. Du = D1*(Ayu + uAx)
  @. Dv = D2*(Ayv + vAx)
  @. du = Du + a*u*u./v + ubar - ฮฑ*u
  @. dv = Dv + a*u*u - ฮฒ*v
end
prob_gm3 = ODEProblem(gm3!,r0,(0.0,0.1),p);
@benchmark solve($prob_gm3,Tsit5())
# samples: 181; evals/sample: 1; memory estimate: 29.63 MiB; allocs estimate: 438
# ns

#  (2.1e7 - 2.2e7]  โ–Œ1
#  (2.2e7 - 2.3e7]  โ–Œ1
#  (2.3e7 - 2.5e7]   0
#  (2.5e7 - 2.6e7]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹14
#  (2.6e7 - 2.7e7]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Ž36
#  (2.7e7 - 2.8e7]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 76
#  (2.8e7 - 3.0e7]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ39
#  (3.0e7 - 3.1e7]  โ–ˆโ–ˆโ–ˆโ–ˆ10
#  (3.1e7 - 3.2e7]  โ–ˆโ–‹4

#                   Counts

# min: 20.760 ms (0.00% GC); mean: 27.682 ms (2.07% GC); median: 27.566 ms (0.00% GC); max: 32.108 ms (0.00% GC).

using LoopVectorization
function matmul!(C::AbstractMatrix, A::AbstractMatrix, B::AbstractMatrix)
  @tturbo for n โˆˆ indices((C,B),2), m โˆˆ indices((C,A),1)
    Cโ‚˜โ‚™ = zero(eltype(C))
    for k โˆˆ indices((A,B),(2,1))
      Cโ‚˜โ‚™ += A[m,k]*B[k,n]
    end
    C[m,n] = Cโ‚˜โ‚™
  end
end
function matmuladd!(C::AbstractMatrix, A::AbstractMatrix, B::AbstractMatrix)
  @tturbo for n โˆˆ indices((C,B),2), m โˆˆ indices((C,A),1)
    Cโ‚˜โ‚™ = zero(eltype(C))
    for k โˆˆ indices((A,B),(2,1))
      Cโ‚˜โ‚™ += A[m,k]*B[k,n]
    end
    C[m,n] += Cโ‚˜โ‚™
  end
end

function gm5!(dr,r,p,t)
  a,ฮฑ,ubar,ฮฒ,D1,D2 = p
  u = @view r[:,:,1];
  v = @view r[:,:,2];
  du = @view dr[:,:,1];
  dv = @view dr[:,:,2];
  matmul!(Ayu,Ay,u)
  matmuladd!(Ayu,u,Ax)
  # @turbo @. du = D1*Ayu + a*u*u./v + ubar - ฮฑ*u
  @turbo for i โˆˆ eachindex(u)
    du[i] = D1 * Ayu[i] + a * u[i]*u[i] / v[i] + ubar - ฮฑ*u[i]
  end
  matmul!(Ayu,Ay,v)
  matmuladd!(Ayu,v,Ax)
  # @turbo @. dv = D2*Ayu + a*u*u - ฮฒ*v
  @turbo for i โˆˆ eachindex(u)
    dv[i] = D2 * Ayu[i] + a * u[i]*u[i] - ฮฒ*v[i]
  end
end
prob_gm5 = ODEProblem(gm5!,r0,(0.0,0.1),p);
@benchmark solve($prob_gm5,Tsit5())
# samples: 306; evals/sample: 1; memory estimate: 29.62 MiB; allocs estimate: 433
# ns

#  (1.1e7 - 1.3e7]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 211
#  (1.3e7 - 1.5e7]  โ–ˆโ–ˆโ–Ž15
#  (1.5e7 - 1.7e7]  โ–2
#  (1.7e7 - 1.9e7]  โ–2
#  (1.9e7 - 2.1e7]   0
#  (2.1e7 - 2.3e7]   0
#  (2.3e7 - 2.5e7]  โ–‰6
#  (2.5e7 - 2.8e7]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Š54
#  (2.8e7 - 3.0e7]  โ–ˆ7
#  (3.0e7 - 3.2e7]  โ–ˆโ–9

#                   Counts

# min: 10.999 ms (0.00% GC); mean: 16.355 ms (3.53% GC); median: 12.869 ms (0.00% GC); max: 31.714 ms (6.63% GC).

function matmul!(C::AbstractArray{<:Any,3}, A::AbstractMatrix, B::AbstractArray{<:Any,3})
  @tturbo for n โˆˆ indices((C,B),2), m โˆˆ indices((C,A),1), d โˆˆ 1:2
    Cโ‚˜โ‚™ = zero(eltype(C))
    for k โˆˆ indices((A,B),(2,1))
      Cโ‚˜โ‚™ += A[m,k]*B[k,n,d]
    end
    C[m,n,d] = Cโ‚˜โ‚™
  end
end
function matmuladd!(C::AbstractArray{<:Any,3}, A::AbstractArray{<:Any,3}, B::AbstractMatrix)
  @tturbo for n โˆˆ indices((C,B),2), m โˆˆ indices((C,A),1), d โˆˆ 1:2
    Cโ‚˜โ‚™ = zero(eltype(C))
    for k โˆˆ indices((A,B),(2,1))
      Cโ‚˜โ‚™ += A[m,k,d]*B[k,n]
    end
    C[m,n,d] += Cโ‚˜โ‚™
  end
end
const Ayuv = similar(Ayu, (size(Ayu,1),size(Ayu,2),2));
function gm6!(dr,r,p,t)
  a,ฮฑ,ubar,ฮฒ,D1,D2 = p
  matmul!(Ayuv, Ay, r)
  matmuladd!(Ayuv, r, Ax)
  Nยฒ = size(r,1)*size(r,2)
  u = reshape(r, (Nยฒ,2))
  du = reshape(dr, (Nยฒ,2))
  Ayuvr = reshape(Ayuv, (Nยฒ,2))
  @turbo for i โˆˆ 1:Nยฒ
    du[i,1] = D1 * Ayuvr[i,1] + a * u[i,1]*u[i,1] / u[i,2] + ubar - ฮฑ*u[i,1]
    du[i,2] = D2 * Ayuvr[i,2] + a * u[i,1]*u[i,1] - ฮฒ*u[i,2]
  end
end
prob_gm6 = ODEProblem(gm6!,r0,(0.0,0.1),p);
@benchmark solve($prob_gm6, Tsit5())
# samples: 310; evals/sample: 1; memory estimate: 29.67 MiB; allocs estimate: 1315
# ns

#  (1.1e7 - 1.3e7]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ37
#  (1.3e7 - 1.4e7]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ205
#  (1.4e7 - 1.6e7]   0
#  (1.6e7 - 1.8e7]   0
#  (1.8e7 - 2.0e7]   0
#  (2.0e7 - 2.1e7]   0
#  (2.1e7 - 2.3e7]   0
#  (2.3e7 - 2.5e7]   0
#  (2.5e7 - 2.6e7]  โ–Ž1
#  (2.6e7 - 2.8e7]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰67

#                   Counts

# min: 11.091 ms (0.00% GC); mean: 16.126 ms (3.49% GC); median: 13.182 ms (0.00% GC); max: 27.934 ms (9.08% GC).
  
pdevec = (1.0,1.0,1.0,10.0,0.001,100.0,N);
function fast_gm!(du,u,p,t)
  a,ฮฑ,ubar,ฮฒ,D1,D2,N = p

  @inbounds for j in 2:N-1, i in 2:N-1
    du[i,j,1] = D1*(u[i-1,j,1] + u[i+1,j,1] + u[i,j+1,1] + u[i,j-1,1] - 4u[i,j,1]) +
              a*u[i,j,1]^2/u[i,j,2] + ubar - ฮฑ*u[i,j,1]
  end

  @inbounds for j in 2:N-1, i in 2:N-1
    du[i,j,2] = D2*(u[i-1,j,2] + u[i+1,j,2] + u[i,j+1,2] + u[i,j-1,2] - 4u[i,j,2]) +
            a*u[i,j,1]^2 - ฮฒ*u[i,j,2]
  end

  @inbounds for j in 2:N-1
    i = 1
    du[1,j,1] = D1*(2u[i+1,j,1] + u[i,j+1,1] + u[i,j-1,1] - 4u[i,j,1]) +
            a*u[i,j,1]^2/u[i,j,2] + ubar - ฮฑ*u[i,j,1]
  end
  @inbounds for j in 2:N-1
    i = 1
    du[1,j,2] = D2*(2u[i+1,j,2] + u[i,j+1,2] + u[i,j-1,2] - 4u[i,j,2]) +
            a*u[i,j,1]^2 - ฮฒ*u[i,j,2]
  end
  @inbounds for j in 2:N-1
    i = N
    du[end,j,1] = D1*(2u[i-1,j,1] + u[i,j+1,1] + u[i,j-1,1] - 4u[i,j,1]) +
           a*u[i,j,1]^2/u[i,j,2] + ubar - ฮฑ*u[i,j,1]
  end
  @inbounds for j in 2:N-1
    i = N
    du[end,j,2] = D2*(2u[i-1,j,2] + u[i,j+1,2] + u[i,j-1,2] - 4u[i,j,2]) +
           a*u[i,j,1]^2 - ฮฒ*u[i,j,2]
  end

  @inbounds for i in 2:N-1
    j = 1
    du[i,1,1] = D1*(u[i-1,j,1] + u[i+1,j,1] + 2u[i,j+1,1] - 4u[i,j,1]) +
              a*u[i,j,1]^2/u[i,j,2] + ubar - ฮฑ*u[i,j,1]
  end
  @inbounds for i in 2:N-1
    j = 1
    du[i,1,2] = D2*(u[i-1,j,2] + u[i+1,j,2] + 2u[i,j+1,2] - 4u[i,j,2]) +
              a*u[i,j,1]^2 - ฮฒ*u[i,j,2]
  end
  @inbounds for i in 2:N-1
    j = N
    du[i,end,1] = D1*(u[i-1,j,1] + u[i+1,j,1] + 2u[i,j-1,1] - 4u[i,j,1]) +
             a*u[i,j,1]^2/u[i,j,2] + ubar - ฮฑ*u[i,j,1]
  end
  @inbounds for i in 2:N-1
    j = N
    du[i,end,2] = D2*(u[i-1,j,2] + u[i+1,j,2] + 2u[i,j-1,2] - 4u[i,j,2]) +
             a*u[i,j,1]^2 - ฮฒ*u[i,j,2]
  end

  @inbounds begin
    i = 1; j = 1
    du[1,1,1] = D1*(2u[i+1,j,1] + 2u[i,j+1,1] - 4u[i,j,1]) +
              a*u[i,j,1]^2/u[i,j,2] + ubar - ฮฑ*u[i,j,1]
    du[1,1,2] = D2*(2u[i+1,j,2] + 2u[i,j+1,2] - 4u[i,j,2]) +
              a*u[i,j,1]^2 - ฮฒ*u[i,j,2]

    i = 1; j = N
    du[1,N,1] = D1*(2u[i+1,j,1] + 2u[i,j-1,1] - 4u[i,j,1]) +
             a*u[i,j,1]^2/u[i,j,2] + ubar - ฮฑ*u[i,j,1]
    du[1,N,2] = D2*(2u[i+1,j,2] + 2u[i,j-1,2] - 4u[i,j,2]) +
             a*u[i,j,1]^2 - ฮฒ*u[i,j,2]

    i = N; j = 1
    du[N,1,1] = D1*(2u[i-1,j,1] + 2u[i,j+1,1] - 4u[i,j,1]) +
             a*u[i,j,1]^2/u[i,j,2] + ubar - ฮฑ*u[i,j,1]
    du[N,1,2] = D2*(2u[i-1,j,2] + 2u[i,j+1,2] - 4u[i,j,2]) +
             a*u[i,j,1]^2 - ฮฒ*u[i,j,2]

    i = N; j = N
    du[end,end,1] = D1*(2u[i-1,j,1] + 2u[i,j-1,1] - 4u[i,j,1]) +
             a*u[i,j,1]^2/u[i,j,2] + ubar - ฮฑ*u[i,j,1]
    du[end,end,2] = D2*(2u[i-1,j,2] + 2u[i,j-1,2] - 4u[i,j,2]) +
             a*u[i,j,1]^2 - ฮฒ*u[i,j,2]
   end
end
prob_devec = ODEProblem(fast_gm!,r0,(0.0,0.1),pdevec);
@benchmark solve($prob_devec,Tsit5())
# samples: 600; evals/sample: 1; memory estimate: 29.63 MiB; allocs estimate: 438
# ns

#  (7.2e6   - 7.74e6 ]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–337
#  (7.74e6  - 8.27e6 ]  โ–ˆโ–ˆโ–‹29
#  (8.27e6  - 8.81e6 ]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–68
#  (8.81e6  - 9.34e6 ]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹74
#  (9.34e6  - 9.88e6 ]  โ–ˆโ–ˆโ–ˆ33
#  (9.88e6  - 1.041e7]  โ–4
#  (1.041e7 - 1.095e7]   0
#  (1.095e7 - 1.149e7]  โ–ˆโ–ˆ22
#  (1.149e7 - 1.202e7]  โ–‹7
#  (1.202e7 - 1.256e7]  โ–ˆโ–ˆโ–Ž24
#  (1.256e7 - 1.309e7]  โ–Ž2

#                   Counts

# min: 7.200 ms (0.00% GC); mean: 8.335 ms (7.27% GC); median: 7.482 ms (0.00% GC); max: 13.092 ms (43.21% GC).

  
function turbo_gm!(du,u,p,t)
  a,ฮฑ,ubar,ฮฒ,D1,D2,N = p

  @turbo for j in 2:N-1, i in 2:N-1
    du[i,j,1] = D1*(u[i-1,j,1] + u[i+1,j,1] + u[i,j+1,1] + u[i,j-1,1] - 4u[i,j,1]) +
              a*u[i,j,1]^2/u[i,j,2] + ubar - ฮฑ*u[i,j,1]
  end

  @turbo for j in 2:N-1, i in 2:N-1
    du[i,j,2] = D2*(u[i-1,j,2] + u[i+1,j,2] + u[i,j+1,2] + u[i,j-1,2] - 4u[i,j,2]) +
            a*u[i,j,1]^2 - ฮฒ*u[i,j,2]
  end

  @inbounds for j in 2:N-1
    i=1
    du[1,j,1] = D1*(2u[2,j,1] + u[1,j+1,1] + u[1,j-1,1] - 4u[1,j,1]) +
            a*u[1,j,1]^2/u[1,j,2] + ubar - ฮฑ*u[1,j,1]
  end
  @inbounds for j in 2:N-1
    i = 1
    du[1,j,2] = D2*(2u[i+1,j,2] + u[i,j+1,2] + u[i,j-1,2] - 4u[i,j,2]) +
            a*u[i,j,1]^2 - ฮฒ*u[i,j,2]
  end
  @inbounds for j in 2:N-1
    i = N
    du[end,j,1] = D1*(2u[i-1,j,1] + u[i,j+1,1] + u[i,j-1,1] - 4u[i,j,1]) +
           a*u[i,j,1]^2/u[i,j,2] + ubar - ฮฑ*u[i,j,1]
  end
  @inbounds for j in 2:N-1
    i = N
    du[end,j,2] = D2*(2u[i-1,j,2] + u[i,j+1,2] + u[i,j-1,2] - 4u[i,j,2]) +
           a*u[i,j,1]^2 - ฮฒ*u[i,j,2]
  end

  @inbounds for i in 2:N-1
    j = 1
    du[i,1,1] = D1*(u[i-1,j,1] + u[i+1,j,1] + 2u[i,j+1,1] - 4u[i,j,1]) +
              a*u[i,j,1]^2/u[i,j,2] + ubar - ฮฑ*u[i,j,1]
  end
  @inbounds for i in 2:N-1
    j = 1
    du[i,1,2] = D2*(u[i-1,j,2] + u[i+1,j,2] + 2u[i,j+1,2] - 4u[i,j,2]) +
              a*u[i,j,1]^2 - ฮฒ*u[i,j,2]
  end
  @inbounds for i in 2:N-1
    j = N
    du[i,end,1] = D1*(u[i-1,j,1] + u[i+1,j,1] + 2u[i,j-1,1] - 4u[i,j,1]) +
             a*u[i,j,1]^2/u[i,j,2] + ubar - ฮฑ*u[i,j,1]
  end
  @inbounds for i in 2:N-1
    j = N
    du[i,end,2] = D2*(u[i-1,j,2] + u[i+1,j,2] + 2u[i,j-1,2] - 4u[i,j,2]) +
             a*u[i,j,1]^2 - ฮฒ*u[i,j,2]
  end

  @inbounds begin
    i = 1; j = 1
    du[1,1,1] = D1*(2u[i+1,j,1] + 2u[i,j+1,1] - 4u[i,j,1]) +
              a*u[i,j,1]^2/u[i,j,2] + ubar - ฮฑ*u[i,j,1]
    du[1,1,2] = D2*(2u[i+1,j,2] + 2u[i,j+1,2] - 4u[i,j,2]) +
              a*u[i,j,1]^2 - ฮฒ*u[i,j,2]

    i = 1; j = N
    du[1,N,1] = D1*(2u[i+1,j,1] + 2u[i,j-1,1] - 4u[i,j,1]) +
             a*u[i,j,1]^2/u[i,j,2] + ubar - ฮฑ*u[i,j,1]
    du[1,N,2] = D2*(2u[i+1,j,2] + 2u[i,j-1,2] - 4u[i,j,2]) +
             a*u[i,j,1]^2 - ฮฒ*u[i,j,2]

    i = N; j = 1
    du[N,1,1] = D1*(2u[i-1,j,1] + 2u[i,j+1,1] - 4u[i,j,1]) +
             a*u[i,j,1]^2/u[i,j,2] + ubar - ฮฑ*u[i,j,1]
    du[N,1,2] = D2*(2u[i-1,j,2] + 2u[i,j+1,2] - 4u[i,j,2]) +
             a*u[i,j,1]^2 - ฮฒ*u[i,j,2]

    i = N; j = N
    du[end,end,1] = D1*(2u[i-1,j,1] + 2u[i,j-1,1] - 4u[i,j,1]) +
             a*u[i,j,1]^2/u[i,j,2] + ubar - ฮฑ*u[i,j,1]
    du[end,end,2] = D2*(2u[i-1,j,2] + 2u[i,j-1,2] - 4u[i,j,2]) +
             a*u[i,j,1]^2 - ฮฒ*u[i,j,2]
   end
end
prob_turbo_devec = ODEProblem(turbo_gm!,r0,(0.0,0.1),pdevec);
@benchmark solve($prob_turbo_devec,Tsit5())
# samples: 600; evals/sample: 1; memory estimate: 29.63 MiB; allocs estimate: 438
# ns

#  (7.11e6  - 7.78e6 ]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 324
#  (7.78e6  - 8.45e6 ]  โ–4
#  (8.45e6  - 9.12e6 ]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Ž88
#  (9.12e6  - 9.79e6 ]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰95
#  (9.79e6  - 1.046e7]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹71
#  (1.046e7 - 1.113e7]  โ–Œ5
#  (1.113e7 - 1.18e7 ]  โ–3
#  (1.18e7  - 1.247e7]  โ–3
#  (1.247e7 - 1.314e7]  โ–Ž2
#  (1.314e7 - 1.381e7]  โ–Ž2
#  (1.381e7 - 1.448e7]  โ–3

#                   Counts

# min: 7.107 ms (0.00% GC); mean: 8.331 ms (7.72% GC); median: 7.374 ms (0.00% GC); max: 14.477 ms (0.00% GC).
function turbo_gm_fuse!(du,u,p,t)
  a,ฮฑ,ubar,ฮฒ,D1,D2,N = p

  @turbo for j in 2:N-1, i in 2:N-1
    du[i,j,1] = D1*(u[i-1,j,1] + u[i+1,j,1] + u[i,j+1,1] + u[i,j-1,1] - 4u[i,j,1]) +
              a*u[i,j,1]^2/u[i,j,2] + ubar - ฮฑ*u[i,j,1]
    du[i,j,2] = D2*(u[i-1,j,2] + u[i+1,j,2] + u[i,j+1,2] + u[i,j-1,2] - 4u[i,j,2]) +
            a*u[i,j,1]^2 - ฮฒ*u[i,j,2]
  end

  @turbo for j in 2:N-1
    du[1,j,1] = D1*(2u[2,j,1] + u[1,j+1,1] + u[1,j-1,1] - 4u[1,j,1]) +
            a*u[1,j,1]^2/u[1,j,2] + ubar - ฮฑ*u[1,j,1]
  # end
  # @turbo for j in 2:N-1
    du[1,j,2] = D2*(2u[2,j,2] + u[1,j+1,2] + u[1,j-1,2] - 4u[1,j,2]) +
            a*u[1,j,1]^2 - ฮฒ*u[1,j,2]
  # end
  # @turbo for j in 2:N-1
    du[N,j,1] = D1*(2u[N-1,j,1] + u[N,j+1,1] + u[N,j-1,1] - 4u[N,j,1]) +
           a*u[N,j,1]^2/u[N,j,2] + ubar - ฮฑ*u[N,j,1]
  # end
  # @turbo for j in 2:N-1
    du[N,j,2] = D2*(2u[N-1,j,2] + u[N,j+1,2] + u[N,j-1,2] - 4u[N,j,2]) +
      a*u[N,j,1]^2 - ฮฒ*u[N,j,2]
  end
  @turbo for j in 2:N-1
    du[j,1,1] = D1*(u[j-1,1,1] + u[j+1,1,1] + 2u[j,2,1] - 4u[j,1,1]) +
              a*u[j,1,1]^2/u[j,1,2] + ubar - ฮฑ*u[j,1,1]
  # end
  # @turbo for j in 2:N-1
    du[j,1,2] = D2*(u[j-1,1,2] + u[j+1,1,2] + 2u[j,2,2] - 4u[j,1,2]) +
              a*u[j,1,1]^2 - ฮฒ*u[j,1,2]
  # end
  # @turbo for j in 2:N-1
    du[j,N,1] = D1*(u[j-1,N,1] + u[j+1,N,1] + 2u[j,N-1,1] - 4u[j,N,1]) +
             a*u[j,N,1]^2/u[j,N,2] + ubar - ฮฑ*u[j,N,1]
  # end
  # @turbo for j in 2:N-1
    du[j,N,2] = D2*(u[j-1,N,2] + u[j+1,N,2] + 2u[j,N-1,2] - 4u[j,N,2]) +
             a*u[j,N,1]^2 - ฮฒ*u[j,N,2]
  end

  @inbounds begin
    i = 1; j = 1
    du[1,1,1] = D1*(2u[i+1,j,1] + 2u[i,j+1,1] - 4u[i,j,1]) +
              a*u[i,j,1]^2/u[i,j,2] + ubar - ฮฑ*u[i,j,1]
    du[1,1,2] = D2*(2u[i+1,j,2] + 2u[i,j+1,2] - 4u[i,j,2]) +
              a*u[i,j,1]^2 - ฮฒ*u[i,j,2]

    i = 1; j = N
    du[1,N,1] = D1*(2u[i+1,j,1] + 2u[i,j-1,1] - 4u[i,j,1]) +
             a*u[i,j,1]^2/u[i,j,2] + ubar - ฮฑ*u[i,j,1]
    du[1,N,2] = D2*(2u[i+1,j,2] + 2u[i,j-1,2] - 4u[i,j,2]) +
             a*u[i,j,1]^2 - ฮฒ*u[i,j,2]

    i = N; j = 1
    du[N,1,1] = D1*(2u[i-1,j,1] + 2u[i,j+1,1] - 4u[i,j,1]) +
             a*u[i,j,1]^2/u[i,j,2] + ubar - ฮฑ*u[i,j,1]
    du[N,1,2] = D2*(2u[i-1,j,2] + 2u[i,j+1,2] - 4u[i,j,2]) +
             a*u[i,j,1]^2 - ฮฒ*u[i,j,2]

    i = N; j = N
    du[N,N,1] = D1*(2u[i-1,j,1] + 2u[i,j-1,1] - 4u[i,j,1]) +
             a*u[i,j,1]^2/u[i,j,2] + ubar - ฮฑ*u[i,j,1]
    du[N,N,2] = D2*(2u[i-1,j,2] + 2u[i,j-1,2] - 4u[i,j,2]) +
             a*u[i,j,1]^2 - ฮฒ*u[i,j,2]
   end
end
prob_turbo_fuse = ODEProblem(turbo_gm_fuse!,r0,(0.0,0.1),pdevec);
@benchmark solve($prob_turbo_fuse,Tsit5())
# samples: 629; evals/sample: 1; memory estimate: 29.62 MiB; allocs estimate: 433
# ns

#  (6.8e6  - 7.9e6 ]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ427
#  (7.9e6  - 8.9e6 ]  โ–ˆโ–ˆโ–‹36
#  (8.9e6  - 1.0e7 ]  โ–ˆโ–ˆโ–ˆโ–ˆโ–Š66
#  (1.0e7  - 1.11e7]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰97
#  (1.11e7 - 1.22e7]   0
#  (1.22e7 - 1.33e7]   0
#  (1.33e7 - 1.44e7]   0
#  (1.44e7 - 1.54e7]  โ–1
#  (1.54e7 - 1.65e7]  โ–1
#  (1.65e7 - 1.76e7]   0
#  (1.76e7 - 2.12e7]  โ–1

#                   Counts

# min: 6.786 ms (0.00% GC); mean: 7.938 ms (7.63% GC); median: 7.014 ms (0.00% GC); max: 21.180 ms (66.47% GC).

from methodoflines.jl.

chriselrod avatar chriselrod commented on May 23, 2024

Those loops are easy enough for LLVM to SIMD, and there's not too much that can be done to make them faster.
I.e., there isn't much reuse you can get through blocking. You're just streaming through memory.

It is unrolling the j loop to try and save on j loads a bit (i.e., j+1 on one j iteration is j on the next, is j-1 on the next), but that doesn't seem to help much.

from methodoflines.jl.

ChrisRackauckas avatar ChrisRackauckas commented on May 23, 2024

Yeah, I think you need to not have the nonlinear portion of the PDE (+ a*u[i,j,1]^2 - ฮฒ*u[i,j,2]) for a stencil compiler to work well. This is what I was mentioning in https://discourse.julialang.org/t/how-to-tackle-2d-reaction-diffusion-equation/54110/15?u=chrisrackauckas . The authors of that seemed to get mad at me, but it seems to be the simple fact that once you have a term that is outside the original range, u[i,j,2], or a term which is sufficiently difficult to compute so that it's not too memory bound, u[i,j,1]^2, then the act of fusion is just better than having a good stencil compiler. That LoopVectorization demo above is precisely what I needed to see to fully prove that indeed the fused version outperforms even a good stencil compiler in that case, and that @turbo has an effect but it's not massive.

It would still be good to investigate multithreading between those when the space is really large though, but that's a separate thing and only matters for the huge PDE case. But essentially, those benchmarks are my argument for why we shouldn't be generating A*u operators but instead should be generating (and fusing) map operations.

from methodoflines.jl.

chriselrod avatar chriselrod commented on May 23, 2024

Making them dense:

Ax = Array(Tridiagonal([1.0 for i in 1:N-1],[-2.0 for i in 1:N],[1.0 for i in 1:N-1]))

adds a huge amount of computation, making it go from O(N^2) to O(N^3).
Seems like an obviously bad idea.
We could implement specialized Tridiagonal multiplication routines, e.g. like
JuliaSIMD/LoopVectorization.jl#266 (comment)
but at that point we may as well use the fusing map operations.

For the mapped version, 100x100 isn't large enough to benefit from multithreading yet, but 1000x1000 would be.

from methodoflines.jl.

ChrisRackauckas avatar ChrisRackauckas commented on May 23, 2024

The first part is the same as an image filter.

from methodoflines.jl.

chriselrod avatar chriselrod commented on May 23, 2024

or a term which is sufficiently difficult to compute so that it's not too memory bound, u[i,j,1]^2

Not sure if I follow here, but u[i,j,1] is easy to compute.
Fusion is great for memory bound applications though. Passing over the data once is better than passing over it twice.

We shouldn't be doing dense Ax::Matrix operations.
I assume ParallelStencil wouldn't be?

If we let Ax be tridiagonal, it'd do much better.
Still probably not as well as the loops, but better.

from methodoflines.jl.

ChrisRackauckas avatar ChrisRackauckas commented on May 23, 2024

Not sure if I follow here, but u[i,j,1] is easy to compute.

I mean, when you're looping over u[i,j,2] and need to grab u[i,j,1], you very quickly expand the cache lines like that. And this is still a very small example with very few nonlinearities!

We shouldn't be doing dense Ax::Matrix operations. I assume ParallelStencil wouldn't be?

It wouldn't, I thought you were using the same setup as the image filter you had? A 2D Laplacian is just a 2D filter.

from methodoflines.jl.

chriselrod avatar chriselrod commented on May 23, 2024

Ha, let me do that. Above it was just matmul!.

Note that the image filtering example doesn't do boundaries...

I mean, when you're looping over u[i,j,2] and need to grab u[i,j,1], you very quickly expand the cache lines like that.

Although reuse tends to be fairly immediate (e.g. u[i,j,1], u[i+1,j,1]), but then there aren't any more opportunities for re-use, so you don't mind these loads getting evicted.
Although there is concern over u[i,j,1], u[i,j+1,1]. It'd probably take some experimenting to see if some larger cache tiling helps more than the j unrolling does.

from methodoflines.jl.

chriselrod avatar chriselrod commented on May 23, 2024

@ChrisRackauckas here is the example I think you were looking for:

using StaticArrays, LoopVectorization, Static
const LAPLACE_KERNEL = @MMatrix([0.0 1.0 0.0; 1.0 -4.0 1.0; 0.0 1.0 0.0]);
function filter2davx!(out::AbstractMatrix, A::AbstractMatrix, kern::MMatrix{3,3})
  N = size(out,1)
  @inbounds @fastmath let j_2 = j_1 = 1
    tmp  = A[1+j_1, 1+j_2] * kern[-1+2, -1+2]
    tmp += A[1+j_1, 0+j_2] * kern[-1+2, 0+2]
    tmp += A[1+j_1, 1+j_2] * kern[-1+2, 1+2]
    tmp += A[0+j_1, 1+j_2] * kern[0+2, -1+2]
    tmp += A[1+j_1, 1+j_2] * kern[1+2, -1+2]
    for i_1 โˆˆ 0:1, i_2 โˆˆ 0:1
      tmp += A[i_1+j_1, i_2+j_2] * kern[i_1+2, i_2+2]
    end
    out[j_1,j_2] = tmp
  end
  @inbounds @fastmath let j_1 = 1; j_2 = N
    tmp  = A[1+j_1, -1+j_2] * kern[-1+2, -1+2]
    tmp += A[1+j_1, 0+j_2] * kern[-1+2, 0+2]
    tmp += A[1+j_1, -1+j_2] * kern[-1+2, 1+2]
    tmp += A[0+j_1, -1+j_2] * kern[0+2, 1+2]
    tmp += A[1+j_1, -1+j_2] * kern[1+2, 1+2]
    for i_1 โˆˆ 0:1, i_2 โˆˆ -1:0
      tmp += A[i_1+j_1, i_2+j_2] * kern[i_1+2, i_2+2]
    end
    out[j_1,j_2] = tmp
  end
  @inbounds @fastmath let j_2 = 1; j_1 = N
    tmp  = A[-1+j_1, 1+j_2] * kern[1+2, -1+2]
    tmp += A[-1+j_1, 0+j_2] * kern[1+2, 0+2]
    tmp += A[-1+j_1, 1+j_2] * kern[-1+2, 1+2]
    tmp += A[0+j_1, 1+j_2] * kern[0+2, -1+2]
    tmp += A[-1+j_1, 1+j_2] * kern[1+2, -1+2]
    for i_1 โˆˆ -1:0, i_2 โˆˆ 0:1
      tmp += A[i_1+j_1, i_2+j_2] * kern[i_1+2, i_2+2]
    end
    out[j_1,j_2] = tmp
  end  
  @turbo for j โˆˆ 2:N-1
  # for j โˆˆ 2:N-1
    tmp_1 = zero(eltype(out))
    tmp_2 = zero(eltype(out))
    for i_1 โˆˆ -1:1
      tmp_1 += A[i_1+j, 2] * kern[i_1+2, 3]
      tmp_2 += A[2, i_1+j] * kern[3, i_1+2]
      for i_2 โˆˆ 0:1
        tmp_1 += A[i_1+j, 1+i_2] * kern[i_1+2, i_2+2]
        tmp_2 += A[1+i_2, i_1+j] * kern[i_2+2, i_1+2]
      end
    end
    out[j,1] = tmp_1
    out[1,j] = tmp_2
  end
  @turbo for j_2 โˆˆ 2:N-1, j_1 โˆˆ 2:N-1
  # for j_2 โˆˆ 2:N-1, j_1 โˆˆ 2:N-1
    tmp = zero(eltype(out))
    for i_1 โˆˆ -1:1, i_2 โˆˆ -1:1
      tmp += A[i_1+j_1, i_2+j_2] * kern[i_1+2, i_2+2]
    end
    out[j_1,j_2] = tmp
  end
  @turbo for j โˆˆ 2:N-1
  # for j โˆˆ 2:N-1
    tmp_1 = zero(eltype(out))
    tmp_2 = zero(eltype(out))
    for i_1 โˆˆ -1:1
      tmp_1 += A[i_1+j, N-1] * kern[i_1+2, 3]
      tmp_2 += A[N-1, i_1+j] * kern[3, i_1+2]
      for i_2 โˆˆ -1:0
        tmp_1 += A[i_1+j, N+i_2] * kern[i_1+2, i_2+2]
        tmp_2 += A[N+i_2, i_1+j] * kern[i_2+2, i_1+2]
      end
    end
    out[j,N] = tmp_1
    out[N,j] = tmp_2
  end
  @inbounds @fastmath let j_2 = j_1 = N
    tmp  = A[-1+j_1, -1+j_2] * kern[1+2, -1+2]
    tmp += A[-1+j_1, 0+j_2] * kern[1+2, 0+2]
    tmp += A[-1+j_1, -1+j_2] * kern[1+2, 1+2]
    tmp += A[-1+j_1, -1+j_2] * kern[-1+2, 1+2]
    tmp += A[0+j_1, -1+j_2] * kern[0+2, 1+2]
    for i_1 โˆˆ -1:0, i_2 โˆˆ -1:0
      tmp += A[i_1+j_1, i_2+j_2] * kern[i_1+2, i_2+2]
    end
    out[j_1,j_2] = tmp
  end
  nothing
end
function gm7!(dr,r,p,t)
  a,ฮฑ,ubar,ฮฒ,D1,D2 = p
  u = @view r[:,:,1];
  v = @view r[:,:,2];
  du = @view dr[:,:,1];
  dv = @view dr[:,:,2];
  filter2davx!(Ayu, u, LAPLACE_KERNEL)
  # @turbo @. du = D1*Ayu + a*u*u./v + ubar - ฮฑ*u
  @turbo for i โˆˆ eachindex(u)
    du[i] = D1 * Ayu[i] + a * u[i]*u[i] / v[i] + ubar - ฮฑ*u[i]
  end
  filter2davx!(Ayu, v, LAPLACE_KERNEL)
  # @turbo @. dv = D2*Ayu + a*u*u - ฮฒ*v
  @turbo for i โˆˆ eachindex(u)
    dv[i] = D2 * Ayu[i] + a * u[i]*u[i] - ฮฒ*v[i]
  end
end
prob_gm7 = ODEProblem(gm7!,r0,(0.0,0.1),p);
@benchmark solve($prob_gm7,Tsit5())
samples: 546; evals/sample: 1; memory estimate: 29.62 MiB; allocs estimate: 433
ns

 (7.96e6  - 8.63e6 ]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ311
 (8.63e6  - 9.29e6 ]  โ–ˆโ–ˆโ–‹26
 (9.29e6  - 9.95e6 ]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Š101
 (9.95e6  - 1.062e7]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ51
 (1.062e7 - 1.128e7]  โ–3
 (1.128e7 - 1.194e7]  โ–ˆโ–ˆโ–Œ25
 (1.194e7 - 1.26e7 ]  โ–1
 (1.26e7  - 1.327e7]  โ–1
 (1.327e7 - 1.393e7]   0
 (1.393e7 - 1.459e7]  โ–ˆโ–13
 (1.459e7 - 1.525e7]  โ–ˆโ–14

                  Counts

min: 7.965 ms (0.00% GC); mean: 9.162 ms (6.71% GC); median: 8.165 ms (0.00% GC); max: 15.255 ms (10.53% GC).

If we wanted to optimize this a bit more, we could

  1. take advantage of the sparsity in the laplacian kernel
  2. make the values from the kernel constant
  3. Fuse operations to pass over data less often.

Oops, we now have fast_gm!/turbo_gm_fuse!.

from methodoflines.jl.

ChrisRackauckas avatar ChrisRackauckas commented on May 23, 2024

You can just hardcode that it's the stencil

0 1 0
1 -4 1
0 1 0

from methodoflines.jl.

chriselrod avatar chriselrod commented on May 23, 2024

Sure, but at that point why not go the rest of the way to turbo_gm_fuse!?

from methodoflines.jl.

ChrisRackauckas avatar ChrisRackauckas commented on May 23, 2024

Yeah, go all of the way.

from methodoflines.jl.

chriselrod avatar chriselrod commented on May 23, 2024

The one thing fusing inhibits is easily threading across the third axis.
I tried that, and it didn't help. 8.2ms solve time.

I should add a version of @spawn to Polyester so we can parallelize different functions.
Maybe if the third axis was huge (and you're still doing something different), but that seems contrived.

Ideally, we'd have a loop compiler that does the right thing with whatever loops we give it. There isn't really anything that special about stencils.

EDIT: To give ParallelStencil.jl/stencil compilers credit (aside from the fact that I didn't benchmark it at all so there is no "discredit" either), working on a smaller problem domain (just stencils) makes it easier to scale out, e.g. to all the GPUs.

from methodoflines.jl.

ChrisRackauckas avatar ChrisRackauckas commented on May 23, 2024

Yeah, they do have a lot of uses. http://supertech.csail.mit.edu/papers/pochoir_spaa11.pdf is a good read if you haven't seen it.

from methodoflines.jl.

chriselrod avatar chriselrod commented on May 23, 2024

Thanks, that's interesting.
So the key to a stencil compiler is how it optimizes cache locality across time.

That is, it cuts up the iteration space/dependencies in a way so that multiple time steps -- and hence lots of reuse -- can be computed on sub (trapezoidal) regions. These regions can also be computed in parallel.

If you want to compute only a single time step update, you need to iterate over the entire arrays. Then you may as well do what you can to minimize the amount of iteration, e.g. fuse it with other operations.

However, if you want to compute, say, 1000 time steps, you don't need to pass over the arrays 1000 times. It takes many time steps for changes to index u[i,j] to fan out and influence iterations at index u[i+100, j-100].
So starting at t=0, you can take a chunk of the array u that fits nicely in cache, and then perform a large number of time updates. This chunk shrinks overtime, because you lose some along the boundaries, hence the trapezoidal shape.

Eventually, you'd stop and instead work on another region. For example, one now available because the region(s) it depends on have been finished.

Stencil compilers figure our clever ways to carve the space up in this way, running those separate chunks on different cores/gpus/etc, and also scheduling the leftovers to fill in the missing pieces, etc.

Polyhedral optimizers can also do this sort of analysis.
Hmm.

Anyway, regarding the computations here where we're doing a single time step at a time, we can't benefit from these optimizations.
If we were able to do many iterations, then ideally we'd be able to fuse operations like a*u[i,j,1]^2/u[i,j,2] + ubar - ฮฑ*u[i,j,1] with it. That shouldn't cause any scheduling problems (e.g. to a polyhedral optimizer).

from methodoflines.jl.

YingboMa avatar YingboMa commented on May 23, 2024

The current MTK code example of scalarized code is

using ModelingToolkit
l = 1
n = 5
h = l / n
@variables t u[1:n](t)
D = Differential(t)
# periodic bounary condition version 1
eqs = [
       h^2 * D(u[1]) ~ u[end] - 2u[1] + u[2]
       map(2:n-1) do i
           h^2 * D(u[i]) ~ u[i-1] - 2u[i] + u[i+1] # ifelse(u[i]>0, u[i+1] - u[i], u[i] - u[i-1])
       end
       h^2 * D(u[end]) ~ u[end-1] - 2u[end] + u[1]
      ]

My tentative syntax for the interior is

map(u[1:n-1], u[2:n], u[3:n+1]) do um1, ui, up1
    um1 - 2ui + up1
end

I am not sure how to scale this to high dims though.

from methodoflines.jl.

ChrisRackauckas avatar ChrisRackauckas commented on May 23, 2024

Let's talk about that high dim case today. That's going to be important to this. And get @chriselrod and @shashi on too. 1pm?

from methodoflines.jl.

valentinsulzer avatar valentinsulzer commented on May 23, 2024

Happy to help with this but the discussions so far are above my paygrade ๐Ÿ˜…

from methodoflines.jl.

ChrisRackauckas avatar ChrisRackauckas commented on May 23, 2024
using OrdinaryDiffEq, LinearAlgebra, SparseArrays, BenchmarkTools
using LoopVectorization
const ฮฑโ‚‚ = 1.0
const ฮฑโ‚ƒ = 1.0
const ฮฒโ‚ = 1.0
const ฮฒโ‚‚ = 1.0
const ฮฒโ‚ƒ = 1.0
const rโ‚ = 1.0
const rโ‚‚ = 1.0
const D = 100.0
const ฮณโ‚ = 0.1
const ฮณโ‚‚ = 0.1
const ฮณโ‚ƒ = 0.1
const N = 512
const X = reshape([i for i in 1:N for j in 1:N],N,N)
const Y = reshape([j for i in 1:N for j in 1:N],N,N)
const ฮฑโ‚ = 1.0.*(X.>=4*N/5)
const Mx = Tridiagonal([1.0 for i in 1:N-1],[-2.0 for i in 1:N],[1.0 for i in 1:N-1])
const My = copy(Mx)
Mx[2,1] = 2.0
Mx[end-1,end] = 2.0
My[1,2] = 2.0
My[end,end-1] = 2.0
# Define the discretized PDE as an ODE function
const MyA = zeros(N,N)
const AMx = zeros(N,N)
const DA = zeros(N,N)
function f(du,u,p,t)
   A = @view  u[:,:,1]
   B = @view  u[:,:,2]
   C = @view  u[:,:,3]
  dA = @view du[:,:,1]
  dB = @view du[:,:,2]
  dC = @view du[:,:,3]
  mul!(MyA,My,A)
  mul!(AMx,A,Mx)
  @. DA = D*(MyA + AMx)
  @. dA = DA + ฮฑโ‚ - ฮฒโ‚*A - rโ‚*A*B + rโ‚‚*C
  @. dB = ฮฑโ‚‚ - ฮฒโ‚‚*B - rโ‚*A*B + rโ‚‚*C
  @. dC = ฮฑโ‚ƒ - ฮฒโ‚ƒ*C + rโ‚*A*B - rโ‚‚*C
end
function f!(du,u,p,t)
    A = @view  u[:,:,1]
    B = @view  u[:,:,2]
    C = @view  u[:,:,3]
   dA = @view du[:,:,1]
   dB = @view du[:,:,2]
   dC = @view du[:,:,3]
  @turbo for j in 2:N-1, i in 2:N-1
    dA[i,j] = D*(A[i-1,j] + A[i+1,j] + A[i,j+1] + A[i,j-1] - 4A[i,j]) +
              ฮฑโ‚[i,j] - ฮฒโ‚*A[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dB[i,j] = ฮฑโ‚‚ - ฮฒโ‚‚*B[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dC[i,j] = ฮฑโ‚ƒ - ฮฒโ‚ƒ*C[i,j] + rโ‚*A[i,j]*B[i,j] - rโ‚‚*C[i,j]
  end
  @inbounds for j in 2:N-1
    i = 1
    dA[1,j] = D*(2A[i+1,j] + A[i,j+1] + A[i,j-1] - 4A[i,j]) +
              ฮฑโ‚[i,j] - ฮฒโ‚*A[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dB[i,j] = ฮฑโ‚‚ - ฮฒโ‚‚*B[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dC[i,j] = ฮฑโ‚ƒ - ฮฒโ‚ƒ*C[i,j] + rโ‚*A[i,j]*B[i,j] - rโ‚‚*C[i,j]
  end
  @inbounds for j in 2:N-1
    i = N
    dA[end,j] = D*(2A[i-1,j] + A[i,j+1] + A[i,j-1] - 4A[i,j]) +
              ฮฑโ‚[i,j] - ฮฒโ‚*A[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dB[i,j] = ฮฑโ‚‚ - ฮฒโ‚‚*B[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dC[i,j] = ฮฑโ‚ƒ - ฮฒโ‚ƒ*C[i,j] + rโ‚*A[i,j]*B[i,j] - rโ‚‚*C[i,j]
  end
  @inbounds for i in 2:N-1
    j = 1
    dA[i,j] = D*(A[i-1,j] + A[i+1,j] + 2A[i,j+1] - 4A[i,j]) +
              ฮฑโ‚[i,j] - ฮฒโ‚*A[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dB[i,j] = ฮฑโ‚‚ - ฮฒโ‚‚*B[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dC[i,j] = ฮฑโ‚ƒ - ฮฒโ‚ƒ*C[i,j] + rโ‚*A[i,j]*B[i,j] - rโ‚‚*C[i,j]
  end
  @inbounds for i in 2:N-1
    j = N
    dA[i,end] = D*(A[i-1,j] + A[i+1,j] + 2A[i,j-1] - 4A[i,j]) +
                ฮฑโ‚[i,j] - ฮฒโ‚*A[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dB[i,j] = ฮฑโ‚‚ - ฮฒโ‚‚*B[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dC[i,j] = ฮฑโ‚ƒ - ฮฒโ‚ƒ*C[i,j] + rโ‚*A[i,j]*B[i,j] - rโ‚‚*C[i,j]
  end
  @inbounds begin
     i = 1; j = 1
     dA[i,j] = D*(2A[i+1,j] + 2A[i,j+1] - 4A[i,j]) +
               ฮฑโ‚[i,j] - ฮฒโ‚*A[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
     dB[i,j] = ฮฑโ‚‚ - ฮฒโ‚‚*B[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
     dC[i,j] = ฮฑโ‚ƒ - ฮฒโ‚ƒ*C[i,j] + rโ‚*A[i,j]*B[i,j] - rโ‚‚*C[i,j]
     i = 1; j = N
     dA[i,j] = D*(2A[i+1,j] + 2A[i,j-1] - 4A[i,j]) +
               ฮฑโ‚[i,j] - ฮฒโ‚*A[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
     dB[i,j] = ฮฑโ‚‚ - ฮฒโ‚‚*B[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
     dC[i,j] = ฮฑโ‚ƒ - ฮฒโ‚ƒ*C[i,j] + rโ‚*A[i,j]*B[i,j] - rโ‚‚*C[i,j]
     i = N; j = 1
     dA[i,j] = D*(2A[i-1,j] + 2A[i,j+1] - 4A[i,j]) +
               ฮฑโ‚[i,j] - ฮฒโ‚*A[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
     dB[i,j] = ฮฑโ‚‚ - ฮฒโ‚‚*B[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
     dC[i,j] = ฮฑโ‚ƒ - ฮฒโ‚ƒ*C[i,j] + rโ‚*A[i,j]*B[i,j] - rโ‚‚*C[i,j]
     i = N; j = N
     dA[i,j] = D*(2A[i-1,j] + 2A[i,j-1] - 4A[i,j]) +
               ฮฑโ‚[i,j] - ฮฒโ‚*A[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
     dB[i,j] = ฮฑโ‚‚ - ฮฒโ‚‚*B[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
     dC[i,j] = ฮฑโ‚ƒ - ฮฒโ‚ƒ*C[i,j] + rโ‚*A[i,j]*B[i,j] - rโ‚‚*C[i,j]
   end
end
#=
# Double check:
u = rand(N,N,3)
du  = similar(u)
du2 = similar(u)
f( du ,u,nothing,0.0)
f!(du2,u,nothing,0.0)
=#
u0 = zeros(N,N,3)
prob = ODEProblem(f!,u0,(0.0,10.0))
@btime solve(prob, ROCK4(), reltol = 1e-8, abstol=1e-8, saveat=0.1); # 13.300 s (3406 allocations: 4.34 GiB)
@btime solve(prob, Tsit5(), reltol = 1e-8, abstol=1e-8, saveat=0.1); # 38.481 s (270 allocations: 708.02 MiB)
@btime solve(prob, DP5(), reltol = 1e-8, abstol=1e-8, saveat=0.1); # 40.445 s (267 allocations: 690.02 MiB)

There's an N to tone down to test with. I'm pretty confused by this. Is it just the cache lines would have to be too long?

@chriselrod found:

Locally:
julia> u0 = randn(N,N,3);
julia> du0 = similar(u0);
julia> @benchmark f!($du0,$u0,nothing,0.0) # @turbo
samples: 7479; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
 (641000.0 - 644000.0]  โ–ˆโ–ˆโ–57
 (644000.0 - 647000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ291
 (647000.0 - 650000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰552
 (650000.0 - 653100.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹683
 (653100.0 - 656100.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ778
 (656100.0 - 659100.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ749
 (659100.0 - 662100.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ750
 (662100.0 - 665100.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ653
 (665100.0 - 668200.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Ž675
 (668200.0 - 671200.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰693
 (671200.0 - 674200.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 836
 (674200.0 - 677200.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰663
 (677200.0 - 680200.0]  โ–ˆโ–ˆโ–ˆโ–91
 (680200.0 - 1.0048e6]  โ–8
                  Counts
min: 640.984 ฮผs (0.00% GC); mean: 661.914 ฮผs (0.00% GC); median: 661.578 ฮผs (0.00% GC); max: 1.005 ms (0.00% GC).
julia> @benchmark g!($du0,$u0,nothing,0.0) # @inbounds @fastmath
samples: 4784; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
 (1.0269e6 - 1.0287e6]  โ–Ž15
 (1.0287e6 - 1.0306e6]  โ–10
 (1.0306e6 - 1.0324e6]  โ–7
 (1.0324e6 - 1.0342e6]  โ–Ž20
 (1.0342e6 - 1.036e6 ]  โ–7
 (1.036e6  - 1.0379e6]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹953
 (1.0379e6 - 1.0397e6]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ3337
 (1.0397e6 - 1.0415e6]  โ–ˆโ–ˆโ–ˆโ–Œ377
 (1.0415e6 - 1.0433e6]  โ–30
 (1.0433e6 - 1.0452e6]  โ–13
 (1.0452e6 - 1.047e6 ]  โ–6
 (1.047e6  - 1.0488e6]  โ–3
 (1.0488e6 - 1.0506e6]  โ–1
 (1.0506e6 - 1.2524e6]  โ–5
                  Counts
min: 1.027 ms (0.00% GC); mean: 1.039 ms (0.00% GC); median: 1.038 ms (0.00% GC); max: 1.252 ms (0.00% GC).
julia> @show N
N = 512
LV doesn't do cache tiling currently, so performance suffers at larger sizes.
I have a good idea of how I want it to do that now, but want to wait for the rewrite as otherwise the rewrite will never happen (there's no end to the stuff I could still do). (edited) 
7:43
g! was using @inbounds @fastmath and took 1ms, while f! was using @turbo and took 0.6ms.
I could try more or on a different computer. (edited) 
7:46
With @tturbo:
julia> @benchmark fthreads!($du0,$u0,nothing,0.0) # @tturbo
samples: 10000; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
 (89800.0  - 91100.0 ]  โ–7
 (91100.0  - 92400.0 ]  โ–ˆ80
 (92400.0  - 93700.0 ]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–453
 (93700.0  - 95000.0 ]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Š1405
 (95000.0  - 96300.0 ]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ2681
 (96300.0  - 97600.0 ]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 2692
 (97600.0  - 98900.0 ]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ1658
 (98900.0  - 100200.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–629
 (100200.0 - 101500.0]  โ–ˆโ–ˆโ–‹234
 (101500.0 - 102800.0]  โ–‰78
 (102800.0 - 104000.0]  โ–Œ37
 (104000.0 - 105300.0]  โ–Ž15
 (105300.0 - 106600.0]  โ–11
 (106600.0 - 107900.0]  โ–10
 (107900.0 - 349100.0]  โ–10
                  Counts
min: 89.756 ฮผs (0.00% GC); mean: 96.586 ฮผs (0.00% GC); median: 96.428 ฮผs (0.00% GC); max: 349.074 ฮผs (0.00% GC).

and thread:


Chris Elrod  1 day ago
Seems to be a problem in LoopVectorization's cost modeling.
It is unrolling the j loop to save on reloads of A[i,j-1],A[i,j] ,A[i,j+1].
But it seems faster not to unroll at all.

Chris Elrod  1 day ago
You can try to confirm on your machine with @turbo unroll=1 for ...

Chris Elrod  1 day ago
Unrolling does seem faster for very small N, e.g. I tried 30.

Chris Elrod  1 day ago
This was also true on my old (AVX2, Haswell) laptop.

Chris Elrod  18 hours ago
I tweaked some heuristics in LoopVectorization 0.12.38.
It should do better now, but I still saw some anonymously worse performance on my Cascadelake-X system from objectively worse assembly. I.e., reloading the same data multiple times instead of reusing a register -> worse performance. So it's generating the objectively worse but slower (on that machine) code.
I could perhaps add heuristics to identify that case to add prefetch instructions, which should have the same (but more) benefit as whatever benefit the extra loads is having.
On my Haswell (AVX2) laptop, the code without extra loads seemed slightly faster.
Code with extra loads == @inbounds @fastmath for ...; @simd ivdep for ...
I can reproduce the problem with LLVM by using @simd ivdep like above, but moving all the stores after all the loads in the loop. That eliminates the redundant loads, and worsens performance. (edited) 

Chris Elrod  18 hours ago
Anyway, the loop is memory bound and the key seems to be making it as easy on the hardware prefetchers as possible. Any sort of fanciness -- including using TiledIteration.jl to block -- just makes performance worse.
So it may be worth getting LoopVectorization to emit software prefetches there.
EDIT: Played around with software prefetching there. Made it slower. (edited) 

Chris Elrod  16 hours ago
Also note that 512 is a magic bad number. Benchmarks with 512x512 matrices:
Binary 
stencilbench.jl
4 kB Binary4 kB โ€” Click to download



Chris Elrod  16 hours ago
julia> @benchmark fbase!($du,$u,$ฮฑโ‚)
samples: 4373; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
 (1.06e6  - 1.102e6]  โ–ˆโ–122
 (1.102e6 - 1.144e6]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 3397
 (1.144e6 - 1.186e6]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Ž591
 (1.186e6 - 1.229e6]  โ–ˆโ–Š195
 (1.229e6 - 1.271e6]  โ–32
 (1.271e6 - 1.313e6]  โ–10
 (1.313e6 - 1.355e6]  โ–5
 (1.355e6 - 1.397e6]  โ–9
 (1.397e6 - 1.439e6]  โ–1
 (1.439e6 - 1.481e6]   0
 (1.481e6 - 1.523e6]  โ–3
 (1.523e6 - 1.565e6]  โ–1
 (1.565e6 - 1.608e6]  โ–2
 (1.608e6 - 1.807e6]  โ–5
                  Counts
min: 1.060 ms (0.00% GC); mean: 1.134 ms (0.00% GC); median: 1.126 ms (0.00% GC); max: 1.807 ms (0.00% GC).
julia> @benchmark fturbo!($du,$u,$ฮฑโ‚)
samples: 6310; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
 (740000.0 - 787000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 4631
 (787000.0 - 834000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ1139
 (834000.0 - 881000.0]  โ–ˆโ–ˆโ–321
 (881000.0 - 928000.0]  โ–ˆ139
 (928000.0 - 975000.0]  โ–40
 (975000.0 - 1.022e6 ]  โ–19
 (1.022e6  - 1.069e6 ]  โ–6
 (1.069e6  - 1.116e6 ]  โ–3
 (1.116e6  - 1.163e6 ]  โ–4
 (1.163e6  - 1.211e6 ]   0
 (1.211e6  - 1.258e6 ]   0
 (1.258e6  - 1.305e6 ]  โ–1
 (1.305e6  - 1.352e6 ]   0
 (1.352e6  - 1.577e6 ]  โ–7
                  Counts
min: 739.585 ฮผs (0.00% GC); mean: 781.544 ฮผs (0.00% GC); median: 767.157 ฮผs (0.00% GC); max: 1.577 ms (0.00% GC).
julia> @benchmark funroll1!($du,$u,$ฮฑโ‚)
samples: 6365; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
 (739000.0 - 756000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹1365
 (756000.0 - 772000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 2623
 (772000.0 - 788000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹921
 (788000.0 - 804000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Š580
 (804000.0 - 821000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–Š407
 (821000.0 - 837000.0]  โ–ˆโ–ˆโ–201
 (837000.0 - 853000.0]  โ–ˆโ–Ž109
 (853000.0 - 870000.0]  โ–‰73
 (870000.0 - 886000.0]  โ–Œ34
 (886000.0 - 902000.0]  โ–25
 (902000.0 - 918000.0]  โ–9
 (918000.0 - 935000.0]  โ–6
 (935000.0 - 951000.0]  โ–5
 (951000.0 - 1.137e6 ]  โ–7
                  Counts
min: 739.437 ฮผs (0.00% GC); mean: 775.172 ฮผs (0.00% GC); median: 764.959 ฮผs (0.00% GC); max: 1.137 ms (0.00% GC).
julia> @benchmark funroll1_v2!($du,$u,$ฮฑโ‚)
samples: 6431; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
 (730000.0 - 746000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–806
 (746000.0 - 761000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–2902
 (761000.0 - 777000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Š1229
 (777000.0 - 792000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–605
 (792000.0 - 808000.0]  โ–ˆโ–ˆโ–ˆโ–‰369
 (808000.0 - 823000.0]  โ–ˆโ–ˆโ–194
 (823000.0 - 839000.0]  โ–ˆโ–125
 (839000.0 - 855000.0]  โ–‰75
 (855000.0 - 870000.0]  โ–Œ43
 (870000.0 - 886000.0]  โ–33
 (886000.0 - 901000.0]  โ–Ž18
 (901000.0 - 917000.0]  โ–9
 (917000.0 - 932000.0]  โ–Ž16
 (932000.0 - 1.133e6 ]  โ–7
                  Counts
min: 730.097 ฮผs (0.00% GC); mean: 767.196 ฮผs (0.00% GC); median: 757.611 ฮผs (0.00% GC); max: 1.133 ms (0.00% GC).
julia> @benchmark fbase_ivdep!($du,$u,$ฮฑโ‚)
samples: 6693; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
 (688000.0 - 709000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Ž1779
 (709000.0 - 730000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 1959
 (730000.0 - 751000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰1163
 (751000.0 - 773000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Ž731
 (773000.0 - 794000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰441
 (794000.0 - 815000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–Œ291
 (815000.0 - 836000.0]  โ–ˆโ–ˆโ–154
 (836000.0 - 857000.0]  โ–ˆโ–Ž81
 (857000.0 - 878000.0]  โ–Œ30
 (878000.0 - 900000.0]  โ–Œ27
 (900000.0 - 921000.0]  โ–19
 (921000.0 - 942000.0]  โ–7
 (942000.0 - 963000.0]  โ–4
 (963000.0 - 1.17e6  ]  โ–7
                  Counts
min: 687.778 ฮผs (0.00% GC); mean: 736.942 ฮผs (0.00% GC); median: 724.632 ฮผs (0.00% GC); max: 1.170 ms (0.00% GC).
julia> @benchmark fbase_ivdep_v2!($du,$u,$ฮฑโ‚)
samples: 6249; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
 (749000.0 - 766000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–1134
 (766000.0 - 783000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 2424
 (783000.0 - 801000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ1089
 (801000.0 - 818000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Ž661
 (818000.0 - 835000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–‰392
 (835000.0 - 852000.0]  โ–ˆโ–ˆโ–ˆ241
 (852000.0 - 869000.0]  โ–ˆโ–109
 (869000.0 - 886000.0]  โ–‰61
 (886000.0 - 903000.0]  โ–Š51
 (903000.0 - 920000.0]  โ–Œ40
 (920000.0 - 937000.0]  โ–24
 (937000.0 - 954000.0]  โ–8
 (954000.0 - 971000.0]  โ–8
 (971000.0 - 1.177e6 ]  โ–7
                  Counts
min: 749.351 ฮผs (0.00% GC); mean: 789.853 ฮผs (0.00% GC); median: 779.491 ฮผs (0.00% GC); max: 1.177 ms (0.00% GC).
Binary 
stencilbench.jl
4 kB Binary4 kB โ€” Click to download



Chris Elrod  16 hours ago
Increasing N from 512 to 528:
Binary 
stencilbench.jl
4 kB Binary4 kB โ€” Click to download



Chris Elrod  16 hours ago
julia> @benchmark fbase!($du,$u,$ฮฑโ‚)
samples: 4993; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
 (911000.0 - 924000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰477
 (924000.0 - 936000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–904
 (936000.0 - 949000.0]  โ–ˆโ–ˆโ–Š183
 (949000.0 - 961000.0]  โ–ˆโ–‹112
 (961000.0 - 974000.0]  โ–‹37
 (974000.0 - 987000.0]  โ–Ž9
 (987000.0 - 999000.0]  โ–2
 (999000.0 - 1.012e6 ]  โ–1
 (1.012e6  - 1.025e6 ]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 2086
 (1.025e6  - 1.037e6 ]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–718
 (1.037e6  - 1.05e6  ]  โ–ˆโ–ˆโ–ˆโ–ˆโ–‹319
 (1.05e6   - 1.063e6 ]  โ–ˆโ–‰128
 (1.063e6  - 1.075e6 ]  โ–Ž12
 (1.075e6  - 1.192e6 ]  โ–5
                  Counts
min: 910.853 ฮผs (0.00% GC); mean: 993.422 ฮผs (0.00% GC); median: 1.020 ms (0.00% GC); max: 1.192 ms (0.00% GC).
julia> @benchmark fturbo!($du,$u,$ฮฑโ‚)
samples: 6683; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
 (709000.0 - 719000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Ž552
 (719000.0 - 730000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 2323
 (730000.0 - 740000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Ž1717
 (740000.0 - 751000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Ž637
 (751000.0 - 762000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–‹353
 (762000.0 - 772000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–‹356
 (772000.0 - 783000.0]  โ–ˆโ–ˆโ–ˆโ–ˆ305
 (783000.0 - 793000.0]  โ–ˆโ–ˆโ–ˆ232
 (793000.0 - 804000.0]  โ–ˆโ–‹118
 (804000.0 - 814000.0]  โ–Œ35
 (814000.0 - 825000.0]  โ–27
 (825000.0 - 836000.0]  โ–Ž16
 (836000.0 - 846000.0]  โ–5
 (846000.0 - 1.136e6 ]  โ–7
                  Counts
min: 708.509 ฮผs (0.00% GC); mean: 739.460 ฮผs (0.00% GC); median: 731.846 ฮผs (0.00% GC); max: 1.136 ms (0.00% GC).
julia> @benchmark funroll1!($du,$u,$ฮฑโ‚)
samples: 6669; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
 (707000.0 - 719000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Š429
 (719000.0 - 730000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 2241
 (730000.0 - 742000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹1905
 (742000.0 - 754000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–680
 (754000.0 - 765000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ409
 (765000.0 - 777000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–‰359
 (777000.0 - 788000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–325
 (788000.0 - 800000.0]  โ–ˆโ–ˆโ–Ž163
 (800000.0 - 811000.0]  โ–ˆโ–79
 (811000.0 - 823000.0]  โ–28
 (823000.0 - 834000.0]  โ–Ž18
 (834000.0 - 846000.0]  โ–19
 (846000.0 - 857000.0]  โ–7
 (857000.0 - 1.143e6 ]  โ–7
                  Counts
min: 707.473 ฮผs (0.00% GC); mean: 740.984 ฮผs (0.00% GC); median: 733.445 ฮผs (0.00% GC); max: 1.143 ms (0.00% GC).
julia> @benchmark funroll1_v2!($du,$u,$ฮฑโ‚)
samples: 6594; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
 (717000.0 - 727000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–285
 (727000.0 - 736000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Š1707
 (736000.0 - 746000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 1998
 (746000.0 - 756000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–950
 (756000.0 - 766000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Š377
 (766000.0 - 775000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–‹305
 (775000.0 - 785000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–Œ293
 (785000.0 - 795000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–‹307
 (795000.0 - 805000.0]  โ–ˆโ–ˆโ–ˆ196
 (805000.0 - 814000.0]  โ–ˆโ–‹101
 (814000.0 - 824000.0]  โ–Š45
 (824000.0 - 834000.0]  โ–17
 (834000.0 - 844000.0]  โ–6
 (844000.0 - 1.147e6 ]  โ–7
                  Counts
min: 716.736 ฮผs (0.00% GC); mean: 749.543 ฮผs (0.00% GC); median: 742.499 ฮผs (0.00% GC); max: 1.147 ms (0.00% GC).
julia> @benchmark fbase_ivdep!($du,$u,$ฮฑโ‚)
samples: 7403; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
 (635000.0 - 649000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–2645
 (649000.0 - 664000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ2292
 (664000.0 - 678000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Š591
 (678000.0 - 693000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–Ž374
 (693000.0 - 707000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹490
 (707000.0 - 722000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰508
 (722000.0 - 736000.0]  โ–ˆโ–ˆโ–ˆโ–Ž276
 (736000.0 - 751000.0]  โ–ˆโ–119
 (751000.0 - 765000.0]  โ–‹49
 (765000.0 - 780000.0]  โ–25
 (780000.0 - 794000.0]  โ–Ž18
 (794000.0 - 809000.0]  โ–3
 (809000.0 - 823000.0]  โ–5
 (823000.0 - 1.106e6 ]  โ–8
                  Counts
min: 634.654 ฮผs (0.00% GC); mean: 666.727 ฮผs (0.00% GC); median: 652.383 ฮผs (0.00% GC); max: 1.106 ms (0.00% GC).
julia> @benchmark fbase_ivdep_v2!($du,$u,$ฮฑโ‚)
samples: 6610; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
 (717000.0 - 730000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Š893
 (730000.0 - 742000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–3102
 (742000.0 - 754000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ1031
 (754000.0 - 766000.0]  โ–ˆโ–ˆโ–ˆโ–‹363
 (766000.0 - 779000.0]  โ–ˆโ–ˆโ–ˆโ–Œ358
 (779000.0 - 791000.0]  โ–ˆโ–ˆโ–ˆโ–Ž331
 (791000.0 - 803000.0]  โ–ˆโ–ˆโ–Š278
 (803000.0 - 815000.0]  โ–ˆโ–Ž125
 (815000.0 - 827000.0]  โ–‹63
 (827000.0 - 840000.0]  โ–27
 (840000.0 - 852000.0]  โ–12
 (852000.0 - 864000.0]  โ–11
 (864000.0 - 876000.0]  โ–9
 (876000.0 - 1.126e6 ]  โ–7
                  Counts
min: 717.473 ฮผs (0.00% GC); mean: 747.755 ฮผs (0.00% GC); median: 738.879 ฮผs (0.00% GC); max: 1.126 ms (0.00% GC).
Binary 
stencilbench.jl
4 kB Binary4 kB โ€” Click to download



Chris Elrod  16 hours ago
I attached the script.
512 is bad because many CPUs (including this one I benchmarked on, a 7900X Skylake-X CPU) have 32 KiB L1 caches with 8 way associativity.
That means data
julia> 32*(2^10) รท 8
4096
bytes apart would contend for the same slot of the of the L1 cache.
I.e., data
julia> (32*(2^10) รท 8) รท sizeof(Float64)
512
Float64 appart contend for the same slot.
With 512x512 matrices, this means data from A[i,j-1] , A[i,j], and A[i,j+1] all want to occupy the same place in the L1 cache.
This also makes prefetching the next cacheline (i.e. A[i+8,j-1], A[i+8,j], and A[i+8,j+1] at the same time harder).
Hence why increasing the sizes of these arrays can actually make operating on them take overall less time.
Binary 
stencilbench.jl
4 kB Binary4 kB โ€” Click to download

from methodoflines.jl.

chriselrod avatar chriselrod commented on May 23, 2024

Solve times, prob is @turbo, prob_threads is @tturbo, and prob_simdivdep is @inbounds @fastmath @simd ivdep:

julia> @time solve(prob, ROCK4(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
 46.533658 seconds (14.03 k allocations: 35.467 GiB, 0.49% gc time)

julia> @time solve(prob_threads, ROCK4(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
 42.405112 seconds (14.86 k allocations: 35.467 GiB, 0.93% gc time)

julia> @time solve(prob_simdivdep, ROCK4(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
 46.475041 seconds (14.12 k allocations: 35.725 GiB, 0.14% gc time)

julia> @time solve(prob, Tsit5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
 43.199515 seconds (266 allocations: 708.018 MiB)

julia> @time solve(prob_threads, Tsit5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
 38.143093 seconds (283 allocations: 708.018 MiB, 0.07% gc time)

julia> @time solve(prob_simdivdep, Tsit5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
 43.644268 seconds (266 allocations: 708.018 MiB, 0.24% gc time)

julia> @time solve(prob, DP5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
 43.931490 seconds (265 allocations: 690.017 MiB)

julia> @time solve(prob_threads, DP5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
 37.560453 seconds (282 allocations: 690.018 MiB, 0.01% gc time)

julia> @time solve(prob_simdivdep, DP5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
 43.625376 seconds (265 allocations: 690.017 MiB, 0.01% gc time)

Testing just the functions:

julia> @benchmark f!($du, $u0, nothing, 0.0) # @turbo
samples: 9231; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns

 (530100.0 - 531000.0]  โ–Š31
 (531000.0 - 532000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–Œ207
 (532000.0 - 532900.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–575
 (532900.0 - 533800.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–991
 (533800.0 - 534700.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ1335
 (534700.0 - 535700.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ1408
 (535700.0 - 536600.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ1404
 (536600.0 - 537500.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ1333
 (537500.0 - 538400.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ1125
 (538400.0 - 539400.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹589
 (539400.0 - 540300.0]  โ–ˆโ–ˆโ–ˆโ–ˆ184
 (540300.0 - 541200.0]  โ–Š31
 (541200.0 - 542100.0]  โ–Ž6
 (542100.0 - 543100.0]  โ–2
 (543100.0 - 895600.0]  โ–Ž10

                  Counts

min: 530.115 ฮผs (0.00% GC); mean: 535.779 ฮผs (0.00% GC); median: 535.711 ฮผs (0.00% GC); max: 895.599 ฮผs (0.00% GC).

julia> @benchmark fthreads!($du, $u0, nothing, 0.0) # @tturbo
samples: 10000; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns

 (81000.0 - 81900.0 ]  โ–1
 (81900.0 - 82800.0 ]  โ–5
 (82800.0 - 83700.0 ]  โ–Š50
 (83700.0 - 84500.0 ]  โ–ˆโ–ˆโ–ˆโ–260
 (84500.0 - 85400.0 ]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ948
 (85400.0 - 86300.0 ]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Ž2006
 (86300.0 - 87200.0 ]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ2384
 (87200.0 - 88100.0 ]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Š1964
 (88100.0 - 89000.0 ]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Š1322
 (89000.0 - 89800.0 ]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Ž650
 (89800.0 - 90700.0 ]  โ–ˆโ–ˆโ–ˆ238
 (90700.0 - 91600.0 ]  โ–ˆโ–107
 (91600.0 - 92500.0 ]  โ–‹41
 (92500.0 - 93400.0 ]  โ–Ž14
 (93400.0 - 327700.0]  โ–Ž10

                  Counts

min: 81.006 ฮผs (0.00% GC); mean: 87.097 ฮผs (0.00% GC); median: 86.934 ฮผs (0.00% GC); max: 327.730 ฮผs (0.00% GC).

julia> @benchmark fsimdivdep!($du, $u0, nothing, 0.0) # @simd ivdep
samples: 9821; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns

 (498800.0 - 499600.0]  โ–Ž10
 (499600.0 - 500400.0]  โ–ˆโ–‰109
 (500400.0 - 501200.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Š688
 (501200.0 - 502000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹1512
 (502000.0 - 502800.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹1571
 (502800.0 - 503500.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹1508
 (503500.0 - 504300.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–1772
 (504300.0 - 505100.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Š1517
 (505100.0 - 505900.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Ž781
 (505900.0 - 506700.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–‹266
 (506700.0 - 507500.0]  โ–ˆ54
 (507500.0 - 508200.0]  โ–Ž13
 (508200.0 - 509000.0]  โ–5
 (509000.0 - 509800.0]  โ–5
 (509800.0 - 897000.0]  โ–Ž10

                  Counts

min: 498.833 ฮผs (0.00% GC); mean: 503.346 ฮผs (0.00% GC); median: 503.297 ฮผs (0.00% GC); max: 897.010 ฮผs (0.00% GC).

So most of the performance gain we see from threading doesn't translate to the actual solve.

For reference, here is the assembly from the main loop using @inbounds @fastmath @simd ivdep, along with llvm-mca's report:

Iterations:        100
Instructions:      2400
Total Cycles:      878
Total uOps:        4100

Dispatch Width:    6
uOps Per Cycle:    4.67
IPC:               2.73
Block RThroughput: 7.0


Cycles with backend pressure increase [ 45.22% ]
Throughput Bottlenecks:
  Resource Pressure       [ 29.16% ]
  - SKXPort0  [ 21.30% ]
  - SKXPort1  [ 0.80% ]
  - SKXPort2  [ 14.58% ]
  - SKXPort3  [ 14.58% ]
  - SKXPort4  [ 4.78% ]
  - SKXPort5  [ 21.30% ]
  Data Dependencies:      [ 39.41% ]
  - Register Dependencies [ 39.41% ]
  - Memory Dependencies   [ 0.00% ]

Critical sequence based on the simulation:

              Instruction                                 Dependency Information
 +----< 22.   add       r8, 64
 |
 |    < loop carried >
 |
 +----> 0.    vmovupd   zmm6, zmmword ptr [r14 + r8 - 8]  ## REGISTER dependency:  r8
 |      1.    vmovupd   zmm7, zmmword ptr [r14 + r8]
 |      2.    vaddpd    zmm7, zmm7, zmmword ptr [r14 + r8 - 16]
 |      3.    vaddpd    zmm7, zmm7, zmmword ptr [rbx + r8]
 |      4.    vaddpd    zmm7, zmm7, zmmword ptr [r15 + r8 + 8]
 |      5.    vfmadd231pd       zmm7, zmm6, zmm3
 |      6.    vfmadd213pd       zmm7, zmm4, zmmword ptr [rdx + r8]
 +----> 7.    vfnmsub132pd      zmm6, zmm6, zmmword ptr [r10 + r8] ## REGISTER dependency:  zmm6
 |      8.    vaddpd    zmm6, zmm7, zmm6
 |      9.    vaddpd    zmm6, zmm6, zmmword ptr [r9 + r8]
 |      10.   vmovupd   zmmword ptr [r13 + r8], zmm6
 +----> 11.   vmovupd   zmm6, zmmword ptr [r10 + r8]      ## RESOURCE interference:  SKXPort3 [ probability: 2% ]
 +----> 12.   vfnmsub231pd      zmm6, zmm6, zmmword ptr [r14 + r8 - 8] ## REGISTER dependency:  zmm6
 |      13.   vaddpd    zmm6, zmm6, zmm5
 |      14.   vaddpd    zmm6, zmm6, zmmword ptr [r9 + r8]
 |      15.   vmovupd   zmmword ptr [r12 + r8], zmm6
 +----> 16.   vmovupd   zmm6, zmmword ptr [r9 + r8]       ## RESOURCE interference:  SKXPort3 [ probability: 11% ]
 |      17.   vmovupd   zmm7, zmmword ptr [r10 + r8]
 |      18.   vfmadd132pd       zmm7, zmm5, zmmword ptr [r14 + r8 - 8]
 +----> 19.   vaddpd    zmm6, zmm6, zmm6                  ## REGISTER dependency:  zmm6
 +----> 20.   vsubpd    zmm6, zmm7, zmm6                  ## REGISTER dependency:  zmm6
 +----> 21.   vmovupd   zmmword ptr [rdi + r8], zmm6      ## REGISTER dependency:  zmm6
        22.   add       r8, 64
        23.   cmp       r8, 4032


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 2      8     0.50    *                   vmovupd       zmm6, zmmword ptr [r14 + r8 - 8]
 2      8     0.50    *                   vmovupd       zmm7, zmmword ptr [r14 + r8]
 2      11    0.50    *                   vaddpd        zmm7, zmm7, zmmword ptr [r14 + r8 - 16]
 2      11    0.50    *                   vaddpd        zmm7, zmm7, zmmword ptr [rbx + r8]
 2      11    0.50    *                   vaddpd        zmm7, zmm7, zmmword ptr [r15 + r8 + 8]
 1      4     0.50                        vfmadd231pd   zmm7, zmm6, zmm3
 2      11    0.50    *                   vfmadd213pd   zmm7, zmm4, zmmword ptr [rdx + r8]
 2      11    0.50    *                   vfnmsub132pd  zmm6, zmm6, zmmword ptr [r10 + r8]
 1      4     0.50                        vaddpd        zmm6, zmm7, zmm6
 2      11    0.50    *                   vaddpd        zmm6, zmm6, zmmword ptr [r9 + r8]
 2      1     1.00           *            vmovupd       zmmword ptr [r13 + r8], zmm6
 2      8     0.50    *                   vmovupd       zmm6, zmmword ptr [r10 + r8]
 2      11    0.50    *                   vfnmsub231pd  zmm6, zmm6, zmmword ptr [r14 + r8 - 8]
 1      4     0.50                        vaddpd        zmm6, zmm6, zmm5
 2      11    0.50    *                   vaddpd        zmm6, zmm6, zmmword ptr [r9 + r8]
 2      1     1.00           *            vmovupd       zmmword ptr [r12 + r8], zmm6
 2      8     0.50    *                   vmovupd       zmm6, zmmword ptr [r9 + r8]
 2      8     0.50    *                   vmovupd       zmm7, zmmword ptr [r10 + r8]
 2      11    0.50    *                   vfmadd132pd   zmm7, zmm5, zmmword ptr [r14 + r8 - 8]
 1      4     0.50                        vaddpd        zmm6, zmm6, zmm6
 1      4     0.50                        vsubpd        zmm6, zmm7, zmm6
 2      1     1.00           *            vmovupd       zmmword ptr [rdi + r8], zmm6
 1      1     0.25                        add   r8, 64
 1      1     0.25                        cmp   r8, 4032


Resources:
[0]   - SKXDivider
[1]   - SKXFPDivider
[2]   - SKXPort0
[3]   - SKXPort1
[4]   - SKXPort2
[5]   - SKXPort3
[6]   - SKXPort4
[7]   - SKXPort5
[8]   - SKXPort6
[9]   - SKXPort7


Resource pressure per iteration:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]
 -      -     7.44   4.25   7.06   7.06   3.00   7.31   2.00   2.88

Resource pressure by instruction:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    Instructions:
 -      -     0.01   0.98   0.41   0.59    -     0.01    -      -     vmovupd   zmm6, zmmword ptr [r14 + r8 - 8]
 -      -     0.13   0.84   0.55   0.45    -     0.03    -      -     vmovupd   zmm7, zmmword ptr [r14 + r8]
 -      -     0.61    -     0.59   0.41    -     0.39    -      -     vaddpd    zmm7, zmm7, zmmword ptr [r14 + r8 - 16]
 -      -     0.47    -     0.61   0.39    -     0.53    -      -     vaddpd    zmm7, zmm7, zmmword ptr [rbx + r8]
 -      -     0.51    -     0.55   0.45    -     0.49    -      -     vaddpd    zmm7, zmm7, zmmword ptr [r15 + r8 + 8]
 -      -     0.46    -      -      -      -     0.54    -      -     vfmadd231pd       zmm7, zmm6, zmm3
 -      -     0.54    -     0.46   0.54    -     0.46    -      -     vfmadd213pd       zmm7, zmm4, zmmword ptr [rdx + r8]
 -      -     0.44    -     0.66   0.34    -     0.56    -      -     vfnmsub132pd      zmm6, zmm6, zmmword ptr [r10 + r8]
 -      -     0.48    -      -      -      -     0.52    -      -     vaddpd    zmm6, zmm7, zmm6
 -      -     0.52    -     0.69   0.31    -     0.48    -      -     vaddpd    zmm6, zmm6, zmmword ptr [r9 + r8]
 -      -      -      -     0.01    -     1.00    -      -     0.99   vmovupd   zmmword ptr [r13 + r8], zmm6
 -      -     0.11   0.88   0.49   0.51    -     0.01    -      -     vmovupd   zmm6, zmmword ptr [r10 + r8]
 -      -     0.66    -     0.32   0.68    -     0.34    -      -     vfnmsub231pd      zmm6, zmm6, zmmword ptr [r14 + r8 - 8]
 -      -     0.50    -      -      -      -     0.50    -      -     vaddpd    zmm6, zmm6, zmm5
 -      -     0.50    -     0.47   0.53    -     0.50    -      -     vaddpd    zmm6, zmm6, zmmword ptr [r9 + r8]
 -      -      -      -      -     0.02   1.00    -      -     0.98   vmovupd   zmmword ptr [r12 + r8], zmm6
 -      -     0.01   0.98   0.24   0.76    -     0.01    -      -     vmovupd   zmm6, zmmword ptr [r9 + r8]
 -      -     0.25   0.57   0.57   0.43    -     0.18    -      -     vmovupd   zmm7, zmmword ptr [r10 + r8]
 -      -     0.45    -     0.36   0.64    -     0.55    -      -     vfmadd132pd       zmm7, zmm5, zmmword ptr [r14 + r8 - 8]
 -      -     0.39    -      -      -      -     0.61    -      -     vaddpd    zmm6, zmm6, zmm6
 -      -     0.40    -      -      -      -     0.60    -      -     vsubpd    zmm6, zmm7, zmm6
 -      -      -      -     0.08   0.01   1.00    -      -     0.91   vmovupd   zmmword ptr [rdi + r8], zmm6
 -      -      -      -      -      -      -      -     1.00    -     add       r8, 64
 -      -      -      -      -      -      -      -     1.00    -     cmp       r8, 4032

Versus @turbo:

Iterations:        100
Instructions:      3100
Total Cycles:      783
Total uOps:        4200

Dispatch Width:    6
uOps Per Cycle:    5.36
IPC:               3.96
Block RThroughput: 7.0


No resource or data dependency bottlenecks discovered.


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 2      8     0.50    *                   vmovupd       zmm4, zmmword ptr [rbx - 8]
 2      8     0.50    *                   vmovupd       zmm5, zmmword ptr [rbx]
 2      11    0.50    *                   vaddpd        zmm4, zmm4, zmmword ptr [rbx + 8]
 2      11    0.50    *                   vaddpd        zmm4, zmm4, zmmword ptr [r12 + rcx]
 2      11    0.50    *                   vaddpd        zmm4, zmm4, zmmword ptr [r12]
 1      4     0.50                        vfmadd231pd   zmm4, zmm5, zmm1
 2      8     0.50    *                   vmovupd       zmm6, zmmword ptr [rbp]
 2      8     0.50    *                   vmovupd       zmm7, zmmword ptr [rdx]
 2      11    0.50    *                   vaddpd        zmm8, zmm7, zmmword ptr [rdi]
 1      4     0.50                        vfmsub231pd   zmm8, zmm5, zmm6
 1      4     0.50                        vfnmadd231pd  zmm8, zmm5, zmm0
 1      4     0.50                        vfmsub231pd   zmm8, zmm4, zmm2
 1      4     0.50                        vaddpd        zmm4, zmm7, zmm3
 1      4     0.50                        vfmsub231pd   zmm4, zmm5, zmm6
 1      4     0.50                        vfmsub231pd   zmm4, zmm6, zmm0
 1      1     0.33                        vmovapd       zmm9, zmm0
 1      4     0.50                        vfmadd213pd   zmm9, zmm7, zmm3
 1      4     0.50                        vfmadd231pd   zmm9, zmm5, zmm6
 2      1     1.00           *            vmovupd       zmmword ptr [r9], zmm8
 2      1     1.00           *            vmovupd       zmmword ptr [rax], zmm4
 1      4     0.50                        vfmadd231pd   zmm9, zmm0, zmm7
 2      1     1.00           *            vmovupd       zmmword ptr [r14], zmm9
 1      1     0.25                        add   rbx, 64
 1      1     0.25                        add   rbp, 64
 1      1     0.25                        add   rdx, 64
 1      1     0.25                        add   rdi, 64
 1      1     0.25                        add   r9, 64
 1      1     0.25                        add   rax, 64
 1      1     0.25                        add   r14, 64
 1      1     0.25                        add   r12, 64
 1      1     0.25                        cmp   r10, rbx


Resources:
[0]   - SKXDivider
[1]   - SKXFPDivider
[2]   - SKXPort0
[3]   - SKXPort1
[4]   - SKXPort2
[5]   - SKXPort3
[6]   - SKXPort4
[7]   - SKXPort5
[8]   - SKXPort6
[9]   - SKXPort7


Resource pressure per iteration:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]
 -      -     7.52   7.33   4.06   4.06   3.00   7.53   5.62   2.88

Resource pressure by instruction:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    Instructions:
 -      -     0.07   0.84   0.61   0.39    -     0.09    -      -     vmovupd   zmm4, zmmword ptr [rbx - 8]
 -      -      -     0.97   0.45   0.55    -     0.03    -      -     vmovupd   zmm5, zmmword ptr [rbx]
 -      -     0.65    -     0.39   0.61    -     0.35    -      -     vaddpd    zmm4, zmm4, zmmword ptr [rbx + 8]
 -      -     0.48    -     0.37   0.63    -     0.52    -      -     vaddpd    zmm4, zmm4, zmmword ptr [r12 + rcx]
 -      -     0.35    -     0.80   0.20    -     0.65    -      -     vaddpd    zmm4, zmm4, zmmword ptr [r12]
 -      -     0.27    -      -      -      -     0.73    -      -     vfmadd231pd       zmm4, zmm5, zmm1
 -      -     0.02   0.83   0.62   0.38    -     0.15    -      -     vmovupd   zmm6, zmmword ptr [rbp]
 -      -     0.08   0.91   0.29   0.71    -     0.01    -      -     vmovupd   zmm7, zmmword ptr [rdx]
 -      -     0.56    -     0.49   0.51    -     0.44    -      -     vaddpd    zmm8, zmm7, zmmword ptr [rdi]
 -      -     0.39    -      -      -      -     0.61    -      -     vfmsub231pd       zmm8, zmm5, zmm6
 -      -     0.40    -      -      -      -     0.60    -      -     vfnmadd231pd      zmm8, zmm5, zmm0
 -      -     0.29    -      -      -      -     0.71    -      -     vfmsub231pd       zmm8, zmm4, zmm2
 -      -     0.55    -      -      -      -     0.45    -      -     vaddpd    zmm4, zmm7, zmm3
 -      -     0.31    -      -      -      -     0.69    -      -     vfmsub231pd       zmm4, zmm5, zmm6
 -      -     0.32    -      -      -      -     0.68    -      -     vfmsub231pd       zmm4, zmm6, zmm0
 -      -     0.20   0.78    -      -      -     0.02    -      -     vmovapd   zmm9, zmm0
 -      -     0.75    -      -      -      -     0.25    -      -     vfmadd213pd       zmm9, zmm7, zmm3
 -      -     0.77    -      -      -      -     0.23    -      -     vfmadd231pd       zmm9, zmm5, zmm6
 -      -      -      -     0.01    -     1.00    -      -     0.99   vmovupd   zmmword ptr [r9], zmm8
 -      -      -      -      -     0.03   1.00    -      -     0.97   vmovupd   zmmword ptr [rax], zmm4
 -      -     0.74    -      -      -      -     0.26    -      -     vfmadd231pd       zmm9, zmm0, zmm7
 -      -      -      -     0.03   0.05   1.00    -      -     0.92   vmovupd   zmmword ptr [r14], zmm9
 -      -      -      -      -      -      -      -     1.00    -     add       rbx, 64
 -      -     0.12   0.03    -      -      -     0.01   0.84    -     add       rbp, 64
 -      -     0.05   0.34    -      -      -     0.02   0.59    -     add       rdx, 64
 -      -     0.06   0.24    -      -      -      -     0.70    -     add       rdi, 64
 -      -     0.02   0.69    -      -      -     0.01   0.28    -     add       r9, 64
 -      -      -     0.29    -      -      -      -     0.71    -     add       rax, 64
 -      -     0.02   0.30    -      -      -     0.02   0.66    -     add       r14, 64
 -      -     0.01   0.69    -      -      -      -     0.30    -     add       r12, 64
 -      -     0.04   0.42    -      -      -      -     0.54    -     cmp       r10, rbx

All the extra adds here for LoopVectorization are because while LLVM can figure out the views are the same array, LoopVectorization doesn't know that.

If you want just a single number from each report, LLVM-MCA estimates 100 iterations of the @turbo loop will take 783 clock cycles, while 100 iterations of the @simd ivdep loop will take 878 clock cycles.
If we replace the views with indexing, @turbo is down to 731 cycles:

Iterations:        100
Instructions:      2700
Total Cycles:      731
Total uOps:        3800

Dispatch Width:    6
uOps Per Cycle:    5.20
IPC:               3.69
Block RThroughput: 7.0


Cycles with backend pressure increase [ 51.44% ]
Throughput Bottlenecks:
  Resource Pressure       [ 50.21% ]
  - SKXPort0  [ 49.93% ]
  - SKXPort1  [ 49.11% ]
  - SKXPort2  [ 13.13% ]
  - SKXPort3  [ 13.13% ]
  - SKXPort4  [ 5.06% ]
  - SKXPort5  [ 49.93% ]
  - SKXPort6  [ 9.58% ]
  - SKXPort7  [ 3.15% ]
  Data Dependencies:      [ 39.53% ]
  - Register Dependencies [ 39.53% ]
  - Memory Dependencies   [ 0.00% ]

Critical sequence based on the simulation:

              Instruction                                 Dependency Information
 +----< 13.   vaddpd    zmm5, zmm7, zmm3
 |
 |    < loop carried >
 |
 |      0.    mov       r11, rbx
 +----> 1.    vmovupd   zmm4, zmmword ptr [rbx]           ## RESOURCE interference:  SKXPort5 [ probability: 25% ]
 +----> 2.    vmovupd   zmm5, zmmword ptr [rbx - 8]       ## RESOURCE interference:  SKXPort1 [ probability: 91% ]
 +----> 3.    vaddpd    zmm5, zmm5, zmmword ptr [rbx + 8] ## RESOURCE interference:  SKXPort2 [ probability: 13% ]
 +----> 4.    vaddpd    zmm5, zmm5, zmmword ptr [rbx + r9] ## REGISTER dependency:  zmm5
 +----> 5.    vaddpd    zmm5, zmm5, zmmword ptr [rbx + rax] ## REGISTER dependency:  zmm5
 +----> 6.    vmovupd   zmm6, zmmword ptr [rbx + rdi]     ## RESOURCE interference:  SKXPort5 [ probability: 1% ]
 |      7.    vfmadd231pd       zmm5, zmm4, zmm0
 +----> 8.    vmovupd   zmm7, zmmword ptr [rbx + 2*rdi]   ## RESOURCE interference:  SKXPort1 [ probability: 93% ]
 |      9.    vaddpd    zmm8, zmm7, zmmword ptr [rbp]
 |      10.   vfmsub231pd       zmm8, zmm4, zmm6
 |      11.   vfnmadd231pd      zmm8, zmm4, zmm1
 |      12.   vfmsub231pd       zmm8, zmm5, zmm2
 +----> 13.   vaddpd    zmm5, zmm7, zmm3                  ## REGISTER dependency:  zmm7
 |      14.   vfmsub231pd       zmm5, zmm4, zmm6
 |      15.   vfmsub231pd       zmm5, zmm6, zmm1
 |      16.   vmovapd   zmm9, zmm1
 +----> 17.   vfmadd213pd       zmm9, zmm7, zmm3          ## RESOURCE interference:  SKXPort5 [ probability: 51% ]
 |      18.   vfmadd231pd       zmm9, zmm4, zmm6
 |      19.   vfmadd231pd       zmm9, zmm1, zmm7
 |      20.   vmovupd   zmmword ptr [r12], zmm8
 |      21.   vmovupd   zmmword ptr [r12 + rdx], zmm5
 |      22.   vmovupd   zmmword ptr [r12 + 2*rdx], zmm9
 |      23.   add       rbx, 64
 |      24.   add       rbp, 64
 |      25.   add       r12, 64
 |      26.   cmp       rsi, rbx
 |
 |    < loop carried >
 |
 +----> 3.    vaddpd    zmm5, zmm5, zmmword ptr [rbx + 8] ## RESOURCE interference:  SKXPort5 [ probability: 38% ]


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      1     0.25                        mov   r11, rbx
 2      8     0.50    *                   vmovupd       zmm4, zmmword ptr [rbx]
 2      8     0.50    *                   vmovupd       zmm5, zmmword ptr [rbx - 8]
 2      11    0.50    *                   vaddpd        zmm5, zmm5, zmmword ptr [rbx + 8]
 2      11    0.50    *                   vaddpd        zmm5, zmm5, zmmword ptr [rbx + r9]
 2      11    0.50    *                   vaddpd        zmm5, zmm5, zmmword ptr [rbx + rax]
 2      8     0.50    *                   vmovupd       zmm6, zmmword ptr [rbx + rdi]
 1      4     0.50                        vfmadd231pd   zmm5, zmm4, zmm0
 2      8     0.50    *                   vmovupd       zmm7, zmmword ptr [rbx + 2*rdi]
 2      11    0.50    *                   vaddpd        zmm8, zmm7, zmmword ptr [rbp]
 1      4     0.50                        vfmsub231pd   zmm8, zmm4, zmm6
 1      4     0.50                        vfnmadd231pd  zmm8, zmm4, zmm1
 1      4     0.50                        vfmsub231pd   zmm8, zmm5, zmm2
 1      4     0.50                        vaddpd        zmm5, zmm7, zmm3
 1      4     0.50                        vfmsub231pd   zmm5, zmm4, zmm6
 1      4     0.50                        vfmsub231pd   zmm5, zmm6, zmm1
 1      1     0.33                        vmovapd       zmm9, zmm1
 1      4     0.50                        vfmadd213pd   zmm9, zmm7, zmm3
 1      4     0.50                        vfmadd231pd   zmm9, zmm4, zmm6
 1      4     0.50                        vfmadd231pd   zmm9, zmm1, zmm7
 2      1     1.00           *            vmovupd       zmmword ptr [r12], zmm8
 2      1     1.00           *            vmovupd       zmmword ptr [r12 + rdx], zmm5
 2      1     1.00           *            vmovupd       zmmword ptr [r12 + 2*rdx], zmm9
 1      1     0.25                        add   rbx, 64
 1      1     0.25                        add   rbp, 64
 1      1     0.25                        add   r12, 64
 1      1     0.25                        cmp   rsi, rbx


Resources:
[0]   - SKXDivider
[1]   - SKXFPDivider
[2]   - SKXPort0
[3]   - SKXPort1
[4]   - SKXPort2
[5]   - SKXPort3
[6]   - SKXPort4
[7]   - SKXPort5
[8]   - SKXPort6
[9]   - SKXPort7


Resource pressure per iteration:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]
 -      -     7.06   6.79   4.04   4.05   3.00   7.07   3.08   2.91

Resource pressure by instruction:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    Instructions:
 -      -      -     0.85    -      -      -     0.01   0.14    -     mov       r11, rbx
 -      -      -     0.99   0.46   0.54    -     0.01    -      -     vmovupd   zmm4, zmmword ptr [rbx]
 -      -     0.01   0.98   0.75   0.25    -     0.01    -      -     vmovupd   zmm5, zmmword ptr [rbx - 8]
 -      -     0.19    -     0.67   0.33    -     0.81    -      -     vaddpd    zmm5, zmm5, zmmword ptr [rbx + 8]
 -      -     0.19    -     0.53   0.47    -     0.81    -      -     vaddpd    zmm5, zmm5, zmmword ptr [rbx + r9]
 -      -     0.53    -     0.66   0.34    -     0.47    -      -     vaddpd    zmm5, zmm5, zmmword ptr [rbx + rax]
 -      -     0.01   0.97   0.31   0.69    -     0.02    -      -     vmovupd   zmm6, zmmword ptr [rbx + rdi]
 -      -     0.94    -      -      -      -     0.06    -      -     vfmadd231pd       zmm5, zmm4, zmm0
 -      -      -     0.99   0.29   0.71    -     0.01    -      -     vmovupd   zmm7, zmmword ptr [rbx + 2*rdi]
 -      -     0.10    -     0.33   0.67    -     0.90    -      -     vaddpd    zmm8, zmm7, zmmword ptr [rbp]
 -      -     0.73    -      -      -      -     0.27    -      -     vfmsub231pd       zmm8, zmm4, zmm6
 -      -     0.93    -      -      -      -     0.07    -      -     vfnmadd231pd      zmm8, zmm4, zmm1
 -      -     0.96    -      -      -      -     0.04    -      -     vfmsub231pd       zmm8, zmm5, zmm2
 -      -     0.09    -      -      -      -     0.91    -      -     vaddpd    zmm5, zmm7, zmm3
 -      -     0.55    -      -      -      -     0.45    -      -     vfmsub231pd       zmm5, zmm4, zmm6
 -      -     0.34    -      -      -      -     0.66    -      -     vfmsub231pd       zmm5, zmm6, zmm1
 -      -      -     0.98    -      -      -     0.02    -      -     vmovapd   zmm9, zmm1
 -      -     0.30    -      -      -      -     0.70    -      -     vfmadd213pd       zmm9, zmm7, zmm3
 -      -     0.55    -      -      -      -     0.45    -      -     vfmadd231pd       zmm9, zmm4, zmm6
 -      -     0.62    -      -      -      -     0.38    -      -     vfmadd231pd       zmm9, zmm1, zmm7
 -      -      -      -      -      -     1.00    -      -     1.00   vmovupd   zmmword ptr [r12], zmm8
 -      -      -      -      -     0.03   1.00    -      -     0.97   vmovupd   zmmword ptr [r12 + rdx], zmm5
 -      -      -      -     0.04   0.02   1.00    -      -     0.94   vmovupd   zmmword ptr [r12 + 2*rdx], zmm9
 -      -      -      -      -      -      -      -     1.00    -     add       rbx, 64
 -      -      -     0.04    -      -      -     0.01   0.95    -     add       rbp, 64
 -      -     0.01   0.87    -      -      -      -     0.12    -     add       r12, 64
 -      -     0.01   0.12    -      -      -      -     0.87    -     cmp       rsi, rbx

This code looks better to me, and LLVM says it is faster. But it's slower in benchmarks (on my computers) anyway, and I'm a bit at a loss.
Maybe I could add architecture specific hacks to do bad things sometimes, but in general worse probably isn't better.

Would be interesting to see more benchmarks.
Code:

using OrdinaryDiffEq, LinearAlgebra, SparseArrays, BenchmarkHistograms, LoopVectorization
const ฮฑโ‚‚ = 1.0
const ฮฑโ‚ƒ = 1.0
const ฮฒโ‚ = 1.0
const ฮฒโ‚‚ = 1.0
const ฮฒโ‚ƒ = 1.0
const rโ‚ = 1.0
const rโ‚‚ = 1.0
const D = 100.0
const ฮณโ‚ = 0.1
const ฮณโ‚‚ = 0.1
const ฮณโ‚ƒ = 0.1
const N = 512
const X = reshape([i for i in 1:N for j in 1:N],N,N);
const Y = reshape([j for i in 1:N for j in 1:N],N,N);
const ฮฑโ‚ = 1.0.*(X.>=4*N/5);
const Mx = Tridiagonal([1.0 for i in 1:N-1],[-2.0 for i in 1:N],[1.0 for i in 1:N-1]);
const My = copy(Mx);
Mx[2,1] = 2.0;
Mx[end-1,end] = 2.0;
My[1,2] = 2.0;
My[end,end-1] = 2.0;
# Define the discretized PDE as an ODE function
const MyA = zeros(N,N);
const AMx = zeros(N,N);
const DA = zeros(N,N);
function f!(du,u,p,t)
  A = @view  u[:,:,1]
  B = @view  u[:,:,2]
  C = @view  u[:,:,3]
  dA = @view du[:,:,1]
  dB = @view du[:,:,2]
  dC = @view du[:,:,3]
  @turbo for j in 2:N-1, i in 2:N-1
    dA[i,j] = D*(A[i-1,j] + A[i+1,j] + A[i,j+1] + A[i,j-1] - 4A[i,j]) +
      ฮฑโ‚[i,j] - ฮฒโ‚*A[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dB[i,j] = ฮฑโ‚‚ - ฮฒโ‚‚*B[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dC[i,j] = ฮฑโ‚ƒ - ฮฒโ‚ƒ*C[i,j] + rโ‚*A[i,j]*B[i,j] - rโ‚‚*C[i,j]
  end
  @inbounds for j in 2:N-1
    i = 1
    dA[1,j] = D*(2A[i+1,j] + A[i,j+1] + A[i,j-1] - 4A[i,j]) +
      ฮฑโ‚[i,j] - ฮฒโ‚*A[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dB[i,j] = ฮฑโ‚‚ - ฮฒโ‚‚*B[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dC[i,j] = ฮฑโ‚ƒ - ฮฒโ‚ƒ*C[i,j] + rโ‚*A[i,j]*B[i,j] - rโ‚‚*C[i,j]
  end
  @inbounds for j in 2:N-1
    i = N
    dA[end,j] = D*(2A[i-1,j] + A[i,j+1] + A[i,j-1] - 4A[i,j]) +
      ฮฑโ‚[i,j] - ฮฒโ‚*A[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dB[i,j] = ฮฑโ‚‚ - ฮฒโ‚‚*B[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dC[i,j] = ฮฑโ‚ƒ - ฮฒโ‚ƒ*C[i,j] + rโ‚*A[i,j]*B[i,j] - rโ‚‚*C[i,j]
  end
  @inbounds for i in 2:N-1
    j = 1
    dA[i,j] = D*(A[i-1,j] + A[i+1,j] + 2A[i,j+1] - 4A[i,j]) +
      ฮฑโ‚[i,j] - ฮฒโ‚*A[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dB[i,j] = ฮฑโ‚‚ - ฮฒโ‚‚*B[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dC[i,j] = ฮฑโ‚ƒ - ฮฒโ‚ƒ*C[i,j] + rโ‚*A[i,j]*B[i,j] - rโ‚‚*C[i,j]
  end
  @inbounds for i in 2:N-1
    j = N
    dA[i,end] = D*(A[i-1,j] + A[i+1,j] + 2A[i,j-1] - 4A[i,j]) +
      ฮฑโ‚[i,j] - ฮฒโ‚*A[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dB[i,j] = ฮฑโ‚‚ - ฮฒโ‚‚*B[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dC[i,j] = ฮฑโ‚ƒ - ฮฒโ‚ƒ*C[i,j] + rโ‚*A[i,j]*B[i,j] - rโ‚‚*C[i,j]
  end
  @inbounds begin
    i = 1; j = 1
    dA[i,j] = D*(2A[i+1,j] + 2A[i,j+1] - 4A[i,j]) +
      ฮฑโ‚[i,j] - ฮฒโ‚*A[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dB[i,j] = ฮฑโ‚‚ - ฮฒโ‚‚*B[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dC[i,j] = ฮฑโ‚ƒ - ฮฒโ‚ƒ*C[i,j] + rโ‚*A[i,j]*B[i,j] - rโ‚‚*C[i,j]
    i = 1; j = N
    dA[i,j] = D*(2A[i+1,j] + 2A[i,j-1] - 4A[i,j]) +
      ฮฑโ‚[i,j] - ฮฒโ‚*A[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dB[i,j] = ฮฑโ‚‚ - ฮฒโ‚‚*B[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dC[i,j] = ฮฑโ‚ƒ - ฮฒโ‚ƒ*C[i,j] + rโ‚*A[i,j]*B[i,j] - rโ‚‚*C[i,j]
    i = N; j = 1
    dA[i,j] = D*(2A[i-1,j] + 2A[i,j+1] - 4A[i,j]) +
      ฮฑโ‚[i,j] - ฮฒโ‚*A[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dB[i,j] = ฮฑโ‚‚ - ฮฒโ‚‚*B[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dC[i,j] = ฮฑโ‚ƒ - ฮฒโ‚ƒ*C[i,j] + rโ‚*A[i,j]*B[i,j] - rโ‚‚*C[i,j]
    i = N; j = N
    dA[i,j] = D*(2A[i-1,j] + 2A[i,j-1] - 4A[i,j]) +
      ฮฑโ‚[i,j] - ฮฒโ‚*A[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dB[i,j] = ฮฑโ‚‚ - ฮฒโ‚‚*B[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dC[i,j] = ฮฑโ‚ƒ - ฮฒโ‚ƒ*C[i,j] + rโ‚*A[i,j]*B[i,j] - rโ‚‚*C[i,j]
  end
end
function fthreads!(du,u,p,t)
  A = @view  u[:,:,1]
  B = @view  u[:,:,2]
  C = @view  u[:,:,3]
  dA = @view du[:,:,1]
  dB = @view du[:,:,2]
  dC = @view du[:,:,3]
  @tturbo for j in 2:N-1, i in 2:N-1
    dA[i,j] = D*(A[i-1,j] + A[i+1,j] + A[i,j+1] + A[i,j-1] - 4A[i,j]) +
      ฮฑโ‚[i,j] - ฮฒโ‚*A[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dB[i,j] = ฮฑโ‚‚ - ฮฒโ‚‚*B[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dC[i,j] = ฮฑโ‚ƒ - ฮฒโ‚ƒ*C[i,j] + rโ‚*A[i,j]*B[i,j] - rโ‚‚*C[i,j]
  end
  @inbounds for j in 2:N-1
    i = 1
    dA[1,j] = D*(2A[i+1,j] + A[i,j+1] + A[i,j-1] - 4A[i,j]) +
      ฮฑโ‚[i,j] - ฮฒโ‚*A[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dB[i,j] = ฮฑโ‚‚ - ฮฒโ‚‚*B[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dC[i,j] = ฮฑโ‚ƒ - ฮฒโ‚ƒ*C[i,j] + rโ‚*A[i,j]*B[i,j] - rโ‚‚*C[i,j]
  end
  @inbounds for j in 2:N-1
    i = N
    dA[end,j] = D*(2A[i-1,j] + A[i,j+1] + A[i,j-1] - 4A[i,j]) +
      ฮฑโ‚[i,j] - ฮฒโ‚*A[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dB[i,j] = ฮฑโ‚‚ - ฮฒโ‚‚*B[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dC[i,j] = ฮฑโ‚ƒ - ฮฒโ‚ƒ*C[i,j] + rโ‚*A[i,j]*B[i,j] - rโ‚‚*C[i,j]
  end
  @inbounds for i in 2:N-1
    j = 1
    dA[i,j] = D*(A[i-1,j] + A[i+1,j] + 2A[i,j+1] - 4A[i,j]) +
      ฮฑโ‚[i,j] - ฮฒโ‚*A[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dB[i,j] = ฮฑโ‚‚ - ฮฒโ‚‚*B[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dC[i,j] = ฮฑโ‚ƒ - ฮฒโ‚ƒ*C[i,j] + rโ‚*A[i,j]*B[i,j] - rโ‚‚*C[i,j]
  end
  @inbounds for i in 2:N-1
    j = N
    dA[i,end] = D*(A[i-1,j] + A[i+1,j] + 2A[i,j-1] - 4A[i,j]) +
      ฮฑโ‚[i,j] - ฮฒโ‚*A[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dB[i,j] = ฮฑโ‚‚ - ฮฒโ‚‚*B[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dC[i,j] = ฮฑโ‚ƒ - ฮฒโ‚ƒ*C[i,j] + rโ‚*A[i,j]*B[i,j] - rโ‚‚*C[i,j]
  end
  @inbounds begin
    i = 1; j = 1
    dA[i,j] = D*(2A[i+1,j] + 2A[i,j+1] - 4A[i,j]) +
      ฮฑโ‚[i,j] - ฮฒโ‚*A[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dB[i,j] = ฮฑโ‚‚ - ฮฒโ‚‚*B[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dC[i,j] = ฮฑโ‚ƒ - ฮฒโ‚ƒ*C[i,j] + rโ‚*A[i,j]*B[i,j] - rโ‚‚*C[i,j]
    i = 1; j = N
    dA[i,j] = D*(2A[i+1,j] + 2A[i,j-1] - 4A[i,j]) +
      ฮฑโ‚[i,j] - ฮฒโ‚*A[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dB[i,j] = ฮฑโ‚‚ - ฮฒโ‚‚*B[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dC[i,j] = ฮฑโ‚ƒ - ฮฒโ‚ƒ*C[i,j] + rโ‚*A[i,j]*B[i,j] - rโ‚‚*C[i,j]
    i = N; j = 1
    dA[i,j] = D*(2A[i-1,j] + 2A[i,j+1] - 4A[i,j]) +
      ฮฑโ‚[i,j] - ฮฒโ‚*A[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dB[i,j] = ฮฑโ‚‚ - ฮฒโ‚‚*B[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dC[i,j] = ฮฑโ‚ƒ - ฮฒโ‚ƒ*C[i,j] + rโ‚*A[i,j]*B[i,j] - rโ‚‚*C[i,j]
    i = N; j = N
    dA[i,j] = D*(2A[i-1,j] + 2A[i,j-1] - 4A[i,j]) +
      ฮฑโ‚[i,j] - ฮฒโ‚*A[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dB[i,j] = ฮฑโ‚‚ - ฮฒโ‚‚*B[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dC[i,j] = ฮฑโ‚ƒ - ฮฒโ‚ƒ*C[i,j] + rโ‚*A[i,j]*B[i,j] - rโ‚‚*C[i,j]
  end
end
function fsimdivdep!(du,u,p,t)
  A = @view  u[:,:,1]
  B = @view  u[:,:,2]
  C = @view  u[:,:,3]
  dA = @view du[:,:,1]
  dB = @view du[:,:,2]
  dC = @view du[:,:,3]
  @inbounds @fastmath for j in 2:N-1; @simd ivdep for i in 2:N-1
    dA[i,j] = D*(A[i-1,j] + A[i+1,j] + A[i,j+1] + A[i,j-1] - 4A[i,j]) +
      ฮฑโ‚[i,j] - ฮฒโ‚*A[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dB[i,j] = ฮฑโ‚‚ - ฮฒโ‚‚*B[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dC[i,j] = ฮฑโ‚ƒ - ฮฒโ‚ƒ*C[i,j] + rโ‚*A[i,j]*B[i,j] - rโ‚‚*C[i,j]
  end; end
  @inbounds for j in 2:N-1
    i = 1
    dA[1,j] = D*(2A[i+1,j] + A[i,j+1] + A[i,j-1] - 4A[i,j]) +
      ฮฑโ‚[i,j] - ฮฒโ‚*A[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dB[i,j] = ฮฑโ‚‚ - ฮฒโ‚‚*B[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dC[i,j] = ฮฑโ‚ƒ - ฮฒโ‚ƒ*C[i,j] + rโ‚*A[i,j]*B[i,j] - rโ‚‚*C[i,j]
  end
  @inbounds for j in 2:N-1
    i = N
    dA[end,j] = D*(2A[i-1,j] + A[i,j+1] + A[i,j-1] - 4A[i,j]) +
      ฮฑโ‚[i,j] - ฮฒโ‚*A[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dB[i,j] = ฮฑโ‚‚ - ฮฒโ‚‚*B[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dC[i,j] = ฮฑโ‚ƒ - ฮฒโ‚ƒ*C[i,j] + rโ‚*A[i,j]*B[i,j] - rโ‚‚*C[i,j]
  end
  @inbounds for i in 2:N-1
    j = 1
    dA[i,j] = D*(A[i-1,j] + A[i+1,j] + 2A[i,j+1] - 4A[i,j]) +
      ฮฑโ‚[i,j] - ฮฒโ‚*A[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dB[i,j] = ฮฑโ‚‚ - ฮฒโ‚‚*B[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dC[i,j] = ฮฑโ‚ƒ - ฮฒโ‚ƒ*C[i,j] + rโ‚*A[i,j]*B[i,j] - rโ‚‚*C[i,j]
  end
  @inbounds for i in 2:N-1
    j = N
    dA[i,end] = D*(A[i-1,j] + A[i+1,j] + 2A[i,j-1] - 4A[i,j]) +
      ฮฑโ‚[i,j] - ฮฒโ‚*A[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dB[i,j] = ฮฑโ‚‚ - ฮฒโ‚‚*B[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dC[i,j] = ฮฑโ‚ƒ - ฮฒโ‚ƒ*C[i,j] + rโ‚*A[i,j]*B[i,j] - rโ‚‚*C[i,j]
  end
  @inbounds begin
    i = 1; j = 1
    dA[i,j] = D*(2A[i+1,j] + 2A[i,j+1] - 4A[i,j]) +
      ฮฑโ‚[i,j] - ฮฒโ‚*A[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dB[i,j] = ฮฑโ‚‚ - ฮฒโ‚‚*B[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dC[i,j] = ฮฑโ‚ƒ - ฮฒโ‚ƒ*C[i,j] + rโ‚*A[i,j]*B[i,j] - rโ‚‚*C[i,j]
    i = 1; j = N
    dA[i,j] = D*(2A[i+1,j] + 2A[i,j-1] - 4A[i,j]) +
      ฮฑโ‚[i,j] - ฮฒโ‚*A[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dB[i,j] = ฮฑโ‚‚ - ฮฒโ‚‚*B[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dC[i,j] = ฮฑโ‚ƒ - ฮฒโ‚ƒ*C[i,j] + rโ‚*A[i,j]*B[i,j] - rโ‚‚*C[i,j]
    i = N; j = 1
    dA[i,j] = D*(2A[i-1,j] + 2A[i,j+1] - 4A[i,j]) +
      ฮฑโ‚[i,j] - ฮฒโ‚*A[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dB[i,j] = ฮฑโ‚‚ - ฮฒโ‚‚*B[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dC[i,j] = ฮฑโ‚ƒ - ฮฒโ‚ƒ*C[i,j] + rโ‚*A[i,j]*B[i,j] - rโ‚‚*C[i,j]
    i = N; j = N
    dA[i,j] = D*(2A[i-1,j] + 2A[i,j-1] - 4A[i,j]) +
      ฮฑโ‚[i,j] - ฮฒโ‚*A[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dB[i,j] = ฮฑโ‚‚ - ฮฒโ‚‚*B[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dC[i,j] = ฮฑโ‚ƒ - ฮฒโ‚ƒ*C[i,j] + rโ‚*A[i,j]*B[i,j] - rโ‚‚*C[i,j]
  end
end
function fnoviews!(du,u,p,t) # no views
  A = @view  u[:,:,1]
  B = @view  u[:,:,2]
  C = @view  u[:,:,3]
  dA = @view du[:,:,1]
  dB = @view du[:,:,2]
  dC = @view du[:,:,3]
  @turbo for j in 2:N-1, i in 2:N-1
    du[i,j,1] = D*(u[i-1,j,1] + u[i+1,j,1] + u[i,j+1,1] + u[i,j-1,1] - 4u[i,j,1]) +
      ฮฑโ‚[i,j] - ฮฒโ‚*u[i,j,1] - rโ‚*u[i,j,1]*u[i,j,2] + rโ‚‚*u[i,j,3]
    du[i,j,2] = ฮฑโ‚‚ - ฮฒโ‚‚*u[i,j,2] - rโ‚*u[i,j,1]*u[i,j,2] + rโ‚‚*u[i,j,3]
    du[i,j,3] = ฮฑโ‚ƒ - ฮฒโ‚ƒ*u[i,j,3] + rโ‚*u[i,j,1]*u[i,j,2] - rโ‚‚*u[i,j,3]
  end
  @inbounds for j in 2:N-1
    i = 1
    dA[1,j] = D*(2A[i+1,j] + A[i,j+1] + A[i,j-1] - 4A[i,j]) +
      ฮฑโ‚[i,j] - ฮฒโ‚*A[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dB[i,j] = ฮฑโ‚‚ - ฮฒโ‚‚*B[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dC[i,j] = ฮฑโ‚ƒ - ฮฒโ‚ƒ*C[i,j] + rโ‚*A[i,j]*B[i,j] - rโ‚‚*C[i,j]
  end
  @inbounds for j in 2:N-1
    i = N
    dA[end,j] = D*(2A[i-1,j] + A[i,j+1] + A[i,j-1] - 4A[i,j]) +
      ฮฑโ‚[i,j] - ฮฒโ‚*A[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dB[i,j] = ฮฑโ‚‚ - ฮฒโ‚‚*B[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dC[i,j] = ฮฑโ‚ƒ - ฮฒโ‚ƒ*C[i,j] + rโ‚*A[i,j]*B[i,j] - rโ‚‚*C[i,j]
  end
  @inbounds for i in 2:N-1
    j = 1
    dA[i,j] = D*(A[i-1,j] + A[i+1,j] + 2A[i,j+1] - 4A[i,j]) +
      ฮฑโ‚[i,j] - ฮฒโ‚*A[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dB[i,j] = ฮฑโ‚‚ - ฮฒโ‚‚*B[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dC[i,j] = ฮฑโ‚ƒ - ฮฒโ‚ƒ*C[i,j] + rโ‚*A[i,j]*B[i,j] - rโ‚‚*C[i,j]
  end
  @inbounds for i in 2:N-1
    j = N
    dA[i,end] = D*(A[i-1,j] + A[i+1,j] + 2A[i,j-1] - 4A[i,j]) +
      ฮฑโ‚[i,j] - ฮฒโ‚*A[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dB[i,j] = ฮฑโ‚‚ - ฮฒโ‚‚*B[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dC[i,j] = ฮฑโ‚ƒ - ฮฒโ‚ƒ*C[i,j] + rโ‚*A[i,j]*B[i,j] - rโ‚‚*C[i,j]
  end
  @inbounds begin
    i = 1; j = 1
    dA[i,j] = D*(2A[i+1,j] + 2A[i,j+1] - 4A[i,j]) +
      ฮฑโ‚[i,j] - ฮฒโ‚*A[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dB[i,j] = ฮฑโ‚‚ - ฮฒโ‚‚*B[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dC[i,j] = ฮฑโ‚ƒ - ฮฒโ‚ƒ*C[i,j] + rโ‚*A[i,j]*B[i,j] - rโ‚‚*C[i,j]
    i = 1; j = N
    dA[i,j] = D*(2A[i+1,j] + 2A[i,j-1] - 4A[i,j]) +
      ฮฑโ‚[i,j] - ฮฒโ‚*A[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dB[i,j] = ฮฑโ‚‚ - ฮฒโ‚‚*B[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dC[i,j] = ฮฑโ‚ƒ - ฮฒโ‚ƒ*C[i,j] + rโ‚*A[i,j]*B[i,j] - rโ‚‚*C[i,j]
    i = N; j = 1
    dA[i,j] = D*(2A[i-1,j] + 2A[i,j+1] - 4A[i,j]) +
      ฮฑโ‚[i,j] - ฮฒโ‚*A[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dB[i,j] = ฮฑโ‚‚ - ฮฒโ‚‚*B[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dC[i,j] = ฮฑโ‚ƒ - ฮฒโ‚ƒ*C[i,j] + rโ‚*A[i,j]*B[i,j] - rโ‚‚*C[i,j]
    i = N; j = N
    dA[i,j] = D*(2A[i-1,j] + 2A[i,j-1] - 4A[i,j]) +
      ฮฑโ‚[i,j] - ฮฒโ‚*A[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dB[i,j] = ฮฑโ‚‚ - ฮฒโ‚‚*B[i,j] - rโ‚*A[i,j]*B[i,j] + rโ‚‚*C[i,j]
    dC[i,j] = ฮฑโ‚ƒ - ฮฒโ‚ƒ*C[i,j] + rโ‚*A[i,j]*B[i,j] - rโ‚‚*C[i,j]
  end
end
u0 = zeros(N,N,3); du = similar(u0);
prob = ODEProblem(f!,u0,(0.0,10.0));
prob_threads = ODEProblem(fthreads!,u0,(0.0,10.0));
prob_simdivdep = ODEProblem(fsimdivdep!,u0,(0.0,10.0));

@time solve(prob, ROCK4(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
@time solve(prob_threads, ROCK4(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
@time solve(prob_simdivdep, ROCK4(), reltol = 1e-8, abstol=1e-8, saveat=0.1);

@time solve(prob, Tsit5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
@time solve(prob_threads, Tsit5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
@time solve(prob_simdivdep, Tsit5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);

@time solve(prob, DP5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
@time solve(prob_threads, DP5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
@time solve(prob_simdivdep, DP5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);

from methodoflines.jl.

chriselrod avatar chriselrod commented on May 23, 2024

Oh, and benchmarks on the M1 (running on a native Julia 1.8 build):

julia> @time solve(prob, ROCK4(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
 21.625047 seconds (14.09 k allocations: 35.655 GiB, 0.71% gc time)

julia> @time solve(prob_threads, ROCK4(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
 91.950376 seconds (14.24 k allocations: 35.655 GiB, 0.24% gc time)

julia> @time solve(prob_simdivdep, ROCK4(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
 23.505749 seconds (14.01 k allocations: 35.421 GiB, 0.23% gc time)

julia> @time solve(prob, Tsit5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
 19.860662 seconds (267 allocations: 708.018 MiB)

julia> @time solve(prob_threads, Tsit5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
 22.793870 seconds (270 allocations: 708.018 MiB, 0.06% gc time)

julia> @time solve(prob_simdivdep, Tsit5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
 21.074039 seconds (270 allocations: 708.018 MiB, 0.53% gc time)

julia> @time solve(prob, DP5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
 20.464703 seconds (264 allocations: 690.017 MiB, 0.34% gc time)

julia> @time solve(prob_threads, DP5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
 21.983172 seconds (267 allocations: 690.017 MiB, 0.07% gc time)

julia> @time solve(prob_simdivdep, DP5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
 21.848427 seconds (267 allocations: 690.017 MiB, 0.39% gc time)

Not sure what's going on with ROCK4 and multithreading, but otherwise -- ouch Intel.
The benchmarks from my previous post were on a 10980XE, a much higher end chip than the M1.
Would be interesting to see how a Zen3 chip compares.

Benchmarks of the individual functions:

julia> @benchmark f!($du, $u0, nothing, 0.0) # @turbo
samples: 10000; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns

 (409000.0 - 411200.0]  โ–3
 (411200.0 - 413500.0]  โ–23
 (413500.0 - 415700.0]  โ–ˆโ–ˆโ–Ž191
 (415700.0 - 418000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–894
 (418000.0 - 420300.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Š1974
 (420300.0 - 422500.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–2614
 (422500.0 - 424800.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹2228
 (424800.0 - 427100.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–1220
 (427100.0 - 429300.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰506
 (429300.0 - 431600.0]  โ–ˆโ–ˆโ–180
 (431600.0 - 433800.0]  โ–‰76
 (433800.0 - 436100.0]  โ–‹46
 (436100.0 - 438400.0]  โ–25
 (438400.0 - 440600.0]  โ–10
 (440600.0 - 534200.0]  โ–10

                  Counts

min: 408.958 ฮผs (0.00% GC); mean: 422.196 ฮผs (0.00% GC); median: 421.917 ฮผs (0.00% GC); max: 534.208 ฮผs (0.00% GC).

julia> @benchmark fthreads!($du, $u0, nothing, 0.0) # @tturbo
samples: 10000; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns

 (218400.0 - 219900.0]  โ–Ž16
 (219900.0 - 221500.0]  โ–ˆ73
 (221500.0 - 223000.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰473
 (223000.0 - 224500.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–1485
 (224500.0 - 226100.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰2415
 (226100.0 - 227600.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 2433
 (227600.0 - 229100.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ1865
 (229100.0 - 230700.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Š871
 (230700.0 - 232200.0]  โ–ˆโ–ˆโ–ˆโ–244
 (232200.0 - 233800.0]  โ–Š55
 (233800.0 - 235300.0]  โ–Ž16
 (235300.0 - 236800.0]  โ–22
 (236800.0 - 238400.0]  โ–Ž16
 (238400.0 - 239900.0]  โ–6
 (239900.0 - 411600.0]  โ–10

                  Counts

min: 218.375 ฮผs (0.00% GC); mean: 226.536 ฮผs (0.00% GC); median: 226.375 ฮผs (0.00% GC); max: 411.584 ฮผs (0.00% GC).

julia> @benchmark fsimdivdep!($du, $u0, nothing, 0.0) # @simd ivdep
samples: 8948; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns

 (548500.0 - 550300.0]  โ–27
 (550300.0 - 552100.0]  โ–ˆโ–ˆโ–ˆโ–‹290
 (552100.0 - 553900.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ1252
 (553900.0 - 555700.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 2424
 (555700.0 - 557500.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Š2315
 (557500.0 - 559300.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰1360
 (559300.0 - 561100.0]  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Ž582
 (561100.0 - 562900.0]  โ–ˆโ–ˆโ–‰227
 (562900.0 - 564700.0]  โ–ˆโ–‰144
 (564700.0 - 566400.0]  โ–ˆโ–111
 (566400.0 - 568200.0]  โ–ˆโ–88
 (568200.0 - 570000.0]  โ–ˆ76
 (570000.0 - 571800.0]  โ–27
 (571800.0 - 573600.0]  โ–Ž16
 (573600.0 - 604700.0]  โ–9

                  Counts

min: 548.500 ฮผs (0.00% GC); mean: 556.582 ฮผs (0.00% GC); median: 556.000 ฮผs (0.00% GC); max: 604.667 ฮผs (0.00% GC).

Julia crashes a lot on the M1, though.

from methodoflines.jl.

valentinsulzer avatar valentinsulzer commented on May 23, 2024

Some more benchmarks (also on M1) for linear diffusion that might be of interest

using DifferentialEquations

using LinearAlgebra, SparseArrays

# Naive implementation

n = 100
A = spdiagm(-1 => ones(n-1), 0 => reduce(vcat, [-1,-2ones(n-2),-1]), 1=> ones(n-1))

function pde!(dy, y, p, t)
    dy .= A * y 
    nothing
end

u0 = sin.(range(0,stop=pi,length=n))
dy = similar(u0)
tspan = (0.0,1.0)

using BenchmarkTools
@btime pde!(dy,u0,0,0) # 728.531 ns (3 allocations: 928 bytes)

prob = ODEProblem(pde!, u0, tspan)

# Benchmarks
using Sundials
@btime solve(prob, KenCarp47(autodiff=false), reltol=1e-6, abstol=1e-6); # 1.848 ms (2358 allocations: 806.56 KiB)
@btime solve(prob, CVODE_BDF(), reltol=1e-6, abstol=1e-6); # 367.167 ฮผs (899 allocations: 190.02 KiB)
@btime solve(prob, CVODE_BDF(linear_solver=:Band,jac_upper=1,jac_lower=1), reltol=1e-6, abstol=1e-6);# 118.416 ฮผs (412 allocations: 86.91 KiB)


# Non-allocating implementation

pde_noalloc! = let A = spdiagm(-1 => ones(n-1), 0 => reduce(vcat, [-1,-2ones(n-2),-1]), 1=> ones(n-1))
    
    function pde!(dy, y, p, t)
        mul!(dy, A, y)
        nothing
    end
    
end

@btime pde_noalloc!(dy,u0,0,0) # 371.079 ns (0 allocations: 0 bytes)
prob_noalloc = ODEProblem(pde_noalloc!, u0, tspan)
@btime solve(prob_noalloc, KenCarp47(autodiff=false), reltol=1e-6, abstol=1e-6); # 1.283 ms (388 allocations: 211.94 KiB)
@btime solve(prob_noalloc, CVODE_BDF(), reltol=1e-6, abstol=1e-6); # 305.750 ฮผs (522 allocations: 77.14 KiB)
@btime solve(prob_noalloc, CVODE_BDF(linear_solver=:Band,jac_upper=1,jac_lower=1), reltol=1e-6, abstol=1e-6); # 105.125 ฮผs (328 allocations: 61.98 KiB)

# Analytical Jacobian
de = modelingtoolkitize(prob)
jac = eval(ModelingToolkit.generate_jacobian(de)[2])
f = ODEFunction(pde!, jac=jac)
prob_jac = ODEProblem(f,u0,tspan)
@btime solve(prob_jac, KenCarp47(autodiff=false), reltol=1e-6, abstol=1e-6); # 1.403 ms (499 allocations: 256.42 KiB)
@btime solve(prob_jac, CVODE_BDF(), reltol=1e-6, abstol=1e-6); # 298.291 ฮผs (403 allocations: 83.95 KiB)

f = ODEFunction(pde_noalloc!, jac=jac)
prob_noalloc_jac = ODEProblem(f,u0,tspan)
@btime solve(prob_noalloc_jac, KenCarp47(autodiff=false), reltol=1e-6, abstol=1e-6); # 1.351 ms (347 allocations: 210.91 KiB)
@btime solve(prob_noalloc_jac, CVODE_BDF(), reltol=1e-6, abstol=1e-6); # 278.792 ฮผs (328 allocations: 61.75 KiB)

prob_mtk = ODEProblem(de, Pair[], tspan)
@btime solve(prob_mtk, KenCarp47(autodiff=false), reltol=1e-6, abstol=1e-6); # 1.028 ms (388 allocations: 214.03 KiB)
@btime solve(prob_mtk, CVODE_BDF(), reltol=1e-6, abstol=1e-6); # 266.958 ฮผs (522 allocations: 78.38 KiB)

# Sparse jacobian
using SparsityDetection
input = rand(length(u0))
output = similar(input)
sparsity_pattern = jacobian_sparsity(pde!,output,input,0.0,0.0)
jac_sparsity = Float64.(sparse(sparsity_pattern))

using Plots
spy(jac_sparsity)

f = ODEFunction(pde!;jac_prototype=jac_sparsity)
prob_sparse = ODEProblem(f,u0,tspan)
@btime solve(prob_sparse, KenCarp47(), reltol=1e-6, abstol=1e-6); # 785.375 ฮผs (2368 allocations: 1.51 MiB)
@btime solve(prob_sparse, CVODE_BDF(), reltol=1e-6, abstol=1e-6); # 375.833 ฮผs (934 allocations: 199.62 KiB)

f = ODEFunction(pde_noalloc!;jac_prototype=jac_sparsity)
prob_noalloc_sparse = ODEProblem(f,u0,tspan)
@btime solve(prob_noalloc_sparse, KenCarp47(), reltol=1e-6, abstol=1e-6); # 751.666 ฮผs (2200 allocations: 1.45 MiB)
@btime solve(prob_noalloc_sparse, CVODE_BDF(), reltol=1e-6, abstol=1e-6); # 314.375 ฮผs (561 allocations: 87.88 KiB)

from methodoflines.jl.

xtalax avatar xtalax commented on May 23, 2024

@ChrisRackauckas is this still relevant?

from methodoflines.jl.

ChrisRackauckas avatar ChrisRackauckas commented on May 23, 2024

I don't think so. It's the justification with going with the stencil composition form.

from methodoflines.jl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.