Comments (27)
With array symbolics now merged (https://symbolics.juliasymbolics.org/dev/manual/arrays/) we have a better idea of where this is all going. Pulling in @shashi, @chriselrod, and @YingboMa into this as well.
From https://tutorials.sciml.ai/html/introduction/03-optimizing_diffeq_code.html we know that fusing the operators is a pretty major effect on performance. So we do not want the final code to have sparse matrix multiplications, and instead want to make sure that this all lowers to something that uses LoopVectorization.jl over contiguous loop indices. In order to support huge equations, we probably do not want this to be scalarized. Also, we want to make sure that this use a cache-optimized or at least cache-oblivious loop iteration, which should be the case by using LoopVectorization.jl. It would be good for the code to be GPU friendly as well.
So on that note, it would be good to see a test of LoopVectorization against something like NNLib's conv for the nonlinear Laplacian. I have faith it could match or surpass it, but that's a thing we should benchmark.
Additionally, we need to think about how to lower symbolic maps and such down to something that could be GPU friendly, and what do about the boundary conditions in this case.
It would also be good to note the effect of the matrix-vector for m on the index reduction algorithm. IIUC one would need to write the boundary terms separate from the matvecs in order to allow for the tearing to work, since otherwise that would break the full incidence assumption. If the boundary conditions already have to be written out in index form, I don't know if that saves any simplicity.
@YingboMa can you share the suggested MTK code for the current form?
from methodoflines.jl.
So on that note, it would be good to see a test of LoopVectorization against something like NNLib's conv for the nonlinear Laplacian. I have faith it could match or surpass it, but that's a thing we should benchmark.
Have an example?
using NNlib, Polyester, LoopVectorization, Static
function kernaxes(::DenseConvDims{2,K,C_in, C_out}) where {K,C_in, C_out} # LoopVectorization can take advantage of static size information
Kโ = StaticInt(1):StaticInt(K[1])
Kโ = StaticInt(1):StaticInt(K[2])
Cแตขโ = StaticInt(1):StaticInt(C_in)
Cโแตคโ = StaticInt(1):StaticInt(C_out)
(Kโ, Kโ, Cแตขโ, Cโแตคโ)
end
function convlayer!(
out::AbstractArray{<:Any,4}, img, kern,
dcd::DenseConvDims{2, <:Any, <:Any, <:Any, (1, 1), (0, 0, 0, 0), (1, 1), true}
)
(Kโ, Kโ, Cแตขโ, Cโแตคโ) = kernaxes(dcd)
@batch for d โ axes(out,4)
@turbo for jโ โ axes(out,1), jโ โ axes(out,2), o โ Cโแตคโ
s = zero(eltype(out))
for kโ โ Kโ, kโ โ Kโ, i โ Cแตขโ
s += img[jโ + kโ - 1, jโ + kโ - 1, i, d] * kern[kโ, kโ, i, o]
end
out[jโ, jโ, o, d] = s
end
end
out
end
img = rand(Float32, 260, 260, 3, 100);
kern = rand(Float32, 5, 5, 3, 6);
out1 = Array{Float32}(undef, size(img,1)+1-size(kern,1), size(img,2)+1-size(kern,2), size(kern,4), size(img,4));
out2 = similar(out1);
dcd = NNlib.DenseConvDims(img, kern, flipkernel = true);
@time NNlib.conv!(out1, img, kern, dcd);
@time convlayer!(out2, img, kern, dcd);
out1 โ out2
@benchmark NNlib.conv!($out1, $img, $kern, $dcd)
@benchmark convlayer!($out1, $img, $kern, $dcd)
LoopVectorization's algorithms scale very badly with the number of loops. 7 works, but takes a minute to compile. So I'm using 6 above, @batch
ing across the seventh. For a CPU-optimized NNlib, consistent threading (e.g., always across batches) would also have the advantage of improving locality, i.e. data will hopefully be able to stay in the same L2 cache from one layer to the next (assuming the data is small enough to fit, as may be the case if training on the CPU).
This is what I get on an 18 core system with AVX512:
julia> @time NNlib.conv!(out1, img, kern, dcd);
0.802161 seconds (2.06 M allocations: 790.165 MiB, 20.41% gc time)
julia> @time convlayer!(out2, img, kern, dcd);
9.553409 seconds (21.51 M allocations: 1.149 GiB, 1.06% gc time)
julia> out1 โ out2
true
julia> @benchmark NNlib.conv!($out1, $img, $kern, $dcd)
samples: 27; evals/sample: 1; memory estimate: 675.02 MiB; allocs estimate: 195
ns
(1.8e8 - 1.9e8] โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 24
(1.9e8 - 2.1e8] โโ1
(2.1e8 - 2.3e8] 0
(2.3e8 - 2.5e8] 0
(2.5e8 - 2.7e8] โโ1
(2.7e8 - 2.9e8] โโ1
Counts
min: 175.652 ms (0.00% GC); mean: 191.870 ms (4.80% GC); median: 184.447 ms (0.44% GC); max: 288.455 ms (36.61% GC).
julia> @benchmark convlayer!($out1, $img, $kern, $dcd)
samples: 944; evals/sample: 1; memory estimate: 32 bytes; allocs estimate: 4
ns
(5.233e6 - 5.256e6] โโโ31
(5.256e6 - 5.278e6] โโโโโโโโโโโโโโโ193
(5.278e6 - 5.301e6] โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ412
(5.301e6 - 5.324e6] โโโโโโโโโโโโโโโโโโโโโ284
(5.324e6 - 5.347e6] โโ20
(5.347e6 - 5.37e6 ] 0
(5.37e6 - 5.392e6] 0
(5.392e6 - 5.415e6] โ1
(5.415e6 - 5.438e6] 0
(5.438e6 - 5.461e6] โ1
(5.461e6 - 5.484e6] โ2
Counts
min: 5.233 ms (0.00% GC); mean: 5.292 ms (0.00% GC); median: 5.294 ms (0.00% GC); max: 5.484 ms (0.00% GC).
julia> versioninfo()
Julia Version 1.7.0-DEV.1208
Commit 0a133ff22b* (2021-05-30 11:28 UTC)
Platform Info:
OS: Linux (x86_64-generic-linux)
CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.0 (ORCJIT, cascadelake)
Environment:
JULIA_NUM_THREADS = 36
LoopVectorization was over 30x faster.
This was also a generic convolution. The Laplacian has more structure you could exploit.
If it's only a 1-d laplacian, it'd probably be harder to do well. The more compute, the less it'll be bottle-necked by memory.
EDIT:
Just realized convlayer!
is allocating...
Apparently it isn't making much use of threading -- only about 4x faster, so I guess it's already pretty constrained by memory:
julia> function convlayer_st!(
out::AbstractArray{<:Any,4}, img, kern,
dcd::DenseConvDims{2, <:Any, <:Any, <:Any, (1, 1), (0, 0, 0, 0), (1, 1), true}
)
(Kโ, Kโ, Cแตขโ, Cโแตคโ) = kernaxes(dcd)
for d โ axes(out,4)
@turbo for jโ โ axes(out,1), jโ โ axes(out,2), o โ Cโแตคโ
s = zero(eltype(out))
for kโ โ Kโ, kโ โ Kโ, i โ Cแตขโ
s += img[jโ + kโ - 1, jโ + kโ - 1, i, d] * kern[kโ, kโ, i, o]
end
out[jโ, jโ, o, d] = s
end
out
end
end
convlayer_st! (generic function with 1 method)
julia> @benchmark convlayer_st!($out1, $img, $kern, $dcd)
samples: 180; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
(2.754e7 - 2.758e7] โ2
(2.758e7 - 2.763e7] 0
(2.763e7 - 2.768e7] 0
(2.768e7 - 2.772e7] โโโโโโ19
(2.772e7 - 2.777e7] โโโโโโโโโโโ37
(2.777e7 - 2.781e7] โ2
(2.781e7 - 2.786e7] โโโ9
(2.786e7 - 2.79e7 ] โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 108
(2.79e7 - 2.795e7] โ3
Counts
min: 27.539 ms (0.00% GC); mean: 27.824 ms (0.00% GC); median: 27.865 ms (0.00% GC); max: 27.949 ms (0.00% GC).
Apparently Intel's next generation of server chips (Saphire Rapids) will come with up to 64 GiB of HBME2 memory. Maybe it'll only be a select few extremely expensive models, but if there're affordable options, that'd be exciting.
AMD also announced 3d stacking for 192 MiB of L3 cache, which isn't quite enough for problems of this size:
julia> (sizeof(out1) + sizeof(img) + sizeof(kern)) / (1<<20) # bytes of data / (bytes/MiB)
227.36377716064453
The three arrays here are 227 MiB. Making the problem just a little smaller would then let them all fit.
For comparison, the 10980XE I ran these benchmarks on has only 24.75 MiB of L3.
These were 100 260x260 input images (and 100 256x256 outputs), so the real use cases here are hopefully smaller.
from methodoflines.jl.
Look at the PDE example in https://tutorials.sciml.ai/html/introduction/03-optimizing_diffeq_code.html for an example of it.
mul!(Ayu,Ay,u)
mul!(uAx,u,Ax)
mul!(Ayv,Ay,v)
mul!(vAx,v,Ax)
is just a cheap way to do 2D convolutions with dense matmuls, but it would be good to transcribe that into conv!
from methodoflines.jl.
I played with it a bit, and haven't been able to get any major improvements over the explicit loop version:
using LinearAlgebra, OrdinaryDiffEq
# Generate the constants
p = (1.0,1.0,1.0,10.0,0.001,100.0) # a,ฮฑ,ubar,ฮฒ,D1,D2
N = 100
const Ax = Array(Tridiagonal([1.0 for i in 1:N-1],[-2.0 for i in 1:N],[1.0 for i in 1:N-1]));
const Ay = copy(Ax);
Ax[2,1] = 2.0;
Ax[end-1,end] = 2.0;
Ay[1,2] = 2.0;
Ay[end,end-1] = 2.0;
function basic_version!(dr,r,p,t)
a,ฮฑ,ubar,ฮฒ,D1,D2 = p
u = r[:,:,1]
v = r[:,:,2]
Du = D1*(Ay*u + u*Ax)
Dv = D2*(Ay*v + v*Ax)
dr[:,:,1] = Du .+ a.*u.*u./v .+ ubar .- ฮฑ*u
dr[:,:,2] = Dv .+ a.*u.*u .- ฮฒ*v
end
a,ฮฑ,ubar,ฮฒ,D1,D2 = p;
uss = (ubar+ฮฒ)/ฮฑ;
vss = (a/ฮฒ)*uss^2;
r0 = zeros(100,100,2);
r0[:,:,1] .= uss.+0.1.*rand.();
r0[:,:,2] .= vss;
prob = ODEProblem(basic_version!,r0,(0.0,0.1),p);
@benchmark solve($prob,Tsit5())
# samples: 123; evals/sample: 1; memory estimate: 186.80 MiB; allocs estimate: 4555
# ns
# (3.0e7 - 5.0e7] โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 121
# (5.0e7 - 7.0e7] 0
# (7.0e7 - 8.0e7] 0
# (8.0e7 - 1.0e8] 0
# (1.0e8 - 1.2e8] 0
# (1.2e8 - 1.3e8] โ1
# (1.3e8 - 1.5e8] 0
# (1.5e8 - 1.6e8] โ1
# Counts
# min: 34.770 ms (7.03% GC); mean: 40.722 ms (9.64% GC); median: 39.011 ms (5.79% GC); max: 164.571 ms (77.65% GC).
function gm2!(dr,r,p,t)
a,ฮฑ,ubar,ฮฒ,D1,D2 = p
u = @view r[:,:,1]
v = @view r[:,:,2]
du = @view dr[:,:,1]
dv = @view dr[:,:,2]
Du = D1*(Ay*u + u*Ax)
Dv = D2*(Ay*v + v*Ax)
@. du = Du + a.*u.*u./v + ubar - ฮฑ*u
@. dv = Dv + a.*u.*u - ฮฒ*v
end
prob = ODEProblem(gm2!,r0,(0.0,0.1),p);
@benchmark solve($prob,Tsit5())
# samples: 171; evals/sample: 1; memory estimate: 119.44 MiB; allocs estimate: 2790
# ns
# (2.68e7 - 2.75e7] โ1
# (2.75e7 - 2.82e7] โโโโ6
# (2.82e7 - 2.9e7 ] โโโโโโโโโโโโโโโโโโโโโโโโ45
# (2.9e7 - 2.97e7] โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ58
# (2.97e7 - 3.04e7] โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ55
# (3.04e7 - 3.11e7] โโโ5
# (3.11e7 - 3.18e7] 0
# (3.18e7 - 3.25e7] 0
# (3.25e7 - 3.32e7] โ1
# Counts
# min: 26.824 ms (7.47% GC); mean: 29.393 ms (6.39% GC); median: 29.594 ms (7.36% GC); max: 33.241 ms (27.63% GC).
const Ayu = zeros(N,N);
const uAx = zeros(N,N);
const Du = zeros(N,N);
const Ayv = zeros(N,N);
const vAx = zeros(N,N);
const Dv = zeros(N,N);
function gm3!(dr,r,p,t)
a,ฮฑ,ubar,ฮฒ,D1,D2 = p
u = @view r[:,:,1]
v = @view r[:,:,2]
du = @view dr[:,:,1]
dv = @view dr[:,:,2]
mul!(Ayu,Ay,u)
mul!(uAx,u,Ax)
mul!(Ayv,Ay,v)
mul!(vAx,v,Ax)
@. Du = D1*(Ayu + uAx)
@. Dv = D2*(Ayv + vAx)
@. du = Du + a*u*u./v + ubar - ฮฑ*u
@. dv = Dv + a*u*u - ฮฒ*v
end
prob_gm3 = ODEProblem(gm3!,r0,(0.0,0.1),p);
@benchmark solve($prob_gm3,Tsit5())
# samples: 181; evals/sample: 1; memory estimate: 29.63 MiB; allocs estimate: 438
# ns
# (2.1e7 - 2.2e7] โ1
# (2.2e7 - 2.3e7] โ1
# (2.3e7 - 2.5e7] 0
# (2.5e7 - 2.6e7] โโโโโโ14
# (2.6e7 - 2.7e7] โโโโโโโโโโโโโโโ36
# (2.7e7 - 2.8e7] โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 76
# (2.8e7 - 3.0e7] โโโโโโโโโโโโโโโโ39
# (3.0e7 - 3.1e7] โโโโ10
# (3.1e7 - 3.2e7] โโ4
# Counts
# min: 20.760 ms (0.00% GC); mean: 27.682 ms (2.07% GC); median: 27.566 ms (0.00% GC); max: 32.108 ms (0.00% GC).
using LoopVectorization
function matmul!(C::AbstractMatrix, A::AbstractMatrix, B::AbstractMatrix)
@tturbo for n โ indices((C,B),2), m โ indices((C,A),1)
Cโโ = zero(eltype(C))
for k โ indices((A,B),(2,1))
Cโโ += A[m,k]*B[k,n]
end
C[m,n] = Cโโ
end
end
function matmuladd!(C::AbstractMatrix, A::AbstractMatrix, B::AbstractMatrix)
@tturbo for n โ indices((C,B),2), m โ indices((C,A),1)
Cโโ = zero(eltype(C))
for k โ indices((A,B),(2,1))
Cโโ += A[m,k]*B[k,n]
end
C[m,n] += Cโโ
end
end
function gm5!(dr,r,p,t)
a,ฮฑ,ubar,ฮฒ,D1,D2 = p
u = @view r[:,:,1];
v = @view r[:,:,2];
du = @view dr[:,:,1];
dv = @view dr[:,:,2];
matmul!(Ayu,Ay,u)
matmuladd!(Ayu,u,Ax)
# @turbo @. du = D1*Ayu + a*u*u./v + ubar - ฮฑ*u
@turbo for i โ eachindex(u)
du[i] = D1 * Ayu[i] + a * u[i]*u[i] / v[i] + ubar - ฮฑ*u[i]
end
matmul!(Ayu,Ay,v)
matmuladd!(Ayu,v,Ax)
# @turbo @. dv = D2*Ayu + a*u*u - ฮฒ*v
@turbo for i โ eachindex(u)
dv[i] = D2 * Ayu[i] + a * u[i]*u[i] - ฮฒ*v[i]
end
end
prob_gm5 = ODEProblem(gm5!,r0,(0.0,0.1),p);
@benchmark solve($prob_gm5,Tsit5())
# samples: 306; evals/sample: 1; memory estimate: 29.62 MiB; allocs estimate: 433
# ns
# (1.1e7 - 1.3e7] โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 211
# (1.3e7 - 1.5e7] โโโ15
# (1.5e7 - 1.7e7] โ2
# (1.7e7 - 1.9e7] โ2
# (1.9e7 - 2.1e7] 0
# (2.1e7 - 2.3e7] 0
# (2.3e7 - 2.5e7] โ6
# (2.5e7 - 2.8e7] โโโโโโโโ54
# (2.8e7 - 3.0e7] โ7
# (3.0e7 - 3.2e7] โโ9
# Counts
# min: 10.999 ms (0.00% GC); mean: 16.355 ms (3.53% GC); median: 12.869 ms (0.00% GC); max: 31.714 ms (6.63% GC).
function matmul!(C::AbstractArray{<:Any,3}, A::AbstractMatrix, B::AbstractArray{<:Any,3})
@tturbo for n โ indices((C,B),2), m โ indices((C,A),1), d โ 1:2
Cโโ = zero(eltype(C))
for k โ indices((A,B),(2,1))
Cโโ += A[m,k]*B[k,n,d]
end
C[m,n,d] = Cโโ
end
end
function matmuladd!(C::AbstractArray{<:Any,3}, A::AbstractArray{<:Any,3}, B::AbstractMatrix)
@tturbo for n โ indices((C,B),2), m โ indices((C,A),1), d โ 1:2
Cโโ = zero(eltype(C))
for k โ indices((A,B),(2,1))
Cโโ += A[m,k,d]*B[k,n]
end
C[m,n,d] += Cโโ
end
end
const Ayuv = similar(Ayu, (size(Ayu,1),size(Ayu,2),2));
function gm6!(dr,r,p,t)
a,ฮฑ,ubar,ฮฒ,D1,D2 = p
matmul!(Ayuv, Ay, r)
matmuladd!(Ayuv, r, Ax)
Nยฒ = size(r,1)*size(r,2)
u = reshape(r, (Nยฒ,2))
du = reshape(dr, (Nยฒ,2))
Ayuvr = reshape(Ayuv, (Nยฒ,2))
@turbo for i โ 1:Nยฒ
du[i,1] = D1 * Ayuvr[i,1] + a * u[i,1]*u[i,1] / u[i,2] + ubar - ฮฑ*u[i,1]
du[i,2] = D2 * Ayuvr[i,2] + a * u[i,1]*u[i,1] - ฮฒ*u[i,2]
end
end
prob_gm6 = ODEProblem(gm6!,r0,(0.0,0.1),p);
@benchmark solve($prob_gm6, Tsit5())
# samples: 310; evals/sample: 1; memory estimate: 29.67 MiB; allocs estimate: 1315
# ns
# (1.1e7 - 1.3e7] โโโโโโ37
# (1.3e7 - 1.4e7] โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ205
# (1.4e7 - 1.6e7] 0
# (1.6e7 - 1.8e7] 0
# (1.8e7 - 2.0e7] 0
# (2.0e7 - 2.1e7] 0
# (2.1e7 - 2.3e7] 0
# (2.3e7 - 2.5e7] 0
# (2.5e7 - 2.6e7] โ1
# (2.6e7 - 2.8e7] โโโโโโโโโโ67
# Counts
# min: 11.091 ms (0.00% GC); mean: 16.126 ms (3.49% GC); median: 13.182 ms (0.00% GC); max: 27.934 ms (9.08% GC).
pdevec = (1.0,1.0,1.0,10.0,0.001,100.0,N);
function fast_gm!(du,u,p,t)
a,ฮฑ,ubar,ฮฒ,D1,D2,N = p
@inbounds for j in 2:N-1, i in 2:N-1
du[i,j,1] = D1*(u[i-1,j,1] + u[i+1,j,1] + u[i,j+1,1] + u[i,j-1,1] - 4u[i,j,1]) +
a*u[i,j,1]^2/u[i,j,2] + ubar - ฮฑ*u[i,j,1]
end
@inbounds for j in 2:N-1, i in 2:N-1
du[i,j,2] = D2*(u[i-1,j,2] + u[i+1,j,2] + u[i,j+1,2] + u[i,j-1,2] - 4u[i,j,2]) +
a*u[i,j,1]^2 - ฮฒ*u[i,j,2]
end
@inbounds for j in 2:N-1
i = 1
du[1,j,1] = D1*(2u[i+1,j,1] + u[i,j+1,1] + u[i,j-1,1] - 4u[i,j,1]) +
a*u[i,j,1]^2/u[i,j,2] + ubar - ฮฑ*u[i,j,1]
end
@inbounds for j in 2:N-1
i = 1
du[1,j,2] = D2*(2u[i+1,j,2] + u[i,j+1,2] + u[i,j-1,2] - 4u[i,j,2]) +
a*u[i,j,1]^2 - ฮฒ*u[i,j,2]
end
@inbounds for j in 2:N-1
i = N
du[end,j,1] = D1*(2u[i-1,j,1] + u[i,j+1,1] + u[i,j-1,1] - 4u[i,j,1]) +
a*u[i,j,1]^2/u[i,j,2] + ubar - ฮฑ*u[i,j,1]
end
@inbounds for j in 2:N-1
i = N
du[end,j,2] = D2*(2u[i-1,j,2] + u[i,j+1,2] + u[i,j-1,2] - 4u[i,j,2]) +
a*u[i,j,1]^2 - ฮฒ*u[i,j,2]
end
@inbounds for i in 2:N-1
j = 1
du[i,1,1] = D1*(u[i-1,j,1] + u[i+1,j,1] + 2u[i,j+1,1] - 4u[i,j,1]) +
a*u[i,j,1]^2/u[i,j,2] + ubar - ฮฑ*u[i,j,1]
end
@inbounds for i in 2:N-1
j = 1
du[i,1,2] = D2*(u[i-1,j,2] + u[i+1,j,2] + 2u[i,j+1,2] - 4u[i,j,2]) +
a*u[i,j,1]^2 - ฮฒ*u[i,j,2]
end
@inbounds for i in 2:N-1
j = N
du[i,end,1] = D1*(u[i-1,j,1] + u[i+1,j,1] + 2u[i,j-1,1] - 4u[i,j,1]) +
a*u[i,j,1]^2/u[i,j,2] + ubar - ฮฑ*u[i,j,1]
end
@inbounds for i in 2:N-1
j = N
du[i,end,2] = D2*(u[i-1,j,2] + u[i+1,j,2] + 2u[i,j-1,2] - 4u[i,j,2]) +
a*u[i,j,1]^2 - ฮฒ*u[i,j,2]
end
@inbounds begin
i = 1; j = 1
du[1,1,1] = D1*(2u[i+1,j,1] + 2u[i,j+1,1] - 4u[i,j,1]) +
a*u[i,j,1]^2/u[i,j,2] + ubar - ฮฑ*u[i,j,1]
du[1,1,2] = D2*(2u[i+1,j,2] + 2u[i,j+1,2] - 4u[i,j,2]) +
a*u[i,j,1]^2 - ฮฒ*u[i,j,2]
i = 1; j = N
du[1,N,1] = D1*(2u[i+1,j,1] + 2u[i,j-1,1] - 4u[i,j,1]) +
a*u[i,j,1]^2/u[i,j,2] + ubar - ฮฑ*u[i,j,1]
du[1,N,2] = D2*(2u[i+1,j,2] + 2u[i,j-1,2] - 4u[i,j,2]) +
a*u[i,j,1]^2 - ฮฒ*u[i,j,2]
i = N; j = 1
du[N,1,1] = D1*(2u[i-1,j,1] + 2u[i,j+1,1] - 4u[i,j,1]) +
a*u[i,j,1]^2/u[i,j,2] + ubar - ฮฑ*u[i,j,1]
du[N,1,2] = D2*(2u[i-1,j,2] + 2u[i,j+1,2] - 4u[i,j,2]) +
a*u[i,j,1]^2 - ฮฒ*u[i,j,2]
i = N; j = N
du[end,end,1] = D1*(2u[i-1,j,1] + 2u[i,j-1,1] - 4u[i,j,1]) +
a*u[i,j,1]^2/u[i,j,2] + ubar - ฮฑ*u[i,j,1]
du[end,end,2] = D2*(2u[i-1,j,2] + 2u[i,j-1,2] - 4u[i,j,2]) +
a*u[i,j,1]^2 - ฮฒ*u[i,j,2]
end
end
prob_devec = ODEProblem(fast_gm!,r0,(0.0,0.1),pdevec);
@benchmark solve($prob_devec,Tsit5())
# samples: 600; evals/sample: 1; memory estimate: 29.63 MiB; allocs estimate: 438
# ns
# (7.2e6 - 7.74e6 ] โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ337
# (7.74e6 - 8.27e6 ] โโโ29
# (8.27e6 - 8.81e6 ] โโโโโโโ68
# (8.81e6 - 9.34e6 ] โโโโโโโ74
# (9.34e6 - 9.88e6 ] โโโ33
# (9.88e6 - 1.041e7] โ4
# (1.041e7 - 1.095e7] 0
# (1.095e7 - 1.149e7] โโ22
# (1.149e7 - 1.202e7] โ7
# (1.202e7 - 1.256e7] โโโ24
# (1.256e7 - 1.309e7] โ2
# Counts
# min: 7.200 ms (0.00% GC); mean: 8.335 ms (7.27% GC); median: 7.482 ms (0.00% GC); max: 13.092 ms (43.21% GC).
function turbo_gm!(du,u,p,t)
a,ฮฑ,ubar,ฮฒ,D1,D2,N = p
@turbo for j in 2:N-1, i in 2:N-1
du[i,j,1] = D1*(u[i-1,j,1] + u[i+1,j,1] + u[i,j+1,1] + u[i,j-1,1] - 4u[i,j,1]) +
a*u[i,j,1]^2/u[i,j,2] + ubar - ฮฑ*u[i,j,1]
end
@turbo for j in 2:N-1, i in 2:N-1
du[i,j,2] = D2*(u[i-1,j,2] + u[i+1,j,2] + u[i,j+1,2] + u[i,j-1,2] - 4u[i,j,2]) +
a*u[i,j,1]^2 - ฮฒ*u[i,j,2]
end
@inbounds for j in 2:N-1
i=1
du[1,j,1] = D1*(2u[2,j,1] + u[1,j+1,1] + u[1,j-1,1] - 4u[1,j,1]) +
a*u[1,j,1]^2/u[1,j,2] + ubar - ฮฑ*u[1,j,1]
end
@inbounds for j in 2:N-1
i = 1
du[1,j,2] = D2*(2u[i+1,j,2] + u[i,j+1,2] + u[i,j-1,2] - 4u[i,j,2]) +
a*u[i,j,1]^2 - ฮฒ*u[i,j,2]
end
@inbounds for j in 2:N-1
i = N
du[end,j,1] = D1*(2u[i-1,j,1] + u[i,j+1,1] + u[i,j-1,1] - 4u[i,j,1]) +
a*u[i,j,1]^2/u[i,j,2] + ubar - ฮฑ*u[i,j,1]
end
@inbounds for j in 2:N-1
i = N
du[end,j,2] = D2*(2u[i-1,j,2] + u[i,j+1,2] + u[i,j-1,2] - 4u[i,j,2]) +
a*u[i,j,1]^2 - ฮฒ*u[i,j,2]
end
@inbounds for i in 2:N-1
j = 1
du[i,1,1] = D1*(u[i-1,j,1] + u[i+1,j,1] + 2u[i,j+1,1] - 4u[i,j,1]) +
a*u[i,j,1]^2/u[i,j,2] + ubar - ฮฑ*u[i,j,1]
end
@inbounds for i in 2:N-1
j = 1
du[i,1,2] = D2*(u[i-1,j,2] + u[i+1,j,2] + 2u[i,j+1,2] - 4u[i,j,2]) +
a*u[i,j,1]^2 - ฮฒ*u[i,j,2]
end
@inbounds for i in 2:N-1
j = N
du[i,end,1] = D1*(u[i-1,j,1] + u[i+1,j,1] + 2u[i,j-1,1] - 4u[i,j,1]) +
a*u[i,j,1]^2/u[i,j,2] + ubar - ฮฑ*u[i,j,1]
end
@inbounds for i in 2:N-1
j = N
du[i,end,2] = D2*(u[i-1,j,2] + u[i+1,j,2] + 2u[i,j-1,2] - 4u[i,j,2]) +
a*u[i,j,1]^2 - ฮฒ*u[i,j,2]
end
@inbounds begin
i = 1; j = 1
du[1,1,1] = D1*(2u[i+1,j,1] + 2u[i,j+1,1] - 4u[i,j,1]) +
a*u[i,j,1]^2/u[i,j,2] + ubar - ฮฑ*u[i,j,1]
du[1,1,2] = D2*(2u[i+1,j,2] + 2u[i,j+1,2] - 4u[i,j,2]) +
a*u[i,j,1]^2 - ฮฒ*u[i,j,2]
i = 1; j = N
du[1,N,1] = D1*(2u[i+1,j,1] + 2u[i,j-1,1] - 4u[i,j,1]) +
a*u[i,j,1]^2/u[i,j,2] + ubar - ฮฑ*u[i,j,1]
du[1,N,2] = D2*(2u[i+1,j,2] + 2u[i,j-1,2] - 4u[i,j,2]) +
a*u[i,j,1]^2 - ฮฒ*u[i,j,2]
i = N; j = 1
du[N,1,1] = D1*(2u[i-1,j,1] + 2u[i,j+1,1] - 4u[i,j,1]) +
a*u[i,j,1]^2/u[i,j,2] + ubar - ฮฑ*u[i,j,1]
du[N,1,2] = D2*(2u[i-1,j,2] + 2u[i,j+1,2] - 4u[i,j,2]) +
a*u[i,j,1]^2 - ฮฒ*u[i,j,2]
i = N; j = N
du[end,end,1] = D1*(2u[i-1,j,1] + 2u[i,j-1,1] - 4u[i,j,1]) +
a*u[i,j,1]^2/u[i,j,2] + ubar - ฮฑ*u[i,j,1]
du[end,end,2] = D2*(2u[i-1,j,2] + 2u[i,j-1,2] - 4u[i,j,2]) +
a*u[i,j,1]^2 - ฮฒ*u[i,j,2]
end
end
prob_turbo_devec = ODEProblem(turbo_gm!,r0,(0.0,0.1),pdevec);
@benchmark solve($prob_turbo_devec,Tsit5())
# samples: 600; evals/sample: 1; memory estimate: 29.63 MiB; allocs estimate: 438
# ns
# (7.11e6 - 7.78e6 ] โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 324
# (7.78e6 - 8.45e6 ] โ4
# (8.45e6 - 9.12e6 ] โโโโโโโโโ88
# (9.12e6 - 9.79e6 ] โโโโโโโโโ95
# (9.79e6 - 1.046e7] โโโโโโโ71
# (1.046e7 - 1.113e7] โ5
# (1.113e7 - 1.18e7 ] โ3
# (1.18e7 - 1.247e7] โ3
# (1.247e7 - 1.314e7] โ2
# (1.314e7 - 1.381e7] โ2
# (1.381e7 - 1.448e7] โ3
# Counts
# min: 7.107 ms (0.00% GC); mean: 8.331 ms (7.72% GC); median: 7.374 ms (0.00% GC); max: 14.477 ms (0.00% GC).
function turbo_gm_fuse!(du,u,p,t)
a,ฮฑ,ubar,ฮฒ,D1,D2,N = p
@turbo for j in 2:N-1, i in 2:N-1
du[i,j,1] = D1*(u[i-1,j,1] + u[i+1,j,1] + u[i,j+1,1] + u[i,j-1,1] - 4u[i,j,1]) +
a*u[i,j,1]^2/u[i,j,2] + ubar - ฮฑ*u[i,j,1]
du[i,j,2] = D2*(u[i-1,j,2] + u[i+1,j,2] + u[i,j+1,2] + u[i,j-1,2] - 4u[i,j,2]) +
a*u[i,j,1]^2 - ฮฒ*u[i,j,2]
end
@turbo for j in 2:N-1
du[1,j,1] = D1*(2u[2,j,1] + u[1,j+1,1] + u[1,j-1,1] - 4u[1,j,1]) +
a*u[1,j,1]^2/u[1,j,2] + ubar - ฮฑ*u[1,j,1]
# end
# @turbo for j in 2:N-1
du[1,j,2] = D2*(2u[2,j,2] + u[1,j+1,2] + u[1,j-1,2] - 4u[1,j,2]) +
a*u[1,j,1]^2 - ฮฒ*u[1,j,2]
# end
# @turbo for j in 2:N-1
du[N,j,1] = D1*(2u[N-1,j,1] + u[N,j+1,1] + u[N,j-1,1] - 4u[N,j,1]) +
a*u[N,j,1]^2/u[N,j,2] + ubar - ฮฑ*u[N,j,1]
# end
# @turbo for j in 2:N-1
du[N,j,2] = D2*(2u[N-1,j,2] + u[N,j+1,2] + u[N,j-1,2] - 4u[N,j,2]) +
a*u[N,j,1]^2 - ฮฒ*u[N,j,2]
end
@turbo for j in 2:N-1
du[j,1,1] = D1*(u[j-1,1,1] + u[j+1,1,1] + 2u[j,2,1] - 4u[j,1,1]) +
a*u[j,1,1]^2/u[j,1,2] + ubar - ฮฑ*u[j,1,1]
# end
# @turbo for j in 2:N-1
du[j,1,2] = D2*(u[j-1,1,2] + u[j+1,1,2] + 2u[j,2,2] - 4u[j,1,2]) +
a*u[j,1,1]^2 - ฮฒ*u[j,1,2]
# end
# @turbo for j in 2:N-1
du[j,N,1] = D1*(u[j-1,N,1] + u[j+1,N,1] + 2u[j,N-1,1] - 4u[j,N,1]) +
a*u[j,N,1]^2/u[j,N,2] + ubar - ฮฑ*u[j,N,1]
# end
# @turbo for j in 2:N-1
du[j,N,2] = D2*(u[j-1,N,2] + u[j+1,N,2] + 2u[j,N-1,2] - 4u[j,N,2]) +
a*u[j,N,1]^2 - ฮฒ*u[j,N,2]
end
@inbounds begin
i = 1; j = 1
du[1,1,1] = D1*(2u[i+1,j,1] + 2u[i,j+1,1] - 4u[i,j,1]) +
a*u[i,j,1]^2/u[i,j,2] + ubar - ฮฑ*u[i,j,1]
du[1,1,2] = D2*(2u[i+1,j,2] + 2u[i,j+1,2] - 4u[i,j,2]) +
a*u[i,j,1]^2 - ฮฒ*u[i,j,2]
i = 1; j = N
du[1,N,1] = D1*(2u[i+1,j,1] + 2u[i,j-1,1] - 4u[i,j,1]) +
a*u[i,j,1]^2/u[i,j,2] + ubar - ฮฑ*u[i,j,1]
du[1,N,2] = D2*(2u[i+1,j,2] + 2u[i,j-1,2] - 4u[i,j,2]) +
a*u[i,j,1]^2 - ฮฒ*u[i,j,2]
i = N; j = 1
du[N,1,1] = D1*(2u[i-1,j,1] + 2u[i,j+1,1] - 4u[i,j,1]) +
a*u[i,j,1]^2/u[i,j,2] + ubar - ฮฑ*u[i,j,1]
du[N,1,2] = D2*(2u[i-1,j,2] + 2u[i,j+1,2] - 4u[i,j,2]) +
a*u[i,j,1]^2 - ฮฒ*u[i,j,2]
i = N; j = N
du[N,N,1] = D1*(2u[i-1,j,1] + 2u[i,j-1,1] - 4u[i,j,1]) +
a*u[i,j,1]^2/u[i,j,2] + ubar - ฮฑ*u[i,j,1]
du[N,N,2] = D2*(2u[i-1,j,2] + 2u[i,j-1,2] - 4u[i,j,2]) +
a*u[i,j,1]^2 - ฮฒ*u[i,j,2]
end
end
prob_turbo_fuse = ODEProblem(turbo_gm_fuse!,r0,(0.0,0.1),pdevec);
@benchmark solve($prob_turbo_fuse,Tsit5())
# samples: 629; evals/sample: 1; memory estimate: 29.62 MiB; allocs estimate: 433
# ns
# (6.8e6 - 7.9e6 ] โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ427
# (7.9e6 - 8.9e6 ] โโโ36
# (8.9e6 - 1.0e7 ] โโโโโ66
# (1.0e7 - 1.11e7] โโโโโโโ97
# (1.11e7 - 1.22e7] 0
# (1.22e7 - 1.33e7] 0
# (1.33e7 - 1.44e7] 0
# (1.44e7 - 1.54e7] โ1
# (1.54e7 - 1.65e7] โ1
# (1.65e7 - 1.76e7] 0
# (1.76e7 - 2.12e7] โ1
# Counts
# min: 6.786 ms (0.00% GC); mean: 7.938 ms (7.63% GC); median: 7.014 ms (0.00% GC); max: 21.180 ms (66.47% GC).
from methodoflines.jl.
Those loops are easy enough for LLVM to SIMD, and there's not too much that can be done to make them faster.
I.e., there isn't much reuse you can get through blocking. You're just streaming through memory.
It is unrolling the j
loop to try and save on j
loads a bit (i.e., j+1
on one j
iteration is j
on the next, is j-1
on the next), but that doesn't seem to help much.
from methodoflines.jl.
Yeah, I think you need to not have the nonlinear portion of the PDE (+ a*u[i,j,1]^2 - ฮฒ*u[i,j,2])
for a stencil compiler to work well. This is what I was mentioning in https://discourse.julialang.org/t/how-to-tackle-2d-reaction-diffusion-equation/54110/15?u=chrisrackauckas . The authors of that seemed to get mad at me, but it seems to be the simple fact that once you have a term that is outside the original range, u[i,j,2]
, or a term which is sufficiently difficult to compute so that it's not too memory bound, u[i,j,1]^2
, then the act of fusion is just better than having a good stencil compiler. That LoopVectorization demo above is precisely what I needed to see to fully prove that indeed the fused version outperforms even a good stencil compiler in that case, and that @turbo
has an effect but it's not massive.
It would still be good to investigate multithreading between those when the space is really large though, but that's a separate thing and only matters for the huge PDE case. But essentially, those benchmarks are my argument for why we shouldn't be generating A*u
operators but instead should be generating (and fusing) map operations.
from methodoflines.jl.
Making them dense:
Ax = Array(Tridiagonal([1.0 for i in 1:N-1],[-2.0 for i in 1:N],[1.0 for i in 1:N-1]))
adds a huge amount of computation, making it go from O(N^2)
to O(N^3)
.
Seems like an obviously bad idea.
We could implement specialized Tridiagonal
multiplication routines, e.g. like
JuliaSIMD/LoopVectorization.jl#266 (comment)
but at that point we may as well use the fusing map operations.
For the mapped version, 100x100 isn't large enough to benefit from multithreading yet, but 1000x1000 would be.
from methodoflines.jl.
The first part is the same as an image filter.
from methodoflines.jl.
or a term which is sufficiently difficult to compute so that it's not too memory bound,
u[i,j,1]^2
Not sure if I follow here, but u[i,j,1]
is easy to compute.
Fusion is great for memory bound applications though. Passing over the data once is better than passing over it twice.
We shouldn't be doing dense Ax::Matrix
operations.
I assume ParallelStencil wouldn't be?
If we let Ax
be tridiagonal, it'd do much better.
Still probably not as well as the loops, but better.
from methodoflines.jl.
Not sure if I follow here, but u[i,j,1] is easy to compute.
I mean, when you're looping over u[i,j,2]
and need to grab u[i,j,1]
, you very quickly expand the cache lines like that. And this is still a very small example with very few nonlinearities!
We shouldn't be doing dense Ax::Matrix operations. I assume ParallelStencil wouldn't be?
It wouldn't, I thought you were using the same setup as the image filter you had? A 2D Laplacian is just a 2D filter.
from methodoflines.jl.
Ha, let me do that. Above it was just matmul!
.
Note that the image filtering example doesn't do boundaries...
I mean, when you're looping over u[i,j,2] and need to grab u[i,j,1], you very quickly expand the cache lines like that.
Although reuse tends to be fairly immediate (e.g. u[i,j,1]
, u[i+1,j,1]
), but then there aren't any more opportunities for re-use, so you don't mind these loads getting evicted.
Although there is concern over u[i,j,1]
, u[i,j+1,1]
. It'd probably take some experimenting to see if some larger cache tiling helps more than the j
unrolling does.
from methodoflines.jl.
@ChrisRackauckas here is the example I think you were looking for:
using StaticArrays, LoopVectorization, Static
const LAPLACE_KERNEL = @MMatrix([0.0 1.0 0.0; 1.0 -4.0 1.0; 0.0 1.0 0.0]);
function filter2davx!(out::AbstractMatrix, A::AbstractMatrix, kern::MMatrix{3,3})
N = size(out,1)
@inbounds @fastmath let j_2 = j_1 = 1
tmp = A[1+j_1, 1+j_2] * kern[-1+2, -1+2]
tmp += A[1+j_1, 0+j_2] * kern[-1+2, 0+2]
tmp += A[1+j_1, 1+j_2] * kern[-1+2, 1+2]
tmp += A[0+j_1, 1+j_2] * kern[0+2, -1+2]
tmp += A[1+j_1, 1+j_2] * kern[1+2, -1+2]
for i_1 โ 0:1, i_2 โ 0:1
tmp += A[i_1+j_1, i_2+j_2] * kern[i_1+2, i_2+2]
end
out[j_1,j_2] = tmp
end
@inbounds @fastmath let j_1 = 1; j_2 = N
tmp = A[1+j_1, -1+j_2] * kern[-1+2, -1+2]
tmp += A[1+j_1, 0+j_2] * kern[-1+2, 0+2]
tmp += A[1+j_1, -1+j_2] * kern[-1+2, 1+2]
tmp += A[0+j_1, -1+j_2] * kern[0+2, 1+2]
tmp += A[1+j_1, -1+j_2] * kern[1+2, 1+2]
for i_1 โ 0:1, i_2 โ -1:0
tmp += A[i_1+j_1, i_2+j_2] * kern[i_1+2, i_2+2]
end
out[j_1,j_2] = tmp
end
@inbounds @fastmath let j_2 = 1; j_1 = N
tmp = A[-1+j_1, 1+j_2] * kern[1+2, -1+2]
tmp += A[-1+j_1, 0+j_2] * kern[1+2, 0+2]
tmp += A[-1+j_1, 1+j_2] * kern[-1+2, 1+2]
tmp += A[0+j_1, 1+j_2] * kern[0+2, -1+2]
tmp += A[-1+j_1, 1+j_2] * kern[1+2, -1+2]
for i_1 โ -1:0, i_2 โ 0:1
tmp += A[i_1+j_1, i_2+j_2] * kern[i_1+2, i_2+2]
end
out[j_1,j_2] = tmp
end
@turbo for j โ 2:N-1
# for j โ 2:N-1
tmp_1 = zero(eltype(out))
tmp_2 = zero(eltype(out))
for i_1 โ -1:1
tmp_1 += A[i_1+j, 2] * kern[i_1+2, 3]
tmp_2 += A[2, i_1+j] * kern[3, i_1+2]
for i_2 โ 0:1
tmp_1 += A[i_1+j, 1+i_2] * kern[i_1+2, i_2+2]
tmp_2 += A[1+i_2, i_1+j] * kern[i_2+2, i_1+2]
end
end
out[j,1] = tmp_1
out[1,j] = tmp_2
end
@turbo for j_2 โ 2:N-1, j_1 โ 2:N-1
# for j_2 โ 2:N-1, j_1 โ 2:N-1
tmp = zero(eltype(out))
for i_1 โ -1:1, i_2 โ -1:1
tmp += A[i_1+j_1, i_2+j_2] * kern[i_1+2, i_2+2]
end
out[j_1,j_2] = tmp
end
@turbo for j โ 2:N-1
# for j โ 2:N-1
tmp_1 = zero(eltype(out))
tmp_2 = zero(eltype(out))
for i_1 โ -1:1
tmp_1 += A[i_1+j, N-1] * kern[i_1+2, 3]
tmp_2 += A[N-1, i_1+j] * kern[3, i_1+2]
for i_2 โ -1:0
tmp_1 += A[i_1+j, N+i_2] * kern[i_1+2, i_2+2]
tmp_2 += A[N+i_2, i_1+j] * kern[i_2+2, i_1+2]
end
end
out[j,N] = tmp_1
out[N,j] = tmp_2
end
@inbounds @fastmath let j_2 = j_1 = N
tmp = A[-1+j_1, -1+j_2] * kern[1+2, -1+2]
tmp += A[-1+j_1, 0+j_2] * kern[1+2, 0+2]
tmp += A[-1+j_1, -1+j_2] * kern[1+2, 1+2]
tmp += A[-1+j_1, -1+j_2] * kern[-1+2, 1+2]
tmp += A[0+j_1, -1+j_2] * kern[0+2, 1+2]
for i_1 โ -1:0, i_2 โ -1:0
tmp += A[i_1+j_1, i_2+j_2] * kern[i_1+2, i_2+2]
end
out[j_1,j_2] = tmp
end
nothing
end
function gm7!(dr,r,p,t)
a,ฮฑ,ubar,ฮฒ,D1,D2 = p
u = @view r[:,:,1];
v = @view r[:,:,2];
du = @view dr[:,:,1];
dv = @view dr[:,:,2];
filter2davx!(Ayu, u, LAPLACE_KERNEL)
# @turbo @. du = D1*Ayu + a*u*u./v + ubar - ฮฑ*u
@turbo for i โ eachindex(u)
du[i] = D1 * Ayu[i] + a * u[i]*u[i] / v[i] + ubar - ฮฑ*u[i]
end
filter2davx!(Ayu, v, LAPLACE_KERNEL)
# @turbo @. dv = D2*Ayu + a*u*u - ฮฒ*v
@turbo for i โ eachindex(u)
dv[i] = D2 * Ayu[i] + a * u[i]*u[i] - ฮฒ*v[i]
end
end
prob_gm7 = ODEProblem(gm7!,r0,(0.0,0.1),p);
@benchmark solve($prob_gm7,Tsit5())
samples: 546; evals/sample: 1; memory estimate: 29.62 MiB; allocs estimate: 433
ns
(7.96e6 - 8.63e6 ] โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ311
(8.63e6 - 9.29e6 ] โโโ26
(9.29e6 - 9.95e6 ] โโโโโโโโโโ101
(9.95e6 - 1.062e7] โโโโโ51
(1.062e7 - 1.128e7] โ3
(1.128e7 - 1.194e7] โโโ25
(1.194e7 - 1.26e7 ] โ1
(1.26e7 - 1.327e7] โ1
(1.327e7 - 1.393e7] 0
(1.393e7 - 1.459e7] โโ13
(1.459e7 - 1.525e7] โโ14
Counts
min: 7.965 ms (0.00% GC); mean: 9.162 ms (6.71% GC); median: 8.165 ms (0.00% GC); max: 15.255 ms (10.53% GC).
If we wanted to optimize this a bit more, we could
- take advantage of the sparsity in the laplacian kernel
- make the values from the kernel constant
- Fuse operations to pass over data less often.
Oops, we now have fast_gm!
/turbo_gm_fuse!
.
from methodoflines.jl.
You can just hardcode that it's the stencil
0 1 0
1 -4 1
0 1 0
from methodoflines.jl.
Sure, but at that point why not go the rest of the way to turbo_gm_fuse!
?
from methodoflines.jl.
Yeah, go all of the way.
from methodoflines.jl.
The one thing fusing inhibits is easily threading across the third axis.
I tried that, and it didn't help. 8.2ms solve time.
I should add a version of @spawn
to Polyester so we can parallelize different functions.
Maybe if the third axis was huge (and you're still doing something different), but that seems contrived.
Ideally, we'd have a loop compiler that does the right thing with whatever loops we give it. There isn't really anything that special about stencils.
EDIT: To give ParallelStencil.jl
/stencil compilers credit (aside from the fact that I didn't benchmark it at all so there is no "discredit" either), working on a smaller problem domain (just stencils) makes it easier to scale out, e.g. to all the GPUs.
from methodoflines.jl.
Yeah, they do have a lot of uses. http://supertech.csail.mit.edu/papers/pochoir_spaa11.pdf is a good read if you haven't seen it.
from methodoflines.jl.
Thanks, that's interesting.
So the key to a stencil compiler is how it optimizes cache locality across time.
That is, it cuts up the iteration space/dependencies in a way so that multiple time steps -- and hence lots of reuse -- can be computed on sub (trapezoidal) regions. These regions can also be computed in parallel.
If you want to compute only a single time step update, you need to iterate over the entire arrays. Then you may as well do what you can to minimize the amount of iteration, e.g. fuse it with other operations.
However, if you want to compute, say, 1000 time steps, you don't need to pass over the arrays 1000 times. It takes many time steps for changes to index u[i,j]
to fan out and influence iterations at index u[i+100, j-100]
.
So starting at t=0
, you can take a chunk of the array u
that fits nicely in cache, and then perform a large number of time updates. This chunk shrinks overtime, because you lose some along the boundaries, hence the trapezoidal shape.
Eventually, you'd stop and instead work on another region. For example, one now available because the region(s) it depends on have been finished.
Stencil compilers figure our clever ways to carve the space up in this way, running those separate chunks on different cores/gpus/etc, and also scheduling the leftovers to fill in the missing pieces, etc.
Polyhedral optimizers can also do this sort of analysis.
Hmm.
Anyway, regarding the computations here where we're doing a single time step at a time, we can't benefit from these optimizations.
If we were able to do many iterations, then ideally we'd be able to fuse operations like a*u[i,j,1]^2/u[i,j,2] + ubar - ฮฑ*u[i,j,1]
with it. That shouldn't cause any scheduling problems (e.g. to a polyhedral optimizer).
from methodoflines.jl.
The current MTK code example of scalarized code is
using ModelingToolkit
l = 1
n = 5
h = l / n
@variables t u[1:n](t)
D = Differential(t)
# periodic bounary condition version 1
eqs = [
h^2 * D(u[1]) ~ u[end] - 2u[1] + u[2]
map(2:n-1) do i
h^2 * D(u[i]) ~ u[i-1] - 2u[i] + u[i+1] # ifelse(u[i]>0, u[i+1] - u[i], u[i] - u[i-1])
end
h^2 * D(u[end]) ~ u[end-1] - 2u[end] + u[1]
]
My tentative syntax for the interior is
map(u[1:n-1], u[2:n], u[3:n+1]) do um1, ui, up1
um1 - 2ui + up1
end
I am not sure how to scale this to high dims though.
from methodoflines.jl.
Let's talk about that high dim case today. That's going to be important to this. And get @chriselrod and @shashi on too. 1pm?
from methodoflines.jl.
Happy to help with this but the discussions so far are above my paygrade ๐
from methodoflines.jl.
using OrdinaryDiffEq, LinearAlgebra, SparseArrays, BenchmarkTools
using LoopVectorization
const ฮฑโ = 1.0
const ฮฑโ = 1.0
const ฮฒโ = 1.0
const ฮฒโ = 1.0
const ฮฒโ = 1.0
const rโ = 1.0
const rโ = 1.0
const D = 100.0
const ฮณโ = 0.1
const ฮณโ = 0.1
const ฮณโ = 0.1
const N = 512
const X = reshape([i for i in 1:N for j in 1:N],N,N)
const Y = reshape([j for i in 1:N for j in 1:N],N,N)
const ฮฑโ = 1.0.*(X.>=4*N/5)
const Mx = Tridiagonal([1.0 for i in 1:N-1],[-2.0 for i in 1:N],[1.0 for i in 1:N-1])
const My = copy(Mx)
Mx[2,1] = 2.0
Mx[end-1,end] = 2.0
My[1,2] = 2.0
My[end,end-1] = 2.0
# Define the discretized PDE as an ODE function
const MyA = zeros(N,N)
const AMx = zeros(N,N)
const DA = zeros(N,N)
function f(du,u,p,t)
A = @view u[:,:,1]
B = @view u[:,:,2]
C = @view u[:,:,3]
dA = @view du[:,:,1]
dB = @view du[:,:,2]
dC = @view du[:,:,3]
mul!(MyA,My,A)
mul!(AMx,A,Mx)
@. DA = D*(MyA + AMx)
@. dA = DA + ฮฑโ - ฮฒโ*A - rโ*A*B + rโ*C
@. dB = ฮฑโ - ฮฒโ*B - rโ*A*B + rโ*C
@. dC = ฮฑโ - ฮฒโ*C + rโ*A*B - rโ*C
end
function f!(du,u,p,t)
A = @view u[:,:,1]
B = @view u[:,:,2]
C = @view u[:,:,3]
dA = @view du[:,:,1]
dB = @view du[:,:,2]
dC = @view du[:,:,3]
@turbo for j in 2:N-1, i in 2:N-1
dA[i,j] = D*(A[i-1,j] + A[i+1,j] + A[i,j+1] + A[i,j-1] - 4A[i,j]) +
ฮฑโ[i,j] - ฮฒโ*A[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dB[i,j] = ฮฑโ - ฮฒโ*B[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dC[i,j] = ฮฑโ - ฮฒโ*C[i,j] + rโ*A[i,j]*B[i,j] - rโ*C[i,j]
end
@inbounds for j in 2:N-1
i = 1
dA[1,j] = D*(2A[i+1,j] + A[i,j+1] + A[i,j-1] - 4A[i,j]) +
ฮฑโ[i,j] - ฮฒโ*A[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dB[i,j] = ฮฑโ - ฮฒโ*B[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dC[i,j] = ฮฑโ - ฮฒโ*C[i,j] + rโ*A[i,j]*B[i,j] - rโ*C[i,j]
end
@inbounds for j in 2:N-1
i = N
dA[end,j] = D*(2A[i-1,j] + A[i,j+1] + A[i,j-1] - 4A[i,j]) +
ฮฑโ[i,j] - ฮฒโ*A[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dB[i,j] = ฮฑโ - ฮฒโ*B[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dC[i,j] = ฮฑโ - ฮฒโ*C[i,j] + rโ*A[i,j]*B[i,j] - rโ*C[i,j]
end
@inbounds for i in 2:N-1
j = 1
dA[i,j] = D*(A[i-1,j] + A[i+1,j] + 2A[i,j+1] - 4A[i,j]) +
ฮฑโ[i,j] - ฮฒโ*A[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dB[i,j] = ฮฑโ - ฮฒโ*B[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dC[i,j] = ฮฑโ - ฮฒโ*C[i,j] + rโ*A[i,j]*B[i,j] - rโ*C[i,j]
end
@inbounds for i in 2:N-1
j = N
dA[i,end] = D*(A[i-1,j] + A[i+1,j] + 2A[i,j-1] - 4A[i,j]) +
ฮฑโ[i,j] - ฮฒโ*A[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dB[i,j] = ฮฑโ - ฮฒโ*B[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dC[i,j] = ฮฑโ - ฮฒโ*C[i,j] + rโ*A[i,j]*B[i,j] - rโ*C[i,j]
end
@inbounds begin
i = 1; j = 1
dA[i,j] = D*(2A[i+1,j] + 2A[i,j+1] - 4A[i,j]) +
ฮฑโ[i,j] - ฮฒโ*A[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dB[i,j] = ฮฑโ - ฮฒโ*B[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dC[i,j] = ฮฑโ - ฮฒโ*C[i,j] + rโ*A[i,j]*B[i,j] - rโ*C[i,j]
i = 1; j = N
dA[i,j] = D*(2A[i+1,j] + 2A[i,j-1] - 4A[i,j]) +
ฮฑโ[i,j] - ฮฒโ*A[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dB[i,j] = ฮฑโ - ฮฒโ*B[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dC[i,j] = ฮฑโ - ฮฒโ*C[i,j] + rโ*A[i,j]*B[i,j] - rโ*C[i,j]
i = N; j = 1
dA[i,j] = D*(2A[i-1,j] + 2A[i,j+1] - 4A[i,j]) +
ฮฑโ[i,j] - ฮฒโ*A[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dB[i,j] = ฮฑโ - ฮฒโ*B[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dC[i,j] = ฮฑโ - ฮฒโ*C[i,j] + rโ*A[i,j]*B[i,j] - rโ*C[i,j]
i = N; j = N
dA[i,j] = D*(2A[i-1,j] + 2A[i,j-1] - 4A[i,j]) +
ฮฑโ[i,j] - ฮฒโ*A[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dB[i,j] = ฮฑโ - ฮฒโ*B[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dC[i,j] = ฮฑโ - ฮฒโ*C[i,j] + rโ*A[i,j]*B[i,j] - rโ*C[i,j]
end
end
#=
# Double check:
u = rand(N,N,3)
du = similar(u)
du2 = similar(u)
f( du ,u,nothing,0.0)
f!(du2,u,nothing,0.0)
=#
u0 = zeros(N,N,3)
prob = ODEProblem(f!,u0,(0.0,10.0))
@btime solve(prob, ROCK4(), reltol = 1e-8, abstol=1e-8, saveat=0.1); # 13.300 s (3406 allocations: 4.34 GiB)
@btime solve(prob, Tsit5(), reltol = 1e-8, abstol=1e-8, saveat=0.1); # 38.481 s (270 allocations: 708.02 MiB)
@btime solve(prob, DP5(), reltol = 1e-8, abstol=1e-8, saveat=0.1); # 40.445 s (267 allocations: 690.02 MiB)
There's an N to tone down to test with. I'm pretty confused by this. Is it just the cache lines would have to be too long?
@chriselrod found:
Locally:
julia> u0 = randn(N,N,3);
julia> du0 = similar(u0);
julia> @benchmark f!($du0,$u0,nothing,0.0) # @turbo
samples: 7479; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
(641000.0 - 644000.0] โโโ57
(644000.0 - 647000.0] โโโโโโโโโโโ291
(647000.0 - 650000.0] โโโโโโโโโโโโโโโโโโโโ552
(650000.0 - 653100.0] โโโโโโโโโโโโโโโโโโโโโโโโโ683
(653100.0 - 656100.0] โโโโโโโโโโโโโโโโโโโโโโโโโโโโ778
(656100.0 - 659100.0] โโโโโโโโโโโโโโโโโโโโโโโโโโโ749
(659100.0 - 662100.0] โโโโโโโโโโโโโโโโโโโโโโโโโโโ750
(662100.0 - 665100.0] โโโโโโโโโโโโโโโโโโโโโโโโ653
(665100.0 - 668200.0] โโโโโโโโโโโโโโโโโโโโโโโโโ675
(668200.0 - 671200.0] โโโโโโโโโโโโโโโโโโโโโโโโโ693
(671200.0 - 674200.0] โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 836
(674200.0 - 677200.0] โโโโโโโโโโโโโโโโโโโโโโโโ663
(677200.0 - 680200.0] โโโโ91
(680200.0 - 1.0048e6] โ8
Counts
min: 640.984 ฮผs (0.00% GC); mean: 661.914 ฮผs (0.00% GC); median: 661.578 ฮผs (0.00% GC); max: 1.005 ms (0.00% GC).
julia> @benchmark g!($du0,$u0,nothing,0.0) # @inbounds @fastmath
samples: 4784; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
(1.0269e6 - 1.0287e6] โ15
(1.0287e6 - 1.0306e6] โ10
(1.0306e6 - 1.0324e6] โ7
(1.0324e6 - 1.0342e6] โ20
(1.0342e6 - 1.036e6 ] โ7
(1.036e6 - 1.0379e6] โโโโโโโโโ953
(1.0379e6 - 1.0397e6] โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ3337
(1.0397e6 - 1.0415e6] โโโโ377
(1.0415e6 - 1.0433e6] โ30
(1.0433e6 - 1.0452e6] โ13
(1.0452e6 - 1.047e6 ] โ6
(1.047e6 - 1.0488e6] โ3
(1.0488e6 - 1.0506e6] โ1
(1.0506e6 - 1.2524e6] โ5
Counts
min: 1.027 ms (0.00% GC); mean: 1.039 ms (0.00% GC); median: 1.038 ms (0.00% GC); max: 1.252 ms (0.00% GC).
julia> @show N
N = 512
LV doesn't do cache tiling currently, so performance suffers at larger sizes.
I have a good idea of how I want it to do that now, but want to wait for the rewrite as otherwise the rewrite will never happen (there's no end to the stuff I could still do). (edited)
7:43
g! was using @inbounds @fastmath and took 1ms, while f! was using @turbo and took 0.6ms.
I could try more or on a different computer. (edited)
7:46
With @tturbo:
julia> @benchmark fthreads!($du0,$u0,nothing,0.0) # @tturbo
samples: 10000; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
(89800.0 - 91100.0 ] โ7
(91100.0 - 92400.0 ] โ80
(92400.0 - 93700.0 ] โโโโโโ453
(93700.0 - 95000.0 ] โโโโโโโโโโโโโโโโ1405
(95000.0 - 96300.0 ] โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ2681
(96300.0 - 97600.0 ] โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 2692
(97600.0 - 98900.0 ] โโโโโโโโโโโโโโโโโโโ1658
(98900.0 - 100200.0] โโโโโโโโ629
(100200.0 - 101500.0] โโโ234
(101500.0 - 102800.0] โ78
(102800.0 - 104000.0] โ37
(104000.0 - 105300.0] โ15
(105300.0 - 106600.0] โ11
(106600.0 - 107900.0] โ10
(107900.0 - 349100.0] โ10
Counts
min: 89.756 ฮผs (0.00% GC); mean: 96.586 ฮผs (0.00% GC); median: 96.428 ฮผs (0.00% GC); max: 349.074 ฮผs (0.00% GC).
and thread:
Chris Elrod 1 day ago
Seems to be a problem in LoopVectorization's cost modeling.
It is unrolling the j loop to save on reloads of A[i,j-1],A[i,j] ,A[i,j+1].
But it seems faster not to unroll at all.
Chris Elrod 1 day ago
You can try to confirm on your machine with @turbo unroll=1 for ...
Chris Elrod 1 day ago
Unrolling does seem faster for very small N, e.g. I tried 30.
Chris Elrod 1 day ago
This was also true on my old (AVX2, Haswell) laptop.
Chris Elrod 18 hours ago
I tweaked some heuristics in LoopVectorization 0.12.38.
It should do better now, but I still saw some anonymously worse performance on my Cascadelake-X system from objectively worse assembly. I.e., reloading the same data multiple times instead of reusing a register -> worse performance. So it's generating the objectively worse but slower (on that machine) code.
I could perhaps add heuristics to identify that case to add prefetch instructions, which should have the same (but more) benefit as whatever benefit the extra loads is having.
On my Haswell (AVX2) laptop, the code without extra loads seemed slightly faster.
Code with extra loads == @inbounds @fastmath for ...; @simd ivdep for ...
I can reproduce the problem with LLVM by using @simd ivdep like above, but moving all the stores after all the loads in the loop. That eliminates the redundant loads, and worsens performance. (edited)
Chris Elrod 18 hours ago
Anyway, the loop is memory bound and the key seems to be making it as easy on the hardware prefetchers as possible. Any sort of fanciness -- including using TiledIteration.jl to block -- just makes performance worse.
So it may be worth getting LoopVectorization to emit software prefetches there.
EDIT: Played around with software prefetching there. Made it slower. (edited)
Chris Elrod 16 hours ago
Also note that 512 is a magic bad number. Benchmarks with 512x512 matrices:
Binary
stencilbench.jl
4 kB Binary4 kB โ Click to download
Chris Elrod 16 hours ago
julia> @benchmark fbase!($du,$u,$ฮฑโ)
samples: 4373; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
(1.06e6 - 1.102e6] โโ122
(1.102e6 - 1.144e6] โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 3397
(1.144e6 - 1.186e6] โโโโโโ591
(1.186e6 - 1.229e6] โโ195
(1.229e6 - 1.271e6] โ32
(1.271e6 - 1.313e6] โ10
(1.313e6 - 1.355e6] โ5
(1.355e6 - 1.397e6] โ9
(1.397e6 - 1.439e6] โ1
(1.439e6 - 1.481e6] 0
(1.481e6 - 1.523e6] โ3
(1.523e6 - 1.565e6] โ1
(1.565e6 - 1.608e6] โ2
(1.608e6 - 1.807e6] โ5
Counts
min: 1.060 ms (0.00% GC); mean: 1.134 ms (0.00% GC); median: 1.126 ms (0.00% GC); max: 1.807 ms (0.00% GC).
julia> @benchmark fturbo!($du,$u,$ฮฑโ)
samples: 6310; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
(740000.0 - 787000.0] โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 4631
(787000.0 - 834000.0] โโโโโโโโ1139
(834000.0 - 881000.0] โโโ321
(881000.0 - 928000.0] โ139
(928000.0 - 975000.0] โ40
(975000.0 - 1.022e6 ] โ19
(1.022e6 - 1.069e6 ] โ6
(1.069e6 - 1.116e6 ] โ3
(1.116e6 - 1.163e6 ] โ4
(1.163e6 - 1.211e6 ] 0
(1.211e6 - 1.258e6 ] 0
(1.258e6 - 1.305e6 ] โ1
(1.305e6 - 1.352e6 ] 0
(1.352e6 - 1.577e6 ] โ7
Counts
min: 739.585 ฮผs (0.00% GC); mean: 781.544 ฮผs (0.00% GC); median: 767.157 ฮผs (0.00% GC); max: 1.577 ms (0.00% GC).
julia> @benchmark funroll1!($du,$u,$ฮฑโ)
samples: 6365; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
(739000.0 - 756000.0] โโโโโโโโโโโโโโโโ1365
(756000.0 - 772000.0] โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 2623
(772000.0 - 788000.0] โโโโโโโโโโโ921
(788000.0 - 804000.0] โโโโโโโ580
(804000.0 - 821000.0] โโโโโ407
(821000.0 - 837000.0] โโโ201
(837000.0 - 853000.0] โโ109
(853000.0 - 870000.0] โ73
(870000.0 - 886000.0] โ34
(886000.0 - 902000.0] โ25
(902000.0 - 918000.0] โ9
(918000.0 - 935000.0] โ6
(935000.0 - 951000.0] โ5
(951000.0 - 1.137e6 ] โ7
Counts
min: 739.437 ฮผs (0.00% GC); mean: 775.172 ฮผs (0.00% GC); median: 764.959 ฮผs (0.00% GC); max: 1.137 ms (0.00% GC).
julia> @benchmark funroll1_v2!($du,$u,$ฮฑโ)
samples: 6431; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
(730000.0 - 746000.0] โโโโโโโโโ806
(746000.0 - 761000.0] โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ2902
(761000.0 - 777000.0] โโโโโโโโโโโโโ1229
(777000.0 - 792000.0] โโโโโโโ605
(792000.0 - 808000.0] โโโโ369
(808000.0 - 823000.0] โโโ194
(823000.0 - 839000.0] โโ125
(839000.0 - 855000.0] โ75
(855000.0 - 870000.0] โ43
(870000.0 - 886000.0] โ33
(886000.0 - 901000.0] โ18
(901000.0 - 917000.0] โ9
(917000.0 - 932000.0] โ16
(932000.0 - 1.133e6 ] โ7
Counts
min: 730.097 ฮผs (0.00% GC); mean: 767.196 ฮผs (0.00% GC); median: 757.611 ฮผs (0.00% GC); max: 1.133 ms (0.00% GC).
julia> @benchmark fbase_ivdep!($du,$u,$ฮฑโ)
samples: 6693; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
(688000.0 - 709000.0] โโโโโโโโโโโโโโโโโโโโโโโโโโโโ1779
(709000.0 - 730000.0] โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 1959
(730000.0 - 751000.0] โโโโโโโโโโโโโโโโโโ1163
(751000.0 - 773000.0] โโโโโโโโโโโโ731
(773000.0 - 794000.0] โโโโโโโ441
(794000.0 - 815000.0] โโโโโ291
(815000.0 - 836000.0] โโโ154
(836000.0 - 857000.0] โโ81
(857000.0 - 878000.0] โ30
(878000.0 - 900000.0] โ27
(900000.0 - 921000.0] โ19
(921000.0 - 942000.0] โ7
(942000.0 - 963000.0] โ4
(963000.0 - 1.17e6 ] โ7
Counts
min: 687.778 ฮผs (0.00% GC); mean: 736.942 ฮผs (0.00% GC); median: 724.632 ฮผs (0.00% GC); max: 1.170 ms (0.00% GC).
julia> @benchmark fbase_ivdep_v2!($du,$u,$ฮฑโ)
samples: 6249; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
(749000.0 - 766000.0] โโโโโโโโโโโโโโโ1134
(766000.0 - 783000.0] โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 2424
(783000.0 - 801000.0] โโโโโโโโโโโโโโ1089
(801000.0 - 818000.0] โโโโโโโโโ661
(818000.0 - 835000.0] โโโโโ392
(835000.0 - 852000.0] โโโ241
(852000.0 - 869000.0] โโ109
(869000.0 - 886000.0] โ61
(886000.0 - 903000.0] โ51
(903000.0 - 920000.0] โ40
(920000.0 - 937000.0] โ24
(937000.0 - 954000.0] โ8
(954000.0 - 971000.0] โ8
(971000.0 - 1.177e6 ] โ7
Counts
min: 749.351 ฮผs (0.00% GC); mean: 789.853 ฮผs (0.00% GC); median: 779.491 ฮผs (0.00% GC); max: 1.177 ms (0.00% GC).
Binary
stencilbench.jl
4 kB Binary4 kB โ Click to download
Chris Elrod 16 hours ago
Increasing N from 512 to 528:
Binary
stencilbench.jl
4 kB Binary4 kB โ Click to download
Chris Elrod 16 hours ago
julia> @benchmark fbase!($du,$u,$ฮฑโ)
samples: 4993; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
(911000.0 - 924000.0] โโโโโโโ477
(924000.0 - 936000.0] โโโโโโโโโโโโโโ904
(936000.0 - 949000.0] โโโ183
(949000.0 - 961000.0] โโ112
(961000.0 - 974000.0] โ37
(974000.0 - 987000.0] โ9
(987000.0 - 999000.0] โ2
(999000.0 - 1.012e6 ] โ1
(1.012e6 - 1.025e6 ] โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 2086
(1.025e6 - 1.037e6 ] โโโโโโโโโโโ718
(1.037e6 - 1.05e6 ] โโโโโ319
(1.05e6 - 1.063e6 ] โโ128
(1.063e6 - 1.075e6 ] โ12
(1.075e6 - 1.192e6 ] โ5
Counts
min: 910.853 ฮผs (0.00% GC); mean: 993.422 ฮผs (0.00% GC); median: 1.020 ms (0.00% GC); max: 1.192 ms (0.00% GC).
julia> @benchmark fturbo!($du,$u,$ฮฑโ)
samples: 6683; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
(709000.0 - 719000.0] โโโโโโโโ552
(719000.0 - 730000.0] โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 2323
(730000.0 - 740000.0] โโโโโโโโโโโโโโโโโโโโโโโ1717
(740000.0 - 751000.0] โโโโโโโโโ637
(751000.0 - 762000.0] โโโโโ353
(762000.0 - 772000.0] โโโโโ356
(772000.0 - 783000.0] โโโโ305
(783000.0 - 793000.0] โโโ232
(793000.0 - 804000.0] โโ118
(804000.0 - 814000.0] โ35
(814000.0 - 825000.0] โ27
(825000.0 - 836000.0] โ16
(836000.0 - 846000.0] โ5
(846000.0 - 1.136e6 ] โ7
Counts
min: 708.509 ฮผs (0.00% GC); mean: 739.460 ฮผs (0.00% GC); median: 731.846 ฮผs (0.00% GC); max: 1.136 ms (0.00% GC).
julia> @benchmark funroll1!($du,$u,$ฮฑโ)
samples: 6669; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
(707000.0 - 719000.0] โโโโโโ429
(719000.0 - 730000.0] โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 2241
(730000.0 - 742000.0] โโโโโโโโโโโโโโโโโโโโโโโโโโ1905
(742000.0 - 754000.0] โโโโโโโโโโ680
(754000.0 - 765000.0] โโโโโโ409
(765000.0 - 777000.0] โโโโโ359
(777000.0 - 788000.0] โโโโโ325
(788000.0 - 800000.0] โโโ163
(800000.0 - 811000.0] โโ79
(811000.0 - 823000.0] โ28
(823000.0 - 834000.0] โ18
(834000.0 - 846000.0] โ19
(846000.0 - 857000.0] โ7
(857000.0 - 1.143e6 ] โ7
Counts
min: 707.473 ฮผs (0.00% GC); mean: 740.984 ฮผs (0.00% GC); median: 733.445 ฮผs (0.00% GC); max: 1.143 ms (0.00% GC).
julia> @benchmark funroll1_v2!($du,$u,$ฮฑโ)
samples: 6594; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
(717000.0 - 727000.0] โโโโโ285
(727000.0 - 736000.0] โโโโโโโโโโโโโโโโโโโโโโโโโโ1707
(736000.0 - 746000.0] โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 1998
(746000.0 - 756000.0] โโโโโโโโโโโโโโโ950
(756000.0 - 766000.0] โโโโโโ377
(766000.0 - 775000.0] โโโโโ305
(775000.0 - 785000.0] โโโโโ293
(785000.0 - 795000.0] โโโโโ307
(795000.0 - 805000.0] โโโ196
(805000.0 - 814000.0] โโ101
(814000.0 - 824000.0] โ45
(824000.0 - 834000.0] โ17
(834000.0 - 844000.0] โ6
(844000.0 - 1.147e6 ] โ7
Counts
min: 716.736 ฮผs (0.00% GC); mean: 749.543 ฮผs (0.00% GC); median: 742.499 ฮผs (0.00% GC); max: 1.147 ms (0.00% GC).
julia> @benchmark fbase_ivdep!($du,$u,$ฮฑโ)
samples: 7403; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
(635000.0 - 649000.0] โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ2645
(649000.0 - 664000.0] โโโโโโโโโโโโโโโโโโโโโโโโโโ2292
(664000.0 - 678000.0] โโโโโโโ591
(678000.0 - 693000.0] โโโโโ374
(693000.0 - 707000.0] โโโโโโ490
(707000.0 - 722000.0] โโโโโโ508
(722000.0 - 736000.0] โโโโ276
(736000.0 - 751000.0] โโ119
(751000.0 - 765000.0] โ49
(765000.0 - 780000.0] โ25
(780000.0 - 794000.0] โ18
(794000.0 - 809000.0] โ3
(809000.0 - 823000.0] โ5
(823000.0 - 1.106e6 ] โ8
Counts
min: 634.654 ฮผs (0.00% GC); mean: 666.727 ฮผs (0.00% GC); median: 652.383 ฮผs (0.00% GC); max: 1.106 ms (0.00% GC).
julia> @benchmark fbase_ivdep_v2!($du,$u,$ฮฑโ)
samples: 6610; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
(717000.0 - 730000.0] โโโโโโโโโ893
(730000.0 - 742000.0] โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ3102
(742000.0 - 754000.0] โโโโโโโโโโ1031
(754000.0 - 766000.0] โโโโ363
(766000.0 - 779000.0] โโโโ358
(779000.0 - 791000.0] โโโโ331
(791000.0 - 803000.0] โโโ278
(803000.0 - 815000.0] โโ125
(815000.0 - 827000.0] โ63
(827000.0 - 840000.0] โ27
(840000.0 - 852000.0] โ12
(852000.0 - 864000.0] โ11
(864000.0 - 876000.0] โ9
(876000.0 - 1.126e6 ] โ7
Counts
min: 717.473 ฮผs (0.00% GC); mean: 747.755 ฮผs (0.00% GC); median: 738.879 ฮผs (0.00% GC); max: 1.126 ms (0.00% GC).
Binary
stencilbench.jl
4 kB Binary4 kB โ Click to download
Chris Elrod 16 hours ago
I attached the script.
512 is bad because many CPUs (including this one I benchmarked on, a 7900X Skylake-X CPU) have 32 KiB L1 caches with 8 way associativity.
That means data
julia> 32*(2^10) รท 8
4096
bytes apart would contend for the same slot of the of the L1 cache.
I.e., data
julia> (32*(2^10) รท 8) รท sizeof(Float64)
512
Float64 appart contend for the same slot.
With 512x512 matrices, this means data from A[i,j-1] , A[i,j], and A[i,j+1] all want to occupy the same place in the L1 cache.
This also makes prefetching the next cacheline (i.e. A[i+8,j-1], A[i+8,j], and A[i+8,j+1] at the same time harder).
Hence why increasing the sizes of these arrays can actually make operating on them take overall less time.
Binary
stencilbench.jl
4 kB Binary4 kB โ Click to download
from methodoflines.jl.
Solve times, prob
is @turbo
, prob_threads
is @tturbo
, and prob_simdivdep
is @inbounds @fastmath @simd ivdep
:
julia> @time solve(prob, ROCK4(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
46.533658 seconds (14.03 k allocations: 35.467 GiB, 0.49% gc time)
julia> @time solve(prob_threads, ROCK4(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
42.405112 seconds (14.86 k allocations: 35.467 GiB, 0.93% gc time)
julia> @time solve(prob_simdivdep, ROCK4(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
46.475041 seconds (14.12 k allocations: 35.725 GiB, 0.14% gc time)
julia> @time solve(prob, Tsit5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
43.199515 seconds (266 allocations: 708.018 MiB)
julia> @time solve(prob_threads, Tsit5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
38.143093 seconds (283 allocations: 708.018 MiB, 0.07% gc time)
julia> @time solve(prob_simdivdep, Tsit5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
43.644268 seconds (266 allocations: 708.018 MiB, 0.24% gc time)
julia> @time solve(prob, DP5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
43.931490 seconds (265 allocations: 690.017 MiB)
julia> @time solve(prob_threads, DP5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
37.560453 seconds (282 allocations: 690.018 MiB, 0.01% gc time)
julia> @time solve(prob_simdivdep, DP5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
43.625376 seconds (265 allocations: 690.017 MiB, 0.01% gc time)
Testing just the functions:
julia> @benchmark f!($du, $u0, nothing, 0.0) # @turbo
samples: 9231; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
(530100.0 - 531000.0] โ31
(531000.0 - 532000.0] โโโโโ207
(532000.0 - 532900.0] โโโโโโโโโโโโโ575
(532900.0 - 533800.0] โโโโโโโโโโโโโโโโโโโโโโ991
(533800.0 - 534700.0] โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ1335
(534700.0 - 535700.0] โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ1408
(535700.0 - 536600.0] โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ1404
(536600.0 - 537500.0] โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ1333
(537500.0 - 538400.0] โโโโโโโโโโโโโโโโโโโโโโโโ1125
(538400.0 - 539400.0] โโโโโโโโโโโโโ589
(539400.0 - 540300.0] โโโโ184
(540300.0 - 541200.0] โ31
(541200.0 - 542100.0] โ6
(542100.0 - 543100.0] โ2
(543100.0 - 895600.0] โ10
Counts
min: 530.115 ฮผs (0.00% GC); mean: 535.779 ฮผs (0.00% GC); median: 535.711 ฮผs (0.00% GC); max: 895.599 ฮผs (0.00% GC).
julia> @benchmark fthreads!($du, $u0, nothing, 0.0) # @tturbo
samples: 10000; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
(81000.0 - 81900.0 ] โ1
(81900.0 - 82800.0 ] โ5
(82800.0 - 83700.0 ] โ50
(83700.0 - 84500.0 ] โโโโ260
(84500.0 - 85400.0 ] โโโโโโโโโโโโ948
(85400.0 - 86300.0 ] โโโโโโโโโโโโโโโโโโโโโโโโโโ2006
(86300.0 - 87200.0 ] โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ2384
(87200.0 - 88100.0 ] โโโโโโโโโโโโโโโโโโโโโโโโโ1964
(88100.0 - 89000.0 ] โโโโโโโโโโโโโโโโโ1322
(89000.0 - 89800.0 ] โโโโโโโโโ650
(89800.0 - 90700.0 ] โโโ238
(90700.0 - 91600.0 ] โโ107
(91600.0 - 92500.0 ] โ41
(92500.0 - 93400.0 ] โ14
(93400.0 - 327700.0] โ10
Counts
min: 81.006 ฮผs (0.00% GC); mean: 87.097 ฮผs (0.00% GC); median: 86.934 ฮผs (0.00% GC); max: 327.730 ฮผs (0.00% GC).
julia> @benchmark fsimdivdep!($du, $u0, nothing, 0.0) # @simd ivdep
samples: 9821; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
(498800.0 - 499600.0] โ10
(499600.0 - 500400.0] โโ109
(500400.0 - 501200.0] โโโโโโโโโโโโ688
(501200.0 - 502000.0] โโโโโโโโโโโโโโโโโโโโโโโโโโ1512
(502000.0 - 502800.0] โโโโโโโโโโโโโโโโโโโโโโโโโโโ1571
(502800.0 - 503500.0] โโโโโโโโโโโโโโโโโโโโโโโโโโ1508
(503500.0 - 504300.0] โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ1772
(504300.0 - 505100.0] โโโโโโโโโโโโโโโโโโโโโโโโโโ1517
(505100.0 - 505900.0] โโโโโโโโโโโโโโ781
(505900.0 - 506700.0] โโโโโ266
(506700.0 - 507500.0] โ54
(507500.0 - 508200.0] โ13
(508200.0 - 509000.0] โ5
(509000.0 - 509800.0] โ5
(509800.0 - 897000.0] โ10
Counts
min: 498.833 ฮผs (0.00% GC); mean: 503.346 ฮผs (0.00% GC); median: 503.297 ฮผs (0.00% GC); max: 897.010 ฮผs (0.00% GC).
So most of the performance gain we see from threading doesn't translate to the actual solve.
For reference, here is the assembly from the main loop using @inbounds @fastmath @simd ivdep
, along with llvm-mca's report:
Iterations: 100
Instructions: 2400
Total Cycles: 878
Total uOps: 4100
Dispatch Width: 6
uOps Per Cycle: 4.67
IPC: 2.73
Block RThroughput: 7.0
Cycles with backend pressure increase [ 45.22% ]
Throughput Bottlenecks:
Resource Pressure [ 29.16% ]
- SKXPort0 [ 21.30% ]
- SKXPort1 [ 0.80% ]
- SKXPort2 [ 14.58% ]
- SKXPort3 [ 14.58% ]
- SKXPort4 [ 4.78% ]
- SKXPort5 [ 21.30% ]
Data Dependencies: [ 39.41% ]
- Register Dependencies [ 39.41% ]
- Memory Dependencies [ 0.00% ]
Critical sequence based on the simulation:
Instruction Dependency Information
+----< 22. add r8, 64
|
| < loop carried >
|
+----> 0. vmovupd zmm6, zmmword ptr [r14 + r8 - 8] ## REGISTER dependency: r8
| 1. vmovupd zmm7, zmmword ptr [r14 + r8]
| 2. vaddpd zmm7, zmm7, zmmword ptr [r14 + r8 - 16]
| 3. vaddpd zmm7, zmm7, zmmword ptr [rbx + r8]
| 4. vaddpd zmm7, zmm7, zmmword ptr [r15 + r8 + 8]
| 5. vfmadd231pd zmm7, zmm6, zmm3
| 6. vfmadd213pd zmm7, zmm4, zmmword ptr [rdx + r8]
+----> 7. vfnmsub132pd zmm6, zmm6, zmmword ptr [r10 + r8] ## REGISTER dependency: zmm6
| 8. vaddpd zmm6, zmm7, zmm6
| 9. vaddpd zmm6, zmm6, zmmword ptr [r9 + r8]
| 10. vmovupd zmmword ptr [r13 + r8], zmm6
+----> 11. vmovupd zmm6, zmmword ptr [r10 + r8] ## RESOURCE interference: SKXPort3 [ probability: 2% ]
+----> 12. vfnmsub231pd zmm6, zmm6, zmmword ptr [r14 + r8 - 8] ## REGISTER dependency: zmm6
| 13. vaddpd zmm6, zmm6, zmm5
| 14. vaddpd zmm6, zmm6, zmmword ptr [r9 + r8]
| 15. vmovupd zmmword ptr [r12 + r8], zmm6
+----> 16. vmovupd zmm6, zmmword ptr [r9 + r8] ## RESOURCE interference: SKXPort3 [ probability: 11% ]
| 17. vmovupd zmm7, zmmword ptr [r10 + r8]
| 18. vfmadd132pd zmm7, zmm5, zmmword ptr [r14 + r8 - 8]
+----> 19. vaddpd zmm6, zmm6, zmm6 ## REGISTER dependency: zmm6
+----> 20. vsubpd zmm6, zmm7, zmm6 ## REGISTER dependency: zmm6
+----> 21. vmovupd zmmword ptr [rdi + r8], zmm6 ## REGISTER dependency: zmm6
22. add r8, 64
23. cmp r8, 4032
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)
[1] [2] [3] [4] [5] [6] Instructions:
2 8 0.50 * vmovupd zmm6, zmmword ptr [r14 + r8 - 8]
2 8 0.50 * vmovupd zmm7, zmmword ptr [r14 + r8]
2 11 0.50 * vaddpd zmm7, zmm7, zmmword ptr [r14 + r8 - 16]
2 11 0.50 * vaddpd zmm7, zmm7, zmmword ptr [rbx + r8]
2 11 0.50 * vaddpd zmm7, zmm7, zmmword ptr [r15 + r8 + 8]
1 4 0.50 vfmadd231pd zmm7, zmm6, zmm3
2 11 0.50 * vfmadd213pd zmm7, zmm4, zmmword ptr [rdx + r8]
2 11 0.50 * vfnmsub132pd zmm6, zmm6, zmmword ptr [r10 + r8]
1 4 0.50 vaddpd zmm6, zmm7, zmm6
2 11 0.50 * vaddpd zmm6, zmm6, zmmword ptr [r9 + r8]
2 1 1.00 * vmovupd zmmword ptr [r13 + r8], zmm6
2 8 0.50 * vmovupd zmm6, zmmword ptr [r10 + r8]
2 11 0.50 * vfnmsub231pd zmm6, zmm6, zmmword ptr [r14 + r8 - 8]
1 4 0.50 vaddpd zmm6, zmm6, zmm5
2 11 0.50 * vaddpd zmm6, zmm6, zmmword ptr [r9 + r8]
2 1 1.00 * vmovupd zmmword ptr [r12 + r8], zmm6
2 8 0.50 * vmovupd zmm6, zmmword ptr [r9 + r8]
2 8 0.50 * vmovupd zmm7, zmmword ptr [r10 + r8]
2 11 0.50 * vfmadd132pd zmm7, zmm5, zmmword ptr [r14 + r8 - 8]
1 4 0.50 vaddpd zmm6, zmm6, zmm6
1 4 0.50 vsubpd zmm6, zmm7, zmm6
2 1 1.00 * vmovupd zmmword ptr [rdi + r8], zmm6
1 1 0.25 add r8, 64
1 1 0.25 cmp r8, 4032
Resources:
[0] - SKXDivider
[1] - SKXFPDivider
[2] - SKXPort0
[3] - SKXPort1
[4] - SKXPort2
[5] - SKXPort3
[6] - SKXPort4
[7] - SKXPort5
[8] - SKXPort6
[9] - SKXPort7
Resource pressure per iteration:
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
- - 7.44 4.25 7.06 7.06 3.00 7.31 2.00 2.88
Resource pressure by instruction:
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9] Instructions:
- - 0.01 0.98 0.41 0.59 - 0.01 - - vmovupd zmm6, zmmword ptr [r14 + r8 - 8]
- - 0.13 0.84 0.55 0.45 - 0.03 - - vmovupd zmm7, zmmword ptr [r14 + r8]
- - 0.61 - 0.59 0.41 - 0.39 - - vaddpd zmm7, zmm7, zmmword ptr [r14 + r8 - 16]
- - 0.47 - 0.61 0.39 - 0.53 - - vaddpd zmm7, zmm7, zmmword ptr [rbx + r8]
- - 0.51 - 0.55 0.45 - 0.49 - - vaddpd zmm7, zmm7, zmmword ptr [r15 + r8 + 8]
- - 0.46 - - - - 0.54 - - vfmadd231pd zmm7, zmm6, zmm3
- - 0.54 - 0.46 0.54 - 0.46 - - vfmadd213pd zmm7, zmm4, zmmword ptr [rdx + r8]
- - 0.44 - 0.66 0.34 - 0.56 - - vfnmsub132pd zmm6, zmm6, zmmword ptr [r10 + r8]
- - 0.48 - - - - 0.52 - - vaddpd zmm6, zmm7, zmm6
- - 0.52 - 0.69 0.31 - 0.48 - - vaddpd zmm6, zmm6, zmmword ptr [r9 + r8]
- - - - 0.01 - 1.00 - - 0.99 vmovupd zmmword ptr [r13 + r8], zmm6
- - 0.11 0.88 0.49 0.51 - 0.01 - - vmovupd zmm6, zmmword ptr [r10 + r8]
- - 0.66 - 0.32 0.68 - 0.34 - - vfnmsub231pd zmm6, zmm6, zmmword ptr [r14 + r8 - 8]
- - 0.50 - - - - 0.50 - - vaddpd zmm6, zmm6, zmm5
- - 0.50 - 0.47 0.53 - 0.50 - - vaddpd zmm6, zmm6, zmmword ptr [r9 + r8]
- - - - - 0.02 1.00 - - 0.98 vmovupd zmmword ptr [r12 + r8], zmm6
- - 0.01 0.98 0.24 0.76 - 0.01 - - vmovupd zmm6, zmmword ptr [r9 + r8]
- - 0.25 0.57 0.57 0.43 - 0.18 - - vmovupd zmm7, zmmword ptr [r10 + r8]
- - 0.45 - 0.36 0.64 - 0.55 - - vfmadd132pd zmm7, zmm5, zmmword ptr [r14 + r8 - 8]
- - 0.39 - - - - 0.61 - - vaddpd zmm6, zmm6, zmm6
- - 0.40 - - - - 0.60 - - vsubpd zmm6, zmm7, zmm6
- - - - 0.08 0.01 1.00 - - 0.91 vmovupd zmmword ptr [rdi + r8], zmm6
- - - - - - - - 1.00 - add r8, 64
- - - - - - - - 1.00 - cmp r8, 4032
Versus @turbo
:
Iterations: 100
Instructions: 3100
Total Cycles: 783
Total uOps: 4200
Dispatch Width: 6
uOps Per Cycle: 5.36
IPC: 3.96
Block RThroughput: 7.0
No resource or data dependency bottlenecks discovered.
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)
[1] [2] [3] [4] [5] [6] Instructions:
2 8 0.50 * vmovupd zmm4, zmmword ptr [rbx - 8]
2 8 0.50 * vmovupd zmm5, zmmword ptr [rbx]
2 11 0.50 * vaddpd zmm4, zmm4, zmmword ptr [rbx + 8]
2 11 0.50 * vaddpd zmm4, zmm4, zmmword ptr [r12 + rcx]
2 11 0.50 * vaddpd zmm4, zmm4, zmmword ptr [r12]
1 4 0.50 vfmadd231pd zmm4, zmm5, zmm1
2 8 0.50 * vmovupd zmm6, zmmword ptr [rbp]
2 8 0.50 * vmovupd zmm7, zmmword ptr [rdx]
2 11 0.50 * vaddpd zmm8, zmm7, zmmword ptr [rdi]
1 4 0.50 vfmsub231pd zmm8, zmm5, zmm6
1 4 0.50 vfnmadd231pd zmm8, zmm5, zmm0
1 4 0.50 vfmsub231pd zmm8, zmm4, zmm2
1 4 0.50 vaddpd zmm4, zmm7, zmm3
1 4 0.50 vfmsub231pd zmm4, zmm5, zmm6
1 4 0.50 vfmsub231pd zmm4, zmm6, zmm0
1 1 0.33 vmovapd zmm9, zmm0
1 4 0.50 vfmadd213pd zmm9, zmm7, zmm3
1 4 0.50 vfmadd231pd zmm9, zmm5, zmm6
2 1 1.00 * vmovupd zmmword ptr [r9], zmm8
2 1 1.00 * vmovupd zmmword ptr [rax], zmm4
1 4 0.50 vfmadd231pd zmm9, zmm0, zmm7
2 1 1.00 * vmovupd zmmword ptr [r14], zmm9
1 1 0.25 add rbx, 64
1 1 0.25 add rbp, 64
1 1 0.25 add rdx, 64
1 1 0.25 add rdi, 64
1 1 0.25 add r9, 64
1 1 0.25 add rax, 64
1 1 0.25 add r14, 64
1 1 0.25 add r12, 64
1 1 0.25 cmp r10, rbx
Resources:
[0] - SKXDivider
[1] - SKXFPDivider
[2] - SKXPort0
[3] - SKXPort1
[4] - SKXPort2
[5] - SKXPort3
[6] - SKXPort4
[7] - SKXPort5
[8] - SKXPort6
[9] - SKXPort7
Resource pressure per iteration:
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
- - 7.52 7.33 4.06 4.06 3.00 7.53 5.62 2.88
Resource pressure by instruction:
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9] Instructions:
- - 0.07 0.84 0.61 0.39 - 0.09 - - vmovupd zmm4, zmmword ptr [rbx - 8]
- - - 0.97 0.45 0.55 - 0.03 - - vmovupd zmm5, zmmword ptr [rbx]
- - 0.65 - 0.39 0.61 - 0.35 - - vaddpd zmm4, zmm4, zmmword ptr [rbx + 8]
- - 0.48 - 0.37 0.63 - 0.52 - - vaddpd zmm4, zmm4, zmmword ptr [r12 + rcx]
- - 0.35 - 0.80 0.20 - 0.65 - - vaddpd zmm4, zmm4, zmmword ptr [r12]
- - 0.27 - - - - 0.73 - - vfmadd231pd zmm4, zmm5, zmm1
- - 0.02 0.83 0.62 0.38 - 0.15 - - vmovupd zmm6, zmmword ptr [rbp]
- - 0.08 0.91 0.29 0.71 - 0.01 - - vmovupd zmm7, zmmword ptr [rdx]
- - 0.56 - 0.49 0.51 - 0.44 - - vaddpd zmm8, zmm7, zmmword ptr [rdi]
- - 0.39 - - - - 0.61 - - vfmsub231pd zmm8, zmm5, zmm6
- - 0.40 - - - - 0.60 - - vfnmadd231pd zmm8, zmm5, zmm0
- - 0.29 - - - - 0.71 - - vfmsub231pd zmm8, zmm4, zmm2
- - 0.55 - - - - 0.45 - - vaddpd zmm4, zmm7, zmm3
- - 0.31 - - - - 0.69 - - vfmsub231pd zmm4, zmm5, zmm6
- - 0.32 - - - - 0.68 - - vfmsub231pd zmm4, zmm6, zmm0
- - 0.20 0.78 - - - 0.02 - - vmovapd zmm9, zmm0
- - 0.75 - - - - 0.25 - - vfmadd213pd zmm9, zmm7, zmm3
- - 0.77 - - - - 0.23 - - vfmadd231pd zmm9, zmm5, zmm6
- - - - 0.01 - 1.00 - - 0.99 vmovupd zmmword ptr [r9], zmm8
- - - - - 0.03 1.00 - - 0.97 vmovupd zmmword ptr [rax], zmm4
- - 0.74 - - - - 0.26 - - vfmadd231pd zmm9, zmm0, zmm7
- - - - 0.03 0.05 1.00 - - 0.92 vmovupd zmmword ptr [r14], zmm9
- - - - - - - - 1.00 - add rbx, 64
- - 0.12 0.03 - - - 0.01 0.84 - add rbp, 64
- - 0.05 0.34 - - - 0.02 0.59 - add rdx, 64
- - 0.06 0.24 - - - - 0.70 - add rdi, 64
- - 0.02 0.69 - - - 0.01 0.28 - add r9, 64
- - - 0.29 - - - - 0.71 - add rax, 64
- - 0.02 0.30 - - - 0.02 0.66 - add r14, 64
- - 0.01 0.69 - - - - 0.30 - add r12, 64
- - 0.04 0.42 - - - - 0.54 - cmp r10, rbx
All the extra add
s here for LoopVectorization are because while LLVM can figure out the views are the same array, LoopVectorization doesn't know that.
If you want just a single number from each report, LLVM-MCA estimates 100 iterations of the @turbo
loop will take 783 clock cycles, while 100 iterations of the @simd ivdep
loop will take 878 clock cycles.
If we replace the views with indexing, @turbo
is down to 731 cycles:
Iterations: 100
Instructions: 2700
Total Cycles: 731
Total uOps: 3800
Dispatch Width: 6
uOps Per Cycle: 5.20
IPC: 3.69
Block RThroughput: 7.0
Cycles with backend pressure increase [ 51.44% ]
Throughput Bottlenecks:
Resource Pressure [ 50.21% ]
- SKXPort0 [ 49.93% ]
- SKXPort1 [ 49.11% ]
- SKXPort2 [ 13.13% ]
- SKXPort3 [ 13.13% ]
- SKXPort4 [ 5.06% ]
- SKXPort5 [ 49.93% ]
- SKXPort6 [ 9.58% ]
- SKXPort7 [ 3.15% ]
Data Dependencies: [ 39.53% ]
- Register Dependencies [ 39.53% ]
- Memory Dependencies [ 0.00% ]
Critical sequence based on the simulation:
Instruction Dependency Information
+----< 13. vaddpd zmm5, zmm7, zmm3
|
| < loop carried >
|
| 0. mov r11, rbx
+----> 1. vmovupd zmm4, zmmword ptr [rbx] ## RESOURCE interference: SKXPort5 [ probability: 25% ]
+----> 2. vmovupd zmm5, zmmword ptr [rbx - 8] ## RESOURCE interference: SKXPort1 [ probability: 91% ]
+----> 3. vaddpd zmm5, zmm5, zmmword ptr [rbx + 8] ## RESOURCE interference: SKXPort2 [ probability: 13% ]
+----> 4. vaddpd zmm5, zmm5, zmmword ptr [rbx + r9] ## REGISTER dependency: zmm5
+----> 5. vaddpd zmm5, zmm5, zmmword ptr [rbx + rax] ## REGISTER dependency: zmm5
+----> 6. vmovupd zmm6, zmmword ptr [rbx + rdi] ## RESOURCE interference: SKXPort5 [ probability: 1% ]
| 7. vfmadd231pd zmm5, zmm4, zmm0
+----> 8. vmovupd zmm7, zmmword ptr [rbx + 2*rdi] ## RESOURCE interference: SKXPort1 [ probability: 93% ]
| 9. vaddpd zmm8, zmm7, zmmword ptr [rbp]
| 10. vfmsub231pd zmm8, zmm4, zmm6
| 11. vfnmadd231pd zmm8, zmm4, zmm1
| 12. vfmsub231pd zmm8, zmm5, zmm2
+----> 13. vaddpd zmm5, zmm7, zmm3 ## REGISTER dependency: zmm7
| 14. vfmsub231pd zmm5, zmm4, zmm6
| 15. vfmsub231pd zmm5, zmm6, zmm1
| 16. vmovapd zmm9, zmm1
+----> 17. vfmadd213pd zmm9, zmm7, zmm3 ## RESOURCE interference: SKXPort5 [ probability: 51% ]
| 18. vfmadd231pd zmm9, zmm4, zmm6
| 19. vfmadd231pd zmm9, zmm1, zmm7
| 20. vmovupd zmmword ptr [r12], zmm8
| 21. vmovupd zmmword ptr [r12 + rdx], zmm5
| 22. vmovupd zmmword ptr [r12 + 2*rdx], zmm9
| 23. add rbx, 64
| 24. add rbp, 64
| 25. add r12, 64
| 26. cmp rsi, rbx
|
| < loop carried >
|
+----> 3. vaddpd zmm5, zmm5, zmmword ptr [rbx + 8] ## RESOURCE interference: SKXPort5 [ probability: 38% ]
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)
[1] [2] [3] [4] [5] [6] Instructions:
1 1 0.25 mov r11, rbx
2 8 0.50 * vmovupd zmm4, zmmword ptr [rbx]
2 8 0.50 * vmovupd zmm5, zmmword ptr [rbx - 8]
2 11 0.50 * vaddpd zmm5, zmm5, zmmword ptr [rbx + 8]
2 11 0.50 * vaddpd zmm5, zmm5, zmmword ptr [rbx + r9]
2 11 0.50 * vaddpd zmm5, zmm5, zmmword ptr [rbx + rax]
2 8 0.50 * vmovupd zmm6, zmmword ptr [rbx + rdi]
1 4 0.50 vfmadd231pd zmm5, zmm4, zmm0
2 8 0.50 * vmovupd zmm7, zmmword ptr [rbx + 2*rdi]
2 11 0.50 * vaddpd zmm8, zmm7, zmmword ptr [rbp]
1 4 0.50 vfmsub231pd zmm8, zmm4, zmm6
1 4 0.50 vfnmadd231pd zmm8, zmm4, zmm1
1 4 0.50 vfmsub231pd zmm8, zmm5, zmm2
1 4 0.50 vaddpd zmm5, zmm7, zmm3
1 4 0.50 vfmsub231pd zmm5, zmm4, zmm6
1 4 0.50 vfmsub231pd zmm5, zmm6, zmm1
1 1 0.33 vmovapd zmm9, zmm1
1 4 0.50 vfmadd213pd zmm9, zmm7, zmm3
1 4 0.50 vfmadd231pd zmm9, zmm4, zmm6
1 4 0.50 vfmadd231pd zmm9, zmm1, zmm7
2 1 1.00 * vmovupd zmmword ptr [r12], zmm8
2 1 1.00 * vmovupd zmmword ptr [r12 + rdx], zmm5
2 1 1.00 * vmovupd zmmword ptr [r12 + 2*rdx], zmm9
1 1 0.25 add rbx, 64
1 1 0.25 add rbp, 64
1 1 0.25 add r12, 64
1 1 0.25 cmp rsi, rbx
Resources:
[0] - SKXDivider
[1] - SKXFPDivider
[2] - SKXPort0
[3] - SKXPort1
[4] - SKXPort2
[5] - SKXPort3
[6] - SKXPort4
[7] - SKXPort5
[8] - SKXPort6
[9] - SKXPort7
Resource pressure per iteration:
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
- - 7.06 6.79 4.04 4.05 3.00 7.07 3.08 2.91
Resource pressure by instruction:
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9] Instructions:
- - - 0.85 - - - 0.01 0.14 - mov r11, rbx
- - - 0.99 0.46 0.54 - 0.01 - - vmovupd zmm4, zmmword ptr [rbx]
- - 0.01 0.98 0.75 0.25 - 0.01 - - vmovupd zmm5, zmmword ptr [rbx - 8]
- - 0.19 - 0.67 0.33 - 0.81 - - vaddpd zmm5, zmm5, zmmword ptr [rbx + 8]
- - 0.19 - 0.53 0.47 - 0.81 - - vaddpd zmm5, zmm5, zmmword ptr [rbx + r9]
- - 0.53 - 0.66 0.34 - 0.47 - - vaddpd zmm5, zmm5, zmmword ptr [rbx + rax]
- - 0.01 0.97 0.31 0.69 - 0.02 - - vmovupd zmm6, zmmword ptr [rbx + rdi]
- - 0.94 - - - - 0.06 - - vfmadd231pd zmm5, zmm4, zmm0
- - - 0.99 0.29 0.71 - 0.01 - - vmovupd zmm7, zmmword ptr [rbx + 2*rdi]
- - 0.10 - 0.33 0.67 - 0.90 - - vaddpd zmm8, zmm7, zmmword ptr [rbp]
- - 0.73 - - - - 0.27 - - vfmsub231pd zmm8, zmm4, zmm6
- - 0.93 - - - - 0.07 - - vfnmadd231pd zmm8, zmm4, zmm1
- - 0.96 - - - - 0.04 - - vfmsub231pd zmm8, zmm5, zmm2
- - 0.09 - - - - 0.91 - - vaddpd zmm5, zmm7, zmm3
- - 0.55 - - - - 0.45 - - vfmsub231pd zmm5, zmm4, zmm6
- - 0.34 - - - - 0.66 - - vfmsub231pd zmm5, zmm6, zmm1
- - - 0.98 - - - 0.02 - - vmovapd zmm9, zmm1
- - 0.30 - - - - 0.70 - - vfmadd213pd zmm9, zmm7, zmm3
- - 0.55 - - - - 0.45 - - vfmadd231pd zmm9, zmm4, zmm6
- - 0.62 - - - - 0.38 - - vfmadd231pd zmm9, zmm1, zmm7
- - - - - - 1.00 - - 1.00 vmovupd zmmword ptr [r12], zmm8
- - - - - 0.03 1.00 - - 0.97 vmovupd zmmword ptr [r12 + rdx], zmm5
- - - - 0.04 0.02 1.00 - - 0.94 vmovupd zmmword ptr [r12 + 2*rdx], zmm9
- - - - - - - - 1.00 - add rbx, 64
- - - 0.04 - - - 0.01 0.95 - add rbp, 64
- - 0.01 0.87 - - - - 0.12 - add r12, 64
- - 0.01 0.12 - - - - 0.87 - cmp rsi, rbx
This code looks better to me, and LLVM says it is faster. But it's slower in benchmarks (on my computers) anyway, and I'm a bit at a loss.
Maybe I could add architecture specific hacks to do bad things sometimes, but in general worse probably isn't better.
Would be interesting to see more benchmarks.
Code:
using OrdinaryDiffEq, LinearAlgebra, SparseArrays, BenchmarkHistograms, LoopVectorization
const ฮฑโ = 1.0
const ฮฑโ = 1.0
const ฮฒโ = 1.0
const ฮฒโ = 1.0
const ฮฒโ = 1.0
const rโ = 1.0
const rโ = 1.0
const D = 100.0
const ฮณโ = 0.1
const ฮณโ = 0.1
const ฮณโ = 0.1
const N = 512
const X = reshape([i for i in 1:N for j in 1:N],N,N);
const Y = reshape([j for i in 1:N for j in 1:N],N,N);
const ฮฑโ = 1.0.*(X.>=4*N/5);
const Mx = Tridiagonal([1.0 for i in 1:N-1],[-2.0 for i in 1:N],[1.0 for i in 1:N-1]);
const My = copy(Mx);
Mx[2,1] = 2.0;
Mx[end-1,end] = 2.0;
My[1,2] = 2.0;
My[end,end-1] = 2.0;
# Define the discretized PDE as an ODE function
const MyA = zeros(N,N);
const AMx = zeros(N,N);
const DA = zeros(N,N);
function f!(du,u,p,t)
A = @view u[:,:,1]
B = @view u[:,:,2]
C = @view u[:,:,3]
dA = @view du[:,:,1]
dB = @view du[:,:,2]
dC = @view du[:,:,3]
@turbo for j in 2:N-1, i in 2:N-1
dA[i,j] = D*(A[i-1,j] + A[i+1,j] + A[i,j+1] + A[i,j-1] - 4A[i,j]) +
ฮฑโ[i,j] - ฮฒโ*A[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dB[i,j] = ฮฑโ - ฮฒโ*B[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dC[i,j] = ฮฑโ - ฮฒโ*C[i,j] + rโ*A[i,j]*B[i,j] - rโ*C[i,j]
end
@inbounds for j in 2:N-1
i = 1
dA[1,j] = D*(2A[i+1,j] + A[i,j+1] + A[i,j-1] - 4A[i,j]) +
ฮฑโ[i,j] - ฮฒโ*A[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dB[i,j] = ฮฑโ - ฮฒโ*B[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dC[i,j] = ฮฑโ - ฮฒโ*C[i,j] + rโ*A[i,j]*B[i,j] - rโ*C[i,j]
end
@inbounds for j in 2:N-1
i = N
dA[end,j] = D*(2A[i-1,j] + A[i,j+1] + A[i,j-1] - 4A[i,j]) +
ฮฑโ[i,j] - ฮฒโ*A[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dB[i,j] = ฮฑโ - ฮฒโ*B[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dC[i,j] = ฮฑโ - ฮฒโ*C[i,j] + rโ*A[i,j]*B[i,j] - rโ*C[i,j]
end
@inbounds for i in 2:N-1
j = 1
dA[i,j] = D*(A[i-1,j] + A[i+1,j] + 2A[i,j+1] - 4A[i,j]) +
ฮฑโ[i,j] - ฮฒโ*A[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dB[i,j] = ฮฑโ - ฮฒโ*B[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dC[i,j] = ฮฑโ - ฮฒโ*C[i,j] + rโ*A[i,j]*B[i,j] - rโ*C[i,j]
end
@inbounds for i in 2:N-1
j = N
dA[i,end] = D*(A[i-1,j] + A[i+1,j] + 2A[i,j-1] - 4A[i,j]) +
ฮฑโ[i,j] - ฮฒโ*A[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dB[i,j] = ฮฑโ - ฮฒโ*B[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dC[i,j] = ฮฑโ - ฮฒโ*C[i,j] + rโ*A[i,j]*B[i,j] - rโ*C[i,j]
end
@inbounds begin
i = 1; j = 1
dA[i,j] = D*(2A[i+1,j] + 2A[i,j+1] - 4A[i,j]) +
ฮฑโ[i,j] - ฮฒโ*A[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dB[i,j] = ฮฑโ - ฮฒโ*B[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dC[i,j] = ฮฑโ - ฮฒโ*C[i,j] + rโ*A[i,j]*B[i,j] - rโ*C[i,j]
i = 1; j = N
dA[i,j] = D*(2A[i+1,j] + 2A[i,j-1] - 4A[i,j]) +
ฮฑโ[i,j] - ฮฒโ*A[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dB[i,j] = ฮฑโ - ฮฒโ*B[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dC[i,j] = ฮฑโ - ฮฒโ*C[i,j] + rโ*A[i,j]*B[i,j] - rโ*C[i,j]
i = N; j = 1
dA[i,j] = D*(2A[i-1,j] + 2A[i,j+1] - 4A[i,j]) +
ฮฑโ[i,j] - ฮฒโ*A[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dB[i,j] = ฮฑโ - ฮฒโ*B[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dC[i,j] = ฮฑโ - ฮฒโ*C[i,j] + rโ*A[i,j]*B[i,j] - rโ*C[i,j]
i = N; j = N
dA[i,j] = D*(2A[i-1,j] + 2A[i,j-1] - 4A[i,j]) +
ฮฑโ[i,j] - ฮฒโ*A[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dB[i,j] = ฮฑโ - ฮฒโ*B[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dC[i,j] = ฮฑโ - ฮฒโ*C[i,j] + rโ*A[i,j]*B[i,j] - rโ*C[i,j]
end
end
function fthreads!(du,u,p,t)
A = @view u[:,:,1]
B = @view u[:,:,2]
C = @view u[:,:,3]
dA = @view du[:,:,1]
dB = @view du[:,:,2]
dC = @view du[:,:,3]
@tturbo for j in 2:N-1, i in 2:N-1
dA[i,j] = D*(A[i-1,j] + A[i+1,j] + A[i,j+1] + A[i,j-1] - 4A[i,j]) +
ฮฑโ[i,j] - ฮฒโ*A[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dB[i,j] = ฮฑโ - ฮฒโ*B[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dC[i,j] = ฮฑโ - ฮฒโ*C[i,j] + rโ*A[i,j]*B[i,j] - rโ*C[i,j]
end
@inbounds for j in 2:N-1
i = 1
dA[1,j] = D*(2A[i+1,j] + A[i,j+1] + A[i,j-1] - 4A[i,j]) +
ฮฑโ[i,j] - ฮฒโ*A[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dB[i,j] = ฮฑโ - ฮฒโ*B[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dC[i,j] = ฮฑโ - ฮฒโ*C[i,j] + rโ*A[i,j]*B[i,j] - rโ*C[i,j]
end
@inbounds for j in 2:N-1
i = N
dA[end,j] = D*(2A[i-1,j] + A[i,j+1] + A[i,j-1] - 4A[i,j]) +
ฮฑโ[i,j] - ฮฒโ*A[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dB[i,j] = ฮฑโ - ฮฒโ*B[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dC[i,j] = ฮฑโ - ฮฒโ*C[i,j] + rโ*A[i,j]*B[i,j] - rโ*C[i,j]
end
@inbounds for i in 2:N-1
j = 1
dA[i,j] = D*(A[i-1,j] + A[i+1,j] + 2A[i,j+1] - 4A[i,j]) +
ฮฑโ[i,j] - ฮฒโ*A[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dB[i,j] = ฮฑโ - ฮฒโ*B[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dC[i,j] = ฮฑโ - ฮฒโ*C[i,j] + rโ*A[i,j]*B[i,j] - rโ*C[i,j]
end
@inbounds for i in 2:N-1
j = N
dA[i,end] = D*(A[i-1,j] + A[i+1,j] + 2A[i,j-1] - 4A[i,j]) +
ฮฑโ[i,j] - ฮฒโ*A[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dB[i,j] = ฮฑโ - ฮฒโ*B[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dC[i,j] = ฮฑโ - ฮฒโ*C[i,j] + rโ*A[i,j]*B[i,j] - rโ*C[i,j]
end
@inbounds begin
i = 1; j = 1
dA[i,j] = D*(2A[i+1,j] + 2A[i,j+1] - 4A[i,j]) +
ฮฑโ[i,j] - ฮฒโ*A[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dB[i,j] = ฮฑโ - ฮฒโ*B[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dC[i,j] = ฮฑโ - ฮฒโ*C[i,j] + rโ*A[i,j]*B[i,j] - rโ*C[i,j]
i = 1; j = N
dA[i,j] = D*(2A[i+1,j] + 2A[i,j-1] - 4A[i,j]) +
ฮฑโ[i,j] - ฮฒโ*A[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dB[i,j] = ฮฑโ - ฮฒโ*B[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dC[i,j] = ฮฑโ - ฮฒโ*C[i,j] + rโ*A[i,j]*B[i,j] - rโ*C[i,j]
i = N; j = 1
dA[i,j] = D*(2A[i-1,j] + 2A[i,j+1] - 4A[i,j]) +
ฮฑโ[i,j] - ฮฒโ*A[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dB[i,j] = ฮฑโ - ฮฒโ*B[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dC[i,j] = ฮฑโ - ฮฒโ*C[i,j] + rโ*A[i,j]*B[i,j] - rโ*C[i,j]
i = N; j = N
dA[i,j] = D*(2A[i-1,j] + 2A[i,j-1] - 4A[i,j]) +
ฮฑโ[i,j] - ฮฒโ*A[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dB[i,j] = ฮฑโ - ฮฒโ*B[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dC[i,j] = ฮฑโ - ฮฒโ*C[i,j] + rโ*A[i,j]*B[i,j] - rโ*C[i,j]
end
end
function fsimdivdep!(du,u,p,t)
A = @view u[:,:,1]
B = @view u[:,:,2]
C = @view u[:,:,3]
dA = @view du[:,:,1]
dB = @view du[:,:,2]
dC = @view du[:,:,3]
@inbounds @fastmath for j in 2:N-1; @simd ivdep for i in 2:N-1
dA[i,j] = D*(A[i-1,j] + A[i+1,j] + A[i,j+1] + A[i,j-1] - 4A[i,j]) +
ฮฑโ[i,j] - ฮฒโ*A[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dB[i,j] = ฮฑโ - ฮฒโ*B[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dC[i,j] = ฮฑโ - ฮฒโ*C[i,j] + rโ*A[i,j]*B[i,j] - rโ*C[i,j]
end; end
@inbounds for j in 2:N-1
i = 1
dA[1,j] = D*(2A[i+1,j] + A[i,j+1] + A[i,j-1] - 4A[i,j]) +
ฮฑโ[i,j] - ฮฒโ*A[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dB[i,j] = ฮฑโ - ฮฒโ*B[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dC[i,j] = ฮฑโ - ฮฒโ*C[i,j] + rโ*A[i,j]*B[i,j] - rโ*C[i,j]
end
@inbounds for j in 2:N-1
i = N
dA[end,j] = D*(2A[i-1,j] + A[i,j+1] + A[i,j-1] - 4A[i,j]) +
ฮฑโ[i,j] - ฮฒโ*A[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dB[i,j] = ฮฑโ - ฮฒโ*B[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dC[i,j] = ฮฑโ - ฮฒโ*C[i,j] + rโ*A[i,j]*B[i,j] - rโ*C[i,j]
end
@inbounds for i in 2:N-1
j = 1
dA[i,j] = D*(A[i-1,j] + A[i+1,j] + 2A[i,j+1] - 4A[i,j]) +
ฮฑโ[i,j] - ฮฒโ*A[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dB[i,j] = ฮฑโ - ฮฒโ*B[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dC[i,j] = ฮฑโ - ฮฒโ*C[i,j] + rโ*A[i,j]*B[i,j] - rโ*C[i,j]
end
@inbounds for i in 2:N-1
j = N
dA[i,end] = D*(A[i-1,j] + A[i+1,j] + 2A[i,j-1] - 4A[i,j]) +
ฮฑโ[i,j] - ฮฒโ*A[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dB[i,j] = ฮฑโ - ฮฒโ*B[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dC[i,j] = ฮฑโ - ฮฒโ*C[i,j] + rโ*A[i,j]*B[i,j] - rโ*C[i,j]
end
@inbounds begin
i = 1; j = 1
dA[i,j] = D*(2A[i+1,j] + 2A[i,j+1] - 4A[i,j]) +
ฮฑโ[i,j] - ฮฒโ*A[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dB[i,j] = ฮฑโ - ฮฒโ*B[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dC[i,j] = ฮฑโ - ฮฒโ*C[i,j] + rโ*A[i,j]*B[i,j] - rโ*C[i,j]
i = 1; j = N
dA[i,j] = D*(2A[i+1,j] + 2A[i,j-1] - 4A[i,j]) +
ฮฑโ[i,j] - ฮฒโ*A[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dB[i,j] = ฮฑโ - ฮฒโ*B[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dC[i,j] = ฮฑโ - ฮฒโ*C[i,j] + rโ*A[i,j]*B[i,j] - rโ*C[i,j]
i = N; j = 1
dA[i,j] = D*(2A[i-1,j] + 2A[i,j+1] - 4A[i,j]) +
ฮฑโ[i,j] - ฮฒโ*A[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dB[i,j] = ฮฑโ - ฮฒโ*B[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dC[i,j] = ฮฑโ - ฮฒโ*C[i,j] + rโ*A[i,j]*B[i,j] - rโ*C[i,j]
i = N; j = N
dA[i,j] = D*(2A[i-1,j] + 2A[i,j-1] - 4A[i,j]) +
ฮฑโ[i,j] - ฮฒโ*A[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dB[i,j] = ฮฑโ - ฮฒโ*B[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dC[i,j] = ฮฑโ - ฮฒโ*C[i,j] + rโ*A[i,j]*B[i,j] - rโ*C[i,j]
end
end
function fnoviews!(du,u,p,t) # no views
A = @view u[:,:,1]
B = @view u[:,:,2]
C = @view u[:,:,3]
dA = @view du[:,:,1]
dB = @view du[:,:,2]
dC = @view du[:,:,3]
@turbo for j in 2:N-1, i in 2:N-1
du[i,j,1] = D*(u[i-1,j,1] + u[i+1,j,1] + u[i,j+1,1] + u[i,j-1,1] - 4u[i,j,1]) +
ฮฑโ[i,j] - ฮฒโ*u[i,j,1] - rโ*u[i,j,1]*u[i,j,2] + rโ*u[i,j,3]
du[i,j,2] = ฮฑโ - ฮฒโ*u[i,j,2] - rโ*u[i,j,1]*u[i,j,2] + rโ*u[i,j,3]
du[i,j,3] = ฮฑโ - ฮฒโ*u[i,j,3] + rโ*u[i,j,1]*u[i,j,2] - rโ*u[i,j,3]
end
@inbounds for j in 2:N-1
i = 1
dA[1,j] = D*(2A[i+1,j] + A[i,j+1] + A[i,j-1] - 4A[i,j]) +
ฮฑโ[i,j] - ฮฒโ*A[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dB[i,j] = ฮฑโ - ฮฒโ*B[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dC[i,j] = ฮฑโ - ฮฒโ*C[i,j] + rโ*A[i,j]*B[i,j] - rโ*C[i,j]
end
@inbounds for j in 2:N-1
i = N
dA[end,j] = D*(2A[i-1,j] + A[i,j+1] + A[i,j-1] - 4A[i,j]) +
ฮฑโ[i,j] - ฮฒโ*A[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dB[i,j] = ฮฑโ - ฮฒโ*B[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dC[i,j] = ฮฑโ - ฮฒโ*C[i,j] + rโ*A[i,j]*B[i,j] - rโ*C[i,j]
end
@inbounds for i in 2:N-1
j = 1
dA[i,j] = D*(A[i-1,j] + A[i+1,j] + 2A[i,j+1] - 4A[i,j]) +
ฮฑโ[i,j] - ฮฒโ*A[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dB[i,j] = ฮฑโ - ฮฒโ*B[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dC[i,j] = ฮฑโ - ฮฒโ*C[i,j] + rโ*A[i,j]*B[i,j] - rโ*C[i,j]
end
@inbounds for i in 2:N-1
j = N
dA[i,end] = D*(A[i-1,j] + A[i+1,j] + 2A[i,j-1] - 4A[i,j]) +
ฮฑโ[i,j] - ฮฒโ*A[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dB[i,j] = ฮฑโ - ฮฒโ*B[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dC[i,j] = ฮฑโ - ฮฒโ*C[i,j] + rโ*A[i,j]*B[i,j] - rโ*C[i,j]
end
@inbounds begin
i = 1; j = 1
dA[i,j] = D*(2A[i+1,j] + 2A[i,j+1] - 4A[i,j]) +
ฮฑโ[i,j] - ฮฒโ*A[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dB[i,j] = ฮฑโ - ฮฒโ*B[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dC[i,j] = ฮฑโ - ฮฒโ*C[i,j] + rโ*A[i,j]*B[i,j] - rโ*C[i,j]
i = 1; j = N
dA[i,j] = D*(2A[i+1,j] + 2A[i,j-1] - 4A[i,j]) +
ฮฑโ[i,j] - ฮฒโ*A[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dB[i,j] = ฮฑโ - ฮฒโ*B[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dC[i,j] = ฮฑโ - ฮฒโ*C[i,j] + rโ*A[i,j]*B[i,j] - rโ*C[i,j]
i = N; j = 1
dA[i,j] = D*(2A[i-1,j] + 2A[i,j+1] - 4A[i,j]) +
ฮฑโ[i,j] - ฮฒโ*A[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dB[i,j] = ฮฑโ - ฮฒโ*B[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dC[i,j] = ฮฑโ - ฮฒโ*C[i,j] + rโ*A[i,j]*B[i,j] - rโ*C[i,j]
i = N; j = N
dA[i,j] = D*(2A[i-1,j] + 2A[i,j-1] - 4A[i,j]) +
ฮฑโ[i,j] - ฮฒโ*A[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dB[i,j] = ฮฑโ - ฮฒโ*B[i,j] - rโ*A[i,j]*B[i,j] + rโ*C[i,j]
dC[i,j] = ฮฑโ - ฮฒโ*C[i,j] + rโ*A[i,j]*B[i,j] - rโ*C[i,j]
end
end
u0 = zeros(N,N,3); du = similar(u0);
prob = ODEProblem(f!,u0,(0.0,10.0));
prob_threads = ODEProblem(fthreads!,u0,(0.0,10.0));
prob_simdivdep = ODEProblem(fsimdivdep!,u0,(0.0,10.0));
@time solve(prob, ROCK4(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
@time solve(prob_threads, ROCK4(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
@time solve(prob_simdivdep, ROCK4(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
@time solve(prob, Tsit5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
@time solve(prob_threads, Tsit5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
@time solve(prob_simdivdep, Tsit5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
@time solve(prob, DP5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
@time solve(prob_threads, DP5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
@time solve(prob_simdivdep, DP5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
from methodoflines.jl.
Oh, and benchmarks on the M1 (running on a native Julia 1.8 build):
julia> @time solve(prob, ROCK4(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
21.625047 seconds (14.09 k allocations: 35.655 GiB, 0.71% gc time)
julia> @time solve(prob_threads, ROCK4(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
91.950376 seconds (14.24 k allocations: 35.655 GiB, 0.24% gc time)
julia> @time solve(prob_simdivdep, ROCK4(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
23.505749 seconds (14.01 k allocations: 35.421 GiB, 0.23% gc time)
julia> @time solve(prob, Tsit5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
19.860662 seconds (267 allocations: 708.018 MiB)
julia> @time solve(prob_threads, Tsit5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
22.793870 seconds (270 allocations: 708.018 MiB, 0.06% gc time)
julia> @time solve(prob_simdivdep, Tsit5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
21.074039 seconds (270 allocations: 708.018 MiB, 0.53% gc time)
julia> @time solve(prob, DP5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
20.464703 seconds (264 allocations: 690.017 MiB, 0.34% gc time)
julia> @time solve(prob_threads, DP5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
21.983172 seconds (267 allocations: 690.017 MiB, 0.07% gc time)
julia> @time solve(prob_simdivdep, DP5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
21.848427 seconds (267 allocations: 690.017 MiB, 0.39% gc time)
Not sure what's going on with ROCK4 and multithreading, but otherwise -- ouch Intel.
The benchmarks from my previous post were on a 10980XE, a much higher end chip than the M1.
Would be interesting to see how a Zen3 chip compares.
Benchmarks of the individual functions:
julia> @benchmark f!($du, $u0, nothing, 0.0) # @turbo
samples: 10000; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
(409000.0 - 411200.0] โ3
(411200.0 - 413500.0] โ23
(413500.0 - 415700.0] โโโ191
(415700.0 - 418000.0] โโโโโโโโโโโ894
(418000.0 - 420300.0] โโโโโโโโโโโโโโโโโโโโโโโ1974
(420300.0 - 422500.0] โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ2614
(422500.0 - 424800.0] โโโโโโโโโโโโโโโโโโโโโโโโโโ2228
(424800.0 - 427100.0] โโโโโโโโโโโโโโโ1220
(427100.0 - 429300.0] โโโโโโ506
(429300.0 - 431600.0] โโโ180
(431600.0 - 433800.0] โ76
(433800.0 - 436100.0] โ46
(436100.0 - 438400.0] โ25
(438400.0 - 440600.0] โ10
(440600.0 - 534200.0] โ10
Counts
min: 408.958 ฮผs (0.00% GC); mean: 422.196 ฮผs (0.00% GC); median: 421.917 ฮผs (0.00% GC); max: 534.208 ฮผs (0.00% GC).
julia> @benchmark fthreads!($du, $u0, nothing, 0.0) # @tturbo
samples: 10000; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
(218400.0 - 219900.0] โ16
(219900.0 - 221500.0] โ73
(221500.0 - 223000.0] โโโโโโ473
(223000.0 - 224500.0] โโโโโโโโโโโโโโโโโโโ1485
(224500.0 - 226100.0] โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ2415
(226100.0 - 227600.0] โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 2433
(227600.0 - 229100.0] โโโโโโโโโโโโโโโโโโโโโโโ1865
(229100.0 - 230700.0] โโโโโโโโโโโ871
(230700.0 - 232200.0] โโโโ244
(232200.0 - 233800.0] โ55
(233800.0 - 235300.0] โ16
(235300.0 - 236800.0] โ22
(236800.0 - 238400.0] โ16
(238400.0 - 239900.0] โ6
(239900.0 - 411600.0] โ10
Counts
min: 218.375 ฮผs (0.00% GC); mean: 226.536 ฮผs (0.00% GC); median: 226.375 ฮผs (0.00% GC); max: 411.584 ฮผs (0.00% GC).
julia> @benchmark fsimdivdep!($du, $u0, nothing, 0.0) # @simd ivdep
samples: 8948; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
(548500.0 - 550300.0] โ27
(550300.0 - 552100.0] โโโโ290
(552100.0 - 553900.0] โโโโโโโโโโโโโโโโ1252
(553900.0 - 555700.0] โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 2424
(555700.0 - 557500.0] โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ2315
(557500.0 - 559300.0] โโโโโโโโโโโโโโโโโ1360
(559300.0 - 561100.0] โโโโโโโโ582
(561100.0 - 562900.0] โโโ227
(562900.0 - 564700.0] โโ144
(564700.0 - 566400.0] โโ111
(566400.0 - 568200.0] โโ88
(568200.0 - 570000.0] โ76
(570000.0 - 571800.0] โ27
(571800.0 - 573600.0] โ16
(573600.0 - 604700.0] โ9
Counts
min: 548.500 ฮผs (0.00% GC); mean: 556.582 ฮผs (0.00% GC); median: 556.000 ฮผs (0.00% GC); max: 604.667 ฮผs (0.00% GC).
Julia crashes a lot on the M1, though.
from methodoflines.jl.
Some more benchmarks (also on M1) for linear diffusion that might be of interest
using DifferentialEquations
using LinearAlgebra, SparseArrays
# Naive implementation
n = 100
A = spdiagm(-1 => ones(n-1), 0 => reduce(vcat, [-1,-2ones(n-2),-1]), 1=> ones(n-1))
function pde!(dy, y, p, t)
dy .= A * y
nothing
end
u0 = sin.(range(0,stop=pi,length=n))
dy = similar(u0)
tspan = (0.0,1.0)
using BenchmarkTools
@btime pde!(dy,u0,0,0) # 728.531 ns (3 allocations: 928 bytes)
prob = ODEProblem(pde!, u0, tspan)
# Benchmarks
using Sundials
@btime solve(prob, KenCarp47(autodiff=false), reltol=1e-6, abstol=1e-6); # 1.848 ms (2358 allocations: 806.56 KiB)
@btime solve(prob, CVODE_BDF(), reltol=1e-6, abstol=1e-6); # 367.167 ฮผs (899 allocations: 190.02 KiB)
@btime solve(prob, CVODE_BDF(linear_solver=:Band,jac_upper=1,jac_lower=1), reltol=1e-6, abstol=1e-6);# 118.416 ฮผs (412 allocations: 86.91 KiB)
# Non-allocating implementation
pde_noalloc! = let A = spdiagm(-1 => ones(n-1), 0 => reduce(vcat, [-1,-2ones(n-2),-1]), 1=> ones(n-1))
function pde!(dy, y, p, t)
mul!(dy, A, y)
nothing
end
end
@btime pde_noalloc!(dy,u0,0,0) # 371.079 ns (0 allocations: 0 bytes)
prob_noalloc = ODEProblem(pde_noalloc!, u0, tspan)
@btime solve(prob_noalloc, KenCarp47(autodiff=false), reltol=1e-6, abstol=1e-6); # 1.283 ms (388 allocations: 211.94 KiB)
@btime solve(prob_noalloc, CVODE_BDF(), reltol=1e-6, abstol=1e-6); # 305.750 ฮผs (522 allocations: 77.14 KiB)
@btime solve(prob_noalloc, CVODE_BDF(linear_solver=:Band,jac_upper=1,jac_lower=1), reltol=1e-6, abstol=1e-6); # 105.125 ฮผs (328 allocations: 61.98 KiB)
# Analytical Jacobian
de = modelingtoolkitize(prob)
jac = eval(ModelingToolkit.generate_jacobian(de)[2])
f = ODEFunction(pde!, jac=jac)
prob_jac = ODEProblem(f,u0,tspan)
@btime solve(prob_jac, KenCarp47(autodiff=false), reltol=1e-6, abstol=1e-6); # 1.403 ms (499 allocations: 256.42 KiB)
@btime solve(prob_jac, CVODE_BDF(), reltol=1e-6, abstol=1e-6); # 298.291 ฮผs (403 allocations: 83.95 KiB)
f = ODEFunction(pde_noalloc!, jac=jac)
prob_noalloc_jac = ODEProblem(f,u0,tspan)
@btime solve(prob_noalloc_jac, KenCarp47(autodiff=false), reltol=1e-6, abstol=1e-6); # 1.351 ms (347 allocations: 210.91 KiB)
@btime solve(prob_noalloc_jac, CVODE_BDF(), reltol=1e-6, abstol=1e-6); # 278.792 ฮผs (328 allocations: 61.75 KiB)
prob_mtk = ODEProblem(de, Pair[], tspan)
@btime solve(prob_mtk, KenCarp47(autodiff=false), reltol=1e-6, abstol=1e-6); # 1.028 ms (388 allocations: 214.03 KiB)
@btime solve(prob_mtk, CVODE_BDF(), reltol=1e-6, abstol=1e-6); # 266.958 ฮผs (522 allocations: 78.38 KiB)
# Sparse jacobian
using SparsityDetection
input = rand(length(u0))
output = similar(input)
sparsity_pattern = jacobian_sparsity(pde!,output,input,0.0,0.0)
jac_sparsity = Float64.(sparse(sparsity_pattern))
using Plots
spy(jac_sparsity)
f = ODEFunction(pde!;jac_prototype=jac_sparsity)
prob_sparse = ODEProblem(f,u0,tspan)
@btime solve(prob_sparse, KenCarp47(), reltol=1e-6, abstol=1e-6); # 785.375 ฮผs (2368 allocations: 1.51 MiB)
@btime solve(prob_sparse, CVODE_BDF(), reltol=1e-6, abstol=1e-6); # 375.833 ฮผs (934 allocations: 199.62 KiB)
f = ODEFunction(pde_noalloc!;jac_prototype=jac_sparsity)
prob_noalloc_sparse = ODEProblem(f,u0,tspan)
@btime solve(prob_noalloc_sparse, KenCarp47(), reltol=1e-6, abstol=1e-6); # 751.666 ฮผs (2200 allocations: 1.45 MiB)
@btime solve(prob_noalloc_sparse, CVODE_BDF(), reltol=1e-6, abstol=1e-6); # 314.375 ฮผs (561 allocations: 87.88 KiB)
from methodoflines.jl.
@ChrisRackauckas is this still relevant?
from methodoflines.jl.
I don't think so. It's the justification with going with the stencil composition form.
from methodoflines.jl.
Related Issues (20)
- How to check the full(DerivativeOperators) in MethodOfLines
- BoundsError in Discretizing Interface Domains with Different Point Counts
- type PDESystem has no field analytic HOT 1
- Docs for solving the heat equation: extract solution HOT 5
- ArgumentError: Differential w.r.t. multiple variables Any[t, ...] are not allowed. HOT 2
- Interpolate grid values for staggered grid discretization
- Symbolic tracing in staggered grid can return Nans
- UndefVarError: `issymbollike` not defined HOT 3
- Move to using SymbolicUtils chains
- Package installation error HOT 1
- `discretize` errrors when a subset of equations have no time derivatives HOT 5
- Equation + State mismatch dependent on grid spacing
- MethodOfLines down-versions ModelingToolkit HOT 1
- Error with test/pde_systems/MOL_1D_Interface_Nonlinear.jl
- MOL & BoundaryConditions HOT 1
- PDAE example HOT 5
- PDESystem & arguments
- "ExtraVariablesSystemException: The system is unbalanced" Error while solving a PDE! HOT 4
- System of PDEs Initial Failure warning and MethodError HOT 1
- System of PDEs `solve` error
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from methodoflines.jl.