Comments (1)
Thanks, I fixed the problem, added it to the tests, and created a new release.
It should work starting on 0.2.3.
For what it's worth, what you're doing there is similar to what @avx
does when applied to those three loops, but @avx
should be faster.
using LoopVectorization, Test, BenchmarkTools
M = K = N = 100
T = Float64
C1 = Matrix{T}(undef, M, N); C2 = similar(C1); C3 = similar(C1);
A = rand(M, K); At = copy(A');
B = rand(K, N);
function mulCAtB_2x2block!(C, A, B)
M, N = size(C); K = size(B,1)
@assert size(C, 1) == size(A, 2)
@assert size(C, 2) == size(B, 2)
@assert size(A, 1) == size(B, 1)
T = eltype(C)
if mod(M, 2) == 0 && mod(N, 2) == 0
for m ∈ 1:2:M
m1 = m + 1
for n ∈ 1:2:N
n1 = n + 1
C11, C21, C12, C22 = zero(T), zero(T), zero(T), zero(T)
@avx for k ∈ 1:K
C11 += A[k,m] * B[k,n]
C21 += A[k,m1] * B[k,n]
C12 += A[k,m] * B[k,n1]
C22 += A[k,m1] * B[k,n1]
end
C[m,n] = C11
C[m1,n] = C21
C[m,n1] = C12
C[m1,n1] = C22
end
end
else
@inbounds for n ∈ 1:N, m ∈ 1:M
Cmn = 0.0
@inbounds for k ∈ 1:K
Cmn += A[k,m] * B[k,n]
end
C[m,n] = Cmn
end
end
return C
end
function AtmulBavx!(C, A, B)
@avx for n ∈ 1:size(C,2), m ∈ 1:size(C,1)
Cᵢⱼ = zero(eltype(C))
for k ∈ 1:size(A,1)
Cᵢⱼ += A[k,m] * B[k,n]
end
C[m,n] = Cᵢⱼ
end
end
function AmulBavx!(C, A, B)
@avx for m ∈ 1:size(A,1), n ∈ 1:size(B,2)
Cᵢⱼ = zero(eltype(C))
for k ∈ 1:size(A,2)
Cᵢⱼ += A[m,k] * B[k,n]
end
C[m,n] = Cᵢⱼ
end
end
mulCAtB_2x2block!(C1, At, B);
AtmulBavx!(C2, At, B);
AmulBavx!(C3, A, B);
C4 = At' * B;
@test C1 ≈ C4
@test C2 ≈ C4
@test C3 ≈ C4
I'm running these benchmarks on the latest (not yet released) master, where I did some ad hoc fiddling with how it chooses block sizes to get AtmulBavx!
to be faster than mulCAtB_2x2block!
.
Please let me know how the two perform on your computer. I may need to adjust it for AVX
(instruction set) computers as well.
Performance still is bad compared to MKL and the A
-not-tranposed case, as you can see below:
julia> using LinearAlgebra; BLAS.set_num_threads(1)
julia> @benchmark mulCAtB_2x2block!($C1, $At, $B)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 63.729 μs (0.00% GC)
median time: 65.949 μs (0.00% GC)
mean time: 67.540 μs (0.00% GC)
maximum time: 110.972 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark AtmulBavx!($C2, $At, $B)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 40.664 μs (0.00% GC)
median time: 42.035 μs (0.00% GC)
mean time: 42.846 μs (0.00% GC)
maximum time: 74.588 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark AmulBavx!($C2, $A, $B)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 25.863 μs (0.00% GC)
median time: 26.176 μs (0.00% GC)
mean time: 26.578 μs (0.00% GC)
maximum time: 66.092 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark mul!($C3, $At', $B)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 21.489 μs (0.00% GC)
median time: 22.154 μs (0.00% GC)
mean time: 22.415 μs (0.00% GC)
maximum time: 70.543 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark mul!($C3, $A, $B)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 20.007 μs (0.00% GC)
median time: 20.642 μs (0.00% GC)
mean time: 20.932 μs (0.00% GC)
maximum time: 51.647 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
It's twice as slow as MKL, and 60% slower than @avx
when A isn't transposed.
To figure out what @avx
is doing (aside from the obvious macroexpand), it can also help to:
julia> AtmulBq = :(for n ∈ 1:size(C,2), m ∈ 1:size(C,1)
Cᵢⱼ = zero(eltype(C))
for k ∈ 1:size(A,1)
Cᵢⱼ += A[k,m] * B[k,n]
end
C[m,n] = Cᵢⱼ
end);
julia> lsAtmulB = LoopVectorization.LoopSet(AtmulBq);
julia> LoopVectorization.choose_order(lsAtmulB)
([:m, :n, :k], :k, 4, 4)
This essentially means that the loop order is:
for mouter in 1:4:M # the 4 here is the second 4
for nouter in 1:4:N # this 4 is the first 4
# unrolled, for m in mouter:mouter+3, for n in nouter:nouter+3
for k in 1:K
# unrolled, for m in mouter:mouter+3, for n in nouter:nouter+3
end
# unrolled, for m in mouter:mouter+3, for n in nouter:nouter+3
end
end
So the difference between your code in this issue and @avx
on the outer loop is that the former uses a 2x2 tile, while @avx
chose to use a 4x4 tile.
The :k
(not in a vector) means that it is the vectorized loop.
On your computer, @avx
will probably choose to use a 3x4 tile.
Feel free to try out different tile sizes with AtmulBavx!
by manually redefining:
LoopVectorization.choose_order(::LoopSet) = ([:m,:n,:k], :k, 2, 4)
and then redefining AtmulBavx!
(without redefining AtmulBavx!
it wont update, like we're back in Julia 0.5).
Note that you'd have to restart your Julia session (or manually revert to the old definition) if you're using any function with @avx
but without 3 loops indexed by :m
, :n
, and k after replacing
choose_order` with the above.
from loopvectorization.jl.
Related Issues (20)
- `vtrunc(::Float64)` issue HOT 3
- Strange compile behavior for @turbo HOT 2
- is it possible to set @turbo thread = true/false at runtime? HOT 3
- LoopVectorization fail to compile on julia 32bit REPL
- AssertionError: M == 1 HOT 9
- Inconsistent results w/ and w/o @turbo HOT 6
- vfilter with multiple conditions HOT 2
- Memory corruption HOT 2
- Incorrect results using @turbo with linear array indexing HOT 1
- Weird/inconsistent behavior with constant lhs indexing inside @turbo loop HOT 2
- Suboptimal Choice of the Vecotrization Level for Image Convolution HOT 1
- Performance for stride 2
- Bad IR generation triggers assertion failure on 1.11
- Release v0.12.167 breaks RecursiveFactorization on Julia v1.11+ HOT 12
- LoopVectorization.jl causing segfaults on 1.11 HOT 5
- type inference issue with vectors if ints and floats in julia 1.10 HOT 3
- LoadError: BoundsError: attempt to access 2-element Vector{LoopVectorization.ArrayReferenceMeta} at index [0] HOT 1
- Reduction not found HOT 2
- Safety of generating random numbers HOT 2
- Problem/error in execution order
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from loopvectorization.jl.