Git Product home page Git Product logo

Comments (1)

chriselrod avatar chriselrod commented on July 16, 2024

Thanks, I fixed the problem, added it to the tests, and created a new release.
It should work starting on 0.2.3.

For what it's worth, what you're doing there is similar to what @avx does when applied to those three loops, but @avx should be faster.

using LoopVectorization, Test, BenchmarkTools
M = K = N = 100
T = Float64
C1 = Matrix{T}(undef, M, N); C2 = similar(C1); C3 = similar(C1);
A = rand(M, K); At = copy(A');
B = rand(K, N);

function mulCAtB_2x2block!(C, A, B)
    M, N = size(C); K = size(B,1)
    @assert size(C, 1) == size(A, 2)
    @assert size(C, 2) == size(B, 2)
    @assert size(A, 1) == size(B, 1)
    T = eltype(C)
    if mod(M, 2) == 0 && mod(N, 2) == 0
        for m  1:2:M
            m1 = m + 1
            for n  1:2:N 
                n1 = n + 1
                C11, C21, C12, C22 = zero(T), zero(T), zero(T), zero(T)
                @avx for k  1:K
                    C11 += A[k,m] * B[k,n] 
                    C21 += A[k,m1] * B[k,n] 
                    C12 += A[k,m] * B[k,n1] 
                    C22 += A[k,m1] * B[k,n1]
                end
                C[m,n] = C11
                C[m1,n] = C21
                C[m,n1] = C12
                C[m1,n1] = C22
            end
        end
    else
        @inbounds for n  1:N, m  1:M 
            Cmn = 0.0
            @inbounds for k  1:K
                Cmn += A[k,m] * B[k,n]
            end
            C[m,n] = Cmn
        end
    end
    return C
end
function AtmulBavx!(C, A, B)
    @avx for n  1:size(C,2), m  1:size(C,1)
        Cᵢⱼ = zero(eltype(C))
        for k  1:size(A,1)
            Cᵢⱼ += A[k,m] * B[k,n]
        end
        C[m,n] = Cᵢⱼ
    end
end
function AmulBavx!(C, A, B)
    @avx for m  1:size(A,1), n  1:size(B,2)
        Cᵢⱼ = zero(eltype(C))
        for k  1:size(A,2)
            Cᵢⱼ += A[m,k] * B[k,n]
        end
        C[m,n] = Cᵢⱼ
    end
end

mulCAtB_2x2block!(C1, At, B);
AtmulBavx!(C2, At, B);
AmulBavx!(C3, A, B);
C4 = At' * B;
@test C1  C4
@test C2  C4
@test C3  C4

I'm running these benchmarks on the latest (not yet released) master, where I did some ad hoc fiddling with how it chooses block sizes to get AtmulBavx! to be faster than mulCAtB_2x2block!.

Please let me know how the two perform on your computer. I may need to adjust it for AVX (instruction set) computers as well.
Performance still is bad compared to MKL and the A-not-tranposed case, as you can see below:

julia> using LinearAlgebra; BLAS.set_num_threads(1)

julia> @benchmark mulCAtB_2x2block!($C1, $At, $B)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     63.729 μs (0.00% GC)
  median time:      65.949 μs (0.00% GC)
  mean time:        67.540 μs (0.00% GC)
  maximum time:     110.972 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark AtmulBavx!($C2, $At, $B)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     40.664 μs (0.00% GC)
  median time:      42.035 μs (0.00% GC)
  mean time:        42.846 μs (0.00% GC)
  maximum time:     74.588 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark AmulBavx!($C2, $A, $B)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     25.863 μs (0.00% GC)
  median time:      26.176 μs (0.00% GC)
  mean time:        26.578 μs (0.00% GC)
  maximum time:     66.092 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark mul!($C3, $At', $B)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     21.489 μs (0.00% GC)
  median time:      22.154 μs (0.00% GC)
  mean time:        22.415 μs (0.00% GC)
  maximum time:     70.543 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark mul!($C3, $A, $B)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     20.007 μs (0.00% GC)
  median time:      20.642 μs (0.00% GC)
  mean time:        20.932 μs (0.00% GC)
  maximum time:     51.647 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

It's twice as slow as MKL, and 60% slower than @avx when A isn't transposed.

To figure out what @avx is doing (aside from the obvious macroexpand), it can also help to:

julia> AtmulBq = :(for n  1:size(C,2), m  1:size(C,1)
           Cᵢⱼ = zero(eltype(C))
           for k  1:size(A,1)
               Cᵢⱼ += A[k,m] * B[k,n]
           end
           C[m,n] = Cᵢⱼ
       end);

julia> lsAtmulB = LoopVectorization.LoopSet(AtmulBq);

julia> LoopVectorization.choose_order(lsAtmulB)
([:m, :n, :k], :k, 4, 4)

This essentially means that the loop order is:

for mouter in 1:4:M # the 4 here is the second 4
    for nouter in 1:4:N # this 4 is the first 4
        # unrolled, for m in mouter:mouter+3, for n in nouter:nouter+3
        for k in 1:K
            # unrolled, for m in mouter:mouter+3, for n in nouter:nouter+3
        end
        # unrolled, for m in mouter:mouter+3, for n in nouter:nouter+3
    end
end

So the difference between your code in this issue and @avx on the outer loop is that the former uses a 2x2 tile, while @avx chose to use a 4x4 tile.
The :k (not in a vector) means that it is the vectorized loop.

On your computer, @avx will probably choose to use a 3x4 tile.
Feel free to try out different tile sizes with AtmulBavx! by manually redefining:

LoopVectorization.choose_order(::LoopSet) = ([:m,:n,:k], :k, 2, 4)

and then redefining AtmulBavx! (without redefining AtmulBavx! it wont update, like we're back in Julia 0.5).

Note that you'd have to restart your Julia session (or manually revert to the old definition) if you're using any function with @avx but without 3 loops indexed by :m, :n, and k after replacing choose_order` with the above.

from loopvectorization.jl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.