Git Product home page Git Product logo

Comments (6)

PetrKryslUCSD avatar PetrKryslUCSD commented on August 15, 2024

For 1.5 nightly.
Rewriting the loop as

@avx for n in 1:size(C,2), m in 1:size(C,1)
        Cmn = zero(eltype(C))
        for k in 1:size(A,1)
            Cmn += A[k,m] * B[k,n]
        end
        C[m,n] = Cmn
    end

works fine.

from loopvectorization.jl.

chriselrod avatar chriselrod commented on August 15, 2024

I'm surprised I wasn't already testing your first example.

I'm currently working on rewriting (and greatly simplifying) how I'm handling constant values.
It is spaghetti at the moment. It should become a lot more robust.

from loopvectorization.jl.

PetrKryslUCSD avatar PetrKryslUCSD commented on August 15, 2024

Great! Thanks.

from loopvectorization.jl.

chriselrod avatar chriselrod commented on August 15, 2024

I added tests for this. Thanks for the bug report!

I haven't made an announcement yet, but the new feature in 0.4 is:

using LoopVectorization, BenchmarkTools
@inline function mulCAB!(C, A, B)
    M, N = size(C); K = size(B,1)
    @assert size(C, 1) == size(A, 2)
    @assert size(C, 2) == size(B, 2)
    @assert size(A, 1) == size(B, 1)
    # When the @avx macro is available, this code is faster:
    z = zero(eltype(C))
    @avx for n in 1:size(C,2), m in 1:size(C,1)
        Cmn = z
        for k in 1:size(A,1)
            Cmn += A[m,k] * B[k,n]
        end
        C[m,n] = Cmn
    end
    return C
end
@inline function mulCAtB!(C, A, B)
    M, N = size(C); K = size(B,1)
    @assert size(C, 1) == size(A, 2)
    @assert size(C, 2) == size(B, 2)
    @assert size(A, 1) == size(B, 1)
    # When the @avx macro is available, this code is faster:
    z = zero(eltype(C))
    @avx for n in 1:size(C,2), m in 1:size(C,1)
        Cmn = z
        for k in 1:size(A,1)
            Cmn += A[k,m] * B[k,n]
        end
        C[m,n] = Cmn
    end
    return C
end
N = 48;
A = rand(N, N); B = rand(N,N);
C1 = similar(A); C2 = similar(A);
C3 = similar(A); C4 = similar(A);
At = copy(A');

Then:

julia> @benchmark mulCAB!($C1, $At', $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     8.296 μs (0.00% GC)
  median time:      8.324 μs (0.00% GC)
  mean time:        8.368 μs (0.00% GC)
  maximum time:     112.063 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     3

julia> @benchmark mulCAtB!($C2, $At, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     8.371 μs (0.00% GC)
  median time:      8.406 μs (0.00% GC)
  mean time:        8.459 μs (0.00% GC)
  maximum time:     112.286 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     3

julia> C1  C2
true

julia> @benchmark mulCAB!($C3, $A, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     9.996 μs (0.00% GC)
  median time:      10.098 μs (0.00% GC)
  mean time:        10.153 μs (0.00% GC)
  maximum time:     117.462 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     3

julia> @benchmark mulCAtB!($C4, $A', $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     9.538 μs (0.00% GC)
  median time:      9.629 μs (0.00% GC)
  mean time:        9.658 μs (0.00% GC)
  maximum time:     31.392 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> C1  C3
true

julia> C1  C4
true

But I think something may be wrong with the performance at the moment.
I'll look into that.

EDIT: It is definitely looking like there is a 50% performance regression for the untransposed case.

from loopvectorization.jl.

chriselrod avatar chriselrod commented on August 15, 2024

The performance regression was related to a change I made to SIMDPirates.
Get on SIMDPirates 0.2.2+ to avoid it.

The performance regression was minor for avx512 (eg, 3.4 vs 3.3 microseconds) but rather extreme for avx2, otherwise I would have noticed sooner.

The fault behind the regression lies with LLVM.
To get additions and multiplications to fuse, you can either

  1. Pass a flag to LLVM so that it can fuse them (ie, fastmath).
  2. Parse the expression and manually substitute multiplications and additions with muladd calls
  3. Use an expression graph, like what LoopVectorization builds, and use that to fuse operations.

"3." takes the most work, so I haven't implemented that yet.
"2." can miss a lot of cases.
"1." Seems ideal...

...except that the fastmath flag makes LLVM do a lot of stupid things, like shuffling vectors around in registers, and repeatedly rereading from memory when it doesn't have to. Sometimes it'll do this a lot, like the severe performance regression in A * B we just saw above.

So in SIMDPirates 0.2.2 I removed the fastmath flag and am relying on strategy "2." to fuse operations.

I do need at least one of "2." or "3." for LoopVectorization to be aware of the fact that (on most architectures) a * b + c is just one operation.

These are benchmarks run on a different CPU with avx2 (the previous one was shutdown for maintenance). With fastmath (SIMDPirates 0.2.1):

julia> using LoopVectorization, BenchmarkTools

julia> @inline function mulCAB!(C, A, B)
           M, N = size(C); K = size(B,1)
           @assert size(C, 1) == size(A, 2)
           @assert size(C, 2) == size(B, 2)
           @assert size(A, 1) == size(B, 1)
           # When the @avx macro is available, this code is faster:
           z = zero(eltype(C))
           @avx for n in 1:size(C,2), m in 1:size(C,1)
               Cmn = z
               for k in 1:size(A,1)
                   Cmn += A[m,k] * B[k,n]
               end
               C[m,n] = Cmn
           end
           return C
       end
mulCAB! (generic function with 1 method)

julia> @inline function mulCAtB!(C, A, B)
           M, N = size(C); K = size(B,1)
           @assert size(C, 1) == size(A, 2)
           @assert size(C, 2) == size(B, 2)
           @assert size(A, 1) == size(B, 1)
           # When the @avx macro is available, this code is faster:
           z = zero(eltype(C))
           @avx for n in 1:size(C,2), m in 1:size(C,1)
               Cmn = z
               for k in 1:size(A,1)
                   Cmn += A[k,m] * B[k,n]
               end
               C[m,n] = Cmn
           end
           return C
       end
mulCAtB! (generic function with 1 method)

julia> N = 48;

julia> A = rand(N, N); B = rand(N,N);

julia> C1 = similar(A); C2 = similar(A);

julia> C3 = similar(A); C4 = similar(A);

julia> At = copy(A');

julia> @benchmark mulCAB!($C1, $At', $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.920 μs (0.00% GC)
  median time:      6.940 μs (0.00% GC)
  mean time:        6.993 μs (0.00% GC)
  maximum time:     17.780 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     5

julia> @benchmark mulCAtB!($C2, $At, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.800 μs (0.00% GC)
  median time:      6.820 μs (0.00% GC)
  mean time:        6.956 μs (0.00% GC)
  maximum time:     46.920 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     5

julia> C1  C2
true

julia> @benchmark mulCAB!($C3, $A, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.220 μs (0.00% GC)
  median time:      6.340 μs (0.00% GC)
  mean time:        6.361 μs (0.00% GC)
  maximum time:     20.560 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     5

julia> @benchmark mulCAtB!($C4, $A', $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.180 μs (0.00% GC)
  median time:      6.200 μs (0.00% GC)
  mean time:        6.279 μs (0.00% GC)
  maximum time:     43.540 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     5

julia> C1  C3
true

julia> C1  C4
true

Without fastmath (SIMDPirates 0.2.2):

julia> using LoopVectorization, BenchmarkTools

julia> @inline function mulCAB!(C, A, B)
           M, N = size(C); K = size(B,1)
           @assert size(C, 1) == size(A, 2)
           @assert size(C, 2) == size(B, 2)
           @assert size(A, 1) == size(B, 1)
           # When the @avx macro is available, this code is faster:
           z = zero(eltype(C))
           @avx for n in 1:size(C,2), m in 1:size(C,1)
               Cmn = z
               for k in 1:size(A,1)
                   Cmn += A[m,k] * B[k,n]
               end
               C[m,n] = Cmn
           end
           return C
       end
mulCAB! (generic function with 1 method)

julia> @inline function mulCAtB!(C, A, B)
           M, N = size(C); K = size(B,1)
           @assert size(C, 1) == size(A, 2)
           @assert size(C, 2) == size(B, 2)
           @assert size(A, 1) == size(B, 1)
           # When the @avx macro is available, this code is faster:
           z = zero(eltype(C))
           @avx for n in 1:size(C,2), m in 1:size(C,1)
               Cmn = z
               for k in 1:size(A,1)
                   Cmn += A[k,m] * B[k,n]
               end
               C[m,n] = Cmn
           end
           return C
       end
mulCAtB! (generic function with 1 method)

julia> N = 48;

julia> A = rand(N, N); B = rand(N,N);

julia> C1 = similar(A); C2 = similar(A);

julia> C3 = similar(A); C4 = similar(A);

julia> At = copy(A');

julia> @benchmark mulCAB!($C1, $At', $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.620 μs (0.00% GC)
  median time:      6.640 μs (0.00% GC)
  mean time:        6.724 μs (0.00% GC)
  maximum time:     23.380 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     5

julia> @benchmark mulCAtB!($C2, $At, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.620 μs (0.00% GC)
  median time:      6.840 μs (0.00% GC)
  mean time:        6.865 μs (0.00% GC)
  maximum time:     15.740 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     5

julia> C1  C2
true

julia> @benchmark mulCAB!($C3, $A, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.943 μs (0.00% GC)
  median time:      5.300 μs (0.00% GC)
  mean time:        5.458 μs (0.00% GC)
  maximum time:     20.986 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     7

julia> @benchmark mulCAtB!($C4, $A', $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.828 μs (0.00% GC)
  median time:      5.000 μs (0.00% GC)
  mean time:        5.021 μs (0.00% GC)
  maximum time:     19.286 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     7

julia> C1  C3
true

julia> C1  C4
true

julia> using LinearAlgebra

julia> BLAS.set_num_threads(1)

julia> @benchmark mul!($C1, $At', $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     7.100 μs (0.00% GC)
  median time:      7.150 μs (0.00% GC)
  mean time:        7.231 μs (0.00% GC)
  maximum time:     21.275 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     4

julia> @benchmark mul!($C1, $A, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     7.175 μs (0.00% GC)
  median time:      7.250 μs (0.00% GC)
  mean time:        7.589 μs (0.00% GC)
  maximum time:     64.250 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     4

julia> BLAS.vendor()
:openblas64

Anyway, this last example is what I wanted to show off about LoopVectorization 0.4:
writing mulCAB, and providing A' as an argument will now automatically perform like mulCAtB, and writing mulCAtB and providing it At' will automatically perform like mulCAB. =)

from loopvectorization.jl.

PetrKryslUCSD avatar PetrKryslUCSD commented on August 15, 2024

Thanks for the thorough writeup. Good to know!

from loopvectorization.jl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.