Version of the package 0.4.0. Error: <div class="snippet-clipboard-content notrans

For 1.5 nightly. Rewriting the loop as <div class="snippet-clipboard-content n

I <a href="https://github.com/chriselrod/LoopVectorization.jl/blob/master/test/runtest

Undefined quantity during transformation about loopvectorization.jl HOT 6 CLOSED

juliasimd commented on August 15, 2024

Undefined quantity during transformation

from loopvectorization.jl.

Comments (6)

PetrKryslUCSD commented on August 15, 2024

For 1.5 nightly.
Rewriting the loop as

@avx for n in 1:size(C,2), m in 1:size(C,1)
        Cmn = zero(eltype(C))
        for k in 1:size(A,1)
            Cmn += A[k,m] * B[k,n]
        end
        C[m,n] = Cmn
    end

works fine.

from loopvectorization.jl.

chriselrod commented on August 15, 2024

I'm surprised I wasn't already testing your first example.

I'm currently working on rewriting (and greatly simplifying) how I'm handling constant values.
It is spaghetti at the moment. It should become a lot more robust.

from loopvectorization.jl.

PetrKryslUCSD commented on August 15, 2024

Great! Thanks.

from loopvectorization.jl.

chriselrod commented on August 15, 2024

I added tests for this. Thanks for the bug report!

I haven't made an announcement yet, but the new feature in 0.4 is:

using LoopVectorization, BenchmarkTools
@inline function mulCAB!(C, A, B)
    M, N = size(C); K = size(B,1)
    @assert size(C, 1) == size(A, 2)
    @assert size(C, 2) == size(B, 2)
    @assert size(A, 1) == size(B, 1)
    # When the @avx macro is available, this code is faster:
    z = zero(eltype(C))
    @avx for n in 1:size(C,2), m in 1:size(C,1)
        Cmn = z
        for k in 1:size(A,1)
            Cmn += A[m,k] * B[k,n]
        end
        C[m,n] = Cmn
    end
    return C
end
@inline function mulCAtB!(C, A, B)
    M, N = size(C); K = size(B,1)
    @assert size(C, 1) == size(A, 2)
    @assert size(C, 2) == size(B, 2)
    @assert size(A, 1) == size(B, 1)
    # When the @avx macro is available, this code is faster:
    z = zero(eltype(C))
    @avx for n in 1:size(C,2), m in 1:size(C,1)
        Cmn = z
        for k in 1:size(A,1)
            Cmn += A[k,m] * B[k,n]
        end
        C[m,n] = Cmn
    end
    return C
end
N = 48;
A = rand(N, N); B = rand(N,N);
C1 = similar(A); C2 = similar(A);
C3 = similar(A); C4 = similar(A);
At = copy(A');

Then:

julia> @benchmark mulCAB!($C1, $At', $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     8.296 μs (0.00% GC)
  median time:      8.324 μs (0.00% GC)
  mean time:        8.368 μs (0.00% GC)
  maximum time:     112.063 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     3

julia> @benchmark mulCAtB!($C2, $At, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     8.371 μs (0.00% GC)
  median time:      8.406 μs (0.00% GC)
  mean time:        8.459 μs (0.00% GC)
  maximum time:     112.286 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     3

julia> C1 ≈ C2
true

julia> @benchmark mulCAB!($C3, $A, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     9.996 μs (0.00% GC)
  median time:      10.098 μs (0.00% GC)
  mean time:        10.153 μs (0.00% GC)
  maximum time:     117.462 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     3

julia> @benchmark mulCAtB!($C4, $A', $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     9.538 μs (0.00% GC)
  median time:      9.629 μs (0.00% GC)
  mean time:        9.658 μs (0.00% GC)
  maximum time:     31.392 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> C1 ≈ C3
true

julia> C1 ≈ C4
true

But I think something may be wrong with the performance at the moment.
I'll look into that.

EDIT: It is definitely looking like there is a 50% performance regression for the untransposed case.

from loopvectorization.jl.

chriselrod commented on August 15, 2024

The performance regression was related to a change I made to SIMDPirates.
Get on SIMDPirates 0.2.2+ to avoid it.

The performance regression was minor for avx512 (eg, 3.4 vs 3.3 microseconds) but rather extreme for avx2, otherwise I would have noticed sooner.

The fault behind the regression lies with LLVM.
To get additions and multiplications to fuse, you can either

Pass a flag to LLVM so that it can fuse them (ie, fastmath).
Parse the expression and manually substitute multiplications and additions with muladd calls
Use an expression graph, like what LoopVectorization builds, and use that to fuse operations.

"3." takes the most work, so I haven't implemented that yet.
"2." can miss a lot of cases.
"1." Seems ideal...

...except that the fastmath flag makes LLVM do a lot of stupid things, like shuffling vectors around in registers, and repeatedly rereading from memory when it doesn't have to. Sometimes it'll do this a lot, like the severe performance regression in A * B we just saw above.

So in SIMDPirates 0.2.2 I removed the fastmath flag and am relying on strategy "2." to fuse operations.

I do need at least one of "2." or "3." for LoopVectorization to be aware of the fact that (on most architectures) a * b + c is just one operation.

These are benchmarks run on a different CPU with avx2 (the previous one was shutdown for maintenance). With fastmath (SIMDPirates 0.2.1):

julia> using LoopVectorization, BenchmarkTools

julia> @inline function mulCAB!(C, A, B)
           M, N = size(C); K = size(B,1)
           @assert size(C, 1) == size(A, 2)
           @assert size(C, 2) == size(B, 2)
           @assert size(A, 1) == size(B, 1)
           # When the @avx macro is available, this code is faster:
           z = zero(eltype(C))
           @avx for n in 1:size(C,2), m in 1:size(C,1)
               Cmn = z
               for k in 1:size(A,1)
                   Cmn += A[m,k] * B[k,n]
               end
               C[m,n] = Cmn
           end
           return C
       end
mulCAB! (generic function with 1 method)

julia> @inline function mulCAtB!(C, A, B)
           M, N = size(C); K = size(B,1)
           @assert size(C, 1) == size(A, 2)
           @assert size(C, 2) == size(B, 2)
           @assert size(A, 1) == size(B, 1)
           # When the @avx macro is available, this code is faster:
           z = zero(eltype(C))
           @avx for n in 1:size(C,2), m in 1:size(C,1)
               Cmn = z
               for k in 1:size(A,1)
                   Cmn += A[k,m] * B[k,n]
               end
               C[m,n] = Cmn
           end
           return C
       end
mulCAtB! (generic function with 1 method)

julia> N = 48;

julia> A = rand(N, N); B = rand(N,N);

julia> C1 = similar(A); C2 = similar(A);

julia> C3 = similar(A); C4 = similar(A);

julia> At = copy(A');

julia> @benchmark mulCAB!($C1, $At', $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.920 μs (0.00% GC)
  median time:      6.940 μs (0.00% GC)
  mean time:        6.993 μs (0.00% GC)
  maximum time:     17.780 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     5

julia> @benchmark mulCAtB!($C2, $At, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.800 μs (0.00% GC)
  median time:      6.820 μs (0.00% GC)
  mean time:        6.956 μs (0.00% GC)
  maximum time:     46.920 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     5

julia> C1 ≈ C2
true

julia> @benchmark mulCAB!($C3, $A, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.220 μs (0.00% GC)
  median time:      6.340 μs (0.00% GC)
  mean time:        6.361 μs (0.00% GC)
  maximum time:     20.560 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     5

julia> @benchmark mulCAtB!($C4, $A', $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.180 μs (0.00% GC)
  median time:      6.200 μs (0.00% GC)
  mean time:        6.279 μs (0.00% GC)
  maximum time:     43.540 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     5

julia> C1 ≈ C3
true

julia> C1 ≈ C4
true

Without fastmath (SIMDPirates 0.2.2):

julia> using LoopVectorization, BenchmarkTools

julia> @inline function mulCAB!(C, A, B)
           M, N = size(C); K = size(B,1)
           @assert size(C, 1) == size(A, 2)
           @assert size(C, 2) == size(B, 2)
           @assert size(A, 1) == size(B, 1)
           # When the @avx macro is available, this code is faster:
           z = zero(eltype(C))
           @avx for n in 1:size(C,2), m in 1:size(C,1)
               Cmn = z
               for k in 1:size(A,1)
                   Cmn += A[m,k] * B[k,n]
               end
               C[m,n] = Cmn
           end
           return C
       end
mulCAB! (generic function with 1 method)

julia> @inline function mulCAtB!(C, A, B)
           M, N = size(C); K = size(B,1)
           @assert size(C, 1) == size(A, 2)
           @assert size(C, 2) == size(B, 2)
           @assert size(A, 1) == size(B, 1)
           # When the @avx macro is available, this code is faster:
           z = zero(eltype(C))
           @avx for n in 1:size(C,2), m in 1:size(C,1)
               Cmn = z
               for k in 1:size(A,1)
                   Cmn += A[k,m] * B[k,n]
               end
               C[m,n] = Cmn
           end
           return C
       end
mulCAtB! (generic function with 1 method)

julia> N = 48;

julia> A = rand(N, N); B = rand(N,N);

julia> C1 = similar(A); C2 = similar(A);

julia> C3 = similar(A); C4 = similar(A);

julia> At = copy(A');

julia> @benchmark mulCAB!($C1, $At', $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.620 μs (0.00% GC)
  median time:      6.640 μs (0.00% GC)
  mean time:        6.724 μs (0.00% GC)
  maximum time:     23.380 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     5

julia> @benchmark mulCAtB!($C2, $At, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.620 μs (0.00% GC)
  median time:      6.840 μs (0.00% GC)
  mean time:        6.865 μs (0.00% GC)
  maximum time:     15.740 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     5

julia> C1 ≈ C2
true

julia> @benchmark mulCAB!($C3, $A, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.943 μs (0.00% GC)
  median time:      5.300 μs (0.00% GC)
  mean time:        5.458 μs (0.00% GC)
  maximum time:     20.986 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     7

julia> @benchmark mulCAtB!($C4, $A', $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.828 μs (0.00% GC)
  median time:      5.000 μs (0.00% GC)
  mean time:        5.021 μs (0.00% GC)
  maximum time:     19.286 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     7

julia> C1 ≈ C3
true

julia> C1 ≈ C4
true

julia> using LinearAlgebra

julia> BLAS.set_num_threads(1)

julia> @benchmark mul!($C1, $At', $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     7.100 μs (0.00% GC)
  median time:      7.150 μs (0.00% GC)
  mean time:        7.231 μs (0.00% GC)
  maximum time:     21.275 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     4

julia> @benchmark mul!($C1, $A, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     7.175 μs (0.00% GC)
  median time:      7.250 μs (0.00% GC)
  mean time:        7.589 μs (0.00% GC)
  maximum time:     64.250 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     4

julia> BLAS.vendor()
:openblas64

Anyway, this last example is what I wanted to show off about LoopVectorization 0.4:
writing mulCAB, and providing A' as an argument will now automatically perform like mulCAtB, and writing mulCAtB and providing it At' will automatically perform like mulCAB. =)

from loopvectorization.jl.

PetrKryslUCSD commented on August 15, 2024

Thanks for the thorough writeup. Good to know!

from loopvectorization.jl.

Undefined quantity during transformation about loopvectorization.jl HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent