Possibly related to <a class="issue-link js-issue-link" data-error-text="Failed to loa

On LoopVectorization 0.3.6: <div class="highlight highlight-source-julia notransla

ERROR: UndefVarError: ####temporary#425_ not defined about loopvectorization.jl HOT 5 CLOSED

chrisvwx commented on July 16, 2024

ERROR: UndefVarError: ####temporary#425_ not defined

from loopvectorization.jl.

Comments (5)

chriselrod commented on July 16, 2024

Thanks again for these reports.
Out of curiosity, roughly what percentage of the time do things work when you play around with the library?

This commit fixed your toy3!.

toy2! will be more difficult.
For one thing, I realized the library had been assuming loops start at the start of an array, regardless of index (so offset arrays should have been supported). For now, however, it will throw an error if loops don't.
It shouldn't require too much change to fix this.

But to get the example to work, I'll also have to decide a strategy to deal with peeling.
One idea would be to basically transform it into...

s = B[1,d1]*B[1,κ]
v[1:U*W] .= 0 # U is the number of SIMD vectors, W their width
for d2=2:U*W:d # handle remainder appropriately, as written it will often go out of bounds
    @. v[1:U*W] += B[d2:d2+U*W-1,d1]*B[d2:d2+U*W-1,κ]
end
G[d1,κ] = s + sum(v)

which faithfully follows what the user wrote.
This will also generally be slower than not peeling, except perhaps when d % W == 1.

I think a much better transformation, matching the intention, would more along the lines of:

v[1:U*W] .= B[1:U*W,d1]*B[1:U*W,κ]
for d2=1+W:W:d # handle remainder appropriately, as written it will often go out of bounds
    @. v[1:U*W] += B[d2:d2+U*W-1,d1]*B[d2:d2+U*W-1,κ]
end
G[d1,κ] = sum(v)

This is more difficult: it'll have to recognize the pattern of N-loop iterations getting peeled off. But it is what it ought to do, and therefore what I'll go for.

I'll leave this issue open until I/someone else implements that.

An easier means of getting that behavior would be an @peel macro that lets you write a loop like in toy3!, which will have the @avx macro perform the transformation to toy2!.
That would save it from having to look for and identify the pattern.

from loopvectorization.jl.

chrisvwx commented on July 16, 2024

Thanks! You asked how often things work when I use LoopVectorization; this morning was the first time that I was able to use the @avx macro in a function that I am trying to optimize without a compilation error. With your latest fix I get a consistent 5% speed improvement in a specific non-trivial call to the function. I'll look at the function again later today in more detail.

Thanks for your comments on toy2! and loops that start at an offset. In the code I am trying to optimize, I can often rewrite to not require the offset, yet it does seem like a useful thing to have.

Thanks for looking at this!

from loopvectorization.jl.

chriselrod commented on July 16, 2024

This morning was the first time without a compilation error?
Sounds like it has a long way to go before it provides a good user experience.

from loopvectorization.jl.

chriselrod commented on July 16, 2024

On LoopVectorization 0.3.6:

julia> using LoopVectorization

julia> using BenchmarkTools

julia> function toy1!(G, B,κ)
           d = size(G,1)
           @inbounds for d1=1:d
               G[d1,κ] = B[1,d1]*B[1,κ]
               for d2=2:d
                   G[d1,κ] += B[d2,d1]*B[d2,κ]
               end
           end
       end
toy1! (generic function with 1 method)

julia> function toy2!(G, B,κ)
           d = size(G,1)
           @avx for d1=1:d
               G[d1,κ] = B[1,d1]*B[1,κ]
               for d2=2:d
                   G[d1,κ] += B[d2,d1]*B[d2,κ]
               end
           end
       end
toy2! (generic function with 1 method)

julia> function toy3!(G, B,κ)
           d = size(G,1)
           z = zero(eltype(G))
           @avx for d1=1:d
               G[d1,κ] = z
               for d2=1:d
                   G[d1,κ] += B[d2,d1]*B[d2,κ]
               end
           end
       end
toy3! (generic function with 1 method)

julia> N = 8
8

julia> B = randn(N, N);

julia> G1 = zeros(N, N).*NaN;

julia> G2 = similar(G1);

julia> G3 = similar(G1);

julia> toy1!(G1,B,1)

julia> toy2!(G2,B,1)

julia> toy3!(G3,B,1)

julia> @assert @views G1[:,1] ≈ G2[:,1]

julia> @assert @views G1[:,1] ≈ G3[:,1]

julia> @benchmark toy1!($G1,$B,1)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     33.883 ns (0.00% GC)
  median time:      34.104 ns (0.00% GC)
  mean time:        34.862 ns (0.00% GC)
  maximum time:     69.998 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     993

julia> @benchmark toy2!($G1,$B,1)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     25.958 ns (0.00% GC)
  median time:      26.100 ns (0.00% GC)
  mean time:        26.140 ns (0.00% GC)
  maximum time:     82.856 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     996

julia> @benchmark toy3!($G1,$B,1)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     19.281 ns (0.00% GC)
  median time:      19.478 ns (0.00% GC)
  mean time:        19.480 ns (0.00% GC)
  maximum time:     21.995 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     997

I fixed the remaining problem by (a) allowing loops that start at values other than 1, and (b) making the handling of reductions more robust.

For now, I will close the issue because the examples work.
The peeled code is of course slower. If you'd like for the library to recognize peeling to optimize it specially and attempt to make it as fast as not peeling, feel free to file a new issue for that.

I'm hoping more loops will work without errors!

from loopvectorization.jl.

chrisvwx commented on July 16, 2024

Thanks!

This helps quite a bit. For some reason the code I'm trying to optimize requires something closer to the toy2!; @avx is now giving around a 10% time gain. I'd like to use it in even more places; I submitted another issue.

Thanks for your work on this

from loopvectorization.jl.

ERROR: UndefVarError: ####temporary#425_ not defined about loopvectorization.jl HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent