Git Product home page Git Product logo

Comments (5)

chriselrod avatar chriselrod commented on July 16, 2024

Thanks again for these reports.
Out of curiosity, roughly what percentage of the time do things work when you play around with the library?

This commit fixed your toy3!.

toy2! will be more difficult.
For one thing, I realized the library had been assuming loops start at the start of an array, regardless of index (so offset arrays should have been supported). For now, however, it will throw an error if loops don't.
It shouldn't require too much change to fix this.

But to get the example to work, I'll also have to decide a strategy to deal with peeling.
One idea would be to basically transform it into...

s = B[1,d1]*B[1,κ]
v[1:U*W] .= 0 # U is the number of SIMD vectors, W their width
for d2=2:U*W:d # handle remainder appropriately, as written it will often go out of bounds
    @. v[1:U*W] += B[d2:d2+U*W-1,d1]*B[d2:d2+U*W-1,κ]
end
G[d1,κ] = s + sum(v)

which faithfully follows what the user wrote.
This will also generally be slower than not peeling, except perhaps when d % W == 1.

I think a much better transformation, matching the intention, would more along the lines of:

v[1:U*W] .= B[1:U*W,d1]*B[1:U*W,κ]
for d2=1+W:W:d # handle remainder appropriately, as written it will often go out of bounds
    @. v[1:U*W] += B[d2:d2+U*W-1,d1]*B[d2:d2+U*W-1,κ]
end
G[d1,κ] = sum(v)

This is more difficult: it'll have to recognize the pattern of N-loop iterations getting peeled off. But it is what it ought to do, and therefore what I'll go for.

I'll leave this issue open until I/someone else implements that.

An easier means of getting that behavior would be an @peel macro that lets you write a loop like in toy3!, which will have the @avx macro perform the transformation to toy2!.
That would save it from having to look for and identify the pattern.

from loopvectorization.jl.

chrisvwx avatar chrisvwx commented on July 16, 2024

Thanks! You asked how often things work when I use LoopVectorization; this morning was the first time that I was able to use the @avx macro in a function that I am trying to optimize without a compilation error. With your latest fix I get a consistent 5% speed improvement in a specific non-trivial call to the function. I'll look at the function again later today in more detail.

Thanks for your comments on toy2! and loops that start at an offset. In the code I am trying to optimize, I can often rewrite to not require the offset, yet it does seem like a useful thing to have.

Thanks for looking at this!

from loopvectorization.jl.

chriselrod avatar chriselrod commented on July 16, 2024

This morning was the first time without a compilation error?
Sounds like it has a long way to go before it provides a good user experience.

from loopvectorization.jl.

chriselrod avatar chriselrod commented on July 16, 2024

On LoopVectorization 0.3.6:

julia> using LoopVectorization

julia> using BenchmarkTools

julia> function toy1!(G, B,κ)
           d = size(G,1)
           @inbounds for d1=1:d
               G[d1,κ] = B[1,d1]*B[1,κ]
               for d2=2:d
                   G[d1,κ] += B[d2,d1]*B[d2,κ]
               end
           end
       end
toy1! (generic function with 1 method)

julia> function toy2!(G, B,κ)
           d = size(G,1)
           @avx for d1=1:d
               G[d1,κ] = B[1,d1]*B[1,κ]
               for d2=2:d
                   G[d1,κ] += B[d2,d1]*B[d2,κ]
               end
           end
       end
toy2! (generic function with 1 method)

julia> function toy3!(G, B,κ)
           d = size(G,1)
           z = zero(eltype(G))
           @avx for d1=1:d
               G[d1,κ] = z
               for d2=1:d
                   G[d1,κ] += B[d2,d1]*B[d2,κ]
               end
           end
       end
toy3! (generic function with 1 method)

julia> N = 8
8

julia> B = randn(N, N);

julia> G1 = zeros(N, N).*NaN;

julia> G2 = similar(G1);

julia> G3 = similar(G1);

julia> toy1!(G1,B,1)

julia> toy2!(G2,B,1)

julia> toy3!(G3,B,1)

julia> @assert @views G1[:,1]  G2[:,1]

julia> @assert @views G1[:,1]  G3[:,1]

julia> @benchmark toy1!($G1,$B,1)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     33.883 ns (0.00% GC)
  median time:      34.104 ns (0.00% GC)
  mean time:        34.862 ns (0.00% GC)
  maximum time:     69.998 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     993

julia> @benchmark toy2!($G1,$B,1)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     25.958 ns (0.00% GC)
  median time:      26.100 ns (0.00% GC)
  mean time:        26.140 ns (0.00% GC)
  maximum time:     82.856 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     996

julia> @benchmark toy3!($G1,$B,1)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     19.281 ns (0.00% GC)
  median time:      19.478 ns (0.00% GC)
  mean time:        19.480 ns (0.00% GC)
  maximum time:     21.995 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     997

I fixed the remaining problem by (a) allowing loops that start at values other than 1, and (b) making the handling of reductions more robust.

For now, I will close the issue because the examples work.
The peeled code is of course slower. If you'd like for the library to recognize peeling to optimize it specially and attempt to make it as fast as not peeling, feel free to file a new issue for that.

I'm hoping more loops will work without errors!

from loopvectorization.jl.

chrisvwx avatar chrisvwx commented on July 16, 2024

Thanks!

This helps quite a bit. For some reason the code I'm trying to optimize requires something closer to the toy2!; @avx is now giving around a 10% time gain. I'd like to use it in even more places; I submitted another issue.

Thanks for your work on this

from loopvectorization.jl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.