Comments (5)
Thanks again for these reports.
Out of curiosity, roughly what percentage of the time do things work when you play around with the library?
This commit fixed your toy3!
.
toy2!
will be more difficult.
For one thing, I realized the library had been assuming loops start at the start of an array, regardless of index (so offset arrays should have been supported). For now, however, it will throw an error if loops don't.
It shouldn't require too much change to fix this.
But to get the example to work, I'll also have to decide a strategy to deal with peeling.
One idea would be to basically transform it into...
s = B[1,d1]*B[1,κ]
v[1:U*W] .= 0 # U is the number of SIMD vectors, W their width
for d2=2:U*W:d # handle remainder appropriately, as written it will often go out of bounds
@. v[1:U*W] += B[d2:d2+U*W-1,d1]*B[d2:d2+U*W-1,κ]
end
G[d1,κ] = s + sum(v)
which faithfully follows what the user wrote.
This will also generally be slower than not peeling, except perhaps when d % W == 1
.
I think a much better transformation, matching the intention, would more along the lines of:
v[1:U*W] .= B[1:U*W,d1]*B[1:U*W,κ]
for d2=1+W:W:d # handle remainder appropriately, as written it will often go out of bounds
@. v[1:U*W] += B[d2:d2+U*W-1,d1]*B[d2:d2+U*W-1,κ]
end
G[d1,κ] = sum(v)
This is more difficult: it'll have to recognize the pattern of N-loop iterations getting peeled off. But it is what it ought to do, and therefore what I'll go for.
I'll leave this issue open until I/someone else implements that.
An easier means of getting that behavior would be an @peel
macro that lets you write a loop like in toy3!
, which will have the @avx
macro perform the transformation to toy2!
.
That would save it from having to look for and identify the pattern.
from loopvectorization.jl.
Thanks! You asked how often things work when I use LoopVectorization; this morning was the first time that I was able to use the @avx
macro in a function that I am trying to optimize without a compilation error. With your latest fix I get a consistent 5% speed improvement in a specific non-trivial call to the function. I'll look at the function again later today in more detail.
Thanks for your comments on toy2!
and loops that start at an offset. In the code I am trying to optimize, I can often rewrite to not require the offset, yet it does seem like a useful thing to have.
Thanks for looking at this!
from loopvectorization.jl.
This morning was the first time without a compilation error?
Sounds like it has a long way to go before it provides a good user experience.
from loopvectorization.jl.
On LoopVectorization 0.3.6:
julia> using LoopVectorization
julia> using BenchmarkTools
julia> function toy1!(G, B,κ)
d = size(G,1)
@inbounds for d1=1:d
G[d1,κ] = B[1,d1]*B[1,κ]
for d2=2:d
G[d1,κ] += B[d2,d1]*B[d2,κ]
end
end
end
toy1! (generic function with 1 method)
julia> function toy2!(G, B,κ)
d = size(G,1)
@avx for d1=1:d
G[d1,κ] = B[1,d1]*B[1,κ]
for d2=2:d
G[d1,κ] += B[d2,d1]*B[d2,κ]
end
end
end
toy2! (generic function with 1 method)
julia> function toy3!(G, B,κ)
d = size(G,1)
z = zero(eltype(G))
@avx for d1=1:d
G[d1,κ] = z
for d2=1:d
G[d1,κ] += B[d2,d1]*B[d2,κ]
end
end
end
toy3! (generic function with 1 method)
julia> N = 8
8
julia> B = randn(N, N);
julia> G1 = zeros(N, N).*NaN;
julia> G2 = similar(G1);
julia> G3 = similar(G1);
julia> toy1!(G1,B,1)
julia> toy2!(G2,B,1)
julia> toy3!(G3,B,1)
julia> @assert @views G1[:,1] ≈ G2[:,1]
julia> @assert @views G1[:,1] ≈ G3[:,1]
julia> @benchmark toy1!($G1,$B,1)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 33.883 ns (0.00% GC)
median time: 34.104 ns (0.00% GC)
mean time: 34.862 ns (0.00% GC)
maximum time: 69.998 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 993
julia> @benchmark toy2!($G1,$B,1)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 25.958 ns (0.00% GC)
median time: 26.100 ns (0.00% GC)
mean time: 26.140 ns (0.00% GC)
maximum time: 82.856 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 996
julia> @benchmark toy3!($G1,$B,1)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 19.281 ns (0.00% GC)
median time: 19.478 ns (0.00% GC)
mean time: 19.480 ns (0.00% GC)
maximum time: 21.995 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 997
I fixed the remaining problem by (a) allowing loops that start at values other than 1
, and (b) making the handling of reductions more robust.
For now, I will close the issue because the examples work.
The peeled code is of course slower. If you'd like for the library to recognize peeling to optimize it specially and attempt to make it as fast as not peeling, feel free to file a new issue for that.
I'm hoping more loops will work without errors!
from loopvectorization.jl.
Thanks!
This helps quite a bit. For some reason the code I'm trying to optimize requires something closer to the toy2!
; @avx
is now giving around a 10% time gain. I'd like to use it in even more places; I submitted another issue.
Thanks for your work on this
from loopvectorization.jl.
Related Issues (20)
- `vtrunc(::Float64)` issue HOT 3
- Strange compile behavior for @turbo HOT 2
- is it possible to set @turbo thread = true/false at runtime? HOT 3
- LoopVectorization fail to compile on julia 32bit REPL
- AssertionError: M == 1 HOT 9
- Inconsistent results w/ and w/o @turbo HOT 6
- vfilter with multiple conditions HOT 2
- Memory corruption HOT 2
- Incorrect results using @turbo with linear array indexing HOT 1
- Weird/inconsistent behavior with constant lhs indexing inside @turbo loop HOT 2
- Suboptimal Choice of the Vecotrization Level for Image Convolution HOT 1
- Performance for stride 2
- Bad IR generation triggers assertion failure on 1.11
- Release v0.12.167 breaks RecursiveFactorization on Julia v1.11+ HOT 12
- LoopVectorization.jl causing segfaults on 1.11 HOT 5
- type inference issue with vectors if ints and floats in julia 1.10 HOT 3
- LoadError: BoundsError: attempt to access 2-element Vector{LoopVectorization.ArrayReferenceMeta} at index [0] HOT 1
- Reduction not found HOT 2
- Safety of generating random numbers HOT 2
- Problem/error in execution order
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from loopvectorization.jl.