Comments (6)
Okay, so I played around a bit more and implemented a very stupid early version of @avx
with a @generated
function:
using MacroTools: prewalk
using LoopVectorization: LoopVectorization, LoopSet, lower
#----------------------------------------------------------------------------------------------------
struct Ex{T, Tup} end
function to_type(ex::Expr)
Ex{ex.head, Tuple{to_type.(ex.args)...}}
end
to_type(x) = x
to_type(::LineNumberNode) = nothing
#----------------------------------------------------------------------------------------------------
to_expr(ex::Type{Ex{Head, Tup}}) where {Head, Tup} = Expr(Head, (to_expr(x) for x in Tup.parameters)...)
to_expr(x) = x
#----------------------------------------------------------------------------------------------------
function find_vars_and_gensym!(ex::Expr, vars::Dict{Symbol, Symbol}, ivars::Vector{Symbol})
if ex.head == :(=) && ex.args[1] isa Symbol
push!(ivars, ex.args[1])
elseif ex.head == :call
push!(ivars, ex.args[1])
end
ex
end
function find_vars_and_gensym!(x::Symbol, vars::Dict{Symbol, Symbol}, ivars::Vector{Symbol})
if (x ∉ keys(vars)) && (x ∉ ivars)
gx = gensym(x)
push!(vars, x => gx)
gx
else
x
end
end
find_vars_and_gensym!(x, vars::Dict{Symbol, Symbol}, ivars::Vector{Symbol}) = x
#----------------------------------------------------------------------------------------------------
nt(keys, vals) = NamedTuple{keys, typeof(vals)}(vals)
macro _avx(ex)
D = Dict{Symbol, Symbol}()
ivars = Symbol[]
gex = prewalk(x -> find_vars_and_gensym!(x, D, ivars), ex)
type_ex = to_type(gex)
tvars = Tuple(keys(D))
tgvars = Tuple(values(D))
quote
kwargs = nt($(QuoteNode(tgvars)), $(Expr(:tuple, tvars...)))
$(Expr(:tuple, tvars...)) = _avx($(QuoteNode(type_ex)), kwargs)
end |> esc
end
@generated function _avx(::Type{ex_t}, var_nt::NamedTuple{keys, var_types}) where {ex_t <: Ex, keys, var_types}
ex = to_expr(ex_t)
var_defs = Expr(:block)
for k in keys
push!(var_defs.args, :($k = var_nt[$(QuoteNode(k))]))
end
quote
$var_defs
$(lower(LoopSet(ex)))
$(Expr(:tuple, keys...))
end
end
It technically does work, though something is causing a big performance bug:
function mydot(a, b)
s = 0.0
for i ∈ eachindex(a,b)
s += a[i]*b[i]
end
s
end
function mydotavx(a, b)
s = 0.0
@_avx for i ∈ eachindex(a,b)
s += a[i]*b[i]
end
s
end
a = rand(10); b = rand(10);
@btime mydot($a, $b) # 8.521 ns (0 allocations: 0 bytes)
@btime mydotavx($a, $b) # 2.321 μs (30 allocations: 1.03 KiB)
@show mydotavx(a, b) ≈ mydot(a, b) # true
I think the problem is that I'm causing arrays to be reallocated a bunch, since the performance cost is scaling linearly with the size of a
and b
.
The reason I need the above weird convoluted design is that unlike a macro, variables defined outside a @generated
function won't be visible inside the generated function and vice versa, so I end up doing something like turning the expression
@_avx for i ∈ eachindex(a,b)
s += a[i]*b[i]
end
into
a, b, s = _avx(#= a type encoding the expression =#, (_a=a, _b=b, _s=s))
and the generated function _avx
does the normal @avx
stuff and then returns (a, b, s)
. This way, any overwriting of varaibles such as s += a[i]*b[i]
gets communicated back to the enclosing scope.
Yikes. I'm really feeling like I'm shoving a square peg in a round hole. After being quite gung-ho about the generated function approach, I'm starting to have concerns.
from loopvectorization.jl.
We will need to ensure that, whatever approach we take, the compiler can compile it all away so that it does not introduce any runtime overhead.
If you could point out places where the syntax transformations occur, that would accelerate the process.
There are 3 basic steps:
- Create a LoopSet object.
- Determine the strategy for converting the
LoopSet
back to an expression. - Converting it back into an expression.
Once you have the LoopSet
, calling lower
performs steps 2 and 3.
All we need for a new "frontend" is 1.
When taking the generated function approach (like in broadcasting), we can subdivide it into two parts:
a) A macro that transforms the expression into a call to a generated function.
b) The generated function, which needs to be able to construct the LoopSet
using only type information. The transform and call in a
must provide all of this.
You can see an example of b
in the broadcast code. The type signature of a broadcasted object (before creating axes, which the macro skips) is Broadcasted{S,Nothing,F,A}
. S
is the style, F
function, and A
is a tuple of all the arguments to the function.
vmaterialize!
recursively calls add_broadcast! on the type parameters of A
, until A
is something other than a broadcast, like a scalar or an array.
Each called function generates code to extract objects from the broadcasted
struct, and adds operations as necessary, such as loads from arrays or computation.
The functions they call to add operations to the LoopSet
are in the graphs file.
Regarding a
, a straightforward approach may be to walk the loop expression, transforming it into nested tuples, and wrap each symbol with a Val
.
julia> dotq = :(for i ∈ eachindex(a,b)
s += a[i]*b[i]
end)
:(for i = eachindex(a, b)
#= REPL[46]:2 =#
s += a[i] * b[i]
end)
julia> dump(dotq)
Expr
head: Symbol for
args: Array{Any}((2,))
1: Expr
head: Symbol =
args: Array{Any}((2,))
1: Symbol i
2: Expr
head: Symbol call
args: Array{Any}((3,))
1: Symbol eachindex
2: Symbol a
3: Symbol b
2: Expr
head: Symbol block
args: Array{Any}((2,))
1: LineNumberNode
line: Int64 2
file: Symbol REPL[46]
2: Expr
head: Symbol +=
args: Array{Any}((2,))
1: Symbol s
2: Expr
head: Symbol call
args: Array{Any}((3,))
1: Symbol *
2: Expr
3: Expr
Perhaps something like:
struct For{T<:Tuple} end
struct Block{T<:Tuple} end
struct Equal{T<:Tuple} end
struct Call{T<:Tuple} end
struct ArrayRef{T<:Tuple} end
# implement macro that transforms `dotq` into:
looptype = For{Tuple{Equal{Tuple{:i,Call{Tuple{:eachindex,:a,:b}}()}}(),Block{Tuple{Equal{Tuple{:s,Call{Tuple{:muladd,ArrayRef{Tuple{:a,:i}}(),ArrayRef{Tuple{:b,:i}}(),:s}}()}}()}}()}}()
# implement a generated function that can transform the above into
lsdot = LoopVectorization.LoopSet(dotq);
The generated function would be called on looptype, and also receive all variables referenced (in this case, just a
, b
, and s
-- although we may want to handle reduction variables like s
specially).
We would define a canonical ordering of these variables (perhaps sort([:s,:a,:b])
) so that the generated function can map each to the symbol in the looptype.
Would you like to give it a go, or wait until I get around to it?
A couple parting notes:
- I've been using contract_pass to fuse multiply adds before converting expressions into LoopSets, and to eliminate
+=
and friends. There are two chief reasons for this: (1) because I haven't implemented fusion in the cost models of LoopVectorization yet, it currents considersa * b + c
to take twice as long asmuladd(a,b,c)
and (2) while the compiler can automatically fuse the operations, it also adds a bunch ofvmovapd
instructions to gemm kernels when doing so that it does not do withmuladd
calls that forbid compiler reordering. It's been a few months since I last checked for that LLVM bug. - No new code is allowed to use
MacroTools
, andMacroTools
should be removed from any old code getting touched if possible. I'm getting serious about compile time performance. Some old code of mine took 40+ seconds to compile.MacroTools
is one of the culprits. I'm not willing to sacrifice runtime performance, but I am willing to sacrifice a little bit of convenience. I'm open to any advice on reducing compile times. I think using lots of small type stable functions is key, but I'm no expert on that.
from loopvectorization.jl.
Okay, so I played around with this today and here's what I've got so far:
struct Ex{T, Tup} end
function to_type!(ex::Expr, vars::Vector{Symbol}, ivars::Vector{Symbol})
if ex.head == :(=) && ex.args[1] isa Symbol
push!(ivars, ex.args[1])
end
Ex{ex.head, Tuple{to_type!.(ex.args, Ref(vars), Ref(ivars))...}}
end
function to_type!(x, vars::Vector{Symbol}, ivars::Vector{Symbol})
x
end
to_type!(::LineNumberNode, vars::Vector{Symbol}, ivars::Vector{Symbol}) = nothing
function to_type!(x::Symbol, vars::Vector{Symbol}, ivars::Vector{Symbol})
x ∉ vars && x ∉ ivars && push!(vars, x)
x
end
function to_expr(ex::Type{Ex{Head, Tup}}) where {Head, Tup}
Expr(Head, (to_expr(x) for x in Tup.parameters)...)
end
to_expr(x) = x
nt(keys, vals) = NamedTuple{keys, typeof(vals)}(vals)
macro avx_parse(ex)
vars = Symbol[]
ivars = Symbol[]
type_ex = to_type!(ex, vars, ivars)
tvars = Tuple(vars)
qtivars = QuoteNode(Tuple(ivars))
quote
$(QuoteNode(type_ex)), nt($(QuoteNode(tvars)), (tuple($(vars...)))), $qtivars
end |> esc
end
julia> a = rand(5); b = randn(5);
julia> ex_t, vars, ivars = @avx_parse begin
s = 0.0
for i ∈ eachindex(a,b)
s += a[i]*b[i]
end
end;
julia> ex_t
Ex{:block,Tuple{nothing,Ex{:(=),Tuple{:s,0.0}},nothing,Ex{:for,Tuple{Ex{:(=),Tuple{:i,Ex{:call,Tuple{:eachindex,:a,:b}}}},Ex{:block,Tuple{nothing,Ex{:+=,Tuple{:s,Ex{:call,Tuple{:*,Ex{:ref,Tuple{:a,:i}},Ex{:ref,Tuple{:b,:i}}}}}}}}}}}}
julia> vars
(eachindex = eachindex, a = [0.864891273566464, 0.016615952609847495, 0.24528618468148156, 0.9045614698725608, 0.24968365320302022], b = [0.10359624603037468, 0.7595643433938144, -0.8973765174693641, 0.20134830886960015, 0.49462379671835727], * = *)
julia> ivars
(:s, :i)
julia> to_expr(ex_t)
quote
nothing
s = 0.0
nothing
for i = eachindex(a, b)
nothing
s += a[i] * b[i]
end
end
Currently, I'm just converting LineNumberNode
s to nothing
, but I could strip them out altogether or lift them to the type domain depending on your preferences. I didn't try stripping them out yet becuase my goto solution for that is MacroTools.striplines
.
As a high level summary, the above code can take (I believe) a completely arbitrary expression, convert it into a type, and then output a named tuple matching variables from the expression, excluding the variables which are assigned within the expression (such as i
and s
), and putting those internal variables in their own tuple without values.
My plan is then that this macro would then pass it's three outputs to a generated function which would call to_expr
on the Ex
type to turn it back into an expression, and use the type of the named tuple (e.g. NamedTuple{(:eachindex, :a, :b, :*), Tuple{typeof(eachindex),Array{Float64,1},Array{Float64,1},typeof(*)}}
) to make choices depending on the eltype of the input data.
The functions such as eachindex
, *
, etc. can be filtered out, but it could be nice to have them there just so we can throw an error as soon as we see an unsupported function being used in and @avx
block.
What are your thoughts on this approach so far? Any questions or concerns?
from loopvectorization.jl.
Thanks for this. I'm preoccupied with unfortunate personal affairs until Sunday, so it will be a little while until I can give this a serious look (although I'm still trying to fix the other recent issue as well at the moment).
I assume the @code_warntype
of the generated function looks fine?
I would only return the s
, and not the arrays.
I'd get these from the LoopSet object directly. ls.outer_reductions
gives their ids, so something like [ls.operations[i].var for i in ls.outer_reductions]
should give you the set of all the bindings that changed.
30 is a lot of allocations though. More than I'd expect from just constructing a tuple with two arrays in it. What happens if you inline _avx
?
And try the canonical ordering approach (eg, sorting the symbols) instead of kwargs?
from loopvectorization.jl.
Thanks for this. I'm preoccupied with unfortunate personal affairs until Sunday, so it will be a little while until I can give this a serious look
No problem, take your time.
I assume the @code_warntype of the generated function looks fine?
Actually, it doesn't. Strangely, s
of all variables is inferred to Any
.
julia> @code_warntype mydotavx(a, b)
Variables
#self#::Core.Compiler.Const(mydotavx, false)
a@_2::Array{Float64,1}
b@_3::Array{Float64,1}
kwargs::NamedTuple{(Symbol("##a#402"), Symbol("##b#403"), Symbol("##s#404")),Tuple{Array{Float64,1},Array{Float64,1},Float64}}
@_5::Int64
s::Any
a@_7::Array{Float64,1}
b@_8::Array{Float64,1}
Body::Any
1 ─ (b@_8 = b@_3)
│ (a@_7 = a@_2)
│ (s = Main.zero(Main.Float64))
│ %4 = Core.tuple(a@_7, b@_8, s::Core.Compiler.Const(0.0, false))::Core.Compiler.PartialStruct(Tuple{Array{Float64,1},Array{Float64,1},Float64}, Any[Array{Float64,1}, Array{Float64,1}, Core.Compiler.Const(0.0, false)])
│ (kwargs = Main.nt($(QuoteNode((Symbol("##a#402"), Symbol("##b#403"), Symbol("##s#404")))), %4))
│ %6 = Main._avx($(QuoteNode(Ex{:for,Tuple{Ex{:(=),Tuple{:i,Ex{:call,Tuple{:eachindex,Symbol("##a#402"),Symbol("##b#403")}}}},Ex{:block,Tuple{nothing,Ex{:+=,Tuple{Symbol("##s#404"),Ex{:call,Tuple{:*,Ex{:ref,Tuple{:a,:i}},Ex{:ref,Tuple{:b,:i}}}}}}}}}})), kwargs)::Tuple{Array{Float64,1},Array{Float64,1},Any}
│ %7 = Base.indexed_iterate(%6, 1)::Core.Compiler.PartialStruct(Tuple{Array{Float64,1},Int64}, Any[Array{Float64,1}, Core.Compiler.Const(2, false)])
│ (a@_7 = Core.getfield(%7, 1))
│ (@_5 = Core.getfield(%7, 2))
│ %10 = Base.indexed_iterate(%6, 2, @_5::Core.Compiler.Const(2, false))::Core.Compiler.PartialStruct(Tuple{Array{Float64,1},Int64}, Any[Array{Float64,1}, Core.Compiler.Const(3, false)])
│ (b@_8 = Core.getfield(%10, 1))
│ (@_5 = Core.getfield(%10, 2))
│ %13 = Base.indexed_iterate(%6, 3, @_5::Core.Compiler.Const(3, false))::Core.Compiler.PartialStruct(Tuple{Any,Int64}, Any[Any, Core.Compiler.Const(4, false)])
│ (s = Core.getfield(%13, 1))
└── return s
If I make it put in type assertions on all the variables before they're returned, the code_warntype for mydotavx
is fine but performance doesn't improve. I think that's because there's still a type instability inside _avx
. Bizarre.
I would only return the s, and not the arrays.
I'd get these from the LoopSet object directly. ls.outer_reductions gives their ids, so something like [ls.operations[i].var for i in ls.outer_reductions] should give you the set of all the bindings that changed.
Hm, so that will require some analysis of the loop prior to the generated function then so that we know what variables need to be splatted into. I guess determining which bindings change doesn't require type info so I guess it'd be okay to hoist that analysis to the macro.
I think the discussion here is starting to go beyond an issue. I'll open a draft PR.
from loopvectorization.jl.
I rewrote everything, but I don't have the impression that it actually helped compile times at all.
I'd probably have to go through and benchmark/@code_warntype pieces.
The goal is to both minimize the type unstable regions, and move as quickly as possible to concretely typed, dynamically sized, arrays.
@avx
is now the generated function, while @_avx
rewrites the raw for loops.
This is to reflect the general preference for using @avx
over using @_avx
.
I have added support for specialization on adjoint/transpose, as well as using size information.
The API for size info is overloading maybestaticsize
and/or maybestaticlength
.
using StaticArrays, LoopVectorization
LoopVectorization.maybestaticsize(::MArray{<:Tuple{M,Vararg}}, ::Val{1}) where {M} = LoopVectorization.Static{M}()
LoopVectorization.maybestaticsize(::MArray{<:Tuple{M,N,Vararg}}, ::Val{2}) where {M,N} = LoopVectorization.Static{N}()
If you defined loopbounds as 1:size(A,2)
, it will substitute the size(A,2)
with maybestaticsize(A,Val(2))
(which has the fallback definition size(A,2)
).
The static size parameters must be placed on loop bounds. If you defined variables that the macro can't see, it naturally won't be able to pick them up.
It otherwise also cannot infer loop bounds from the types of the arrays, because while loop bounds will normally match the size of the array axis, that isn't true in general.
I will close this, so that future issues on this theme can focus more on specific features / was to take advantage of this.
from loopvectorization.jl.
Related Issues (20)
- `vtrunc(::Float64)` issue HOT 3
- Strange compile behavior for @turbo HOT 2
- is it possible to set @turbo thread = true/false at runtime? HOT 3
- LoopVectorization fail to compile on julia 32bit REPL
- AssertionError: M == 1 HOT 9
- Inconsistent results w/ and w/o @turbo HOT 6
- vfilter with multiple conditions HOT 2
- Memory corruption HOT 2
- Incorrect results using @turbo with linear array indexing HOT 1
- Weird/inconsistent behavior with constant lhs indexing inside @turbo loop HOT 2
- Suboptimal Choice of the Vecotrization Level for Image Convolution HOT 1
- Performance for stride 2
- Bad IR generation triggers assertion failure on 1.11
- Release v0.12.167 breaks RecursiveFactorization on Julia v1.11+ HOT 12
- LoopVectorization.jl causing segfaults on 1.11 HOT 5
- type inference issue with vectors if ints and floats in julia 1.10 HOT 3
- LoadError: BoundsError: attempt to access 2-element Vector{LoopVectorization.ArrayReferenceMeta} at index [0] HOT 1
- Reduction not found HOT 2
- Safety of generating random numbers HOT 2
- Problem/error in execution order
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from loopvectorization.jl.