Comments (2)
I got
julia> function test()
A = randn(16, 64)
C = randn(64, 64)
for a = 1:1000
At_mul_A!(C, A)
end
end
test (generic function with 1 method)
julia> @time test()
0.004525 seconds (3 allocations: 40.172 KiB)
julia> @time test()
0.004688 seconds (3 allocations: 40.172 KiB)
julia> @time test()
0.004536 seconds (3 allocations: 40.172 KiB)
julia> @time test()
0.004673 seconds (3 allocations: 40.172 KiB)
julia> @time foreach(_->test(),1:1000)
4.794809 seconds (5.07 k allocations: 39.368 MiB, 0.20% gc time, 0.09% compilation time)
julia> function test2()
A = randn(64, 16)' # 16x64
C = randn(64, 64)
for a = 1:1000
At_mul_A!(C, A)
end
end
test2 (generic function with 1 method)
julia> @time test2()
0.001879 seconds (3 allocations: 40.172 KiB)
julia> malloc(): invalid size (unsorted)
[7231] signal (6.-6): Aborted
in expression starting at none:0
It'd be faster to use a different memory layout if possible, which you can see: test2
is over twice as fast as test
.
That is, A_mul_At
is faster than At_mul_A
due to the better memory layout.
This surprises most people who only think about a matrix multiply as a bunch of dot products; i.e. some people naively suspect that setting the layout to make a dot product of a view as fast as possible would be best, but column-major A' * B is actually the worst of the four combinations, even A' * B'
is better.
A' * A
requires traversing memory much more quickly, increasing bandwdith requirements, and requires reductions at the end of the k
loop, requiring extra FLOPs. None of the other 3 orientations need this.
Anyway, I got a segfault, too.
I tried replaces indices
with axes
, and didn't get a segfault.
It might be a bug in the optimizations it does for indices
.
julia> using LoopVectorization
julia> function At_mul_A!(C, A)
@turbo for n in axes(C, 2), m in axes(C, 1)
Cmn = zero(eltype(C))
for k in axes(A, 1)
Cmn += A[k,m] * A[k,n]
end
C[m,n] = Cmn
end
end
At_mul_A! (generic function with 1 method)
julia> function test()
A = randn(16, 64)
C = randn(64, 64)
for a = 1:1000
At_mul_A!(C, A)
end
end
test (generic function with 1 method)
julia> @time test()
0.004558 seconds (3 allocations: 40.172 KiB)
julia> @time test()
0.004533 seconds (3 allocations: 40.172 KiB)
julia> @time test()
0.004505 seconds (3 allocations: 40.172 KiB)
julia> @time test()
0.004553 seconds (3 allocations: 40.172 KiB)
julia> @time foreach(_->test(),1:1000)
4.710555 seconds (12.46 k allocations: 39.802 MiB, 0.24% gc time, 0.08% compilation time)
julia> function test2()
A = randn(64, 16)' # 16x64
C = randn(64, 64)
for a = 1:1000
At_mul_A!(C, A)
end
end
test2 (generic function with 1 method)
julia> @time test2()
0.002044 seconds (3 allocations: 40.172 KiB)
julia> @time test2()
0.001877 seconds (3 allocations: 40.172 KiB)
julia> @time test2()
0.001880 seconds (3 allocations: 40.172 KiB)
julia> @time test2()
0.001868 seconds (3 allocations: 40.172 KiB)
julia> @time foreach(_->test2(),1:1000)
2.011197 seconds (5.07 k allocations: 39.368 MiB, 0.45% gc time, 0.20% compilation time)
from loopvectorization.jl.
Thanks, @chriselrod . Especially for the tips on faster layouts. However, I also get the memory corruption using axes
instead of indices
.
from loopvectorization.jl.
Related Issues (20)
- `vtrunc(::Float64)` issue HOT 3
- Strange compile behavior for @turbo HOT 2
- is it possible to set @turbo thread = true/false at runtime? HOT 3
- LoopVectorization fail to compile on julia 32bit REPL
- AssertionError: M == 1 HOT 9
- Inconsistent results w/ and w/o @turbo HOT 6
- vfilter with multiple conditions HOT 2
- Incorrect results using @turbo with linear array indexing HOT 1
- Weird/inconsistent behavior with constant lhs indexing inside @turbo loop HOT 2
- Suboptimal Choice of the Vecotrization Level for Image Convolution HOT 1
- Performance for stride 2
- Bad IR generation triggers assertion failure on 1.11
- Release v0.12.167 breaks RecursiveFactorization on Julia v1.11+ HOT 12
- LoopVectorization.jl causing segfaults on 1.11 HOT 5
- type inference issue with vectors if ints and floats in julia 1.10 HOT 3
- LoadError: BoundsError: attempt to access 2-element Vector{LoopVectorization.ArrayReferenceMeta} at index [0] HOT 1
- Reduction not found HOT 2
- Safety of generating random numbers HOT 2
- Problem/error in execution order
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from loopvectorization.jl.