Git Product home page Git Product logo

Comments (5)

chriselrod avatar chriselrod commented on July 16, 2024 1
julia> function foo_debug(y, x, b)
           return LoopVectorization.@turbo_debug for i in axes(y, 1), c in axes(y, 2), n in axes(y, 3)
               y[i, c, n] = x[i, c, n] + b[c]
           end
           return nothing
       end
foo_debug (generic function with 1 method)

julia> ls = foo_debug(y,x,b);

julia> LoopVectorization.choose_order(ls)
([:c, :n, :i], :n, Symbol("##undefined##"), :i, 4, -1)

It was ordering the loops c, n, i, probably to try and minimize the amount of times you reload from b.
So foo! does use fewer CPU instructions than foo2!:

julia> using LinuxPerf

julia> foreachf(f::F, N::Int, arg1::A, args::Vararg{Any,K}) where {F,A,K} = foreach(_ -> Base.donotdelete(f(arg1, args...)), 1:N)
foreachf (generic function with 1 method)

julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
         foreachf(foo!, 10_000, y, x, b)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles               6.11e+09   33.3%  #  4.0 cycles per ns
└ task-clock               1.54e+09   33.3%  #  1.5 s
┌ instructions             7.56e+08   66.7%  #  0.1 insns per cycle
│ branch-instructions      2.38e+07   66.7%  #  3.2% of insns
└ branch-misses            2.47e+04   66.7%  #  0.1% of branch insns
┌ L1-dcache-load-misses    6.69e+08   33.3%  # 200.9% of dcache loads
│ L1-dcache-loads          3.33e+08   33.3%
│ cache-misses             1.26e+05   33.3%  #  0.0% of cache refs
└ cache-references         6.51e+08   33.3%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
         foreachf(foo2!, 10_000, y, x, b)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles               6.11e+09   33.3%  #  4.0 cycles per ns
└ task-clock               1.53e+09   33.3%  #  1.5 s
┌ instructions             8.91e+08   66.7%  #  0.1 insns per cycle
│ branch-instructions      2.60e+07   66.7%  #  2.9% of insns
└ branch-misses            3.13e+04   66.7%  #  0.1% of branch insns
┌ L1-dcache-load-misses    6.57e+08   33.3%  # 157.4% of dcache loads
│ L1-dcache-loads          4.17e+08   33.3%
│ cache-misses             7.52e+04   33.3%  #  0.0% of cache refs
└ cache-references         6.55e+08   33.3%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

I'm running them 10_000 times to get a decent sample.
Note that foo2! took 8.91e4 instructions per call, while foo! required only 7.56e4.

@turbo doesn't seem to properly be considering that traversing the memory of y[i, c, n] and x[i, c, n] in the c, n, i order is potentially more costly than n, c, i.
"Potentially", FWIW,

julia> @btime foo!($y, $x, $b)
  149.690 μs (0 allocations: 0 bytes)

julia> @btime foo2!($y, $x, $b)
  148.181 μs (0 allocations: 0 bytes)

julia> @btime foo3!($y, $x, $b)
  148.989 μs (0 allocations: 0 bytes)

julia> @btime foo4!($y, $x, $b) # foo3! but with `@inbounds`
  149.214 μs (0 allocations: 0 bytes)

Notice the pitiful 0.1 insns per cycle from my above @pstats reports. That means the CPU is only executing one instruction every 10 clock cycles. This is really bad, this CPU is theoretically capable of up to 4 instructions/cycle, 40x higher.

Why so few instructions/cycle? Because in all 4 versions of the code, the CPU spends all of its time waiting on memory.
Memory bandwidth is the bottleneck, so what the CPU is doing/how much work it has to do is totally irrelevant.

Still, it's the same amount of memory being loaded, and I don't see that foo! (doing an order meant to minimize the number of instructions needed) is any worse than the others (which traverse in a theoretically nicer order).

Unfortunately, LinuxPerf is Linux-only.

But I can reproduce the timings you reported:

julia> @btime foo!($y, $x, $b)
  133.917 μs (0 allocations: 0 bytes)

julia> @btime foo2!($y, $x, $b)
  51.708 μs (0 allocations: 0 bytes)

julia> @btime foo3!($y, $x, $b)
  55.083 μs (0 allocations: 0 bytes)

julia> @btime foo4!($y, $x, $b)
  55.083 μs (0 allocations: 0 bytes)

julia> function foo_debug(y, x, b)
           return LoopVectorization.@turbo_debug for i in axes(y, 1), c in axes(y, 2), n in axes(y, 3)
               y[i, c, n] = x[i, c, n] + b[c]
           end
           return nothing
       end
foo_debug (generic function with 1 method)

julia> ls = foo_debug(y,x,b);

julia> LoopVectorization.choose_order(ls)
([:c, :n, :i], :n, Symbol("##undefined##"), :i, 4, -1)

The M1 has much higher memory bandwidth than most x64 CPUs, including mine.
Hence it's able to do much better here, at least given the nicer order.

If I shrink the size of the arrays on my x64 computer (the first one I benchmarked above), making them 8x smaller, the code runs about 30x faster, as memory becomes less of an issue:

julia> y32 = randn_stridearray(64, 32, 32);

julia> x32 = randn_stridearray(64, 32, 32);

julia> @btime foo!($y32, $x32, $b)
  5.777 μs (0 allocations: 0 bytes)

julia> @btime foo2!($y32, $x32, $b)
  4.614 μs (0 allocations: 0 bytes)

julia> @btime foo3!($y32, $x32, $b)
  5.007 μs (0 allocations: 0 bytes)

julia> @btime foo4!($y32, $x32, $b)
  5.084 μs (0 allocations: 0 bytes)

and now I can also see on this computer that foo! seems to be worse than foo2!.

If you want to try fiddling with it:

10.0sum(maximum, values(stridepenaltydict)) *

See the 10.0 there?
Try making it larger.
That's a coefficient on how heavily to penalize iterating over loops in a bad order. Increasing it to 50.0 seems to fix it:

julia> LoopVectorization.choose_order(ls)
([:n, :c, :i], :c, Symbol("##undefined##"), :i, 4, -1)

Now n would be the outer-most loop. You'd have to actually force recompilation after making a change like that (e.g. restarting Julia, or editing _turbo_!, such as by commenting out or in a useless line like 1+1).

With that, I now get:

julia> @btime foo!($y32, $x32, $b)
  4.620 μs (0 allocations: 0 bytes)

for the size(..., 3) == 32 example.

I can make a PR, or you're welcome to play with it to try and find a heuristic that works well.

from loopvectorization.jl.

chriselrod avatar chriselrod commented on July 16, 2024

BTW, have you seen VectorizedRNG.jl?
On the M1:

julia> vrandn_stridearray(size...) = StrideArray(randn(local_rng(), Float32, size...), static.(size))
vrandn_stridearray (generic function with 1 method)

julia> @benchmark randn_stridearray(64, 32, 256)
BenchmarkTools.Trial: 2344 samples with 1 evaluation.
 Range (min  max):  1.969 ms    3.901 ms  ┊ GC (min  max): 0.00%  44.07%
 Time  (median):     2.078 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.132 ms ± 267.936 μs  ┊ GC (mean ± σ):  2.60% ±  7.65%

  ▃  ▇█▂
  █▇▆████▇▆▅▅▃▆▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▅▅▃▄▆▅▄▆▅▃▃▁▁▁▁▃▄▇▇▆▅▅▄▆▆ █
  1.97 ms      Histogram: log(frequency) by time      3.54 ms <

 Memory estimate: 2.00 MiB, allocs estimate: 6.

julia> @benchmark vrandn_stridearray(64, 32, 256)
BenchmarkTools.Trial: 6752 samples with 1 evaluation.
 Range (min  max):  622.333 μs    3.232 ms  ┊ GC (min  max): 0.00%  77.36%
 Time  (median):     708.791 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   739.788 μs ± 165.578 μs  ┊ GC (mean ± σ):  4.59% ± 10.53%

  ▄   ▃█▆▂                                                    ▁ ▁
  █▅▃▃████▇▄▅▄▄▄▄▃▅▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▃▇▇▅▄▅█▆█▆▅▁▁▁▁▃▃▃▃██ █
  622 μs        Histogram: log(frequency) by time       1.55 ms <

 Memory estimate: 2.00 MiB, allocs estimate: 10.

On my Linux x64 machine with AVX512:

julia> vrandn_stridearray(size...) = StrideArray(randn(local_rng(), Float32, size...), static.(size))
vrandn_stridearray (generic function with 1 method)

julia> @benchmark randn_stridearray(64, 32, 256)
BenchmarkTools.Trial: 2349 samples with 1 evaluation.
 Range (min  max):  2.053 ms    4.074 ms  ┊ GC (min  max): 0.00%  37.17%
 Time  (median):     2.092 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.126 ms ± 196.034 μs  ┊ GC (mean ± σ):  1.46% ±  5.89%

   █▅
  ▃███▇▄▄▁▄▁▃▃▃▄▆▅▁▁▁▁▁▃▄▃▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▃▆▅▅▅▅▅▅▄▃ █
  2.05 ms      Histogram: log(frequency) by time      3.29 ms <

 Memory estimate: 2.00 MiB, allocs estimate: 6.

julia> @benchmark vrandn_stridearray(64, 32, 256)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  195.751 μs    2.561 ms  ┊ GC (min  max):  0.00%  68.61%
 Time  (median):     202.900 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   235.975 μs ± 202.390 μs  ┊ GC (mean ± σ):  12.61% ± 12.55%

  █                                                             ▁
  ██▆▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▆▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▇██ █
  196 μs        Histogram: log(frequency) by time       1.48 ms <

 Memory estimate: 2.00 MiB, allocs estimate: 2.

from loopvectorization.jl.

cafaxo avatar cafaxo commented on July 16, 2024

Thanks for the elaborate answer!
I'll try playing with that constant.
But yeah, I am trying to train a convolutional neural network for chess on the CPU and memory seems to be quite the bottleneck. I'll have to check with perf on my linux machine.

I haven't tried VectorizedRNG but that could be extremely useful for a different project I am working on (sampling a distribution using random-walk metropolis).

from loopvectorization.jl.

chriselrod avatar chriselrod commented on July 16, 2024

But yeah, I am trying to train a convolutional neural network for chess on the CPU and memory seems to be quite the bottleneck. I'll have to check with perf on my linux machine.

Uses batch sizes small enough to fit in cache should help, if possible.
Have you looked at SimpleChains.jl at all?

It could use more documentation and API need works (especially around memory management), but it does have a convolutional network example that performs much better than Flux or PyTorch do on the CPU, at least for that (small) problem size.

from loopvectorization.jl.

cafaxo avatar cafaxo commented on July 16, 2024

I just tested it: Smaller batch sizes are indeed faster! (That's exactly the opposite behavior I see compared to pytorch+GPU).
I first looked at SimpleChains.jl and the performance of that convinced me that I could reach better training speed with my use case on the M1's CPU compared to pytorch with the GPU backend.
However, I needed features such as zero-padding and residual connections which SimpleChains does not support.
Some parts of SimpleChains seemed somewhat convoluted (pun intended). You seem to be doing some nontrivial stuff for memory management, right? In any case, in order to get a better intuition what the backward passes looked like for my network, I decided to roll my own forward+backward passes on top of StrideArrays+LoopVec.

With SimpleChains, I also ran into this error:
https://julialang.slack.com/archives/C690QRAA3/p1668460255442179
Should I open an issue for that?
Also, by the way: The way that the transpose for 2d convolution is implemented in SimpleChains (manually checking for inbounds in the inner for loop) is super slow on my machine. I am getting much better performance by first copying into a preallocated zero-padded array and then doing the convolution without the inbounds check.

from loopvectorization.jl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.