I was running some testing to see if I could boost performance in a package of mine by

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I'm getting near-identical performance to a regular array now!! <div class="highli

Unexpected sum Performance Delta vs Unitful about dynamicquantities.jl HOT 12 CLOSED

symbolicml commented on June 11, 2024

Unexpected sum Performance Delta vs Unitful

from dynamicquantities.jl.

Comments (12)

gaurav-arya commented on June 11, 2024

Next I tried replacing the struct with the built-in QuantityArray type, which was much slower still, apparently due to allocations.

In my reading, you didn't try out the array of array approach with Unitful, right? That would be interesting to see, but right now it seems like the direct point of comparison between Unitful and DynamicQuantities here is just the factor of 10 between the first two cases.

Edit: you did indeed to do this with StaticArrays, I missed that, but I don't think there's a benchmark for unitful with regular arrays.

from dynamicquantities.jl.

MilesCranmer commented on June 11, 2024

@mikeingold rather than an array of QuantityArray, could you try a QuantityArray of arrays, or better yet a 2D QuantityArray?

One option is for us to write a custom sum that simply performs sum on the base array and wraps it with the units again. But I feel like that’s just hiding a potential issue here.

from dynamicquantities.jl.

MilesCranmer commented on June 11, 2024

I think the issue is that you are calling the repeat outside of the benchmark. So the compiler doesn’t know it can inline the unit propagation and skip the dimension checks, because it literally doesn’t know that the quantities all have the same units within the sum. Unitful gets to skip that part here because that info is indicated in the type which was already computed prior to the sum (via a promotion).

I think a fair comparison would be to include the repeat within the benchmarked function, or use a 2D QuantityArray, or force the vector to be ::Any for element types in Unitful. Does that make sense?

from dynamicquantities.jl.

mikeingold commented on June 11, 2024

The original intent here was just to test the performance of some basic operations, e.g. what if I have a collection of vectors that I need to sum? In that sense it didn’t feel right to time the repeat since it was “just” being used to set up the test, but I do suppose it’s a fair counterpoint that maybe the more fundamental performance traits of DynamicQuantities live in that space, rather than in the basic operations themselves. I’ll try to take another look at these tests tonight to see if there’s a more direct apples-to-apples test to be run.

I still do think it’s interesting that the performance delta of sum, though, was an order of magnitude. I’m not sure how much of that is accounted for by some of the heavy lifting being performed by the compiler itself, or maybe it’s just an artifact of a how much of the vector that can be fit into cache.

from dynamicquantities.jl.

mikeingold commented on June 11, 2024

I've done some more testing and am still getting similar results. Constructing vectors of structs inside the benchmark closed the gap somewhat, but there's still about a 3.5x difference between Unitful and DynamicQuantities. At this point, I suspect the reason is that the Unitful types in this problem are already sufficiently-constrained that type inference and allocations aren't the big tent-pole to begin with. If that's the case, I definitely hadn't expected such a significant delta in what seems to be the performance floor between the two packages, but I suppose it makes sense (units being handled at compile-time vs compute-time).

using DynamicQuantities
using Unitful
using BenchmarkTools

# Cartesian Coordinate with Quantity types
struct CoordinateCartesian{L}
    x::L
    y::L
    z::L
end

Base.:+(u::CoordinateCartesian, v::CoordinateCartesian) = CoordinateCartesian(u.x+v.x, u.y+v.y, u.z+v.z)

# Test: Sum an N-length vector of CoordinateCartesian using default DynamicQuantities
function test_DynamicQuantities(N)
    arr = [CoordinateCartesian(DynamicQuantities.Quantity(rand(), length=1),
                               DynamicQuantities.Quantity(rand(), length=1),
                               DynamicQuantities.Quantity(rand(), length=1) ) for i in 1:N]
    sum(arr)
end

# Test: Sum an N-length vector of CoordinateCartesian using compact DynamicQuantities
function test_DynamicQuantities_R8(N)
    R8 = Dimensions{DynamicQuantities.FixedRational{Int8,6}}
    arr = [CoordinateCartesian(DynamicQuantities.Quantity(rand(), R8, length=1),
                               DynamicQuantities.Quantity(rand(), R8, length=1),
                               DynamicQuantities.Quantity(rand(), R8, length=1) ) for i in 1:N]
    sum(arr)
end

# Test: Sum an N-length vector of CoordinateCartesian using Unitful
function test_Unitful(N)
    arr = [CoordinateCartesian(Unitful.Quantity(rand(), Unitful.@u_str("m")),
                               Unitful.Quantity(rand(), Unitful.@u_str("m")),
                               Unitful.Quantity(rand(), Unitful.@u_str("m")) ) for i in 1:N]
    sum(arr)
end

bench_dq    = @benchmark test_DynamicQuantities($1000) evals=100
bench_dq_r8 = @benchmark test_DynamicQuantities_R8($1000) evals=100
bench_uf    = @benchmark test_Unitful($1000) evals=100

Results

julia> bench_dq
BenchmarkTools.Trial: 969 samples with 100 evaluations.
 Range (min … max):  24.643 μs … 81.427 μs  ┊ GC (min … max):  0.00% … 35.44%
 Time  (median):     48.954 μs              ┊ GC (median):     0.00%
 Time  (mean ± σ):   51.681 μs ± 10.649 μs  ┊ GC (mean ± σ):  12.43% ± 16.48%

                     ▄   ▂ ▁▂  ▁█
  ▂▁▁▂▃▃▂▂▁▂▃▁▁▁▂▂▃▃▆█▇▄▇█████▆███▄▃▃▂▃▂▂▁▂▂▃▃▃▃▄▃▄▄▅▄▃▃▅▄▃▃▄ ▃
  24.6 μs         Histogram: frequency by time        76.2 μs <

 Memory estimate: 117.23 KiB, allocs estimate: 2.

julia> bench_dq_r8
BenchmarkTools.Trial: 146 samples with 100 evaluations.
 Range (min … max):  329.198 μs … 382.411 μs  ┊ GC (min … max): 0.00% … 4.22%
 Time  (median):     345.723 μs               ┊ GC (median):    4.60%
 Time  (mean ± σ):   345.099 μs ±   8.225 μs  ┊ GC (mean ± σ):  2.77% ± 2.48%

             ▂ ▅▅▆           ▃█▂▅▅▃▂ ▅     ▃
  ▅▅▁▄▇▁▁█▄▁▁█▅███▇▅▅▄▄▁▁▇▁▅▄███████▁█▇▅▁█▁█▄█▁▇▇▄▅▄▄▄▁▄▁▁▄▄▁▁▄ ▄
  329 μs           Histogram: frequency by time          364 μs <

 Memory estimate: 250.16 KiB, allocs estimate: 7005.

julia> bench_uf
BenchmarkTools.Trial: 3537 samples with 100 evaluations.
 Range (min … max):   7.580 μs … 53.050 μs  ┊ GC (min … max):  0.00% … 75.62%
 Time  (median):     13.517 μs              ┊ GC (median):     0.00%
 Time  (mean ± σ):   14.124 μs ±  6.536 μs  ┊ GC (mean ± σ):  10.01% ± 15.33%

  ▇▂  ▁ ▃▇▇██▅▂▂▁▂▂                                           ▂
  ██▇██████████████▆▁▁▁▁▁▁▁▁▁▁▃▁▅▆▅▆▆▇▇▆▄▅▆▆▄▃▄▁▄▄▃▄▇▆▇▆▃▆▇█▇ █
  7.58 μs      Histogram: log(frequency) by time      45.8 μs <

 Memory estimate: 23.48 KiB, allocs estimate: 2.

Summary: The Unitful version of this test ran about 3.5x faster than the vanilla DynamicQuantities version. Running the DynamicQuantities version with a more compact type was apparently an order-of-magnitude slower.

from dynamicquantities.jl.

MilesCranmer commented on June 11, 2024

Writing it like DynamicQuantities.Quantity(rand(), length=1) is a bit slower than doing the @u_str stuff like you did for Unitful:

function test_dq_naive(N)
    arr = [
        CoordinateCartesian(
            rand() * DQ.@u_str("m"),
            rand() * DQ.@u_str("m"),
            rand() * DQ.@u_str("m")
        )
        for _ in 1:N
    ]
    sum(arr)
end

bench_dq_naive = @benchmark test_dq_naive(1000) evals=100

For me this gives a min time of 17.264 us compared to Unitful's 7.581 us, so a 2.3x difference.

(This is just because the macro => computes the Dimensions object at compile time)

This is obviously still not great. However this is really tricky because

struct CoordinateCartesian{L}
    x::L
    y::L
    z::L
end

is optimal for Unitful.jl (the L stores the unit information), but not really for DynamicQuantities.jl. So again, with Unitful, the compiler knows that all the units are the same => sum is fast. But DynamicQuantities.jl, the compiler has no idea what the units are => sum is slow.

To make things faster for DynamicQuantities, you need to wrap stuff in QuantityArray. That way you can tell Julia: "these quantities all share the same units", and it can avoid the dimension checks it has to do (right now it has to compare the units of every single element in the sum, whereas Unitful can skip them).

from dynamicquantities.jl.

mikeingold commented on June 11, 2024

Writing it like DynamicQuantities.Quantity(rand(), length=1) is a bit slower than doing the @u_str stuff like you did for Unitful ... (This is just because the macro => computes the Dimensions object at compile time)

Interesting. I'd actually guessed the opposite, that calling the constructor function directly would've been less complicated than a macro, but that makes sense.

To make things faster for DynamicQuantities, you need to wrap stuff in QuantityArray. That way you can tell Julia: "these quantities all share the same units", and it can avoid the dimension checks it has to do (right now it has to compare the units of every single element in the sum, whereas Unitful can skip them).

Is the following a fair implementation of this advic, adapting from the prior pattern? These results are roughly in the neighborhood of what I'm seeing for Vector{Quantity} and similar implementations, definitely still much slower than structs or StaticVectors.

# Test: Sum an N-length vector of QuantityArray's
function test_QuantityArray(N)
    arr = [DynamicQuantities.QuantityArray([rand(),rand(),rand()], DynamicQuantities.@u_str("m")) for _ in 1:N]
    sum(arr)
end

"""
BenchmarkTools.Trial: 219 samples with 100 evaluations.
 Range (min … max):  211.921 μs … 279.226 μs  ┊ GC (min … max): 0.00% … 5.85%
 Time  (median):     231.468 μs               ┊ GC (median):    6.02%
 Time  (mean ± σ):   228.822 μs ±   9.042 μs  ┊ GC (mean ± σ):  3.93% ± 3.10%

              ▂▁ ▂                         ▂      █ ▇ ▃ ▄
  ▅▄▁▁▄▁▅▁▄▃▄▄██▇█▇▆▆▇▃▆▅▆▁▁▃▁▃▃▁▁▁▁▁▁▆▆▅▃▆█▁▇▁█▄██▇█▇█▅█▄▆▆▃▃▄ ▄
  212 μs           Histogram: frequency by time          240 μs <

 Memory estimate: 273.41 KiB, allocs estimate: 3001.
"""

The more I'm looking at this, the more it seems like I got lucky and intuited my existing Unitful code into a relatively optimal state where the types are constrained/defined enough to avoid a lot of the big performance pitfalls.

The only real remaining "Issue" in my mind is just how surprising it was that there seems to be a delta in what you might call the performance floor of each package, i.e. that it is possible for Unitful to be faster in very specific situations. I'm planning to do some more testing with more complicated expressions to see if the type constraints continue to hold the line, or at what point they break down. Are you on the Julia Discourse @MilesCranmer? Maybe it would be better to migrate this topic there vs it being tracked as a concrete "Issue" here?

from dynamicquantities.jl.

MilesCranmer commented on June 11, 2024

Is the following a fair implementation of this advic, adapting from the prior pattern?

Almost but not quite. Since you are summing coordinates along the sample axis, the QuantityArray needs to also wrap the sample axis (in order for Julia to remove the dimensional analysis). So e.g., all the x coordinates should be stored in a QuantityArray, and all the y in another QuantityArray. Right now you are storing each x, y, z in a single QuantityArray which I don’t think will help since you never sum x + y + z.

Unitful gets around this via a sophisticated set of promotion rules. When you create an array of types prametrized to L, the units, it sees at compile time that all the elements have that type, and so the entire array is Array{...{L}}, and so the sum doesn’t need to do dimensional analysis.

So to get closer in performance you need to also tell Julia that all the units are the same, which you can do with a QuantityArray.

Julia discourse

Sounds good to me, it should be a nice way to get additional performance tips.

from dynamicquantities.jl.

MilesCranmer commented on June 11, 2024

@mikeingold I wrote a little example in #49 for how I think you should actually do this (once that PR merges):

First, make the coords:

struct Coords
    x::Float64
    y::Float64
end

# Define arithmetic operations on Coords
Base.:+(a::Coords, b::Coords) = Coords(a.x + b.x, a.y + b.y)
Base.:-(a::Coords, b::Coords) = Coords(a.x - b.x, a.y - b.y)
Base.:*(a::Coords, b::Number) = Coords(a.x * b, a.y * b)
Base.:*(a::Number, b::Coords) = Coords(a * b.x, a * b.y)
Base.:/(a::Coords, b::Number) = Coords(a.x / b, a.y / b)

We can then build a GenericQuantity out of this:

coord1 = GenericQuantity(Coords(0.3, 0.9), length=1)
coord2 = GenericQuantity(Coords(0.2, -0.1), length=1)

and perform operations on these:

coord1 + coord2 |> uconvert(us"cm")
# (Coords(50.0, 80.0)) cm

The nice part about this is it only stores a single Dimensions (or SymbolicDimensions) for the entire struct!

Then, we can build an array like so:

function test_QuantityArray(N)
    coord_array = QuantityArray([GenericQuantity(Coords(rand(), rand()), length=1) for i=1:N])
    sum(coord_array)
end

This QuantityArray already indicates that all the dimensions are the same, and thus the summation should be faster! The only remaining reason why it wouldn't be as fast is due to the compiler not inlining enough.

from dynamicquantities.jl.

MilesCranmer commented on June 11, 2024

I'm getting near-identical performance to a regular array now!!

julia> @btime sum(coord_array) setup=(N=1000; coord_array=QuantityArray([GenericQuantity(Coords(rand(), rand()), length=1) for i=1:N]))
  1.113 μs (0 allocations: 0 bytes)
(Coords(501.7717111461543, 494.36328730797095)) m

julia> @btime sum(array) setup=(N=1000; array=[Coords(rand(), rand()) for i=1:N])
  1.087 μs (0 allocations: 0 bytes)
Coords(505.4496129866645, 507.2903371535713)

from dynamicquantities.jl.

mikeingold commented on June 11, 2024

Sorry for the delay, haven't had as much time lately to work on this. I'm excited to try out the new update when it's available on General! At this point I'd propose using PR #49 as justification to close this Issue.

from dynamicquantities.jl.

MilesCranmer commented on June 11, 2024

Cool! Closing with v0.8.0.

from dynamicquantities.jl.

Unexpected sum Performance Delta vs Unitful about dynamicquantities.jl HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent