Comments (7)
Hmm. I need to add support for multiple return values (or, equivalently, returning a tuple and unpacking it).
However, in case your curiosity is impatient and you doesn't want to wait for me to fix this issue:
LoopVectorization uses SLEEFPirates for evaluating special functions.
julia> using SLEEFPirates, SIMDPirates
julia> x = SVec(ntuple(Val(4)) do i Core.VecElement(rand()) end)
SVec{4,Float64}<0.25773116159740384, 0.6887968844827816, 0.3906818208655902, 0.5101711516658685>
julia> sin(x)
SVec{4,Float64}<0.25488730940177806, 0.6356088237080095, 0.38081894899915164, 0.4883266114084775>
julia> cos(x)
SVec{4,Float64}<0.966970764555952, 0.7720112843893673, 0.9246496244973993, 0.8726609425145104>
julia> sincos(x)
(SVec{4,Float64}<0.25488730940177806, 0.6356088237080095, 0.38081894899915164, 0.4883266114084775>, SVec{4,Float64}<0.966970764555952, 0.7720112843893673, 0.9246496244973993, 0.8726609425145105>)
julia> using BenchmarkTools
julia> @benchmark sincos($x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 21.699 ns (0.00% GC)
median time: 21.903 ns (0.00% GC)
mean time: 22.186 ns (0.00% GC)
maximum time: 333.447 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 997
julia> @benchmark sin($x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 20.016 ns (0.00% GC)
median time: 20.105 ns (0.00% GC)
mean time: 20.424 ns (0.00% GC)
maximum time: 326.284 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 997
julia> @benchmark cos($x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 20.389 ns (0.00% GC)
median time: 20.481 ns (0.00% GC)
mean time: 20.926 ns (0.00% GC)
maximum time: 226.778 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 997
julia> versioninfo()
Julia Version 1.5.0-DEV.168
Commit c4c4a13* (2020-01-28 16:23 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-9.0.0 (ORCJIT, haswell)
To compare with the links, the CPU I tested on is closest to the Intel® Xeon® CPU E5-2699 v4.
At 2.5 GHz, 20ns corresponds to 20ns * 2.5GHz / 4 elements = 12.5 clocks per element
I assume "HA" and "LA" mean high and low accuracy, respectively?
This would make Intel's high accuracy sin
roughly twice as fast as the sin
from SLEEFPirates, while their sincos
is only a little faster.
SLEEFPirates does far better than base Julia, but is still definitely suboptimal. Would be great to improve it or switch to a better library in the future.
This is true for most functions. I included an "exp" benchmark in the discourse thread, where you'll see LoopVectorization was much faster than base Julia and Clang, but lagged behind gfortran and the Intel compilers.
Note that the C-library "SLEEF" is now on major version 3, while SLEEFPirates.jl was forked from SLEEF.jl, which was based on C-SLEEF major version 2.
from loopvectorization.jl.
Yes, I am aware of that. I also use it for the Look-up-Table comparison in https://nextjournal.com/zorn/fast-fixed-point-sine-and-cosine-approximation-with-julia and in the function get_angle(x)
, which I use to calculate sin(x)
and cos(x)
based on A(x)
and B(x)
.
I thought, that you were doing it in float arithmetic.
But it is interesting to see, that LLVM is doing that automatically!
from loopvectorization.jl.
Adding to this conversation and since you were interested in fixed-point sine and cosine approximation, I have written an article at nextjournal about my findings: https://nextjournal.com/zorn/fast-fixed-point-sine-and-cosine-approximation-with-julia
There is indeed a better way than simply doing cos(x) = sin(x + pi / 2).
As you will see in the article, the Int16 and Int32 variant vectorize quite well using standard Julia. The Int64 variant is a little slower. I hope that once LoopVectorization supports multiple return values, that I can get this faster.
from loopvectorization.jl.
@zsoerenm You may be interested in VectorizedRNG.jl, which I discussed before.
I'm still using a floating point approximation of sin and cosine instead of fixed point, but I am using a very simple approximation I wrote. Floating point has the advantage of not needing to keep track of the exponents and lets you avoid all those shifts (although shifts are very fast).
Because I need a random sin
and cos
on the interval (0,2pi), I approximate a single quadrant of sin
on (0,pi/2), shift cos
appropriately, and copy random sign bits to get the full (0, 2pi) range.
I used Remez.jl for my polynomial coefficients.
My random log(U)
where U ~ Uniform(0,1)
isn't quite as big an improvement. That one takes a random integer.
I pass in a random series of bits, count the number of leading zeros as a random 2^lz
exponent, and then use the remaining bits, up to a maximum of 52, for a uniform random floating point value in (1,2)
.
Then I approximate log2(x)
over the interval (1,2)
, and add on the leading zeros.
Multiplying by 1/log2(e)
gives a random base-e log. Math:
log(U) = log2(U)/log2(e)
log2(U) = log2(U* 2^lz/2^lz)
= log2(U * 2 ^ lz) - log2(2^lz)
= log2(U * 2 ^ lz) - lz
and U * 2^lz
is in the interval (1,2)
. This works for now, but I'm sure there are better approximations, so I'll look into this more later.
Unfortunately, counting the number of leading zeros requires AVX512CD to vectorize, but this is still reasonably fast on AVX2.
Putting these together, I get about a 2x improvement on Random.randn!
with AVX2, and a more than 4x improvement with AVX512.
from loopvectorization.jl.
Very interesting!
Floating point has the advantage of not needing to keep track of the exponents and lets you avoid all those shifts (although shifts are very fast).
Sure, however, my requirement for accuracy is very low. 2 or 3 bit would already be enough. Unfortunately there is no build in Float16, that I can use to make things faster. Int16 was the best bet.
I used Remez.jl for my polynomial coefficients
As far as I understand, it does a sort of least square fit, doesn't it? In that case you'll have to pay attention to the boundaries of the quarter sine / cosine approximation. Otherwise the function takes a jump every quarter of a circle. I have chosen the polynomials in such a way that the boundaries have a zero error.
Here you'll find a solution for a sixth order polynomial, that has zero errors at the boundaries and let's you calculate sin and cos simultaneously: http://www.olliw.eu/2014/fast-functions/
and copy random sign bits to get the full (0, 2pi) range
How do you restrict the phase to be between 0 and 2pi? I found that the mod
function is quite expensive.
from loopvectorization.jl.
Sure, however, my requirement for accuracy is very low. 2 or 3 bit would already be enough. Unfortunately there is no build in Float16, that I can use to make things faster. Int16 was the best bet.
Ah, if you want to minimize the number of bits per number, that makes sense.
As far as I understand, it does a sort of least square fit, doesn't it?
It does min-max fitting. That is, it minimizes the maximum error over the given range, which in my case was only the first of four quadrants.
Also, because I'm only approximating a quarter of the circle, it can't exactly jump every quarter circle ;).
How do you restrict the phase to be between 0 and 2pi? I found that the mod function is quite expensive.
The inputs are random 64-bit integers. I use the first two bits to ultimately pick the signs of sin
and cos
respectively, after evaluating the quarter-circle approximation. This is equivalent to randomly picking the quadrant. I ignore the next 10 bits.
The last 52 bits make the fractional component of a 64-bit Float in [1,2)
(after appropriately setting the first 12 bits):
julia> VectorizedRNG.mask(typemin(UInt), Float64)
1.0
julia> VectorizedRNG.mask(typemax(UInt), Float64)
1.9999999999999998
From there, I adjust into the open-open (0,1)
interval, where sin
and cos
are mirrored across 0.5.
Roughly, if r
is the random float in [1,2)
, then I evaluate a scaled sin
approximation at roughly both r - 1
and 2 - r
to get the sine and cosine, respectively.
The sin approximation is scaled so that numbers in the range (0,1)
correspond do (0,pi/2)
. Of course, it is only valid over this range.
"roughly", because I translate [1,2)
to (0,1)
instead of [0,1)
and (0,1]
as would happen if I used r-1
and 2-r
.
So no need for mod
. The phase is restricted by design.
Additionally, because the purpose is as part of a random number generator, and because the only inputs are themselves these random unsigned integers, continuity isn't important.
What is important is that the outputs are equivalent to what I'd get if I randomly sampled
sincos(2pi*rand())
The boundary values are:
julia> using VectorizedRNG
julia> u = Core.VecElement.((0x0000000000000000, 0x0fffffffffffffff))
(VecElement{UInt64}(0x0000000000000000), VecElement{UInt64}(0x0fffffffffffffff))
julia> VectorizedRNG.randsincos(u, Float64)
((VecElement{Float64}(3.9734697296186967e-7), VecElement{Float64}(1.000000397346973)), (VecElement{Float64}(1.000000397346973), VecElement{Float64}(3.973469727874718e-7)))
So it doesn't match exactly at 0
or 1
. In fact, at 1
, it is outside of the valid range for sin
and cos
(slightly greater than 1
).
But what matters is that it is relatively close. The final numbers are supposed to be normally distributed. randsincos
is an internal function, not intended to be used by others. Before its output are returned by someone using the random normal generator, they're multiplied by another random number. That number is exponentially distributed, meaning it can be greater than or less than 1, effectively hiding these small errors / the fact that this result is greater than 1.
What I do want is to use a lot of bits to generate the random number, so that a long stream of them would be unique, and so that it passes RNG tests like those in RNGTest.jl. I should run it against BigCrush when I get home tonight, but SmallCrush isn't a problem:
julia> using RNGTest, Distributions, Random, VectorizedRNG
julia> struct RandNormal01{T<:VectorizedRNG.AbstractPCG} <: Random.AbstractRNG
pcg::T
end
julia> function Random.rand!(r::RandNormal01, x::AbstractArray)
randn!(r.pcg, x)
x .= cdf.(Normal(0,1), x) # small crush tests uniform numbers, so transform normal to uniform
end
julia> rngnorm = RNGTest.wrap(RandNormal01(local_pcg()), Float64);
julia> RNGTest.smallcrushTestU01(rngnorm)
========= Summary results of SmallCrush =========
Version: TestU01 1.2.3
Generator:
Number of statistics: 15
Total CPU time: 00:00:15.92
All tests were passed
EDIT:
Unfortunately, BigCrush has been showing me that I need a lot more accuracy.
from loopvectorization.jl.
@zsoerenm
BTW, integer mod with respect to powers of 2 is very fast. The trick is to replace a % m
with a & (m-1)
, e.g. if you want the mod with respect to 2^30
julia> const MODMASK30 = (one(UInt32) << 30) - 0x01
julia> mod30(u) = u & MODMASK30
julia> u = rand(UInt32)
0xea0e3557
julia> mod30(u) === u % (UInt32(2)^30)
true
This works for powers of 2 because
julia> bitstring(one(UInt32) << 30)
"01000000000000000000000000000000"
julia> bitstring(one(UInt32) << 30 - one(UInt32))
"00111111111111111111111111111111"
You're just setting all the higher bits to 0, which is the same as subtracting off all the higher powers of 2 to leave only the remainder.
Although, if you use %
with a constant power of two value, LLVM should have figured this out for you:
julia> rem30(u) = u % 0x40000000
rem30 (generic function with 1 method)
julia> @code_llvm debuginfo=:none rem30(u)
define i32 @julia_rem30_19002(i32) {
top:
%1 = and i32 %0, 1073741823
ret i32 %1
}
julia> reinterpret(Int32, MODMASK30)
1073741823
It's using and
and 1073741823
instead of calculating the mod, so LLVM obviously noticed that 0x40000000
is a power of two and used this trick itself.
from loopvectorization.jl.
Related Issues (20)
- `vtrunc(::Float64)` issue HOT 3
- Strange compile behavior for @turbo HOT 2
- is it possible to set @turbo thread = true/false at runtime? HOT 3
- LoopVectorization fail to compile on julia 32bit REPL
- AssertionError: M == 1 HOT 9
- Inconsistent results w/ and w/o @turbo HOT 6
- vfilter with multiple conditions HOT 2
- Memory corruption HOT 2
- Incorrect results using @turbo with linear array indexing HOT 1
- Weird/inconsistent behavior with constant lhs indexing inside @turbo loop HOT 2
- Suboptimal Choice of the Vecotrization Level for Image Convolution HOT 1
- Performance for stride 2
- Bad IR generation triggers assertion failure on 1.11
- Release v0.12.167 breaks RecursiveFactorization on Julia v1.11+ HOT 12
- LoopVectorization.jl causing segfaults on 1.11 HOT 5
- type inference issue with vectors if ints and floats in julia 1.10 HOT 3
- LoadError: BoundsError: attempt to access 2-element Vector{LoopVectorization.ArrayReferenceMeta} at index [0] HOT 1
- Reduction not found HOT 2
- Safety of generating random numbers HOT 2
- Problem/error in execution order
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from loopvectorization.jl.