Git Product home page Git Product logo

multifloats.jl's Introduction

MultiFloats.jl

Copyright © 2019-2024 by David K. Zhang. Released under the MIT License.

MultiFloats.jl is a Julia package for extended-precision arithmetic using 100–400 bits (≈30–120 decimal digits). In this range, it is the fastest extended-precision library that I am aware of. At 100-bit precision, MultiFloats.jl is roughly 40× faster than BigFloat, 5× faster than Quadmath.jl, and 1.5× faster than DoubleFloats.jl.

MultiFloats.jl is fast because it uses native Float64 operations on static data structures that do not dynamically allocate memory. In contrast, BigFloat allocates memory for every single arithmetic operation, requiring frequent pauses for garbage collection. In addition, MultiFloats.jl uses branch-free algorithms that can be vectorized for even faster execution on SIMD processors.

MultiFloats.jl provides pure-Julia implementations of the basic arithmetic operations (+, -, *, /, sqrt), comparison operators (==, !=, <, >, <=, >=), exp, log, and floating-point introspection methods (isfinite, eps, minfloat, etc.). Transcendental functions (exp, log, sin, cos, etc.) are supported through MPFR.

MultiFloats.jl stores extended-precision numbers in a multi-limb representation that generalizes the idea of double-double arithmetic to an arbitrary number of components. This idea takes inspiration from Jonathan Shewchuk's work on adaptive-precision floating-point arithmetic and Yozo Hida, Xiaoye Li, and David Bailey's algorithms for quad-double arithmetic, combined in a novel fashion with Julia's unique JIT architecture and metaprogramming capabilities.

New Features in v2.0

MultiFloats.jl v2.0 now supports explicit SIMD vector programming using SIMD.jl. In addition to the basic scalar types Float64x2, Float64x3, ..., Float64x8, MultiFloats.jl v2.0 also provides the vector types v2Float64x2, v4Float64x2, v8Float64x2, ..., v2Float64x8, v4Float64x8, v8Float64x8, allowing users to operate on two, four, or eight extended-precision values at a time. These are all instances of the generic type MultiFloatVec{M,T,N}, which represents a vector of M values, each represented by N limbs of type T.

MultiFloats.jl v2.0 also provides the functions mfvgather(array, indices) and mfvscatter(vector, array, indices) to simultaneously load/store multiple values from/to a dense array of type Array{MultiFloat{T,N},D}.

MultiFloats.jl v2.0 uses Julia's @generated function mechanism to automatically generate code on-demand for arithmetic and comparison operations on MultiFloat and MultiFloatVec values. It is no longer necessary to call MultiFloats.use_standard_multifloat_arithmetic(9) in order to compute with Float64x{9}; thanks to the magic of metaprogramming, it will just work!

Breaking Changes from v1.0

MultiFloats.jl v2.0 no longer exports the renormalize function by default. Internal renormalization is now performed more frequently, which should make it unnecessary for users to explicitly call renormalize in the vast majority of cases. It is still available for internal use as MultiFloats.renormalize.

MultiFloats.jl v2.0 no longer provides the user-selectable precision modes that were available in v1.0. Consequently, the following functions no longer exist:

MultiFloats.use_clean_multifloat_arithmetic()
MultiFloats.use_standard_multifloat_arithmetic()
MultiFloats.use_sloppy_multifloat_arithmetic()

My experience has shown that sloppy mode causes serious problems in every nontrivial program, while clean mode has poor performance tradeoffs. Instead of using, say, Float64x3 in clean mode, it almost always makes more sense to use Float64x4 in standard mode. Therefore, I made the decision to streamline development by focusing only on standard mode. If you have an application where sloppy or clean mode is demonstrably useful, please open an issue for discussion!

MultiFloats.jl v2.0 no longer provides a pure-MultiFloat implementation of exp and log. The implementation provided in v1.0 was flawed and performed only marginally better than MPFR. A new implementation, based on economized rational approximations to exp2 and log2, is being developed for a future v2.x release.

Installation

MultiFloats.jl is a registered Julia package, so all you need to do is run the following line in your Julia REPL:

]add MultiFloats

Usage

MultiFloats.jl provides the types Float64x2, Float64x3, ..., Float64x8, which represent extended-precision numbers with 2×, 3×, ..., 8× the precision of Float64. These are all subtypes of the parametric type MultiFloat{T,N}, where T = Float64 and N = 2, 3, ..., 8.

Instances of Float64x2, Float64x3, ..., Float64x8 are convertible to and from Float64 and BigFloat, as shown in the following example.

julia> using MultiFloats

julia> x = Float64x4(2.0)

julia> y = sqrt(x)
1.41421356237309504880168872420969807856967187537694807317667973799

julia> y * y - x
-1.1566582006914837e-66

A comparison with sqrt(BigFloat(2)) reveals that all displayed digits are correct in this example.

Note: MultiFloats.jl also provides a Float64x1 type that has the same precision as Float64, but behaves like Float64x2Float64x8 in terms of supported operations. This is occasionally useful for testing, since any code that works for Float64x1 should also work for Float64x2Float64x8 and vice versa.

Features and Benchmarks

We use two linear algebra tasks to compare the performance of extended-precision floating-point libraries:

  • QR factorization of a random 400×400 matrix
  • Pseudoinverse of a random 400×250 matrix using GenericLinearAlgebra.jl

The timings reported below are averages of 10 single-threaded runs performed on an Intel Core i9-11900KF processor using Julia 1.10.0.

MultiFloats
Float64x2
MPFR
BigFloat
Arb
ArbFloat
Intel
Dec128
Julia
Double64
libquadmath
Float128
400×400 qr 0.276 sec 7.311 sec
27× slower
13.259 sec
48× slower
11.963 sec
43× slower
0.384 sec
1.4× slower
1.399 sec
5× slower
correct digits 26.2 25.9 25.9 27.7 26.1 27.9
400×250 pinv 1.236 sec 49.581 sec
40× slower
❌ Error ❌ Error 1.899 sec
1.5× slower
7.551 sec
6× slower
correct digits 26.0 25.8 ❌ Error ❌ Error 25.9 27.9
selectable precision ✔️ ✔️ ✔️
avoids allocation ✔️ ✔️ ✔️ ✔️
arithmetic
+, -, *, /, sqrt
✔️ ✔️ ✔️ ✔️ ✔️ ✔️
transcendentals
sin, exp, log
⚠️ ✔️ ✔️ ✔️ ✔️ ✔️
compatible with
GenericLinearAlgebra.jl
✔️ ✔️ ✔️ ✔️ ✔️
float introspection
minfloat, eps
✔️ ✔️ ✔️ ✔️ ✔️ ✔️

Precision and Performance

The following tables compare the precision (in bits) and performance (in FLOPs) of the arithmetic algorithms provided by MultiFloats.jl.

Number of Accurate Bits + - * /
Float64x2 107 107 103 103
Float64x3 161 161 156 155
Float64x4 215 215 209 207
Float64x5 269 269 262 259
Float64x6 323 323 314 311
Float64x7 377 377 367 364
Float64x8 431 431 420 416

Worst-case precision observed in ten million random trials using random numbers with uniformly distributed exponents in the range 1.0e-100 to 1.0e+100. Here, + refers to addition of numbers with the same sign, while - refers to addition of numbers with opposite signs. The number of accurate bits was computed by comparison to exact rational arithmetic.

Operation FLOP Count
+ 3N² + 10N - 6
- 3N² + 10N - 6
* 2N³ - 4N² + 9N - 9
/ 6N³ + 4N² - 14N + 2

Caveats

MultiFloats.jl requires an underlying implementation of Float64 with IEEE round-to-nearest semantics. It works out-of-the-box on x86 and ARM but may fail on more exotic architectures.

MultiFloats.jl does not attempt to propagate IEEE Inf and NaN values through arithmetic operations, as this could cause significant performance losses. You can pass these values through the Float64x{N} container types, and introspection functions (isinf, isnan, etc.) will work, but arithmetic operations will typically produce NaN on all non-finite inputs.

multifloats.jl's People

Contributors

dzhang314 avatar haampie avatar jeffreysarnoff avatar juliatagbot avatar lrnv avatar musm avatar nsajko avatar samuelbadr avatar stevengj avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

multifloats.jl's Issues

Why warning about DoubleFloats allocation?

The table at the bottom of this package's README has a warning sign in the DoubleFloats column for "avoids dynamic memory allocation". Why is this? I've never noticed such allocations.

Wrong or outdated claim, and suggestion

Hi,

Is your claim about ArbFloat here:
@JeffreySarnoff
https://github.com/JeffreySarnoff/ArbNumerics.jl

slower than BigFloat wrong (and also slower than your type)? Is it only about QR? I understood ArbFloat to be faster usually, even at the same precision, but not always.

Of course it is faster than BigFloat on its default, but that's sort of unfair. I guess your 106 bit case is best case, and at higher ArbFloat wins. Many default to BigFloat after Float64, since they don't know better, why I intend to get that package documented, and now likely your package too.

Since you define all the basic operators, you could define all, just not yet done? That's a pro of ArbNumerics, also interval option.

I'm thinking should it add your package as a dependency? Could it use your type unless you ask for some inconvenient/large precision, then fall back to its implementation? And then your type could get the transcendentals that way?

Loss of accuracy in `*` / `+`

After running the QR algorithm for a bunch of iterations, I'm hitting 99% of the cases numbers like these, which lose precision when multiplied:

julia> x = Float64x4((-1.4426767353575964e-39, -4.620737599002311e-57, 1.1532128077549776e-66, -9.883310571778616e-83))

julia> y = big(x)

julia> x * x - y * y
-1.329884630995506956660201769041660088331658728714499508848494315565488248944506e-132

julia> abs(x) * abs(x) - y * y
-5.683388719909582042824734830998315327342882525094367086266805748619453533154653e-144

Relative errors:

julia> (x * x - y * y) / (y * y)
-6.389632939012188407781360510046084336131978165576939912207721601547110752468559e-55

julia> (abs(x) * abs(x) - y * y) / (y * y)
-2.730670535139620858770944035756030902664721599555555738618279743956037242156906e-66

I think they're not normalized.


Example. Input matrix:

julia> H1
6×6 Matrix{MultiFloat{Float64, 4}}:
  1.98795  -0.354736   0.0797913   0.148136     -0.164454   0.369633
 -0.19463   0.598573   0.429203   -0.486762      0.239162   0.00142155
  0.0       0.0290679  0.556247   -0.562008     -0.385439  -0.214523
  0.0       0.0        0.511856    1.12336      -0.602872  -0.206262
  0.0       0.0        0.0        -1.49082e-25   1.38356   -0.937583
  0.0       0.0        0.0         0.0           0.022335   1.20556

Apply a double shift of the QR algorithm, and move the "bulge" to the last two rows, it ends up looking like this:

julia> H2
6×6 Matrix{MultiFloat{Float64, 4}}:
 1.9865    0.358077    0.115383   0.13357      -0.182391    0.361471
 0.200765  0.554469    0.581365  -0.204676     -0.249824   -0.0259713
 0.0       0.0246946   1.0663     0.734426     -0.377182   -0.11517
 0.0       0.0        -0.359419   0.658852     -0.582584   -0.315445
 0.0       0.0         0.0       -1.83218e-50   1.43778    -0.923614
 0.0       0.0         0.0       -3.00836e-51   0.0363045   1.15134

Here accuracy is still fine:

julia> H * Q - Q * H2 |> norm
5.414815616357242382449125468599721608592507512793183811901540346944e-64

julia> Q'Q-I |> norm
4.416391581658617777192079003247210327007965703744734256761335112015e-64

But the last Given's rotation that zeros out the -3.0e-51 value is completely inaccurate:

julia> H3
6×6 Matrix{MultiFloat{Float64, 4}}:
 1.9865    0.358077    0.115383   0.13357      -0.121414    0.386247
 0.200765  0.554469    0.581365  -0.204676     -0.250731    0.0148499
 0.0       0.0246946   1.0663     0.734426     -0.390859   -0.052535
 0.0       0.0        -0.359419   0.658852     -0.625997   -0.216883
 0.0       0.0         0.0       -1.85672e-50   1.28839    -0.946116
 0.0       0.0         0.0        0.0           0.0138023   1.30073

julia> Q'Q-I |> norm
1.1366313071582258344782136697615383422393886966249503201836254869903e-47

julia> H * Q - Q * H3 |> norm
1.656359278104309870083428882117651148392957207448309578999648077939e-47

That's probably because the numbers of which a rotation is computed are not normalized:

julia> H2 .+ 0.0 .=== H2
6×6 BitMatrix:
 1  1  1  1  1  1
 1  1  1  1  1  1
 1  1  1  1  1  1
 1  1  1  1  1  1
 1  1  1  0  1  1
 1  1  1  0  1  1

If I normalize them "by hand", it looks like accuracy is restored:

julia> c, s, = LinearAlgebra.givensAlgorithm(a, b)
(0.9867864895853341289209073557253047327730824456644072969143861256782, 0.16202599782705631811014484204531232306332276631781213554170381102448, -1.85671644730184222985716651733286367867540289938089541635541279028e-50)

julia> c*c+s*s-1
8.037197050005110683132676344698446e-48

julia> c, s, = LinearAlgebra.givensAlgorithm(a + 0.0, b + 0.0)
(0.9867864895853341289209073557253047327730824456604417981828460524574, 0.16202599782705631811014484204531232306332276631716101810582393526185, -1.856716447301842229857166517332863678675402899388356814331887958311e-50)

julia> c*c+s*s-1
-1.1543976842476927e-64

But... I can't write my algorithm like that.

How can I avoid loss of accuracy?

sincos leads to stackoverflow

Because of this bit of code

_sincos(x::AbstractFloat) = sincos(x)
_sincos(x) = (sin(x), cos(x))

sincos(x) = _sincos(float(x))

in
julia\base\special\trig.jl

Calling sincos with a MultiFloat causes a stackoverflow.

Minimum working example

using MultiFloats
sincos(Float64x2(4))
ERROR: StackOverflowError:
Stacktrace:
     [1] _sincos(x::MultiFloat{Float64, 2})
       @ Base.Math .\special\trig.jl:203
     [2] sincos(x::MultiFloat{Float64, 2})
       @ Base.Math .\special\trig.jl:206
--- the last 2 lines are repeated 39990 more times ---
 [79983] _sincos(x::MultiFloat{Float64, 2})
       @ Base.Math .\special\trig.jl:203

I'm trying to use conv from DSP.jl and GenericFFT.jl, is there a workaround for this? I don't have a firm grasp on how to change the method used by conv in a scenario like this.

P.S. Why would one write code like the above, it seems designed to make a stack overflow in the case that calling float does nothing

^ not defined.

I have the following stacktrace:

ERROR: ^ not defined for MultiFloat{Float64, 4}
Stacktrace:
  [1] error(::String, ::String, ::Type)
    @ Base .\error.jl:42
  [2] no_op_err(name::String, T::Type)
    @ Base .\promotion.jl:395
  [3] ^(x::MultiFloat{Float64, 4}, y::MultiFloat{Float64, 4})

Missing methods in conversion to Integer

julia> Int(Float64x4(3.0))
ERROR: MethodError: no method matching Int64(::MultiFloat{Float64, 4})
Closest candidates are:
  (::Type{T})(::T) where T<:Number at boot.jl:760
  (::Type{T})(::AbstractChar) where T<:Union{AbstractChar, Number} at char.jl:50
  (::Type{T})(::BigInt) where T<:Union{Int128, Int16, Int32, Int64, Int8} at gmp.jl:356
  ...
Stacktrace:
 [1] top-level scope
   @ REPL[37]:1

julia> Integer(Float64x4(3.0))
ERROR: MethodError: no method matching Integer(::MultiFloat{Float64, 4})
Closest candidates are:
  (::Type{T})(::T) where T<:Number at boot.jl:760
  (::Type{T})(::AbstractChar) where T<:Union{AbstractChar, Number} at char.jl:50
  (::Type{T})(::Base.TwicePrecision) where T<:Number at twiceprecision.jl:243
  ...
Stacktrace:
 [1] top-level scope
   @ REPL[38]:1

It would be nice for these to work to easily switch between Float64 and MultiFloats.

Suboptimal performance of multiplication by an integer

Just a little note on performance:

f(v::Vector{<:Real}, binop::F, f::Real) where {F <: Function} =
  map!(
    let f = f
      e -> binop(e, f)
    end,
    v,
    v)

@benchmark f(v, *, Float64x4(3)) setup=(v = rand(Float64x4, 2^14))  # median time: 260.556 μs

@benchmark f(v, *, 3.0) setup=(v = rand(Float64x4, 2^14))  # median time: 169.787 μs

@benchmark f(v, *, 0x3) setup=(v = rand(Float64x4, 2^14))  # median time: 250.470 μs

@benchmark f(v, *, Float32(3)) setup=(v = rand(Float64x4, 2^14))  # median time: 249.599 μs

Comparing the first and second benchmarks, it seems that multiplication with Float64 is special-cased, leading to very good performance.

The third and fourth benchmarks are disappointing, though, I guess that the UInt8 and Float32 are first converted to Float64x4 before the multiplication?

Perhaps converting to Float64 instead of to Float64x4 would lead to better performance? Or maybe generating special cases for multiplication with small integers or with Float32?

bug: "clean" mode has worse accuracy than regular Float64

Here's a case where MultiFloats "clean" mode is worse than "standard" mode. To make matters worse, "clean" mode is worse than Float64 in this case:

julia> using MultiFloats

julia> MultiFloats.use_clean_multifloat_arithmetic(20)

julia> const F = MultiFloat{Float64, n} where {n}
Float64x (alias for MultiFloat{Float64})

julia> f(x) = isone(x*inv(x))
f (generic function with 1 method)

julia> f(3.0^100)
true

julia> f(F{13}(3.0^100))
false

julia> MultiFloats.use_standard_multifloat_arithmetic(20)

julia> f(F{13}(3.0^100))
true

julia> MultiFloats.use_clean_multifloat_arithmetic(20)

julia> f(F{13}(3.0^100))
false

So there seems to be a bug somewhere in "clean" mode.

`significand`, `frexp` not defined

julia> significand(Float64x2(123.4))
ERROR: MethodError: no method matching significand(::MultiFloat{Float64, 2})
Closest candidates are:
  significand(::T) where T<:Union{Float16, Float32, Float64} at math.jl:896
  significand(::DoubleFloat{T}) where T<:Union{Float16, Float32, Float64} at ~/.julia/packages/DoubleFloats/h3HrU/src/math/prearith/prearith.jl:68
  significand(::BigFloat) at mpfr.jl:873
...

julia> significand(123.4)
1.928125

Xref JuliaMath/DoubleFloats.jl#148

Exp/log implementation burden ?

Hey,

In my application, I only require exponentials and logarithms uppon what is already implemented.

How hard would it be to add those two functions ? Do you know how it should be done ? I would be happy to contribute.

Edit: Could a temporary fix be introduced by convertion to BigFloat back and forth ? I know this is highly sub-optimal, but it'll allow code to at least run. Since exp/log are not the most part of my workflow, it might still provide a huge performance gain over bigfloat for me.

Warning sign in Feature Comparison table

In the Feature Comparison table, there is a warning sign for DoubleFloats on dynamic memory allocation. Could you please elaborate a little bit? Are you referring to the conversion from a Double64 to a (Float64, Float64) pair? Does this explain why MultiFloats is faster than DoubleFloats (assuming both use the same number of operations)?

BTW, thanks for making such a performant library with precise code!

No method matching round(MF)

Doing some optimisations, I encountered the lack of rounding method for MultiFloats:

ERROR: MethodError: no method matching round(::MultiFloat{Float64, 4}, ::RoundingMode{:Up})
Closest candidates are:
  round(::Real, ::RoundingMode; digits, sigdigits, base) at floatfuncs.jl:129
  round(::Complex, ::RoundingMode) at complex.jl:1038
  round(::Complex, ::RoundingMode, ::RoundingMode; kwargs...) at complex.jl:1038

Same thing happend for trunc. Setting:

import Base.round, Base.trunc
Base.round(x::MultiFloat{Float64, 4}, y::RoundingMode{:Up}) = MultiFloat{Float64, 4}(Base.round(Float64(x),y))
Base.trunc(x::Type{Int64}, y::MultiFloat{Float64, 4}) = Base.trunc(x::Type{Int64}, Float64(y))

fixed the problem for my script, but i'm not sure it's the right thing to do.

Dividing x ≠ 0 by zero returns NaN instead of ±Inf

The convention for dividing by zero is not uniform across all levels of precision:

one(Float64) / zero(Float64)      # Inf
one(Float64x1) / zero(Float64x1)  # Inf
one(Float64x2) / zero(Float64x2)  # NaN

Is it possible to have Float64x2, etc follow the same convention as Float64?
I have parts of my code in Tulip.jl that rely on this convention.

I think this should cover all cases:

x y x/y
>0 +0 +Inf
>0 -0 +Inf
±0 ±0 NaN
<0 +0 -Inf
<0 -0 +Inf

Correct implementation of hypot(x, y)

MultiFloats.jl currently has a stop-gap implementation of Base.hypot(x, y) to make MultiFloat usable with the eigenvalue algorithms in GenericLinearAlgebra.jl. This implementation does not correctly avoid undue overflow/underflow, but it should be possible to write an implementation that does.

Catastrophic loss of precision in `hypot`

hypot is just not correctly implemented. it loses all digits of precision for small as well as large values.

using MultiFloats
small = 1e-300
@assert hypot(small, small) / sqrt(2) ≈ 1e-300
xsmall = Float64x2(small)
@assert hypot(xsmall, xsmall) / sqrt(2) ≈ 1e-300

One should probably have some really robust precision audit as part of the unit tests.

Convertion error

I ran into a hole in the mesh of methods.

Convertion from Complex is ambiguous since there is a function that creates a multifloat from limbs. Defining :

MultiFloat{T, N}(z::Complex) where {T, N} = MultiFloat{T, N}(BigFloat(z))

filled the hole.

`repr`

repr(x) should output

Float64xN((x1, x2, ..., xN))

That way you can actually copy and paste them.

`exp`, `log` and friends slow.

JuliaMath/DoubleFloats.jl#136 gives DoubleFloats an exp (and exp2, exp10) implementation that is around 4x faster than what MultiFloats is using. If you wanted this version to work for all levels of precision, it would take some extra engineering, but IMO, it would be worth it.

Some rounding error issue between CPU & GPU

I've tried on MultiFloats.jl on the GPU, but I'm getting loss of precision compared to the CPU:

using CUDA, MultiFloats
A = rand(Float64x8, 100, 100)
B = rand(Float64x8, 100, 100)
A * B - Array(CuArray(A) * CuArray(B))

Gives me

100×100 Matrix{MultiFloat{Float64, 8}}:
 -1.23827e-98    9.35263e-99  -8.83181e-99  …  -4.70324e-99   -1.3348e-98
 -1.98421e-99    8.20389e-99   1.67043e-98      1.45499e-98    2.32225e-98
 -2.77264e-99   -3.30951e-99   1.32426e-98     -1.09181e-98    7.84157e-100
  1.92544e-98    6.35776e-99  -8.85547e-99      1.29435e-98   -4.89252e-99
 -5.52038e-99    5.35901e-99  -3.705e-98        1.53947e-99    7.38954e-99
 -2.16904e-98    1.64505e-98  -1.16536e-98  …  -3.19036e-98    7.5397e-99
  6.72487e-98    6.07349e-99  -2.87359e-98      ...

but eps(Float64x8) is 5.9091063153828709e-126.

What explain this? The order of iteration?

Support a^b

if A and B are MultiFloats, then a^b should return a proper output. For the moment i fall back to exp(b*log(a))

bug: weird issues with big MultiFloats

julia> using MultiFloats

julia> MultiFloats.use_standard_multifloat_arithmetic(20)

julia> g(n::Integer) = rand(MultiFloat{Float64, n})
g (generic function with 1 method)

julia> x = g(20)
-2.2914735159347884e+302

julia> 1 < x
false

julia> x < -1
true

julia> Float64(x)
0.6714350546186427

So it seems there are multiple issues here:

  1. rand produces a number of huge magnitude
  2. rand produces negative number
  3. the conversion to Float64 (producing a positive number) is not consistent with the < operator

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

exp2, exp10, log2, log10 implimentation

It is probably a good idea to use exp2 as the fundamental exponential function (it has fewer roundings) and then just implement exp and exp10 with an extra multiple by log2(e) and log2(10) respectively. A similar strategy can be used for logarithms.

Promotion of Irrationals

julia> pi * Float64x4(2)
ERROR: MethodError: Cannot `convert` an object of type Irrational{} to an object of type NTuple{4, Float64}

Perhaps there might be a way to convert these to BigFloat and then to MultiFloats

julia> 2Float64x4(big(pi))
6.283185307179586476925286766559005768394338798750211641949889184604

Inf + 1 returns NaN

There seems to be something wrong with how Inf is treated when higher precision is used:

julia> Float64x1(-Inf) - 1
-Inf
julia> Float64x2(-Inf) - 1  # Same happens for Float64x3, etc
NaN

and

julia> Float64x1(Inf) + 1
Inf
julia> Float64x2(Inf) + 1  # Same happens for Float64x3, etc
NaN

Vectorization problems (and how to fix them)

Current codegen is unfortunately not great, for instance a trivial loop like this:

julia> function trivial_sum(xs::Vector{Float64x4})
         t = zero(Float64x4)
         @simd for i in 1:length(xs)
           t += @inbounds xs[i]
         end
         t
       end

generates the following inner loop:

L48:
	vmovsd	-24(%rdx), %xmm5                # xmm5 = mem[0],zero
	vmovsd	-16(%rdx), %xmm6                # xmm6 = mem[0],zero
	vmovsd	-8(%rdx), %xmm7                 # xmm7 = mem[0],zero
	vaddsd	%xmm5, %xmm4, %xmm9
	vsubsd	%xmm4, %xmm9, %xmm0
	vsubsd	%xmm0, %xmm9, %xmm3
	vsubsd	%xmm3, %xmm4, %xmm3
	vsubsd	%xmm0, %xmm5, %xmm0
	vaddsd	%xmm3, %xmm0, %xmm0
	vaddsd	%xmm6, %xmm2, %xmm3
	vsubsd	%xmm2, %xmm3, %xmm4
	vsubsd	%xmm4, %xmm3, %xmm5
	vsubsd	%xmm5, %xmm2, %xmm2
	vsubsd	%xmm4, %xmm6, %xmm4
	vaddsd	%xmm2, %xmm4, %xmm2
	vaddsd	%xmm7, %xmm1, %xmm4
	vsubsd	%xmm1, %xmm4, %xmm5
	vsubsd	%xmm5, %xmm4, %xmm6
	vsubsd	%xmm6, %xmm1, %xmm1
	vsubsd	%xmm5, %xmm7, %xmm5
	vaddsd	%xmm1, %xmm5, %xmm1
	vaddsd	(%rdx), %xmm8, %xmm5
	vaddsd	%xmm1, %xmm5, %xmm1
	vaddsd	%xmm0, %xmm3, %xmm5
	vsubsd	%xmm3, %xmm5, %xmm6
	vsubsd	%xmm6, %xmm5, %xmm7
	vsubsd	%xmm7, %xmm3, %xmm3
	vsubsd	%xmm6, %xmm0, %xmm0
	vaddsd	%xmm3, %xmm0, %xmm0
	vaddsd	%xmm2, %xmm4, %xmm3
	vsubsd	%xmm4, %xmm3, %xmm6
	vsubsd	%xmm6, %xmm3, %xmm7
	vsubsd	%xmm7, %xmm4, %xmm4
	vsubsd	%xmm6, %xmm2, %xmm2
	vaddsd	%xmm4, %xmm2, %xmm2
	vaddsd	%xmm0, %xmm3, %xmm4
	vsubsd	%xmm0, %xmm4, %xmm6
	vsubsd	%xmm6, %xmm4, %xmm7
	vsubsd	%xmm7, %xmm0, %xmm0
	vsubsd	%xmm6, %xmm3, %xmm3
	vaddsd	%xmm0, %xmm3, %xmm0
	vaddsd	%xmm0, %xmm2, %xmm0
	vaddsd	%xmm0, %xmm1, %xmm0
	vaddsd	%xmm0, %xmm4, %xmm1
	vsubsd	%xmm4, %xmm1, %xmm2
	vsubsd	%xmm2, %xmm0, %xmm0
	vaddsd	%xmm1, %xmm5, %xmm2
	vsubsd	%xmm5, %xmm2, %xmm3
	vsubsd	%xmm3, %xmm1, %xmm1
	vaddsd	%xmm2, %xmm9, %xmm4
	vsubsd	%xmm9, %xmm4, %xmm3
	vsubsd	%xmm3, %xmm2, %xmm3
	vaddsd	%xmm3, %xmm1, %xmm2
	vsubsd	%xmm3, %xmm2, %xmm3
	vsubsd	%xmm3, %xmm1, %xmm3
	vaddsd	%xmm3, %xmm0, %xmm1
	vsubsd	%xmm3, %xmm1, %xmm3
	vsubsd	%xmm3, %xmm0, %xmm8
	addq	$32, %rdx
	decq	%rcx
	jne	L48

By relaxing the type restrictions in MultiFloats.jl a bit, and patching VectorizationBase.jl to add another Vec type that doesn't allow for certain fast-math ops, it should be possible to get a 4x speedup on AVX2 and potentially 8x on AVX512 (well, that's the upper bound :p).

A minimal set of changes to allow to use VectorizationBase is here: haampie@0e83c05

The idea is to work with a type MultiFloat{Vec{M,T},N} and run the basic *, /, +, and - operations on 2, 4, or 8 numbers simultaneously.

Support ranges (if possible)

These all fail with MethodError: no method matching round(::MultiFloat{Float64, 2}, ::RoundingMode{:Down}):

map(Float64x2, 0:5)
Float64x2(0):Float64x2(5)
Float64x2(0):Float64x2(1):Float64x2(5)

The workaround is to use generators:

julia> (Float64x2(i) for i in 0:3)
Base.Generator{UnitRange{Int64}, var"#3#4"}(var"#3#4"(), 0:3)

conversion to/from other extended precision types

This is really a question rather than an issue. I hope that's OK.

I would like to convert between various extended-precision types.
Are these conversions for MultiFloats correct and idiomatic?

using DoubleFloats
using MultiFloats
using Quadmath

@inline MultiFloat{Float64,2}(x::DoubleFloats.Double64) = Float64x2(x.hi) + x.lo

@inline function MultiFloat{Float64,2}(x::Quadmath.Float128)
    hi = Float64(x)
    lo = Float64(x - hi)
    return Float64x2(hi) + lo
end

@inline Quadmath.Float128(x::Float64x2) = Quadmath.Float128(x._limbs[1]) + x._limbs[2]
function test(x)
    d = Double64(x)
    m = Float64x2(x)
    q = Float128(x)
    b = BigFloat(x)
    return d, m, q, b
end

test(Double64(1.0))
test(Float64x2(2.0))
test(Float128(3.0))
test(BigFloat(4.0))

Implement String -> MultiFloat conversion

This is hard to do in a fashion that guarantees the MultiFloat -> String -> MultiFloat round-trip conversion exactly reconstructs the original value, due to the variable-sized gaps between the components of a MultiFloat. We may ultimately want two conversion pathways: one that guarantees round-trip reconstructibility, and one that produces a nice readable result.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.