ulfjack / ryu Goto Github PK

View Code? Open in Web Editor NEW

1.2K 1.2K 98.0 8.8 MB

Converts floating point numbers to decimal strings

License: Apache License 2.0

C++ 76.28% C 18.88% Java 4.34% Batchfile 0.09% Shell 0.02% Makefile 0.04% Starlark 0.30% CMake 0.06%

ryu's People

Contributors

Stargazers

Watchers

Forkers

stephantlavavej remotion abolz watmough perryao mitchblank emsr wenshao baojiawei1230 jorgbrown fqntxmqee anarazel nanawe joeced jihchi biodataanalysis alex-ilin cespare vgrudenic gcaliene nukepayload2 sahwar marcusjohnson91 robhz786 erthink joeerickson marler8997 trading-for-fun maelh dowenliu-xyz finii longjohncoder operativef vinniefalco gritzko mohabouje michaeljclark alcaro algunenano clickhouse javamonkey zhangwenquan sharma-hrishabh arn4 luminartech forksnd plokhotnyuk rlespinet crackercat paulmarome kparc forworldm stensalweb artoria2e5 biojppm dtolnay-contrib 9il icodein ironmeld exodusdb maisonobe zhuomingliang moarvm lns dianaolympos x5500 chirsz-ever ezhangle sondro clayne athrun-du monflabs-forks nightowl888 silky ditdb dunhamsteve mul1sh hl0071 alugowski schwehr mazino3 psyirius dosworld 0382 thechampagne navegador5 cruiserx asa res-life jamestiotio bebesparkelsparkel snowbellows dandancool cassioneri hmdyuck

ryu's Issues

Reduce 64-bit lookup table size

It should be possible (and cheap) to only store every other value, and take the next smaller available entry, but it may increase the number of bits required for the intermediate calculations.

Compability with ES6 / V8

I'm working with something that possibly will become an IETF standard:
https://tools.ietf.org/html/draft-rundgren-json-canonicalization-scheme-02
It depends on number formatting compatible with ES6/V8.

A short test revealed that the Java version of Ryu is not entirely compatible with this scheme.
x0000000000000001 returns 5e-324 in ES6/V8 while 4.9e-324 in Ryu

This is not really a bug but could be worth knowing.

https://github.com/cyberphone/json-canonicalization/tree/master/testdata#es6-numbers

Fallback code for compilers that don't support 128 bit operations

d2s uses non-standard 128-bit operators. While this is simpler and (probably) faster, it would probably be good to add fallback code for compilers that support neither the __umul128 intrinsic nor __uint128_t type.

Java: in correct rounding to two digits

The Java implementation sometimes doesn't output the closest two-digit representation of a number, but instead rounds to one digit (correctly), and then appends a '0'.

Known examples:
1.0E-323 should be 9.9E-324
1.0E-322 should be 9.9E-323

Investigate LKK remainder and divisibility algorithms

See Faster remainders when the divisor is a constant: beating compilers and libdivide and the paper Faster Remainder by Direct Computation: Applications to Compilers and Software Libraries by Daniel Lemire, Owen Kaser, and Nathan Kurz. This has the potential to improve f2s and possibly d2s.

Ideally, compilers would implement this, so we could continue using % 5 and so forth in the source code.

Java: got "1.943037616030839E16" for 1.9430376160308388E16 input (even rounding mode)

IN=100001101010001010000011111010010111111001110001100101100101001
   S=+ E=2 M=4857594040077097
E =16
d+=19430376160308390
d =19430376160308388
d-=19430376160308386
e2=0
19430376160308388 * 2^0
V+=19430376160308390
V =19430376160308388
V-=19430376160308386
19430376160308388 2
d+=19430376160308390
d =19430376160308388
d-=19430376160308386
e10=0
d-10=false
LAST_REMOVED_DIGIT=8
VP=1943037616030839
VR=1943037616030838
VM=1943037616030838
O=1943037616030839
OLEN=16
EXP=16
1.943037616030839E16 1.9430376160308388E16

Support round-even for the Java impl

See branch java-round-even, which has started this.

As of ed05761, the C code always rounds to even.

Losing the battle to convert a string to double

Going from double to string is amazing using this lib, but I am having trouble efficiently and correctly parsing a string into a double. This is what I have, any ideas? Is there some ulfjack magic that could be worked here?

https://github.com/vinniefalco/json/blob/develop/include/boost/json/detail/number.ipp

Integrate Manual Modulo from Puff Algorithm.

I am the author of the Fastest Method to Printing Integers and Floating-point numbers, which contains the manual modulo optimization for Ryu. When you do mod 100 div 100, you just did 2 division instructions when you only needed to do one. You're welcome. You are now apart of the Uniprinter (Universal Printer), which I'm aiming to be the world's fastest text processing library that grows from stack to heap with optional dynamic memory, compiles in about 3 seconds, and actually prints Unicode in C++ right.

I'm still looking through the algorithm for optimizations, I just cloned.

I just ran the 128-bit bit shifting tables from Puff and yes, the pattern generalizes. The uint32_t decimalLength(const uint128_t v) function is very slow.

#Bits	Max	#Bits	Max	#Bits	#Bits	Max
1	1.00E+00	33	8.59E+09	65	3.69E+19	97
2	3.00E+00	34	1.72E+10	66	7.38E+19	98
3	7.00E+00	35	3.44E+10	67	1.48E+20	99
4	1.50E+01	36	6.87E+10	68	2.95E+20	100
5	3.10E+01	37	1.37E+11	69	5.90E+20	101
6	6.30E+01	38	2.75E+11	70	1.18E+21	102
7	1.27E+02	39	5.50E+11	71	2.36E+21	103
8	2.55E+02	40	1.10E+12	72	4.72E+21	104
9	5.11E+02	41	2.20E+12	73	9.44E+21	105
10	1.02E+03	42	4.40E+12	74	1.89E+22	106
11	2.05E+03	43	8.80E+12	75	3.78E+22	107
12	4.10E+03	44	1.76E+13	76	7.56E+22	108
13	8.19E+03	45	3.52E+13	77	1.51E+23	109
14	1.64E+04	46	7.04E+13	78	3.02E+23	110
15	3.28E+04	47	1.41E+14	79	6.04E+23	111
16	6.55E+04	48	2.81E+14	80	1.21E+24	112
17	1.31E+05	49	5.63E+14	81	2.42E+24	113
18	2.62E+05	50	1.13E+15	82	4.84E+24	114
19	5.24E+05	51	2.25E+15	83	9.67E+24	115
20	1.05E+06	52	4.50E+15	84	1.93E+25	116
21	2.10E+06	53	9.01E+15	85	3.87E+25	117
22	4.19E+06	54	1.80E+16	86	7.74E+25	118
23	8.39E+06	55	3.60E+16	87	1.55E+26	119
24	1.68E+07	56	7.21E+16	88	3.09E+26	120
25	3.36E+07	57	1.44E+17	89	6.19E+26	121
26	6.71E+07	58	2.88E+17	90	1.24E+27	122
27	1.34E+08	59	5.76E+17	91	2.48E+27	123
28	2.68E+08	60	1.15E+18	92	4.95E+27	124
29	5.37E+08	61	2.31E+18	93	9.90E+27	125
30	1.07E+09	62	4.61E+18	94	1.98E+28	126
31	2.15E+09	63	9.22E+18	95	3.96E+28	127
32	4.29E+09	64	1.84E+19	96	7.92E+28	128

Mod 10 is very slow due to the number of decimals in the result (i.e. the less you have to divide by the faster their newton's method they use for the hardware division) I've benchmarked on 64-bit systems to be faster than mod 10 what is called the MSD Look-Up-Down (MSD-LUD), but I gave up on it when I discovered Puff. The only version I saved prints both ends of the number simultaneously, it is here, but that version of the algorithm has issues and Script2 is down for surgery for at least one more month to update the v1 ASCII Data Spec.

Public interface should provide safe buffer lengths

There should be #define macros in rhu.h which specifies safe buffer lengths for the buffered functions.

Does this work on all platforms?

What happens if std::numeric_limits<double>::is_iec559() returns false?

d2s/f2s RYU_DEBUG inconsistency

d2s.c line 253:

#ifdef RYU_DEBUG
  printf("E=%d M=%" PRIu64 "\n", e2 + 2, m2);
#endif

f2s.c line 168:

#ifdef RYU_DEBUG
  printf("E=%d M=%u\n", e2, m2);
#endif

e2 and m2 are computed in a structurally identical manner, so it seems inconsistent that d2s prints e2 + 2 while f2s prints e2, but I don't know which one is desired.

Big endian architectures?

The code that extracts the bits from a float, and the code that converts bits to a float probably don't work on big endian machines.

Why is the exponent 32 bits?

I see in the floating_decimal structure the exponent is stored using a 32-bit signed integer. But exponents for doubles only have a range of ~2,000 distinct integers, why isn't this a short instead?

Thank you for Boost License!

Thank you so much for using this under-appreciated license!!!

Double-conversion compatible output?

Hi, after seeing the benchmarks we are looking into using Ryu instead of the double-conversion library for double-to-string conversions in our JSON library.

For that we don't need flexible output formats like in #27, however we would prefer the output to be formatted like in the double-conversion library, i.e. use scientific notation only for large numbers.

After changing Ryu to use 0.0 instead of 0E0, and e instead of E, here are some examples of how Ryu compares to double-conversion:

= 0.0
= -0.0
! 1.0 1e0
! -1.0 -1e0
! 1.5 1.5e0
! -1.5 -1.5e0
! 0.0001 1e-4
! -0.0001 -1e-4
! 3.1416 3.1416e0
! 42.0 4.2e1
! -42.0 -4.2e1
! 0.1 1e-1
! 10000000000.0 1e10
! -10000000000.0 -1e10
! -10000000000.0 -1e10
= -1e-10
! 12340000000.0 1.234e10
= 1.234e-10
= 1.79769e308
= 2.22507e-308
= -1.79769e308
= -2.22507e-308
= 5e-324
= 2.225073858507201e-308
= 2.2250738585072014e-308
= 1.7976931348623157e308
! 18446744073709552000.0 1.8446744073709552e19
! -9223372036854776000.0 -9.223372036854776e18
! 0.9868011474609375 9.868011474609375e-1
= 1.23e36
! 45913141877270640000.0 4.591314187727064e19
= 2.225073858507201e-308
= 1.7976931348623157e308
= 2.2250738585072014e-308
= 2.225073858507201e-308
! 0.9999999999999999 9.999999999999999e-1
! 1.0000000000000002 1.0000000000000002e0
! 72057594037927930.0 7.205759403792793e16
! 72057594037927940.0 7.205759403792794e16
! 9223372036854775000.0 9.223372036854775e18
! 9223372036854776000.0 9.223372036854776e18
= 1.0141204801825834e31
= 1.0141204801825835e31
= 5.708990770823839e45
= 5.70899077082384e45
! 300.0 3e2

A leading = means that the two coincide, a leading ! means that they don't, with the first number being from double-conversion, and the second from Ryu.

Any thoughts on this subject, or hints on how to go about it?

Thank you!

Investigate if there are any cases for which the gap approach improves performance

https://github.com/ulfjack/ryu/tree/c-check-gap

Maybe add 80-bit float support

Tim asked for that.

Maybe add half (16-bit float) support

TBD

Missing loop in two-digit optimization?

Hey @ulfjack, thanks for the great algorithm!

Quick question: should this be a loop rather than an if, or was that intentional?

ryu/ryu/d2s.c

Line 414 in b2ae7a7

 if (vpDiv100 > vmDiv100) { // Optimization: remove two digits at a time (~86.2%). 

I haven't tried building your benchmarks yet, but in my own implementation based on your C code changing this to a loop gives a noticeable speedup for formatting numbers like 0.3.

Hi, I'm developing a C++ formatting library that uses Ryu

I adapted your code to C++ and I'm using it now in the strf formatting library.

You saved me a ton of work. So I want to thank you.

Improvement: Compare with more algorithms in benchmark

see Milo Yips / C++ double-to-string conversion benchmark
https://github.com/miloyip/dtoa-benchmark

compute_shortest question!

In the paper, in compute_shortest algorithm, can we send to input a, b, c, such that length(b) < length(c)? If yes, can we get bi(output final value) equal to zero, or zero + 1(by rounding)?

=====================================
For example in vacuum: a = 6, b = 8, c = 11, and as output we can get 1e1, which is not closest to input: right answer is just 8e0 ?

Where am i wrong?

The tests leak

Memory allocated by routines such as f2s is never freed. When the tests are run under a sanitizer, leaks are reported.

Maybe provide an API to select rounding modes

The current code hard-codes round-even; we may want to support other rounding modes in the future, e.g., to be compatible with https://github.com/google/double-conversion.

Missing carries in non-intrinsic umul128()?

ryu/ryu/d2s_intrinsics.h

Lines 48 to 76 in 2ec9ead

 static inline uint64_t umul128(const uint64_t a, const uint64_t b, uint64_t* const productHi) { 

 // The casts here help MSVC to avoid calls to the __allmul library function. 

 const uint32_t aLo = (uint32_t)a; 

 const uint32_t aHi = (uint32_t)(a >> 32); 

 const uint32_t bLo = (uint32_t)b; 

 const uint32_t bHi = (uint32_t)(b >> 32); 

 const uint64_t b00 = (uint64_t)aLo * bLo; 

 const uint64_t b01 = (uint64_t)aLo * bHi; 

 const uint64_t b10 = (uint64_t)aHi * bLo; 

 const uint64_t b11 = (uint64_t)aHi * bHi; 

 const uint32_t b00Lo = (uint32_t)b00; 

 const uint32_t b00Hi = (uint32_t)(b00 >> 32); 

 const uint64_t mid1 = b10 + b00Hi; 

 const uint32_t mid1Lo = (uint32_t)(mid1); 

 const uint32_t mid1Hi = (uint32_t)(mid1 >> 32); 

 const uint64_t mid2 = b01 + mid1Lo; 

 const uint32_t mid2Lo = (uint32_t)(mid2); 

 const uint32_t mid2Hi = (uint32_t)(mid2 >> 32); 

 const uint64_t pHi = b11 + mid1Hi + mid2Hi; 

 const uint64_t pLo = ((uint64_t)mid2Lo << 32) + b00Lo; 

 *productHi = pHi; 

 return pLo; 

 }

implements a 64 * 64 = 128 wide multiplication. I realized that it implements the grade-school algorithm for multiplying two 2-digit numbers (where each "digit" is uint32_t), except that it's missing carries! This seems deeply concerning for correctness, but I haven't found any failures due to the missing carries yet.

Is there something about how Ryu calls umul128() that prevents this from being a problem? If so, should it be commented and/or asserted? My attempts to add carrying code have degraded x86 perf by at least 10%.

(I noticed that I replicated this behavior in umul256() so it has greater impact.)

Can POW10_SPLIT_2 be smaller?

POW10_SPLIT_2 is large, 8 * 4445 * 3 = 106,680 bytes:

ryu/ryu/d2fixed_full_table.h

Line 1286 in 9976c57

static const uint64_t POW10_SPLIT_2[4445][3] = {

Interestingly, it contains 1177 rows that are entirely zero. For example, lines 1583 and 1591 have nonzero elements, but lines 1584-1590 are entirely zero:

ryu/ryu/d2fixed_full_table.h

Lines 1583 to 1591 in 9976c57

 { 0u, 0u, 64000000000u }, 

 { 0u, 0u, 0u }, 

 { 0u, 0u, 0u }, 

 { 0u, 0u, 0u }, 

 { 0u, 0u, 0u }, 

 { 0u, 0u, 0u }, 

 { 0u, 0u, 0u }, 

 { 0u, 0u, 0u }, 

 { 175162308u, 0u, 0u },

These entirely zero rows are consuming 26.5% of the table, or 8 * 1177 * 3 = 28,248 bytes. Looking at the usage:

ryu/ryu/d2fixed.c

Lines 448 to 459 in 9976c57

 const uint32_t p = POW10_OFFSET_2[idx] + i; 

 if (p >= POW10_OFFSET_2[idx + 1]) { 

 // If the remaining digits are all 0, then we might as well use memset. 

 // No rounding required in this case. 

 const int32_t fill = precision - 9 * i; 

 memset(result + index, '0', fill); 

 index += fill; 

 break; 

 } 

 // Temporary: j is usually around 128, and by shifting a bit, we push it to 128 or above, which is 

 // a slightly faster code path in mulShift_mod1e9. Instead, we can just increase the multipliers. 

 uint32_t digits = mulShift_mod1e9(m2 << 8, POW10_SPLIT_2[p], j + 8);

ryu/ryu/d2fixed.c

Lines 650 to 653 in 9976c57

 const uint32_t p = POW10_OFFSET_2[idx] + i; 

 // Temporary: j is usually around 128, and by shifting a bit, we push it to 128 or above, which is 

 // a slightly faster code path in mulShift_mod1e9. Instead, we can just increase the multipliers. 

 digits = (p >= POW10_OFFSET_2[idx + 1]) ? 0 : mulShift_mod1e9(m2 << 8, POW10_SPLIT_2[p], j + 8);

makes me think that reorganizing POW10_OFFSET_2 and POW10_SPLIT_2 might allow these entirely zero rows to be eliminated for free (i.e. no additional lookups/branches), possibly even improving performance if this results in skipping mulShift_mod1e9 calls, but this is currently at the outer limit of my understanding of the algorithm.

Remove LEGACY_MODE

This was only added to match the results from the initial paper submission to support the PLDI artifact review. Since nobody can see the initial paper anymore, the code can be dropped now.

How do we get this into Boost?

We were discussing how to get this into Boost and an obvious solution came up, ask @StephanTLavavej ! We could use something header-only with the option of separate compilation. I was working on a port but I stopped because the changes to support this use case would be extensive and this project looks like it is still in development and might see considerably more commits that I would have to port to my C++-mangled header-only solution.

The routines which I ultimately need are

// this is user-facing
struct number
{
    unsigned long long mantissa;
    short exponent;
    bool sign;
};

number double_to_number(double);
char* number_to_chars(char* buf, number);

And of course an implementation of C++17's to_chars (so that users of older C++ versions have access to it).

There are a number of Boost libraries that would benefit from this code such as lexical_cast and a few others.

Release to Maven Central / Bintray

Hi, is it possible to release the Java portion to Maven Central and/or Bintray? Would make it easier to adopt. Thanks.

Feature request: Returning number of characters processed in ryu_parse

ryu_parse does not return the number of characters processed, so we are forced to tokenize the input before feeding it to s2d. s2d already knows when to stop consuming the input at the first non-number character, so there ends up being a duplication of effort and parsing a list of numbers is slower than it should be.

Both std::strtod and std::from_chars return a pointer to the first character not matching the number pattern. ryu_parse should do the same.

With this proposal, the INPUT_TOO_LONG status code would become obsolete.

Public identifiers should have a prefix to avoid name clashes

Public identifiers (extern functions and public enumerators) should have a prefix to avoid name collisions with user code and other libraries.

Extern functions should have a prefix such as ryu_.

The ryu_parse Status enumerators should have a prefix such as RYU_ or RYU_STATUS_ to avoid clashing with macros as well as identifiers in the global namespace. The SUCCESS enumerator is particularly susceptible to conflict.

MSVC - more testing required

While the code now compiles and passes unit tests, more testing on Windows would be good.

RyuDouble.java tests against Float.POSITIVE_INFINITY, not Double.POSITIVE_INFINITY

Forgive me if there is something subtle going on that I have missed.
RyuDouble.java tests against Float.POSITIVE_INFINITY and its opposite.
I can understand RyuFloat.java doing that, but it seems mildly astonishing
in RyuDouble.java. I would expect to see the test against Double.POSITIVE_INFINITY
and its negation.

 public static String doubleToString(double value, RoundingMode roundingMode) {
    // Step 1: Decode the floating point number, and unify normalized and subnormal cases.
    // First, handle all the trivial cases.
    if (Double.isNaN(value)) return "NaN";
    if (value == Float.POSITIVE_INFINITY) return "Infinity";
    if (value == Float.NEGATIVE_INFINITY) return "-Infinity";

Incorrect vrIsTrailingZeros?

In d2s.c, should this call

ryu/ryu/d2s.c

Line 315 in 00ecce8

vrIsTrailingZeros = multipleOfPowerOf2(mv, q - 1);

use q instead of q-1?

As I understand it, the code uses q' = max(0, q-1) to make sure that the loop below is executed at least once, which is required to get a correct value for lastRemovedDigit which in turn is required to correctly round the result. Later, the values of vrIsTrailingZeros and lastRemovedDigit are used to determine if the exact number ends in ...5000...0 here:

ryu/ryu/d2s.c

Line 378 in 00ecce8

if (vrIsTrailingZeros && lastRemovedDigit == 5 && vr % 2 == 0) {

Now, might it be possible that vrIsTrailingZeros is missing one digit, i.e.

                missed digit?
                v
            ...5?000...0
               ^ ~~~~~~^
lastRemovedDigit       vrIsTrailingZeros

I haven't found a counter-example... so it's probably me, who is missing something...

NB: the e2 >= 0 branch is using multipleOfPowerOf5(mv, q).

Related: In f2s.c,

ryu/ryu/f2s.c

Line 195 in 00ecce8

vrIsTrailingZeros = multipleOfPowerOf5(mv, q);

Shouldn't this use q-1 instead of q? In this case vrIsTrailingZeros includes the value of the last removed digit, but shouldn't it only include the digits after the last removed digit?

lastRemovedDigit
               v
            ...5000...0
               ~~~~~~~^
                      vrIsTrailingZeros

I have replaced the assignment above with

vrIsTrailingZeros = (q == 0) || multipleOfPowerOf5(mv, q - 1);

and tested all single-precision numbers: It doesn't affect the output. So, again, I'm probably missing something here...

NB: the e2 < 0 branch is using multipleOfPowerOf2(mv, q - 1) here.

Special case integers

There's no need to do all that work if the floating point number is an exact integer. @cespare mentioned in #86 that this is a significant win for Go, so we should do that too.

Remove trailing zeros from output functions

Hi, first of all thanks a lot for the library, it's performance is really awesome.

I've been working in integrating it in Postgis which decides the format and the amount of decimal digits to print depending on multiple parameters (user input and space left in the buffer), so I can't use d2s_buffered_n.

My main issue is that both d2fixed_buffered_n and d2exp_buffered_n include trailing zeros, like 15.0000 or 2.010e+04. Although it is possible to remove them manually by looking for . or e, it isn't very efficient; since ryu already knows about what's trailing and what not, I'd modified it to remove them if required.

I was wondering if it makes sense to contribute this here. If so, I would create a PR and make any necessary changes to have this included.

In case you want to have a look, the current patch looks like this:

diff --git a/ryu/d2fixed.c b/ryu/d2fixed.c
index b008dab..3bcc22f 100644
--- a/ryu/d2fixed.c
+++ b/ryu/d2fixed.c
@@ -420,7 +420,8 @@ int d2fixed_buffered_n(double d, uint32_t precision, char* result) {
   if (!nonzero) {
     result[index++] = '0';
   }
-  if (precision > 0) {
+  const bool printDecimalPoint = precision > 0;
+  if (printDecimalPoint) {
     result[index++] = '.';
   }
 #ifdef RYU_DEBUG
@@ -532,6 +533,18 @@ int d2fixed_buffered_n(double d, uint32_t precision, char* result) {
     memset(result + index, '0', precision);
     index += precision;
   }
+
+#if RYU_NO_TRAILING_ZEROS
+  if (printDecimalPoint) {
+    while (result[index - 1] == '0') {
+      index--;
+    }
+    if (result[index - 1] == '.') {
+      index--;
+    }
+  }
+#endif
+
   return index;
 }
 
@@ -771,6 +784,18 @@ int d2exp_buffered_n(double d, uint32_t precision, char* result) {
       }
     }
   }
+
+#if RYU_NO_TRAILING_ZEROS
+  if (printDecimalPoint) {
+    while (result[index - 1] == '0') {
+      index--;
+    }
+    if (result[index - 1] == '.') {
+      index--;
+    }
+  }
+#endif
+
   result[index++] = 'e';
   if (exp < 0) {
     result[index++] = '-';

It only does the extra checks when RYU_NO_TRAILING_ZEROS is defined, so the performance for the current case shouldn't be affected.

Some of the results of RyuFloat are different from that of JDK

        Random random = new Random();

        for (int i = 0; i < 1000 * 1000 * 10; ++i) {
            float value = random.nextFloat();

            String str1 = Float.toString(value);
            String str2 = RyuFloat.floatToString(value);

            if (!str1.equals(str2)) {
                System.out.println(str1 + " -> " + str2);
                assertTrue(Float.parseFloat(str1) == Float.parseFloat(str2));
            }
        }

different results:

0.33007812 -> 0.33007813
0.17382812 -> 0.17382813
0.0063476562 -> 0.0063476563
0.43945312 -> 0.43945313
0.33789062 -> 0.33789063
0.0043945312 -> 0.0043945313
0.036132812 -> 0.036132813
0.47851562 -> 0.47851563
0.17382812 -> 0.17382813
0.47851562 -> 0.47851563
0.26757812 -> 0.26757813
0.40039062 -> 0.40039063
0.36914062 -> 0.36914063
0.43945312 -> 0.43945313
0.18945312 -> 0.18945313
0.043945312 -> 0.043945313
0.40039062 -> 0.40039063
0.020507812 -> 0.020507813
0.040039062 -> 0.040039063
0.36132812 -> 0.36132813
0.37695312 -> 0.37695313
0.29882812 -> 0.29882813
0.36132812 -> 0.36132813
0.29882812 -> 0.29882813
0.36914062 -> 0.36914063
0.41601562 -> 0.41601563

Add f2fixed() and f2exp()?

ryu/ryu/d2fixed.c

Line 395 in a73883c

int d2fixed_buffered_n(double d, uint32_t precision, char* result) {

and

ryu/ryu/d2fixed.c

Line 597 in a73883c

int d2exp_buffered_n(double d, uint32_t precision, char* result) {

take 64-bit double but work with 32-bit float (because float can be losslessly widened to double; note that this is untrue for string-to-floating conversions, where rounding a string to a double and rounding again to a float is a subtle mistake).

Would dedicated 32-bit implementations be even faster, possibly by using narrower multiplications? Like how

ryu/ryu/f2s.c

Line 154 in a73883c

 static inline floating_decimal_32 f2d(const uint32_t ieeeMantissa, const uint32_t ieeeExponent) { 

is faster than

ryu/ryu/d2s.c

Line 238 in a73883c

 static inline floating_decimal_64 d2d(const uint64_t ieeeMantissa, const uint32_t ieeeExponent) { 

The library is slower than double-conversion on small integers.

About two times slower on numbers in range 0..9, about 1.5 times slower on numbers in range 0..999.

This is important because your library is pretending to be fastest:

At the time of this writing, these are the fastest known float-to-string conversion algorithms.

But to make the statement that something is fast, it's better to test on real applications and on real datasets.

In what applications there is a need to convert massive amounts of floating point numbers to strings? Some of this applications are using double because there is no other choice or because it is the reasonable default choice.

Example: JavaScript, Lua; JSON serialization, CSV and TSV.
In these applications, most of the doubles will be in fact, integers.

Another examples are time series databases. These databases store sensor values for monitoring. Various metrics can be stored in a single table. For example, server CPU usage is usually fractional, but the number of running processes in a server is integer. It's likely that the most of sensor values in some databases are, in fact, integers.

It means, that it's important to keep in mind the usage scenario when most of formatted floats are integers.

Look for more details here: ClickHouse/ClickHouse#8542
I have plugged ryu in ClickHouse and it shows very promising results except for one case.

Good news is that it's easy to fix with small impact on performance of formatting fractional floats!
Example is here: https://github.com/ClickHouse/ClickHouse/pull/8542/files#diff-5b2c0d6b450ea8c072f39c9625094f66R121
Just check that number is integer and do shortcut.

Some numbers on synthetic benchmark
million rows per second: Ryu, double-conversion, Ryu with shortcut for integers:

SELECT count() FROM system.numbers WHERE NOT ignore(toString(toFloat64(number % 2)))
18.81 26.85 123.33
SELECT count() FROM system.numbers WHERE NOT ignore(toString(toFloat64(number % 10)))
11.59 18.86 102.85
SELECT count() FROM system.numbers WHERE NOT ignore(toString(toFloat64(number % 100)))
10.25 16.08 93.77
SELECT count() FROM system.numbers WHERE NOT ignore(toString(toFloat64(number % 1000)))
10.87 14.92 86.98
SELECT count() FROM system.numbers WHERE NOT ignore(toString(toFloat64(number % 10000)))
10.95 13.57 82.56

Is there a Bug in Master?

In commit 388280c the comment is "add comprehensive tests, finding a core algorithm bug." That commit is only included in the generate-full-output branch.

Does it mean that there was a bug discovered in the Ryu algorithm, which is currently not fixed in master?

I'm asking, because I'm porting Ryu into a different language, and would like to understand the current state.

Thank you!

Suggest splitting into C and Java projects

Title says it all, I'd like to suggest splitting into separate C and Java projects, so that things like issues and commits aren't all mixed together...

RyuDouble.java: Redundant update of dvIsTrailingZeros flag

Following statement is redundant because the flag is not used in subsequent statements:

ryu/src/main/java/info/adams/ryu/RyuDouble.java

Line 300 in 199a388

dvIsTrailingZeros &= lastRemovedDigit == 0;

Improve control over the output format

Would it be possible to control whether the number is printed in exponent or decimal notation? Or, if more convenient, return the chosen location for the decimal point, if present, the number of digits as well as the exponent. And make it possible to get the chosen exponent as integer.

I'm trying to integrate this into a formatting library and as such would like to transform the output into all of the forms available for printf(). In particular, the decimal formatting with user provided precision makes it necessary to reformat the number heavily. While some are trivial to implement, reparsing the exponent to decide on the automatic formatting and moving the floating point separator feels suboptimal.

The double-conversion interfaces could be a guideline here but the defaults and best options are quite specialized to ECMAScript.

Add ability to specify infinity and NaN strings

It would be nice if a user could specify the string representation of infinity and NaN values.

For example:

"nan" instead of "NaN"
"inf" instead of "Infinity"

Java: float/double to byte array (or bytebuffer)

Ulf, thanks for your great work!

Do you have a plan to add more methods to store text representation of numbers except for Java strings?

Lot of serialization libraries and frameworks would appreciate the ability to store immediately into the pre-allocated array or a byte buffer.

Here is my attempt to adopt initial versions of RyuFloat.java and RyuDouble.java in the JSON serializer for Scala: plokhotnyuk/jsoniter-scala@1224719

For the float writing, I have able to avoid / and %operations at all. Also, access to tables was revised to minimize dependent pointer reads, etc. Would you like to try these tricks in your Java version?

Would a smarter decimalLength() help?

The current decimalLength() code is just a series of if() statements. Since the value is usually near the top of the range this isn't so bad (i.e. it's not common to get far down list of if() statements) However, it's going to compile down to a pretty large number of instructions and at least the first branch is going to predict pretty badly.

On platforms that have a "number-of-leading-zeros" instruction (including x86_64) there is simple branch-free method of computing an integerlog10() (with the caveat that an input of 0 needs to be treated as a special case when using that to computing "number of digits") This isn't my gist, but here's the first one that popped up when I googled: https://gist.github.com/CAFxX/ad150f2403a0604e14cc There are also variants that do a couple more ALU instructions to shrink the thr[] table into something that will fit in fewer cachelines -- may or may not be a win..

The above uses the gcc/clang a = __builtin_clzll(x) intrinsic. The MSVC equivalent of that is:

    unsigned long tmp;
    (void) _BitScanReverse64(&tmp, x);
    clz = 63 - tmp;

... the compiler will reduce that and eliminate tmp at the end, so don't let the &tmp scare you.

Anyway, maybe you've already tried something like this and didn't matter (I just watched the video of your talk, haven't read the paper yet) but maybe a couple nanoseconds could be shaved here.

As is usual this sort of code, the seminal reference material is Henry Warren's "Hacker's Delight" If you have a copy of the Second Edition at hand (everyone should...) he discusses a number of log10 options in the last 6 pages of chapter 11.

Differences from paper

I noticed there are some differences in the constants from what the paper would suggest. For example,

ryu/ryu/d2s.h

Lines 40 to 41 in 5c36146

 #define DOUBLE_POW5_INV_BITCOUNT 122 

 #define DOUBLE_POW5_BITCOUNT 121

whereas the paper suggests these should both be 124 (Figure 4). Why the difference?

Hi, I implemented a Ryu algorithm using C#.

Your algorithm is so good!

https://github.com/Dogwei/RyuCsharp

This project is translated from c implementation.

I just change the syntax and i don't change the logic.

Trying to just extract the mantissa, exponent, and sign

I'm trying to extract

unsigned long long mantissa;
short exponent;
bool negative;

from a double. I already wrote some code before I heard about the Ryū so if anyone reading this wants a good laugh here it is:

    if(f < 0)
    {
        f = std::abs(f);
        neg_ = true;
    }
    else
    {
        neg_ = false;
    }
    static exponent_type const bias =
        static_cast<exponent_type>(std::floor(
            std::log10((std::numeric_limits<
                mantissa_type>::max)()))) - 2;
    exp_ = static_cast<exponent_type>(
        std::floor(std::log10(f)));
    auto mant = f / std::pow(10, exp_);
    mant_ = static_cast<mantissa_type>(
        mant * std::pow(10, bias));
    exp_ -= bias;

It is considerably simpler than Ryū and correspondingly, less correct.

I'm trying to adapt the source code in this project to achieve my ends, but it seems to be very oriented towards printing rather than simple extraction of the decimal mantissa and exponent, am I missing something? What's the shortest path to getting these numbers?

	static inline uint64_t umul128(const uint64_t a, const uint64_t b, uint64_t* const productHi) {
	// The casts here help MSVC to avoid calls to the __allmul library function.
	const uint32_t aLo = (uint32_t)a;
	const uint32_t aHi = (uint32_t)(a >> 32);
	const uint32_t bLo = (uint32_t)b;
	const uint32_t bHi = (uint32_t)(b >> 32);

	const uint64_t b00 = (uint64_t)aLo * bLo;
	const uint64_t b01 = (uint64_t)aLo * bHi;
	const uint64_t b10 = (uint64_t)aHi * bLo;
	const uint64_t b11 = (uint64_t)aHi * bHi;

	const uint32_t b00Lo = (uint32_t)b00;
	const uint32_t b00Hi = (uint32_t)(b00 >> 32);

	const uint64_t mid1 = b10 + b00Hi;
	const uint32_t mid1Lo = (uint32_t)(mid1);
	const uint32_t mid1Hi = (uint32_t)(mid1 >> 32);

	const uint64_t mid2 = b01 + mid1Lo;
	const uint32_t mid2Lo = (uint32_t)(mid2);
	const uint32_t mid2Hi = (uint32_t)(mid2 >> 32);

	const uint64_t pHi = b11 + mid1Hi + mid2Hi;
	const uint64_t pLo = ((uint64_t)mid2Lo << 32) + b00Lo;

	*productHi = pHi;
	return pLo;
	}

	{ 0u, 0u, 64000000000u },
	{ 0u, 0u, 0u },
	{ 0u, 0u, 0u },
	{ 0u, 0u, 0u },
	{ 0u, 0u, 0u },
	{ 0u, 0u, 0u },
	{ 0u, 0u, 0u },
	{ 0u, 0u, 0u },
	{ 175162308u, 0u, 0u },

	const uint32_t p = POW10_OFFSET_2[idx] + i;
	if (p >= POW10_OFFSET_2[idx + 1]) {
	// If the remaining digits are all 0, then we might as well use memset.
	// No rounding required in this case.
	const int32_t fill = precision - 9 * i;
	memset(result + index, '0', fill);
	index += fill;
	break;
	}
	// Temporary: j is usually around 128, and by shifting a bit, we push it to 128 or above, which is
	// a slightly faster code path in mulShift_mod1e9. Instead, we can just increase the multipliers.
	uint32_t digits = mulShift_mod1e9(m2 << 8, POW10_SPLIT_2[p], j + 8);

	#define DOUBLE_POW5_INV_BITCOUNT 122
	#define DOUBLE_POW5_BITCOUNT 121

ulfjack / ryu Goto Github PK

ryu's People

Contributors

Stargazers

Watchers

Forkers

ryu's Issues

Recommend Projects

Recommend Topics

Recommend Org