ridiculousfish / libdivide Goto Github PK

View Code? Open in Web Editor NEW

1.1K 1.1K 74.0 1.97 MB

Official git repository for libdivide: optimized integer division

Home Page: http://libdivide.com

License: Other

C++ 65.42% CMake 0.07% Python 0.02% C 34.49%

libdivide's People

Contributors

Stargazers

Watchers

libdivide's Issues

Modulo support

Hi,

Is there a chance to get support for the modulo operation in libdivide, without having to do x - (x / d) * d manually? Ideally, of course, in the same operation as div if it's possible to get it cheap.

On a quick test, it looks like GCC can generate code with only one mul for div+mod of the same divisor, so it should. be possible. (I tested with the rather arbitrary value of 17; div-only was six instructions including a mul, and div+mod was twelve in total, with no additional mul.)

Status update on +-1 branchfree dividers

Currently the branchfree divider cannot be +1 and -1 (our current implementation calls exit(-1) if the user tries to generate a branchfree divider for +-1). When the branchfree divider was first implemented 3 years ago we did not find a way to support +-1 dividers and our branchfree algorithm currently returns 0 if the divider is 1.

But the recent pull request #49 from @yb303 contains proof of concept code that shows that it is possible to support +-1 branchfree dividers. The basic idea is to add the numerator to the result if the divider is 1 (in this case the result is 0).

In order to efficiently and cleanly implement +-1 branchfree support the best solution seems to be to add a new int8_t one bitmask to the branchfree divider structs that is set to -1 if the divider is 1 and to 0 otherwise. Using this new one bitmask it is possible to conditionally add the numerator to the result using just 3 additional assembly instructions on x86.

// Does not support 1 divider
uint64_t libdivide_u64_branchfree_do(uint64_t numer, 
                                     const struct libdivide_u64_branchfree_t *denom) {
    uint64_t q = libdivide_mullhi_u64(denom->magic, numer);
    uint64_t t = ((numer - q) >> 1) + q;
    return t >> denom->more;
}

// Supports 1 divider
uint64_t libdivide_u64_branchfree_do(uint64_t numer, 
                                     const struct libdivide_u64_branchfree_t *denom) {
    uint64_t q = libdivide_mullhi_u64(denom->magic, numer);
    uint64_t t = ((numer - q) >> 1) + q + (numer & denom->one);
    return t >> denom->more;
}

If the algorithm is implemented as shown above there is a performance slowdown of about 20%. If the algorithm is implemented without adding a new variable to the branchfree structs there is a performance slowdown of 35 to 40% (and the implementation must use hacks because there are no bits left in the more variable). Adding a new field to the branchfree structs is not ideal from a clean code perspective since we cast branchfree dividers to ordinary dividers in libdivide.h.

Personally I am using the u64 branchfree divider in two of my projects and I don't need +-1 support. Hence for my personal use case not supporting +-1 is actually a feature since my code will run faster without it. Hence for the time being I will not implement +-1 branchfree divider support.

If you would like to use the branchfree divider in your project and you need +-1 support, please let us know and leave a comment below. You can also leave a comment if you don't care about +-1 branchfree divider support and prefer better performance.

Question: Adding this to Python, Go, Nim, Crystal etc.

Would adding this to Python be a good idea? That way integer divmod would be faster (unless they already did it with Python 3)
Also, adding LibDivide to Go, Nim, Crystal and other languages would make the code base better.
Considering Go is supposed to be the fast code replacement of java, Nim can compile to C/C++ and JS, and Crystal can compile to C.

benchmark between this and GMP

It would be good to do a comparison between this and GCC/GMP's division and modulo function.
Best way to run a test would be to pick a list of 128, 256, 512, 1024, 2048, 4096 and 8192 bit integers divided by 32, 64, 128 and 256 bit integers, pick 16~256 pairs for each set.

Libdivide is using exit instead of abort on errors.

#define LIBDIVIDE_ERROR(msg) \
    do { \
        fprintf(stderr, "libdivide.h:%d: %s(): Error: %s\n", \
            __LINE__, LIBDIVIDE_FUNCTION, msg); \
        exit(-1); \
    } while (0)

The usage of exit is wrong for two reasons:

It does not quickly terminate the program because it calls global destructors (in contrast to _exit).
Core won't be dumped.

The issue has been found here: ClickHouse/ClickHouse#12119

On my mac, system divide gives the fastest result

I compiled and ran the benchmark with AVX2 on my mac and got a strange result.

                 === libdivide s32 benchmark ===

     #   system  scalar  scl_bf  vector  vec_bf   gener   algo
     1   3.241  14.439  -1.000   4.723   0.000   30.836   0
    -1   3.241  14.438  18.811   4.721   7.964   31.028   0
     2   3.241  14.449  18.800   4.723   7.966   30.836   0
    -2   3.243  14.523  18.819   4.721   7.968   31.028   0
     3   3.243  17.937  18.830   6.823   7.970   46.032   2
    -3   3.243  17.797  18.851   6.823   7.968   48.355   2
     4   3.241  14.438  18.733   4.721   7.966   30.821   0
    -4   3.243  14.445  18.733   4.721   7.966   31.028   0
     5   3.243  16.423  18.817   5.557   7.966   43.945   1
    -5   3.243  16.341  18.771   5.557   7.966   46.194   1
     6   3.243  17.783  18.735   6.823   7.968   46.017   2
    -6   3.243  17.867  18.735   6.823   7.966   48.355   2
     7   3.243  17.802  18.817   6.823   7.972   46.032   2
    -7   3.243  17.882  18.733   6.821   7.966   48.369   2
     8   3.243  14.483  18.815   4.721   7.966   30.836   0
    -8   3.243  14.445  18.828   4.723   7.968   31.043   0
     9   3.243  16.347  18.794   5.555   7.966   44.004   1
    -9   3.243  16.427  18.737   5.557   7.970   46.194   1
    10   3.243  16.345  18.798   5.555   7.966   43.931   1
   -10   3.243  16.349  18.733   5.557   7.966   46.180   1

It turns out that the hardware divide is the fastest one, and libdivide is much slower than the hardware.

I don't quite get it, maybe I did something wrong?

Here is my computer info:

$ uname -a
Darwin CJs-MacBook-Pro.local 18.7.0 Darwin Kernel Version 18.7.0: Tue Aug 20 16:57:14 PDT 2019; root:xnu-4903.271.2~2/RELEASE_X86_64 x86_64 i386 MacBookPro11,4 Darwin
$ sw_vers
ProductName:	Mac OS X
ProductVersion:	10.14.6
BuildVersion:	18G95
$ sysctl -a | rg machdep.cpu
machdep.cpu.max_basic: 13
machdep.cpu.max_ext: 2147483656
machdep.cpu.vendor: GenuineIntel
machdep.cpu.brand_string: Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz
machdep.cpu.family: 6
machdep.cpu.model: 70
machdep.cpu.extmodel: 4
machdep.cpu.extfamily: 0
machdep.cpu.stepping: 1
machdep.cpu.feature_bits: 9221959987971750911
machdep.cpu.leaf7_feature_bits: 10155 0
machdep.cpu.leaf7_feature_bits_edx: 2617246720
machdep.cpu.extfeature_bits: 142473169152
machdep.cpu.signature: 263777
machdep.cpu.brand: 0
machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C
machdep.cpu.leaf7_features: RDWRFSGS TSC_THREAD_OFFSET BMI1 AVX2 SMEP BMI2 ERMS INVPCID FPU_CSDS MDCLEAR IBRS STIBP L1DF SSBD
machdep.cpu.extfeatures: SYSCALL XD 1GBPAGE EM64T LAHF LZCNT RDTSCP TSCI
machdep.cpu.logical_per_package: 16
machdep.cpu.cores_per_package: 8
machdep.cpu.microcode_version: 27
machdep.cpu.processor_flag: 5
machdep.cpu.mwait.linesize_min: 64
machdep.cpu.mwait.linesize_max: 64
machdep.cpu.mwait.extensions: 3
machdep.cpu.mwait.sub_Cstates: 270624
machdep.cpu.thermal.sensor: 1
machdep.cpu.thermal.dynamic_acceleration: 1
machdep.cpu.thermal.invariant_APIC_timer: 1
machdep.cpu.thermal.thresholds: 2
machdep.cpu.thermal.ACNT_MCNT: 1
machdep.cpu.thermal.core_power_limits: 1
machdep.cpu.thermal.fine_grain_clock_mod: 1
machdep.cpu.thermal.package_thermal_intr: 1
machdep.cpu.thermal.hardware_feedback: 0
machdep.cpu.thermal.energy_policy: 1
machdep.cpu.xsave.extended_state: 7 832 832 0
machdep.cpu.xsave.extended_state1: 1 0 0 0
machdep.cpu.arch_perf.version: 3
machdep.cpu.arch_perf.number: 4
machdep.cpu.arch_perf.width: 48
machdep.cpu.arch_perf.events_number: 7
machdep.cpu.arch_perf.events: 0
machdep.cpu.arch_perf.fixed_number: 3
machdep.cpu.arch_perf.fixed_width: 48
machdep.cpu.cache.linesize: 64
machdep.cpu.cache.L2_associativity: 8
machdep.cpu.cache.size: 256
machdep.cpu.tlb.inst.large: 8
machdep.cpu.tlb.data.small: 64
machdep.cpu.tlb.data.small_level1: 64
machdep.cpu.tlb.shared: 1024
machdep.cpu.address_bits.physical: 39
machdep.cpu.address_bits.virtual: 48
machdep.cpu.core_count: 4
machdep.cpu.thread_count: 8
machdep.cpu.tsc_ccc.numerator: 0
machdep.cpu.tsc_ccc.denominator: 0

problems with gcc version (Ubuntu/Linaro 4.4.4-14ubuntu5.1) 4.4.5

i'm having problems with libdivide using the default gcc installed with ubuntu 10.10. everything is fine on gcc 4.7.1. maybe put a note in the readme that a certain version of gcc is needed?

when i run the tester everything passes but when i run the benchmark it prints this a bunch of times:

    50   4.210   0.422   0.422   0.000   0.000   4.974     1
Failure on line 653
Failure on line 654
Failure on line 653
Failure on line 654
Failure on line 653
Failure on line 654

when compiling the benchmark in this version of gcc there are a lot of warnings:

jplaisance@jplaisance:~/source/libdivide$ make benchmark
cc -fstrict-aliasing -W -Wall -g -O3 -msse2 -DLIBDIVIDE_USE_SSE2=1  -lpthread   -o benchmark libdivide_benchmark.c
libdivide_benchmark.c: In function ‘test_many_u64’:
libdivide_benchmark.c:765: warning: format ‘%llu’ expects type ‘long long unsigned int’, but argument 3 has type ‘uint64_t’
libdivide_benchmark.c: In function ‘test_many_s64’:
libdivide_benchmark.c:776: warning: format ‘%lld’ expects type ‘long long int’, but argument 3 has type ‘int64_t’
libdivide_benchmark.c: In function ‘random_data’:
libdivide_benchmark.c:790: warning: ignoring return value of ‘posix_memalign’, declared with attribute warn_unused_result
libdivide_benchmark.c: In function ‘mine_s32_vector’:
libdivide_benchmark.c:239: warning: dereferencing pointer ‘comps’ does break strict-aliasing rules
libdivide_benchmark.c:238: note: initialized from here
libdivide_benchmark.c:239: warning: dereferencing pointer ‘({anonymous})’ does break strict-aliasing rules
libdivide_benchmark.c:239: note: initialized from here
libdivide_benchmark.c:239: warning: dereferencing pointer ‘({anonymous})’ does break strict-aliasing rules
libdivide_benchmark.c:239: note: initialized from here
libdivide_benchmark.c:239: warning: dereferencing pointer ‘({anonymous})’ does break strict-aliasing rules
libdivide_benchmark.c:239: note: initialized from here
libdivide_benchmark.c: In function ‘mine_s64_vector’:
libdivide_benchmark.c:489: warning: dereferencing pointer ‘comps’ does break strict-aliasing rules
libdivide_benchmark.c:488: note: initialized from here
libdivide_benchmark.c:489: warning: dereferencing pointer ‘({anonymous})’ does break strict-aliasing rules
libdivide_benchmark.c:489: note: initialized from here
libdivide_benchmark.c: In function ‘mine_u32_vector_unswitched’:
libdivide_benchmark.c:155: warning: dereferencing pointer ‘comps’ does break strict-aliasing rules
libdivide_benchmark.c:154: note: initialized from here
libdivide_benchmark.c:155: warning: dereferencing pointer ‘({anonymous})’ does break strict-aliasing rules
libdivide_benchmark.c:155: note: initialized from here
libdivide_benchmark.c:155: warning: dereferencing pointer ‘({anonymous})’ does break strict-aliasing rules
libdivide_benchmark.c:155: note: initialized from here
libdivide_benchmark.c:155: warning: dereferencing pointer ‘({anonymous})’ does break strict-aliasing rules
libdivide_benchmark.c:155: note: initialized from here
libdivide_benchmark.c: In function ‘mine_u32_vector’:
libdivide_benchmark.c:127: warning: dereferencing pointer ‘comps’ does break strict-aliasing rules
libdivide_benchmark.c:126: note: initialized from here
libdivide_benchmark.c:127: warning: dereferencing pointer ‘({anonymous})’ does break strict-aliasing rules
libdivide_benchmark.c:127: note: initialized from here
libdivide_benchmark.c:127: warning: dereferencing pointer ‘({anonymous})’ does break strict-aliasing rules
libdivide_benchmark.c:127: note: initialized from here
libdivide_benchmark.c:127: warning: dereferencing pointer ‘({anonymous})’ does break strict-aliasing rules
libdivide_benchmark.c:127: note: initialized from here
libdivide_benchmark.c: In function ‘mine_s32_vector_unswitched’:
libdivide_benchmark.c:281: warning: dereferencing pointer ‘comps’ does break strict-aliasing rules
libdivide_benchmark.c:280: note: initialized from here
libdivide_benchmark.c:281: warning: dereferencing pointer ‘({anonymous})’ does break strict-aliasing rules
libdivide_benchmark.c:281: note: initialized from here
libdivide_benchmark.c:281: warning: dereferencing pointer ‘({anonymous})’ does break strict-aliasing rules
libdivide_benchmark.c:281: note: initialized from here
libdivide_benchmark.c:281: warning: dereferencing pointer ‘({anonymous})’ does break strict-aliasing rules
libdivide_benchmark.c:281: note: initialized from here
libdivide_benchmark.c: In function ‘mine_u64_vector’:
libdivide_benchmark.c:435: warning: dereferencing pointer ‘comps’ does break strict-aliasing rules
libdivide_benchmark.c:434: note: initialized from here
libdivide_benchmark.c:435: warning: dereferencing pointer ‘({anonymous})’ does break strict-aliasing rules
libdivide_benchmark.c:435: note: initialized from here
libdivide_benchmark.c: In function ‘mine_u64_vector_unswitched’:
libdivide_benchmark.c:422: warning: dereferencing pointer ‘comps’ does break strict-aliasing rules
libdivide_benchmark.c:421: note: initialized from here
libdivide_benchmark.c:422: warning: dereferencing pointer ‘({anonymous})’ does break strict-aliasing rules
libdivide_benchmark.c:422: note: initialized from here
libdivide_benchmark.c: In function ‘mine_s64_vector_unswitched’:
libdivide_benchmark.c:532: warning: dereferencing pointer ‘comps’ does break strict-aliasing rules
libdivide_benchmark.c:531: note: initialized from here
libdivide_benchmark.c:532: warning: dereferencing pointer ‘({anonymous})’ does break strict-aliasing rules
libdivide_benchmark.c:532: note: initialized from here

Improvement suggestion: packed divider structs

My primecount project uses a huge std::vector of branchfree u64 dividers.

struct libdivide_u64_branchfree_t {
    int64_t magic;
    uint8_t more;
};

Unfortunately GCC (macOS, GCC 6) padds the libdivide_u64_branchfree_t struct with 7 additional bytes to a total of 16 bytes. This increases the memory usage of my algorithm by a very significant amount.

Using attribute__((packed)) prevents GCC from padding the libdivide_u64_branchfree_t struct and reduces the memory usage of my algorithm by 30 percent and improves performance by 10% on x86-64 because the inner loop is more memory/cache efficient.

I will use attribute__((packed)) in primecount's libdivide.h and I wonder if we should use it the official ridiculousfish/libdivide.h as well?

Potential issue:

All modern CPU architectures (x86, x64, PPC64, AArch64) support fast unaligned memory access but some old CPU architectures e.g. IA-64 & sparc64 will run slower using attribute__((packed)) if an array or vector of libdivide divisors is used.

Here is an article from 2006 comparing the assembly code generation for packed and unpacked structs.

Personally, I would enable packed divider structs for all CPU architectures as I guess the number of users that will use an array or vector of libdivide divisors on ancient CPU architectures is very small.

Implementation considerations

attribute__((packed)) is supported by GCC & Clang but it is non-standard C/C++. MSVC supports #pragma pack which is also supported by GCC (since version 4.0 at least) and most likely Clang as well.

So I am in favour of using #pragma pack.

What's your opinion on packed divider structs fish?
Can I go ahead and implement it (and then create a pull request)?

Save 1 instruction in libdivide_u64_branchfree_do()

I guess you could decrease the instruction count by 1 in libdivide_u64_branchfree_do() by using:

uint64_t libdivide_u64_branchfree_do(uint64_t numer, const struct libdivide_u64_branchfree_t *denom) {
    // same as alg 2
    uint64_t q = libdivide__mullhi_u64(denom->magic, numer);
    uint64_t t = ((numer - q) >> 1) + q;
    return t >> denom->more;
}

instead of:

uint64_t libdivide_u64_branchfree_do(uint64_t numer, const struct libdivide_u64_branchfree_t *denom) {
    // same as alg 2
    uint64_t q = libdivide__mullhi_u64(denom->magic, numer);
    uint64_t t = ((numer - q) >> 1) + q;
    return t >> (denom->more & LIBDIVIDE_64_SHIFT_MASK);
}

This would require a different initialization algorithm for libdivide_u64_branchfree_do(). I am not sure if it is worth it.

upgrade int types

Currently gcc is not happy with dividing (some type) by divider.
operator/ and co need more overloads or something more concise.

Use umulh()/mulh() on MSVC

The functions libdivide__mullhi_u64/libdivide__mullhi_s64 can be implemented much faster on MSVC with __umulh()/__mulh() intrinsics. They have the exact same signature so it's a very easy modification and should be supported on all MSVC versions.

Question: why not support SSE2, AVX, and AVX512 all in the same time? (or am I missing something?)

Even with AVX512 support users may still want to use int128.
I think in the current state older code moving to a newer box would force one to widen their ints

Incorrect NEON function signature

static LIBDIVIDE_INLINE uint16x8_t libdivide_s16_branchfree_do_vec128( uint16x8_t numers, const struct libdivide_s16_branchfree_t *denom);

Should refer to int16x8_t

Faster divlu

Hello!

I've tried to run the benchmark from Narrowing Division Insight and Benchmarks and these are the numbers I got for "libdiv mul":

M2 7.1
i5-8500 21.4

When I changed this line from

numhi |= (numlo >> (-shift & 63)) & (-(int64_t)shift >> 63);

if (shift != 0) numhi |= numlo >> (64 - shift);

I got the following numbers:

M2 7.0
i5-8500 20.4

I thought you might find it interesting and perhaps update https://github.com/ridiculousfish/libdivide/blob/master/doc/divlu.c, if you see similar speed-up.

PS: And thanks for writing libdivide!

Question: Why is this not part of GCC/Clang/...

This is a ridiculously dumb question, but why are the optimizations employed in libdivide not used by GCC/Clang/... by default (e.g. for -O2)?

An additional question: Did you try to ask GCC/Clang/... upstreams to incorporate the optimizations you're using? If yes, what were the reactions and what are the outcomes?

libdivide.h:1691:59: error: ‘numers’ may be used uninitialized in this function

Hello,
when I try to build libdivide with gcc 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04) under Ubuntu 20.04, I get the following error:

...
[ 30%] Building CXX object CMakeFiles/benchmark.dir/test/benchmark.cpp.o
In file included from /home/hb/proj/libdivide/test/benchmark.h:24,
                 from /home/hb/proj/libdivide/test/benchmark.cpp:16:
/home/hb/proj/libdivide/libdivide.h: In function ‘uint64_t sum_quotients_vec(const random_numerators<IntT>&, const Divisor&) [with IntT = short unsigned int; Divisor = libdivide::divider<short unsigned int, libdivide::BRANCHFULL>]’:
/home/hb/proj/libdivide/libdivide.h:1691:59: error: ‘numers’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
 1691 |         pTarget[loop] = libdivide_##Algo##_do(pSource[loop], denom); \
      |                                                           ^
...

I just run cmake ., then make, and then get the error. (Running make -j instead of just make gives the same error). Am I doing anything wrong? What's the easiest way to fix or work around? I'm just trying to get a working benchmark binary.
I tried building current master branch...

commit b785c76ea6cf3b85108324f12cd4d16f0e75563e (HEAD -> master, origin/master, origin/HEAD)
Author: Kim Walisch <[email protected]>

    Delete CHANGELOG.md

... as well as release 5.0:

commit b322221677351ebb11f0a42fe9a9a2794da5bfe5 (tag: 5.0)
Author: ridiculousfish <[email protected]>

    Stop passing multiple targets to cmake build

with the same error

Regression: ptrdiff_t/size_t on macOS don't work anymore

With commit 623f9be support was added to make size_t work on some platforms, notably macOS.

This broke again with this commit: 10ba982

This is a regression in a codebase I'm working on when compiling on macOS, because it causes a compiler error in the following code:

libdivide::divider<std::ptrdiff_t> v{42};

This is because int64_t is aliased to long long on macOS, while ptrdiff_t is aliased to long (and size_t to unsigned long, while uint64_t is aliased to unsigned long long). And while both long and long long behave identically in that case, they are not the same type in the C++ type system, so there is no correct template that can be selected by the compiler for the libdivide::divider class.

On 64bit Linux this specific line of code isn't an issue since both ptrdiff_t and int64_t are aliased to long -- but there it is impossible to create a libdivide::divider<long long> object instead for the very same reasons.

Similarly, on Windows x64 both long and int are 32bit types, but long long is the corresponding 64bit type -- so either libdivide::divider<long> or libdivide::divider<int> will not work there, depending on which one is aliased to int32_t.

Is there a reason why the size-based dispatching introduced in the initial commit didn't work properly on AVR so that it had to be dropped for that?

error: a function declaration without a prototype is deprecated in all versions of C

This error is produced with recent clang (>= 15).
Actually this is just a warning, but since -Werror,-Wstrict-prototype is set, it becomes an error:

FAILED: CMakeFiles/test_c99.dir/test/test_c99.c.o 
/usr/local/libexec/ccache/cc -DLIBDIVIDE_SSE2 -I/wrkdirs/usr/ports/math/libdivide/work/libdivide-5.0 -O2 -pipe  -fstack-protector-strong -fno-strict-aliasing -O2 -pipe  -fstack-protector-strong -fno-strict-aliasing  -DNDEBUG -Wall -Wextra -pedantic -Werror -fno-lax-vector-conversions -march=native -fno-vectorize -std=gnu99 -MD -MT CMakeFiles/test_c99.dir/test/test_c99.c.o -MF CMakeFiles/test_c99.dir/test/test_c99.c.o.d -o CMakeFiles/test_c99.dir/test/test_c99.c.o -c /wrkdirs/usr/ports/math/libdivide/work/libdivide-5.0/test/test_c99.c
/wrkdirs/usr/ports/math/libdivide/work/libdivide-5.0/test/test_c99.c:51:14: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
void test_u16() {
             ^
              void
/wrkdirs/usr/ports/math/libdivide/work/libdivide-5.0/test/test_c99.c:62:14: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
void test_s16() {
             ^
              void
/wrkdirs/usr/ports/math/libdivide/work/libdivide-5.0/test/test_c99.c:77:14: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
void test_u32() {
             ^
              void
/wrkdirs/usr/ports/math/libdivide/work/libdivide-5.0/test/test_c99.c:84:14: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
void test_s32() {
             ^
              void
/wrkdirs/usr/ports/math/libdivide/work/libdivide-5.0/test/test_c99.c:91:14: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
void test_u64() {
             ^
              void
/wrkdirs/usr/ports/math/libdivide/work/libdivide-5.0/test/test_c99.c:98:14: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
void test_s64() {
             ^
              void
6 errors generated.

Unnecessary static linkage

Some functions in libdivide.h are declared static but not inline. On the AVR platform, these are not optimized out by the linker even if unused.

Greater magic/shift value than with gcc

Probably doesn't make much of a difference for most uses - and I don't know if it's a libdivide goal to minimize returned magic/shift values.

I noticed that sometimes libdivide returns greater values than what gcc comes up with (when using a static divisor).

Example:

divisor: 659 (64 bit unsigned)

source        magic                     shift
---------------------------------------------
libdivide     14331916488223505960          9
gcc            1791489561027938245          6

Undocumented benchmark column "gener"

There's a table near the end of README that explains meaning of numbers in the benchmark output.
The table fails to explain what the gener column means.

Division by 0 causes non-AVR mcus hang

Division by 0 on non-AVR mcus causes libdivide call abort() which ends up in a noreturn

noisymime/speeduino#1162

Idea: libdivide::divider_branchfree<T>

The current API for using the branchfree divider in C++ is a little too cumbersome in my opinion:

libdivide::divider<uint64_t, libdivide::BRANCHFREE> my_divider;

I think that:

libdivide::divider_branchfree<uint64_t> my_divider;

would be nicer. Using C++11 this could be implemented using:

#if __cplusplus >= 201103L
    template <typename T>
    using divider_branchfree = divider<T, BRANCHFREE>;
#endif

It would be nice to find a solution for C++98 as well.

Bug in libdivide_128_div_64_to_64()

Bug report by Stefan Kanthak:

The bug needs a divisor 'v' with the most significant bit set, i.e. a value from
0x8000000000000000ULL to 0xFFFFFFFFFFFFFFFFULL.
It should be reproducible with a dividend pair 'u1' and 'u0' of 0ULL and
~0ULL: this combination lets 'un64' be ~0ULL instead of 0ULL

Here is a link to the full blog post:
https://skanthak.homepage.t-online.de/division.html

I have written a standalone test program for libdivide's libdivide_128_div_64_to_64() function that can be compiled using GCC or Clang on a 64-bit CPU architecture. The test program calculates many combinations of x1/x2 where x1 is of type __uint128_t and x2 is of type uint64_t. The test program always calculates x1/x2 twice, first using the builtin integer division and second using libdivide's libdivide_128_div_64_to_64() function and then compares both results for equality.

The test program is able to reproduce the reported bug and I can also confirm that the proposed fix below fixes the bug:

// Error
    } else {
        // Avoid undefined behavior
        un64 = u1 | u0;
        un10 = u0;
    }

// Fix
    } else {
        // Avoid undefined behavior
        un64 = u1;
        un10 = u0;
    }

And here is my test program:

// This is a test program for the function libdivide_128_div_64_to_64
// which devides a 128-bit number by a 64-bit number where
// the result must fit into a 64-bit word.
// The test program tries to cover a wide variety of test cases
// e.g. small dividend / small divisor,
//      large dividend / small divisor,
//      small dividend / large divisor,
//      large dividend / large divisor,
//      ...
// 

#include <iostream>
#include <random>
#include <limits>
#include <ostream>
#include <string>

using namespace std;

// Code taken from Hacker's Delight:
// http://www.hackersdelight.org/HDcode/divlu.c.
// License permits inclusion here per:
// http://www.hackersdelight.org/permissions.htm

static uint64_t libdivide_128_div_64_to_64(uint64_t u1, uint64_t u0, uint64_t v, uint64_t *r) {
    const uint64_t b = (1ULL << 32); // Number base (16 bits)
    uint64_t un1, un0; // Norm. dividend LSD's
    uint64_t vn1, vn0; // Norm. divisor digits
    uint64_t q1, q0; // Quotient digits
    uint64_t un64, un21, un10; // Dividend digit pairs
    uint64_t rhat; // A remainder
    int32_t s; // Shift amount for norm

    // If overflow, set rem. to an impossible value,
    // and return the largest possible quotient
    if (u1 >= v) {
        if (r != NULL)
            *r = (uint64_t) -1;
        return (uint64_t) -1;
    }

    // count leading zeros
    s = __builtin_clzll(v);
    if (s > 0) {
        // Normalize divisor
        v = v << s;
        un64 = (u1 << s) | ((u0 >> (64 - s)) & (-s >> 31));
        un10 = u0 << s; // Shift dividend left
    } else {
        // Avoid undefined behavior
        un64 = u1 | u0;
        un10 = u0;
    }

    // Break divisor up into two 32-bit digits
    vn1 = v >> 32;
    vn0 = v & 0xFFFFFFFF;

    // Break right half of dividend into two digits
    un1 = un10 >> 32;
    un0 = un10 & 0xFFFFFFFF;

    // Compute the first quotient digit, q1
    q1 = un64 / vn1;
    rhat = un64 - q1 * vn1;

    while (q1 >= b || q1 * vn0 > b * rhat + un1) {
        q1 = q1 - 1;
        rhat = rhat + vn1;
        if (rhat >= b)
            break;
    }

     // Multiply and subtract
    un21 = un64 * b + un1 - q1 * v;

    // Compute the second quotient digit
    q0 = un21 / vn1;
    rhat = un21 - q0 * vn1;

    while (q0 >= b || q0 * vn0 > b * rhat + un0) {
        q0 = q0 - 1;
        rhat = rhat + vn1;
        if (rhat >= b)
            break;
    }

    // If remainder is wanted, return it
    if (r != NULL)
        *r = (un21 * b + un0 - q0 * v) >> s;

    return q1 * b + q0;
}

inline std::ostream& operator<<(std::ostream& stream, __uint128_t n)
{
  std::string str;

  while (n > 0)
  {
    str += '0' + n % 10;
    n /= 10;
  }

  if (str.empty())
    str = "0";

  stream << std::string(str.rbegin(), str.rend());
  return stream;
}

int main(int argc, char** argv)
{
    random_device rd16;
    mt19937 gen16(rd16());
    uniform_int_distribution<uint16_t> dist16(1, std::numeric_limits<uint16_t>::max());

    int iters = 100000;

    if (argc > 1)
        iters = atoi(argv[1]);

    // Test: random dividends < 2^64 / small divisors    
    for (int i = 0; i < iters; i++)
    {
        uint64_t x1 = 1;

        for (int i = 0; i < 4; i++)
        {
            uint16_t a = dist16(gen16);
            uint64_t x2 = 1;
            x1 *= a;
            x1 += 1;

            for (int j = 1; j < 17; j++)
            {
                x2 = j;

                auto res1 = x1 / x2;
                auto res2 = libdivide_128_div_64_to_64(0, x1, x2, nullptr);

                if ((uint64_t) res1 != (uint64_t) res2)
                {
                    std::cerr << x1 << " / " << x2 << " = " << res1 << " correct" << std::endl;
                    std::cerr << x1 << " / " << x2 << " = " << res2 << " error" << std::endl;
                    return 1;
                }
            }
        }
    }

    // Test: 2^64-1 / random divisors < 2^64    
    for (int i = 0; i < iters; i++)
    {
        uint64_t x1 = ~0ull;
        uint64_t x2 = 1;

        for (int i = 0; i < 4; i++)
        {
            uint16_t a = dist16(gen16);
            x2 *= a;
            x2 += 1;

            auto res1 = x1 / x2;
            auto res2 = libdivide_128_div_64_to_64(0, x1, x2, nullptr);

            if ((uint64_t) res1 != (uint64_t) res2)
            {
                std::cerr << x1 << " / " << x2 << " = " << res1 << " correct" << std::endl;
                std::cerr << x1 << " / " << x2 << " = " << res2 << " error" << std::endl;
                return 1;
            }
        }
    }

    // Test: 2^90 / random divisors > 2^46 && < 2^64    
    for (int i = 0; i < iters; i++)
    {
        __uint128_t x1 = ((__uint128_t) 1) << 90;
        uint64_t x2 = 1;

        while (x2 <= (1ull << 46))
        {
            uint16_t a = dist16(gen16);
            x2 *= a;
            x2 += 1;
        }

        uint64_t res1 = x1 / x2;
        uint64_t res2 = libdivide_128_div_64_to_64((uint64_t)(x1 >> 64), (uint64_t)(x1 & ~0ull), x2, nullptr);

        if ((uint64_t) res1 != (uint64_t) res2)
        {
            std::cerr << x1 << " / " << x2 << " = " << res1 << " correct" << std::endl;
            std::cerr << x1 << " / " << x2 << " = " << res2 << " error" << std::endl;
            return 1;
        }
    }

    // Test: random dividends > 2^100 / 2^64-1    
    for (int i = 0; i < iters; i++)
    {
        __uint128_t x1 = 1;
        uint64_t x2 = ~0ull;

        while (x1 <= (((__uint128_t) 1) << 100))
        {
            uint16_t a = dist16(gen16);
            x1 *= a;
            x1 += 1;
        }

        uint64_t res1 = x1 / x2;
        uint64_t res2 = libdivide_128_div_64_to_64((uint64_t)(x1 >> 64), (uint64_t)(x1 & ~0ull), x2, nullptr);

        if ((uint64_t) res1 != (uint64_t) res2)
        {
            std::cerr << x1 << " / " << x2 << " = " << res1 << " correct" << std::endl;
            std::cerr << x1 << " / " << x2 << " = " << res2 << " error" << std::endl;
            return 1;
        }
    }

    // Test: 2^(0..127) / 2^64-1  
    for (int i = 0; i < 127; i++)
    {
        __uint128_t x1 = ((__uint128_t) 1) << i;
        uint64_t x2 = ~0ull;

        uint64_t res1 = x1 / x2;
        uint64_t res2 = libdivide_128_div_64_to_64((uint64_t)(x1 >> 64), (uint64_t)(x1 & ~0ull), x2, nullptr);

        if ((uint64_t) res1 != (uint64_t) res2)
        {
            std::cerr << x1 << " / " << x2 << " = " << res1 << " correct" << std::endl;
            std::cerr << x1 << " / " << x2 << " = " << res2 << " error" << std::endl;
            return 1;
        }
    }

    // Test: 2^(0..127) / 2^63 
    for (int i = 0; i < 127; i++)
    {
        __uint128_t x1 = ((__uint128_t) 1) << i;
        uint64_t x2 = 1ull << 63;

        uint64_t res1 = x1 / x2;
        uint64_t res2 = libdivide_128_div_64_to_64((uint64_t)(x1 >> 64), (uint64_t)(x1 & ~0ull), x2, nullptr);

        if ((uint64_t) res1 != (uint64_t) res2)
        {
            std::cerr << x1 << " / " << x2 << " = " << res1 << " correct" << std::endl;
            std::cerr << x1 << " / " << x2 << " = " << res2 << " error" << std::endl;
            return 1;
        }
    }

    {
        // Test: 2^64-1 / 2^64-1
        uint64_t x1 = ~0ull;
        uint64_t x2 = ~0ull;

        uint64_t res1 = x1 / x2;
        uint64_t res2 = libdivide_128_div_64_to_64(0, x1, x2, nullptr);

        if ((uint64_t) res1 != (uint64_t) res2)
        {
            std::cerr << x1 << " / " << x2 << " = " << res1 << " correct" << std::endl;
            std::cerr << x1 << " / " << x2 << " = " << res2 << " error" << std::endl;
            return 1;
        }    
    }

    {
        // Test: 2^64-1 / 2^63
        uint64_t x1 = ~0ull;
        uint64_t x2 = 0x8000000000000000ULL;

        uint64_t res1 = x1 / x2;
        uint64_t res2 = libdivide_128_div_64_to_64(0, x1, x2, nullptr);

        if ((uint64_t) res1 != (uint64_t) res2)
        {
            std::cerr << x1 << " / " << x2 << " = " << res1 << " correct" << std::endl;
            std::cerr << x1 << " / " << x2 << " = " << res2 << " error" << std::endl;
            return 1;
        }
    }

    return 0;
}

The above test program can be compiled and run using:

g++ -O2 test.cpp -o test
./test
18446744073709551615 / 10012018566958790731 = 1 correct
18446744073709551615 / 10012018566958790731 = 15540644647674054952 error

primecount fails its test suite (MSVC, x86)

primecount with libdivide (branchfree) fails its test suite when compiled using MSVC x86 (tested with MSVC 2010, MSVC 2015).

The test suite seems to get stuck in an infinite loop:

> primecount.exe --test
Testing phi(x, a) 100%
Testing pi_legendre(x) 100%
Testing pi_meissel(x) 100%
Testing pi_lehmer(x) 100%
Testing pi_lmo1(x) 100%
Testing pi_lmo2(x) 100%
Testing pi_lmo3(x) 100%
Testing pi_lmo4(x) 100%
Testing pi_lmo5(x) 100%
Testing pi_lmo_parallel1(x) 100%
Testing pi_lmo_parallel2(x) 100%
Testing pi_lmo_parallel3(x) 100%
Testing pi_deleglise_rivat1(x)^C

Observations:

The bug comes from libdivide as the test suite passes fine without libdivide.
The bug does not come from the `Use __umulh(), __mulh() MSVC intrinsics on WIN64``pull request as the bug is present even using the old code.
The bug is not present when compiling using MSVC 2010 x64.

Conclusion:

There is a bug in the u64 branchfree algorithm when compiling using MSVC 32-bit (x86).

Consider automatically defining LIBDIVIDE_SSE2 et al

For Visual Studio:

/arch compiler option sets target processor and enables compiler to use corresponding instructions
/arch:IA32, /arch:SSE, /arch:SSE2, /arch:AVX, /arch:AVX2, /arch:AVX512 options available
x86 is by default /arch:SSE2
x64 is always at least /arch:SSE2, it cannot be /arch:SSE or /arch:IA32
The value passed to arch is exposed as _M_IX86_FP predefined macro (distinguish SSE2), __AVX__, __AVX2__, and others. See Predefined macros

So libdivide could have something like:

#if defined(_MSC_VER) && defined(__AVX2__) 
#define LIBDIVIDE_AVX2

Maybe other compilers also expose compilation against specific instruction set via preprocessor

clang-cl: error LNK2019: unresolved external symbol __udivti3

The clang-cl compiler (Windows version of Clang) supports 128-bit integers but not yet 128-bit division. This causes a linker error when compiling libdivide using clang-cl:

> clang-cl /arch:AVX2 /I. /EHsc /W3 /O2 /DLIBDIVIDE_AVX2 test\tester.cpp
tester-d6bae9.obj : error LNK2019: unresolved external symbol __udivti3 referenced in function "private: void __cdecl DivideTest<unsigned __int64>::test_many<0>(unsigned __int64)" (??$test_many@$0A@@?$DivideTest@_K@@AEAAX_K@Z)
tester.exe : fatal error LNK1120: 1 unresolved externals
clang-cl: error: linker command failed with exit code 1120 (use -v to see invocation)

Appveyor build inconsistent with cmake

The appveyor build adds compiler flags to the cmake build. Can these go into the cmake file, so the local & CI builds produce the same result?

Ubuntu (clang):

      CFLAGS: "-Wall -Wextra -pedantic -Werror -O1 -g -fsanitize=address -fsanitize=undefined -fno-sanitize-recover=all -fno-omit-frame-pointer"
      CXXFLAGS: "-Wall -Wextra -pedantic -Werror -O1 -g -fsanitize=address -fsanitize=undefined -fno-sanitize-recover=all -fno-omit-frame-pointer"

VS 2017 & 2019:

      CFLAGS: "/W3 /WX /DLIBDIVIDE_ASSERTIONS_ON"
      CXXFLAGS: "/W3 /WX /DLIBDIVIDE_ASSERTIONS_ON"

Incorrect result for unsigned divide with branchfree mode when divisor == 1

Self-contained example:

#include "libdivide.h"
#include <iostream>

int main()
{
    {
        libdivide::divider<uint32_t, libdivide::BRANCHFREE> du(1);
        std::cout << "u32 result branchfree: " << (1U / du) << std::endl;

        libdivide::divider<uint32_t> du2(1);
        std::cout << "u32 result branchful: " << (1U / du2) << std::endl;
    }
    {
        libdivide::divider<uint64_t, libdivide::BRANCHFREE> du(1);
        std::cout << "u64 result branchfree: " << (1UL / du) << std::endl;

        libdivide::divider<uint64_t> du2(1);
        std::cout << "u64 result branchful: " << (1UL / du2) << std::endl;
    }

    {
        libdivide::divider<int32_t, libdivide::BRANCHFREE> du(1);
        std::cout << "s32 result branchfree: " << (1 / du) << std::endl;

        libdivide::divider<int32_t> du2(1);
        std::cout << "s32 result branchful: " << (1 / du2) << std::endl;
    }
    {
        libdivide::divider<int64_t, libdivide::BRANCHFREE> du(1);
        std::cout << "u64 result branchfree: " << (1L / du) << std::endl;

        libdivide::divider<int64_t> du2(1);
        std::cout << "u64 result branchful: " << (1L / du2) << std::endl;
    }
}

This should calculate 1 / 1, which should be 1. However, this program prints:

u32 result branchfree: 0
u32 result branchful: 1
u64 result branchfree: 0
u64 result branchful: 1
s32 result branchfree: 1
s32 result branchful: 1
u64 result branchfree: 1
u64 result branchful: 1

This was compiled on Ubuntu 16.04 with:

g++ (Ubuntu 5.4.0-6ubuntu1~16.04.1) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

I see the same behavior with -O0 or -O3.

Special support for 63-bit division (unsigned)?

Hi,

I was wondering whether it would make sense to provide special support for 63-bit division. The motivation is that such divisions can, I think, be done a bit faster than 64-bit divisions, and in all cases in which I personally needed fast division, 63 bits were enough.

For example, the current branch-free 64-bit division code is

uint64_t libdivide_u64_branchfree_do(
    uint64_t numer, const struct libdivide_u64_branchfree_t *denom) {
    uint64_t q = libdivide_mullhi_u64(denom->magic, numer);
    uint64_t t = ((numer - q) >> 1) + q;
    return t >> denom->more;
}

On x86-64 the last two lines of that function compile to something like this:

sub    rax,rdi           ; numer - q
shr    rax,1             ; >> 1
add    rax,rdi           ; + q
shrx   rax,rax,r10       ; >> denom->more

These are four 1-cycle instructions that have sequential data dependencies (via rax), so they have a combined latency of 4 cycles.

If both the numerator and denominator were only 63-bits, I think this can be improved to

uint64_t libdivide_u63_branchfree_do(
    uint64_t numer, const struct libdivide_u63_branchfree_t *denom) {
    uint64_t q = libdivide_mullhi_u64(denom->magic, numer);
    return (numer + q) >> denom->more_plus_1;   // more_plus_1 == more + 1
}

Here the last line should compile to something like this, which should take 2 cycles rather than 4.

add    rax,rdi           ; numer + q
shrx   rax,rax,r10       ; >> denom->more_plus_one

As a bonus, division by 1 would work, by setting magic to 0 or 1, and more_plus_one to 0.

On the other hand, there are code maintenance and API clarity drawbacks, which should be weighed against this proposal.

Unnecessary SSE2 check in Cmake

you can use -march=native and following predefined macros inside libdivide.h:

gcc,clang and most compilers:
https://stackoverflow.com/questions/28939652/how-to-detect-sse-sse2-avx-avx2-avx-512-avx-128-fma-kcvi-availability-at-compile
__SSE2_MATH__==1 __SSE2__ ==1
MSVC:
https://msdn.microsoft.com/en-us/library/b0084kay.aspx

_M_IX86_FP ==2

I've added a pull request to demonstrate the idea:
#40

#pragma warning( suppress : 4146 )

I saw your code:

        uint32_t absD = 1U << shift;
        if (more & LIBDIVIDE_NEGATIVE_DIVISOR) {
#if LIBDIVIDE_VC
#pragma warning( suppress : 4146 )
            absD = -absD;
#else
            absD = -absD;
#endif      
        }

I came across the same issue a few years ago and found a shorter solution (which I use here):

uint32_t absD = 1U << shift;
if (more & LIBDIVIDE_NEGATIVE_DIVISOR) {
    absD = absD * (uint32_t)-1;
}

I guess the compiler will generate the same code but you avoid the ugly pragma.

Let's deprecate libdivide's unswitch functionality

I have been running libdivide's benchmark program a lot over the past few days and I realized that libdivide's unswitch divider does not provide any performance speedup compared to the branchfull divider for both GCC and Clang (on x64). This is because GCC & Clang are smart enough to move the branches outside the body of the loop when using the default branchfull divider. The unswitch divider only improves performance by about 20% when using the MSVC compiler.

There are many good reasons for deprecating unswitch:

I don't think anybody uses it (I found no usages using Google).
It has a crazy API (see here).
Unswitch requires crazy hacks (e.g. crash divider)
Unswitch uses a very large amount of code that slows down compile time (e.g. 66% of the C vector API are functions related to unswitch).
By removing unswitch we make it easier to port libdivide to other languages.
We have added the branchfree divider which is similar to unswitch.

Please let me know if you agree or have if you have any objections.

SSE2 is an (unnecessary?) build requirement

I was testing libdivide on ARM where no SSE is available. Nevertheless I was still able to achieve a measurable speedup by using libdivide without any SIMD in my specific case.

I had to do the following changes to make the code compile:

Remove -msse2 -DLIBDIVIDE_USE_SSE2=1 in Makefile
Remove test_four() function in libdivide_test.cpp

All tests run fine afterwards.

My question is now if this SSE dependency is actually uneccessary and if there is any interest in a PR that enables libdivide to build on ARM devices (without any SIMD support)?

CMake cross compilation error

The following bug showed up when trying to package libdivide for Microsoft's vpkg package manager: microsoft/vcpkg#8320.

CMake Error: TRY_RUN() invoked in cross-compiling mode, please set the following cache variables appropriately:
   cpu_x86_EXITCODE (advanced)
   cpu_x86_EXITCODE__TRYRUN_OUTPUT (advanced)
For details see C:/vsts/_work/2/s/buildtrees/libdivide/x64-uwp-rel/TryRunResults.cmake
-- Performing Test cpu_x86 - Failed
-- Performing Test march_native
-- Performing Test march_native - Failed
-- Performing Test avx512
CMake Error: TRY_RUN() invoked in cross-compiling mode, please set the following cache variables appropriately:
   avx512_EXITCODE (advanced)
   avx512_EXITCODE__TRYRUN_OUTPUT (advanced)

We always use try_run even for cross compilation. However on cross compilation try_run must not be used. I came across the same issue in primesieve and fixed the issue there by only running try_run when we are not cross compiling:

if(CMAKE_SYSTEM_NAME AND
   CMAKE_HOST_SYSTEM_NAME AND
   NOT "${CMAKE_SYSTEM_NAME}" STREQUAL "${CMAKE_HOST_SYSTEM_NAME}")
    set(IS_CROSS_COMPILIATION ON)
endif()

if(CMAKE_SYSTEM_PROCESSOR AND
   CMAKE_HOST_SYSTEM_PROCESSOR AND
   NOT "${CMAKE_SYSTEM_PROCESSOR}" STREQUAL "${CMAKE_HOST_SYSTEM_PROCESSOR}")
    set(IS_CROSS_COMPILIATION ON)
endif()

if(NOT IS_CROSS_COMPILIATION)
    # ...
endif()

operator/= to return reference

NEON is not for AArch32

Your CMake scripts check for NEON support but also enable NEON on AArch32 (i.e. 32 bit ARM). Unfortunately you use a bunch of instrinsic functions only available on AArch64, leading to a build failure in case of a 32 bit build.

Please either fix your NEON code so it builds on 32 bit ARM or change the detection scripts so they only detect NEON on AArch64.

As a temporary workaround, I configure with -DLIBDIVIDE_NEON=OFF.

Interesting optimization and other observations

Hello,
I'm preparing to partially port libdivide to D and have some observations about the current implementation:

You use a custom bit field instead of a standard C bit field, there's no benefit, it just makes your code more verbose and error prone.
e.g.

struct libdivide_s32_t {
    uint32_t magic;
    uint8_t shift : 5;
    bool shiftpath : 1;
    bool add : 1;
    int8_t sign : 1;
};

Note that not only is it self documenting, but using a signed type removes the need for casting to make sure you get an arithmetic shift followed by a sign extension!

For the implementation of unsigned division, it is usually better to compute both the shift path and the multiply path and masking out the unneeded result (for pipelined CPU.) If addition is required only then you use a branch. Also, because x86 can quickly sign extend a byte value from high/low byte registers, we store the mask as a -1 int8_t. Using extra bytes is free since the compiler pads the structure to a multiple of the magic type size anyway.

I've implemented a proof of concept in my fork [1], it is not complete though (I'd rather spend time on my port.)
It gives roughly a 2x speed up.
Unswitching now only gives roughly 17% improvement over non-unswitched, so its a debatable optimization (91% faster versus 89% faster than system division.)
I hope I didn't create a bug in the process, since I only checked the benchmark because modifying the tester would require too many changes.
I also added branch hints using __builtin_expect

[1] https://github.com/Safety0ff/libdivide

P.S:
Pass by value would probably have been better than pass-by-pointer. GCC's optimizer seems good enough that it does not have any impact on performance.

missing some headers

$ git diff
diff --git a/libdivide_test.cpp b/libdivide_test.cpp
index 55f7333..bc70da2 100644
--- a/libdivide_test.cpp
+++ b/libdivide_test.cpp
@@ -5,7 +5,8 @@
 #include <stdlib.h>
 #include <time.h>
 #include <iostream>
-
+#include <typeinfo>
+#include <string.h>
 #ifdef LIBDIVIDE_USE_SSE2
 #include <emmintrin.h>
 #endif

Easy fix.

Branchfree divider cannot be 1

libdivide/libdivide.h

Line 742 in ee2a872

LIBDIVIDE_ERROR("branchfree divider must be != 1");

But why? I guess there's no set of shifts and multiples that is equal to dividing by 1? And of course you cannot check if the input is 1 at runtime, as that adds a branch. But then... you're already branching to throw this error?

modular multiplication

Is there a simple way to implement modular multiplication of two 64-bit integers modulus other 64-bit integer with preconditioning? I mean the method like:

uint64_t mulmod_u64(uint64_t a, uint64_t b, const struct libdivide_u64_t *denom) { ... }

which returns (a * b) % denom .

size_t support

Hi,

This might be a naïve question, but I had trouble compiling some C++ code using the admittedly variable width size_t. It compiled on Ubuntu using g++ but failed on Mac OS using clang.

././libdivide.h:1951:63: error: no type named 'divider' in '(anonymous namespace)::libdivide::libdivide_internal::dispatcher<unsigned long, -1>'
    typedef typename libdivide_internal::dispatcher<T, ALGO>::divider div_t;

I changed my code to use uint64_t, which is fine for my use case. Is it possible to support size_t? No urgency here, just curious.

Thanks!
Will

ridiculousfish / libdivide Goto Github PK

libdivide's People

Contributors

Stargazers

Watchers

Forkers

libdivide's Issues

Potential issue:

Implementation considerations

Recommend Projects

Recommend Topics

Recommend Org