There is a request for making scalar functions produce the same results as the vectori

Let me see if I can find a better example. <a class="user-mention notranslate" data-ho

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Scalar functions using vector extensions about sleef HOT 20 CLOSED

shibatch commented on May 5, 2024

Scalar functions using vector extensions

from sleef.

Comments (20)

shibatch commented on May 5, 2024 1

For vadd_vi_vi_vi, we can just use the current implementation since integer operation is fast enough. I guess transfer between a vector register and a normal register would take more time.

typedef __m128i vint;
vint vadd_vi_vi_vi(vint x, vint y) { return _mm_add_epi32(x, y); }

For aarch64, the usual scalar instructions should be used, since they operate on the same vector registers. Since there is no intrinsics for scalar FP or integer instructions, we have to use the plain C expressions.

typedef int32_t vint;
vint vadd_vi_vi_vi(vint x, vint y) { return x + y; }

from sleef.

fpetrogalli commented on May 5, 2024

Sounds like a good idea to me. Are you saying that you are going to use only the sleefsimd*.c as source files, and add a helper file where all the generic types and intrinsics are mapped to scalar functions?

typedef double vdouble;
typedef int vint;
// ...
vdouble vadd_vd_vd_vd(vdouble vx, vdouble vy) {return vx+vy;}

from sleef.

shibatch commented on May 5, 2024

That is okay. Another plan I am thinking is like follows.

typedef __m128d vdouble;
vdouble vadd_vd_vd_vd(vdouble vx, vdouble vy) { return _mm_add_sd(vx, vy); }

from sleef.

fpetrogalli commented on May 5, 2024

I would recommend you not to use vector register to compute scalar values, as it would degrade performance.

I think that the approach of typedef double vdouble; and use sleefsimd*.c is better.
Also, I think we should wait in removing the scalar versions in sleefdp.c and sleefsp.c but simply add new symbols to the library built out of sleefsimd*.c.
For those, I propose to use the current Sleef* naming scheme, with the number of lanes set to 1 and the "vector" extension set to "scalar".

By doing so, we will:

make sure that we have no regression in the tests, especially in our downstream version
avoid conflicts when merging this new feature into the cmake-transition branch

When we are happy about the result, we can safely remove what we think needs to be removed.

@shibatch , does that make sense to you?

from sleef.

shibatch commented on May 5, 2024

I don't understand why using vector register to compute scalar values degrades performance. See the assembly output from the compiler. It is basically using vector register for scalar computation.

One advantage in using vector register and SIMD intrinsics explicitly is that we can guarantee exactly the same operations made in computation. If producing the same results from vector and scalar implementation is the goal, probably we need to do this.

The function prototype will be like

double Sleef_sind1_u10avx2(double);
float Sleef_sinf_u10sse4(float);

Maybe it is better for us to ask David what his requirement actually is.

from sleef.

fpetrogalli commented on May 5, 2024

Let me see if I can find a better example. @shibatch I need your help here. How would you implement vint vadd_vi_vi_vi(vint x, vint y); for SSE2 and AArch64 respectively to operate on scalar?

from sleef.

d-parks commented on May 5, 2024

Hi,

With regards to the inquiry about the requirement in the comment above. What we strive for in the numerical intrinsic libraries is having the scalar and vector versions produce identical results for any given argument. The genesis for the requirement came about from looking at the SLEEF implementation of the scalar and vector versions of the cosine function where there was one subtle difference - while the two versions were coded with mlaf(x,y,z), [return x*y+z]the scalar was built with FMA operations disabled. I also observed that there was one term in the scalar version that used mlaf(), while the vector used add(mul(x,y),z) - I agree with FMA operations disabled for scalar builds this one term effectively computed the same result.

from sleef.

fpetrogalli commented on May 5, 2024

@shibatch - I like the sse2 and advsimd examples you proposed. I wanted to understand if you wanted to use vector registers for AArch64 - which is not the case.

So, let me summarise it how I see this happening:

create a helperx86scalar.h to be used in sleefsimd*.c to generate the scalar versions for Intel. I am happy for you to use the vector instructions of your sse2 examples.
create a helperaarch64scalar.h to be used in sleefsimd*.c to generate the scalar version for AArch64, using regular scalar types and operations (double, float, +).

Each of 1. and 2. before will:
a. add the correspondent testing from iutsimd.c, setting VECTLENDP=VECTLENSP=1.
b. turn on the testing on the travis-ci machines..

Regarding FMA and non-FMA targets. The sleefsimd*.c sources already have sections guarded by macros like ENABLE_FMA_DP. I would rather use those macros to produce versions of the libraries that use or not use FMA, consistently through the scalar and vector helper files.

This way we could produce a fast (FMA) library and a slower (non-FMA) library, with scalar and vector code agreeing on all the values.

Once we are happy about this changes, we can think about removing sleefsp.c and sleefdp.c and all related testing.

@d-parks , does that make sense for you?

@shibatch , when/if you start working on this, please split the x86 and aarch64 work in two sequential pull requests. Start for example with x86, and when we have merged it on master, work on aarch64, so that we limit the amount of rework that the first review might require.

from sleef.

d-parks commented on May 5, 2024

Approach outlined by @fpetrogalli-arm seems reasonable.

Though, I have a few question: What is the conceptual difference between vfma_() and vmla_()? Is vfma() only to be used if the hardware has an FMAC? And vmla() may or may not use and FMAC?

I see that ENABLE_FMA_DP is only used in a single routine, xexp, where there are vmla() and vfma() variants of the intermediate terms.

from sleef.

shibatch commented on May 5, 2024

vmla is multiplication + addition, but it is used if contraction to fma is permitted.
vfma is FMA, and only used if FMA is available.
FMA is extensively used in dd.h and df.h.

from sleef.

shibatch commented on May 5, 2024

Regarding this, I am planning to change the names of macros for enabling helper files.
My plan is to make the name of macro as follows: ENABLE_(extension name)_(vector width in bits). For scalar implementations, vector width will be SCALAR.

For example,
The current ENABLE_AVX2 will become ENABLE_AVX2_256.
ENABLE_AVX2128 will become ENABLE_AVX2_128.
The macro for the scalar implementation utilizing AVX2 instructions will be ENABLE_AVX2_SCALAR.

from sleef.

d-parks commented on May 5, 2024

Hi Shibata-san, Perfect. This is clearer and cleaner. Best regards, Dave

…

On Tue, Oct 24, 2017 at 10:05 AM, Naoki Shibata ***@***.***> wrote: Regarding this, I am planning to change the names of macros for enabling helper files. My plan is to make the name of macro as follows: ENABLE_(extension name)_(vector width in bits). For scalar implementations, vector width will be SCALAR. For example, The current ENABLE_AVX2 will become ENABLE_AVX2_256. ENABLE_AVX2128 will become ENABLE_AVX2_128. The macro for the scalar implementation utilizing AVX2 instructions will be ENABLE_AVX2_SCALAR. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#47 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA7sBzND5dgiz92mzZT-TsfMXYjI-nzBks5svgqygaJpZM4OxruK> .

-- David Parks [email protected] [email protected] +1 719 310 4130 (Cell)

from sleef.

shibatch commented on May 5, 2024

In addition to that, the names for AVX2128 functions will be all changed to the names ending with "avx2".
For example, Sleef_sind2_u10avx2128 will become Sleef_sind2_u10avx2.

The vector width for a function can be inferred from the type. For example,
Sleef_sind2_u10avx2 will invoke the function with helperavx2_128.h.
Sleef_sind4_u10avx2 will invoke the function with helperavx2.h.

The current scalar functions will renamed to the name ending with "purec".
The scalar functions without vector extension name will invoke a dispatcher, which selects the best scalar implementation utilizing the available extensions.

The AVX512F functions will be also exposed with function names without extension name.
For example, I will make an alias name Sleef_sind8_u10 for Sleef_sind8_u10avx512f .

I will not change the vector functions with gnuabi. @fpetrogalli-arm Do you care about the change in name of the scalar functions?

I have not decided if I will start this work after completion of cmake transition, or before it completes.

from sleef.

shibatch commented on May 5, 2024

@d-parks Thank you. I now think that this is going to be a very important feature of SLEEF.
I am also vaguely thinking adding GPGPU support. Please let me know if you have any idea.

from sleef.

shibatch commented on May 5, 2024

I also noticed that I need to introduce type casting functions for arguments and return values of the exported functions.

For example, vdouble should remain __m128d in helperavx2_scalar.h.
However, the types for arguments and return values for the exported functions have to be double.

For example, vadd_vd_vd_vd should be like the following.

__m128d vadd_vd_vd_vd(__m128d x, __m128d y) { return _mm_add_sd(x, y); }

Note that the addition is replaced by _mm_add_sd from _mm_add_pd.

Here, we should not make vdouble double, since the compiler does not know that the upper 64 bits of a 128bit register is zero. If we convert a __m128d value back and forth to a double value, the converting instructions won't be optimized away by the compiler.

So, I am going to add definitions of vdoublearg data type, which is used for passing arguments and return values. Type-casting functions like the following will be added to every exported function.

In helperavx2_scalar.h,

typedef __m128d vdouble;
typedef double vdoublearg;

vdouble vcast_vd_a(vdoublearg d) { return _mm_set_sd(d); }
vdoublearg vcast_a_vd(vdouble v) { return _mm_cvtsd_f64(v); }

In sleefsimddp.c,

EXPORT CONST vdoublearg xsin(vdoublearg da) {
  vdouble d = vcast_vd_a(da);
... (the current implementation) ...
  return vcast_a_vd(u);
}

from sleef.

shibatch commented on May 5, 2024

In order to realize this feature, all conditional branches have to be eliminated. In the current implementation of SLEEF, conditional branches are used for argument reduction of trig functions, where a faster algorithm is used if all the elements in argument vector are small enough. We need to make a slower, but bit-identical-between-all-vector-lengths version and faster and sometimes not bit-identical version for each trig function.

from sleef.

shibatch commented on May 5, 2024

I also noticed that bit-patterns of NAN values are sometimes changed through optimization with GCC. It seems clang does not have this problem.

from sleef.

shibatch commented on May 5, 2024

I'm going to introduce tester3 for this feature(actually I've made it already). This tester will check if the returned values from two implementations are bit-identical. This test is quick since it does not need libmpfr.

from sleef.

carlkl commented on May 5, 2024

@shibatch, what do you mean with NaN bit pattern changes caused by GCC? Is there a test case?

from sleef.

shibatch commented on May 5, 2024

As you know, NaN is not defined by a single pattern of bits, but it can have information in it. That information in NaN seems to be changed during optimization.

from sleef.

Scalar functions using vector extensions about sleef HOT 20 CLOSED

Comments (20)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent