vectorclass / version2 Goto Github PK

Vector class library, latest version

License: Apache License 2.0

C++ 99.83% C 0.17%

version2's Introduction

version2

Vector Class Library, latest version

This is a C++ class library for using the Single Instruction Multiple Data (SIMD) instructions to improve performance on modern microprocessors with the x86 or x86/64 instruction set on Windows, Linux, and Mac platforms. There are no plans to support ARM or other instruction sets.

Latest release

Download manual

Add-on packages for particular applications

Getting-started video. Video blogger Christopher Rose has made this nice video telling how to get started with the Vector Class Library.

Help: You may ask for programming help on StackOverflow using the tag vector-class-library.

version2's People

Contributors

Stargazers

Watchers

Forkers

rurban sabauma igordzreyev jpivarski dokempf isonium chenqin arkadysm michalfapso prakaramjoshi wintel2014 iaiotom jeffwong728 grk101 belial2010 juanecito lenerd jellypixelgames federkamm greenlion mortal2000 shipp108 sithpz karthikbhargavan dacmot clorton marcelkoch rintaroutw ntzzc witek902 ngoclinhng pan310 benquickdenn mungsoo vitaliytalyh one30 kongyiyi adityavs smilexin usingtcnower ksrikar1234 dawoserr ghscan rediadsp seanxcwang luowanqian shenben syegres8 aster2013 dmamason ammarfaizi2 jiaqi-knight fjardon paulbkoch qiyu8 implicithash jcw780 zhaoliang1983x yaozhongxiao jqk6 simonmb stdpain jacek-d73 anysql jianhui2769 rdguerrerom okkirikomi akarinvs njustmxn 1610488836zq winderton picanumber hazcat bejovos acidburn0zzz dmitriykorchemkin fingolfin1196 burmist-git billxw nizvoo crwsr124 taiyang-li iamfork falconer24 vitorian adimitromanolakis eeezio walker-zg fireapache msnh2012 jinxcrazy waldonchen hjzhang-sjtu ankushrayabhari pencilcaseman mmx5 vlovero yyl-20020115 mimaz clayne

version2's Issues

Is there a way to improve performance of double sum?

We have the following code to aggregate double values excluding NANs. Also when all values are NaN we need to produce NaN and not 0.

I’m quite new to vectorisation and here is what I came up with:

double SUM_DOUBLE(double *d, int64_t count) {
    const int32_t step = 8;
    const auto remainder = (int32_t) (count - (count / step) * step);
    const auto *lim = d + count;
    const auto *vec_lim = lim - remainder;

    double *pd = d;
    Vec8d vec;
    Vec8db bVec;
    double sum = 0;
    long sumCount = 0;
    for (; pd < vec_lim; pd += step) {
        vec.load(pd);
        bVec = is_nan(vec);
        sumCount += step - horizontal_count(bVec);
        sum += horizontal_add(select(bVec, 0.0, vec));
    }

    if (pd < lim) {
        for (; pd < lim; pd++) {
            double x = *pd;
            if (x == x) {
                sum += x;
                sumCount++;
            }
        }
    }
    return sumCount > 0 ? sum : NAN;
}

Is there a way to improve performance of this aggregation? There is closed source product out there that does the same thing 20% faster

[Question] How could I get this vcl supports Arm Neon intrinsic?

Dear Sir:
Thanks for your excellent work. As above said, would you give some suggetsions?

blend16 clang jit

Hello,
I'm running into trouble with the blend16 function for avx512 when using just in time compilation with clang.
I get a segmentation fault in rare cases at these lines of code:

const EList <int32_t, 16> bm = perm_mask_broad<Vec16i>(indexs); y = _mm512_permutex2var_ps(a, Vec16i().load(bm.a), b);

when i change it to:

y = _mm512_permutex2var_ps(a, Vec16i( i0, i1, i2, i3, i4, i5, i6, i7, i8, i9, i10, i11, i12, i13, i14, i15), b);

everything is fine.
Might this be a problem with the reference to a temporary object in the arguments?

Cheers
Jakob

Compilation warnings GCC 8.1.0 windows

I'm using MinGW GCC: x86_64-8.1.0-win32-seh-rt_v6-rev0 and i get these compilation warnings:

[ 17%] Building CXX object CMakeFiles/questdb-sse2.dir/src/main/c/share/vcl/instrset_detect.cpp.obj
In file included from C:\Users\blues\dev\questdb\core\src\main\c\share\vcl/vectorclass.h:45,
                 from C:\Users\blues\dev\questdb\core\src\main\c\share\vect.cpp:26:
C:\Users\blues\dev\questdb\core\src\main\c\share\vcl/vectorf128.h:2093:100: warning: optimization attribute on 'Vec2d round(Vec2d)' follows definition but the attribute doesn't match [-Wattributes]
 static inline Vec2d round(Vec2d const a) __attribute__((optimize("-fno-unsafe-math-optimizations")));
                                                                                                    ^
C:\Users\blues\dev\questdb\core\src\main\c\share\vcl/vectorf128.h:1186:21: note: previous definition of 'Vec4f round(Vec4f)' was here
 static inline Vec4f round(Vec4f const a) {
                     ^~~~~

when building last step with the following options:

-m64 -O2 -msse2 -std=c++17

Consider supporting `constexpr` to a greater extent for C++2a

C++2a isn't yet out, but some of us are beginning to work with it and one notable feature now available in GCC 10 and clang trunk is http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1668r1.html

To summarize, the feature enables code like the following:

 constexpr double fma(double b, double c, double d) {
  if (std::is_constant_evaluated())
   return b * c + d;
  asm("vfmadd132sd %0 %1 %2"
   : "+x"(b)
   : "x" (c), "x" (d)
   );
  return b;
 }

Existing functions not marked constexpr may not benefit from constant evaluation, but more importantly, the results may not be used in constant expressions (template parameters, array sizes, etc), even if the inputs are known to be constant-evaluated.

Are there any plans to make the library more constexpr friendly in the future or is any help needed in this regard?

[proposal] any plans for supporting ARM/Neon?

partial horizontal add

Thank you for this great library.

Please consider to add a function like this:

// add the first n elements of a.  this is slightly faster than horizontal_add(a) for n <= 3, on a Silvermont CPU.
static inline float horizontal_add(Vec4f const a, int n) {
  float p[4];
  a.store(p);
  float sum = 0;
  for (int i = 0; i < n; ++i) {
    sum += p[i];
  }
  return sum;
}

My use case is that I am doing a lot of Vec3Df dot products on a Silvermont CPU. Although I am not using https://github.com/vectorclass/add-on/, I believe it could benefit from a partial horizontal add like this.

If there's a faster way to do a partial horizontal add on a Silvermont CPU, please let me know. Thanks.

Compiler bug in VS2017 gives wrong atan2 result for AVX-512

Dear,

When compiling for AVX-512 I found that VS2017 gives incorrect results for the atan2 function.
The issue does not appear anymore when using VS2019, so that's good news.

The following test generates the issue on my pc:

  __m512 x = _mm512_set1_ps(0.5f);
  __m512 y = _mm512_set1_ps(-0.2f);
  Vec16f vx(x), vy(y);
  Vec16f p = atan2(vy, vx);

Not all 16 values in p are correct.

I could fix the issue for VS2017 as follows:
The code below is the atan2 function similar to the original. I only added the #if _MSC_VER ... #endif block. Here I copied some some code that was computed before at the top of the function, but for some reason VS2017 gets rid of the correct values.

There is no issue for AVX2 on VS2017, so when adding the #if block to the code, the AVX2 version will be made slower.

Kind regards.

template<typename VTYPE>
static inline VTYPE atan2(VTYPE const y, VTYPE const x) {

  // define constants
  const float P3atanf = 8.05374449538E-2f;
  const float P2atanf = -1.38776856032E-1f;
  const float P1atanf = 1.99777106478E-1f;
  const float P0atanf = -3.33329491539E-1f;

  typedef decltype (x > x) BVTYPE;             // boolean vector type
  VTYPE  t, x1, x2, y1, y2, s, a, b, z, zz, re;// data vectors
  BVTYPE swapxy, notbig, notsmal;              // boolean vectors


  // move in first octant
  x1 = abs(x);
  y1 = abs(y);
  swapxy = (y1 > x1);
  // swap x and y if y1 > x1
  x2 = select(swapxy, y1, x1);
  y2 = select(swapxy, x1, y1);

  // check for special case: x and y are both +/- INF
  BVTYPE both_infinite = is_inf(x) & is_inf(y);   // x and Y are both infinite
  if (horizontal_or(both_infinite)) {             // at least one element has both infinite
    VTYPE mone = VTYPE(-1.0f);
    x2 = select(both_infinite, x2 & mone, x2);  // get 1.0 with the sign of x
    y2 = select(both_infinite, y2 & mone, y2);  // get 1.0 with the sign of y
    }

  // x = y = 0 will produce NAN. No problem, fixed below
  t = y2 / x2;

  // atan2(y,x)
  // small:  z = t / 1.0;
  // medium: z = (t-1.0) / (t+1.0);
  notsmal = t >= float(VM_SQRT2 - 1.);
  a = if_add(notsmal, t, -1.f);
  b = if_add(notsmal, 1.f, t);
  s = notsmal & VTYPE(float(VM_PI_4));
  z = a / b;


  zz = z * z;

  // Taylor expansion
  re = polynomial_3(zz, P0atanf, P1atanf, P2atanf, P3atanf);
  re = mul_add(re, zz * z, z) + s;

  // move back in place

#if (_MSC_VER > 1900 && _MSC_VER < 1920)  
  x1 = abs(x);            
  y1 = abs(y);         
  swapxy = (y1 > x1);   
#endif
  re = select(swapxy, float(VM_PI_2) - re, re);
  re = select((x | y) == 0.f, 0.f, re);              // atan2(0,+0) = 0 by convention
  re = select(sign_bit(x), float(VM_PI) - re, re);   // also for x = -0.

  // get sign bit
  re = sign_combine(re, y);

  return re;
  }

What is the most efficient way to load data into the vector with leading zeros?

What I want to have:

float list [3] = {1.0 f , 1.1 f , 1.2 f };
Vec4f a;
...
// a = (0.0 , 1.0 , 1.1, 1.2)

Ideal would be a kind of function load_partial with a definable start vector index.
What would be the most efficient way to do this?

Error when selecting betwen boolean vectors

I'm getting a compiler error when selecting between boolean vectors. The error only occurs when I define and assign the boolean vector on the same line.

I'm using release VCL version 2.02.00

Here is an example:

#include <stdio.h>
#include "vcl2/vectorclass.h"

int main() {
Vec4ib sel(false, true, false, false);
Vec4ib a = true;
Vec4ib b = false;
Vec4ib c;
c = select(sel, a, b); //works as expected

Vec4ib d = select(sel, a, b); //error: no viable conversion from 'Vec4i' to 'Vec4ib'

printf("c = (%s, %s, %s, %s)\n", c[0]?"true":"false", c[1]?"true":"false", c[2]?"true":"false", c[3]?"true":"false");
return(0);
}

compile command:
clang++ -std=c++17 vcl_test.cpp -o vcl_test
vcl_test.cpp:12:12: error: no viable conversion from 'Vec4i' to 'Vec4ib'

clang++ --version
clang version 10.0.0-4ubuntu1
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

Byte lookup function

Thanks for the great library! I was wondering if there was a way to perform lookups on large arrays of bytes. It seems like currently with the lookup function the size of the index integers must be the same size as the returned integers. I have a large lookup table of bytes and need to use 16-bit index variables, but want to look up 8-bit integers. For example, a lookup function with the following signature:

Vec16c lookup<n>(Vec16us index, void const * table)

Also is there any reason why n must be a compile-time constant? I think the gather intrinsic, which this seems like it would use under the hood, does not need to know the total table size.

Masked load and store

Hi,

Is there any specific reason to not have masked load and store operations (ie _mm256_maskstore, _mm256_maskload)? Is it better to use load, store together with select over the masked version?

Anyway, thanks for such great library.

Vec4f SSE2 round returning incorrect values

Hi, I've been testing the output of different INSTRSET values match correctly and have found some mismatches with the non SSE4.1 version of round( Vec4f ) with large values. It seems to affect numbers larger than 8388608. Here is a small test setup:

#include <iostream>
#include "vectorclass/vectorclass.h"

using namespace std;

void test( float value )
{
	Vec4f a( value );

	Vec4f b = round( a );

	cout.precision( 16 );
	cout << "Round( " << a.extract( 0 ) << " ) = " << b.extract( 0 ) << endl;
}

int main()
{
	cout << "INSTRSET = " << INSTRSET << endl;

	test( 8388687.f );
	test( 8389345.f );

	return 0;
}

Outputs:

INSTRSET = 2
Round( 8388687 ) = 8388688
Round( 8389345 ) = 8389344

I've tested this on x64 latest MSVC and ClangCl

I don't see any remarks regarding this in the manual and was wondering if it's a known issue?

Thanks.

Feature: enable usage of non-temporal memory hint intrinsics with vectorclass store() method

Hi,

i have an application where i'm using Vec8f to do computations on a large array. When writing results to memory i noticed that i can get quite a bit of speedup by using _mm256_stream_ps() instead of Vec8f.store() method.
Would it be possible to add an optional parameter to the store() methods so that the non-temporal streaming intrinsics are used?

One warning using Clang -Wshorten-64-to-32

I find the compiler warning -Wshorten-64-to-32 valuable in my projects, but it is only accepted by Clang and not GCC. With that warning active (and warnings as errors) I get a single error from vectorclass:

vectorclass/instrset.h:365:12: error: implicit conversion loses integer precision: 'uint64_t' (aka 'unsigned long long') to 'uint32_t' (aka 'unsigned int') [-Werror,-Wshorten-64-to-32]
    return r;
    ~~~~~~ ^

Does it make sense to change that line to the following? return uint32_t(r);

Use store_a instead of store in operator[] / extract

Hi!

Do you think it would be useful to create aligned temporary structs to get a little bit of speedup?
Maybe it would be better for some processors.

Example:

struct alignas(32) Vec256_tmp {
    float data[8];
};

// Member function extract a single element from vector
float extract(int index) const {
#if INSTRSET >= 10
    __m256 x = _mm256_maskz_compress_ps(__mmask8(1u << index), ymm);
    return _mm256_cvtss_f32(x);        
#else 
    Vec256_tmp tmp;
    store_a(tmp.data);
    return tmp.data[index & 7];
#endif
}

Support for WebAssembly SIMD

WebAssembly is getting support for a limited SIMD instruction set similar to SSE2. It would be nice to have support for it to simplify cases where libraries may be used for WebAssembly or native compilation.

https://emscripten.org/docs/porting/simd.html has details on the SIMD implementation.

pow_const macro "ignores" namespace VCL_NAMESPACE

../../../../../projectname/src/simengine/PostProcessingEngine.cpp:152:22: error: use of undeclared identifier 'Const_int_t'; did you mean 'vcl::Const_int_t'?
            x = vcl::pow_const(x, 10);
                     ^
../../../../../projectname/external/vectorclass-version2/include/vectorclass-version2/vectorf128.h:1184:30: note: expanded from macro 'pow_const'
#define pow_const(x,n) pow(x,Const_int_t<n>())

I think this is sufficiently self-explanatory

Ability to suppress "It is recommended to specify also option -mfma when using -mavx2 or higher" pragma

We use vectorclass as part of our software stack, and as part of our performance testing we specifically build parts of our stack enabling AVX2 but not FMA, in order for instance to validate the impact of FMA on our applications.

When we do this our builds get swamped by the above pragma helpfully telling us to enable FMA. Whilst I understand why in generally this might be useful, in our case it is just a nuisance, so I would like to request the ability to disable this pragma, via for instance an optional preprocessor define.

Possible performance regression in blend functions

Hi, thanks for the great library!

I noticed a situation in which version 2 of the library gives me significantly degraded performance compared to version 1. Please find attached a minimal working example consisting of two functions minimal and minimal_vcl16 which take an array, and then 'demux' it by first extracting all elements with even index, and then all elements with odd index.

#include <iostream>
#include <vectorclass.h>

// see https://github.com/martinus/nanobench/raw/master/src/include/nanobench.h
#define ANKERL_NANOBENCH_IMPLEMENT
#include "nanobench.h"
#include <chrono>

typedef unsigned char byte;

byte *generate(size_t SRC_SIZE, int alignment = 64) {
  byte *buf = (byte *)_mm_malloc(SRC_SIZE, alignment); 
  srand(0);
  for (size_t i = 0; i < SRC_SIZE; i++)
    buf[i] = (byte)(rand() % 256);
  return buf;
}

byte *allocate_dst(size_t DST_SIZE, int alignment = 64) {
  byte *result = (byte *)_mm_malloc(DST_SIZE, alignment);
  memset(result, 0, DST_SIZE);
  return result;
}

void minimal(unsigned char *src, unsigned char *dst, size_t step) {
  size_t size = step * 2;
  auto x = dst;
  auto y = dst + step;
  for (int i = 0; i < size; i += 2) {
    x[0] = src[i];
    y[0] = src[i + 1];
    ++x;
    ++y;
  }
}

void minimal_vcl16(unsigned char *src, unsigned char *dst, size_t step) {
  size_t size = step * 2;
  auto x = dst;
  auto y = dst + step;
  for (int pos = 0; pos < size; pos += 2 * 16) {
    Vec16uc a, b;
    a.load_a(&src[pos]);
    b.load_a(&src[pos + 16]);

    blend16<0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30>(a, b)
        .store_a(x);
    blend16<1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31>(a, b)
        .store_a(y);
    x += 16;
    y += 16;
  }
}

int main() {
  auto config = ankerl::nanobench::Config();
  config.minEpochTime(std::chrono::milliseconds{200});

  size_t step = 800 * 800;
  size_t SIZE = step * 2;

  auto src = generate(SIZE);
  auto dst = allocate_dst(SIZE);
  auto dst_vcl = allocate_dst(SIZE);

  minimal(src, dst, step);
  minimal_vcl16(src, dst_vcl, step);

  // check if minimal and minimal_vcl produce identical outputs
  for (int i = 0; i < SIZE; ++i) {
    // std::cout << i << ">>> " << (unsigned int) src[i] << ":  " <<  (unsigned
    // int)dst[i] << "  " << (unsigned int) dst_vcl[i] << std::endl;
    if (dst[i] != dst_vcl[i]) {
      abort();
    }
  }
  std::cout << "OK\n";

  _mm_free(src);
  _mm_free(dst);
  _mm_free(dst_vcl);

  // run the benchmark
  src = generate(SIZE);
  dst = allocate_dst(SIZE);
  dst_vcl = allocate_dst(SIZE);
  config.run("Minimal", [&] { minimal(src, dst, step); })
      .doNotOptimizeAway(dst);
  config.run("MinimalVCL", [&] { minimal_vcl16(src, dst_vcl, step); })
      .doNotOptimizeAway(dst);
  _mm_free(dst);
  _mm_free(dst_vcl);
  _mm_free(src);
}

Compiling the microbenchmark with
g++-9.2 vcl_example.cpp -Ipath/to/vectorclass/version1/ -O3 -mavx2 -mfma -std=c++2a
gives the result

|               ns/op |                op/s |   MdAPE |         ins/op |         cyc/op |    IPC |    branches/op | missed% | benchmark
|--------------------:|--------------------:|--------:|---------------:|---------------:|-------:|---------------:|--------:|:----------------------------------------------
|           80,675.93 |           12,395.27 |    0.2% |     380,004.02 |     234,070.96 |  1.623 |      20,001.02 |    0.0% | `Minimal`
|           84,662.87 |           11,811.55 |    0.3% |     520,002.02 |     245,669.68 |  2.117 |      40,000.02 |    0.0% | `MinimalVCL`

On my other machine MinimalVCL is slightly faster than the baseline. Given the simplicity of the function, I expect the compiler does a good job vectorizing Minimal, so that MinimalVCL has not much room to improve upon it.
On the other hand, using version 2 of the library like so
g++-9.2 vcl_example.cpp -Ipath/to/vectorclass/version2/ -O3 -mavx2 -mfma -std=c++2a
produces a massive pessimization:

|               ns/op |                op/s |   MdAPE |         ins/op |         cyc/op |    IPC |    branches/op | missed% | benchmark
|--------------------:|--------------------:|--------:|---------------:|---------------:|-------:|---------------:|--------:|:----------------------------------------------
|           79,375.69 |           12,598.32 |    0.3% |     380,004.02 |     230,320.17 |  1.650 |      20,001.02 |    0.0% | `Minimal`
|          554,310.82 |            1,804.04 |    0.0% |     760,018.15 |   1,608,538.10 |  0.472 |      40,003.14 |    0.0% | `MinimalVCL`

This picture is consistent across compilers (I tested gcc 8, gcc 9.2, clang 9.0) and on two different machines, also with an analogous version using Vec32uc instead of Vec16uc. When I turn on AVX512 on my workstation, the performance improves to expected levels.

VECTORCLASS_H not updated to 20001 in #error check in vectorclass.h

Hello, big fan of your work here.

After the last commit, I got #error Mixed versions of vector class library from line 83 of vectorclass.h.

At first glance, I belive that the cause is that at the line 82 it still says #if VECTORCLASS_H != 20000 instead of #if VECTORCLASS_H != 20001, and changing it to that makes the error go away.

I guess this was overlooked because that line of code is only active when vectorclass.h is included more than once.

[RTFM] : Is there no vcl:: namespace?

I'm probably an idiot, but how do I make sure the vcl functions and types and such stay behind a namespace? I cannot use non-namespaced dependencies(obviously).

Unexpect slowdown in basic example (provided)

Hello all,

I'm looking at integrating the library into a project I've working on.

However, I want to make sure that I set off on the right foot.

Thus, I have made a very simple minimum working example (vector addition), using CMake and git submodules

You can find the MWE here, which I will improve in responses to this thread.

However, I'm finding an ~2x slowdown using SIMD, which is not what I would expect. I've taken steps to disable automatic vectorisation of my baseline, I think.

I've also not included initialisation in my timing.

Baseline: 1 ms
SIMD: 3 ms

Before integrating, I want to make sure I avoid stumbling blocks such as this.

Does anyone have any insight into what's going on?

Pushing releases

You should consider pushing releases to the release section.

nmul_sub and runtime gather functions

Hello,

1st off, thank you for making this library.

I was wondering why there's no nmul_sub (-a*x-y) functions and why the gather functions are only a compile time template and not available at runtime with the indexes as an argument...

If you'd prefer i fork it, add them and submit a pull request, please let me know, i might to it.

Best regards.

[Proposal] Add fmod / remainder functions

Currently vcl seems to lack vectorised fmod and remainder functions.

I think this would be a good addition and come in quite handy, because compilers are notoriously bad a auto-vectorising code with fmod in it (see https://godbolt.org/z/59Y6eh for an example).

Feature: add constructor from pointer

Currently the only possibility to load data into a vector is to use the methods load(), load_a() etc. I'd like to propose to add a constructor to construct directly from a data pointer, e.g. Vector8f(float* data)
Is there a reason why such a constructor has not been provided? It would enable to write shorter and more convenient code in a number of situations.

request to add horizontal_min/max without propagate NAN

I'm using vcl to implement SIMD functionality of our project.
I found horizontal_min/max propagate NAN.
Would you please add APIs to support horizontal_min/max without propagate NAN?

Thanks and regards.

v2.01.03 tag is missing

I see you have updated the release notes for a release of that name, yet there is no such tag.

Given such tag, I can update the version of the library used in the Godbolt Compiler Explorer.

vectori256e permute8 named permute4

Hello,
I think I found a small naming bug in the vectori256e.h file:
In line 3159 it should probably say permute8 instead of permute4.

Thanks for the great work!
Jakob

[hint] sse2neon

even if there is no plan to implement arm-instructions, it is moslty possible to use the sse2neon lib to emulate sse instructions on ARM.
So you can use vectorclass on ARM

i found some things to be defined before including the vectroclass.h
and a small change that has to be made to the instrset.h file.

it is working as far as i could test on apple M1 chips

maybe those changes can be directly applied to vectorclass due to a preprocesse like "VCL_USE_SSE2NEON"

cheers

/*
ARM compatible include of the vectorclass

on ARM/MAC the sse2neon lib will be imported
and some parameters for the vectorclass are prepared.

#IMPORTANT in vectorclass.h->instrset.h the cpuid function must be
hidden, since it is not compatible with ARM-compilers.

if missing, add the header-include check #if !defined(SSE2NEON_H)
to the function to hide it when compiling on ARM

remember that a dispatcher is not possible in this case.

#if __arm64
#include <sse2neon.h>

// limit to 128byte, since we want to use ARM-neon
#define MAX_VECTOR_SIZE 128

//limit to sse4.2, sse2neon does not have any AVX instructions ( so far )
#define INSTRSET 6

//define unknown function
#define _mm_getcsr() 1

//simulate header included
#define __X86INTRIN_H
#endif
// finally include vectorclass
#include <vectorclass.h>

vectormath_common.h requires vectormath.h include

When including vectormath_common.h directly or indirectly through e.g. vectormath_trig.h an error message is produced: "Incompatible versions of vector class library mixed".
This error message is misleading.

An easy workaround is to include vectormath.h first.

In vectormath_common.h the value of VECTORCLASS_H is checked, without first ensuring that vectorclass.h was included.

sin / cos return nan for large negative values instead of 0

Dear,
the following examples generate nan, but (if I'm not mistaken) should return 0 according to the documentation
(I used Visual Studio 2017) :

__m256 x = _mm256_set1_ps(-277076958175912885225521152.f);
Vec8f vx(x);
Vec8f p = sin(vx);


__m256  x = _mm256_set1_ps(std::numeric_limits<float>::infinity());
Vec8f vx(x);
Vec8f p = sin(vx);


__m256  x = _mm256_set1_ps(-std::numeric_limits<float>::infinity());
Vec8f vx(x);
Vec8f p = sin(vx);

I think the issue can be fixed by changing lines 228 and 229 in file vectormath_trig.h to

overflow = BVTYPE(q > 0x2000000 | q < 0);  // q big but also check for the sign bit
overflow |= ~is_finite(xa); // set overflow if xa is infinite

Kind regards.

Question: support for arm/neon or ibm's vsx?

Hi,

Is there any plan to support instruction sets ARM/neon (ARM_NEON) or IBM's power9 simd instructions (VSX) ?

(Or are they already supported and I missed it?)
thanks!
Alan

More Features into Detect CPU Features

Could some features incorporated into the CPU Detect functions?

I'd like to see:

Number of CPU Physical Cores (Important to set OpenMP number of threads correctly).
CPU Vendor.

More features can be seen at CpuId.jl.
Those will also be great addition to the asmlib library.

Thank You.

More default constructors

I suggest to slightly change the default constructors, e.g.
Vec4d(double a, double b, double c, double d=0)
and
Vec4d(Vec2d v, Vec2d w=Vec2d(0))

This allows an easier use for calculations in 3D and projections in 2D without always having to write an additional irrelevant argument.

License copyright field not completed

I noticed that the Apache 2.0 license file (LICENSE) isn't completed. I assume it should say something like "Copyright 2014-2019 Agner Fog" and then delete the instructions. For example, see https://github.com/moby/moby/blob/master/LICENSE

  To apply the Apache License to your work, attach the following
  boilerplate notice, with the fields enclosed by brackets "[]"
  replaced with your own identifying information. (Don't include
  the brackets!)  The text should be enclosed in the appropriate
  comment syntax for the file format. We also recommend that a
  file or class name and description of purpose be included on the
  same "printed page" as the copyright notice for easier
  identification within third-party archives.

Any interest in using a build system ?

Hello!

First, thank you very much for your extensive HPC documentation, it has been a delicacy to read and a precious source of information.

Now that I plan to seriously use the library, I had to wrap it in the build system I use, to have a clean approach, and I am wondering if you are interested in such a thing: I forked this repository and moved it to the meson build system (that I personally find more "high level" than cmake, thus easier to deal with), you can have a look at it here.

Some information related to the move to meson:

The headers are to be included with a vcl prefix, e.g. #include #include <vcl/vectorclass.h>
I copied one example from the vcl_manual.pdf into a test within the test folder.

The benefits of having a build system:

Enables to easily have and handle different build folders with different build configurations (one with e.g. -mavx, another with some extra defines... etc) whose options can be visualized with meson configure within the build folder.
You can put your automated tests in the repository itself
Easy integration downstream

Downside:

Build system lock-in. But we can at least note that:
- Mson can handle cmake projects.
- the include folder, in vcl-meson, can still be copied into the downstream folder just like with the old approach.
  - (although for that I'd have to move the one .cpp file, for arch detection, that is in a new src folder, back in the include folder)

For anyone else who want to use this library with meson, here you go!

My plans (if it's of any interest) :

Copy the add-ons (at least the ones I need) into the same repository under the #include <vcl/addons/...> prefix
Work on expression templates for the vector container add-on.

One suggestion I can make is about the documentation: I think it's better to write the documentation (vcl_manual.tex) in markdown instead, for the following reasons :

Renders code better
No page-splitting
No file download: can be read directly on Github or your website
- Github now renders math expressions
- There are tools to render markdown in your own website.
No compilation: real-time preview.
For a C++ library documentation, there's really no need for the Latex steam-machine.

If you are interested, I can work in a first conversion of the documentation to markdown.

Thanks again, cheers!

Adel

Compilation error with GCC 7.5.0+Debug+AVX2

I'm getting compilation error while trying to compile code in Debug configuration

/src/vectorclass/vectorf256.h: In function 'Vec4d permute4(Vec4d) [with int i0 = 1; int i1 = 2; int i2 = 0; int i3 = 3]':
/src/vectorclass/vectorf256.h:2475:17: error: the last argument must be an 8-bit immediate
                 y = _mm256_permute4x64_pd(a, mms);
                 ^

This looks like a bug in compiler because mms is constexpr. Are there any workarounds?

Here is code for reproduce
https://github.com/Nekto89/vectorclass_repro_debug_avx_ubuntu

builtin_constant_p

version2/instrset.h

Line 424 in 3d196e0

#if defined(__GNUC__) || defined (__clang__)

This does actually work with a newer clang (> 5). Only gcc is broken.

...

Missing variable shift functions.

The << and >> shift operators only shift the entire vector by the same number. There doesn't seem to be any instructions for variable bit shifts. (_mm_sllv_epi64/32, _mm256_sll_epi64/32, etc.)

Matrix class

A nice extension would be a matrix class. I have started an implementation that is attached. That would extend the use of the vector classes a bit more to mathematical calculation. Although in the attached files are proposals for cross-product, scalar-product, length, and norm.
matrix.txt
If there is interest, I am willing to help with the implentation.

Compiling for instruction set where large vectors don't fit natively produces wrong results

I have implemented an optimization algorithm using Vec8f. From the vcl manual i understand that it is possible to use large vectors even when the instruction set does not support them directly:

Compiling for different instruction sets. If you are using a bigger vector size than supported by the instruction set, then the VCL code will split the big vector into multiple smaller vectors. If you compile the same code again for a higher instruction set, then you will get a more efficient program with full-size vector registers.

(manual p. 5f)

When i run a test case with a few thousand iterations, i get different results when compiling for SSE4.2 (-msse4.2, vcl should split the Vec8f) and AVX2 (-mavx2; -mfma, Vec8f fits natively). The results in the latter case are correct, the results in the former case not. It is not a matter of numerical precision in the iterative algorithm, see attached images.

Result AVX2, correct:

Result SSE4.2, wrong

The only change are the compiler flags, rest of the code is exactly the same.
I'm using blend8 and permute8 functions if that is relevant. I'm also using OpenMP to speed up the loops over the image pixels.

Compiler: gcc-8.3
CPU: Ryzen 2700X

Consider add a CMakeLists.txt to make it easier to integrate vectorclass into other projects

It could be something like this:

cmake_minimum_required(VERSION 3.0.0)
project(VectorClass VERSION 2.01.00)

# Files
#--------------------------------------
set(SRC_FILES
	instrset.h
	instrset_detect.cpp

	vectorclass.h
	vectorf128.h
	vectorf256.h
	vectorf256e.h
	vectorf512.h
	vectorf512e.h
	vectori128.h
	vectori256.h
	vectori256e.h
	vectori512.h
	vectori512e.h
	vectori512s.h
	vectori512se.h
	vectormath_common.h
	vectormath_exp.h
	vectormath_hyp.h
	vectormath_lib.h
	vectormath_trig.h
	vector_convert.h
)

# Library
#--------------------------------------
add_library(${PROJECT_NAME} STATIC 
	${SRC_FILES}
)


target_include_directories(${PROJECT_NAME} PUBLIC  .)

Unexpected code generation von AVX512 with clang

Hello,

the following code is not what I've expected: https://godbolt.org/z/--NTo-

#include <vectorclass.h>

void f(float * dst, float const * src) {
    Vec16i const idx(0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30);
    Vec16f v = lookup<16>(idx, src);
    Vec16f w = lookup<16>(idx, src+1);
    v += w;
    v.store(dst);
}

compiles with clang with flags '-std=c++17 -march=skylake-avx512 -O3' to

.LCPI0_0:
        .long   0                       # 0x0
        .long   2                       # 0x2
        .long   4                       # 0x4
        .long   6                       # 0x6
        .long   8                       # 0x8
        .long   10                      # 0xa
        .long   12                      # 0xc
        .long   14                      # 0xe
f(float*, float const*):                              # @f(float*, float const*)
        vmovups zmm0, zmmword ptr [rsi]
        vaddps  zmm0, zmm0, zmmword ptr [rsi + 4]
        vbroadcasti64x4 zmm1, ymmword ptr [rip + .LCPI0_0] # zmm1 = [0,2,4,6,8,10,12,14,0,2,4,6,8,10,12,14]
        vpermd  zmm0, zmm1, zmm0
        vmovdqu64       zmmword ptr [rdi], zmm0
        vzeroupper
        ret

I want to sum up each pair of consecutive numbers in a 32-array of floats (src) and write the 16 results back into another array (dst). But as I understand the generated code it just sums up the first 8 pairs of numbers and writes them twice in a row in the 'dst' array.

gcc (9.2.0) however does as I would expect.

Is this a bug in clang or in VCL (or gcc) or did I misunderstand the lookup-documentation?

Best regards and thank you for having created this very useful library,
federkamm

64bit*64bit will be overflow

64bit*64bit will be overflow,How to solve it

[Solved] conditional expression is constant

I am trying out Vcl for the first time and included the "vcl/Vectorclass.h" header.

Unfortunately, I immediately get these compile errors:

E:\Development\Classes\Vcl\vectori256e.h(3362,1): error C2220: the following warning is treated as an error
E:\Development\Classes\Vcl\vectori256e.h(3362,1): warning C4127: conditional expression is constant
E:\Development\Classes\Vcl\vectori256e.h(3362,1): message : consider using 'if constexpr' statement instead
E:\Development\Classes\Vcl\vectori512e.h(1875): message : see reference to function template instantiation 'Vec8i lookup<32>(const Vec8i,const void *)' being compiled
E:\Development\Classes\Vcl\vectori256e.h(3363,1): warning C4127: conditional expression is constant
E:\Development\Classes\Vcl\vectori256e.h(3363,1): message : consider using 'if constexpr' statement instead
E:\Development\Classes\Vcl\vectori256e.h(3369,1): warning C4127: conditional expression is constant
E:\Development\Classes\Vcl\vectori256e.h(3369,1): message : consider using 'if constexpr' statement instead
E:\Development\Classes\Vcl\vectori256e.h(3374,1): warning C4127: conditional expression is constant
E:\Development\Classes\Vcl\vectori256e.h(3374,1): message : consider using 'if constexpr' statement instead
E:\Development\Classes\Vcl\vectorf256e.h(1839,1): warning C4127: conditional expression is constant
E:\Development\Classes\Vcl\vectorf256e.h(1839,1): message : consider using 'if constexpr' statement instead
E:\Development\Classes\Vcl\vectorf512e.h(1736): message : see reference to function template instantiation 'Vec4d lookup<8>(const Vec4q,const double *)' being compiled
E:\Development\Classes\Vcl\vectorf256e.h(1840,1): warning C4127: conditional expression is constant
E:\Development\Classes\Vcl\vectorf256e.h(1840,1): message : consider using 'if constexpr' statement instead
E:\Development\Classes\Vcl\vectorf256e.h(1848,1): warning C4127: conditional expression is constant
E:\Development\Classes\Vcl\vectorf256e.h(1848,1): message : consider using 'if constexpr' statement instead
E:\Development\Classes\Vcl\vectori256e.h(3392,1): warning C4127: conditional expression is constant
E:\Development\Classes\Vcl\vectori256e.h(3392,1): message : consider using 'if constexpr' statement instead
E:\Development\Classes\Vcl\vectori512e.h(1892): message : see reference to function template instantiation 'Vec4q lookup<8>(const Vec4q,const void *)' being compiled
E:\Development\Classes\Vcl\vectori256e.h(3395,1): warning C4127: conditional expression is constant
E:\Development\Classes\Vcl\vectori256e.h(3395,1): message : consider using 'if constexpr' statement instead
E:\Development\Classes\Vcl\vectorf256e.h(1807,1): warning C4127: conditional expression is constant
E:\Development\Classes\Vcl\vectorf256e.h(1807,1): message : consider using 'if constexpr' statement instead
E:\Development\Classes\Vcl\vectorf512e.h(1709): message : see reference to function template instantiation 'Vec8f lookup<16>(const Vec8i,const float *)' being compiled
E:\Development\Classes\Vcl\vectorf256e.h(1808,1): warning C4127: conditional expression is constant
E:\Development\Classes\Vcl\vectorf256e.h(1808,1): message : consider using 'if constexpr' statement instead
E:\Development\Classes\Vcl\vectorf256e.h(1814,1): warning C4127: conditional expression is constant
E:\Development\Classes\Vcl\vectorf256e.h(1814,1): message : consider using 'if constexpr' statement instead
E:\Development\Classes\Vcl\vectorf256e.h(1819,1): warning C4127: conditional expression is constant
E:\Development\Classes\Vcl\vectorf256e.h(1819,1): message : consider using 'if constexpr' statement instead
Done building project "Classes.vcxproj" -- FAILED.

Could this please be fixed? Or am I doing something wrong?
I guess at the very least we could add something like this:

#pragma warning(push)
#pragma warning(disable: 4127)
// ...
#pragma warning(pop)

warning message in Visual Studio 2019

Hi,

When compiling my project I get a probably harmless warning message:
.\vectorclass\instrset.h(533,26): warning C4310: cast truncates constant value (compiling source file ..\vectorclass\add-on\random\ranvec1.cpp)

When I change the line in instrset.h(line:533) from:
uint8_t j = uint8_t(B); // index to selected bit
to:
uint8_t j = static_cast<uint8_t>(B); // index to selected bit

all those warnings are gone.

Is this small change ok, I really don't the difference between the two lines, maybe
this is a bug in vs2019 c++ compiler.

thanks

Boolean flags work incorrectly under AVX together with FTZ

When the following code is compiled under "-mavx" the error is detected. AVX2, SSE work fine.
Tested on V2.01.03. Ubuntu 20.04. GCC 9.3.0. CPU Skylake

#include <vectorclass.h>

#include <iostream>

int main()
{
  for (volatile int i = 1; i <= 4; ++i)
  {
    if (horizontal_count(Vec4db{}.load_bits((1U << i) - 1)) != i)
    {
      std::cout << "Part1 problem\n";
    }
  }

  no_subnormals(); // enable FTZ

  for (volatile int i = 1; i <= 4; ++i)
  {
    if (horizontal_count(Vec4db{}.load_bits((1U << i) - 1)) != i)
    {
      std::cout << "Part2 problem\n";
    }
  }
}