Git Product home page Git Product logo

libsimdpp's Introduction

libsimdpp

Travis build status Appveyor build status Join the chat at https://gitter.im/libsimdpp/Lobby

libsimdpp is a portable header-only zero-overhead C++ low level SIMD library. The library presents a single interface over SIMD instruction sets present in x86, ARM, PowerPC and MIPS architectures. On architectures that support different SIMD instruction sets the library allows the same source code files to be compiled for each SIMD instruction set and then hooked into an internal or third-party dynamic dispatch mechanism. This allows the capabilities of the processor to be queried on runtime and the most efficient implementation to be selected.

The library sits somewhere in the middle between programming directly in SIMD intrinsics and even higher-level SIMD libraries. As much control as possible is given to the developer, so that it's possible to exactly predict what code the compiler will generate.

No API-breaking changes are planned for the foreseeable future.

Documentation

Online documentation is provided here.

Compiler and instruction set support

  • This describes the current branch only which may be unstable or otherwise unfit for use. For available releases please see the libsimdpp wiki.

The library supports the following architectures and instruction sets:

  • x86, x86-64: SSE2, SSE3, SSSE3, SSE4.1, AVX, AVX2, FMA3, FMA4, AVX512F, AVX512BW, AVX512DQ, AVX512VL, XOP, popcnt
  • ARM 32-bit: NEON, NEONv2
  • ARM 64-bit: NEON, NEONv2
  • PowerPC 32-bit big-endian: Altivec, VSX v2.06, VSX v2.07
  • PowerPC 64-bit little-endian: Altivec, VSX v2.06, VSX v2.07
  • MIPS 32-bit little-endian: MSA
  • MIPS 64-bit little-endian: MSA

The primary development of the library happens in C++11. A C++98-compatible version of the library is provided on the cxx98 branch.

Supported compilers:

  • C++11 version:

    • GCC: 4.8-7.x
    • Clang: 3.3-4.0
    • Xcode 7.0-9.x
    • MSVC: 2013, 2015, 2017
    • ICC (on both Linux and Windows): 2013, 2015, 2016, 2017
  • C++98 version

    • GCC: 4.4-7.x
    • Clang: 3.3-4.0
    • Xcode 7.0-9.x
    • MSVC: 2013, 2015, 2017
    • ICC (on both Linux and Windows): 2013, 2015, 2016, 2017

Newer versions of the aforementioned compilers will generally work with either C++11 or C++98 version of the library. Older versions of these compilers will generally work with the C++98 version of the library.

Various compiler versions are not supported on various instruction sets due to compiler bugs or incompletely implemented instruction sets. See simdpp/detail/workarounds.h for more details.

  • MSVC and ICC are only supported on x86 and x86-64.

  • AVX is not supported on Clang 3.6 or GCC 4.4

  • AVX2 is not supported on Clang 3.6.

  • AVX512F is not supported on:

    • GCC 5.x and older
    • Clang 5.0 and older
    • MSVC
  • NEON armv7 is not supported on Clang 3.3 and older.

  • NEON aarch64 is not supported on GCC 4.8 and older

  • Altivec on little-endian PPC is not suppported on GCC 5.x and older.

  • VSX on big-endian PPC is not supported on GCC 5.x and older.

  • MSA is not supported on GCC 6.x and older.

Contributing

Contributions are welcome. Please see CONTRIBUTING.md for more information.

License

The library may be freely used in commercial and non-commercial software. The code is distributed under the Boost Software License, Version 1.0. Some internal development scripts are licensed under different licenses -- see comments in these files. The documentation is licensed under CC-BY-SA.

Boost Software License - Version 1.0 - August 17th, 2003

Permission is hereby granted, free of charge, to any person or organization obtaining a copy of the software and accompanying documentation covered by this license (the "Software") to use, reproduce, display, distribute, execute, and transmit the Software, and to prepare derivative works of the Software, and to permit third-parties to whom the Software is furnished to do so, all subject to the following:

The copyright notices in the Software and this entire statement, including the above license grant, this restriction and the following disclaimer, must be included in all copies of the Software, in whole or in part, and all derivative works of the Software, unless such copies or derivative works are solely in the form of machine-executable object code generated by a source language processor.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT SHALL THE COPYRIGHT HOLDERS OR ANYONE DISTRIBUTING THE SOFTWARE BE LIABLE FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

libsimdpp's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

libsimdpp's Issues

Possible incorrect inclusion of AVX2 instructions when not enabled

Hi there,

AFAICT, libsimdpp is erroneously including AVX2 instructions when I neither enable them with a compiler switch nor enable them with the instruction set selection macro. I have a file called fail.cpp:

#define SIMDPP_ARCH_X86_SSE2

#include <simdpp/simd.h>
#include <inttypes.h>

using namespace simdpp;

int main(int argc, char ** argv) {
  return 0;
}

uint64<2> bad(uint64<2> x, uint64<2> y) {
  return bit_andnot(x, y);
}

which I compile with this invocation:

g++ -march=native -std=c++11 -Ilibsimdpp-2.0-rc2 -Wall -Werror fail.cpp

and then I take a look at a.out:

[ec2-user@ip-172-31-54-96 c]$ objdump -M intel -d a.out
...
00000000004005f2 <_Z3badN6simdpp9arch_sse26uint64ILj2EvEES2_>:
  4005f2:       55                      push   rbp
  4005f3:       48 89 e5                mov    rbp,rsp
  4005f6:       48 81 ec 20 02 00 00    sub    rsp,0x220
...
  40085a:       c5 f9 df 85 30 ff ff    vpandn xmm0,xmm0,XMMWORD PTR [rbp-0xd0]

and it includes vpandn which, AFAIK, is an AVX2 instruction. Moreover, this triggers a SIGILL on my machine, so at the very least it's not compatible with my architecture.

Have I done something wrong? Perhaps a bad flag somewhere?

When I compile libsimdpp with cmake ., it does correctly conclude that I lack AVX2:

...
-- Performing Test CAN_RUN_X86_AVX
-- Performing Test CAN_RUN_X86_AVX - Success
-- Performing Test CAN_RUN_X86_AVX2
-- Performing Test CAN_RUN_X86_AVX2 - Failed
...
[ec2-user@ip-172-31-54-96 c]$ gcc --version
gcc (GCC) 4.8.3 20140911 (Red Hat 4.8.3-9)
[ec2-user@ip-172-31-54-96 c]$ cat /proc/cpuinfo | grep flags
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt
[ec2-user@ip-172-31-54-96 c]$ gcc -march=native -Q --help=target | grep avx
  -march=                               core-avx-i
  -mavx                                 [enabled]
  -mavx2                                [disabled]
  -mavx256-split-unaligned-load         [disabled]
  -mavx256-split-unaligned-store        [disabled]
  -mprefer-avx128                       [disabled]
  -msse2avx                             [disabled]
  -mtune=                               core-avx-i

type mistakenly induced

hi, this snippet of code could not be compiled in vs2017, for out_vec was mistakenly induced as uint32x8
`void prelu_simdpp(const T* const in_data, const int len, const float coeff, T* const out_data)
{
const int len_aligned = len & (-8);

for (int i = 0; i < len_aligned; i += 8)
{
	auto in_vec = simdpp::load_u<simdpp::int32x8>(in_data + i);
	auto mask_vec = simdpp::cmp_gt(in_vec, 0);
	auto out_vec = simdpp::blend(in_vec, in_vec * coeff, mask_vec);
	auto out_vec2 = to_float32(out_vec);
}

}
`

compiler error with sse2 on 32bit system

First of all, thanks a lot for this great library!

I have a problem to compile this code on my system:

#include <emmintrin.h>
#include <simdpp/sse2.h>
using namespace simdpp::SIMDPP_ARCH_NAMESPACE;
int main(int argc, char** argv) {
    uint32x4 a = uint32x4::make_const(0x11111111, 0x22222222, 0x33333333, 0x44444444);
    return 0;
}

...with this command:

g++ -std=c++11 -msse2 -I.. main.cpp

...I get these errors:

In file included from ../simdpp/simd/math_shift.h:17:0,
                 from ../simdpp/simd.h:47,
                 from ../simdpp/sse2.h:23,
                 from main.cpp:2:
../simdpp/simd/extract.h: In function ‘uint64_t simdpp::simdpp_arch_sse2::extract(simdpp::simdpp_arch_sse2::basic_int64x2)’:
../simdpp/simd/extract.h:124:31: error: there are no arguments to ‘_mm_cvtsi128_si64’ that depend on a template parameter, so a declaration of ‘_mm_cvtsi128_si64’ must be available [-fpermissive]
     return _mm_cvtsi128_si64(t);
                               ^
../simdpp/simd/extract.h:124:31: note: (if you use ‘-fpermissive’, G++ will accept your code, but allowing the use of an undeclared name is deprecated)
In file included from ../simdpp/simd.h:62:0,
                 from ../simdpp/sse2.h:23,
                 from main.cpp:2:
../simdpp/simd/insert.h: In function ‘simdpp::simdpp_arch_sse2::int128 simdpp::simdpp_arch_sse2::insert(simdpp::simdpp_arch_sse2::basic_int64x2, uint64_t)’:
../simdpp/simd/insert.h:136:37: error: there are no arguments to ‘_mm_cvtsi64_si128’ that depend on a template parameter, so a declaration of ‘_mm_cvtsi64_si128’ must be available [-fpermissive]
     int64x2 vx = _mm_cvtsi64_si128(x);
                                     ^
In file included from ../simdpp/simd.h:69:0,
                 from ../simdpp/sse2.h:23,
                 from main.cpp:2:
../simdpp/simd/int64x2.inl: In static member function ‘static simdpp::simdpp_arch_sse2::uint64x2 simdpp::simdpp_arch_sse2::uint64x2::set_broadcast(uint64_t)’:
../simdpp/simd/int64x2.inl:82:30: error: ‘_mm_cvtsi64_si128’ was not declared in this scope
     r0 = _mm_cvtsi64_si128(v0);
                              ^
In file included from ../simdpp/simd.h:71:0,
                 from ../simdpp/sse2.h:23,
                 from main.cpp:2:
../simdpp/simd/float64x2.inl: In static member function ‘static simdpp::simdpp_arch_sse2::float64x2 simdpp::simdpp_arch_sse2::float64x2::set_broadcast(double)’:
../simdpp/simd/float64x2.inl:52:49: error: ‘_mm_cvtsi64_si128’ was not declared in this scope
     r0 = _mm_cvtsi64_si128(bit_cast<int64_t>(v0));

Those undefined functions (_mm_cvtsi128_si64 and _mm_cvtsi64_si128) are defined in emmintrin.h, but only for 64bit systems. Right now, I have just commented out the code around:

in simdpp/simd/float64x2.inl

inline float64x2 float64x2::set_broadcast(double v0)
{
#if SIMDPP_USE_NULL || SIMDPP_USE_NEON_VFP_DP
    return null::make_vec<float64x2>(v0);
#elif SIMDPP_USE_SSE2
    return zero();
//  int64x2 r0;
//  r0 = _mm_cvtsi64_si128(bit_cast<int64_t>(v0));
//  return permute<0,0>(float64x2(r0));
#else
    return SIMDPP_NOT_IMPLEMENTED1(v0);
#endif
}

in simdpp/simd/int64x2.inl

inline uint64x2 uint64x2::set_broadcast(uint64_t v0)
{
#if SIMDPP_USE_NULL
    return null::make_vec<uint64x2>(v0);
#elif SIMDPP_USE_SSE2
    return zero();
//  uint64x2 r0;
//  r0 = _mm_cvtsi64_si128(v0);
//  r0 = permute<0,0>(r0);
//  return uint64x2(r0);
#elif SIMDPP_USE_NEON
    uint64x1_t r0 = vcreate_u64(v0);
    return vcombine_u64(r0, r0);
#endif
}

in simdpp/simd/extract.h

template<unsigned id>
inline uint64_t extract(basic_int64x2 a)
{
    static_assert(id < 2, "index out of bounds");
#if SIMDPP_USE_NULL
    return a[id];
#elif SIMDPP_USE_SSE4_1
    return _mm_extract_epi64(a, id);
#elif SIMDPP_USE_SSE2
    return 0;
//  uint64x2 t = a;
//  if (id != 0) {
//      t = move_l<id>(t);
//  }
//  return _mm_cvtsi128_si64(t);
#elif SIMDPP_USE_NEON
    return vgetq_lane_u64(a, id);
#endif
}

in simdpp/simd/insert.h

template<unsigned id>
int128 insert(basic_int64x2 a, uint64_t x)
{
#if SIMDPP_USE_NULL
    a[id] = x;
    return a;
#elif SIMDPP_USE_SSE4_1
    return _mm_insert_epi64(a, x, id);
#elif SIMDPP_USE_SSE2
    return 0;
//  int64x2 vx = _mm_cvtsi64_si128(x);
//  if (id == 0) {
//      a = shuffle1<0,1>(vx, a);
//  } else {
//      a = shuffle1<0,0>(a, vx);
//  }
//  return a;
#elif SIMDPP_USE_NEON
    return vsetq_lane_u64(x, a, id);
#endif
}

Is there a correct way to overcome those compiler errors?
Thanks a lot!
Michal

clang generates dubious code for int32x4 >=

this code doesn't produce an error and compiles dubious looking disassembly on clang

simdpp::mask_int32x4 foo1(simdpp::int32x4 a, simdpp::int32x4 b)
{
return a >= b;
}

Inspiration::foo1(simdpp::arch_ssse3::int32<4u, void>, simdpp::arch_ssse3::int32<4u, void>):
0000000000001280 pushq %rbp
0000000000001281 movq %rsp, %rbp
0000000000001284 movdqa 0x134(%rip), %xmm2
000000000000128c pxor %xmm2, %xmm0
0000000000001290 pxor %xmm2, %xmm1
0000000000001294 movdqa %xmm1, %xmm2
0000000000001298 pcmpgtd %xmm0, %xmm2
000000000000129c pshufd $0xa0, %xmm2, %xmm3
00000000000012a1 pcmpeqd %xmm0, %xmm1
00000000000012a5 pshufd $0xf5, %xmm1, %xmm0
00000000000012aa pand %xmm3, %xmm0
00000000000012ae pshufd $0xf5, %xmm2, %xmm1
00000000000012b3 por %xmm0, %xmm1
00000000000012b7 pcmpeqd %xmm0, %xmm0
00000000000012bb pxor %xmm1, %xmm0
00000000000012bf popq %rbp
00000000000012c0 retq

same code on gcc produces a compilation error

simdpp::blend() with mask_int32<8> gets compiled into PBLENDVB (not VPBLENDD)

I believe the instruction VPBLENDD is faster than the instruction PBLENDVB. The first instruction only handles dwords but the second instruction handles bytes.

For that reason I thought that a simdpp::blend() that makes use of a mask_int32<8>
would be compiled into a VPBLENDD instead of a PBLENDVB.

My test program gets compiled into PBLENDVB:

    #include <iostream>
    #include <simdpp/simd.h>

    int main() {
      simdpp::uint32<8> v1 = simdpp::make_uint(std::numeric_limits< uint32_t >::max(), 0);
      simdpp::uint32<8> v2 = simdpp::make_uint(std::numeric_limits< uint32_t >::max());
      const auto mask = simdpp::cmp_eq(v1, v2);
      v1 = simdpp::blend(v1, v2, mask);
      // Just output something so that the compiler does not optimize away everything
      std::cout << simdpp::reduce_max(v1) << "\n";
    }
    $ g++-7.1 -I/home/user/libsimdpp/inst/include/libsimdpp-2.0 -I. -std=c++14  -msse4.1  -mavx2 -O3  -D SIMDPP_ARCH_X86_AVX2 -save-temps /home/user/test.cc
    $ grep blend test.s
    	vpblendvb	%ymm1, %ymm1, %ymm0, %ymm1

Do you know why PBLENDVB is being used and not VPBLENDD?

Conflicts with CRT macros in VS2015

I tried to use libsimdpp in VS2015. However it resulted in errors because ucrt/stdlib.h defines min and max as macros so any occurence of min/max in libsimdpp was getting replaced.

Not sure what's the best way to fix that but currently I added #undef for both of those in null/math.h.

workaround: _mm_set_epi64x identifier not found

I am building a Python extension that uses libsimdpp. As I want to provide compatibility with Python 2.7 (yep, it's still pretty popular) I need to compile against VS 2008 (using the cxx98 branch). There, I am getting following error when compiling with SSE2 options enabled:

error C3861: '_mm_set_epi64x': identifier not found

According to https://msdn.microsoft.com/en-us/library/dk2sdw0h(v=vs.90).aspx the correct header file is intrin.h and just adding it makes it indeed work.

Newer VS versions don't have that problem. Have you heard about this before? Why does libsimdpp not include intrin.h? Does my workaround look ok?

Full logs: https://ci.spacy.io/builders/sense2vec-win64-py27-64-install/builds/47/steps/shell_2/logs/stdio
Workaround: explosion/sense2vec@1d94617

scalar arguments in expressions are broken

Hi Povilas,

with the current git version, this code does not compile:

float32x4 b = make_float(1.0f);
float32x4 r = add(add(b, b), 2.0f);

with the error message:

error: could not convert ‘simdpp::arch_sse2::add<4u, simdpp::arch_sse2::expr_add<simdpp::arch_sse2::float32<4u>, simdpp::arch_sse2::float32<4u> >, simdpp::arch_sse2::expr_scalar<float> >((* & a), (* & simdpp::arch_sse2::detail::cast_expr<simdpp::arch_sse2::float32<4u, simdpp::arch_sse2::expr_scalar<float> >, float>((* & b))))’ from ‘simdpp::arch_sse2::float32<4u, simdpp::arch_sse2::expr_add<simdpp::arch_sse2::float32<4u, simdpp::arch_sse2::expr_add<simdpp::arch_sse2::float32<4u>, simdpp::arch_sse2::float32<4u> > >, simdpp::arch_sse2::float32<4u, simdpp::arch_sse2::expr_scalar<float> > > >’ to ‘simdpp::arch_sse2::float32<4u, simdpp::arch_sse2::expr_add<simdpp::arch_sse2::float32<4u>, simdpp::arch_sse2::float32<4u, simdpp::arch_sse2::expr_scalar<float> > > >’
 template<unsigned N, class V> SIMDPP_INL RET_VEC<N, EXPR<VEC<N>, VEC<N,expr_scalar<   float>>>> FUNC(const VEC<N,V>& a, const float& b)    { return FUNC(a, detail::cast_expr<VEC<N,expr_scalar<   float>>>(b)); } \
                                                                                                                                                                                                               ^
/home/miso/install/libsimdpp/simdpp/core/f_add.h:43:1: note: in expansion of macro ‘SIMDPP_SCALAR_ARG_IMPL_EXPR’
 SIMDPP_SCALAR_ARG_IMPL_EXPR(add, expr_add, float32, float32)
 ^

The problem seems to be in the scalar argument, because this code compiles correctly:

float32x4 b = make_float(1.0f);
float32x4 r = add(add(b, b), b);

Thanks for any hints,
Miso

how to build the dynamic_dispatch example ?

Hello,
Probably a silly question, but since there is absolutely no install or basic usage documentation...

So I got the git repo, then:

cd examples/dynamic_dispatch
make test

and I get:

In file included from test.cc:4:0:
../../simdpp/dispatch/get_arch_gcc_builtin_cpu_supports.h: In function ‘simdpp::Arch simdpp::get_arch_gcc_builtin_cpu_supports()’:
../../simdpp/dispatch/get_arch_gcc_builtin_cpu_supports.h:24:41: error: Parameter to builtin not valid: avx512f
     if (__builtin_cpu_supports("avx512f")) {


gcc (Ubuntu 4.8.4-2ubuntu1~14.04) 4.8.4

Possible performance problem?

I am trying to multiply two float32<8> numbers with SIMDPP_ARCH_X86_AVX setting.

The code something like:
float32 bigi = load(i);
float32 bigm = load(modifiers);
bigi = mul(bigi, bigm);

It works ok, but when I try to trace the code step-by-step I see that after multiplication the code goes to following piece of code:
template<class R, class T> SIMDPP_INL
R cast_memcpy(const T& t)
{
static_assert(sizeof(R) == sizeof(T), "Size mismatch");
R r;
::memcpy(&r, &t, sizeof(R));
return r;
}

I don't understand why we need to do memcpy after each operation. It's a big performance gap.

failed to load

Hi. I tried to write a simple example for libsimdpp.
I thought the following code should run, but it returned an run time error at load(a).

Do you have any idea?

#define  SIMDPP_ARCH_X86_AVX2
#include<simdpp/simd.h>
int main()
{
    const int N = 8;

    // should be aligned to  __mm256
    float SIMDPP_ALIGN(32) a[N];

    for (int i = 0; i < N; ++i)
    {
        a[i] = i;
    }
        // this works.
        //simdpp::float32<4, void> a_avx = simdpp::load(&a[0]);
    simdpp::float32<8, void> a_avx = simdpp::load(&a[0]);

    return 0;
}

CMake tip - if dispatcher generated cpp files are not rebuilding after changing a header

(OSX 10.10, Apple Clang 7.0.2, CMake 3.5.0)

This is probably a bug in CMake (still looking into it) but I found it while working with libsimdpp, so thought other users might find this helpful.

File setup

  • src/main.cpp //main program entry point
  • src/code.cpp //the code that dispatcher will copy
  • src/common.h //some common header, included in code.cpp and main.cpp
  • include/simdpp/... //simdpp include dir

CMakeLists.txt:

[...]
simdpp_get_runnable_archs(RUNNABLE_ARCHS)
simdpp_multiarch(GEN_ARCH_FILES src/code.cpp ${RUNNABLE_ARCHS})
add_executable(simd-test src/main.cpp ${GEN_ARCH_FILES})
target_include_directories(simd-test PRIVATE ${CMAKE_SOURCE_DIR}/include/)

Background

The simdpp_multiarch() CMake function (from SimdppMultiarch.cmake) will use configure_file() to copy ${CMAKE_SOURCE_DIR}/src/code.cpp into the build dir (e.g. ${CMAKE_BINARY_DIR}/src/code_simdpp_-x86_avx.cpp etc). It will also manually add an the include dir back to the original location:

SimdppMultiarch.cmake line 434:
set(CXX_FLAGS "-I\"${CMAKE_CURRENT_SOURCE_DIR}/${SRC_PATH}\" ${CXX_FLAGS}")

This ensures that local includes, such as #include "common.h" in code.cpp will still work at compile time.

Problem

The problem is that when CMake generates the file dependencies, it seems to ignore the file-specific include search path set on the generated files. This means that ${CMAKE_BINARY_DIR}/CMakeFiles/simd-test.dir/depend.make will not include src/common.h and when you change common.h without changing code.cpp, none of the generated files are recompiled! This results in linking with stale object files (which include the old version of common.h) and programs that could crash or be incorrect in subtle ways.

Workaround

Add the local directory of code.cpp to the target include dir with a command like this:

target_include_directories(simd-test PRIVATE ${PROJECT_SOURCE_DIR}/src)

(You may need to update the path for your project, or have multiple of these lines if you simdpp_multiarch() files from multiple directories.)

There may be a way to update simdpp_multiarch() to handle this automatically but a simple solution eludes me at the moment.

aarch64 splat for float32x4 looks really inefficient

float32x4 foo(float a)
{
    return splat(a);
}

the generated code looks really terrible, at least in comparison to SSE. I guess ARMv7 can't do any better than that? Nevertheless ARMv8 can using vdupq_laneq_f32(). The NEON implementation of i_splat4() for float32x4 should probably have a SIMDPP_64_BITS variant.

Resurrect deleted old API where possible

Some old API has been deleted for convenience of libsimdpp development. In many cases it's worth to reconsider the deletion and at least put the old API within SIMDPP_ENABLE_DEPRECATED or similar ifdef block.

avx compile errors

I wrote a simple test to familiarize myself with the library.

//#define SIMDPP_ARCH_X86_SSE4_1
#define SIMDPP_ARCH_X86_AVX
#include <simdpp/simd.h>

using namespace simdpp;

int main(int argc, char *argv[]) {

  float32<8> test = make_float(1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f, 8.0f);
  float32<8> test2 = make_float(2.0f);
  float32<8> sum = add(test, test2);

  const float *lp = reinterpret_cast<const float *>(&sum);

  return lp[0] > lp[1];
}

This compiles with the SSE flag, but fails to compile with AVX. The following error is from clang 3.4


libsimdpp/simdpp/types/float32x8.h:82:26: error: implicit instantiation of undefined template 'simdpp::arch_sse2_sse3_ssse3_sse4p1_avx::uint32<8, void>'
    float32<8>(uint32<8> d)              { *this = bit_cast<float32<8>>(d); }

libsimdpp/simdpp/types/fwd.h:89:44: note: template is declared here
template<unsigned N, class E = void> class uint32;

Documentation and tutorial

Hi,
I'd like to try libsimdpp, but I don't know if needs to be installed and how to interface it with a piece of code. I tried to find a tutorial, but it seems that it is currently missing.

Tests fail

On the FreeBSD 11.1 I am getting these errors:

===>  Testing for libsimdpp-2.0
[0/1] cd /usr/ports/devel/libsimdpp/work/libsimdpp-2.0 && /usr/local/bin/ctest --force-new-ctest-process
Test project /usr/ports/devel/libsimdpp/work/libsimdpp-2.0
    Start 1: s_test1
Could not find executable test1
Looked in the following places:
test1
test1
Release/test1
Release/test1
Debug/test1
Debug/test1
MinSizeRel/test1
MinSizeRel/test1
RelWithDebInfo/test1
RelWithDebInfo/test1
Deployment/test1
Deployment/test1
Development/test1
Development/test1
Unable to find executable: test1
1/9 Test #1: s_test1 ..........................***Not Run   0.00 sec
    Start 2: s_test_dispatcher1
Could not find executable test_dispatcher
Looked in the following places:
test_dispatcher
test_dispatcher
Release/test_dispatcher
Release/test_dispatcher
Debug/test_dispatcher
Debug/test_dispatcher
MinSizeRel/test_dispatcher
MinSizeRel/test_dispatcher
RelWithDebInfo/test_dispatcher
RelWithDebInfo/test_dispatcher
Deployment/test_dispatcher
Deployment/test_dispatcher
Development/test_dispatcher
Development/test_dispatcher
Unable to find executable: test_dispatcher
2/9 Test #2: s_test_dispatcher1 ...............***Not Run   0.00 sec
    Start 3: s_test_dispatcher2
Could not find executable test_dispatcher
Looked in the following places:
test_dispatcher
test_dispatcher
Release/test_dispatcher
Release/test_dispatcher
Debug/test_dispatcher
Debug/test_dispatcher
MinSizeRel/test_dispatcher
MinSizeRel/test_dispatcher
RelWithDebInfo/test_dispatcher
RelWithDebInfo/test_dispatcher
Deployment/test_dispatcher
Deployment/test_dispatcher
Development/test_dispatcher
Development/test_dispatcher
Unable to find executable: test_dispatcher
3/9 Test #3: s_test_dispatcher2 ...............***Not Run   0.00 sec
    Start 4: s_test_dispatcher3
Could not find executable test_dispatcher
Looked in the following places:
test_dispatcher
test_dispatcher
Release/test_dispatcher
Release/test_dispatcher
Debug/test_dispatcher
Debug/test_dispatcher
MinSizeRel/test_dispatcher
MinSizeRel/test_dispatcher
RelWithDebInfo/test_dispatcher
RelWithDebInfo/test_dispatcher
Deployment/test_dispatcher
Deployment/test_dispatcher
Development/test_dispatcher
Development/test_dispatcher
Unable to find executable: test_dispatcher
4/9 Test #4: s_test_dispatcher3 ...............***Not Run   0.00 sec
    Start 5: s_test_dispatcher4
Could not find executable test_dispatcher
Looked in the following places:
test_dispatcher
test_dispatcher
Release/test_dispatcher
Release/test_dispatcher
Debug/test_dispatcher
Debug/test_dispatcher
MinSizeRel/test_dispatcher
MinSizeRel/test_dispatcher
RelWithDebInfo/test_dispatcher
RelWithDebInfo/test_dispatcher
Deployment/test_dispatcher
Deployment/test_dispatcher
Development/test_dispatcher
Development/test_dispatcher
Unable to find executable: test_dispatcher
5/9 Test #5: s_test_dispatcher4 ...............***Not Run   0.00 sec
    Start 6: s_test_dispatcher5
Could not find executable test_dispatcher
Looked in the following places:
test_dispatcher
test_dispatcher
Release/test_dispatcher
Release/test_dispatcher
Debug/test_dispatcher
Debug/test_dispatcher
MinSizeRel/test_dispatcher
MinSizeRel/test_dispatcher
RelWithDebInfo/test_dispatcher
RelWithDebInfo/test_dispatcher
Deployment/test_dispatcher
Deployment/test_dispatcher
Development/test_dispatcher
Development/test_dispatcher
Unable to find executable: test_dispatcher
6/9 Test #6: s_test_dispatcher5 ...............***Not Run   0.00 sec
    Start 7: s_test_dispatcher6
Could not find executable test_dispatcher
Looked in the following places:
test_dispatcher
test_dispatcher
Release/test_dispatcher
Release/test_dispatcher
Debug/test_dispatcher
Debug/test_dispatcher
MinSizeRel/test_dispatcher
MinSizeRel/test_dispatcher
RelWithDebInfo/test_dispatcher
RelWithDebInfo/test_dispatcher
Deployment/test_dispatcher
Deployment/test_dispatcher
Development/test_dispatcher
Development/test_dispatcher
Unable to find executable: test_dispatcher
7/9 Test #7: s_test_dispatcher6 ...............***Not Run   0.00 sec
    Start 8: s_test_dispatcher7
Could not find executable test_dispatcher
Looked in the following places:
test_dispatcher
test_dispatcher
Release/test_dispatcher
Release/test_dispatcher
Debug/test_dispatcher
Debug/test_dispatcher
MinSizeRel/test_dispatcher
MinSizeRel/test_dispatcher
RelWithDebInfo/test_dispatcher
RelWithDebInfo/test_dispatcher
Deployment/test_dispatcher
Deployment/test_dispatcher
Development/test_dispatcher
Development/test_dispatcher
Unable to find executable: test_dispatcher
8/9 Test #8: s_test_dispatcher7 ...............***Not Run   0.00 sec
    Start 9: s_test_expr1
Could not find executable test_expr
Looked in the following places:
test_expr
test_expr
Release/test_expr
Release/test_expr
Debug/test_expr
Debug/test_expr
MinSizeRel/test_expr
MinSizeRel/test_expr
RelWithDebInfo/test_expr
RelWithDebInfo/test_expr
Deployment/test_expr
Deployment/test_expr
Development/test_expr
Development/test_expr
Unable to find executable: test_expr
9/9 Test #9: s_test_expr1 .....................***Not Run   0.00 sec

0% tests passed, 9 tests failed out of 9

Total Test time (real) =   0.02 sec

The following tests FAILED:
	  1 - s_test1 (Not Run)
	  2 - s_test_dispatcher1 (Not Run)
	  3 - s_test_dispatcher2 (Not Run)
	  4 - s_test_dispatcher3 (Not Run)
	  5 - s_test_dispatcher4 (Not Run)
	  6 - s_test_dispatcher5 (Not Run)
	  7 - s_test_dispatcher6 (Not Run)
	  8 - s_test_dispatcher7 (Not Run)
	  9 - s_test_expr1 (Not Run)
Errors while running CTest

NEON 128-bit test_bits_any() more efficient using uint64x2?

would it be better to use uint64x2 as the 128-bit optimized implementation of i_test_bits_any() for NEON?

this generates fewer instructions:

SIMDPP_INL bool i_test_bits_any(const uint64<2>& a)
{
    uint64x2 r = bit_or(a, move2_l<1>(a));
    return extract<0>(r) != 0;
}

as compared to this:

SIMDPP_INL bool i_test_bits_any(const uint32<4>& a)
{
    uint32x4 r = bit_or(a, move4_l<2>(a));
    r = bit_or(r, move4_l<1>(r));
    return extract<0>(r) != 0;
}

test_zero() and test_ones() for longer than 128bit vectors

Hi Povilas,

sorry for my recent splash of messages :o) I started to work more intensely on vectorizing some scalar code.

When I have a mask vector, I sometimes need to know whether all values in the vector are true or false. I have to use a bit_cast to convert the mask to uint vector like this:

mask_int32x8 mask = ...
bool all_true = simdpp::sse::test_ones(simdpp::bit_cast<uint32x8>(mask));

which is not a problem, but the test_zero() and test_ones() in simdpp/sse/compare.h are implemented only for 128bit vectors. I don't know if it would be ok to add the support for 256bit vectors to the same header file, since such long vectors are supported by avx, not sse.

namespace simdpp {
namespace SIMDPP_ARCH_NAMESPACE {
namespace sse {

template<class = void> SIMDPP_INL
bool test_ones(const uint32x8& a)
{
    uint32x4 v1, v2;
    simdpp::split(a, v1, v2);
    return
        test_ones(uint8x16(v1)) &&
        test_ones(uint8x16(v2));
}

template<class = void> SIMDPP_INL
bool test_zero(const uint32x8& a)
{
    uint32x4 v1, v2;
    simdpp::split(a, v1, v2);
    return
        test_zero(uint8x16(v1)) &&
        test_zero(uint8x16(v2));
}

// variants for uint16x16, uint8x32, uint64x8 should follow

}}}

Cheers,
Michal

Linux ICC compilation flag

Hi

I am new to this library, so maybe I am wrong. But I noticed that in here, the flag for ICC is -mavx512f.

Shouldn't it be -xCOMMON-AVX512 (or -xMIX-AVX512 for Xeon Phi x200 and -CORE-AVX512 for other Xeon Phi) according to intel specification?

On my KNL, the ICC can compile the code with warnings that

icpc: command line warning #10159: invalid argument for option '-m'

Will the program still run SIMD instructions correctly ?

Thanks

Qi

libsimdpp shouldn't memorize the architecture of the machine where it is configured

It should detect which SIMD instructions are available when the project is built. The libsimdpp package can be created on the machine with a narrow SIMD set, and this package can be used on the machine with a wide SIMD set, and vice versa. One shouldn't affect the other.

libsimdpp should detect SIMD availability purely in the runtime when it is used. You shouldn't even have a 'configure' step.

https://stackoverflow.com/questions/28939652/how-to-detect-sse-avx-avx2-avx-512-availability-at-compile-time

Dispatcher macros don't work with template functions

I'm trying to figure out how to generate the macros when using the dispatcher.
First, one need to add -DEMIT_DISPATCHER and the list of supported platforms. But it's not really great for cross platform automated build .
Then, my functions are templated factory builders, and this doesn't work...

I was wondering if using the get compilable macro with cmake could generate a comma separated list (easy to do with CMake) and then use it with Boost preprocessor macros to generate the proper code in a template acceptable way.
Any thoughts on this? Without this 2 features, the SIMD filters I'm trying to build for my library are just unusable :/

failed to load a vector of double.

I tried to load a vector of double into float64<2,void> but it encountered a segmentation error.

OS: Windows 10 64bit
platform: Visual Studio 2015 with 32 bits debug mode

#define  SIMDPP_ARCH_X86_AVX2

//float vec[2]; // this works
double vec[2]; // this results in seg. error.

vec[0] = 0;
vec[1] = 1;

float64<2, void> vec64_4 = load(&vec[0]);

comparison functions

Hi Povilas,

I have few issues with comparison functions:

int32x4 i = make_int(13);
int32x4 j = make_int(10);
mask_int32x4 m;
m = cmp_le(i, j); // compilation FAILS
m = cmp_le(10, i); // compilation FAILS
m = cmp_le(i, 10); // compilation FAILS

float32x4 i = make_float(10);
float32x4 j = make_float(13);
mask_float32x4 m;
m = cmp_le(i, j); // OK
m = cmp_le(10, j); // OK
m = cmp_le(j, 10); // OK
m = cmp_le(make_float(13.0f), make_float(10.0f)); // compilation FAILS

In the simdpp/core/cmp_le.h it seems that cmp_le takes only float arguments. What is then the preffered way to compare integers? This issue is probably present in all comparison functions.

Thanks for any hints,
Michal

P.S.: the division operator is missing in simdpp/simd.h. I just added

#include <simdpp/operators/f_div.h>

in my local copy to be able to use it.

constant parameters to expressions

This syntax doesn't seem to be supported now, but I understand the api is in a transition period.

float32<4> a = make_float(1.0f);  
a = add(a, 2.0f);

Licensing

How is this library licensed? According to README.md it's BSD but the COPYING file says GPL3.

ICC generates lots of unnecessary casts

When compiling with ICC 16.0, some bad overloads are selected for assignment (not sure if compiler bug?). This test code:

#define SIMDPP_ARCH_X86_SSE2
#include <simdpp/simd.h>
#include <cstdio>

int main()
{
	using namespace simdpp;

	uint32x4 v1 = make_ones<uint32x4>();
	v1 = v1 << 3;

	std::printf("%u\n", reduce_add(v1));

	return 0;
}

compiles to

--- C:\Users\Mak\Documents\bitpacker\lib\libsimdpp\simdpp\detail\insn\i_shift_l.h 
00FE1451  pcmpeqd     xmm1,xmm1  
00FE1455  pslld       xmm1,3  
--- C:\Users\Mak\Documents\bitpacker\test\simdpp_test1.cpp ---------------------
00FE145A  or          dword ptr [esp+80h],8000h  
--- C:\Users\Mak\Documents\bitpacker\lib\libsimdpp\simdpp\types\int32x4.h ------
00FE1465  movdqa      xmmword ptr [esp+90h],xmm1  
--- C:\Users\Mak\Documents\bitpacker\test\simdpp_test1.cpp ---------------------
00FE146E  ldmxcsr     dword ptr [esp+80h]  
--- C:\Users\Mak\Documents\bitpacker\lib\libsimdpp\simdpp\detail\cast.inl ------
00FE1476  movaps      xmm2,xmmword ptr [esp+90h]  
00FE147E  movaps      xmmword ptr [esp+0A0h],xmm2  
--- C:\Users\Mak\Documents\bitpacker\lib\libsimdpp\simdpp\types\int32x4.h ------
00FE1486  movdqa      xmm0,xmmword ptr [esp+0A0h]  
--- C:\Users\Mak\Documents\bitpacker\lib\libsimdpp\simdpp\detail\insn\i_reduce_add.h 
00FE148F  movdqa      xmmword ptr [esp+90h],xmm0  
--- C:\Users\Mak\Documents\bitpacker\lib\libsimdpp\simdpp\types\int8x16.h ------
00FE1498  movdqa      xmmword ptr [esp+0B0h],xmm0  
--- C:\Users\Mak\Documents\bitpacker\lib\libsimdpp\simdpp\detail\cast.inl ------
00FE14A1  movaps      xmm1,xmmword ptr [esp+0B0h]  
00FE14A9  movaps      xmmword ptr [esp+80h],xmm1  
--- C:\Users\Mak\Documents\bitpacker\lib\libsimdpp\simdpp\detail\insn\move_l.h -
00FE14B1  movdqa      xmm0,xmmword ptr [esp+80h]  
00FE14BA  psrldq      xmm0,8  
--- C:\Users\Mak\Documents\bitpacker\lib\libsimdpp\simdpp\types\int32x4.h ------
00FE14BF  movdqa      xmmword ptr [esp+0A0h],xmm0  
--- C:\Users\Mak\Documents\bitpacker\lib\libsimdpp\simdpp\detail\cast.inl ------
00FE14C8  movaps      xmm1,xmmword ptr [esp+0A0h]  
00FE14D0  movaps      xmmword ptr [esp+0B0h],xmm1  
--- C:\Users\Mak\Documents\bitpacker\lib\libsimdpp\simdpp\types\empty_expr.h ---
00FE14D8  movaps      xmm0,xmmword ptr [esp+0B0h]  
--- C:\Users\Mak\Documents\bitpacker\lib\libsimdpp\simdpp\detail\insn\i_reduce_add.h 
00FE14E0  movdqa      xmm1,xmmword ptr [esp+90h]  
--- C:\Users\Mak\Documents\bitpacker\lib\libsimdpp\simdpp\detail\expr\i_add.h --
00FE14E9  paddd       xmm1,xmm0  
--- C:\Users\Mak\Documents\bitpacker\lib\libsimdpp\simdpp\types\int32x4.h ------
00FE14ED  movdqa      xmmword ptr [esp+80h],xmm1  
--- C:\Users\Mak\Documents\bitpacker\lib\libsimdpp\simdpp\detail\cast.inl ------
00FE14F6  movaps      xmm2,xmmword ptr [esp+80h]  
00FE14FE  movaps      xmmword ptr [esp+0A0h],xmm2  
--- C:\Users\Mak\Documents\bitpacker\lib\libsimdpp\simdpp\types\int32x4.h ------
00FE1506  movdqa      xmm0,xmmword ptr [esp+0A0h]  
--- C:\Users\Mak\Documents\bitpacker\lib\libsimdpp\simdpp\detail\insn\i_reduce_add.h 
00FE150F  movdqa      xmmword ptr [esp+90h],xmm0  
--- C:\Users\Mak\Documents\bitpacker\lib\libsimdpp\simdpp\types\int8x16.h ------
00FE1518  movdqa      xmmword ptr [esp+0B0h],xmm0  
--- C:\Users\Mak\Documents\bitpacker\lib\libsimdpp\simdpp\detail\cast.inl ------
00FE1521  movaps      xmm1,xmmword ptr [esp+0B0h]  
00FE1529  movaps      xmmword ptr [esp+80h],xmm1  
--- C:\Users\Mak\Documents\bitpacker\lib\libsimdpp\simdpp\detail\insn\move_l.h -
00FE1531  movdqa      xmm0,xmmword ptr [esp+80h]  
00FE153A  psrldq      xmm0,4  
--- C:\Users\Mak\Documents\bitpacker\lib\libsimdpp\simdpp\types\int32x4.h ------
00FE153F  movdqa      xmmword ptr [esp+0B0h],xmm0  
--- C:\Users\Mak\Documents\bitpacker\lib\libsimdpp\simdpp\detail\cast.inl ------
00FE1548  movaps      xmm1,xmmword ptr [esp+0B0h]  
00FE1550  movaps      xmmword ptr [esp+0A0h],xmm1  
--- C:\Users\Mak\Documents\bitpacker\lib\libsimdpp\simdpp\types\empty_expr.h ---
00FE1558  movaps      xmm0,xmmword ptr [esp+0A0h]  
--- C:\Users\Mak\Documents\bitpacker\lib\libsimdpp\simdpp\detail\insn\i_reduce_add.h 
00FE1560  movdqa      xmm1,xmmword ptr [esp+90h]  
--- C:\Users\Mak\Documents\bitpacker\lib\libsimdpp\simdpp\detail\expr\i_add.h --
00FE1569  paddd       xmm1,xmm0  
--- C:\Users\Mak\Documents\bitpacker\lib\libsimdpp\simdpp\types\int32x4.h ------
00FE156D  movdqa      xmmword ptr [esp+80h],xmm1  
--- C:\Users\Mak\Documents\bitpacker\lib\libsimdpp\simdpp\detail\cast.inl ------
00FE1576  movaps      xmm2,xmmword ptr [esp+80h]  
00FE157E  movaps      xmmword ptr [esp+0B0h],xmm2  
--- C:\Users\Mak\Documents\bitpacker\lib\libsimdpp\simdpp\types\int32x4.h ------
00FE1586  movdqa      xmm0,xmmword ptr [esp+0B0h]  
--- C:\Users\Mak\Documents\bitpacker\lib\libsimdpp\simdpp\detail\insn\i_reduce_add.h 
00FE158F  movdqa      xmmword ptr [esp+90h],xmm0

Feature Request: foreach on vectors

Hi!

I'm not very experienced with C++ and especially not with this library, but I've found that some of my core uses of this library require patterns like:

uint64_t count = _mm_popcnt_u64(extract<0>(x));
#if UINT64_VECTOR_SIZE >= 2
count += _mm_popcnt_u64(extract<1>(x));
#if UINT64_VECTOR_SIZE >= 4
count += _mm_popcnt_u64(extract<2>(x));
count += _mm_popcnt_u64(extract<3>(x));
#if UINT64_VECTOR_SIZE >= 8
count += _mm_popcnt_u64(extract<4>(x));
count += _mm_popcnt_u64(extract<5>(x));
count += _mm_popcnt_u64(extract<6>(x));
count += _mm_popcnt_u64(extract<7>(x));
#if UINT64_VECTOR_SIZE > 8
#error "we do not support vectors longer than 8, please file an issue"
#endif
#endif
#endif

It would be awesome if there was some syntax like:

uint64_t count = 0
x.foreach<64>( [=](e) {
  count += _mm_popcnt_u64(e);
})

I'm happy to hack this up, but I'd need some guidance/scaffolding about how to approach the problem in the framework of libsimdpp.

SSE3 optimization of float32 reduce_add(), reduce_max(), and others not working

given that SSE3 enabled implies SSE2 is also enabled, the sequence of #elif sections in the code:

#elif SIMDPP_USE_SSE2
float32x4 sum2 = _mm_movehl_ps(a, a);
float32x4 sum = add(a, sum2);
sum = add(sum, permute2<1,0>(sum));
return _mm_cvtss_f32(sum);
#elif SIMDPP_USE_SSE3
float32x4 b = a;
b = _mm_hadd_ps(b, b);
b = _mm_hadd_ps(b, b);
return _mm_cvtss_f32(b);

causes SSE2 code to be used even when SSE3 is available.

Does not compile without warnings MSVC2013, AVX2

Simply including simd.h causing a bunch of warnings; we compile with warnings as errors so we can't use this library:

C:\playground\test\libsimdpp\simdpp/detail/expr/scalar.h(48): warning C4244: '=' : conversion from 'const double' to 'simdpp::arch_avx::any_float32<4,simdpp::arch_avx::float32<4,void>>::element

C:\playground\test\libsimdpp\simdpp/detail/insn/shuffle2x2.h(318): warning C4556: value of intrinsic immediate argument '334' is out of range '0 - 255'

And on and on. Any plans to make this library compile without warnings? Some of the overflow values are worrisome.

Repro:

arch:avx2, MSVC2013

#include "stdafx.h"

#define SIMDPP_ARCH_X86_AVX
#include "simdpp/simd.h"

int main(int argc, _TCHAR* argv[])
{
return 0;
}

no extract_bits_any() for uint8x32

Using this on SSE4.1 then switching to AVX512F broke this call. Potential overload

SIMDPP_INL uint32_t extract_bits_any(const uint8x32& ca)
    {
#if SIMDPP_USE_NULL
        uint8<32> a = ca;
        uint32_t r = 0;
    for (unsigned i = 0; i < a.length; i++) {
        uint8_t x = ca.el(i);
        x = x & 1;
        r = (r >> 1) | (uint32_t(x) << 31);
    }
    return r;
#elif SIMDPP_USE_SSE2
    uint8<16> A,B;
    split(a, A, B);
    return (extract_bits_any(A) << 16) + extract_bits_any(B);
#elif SIMDPP_USE_AVX2
        uint8<32> a = ca;
        return _mm256_movemask_epi8(a);
#endif
    }

math ops fail to do int/float conversion or generate an error on SSE2

float32x4 foo(float32x4 a, int32x4 b)
{
    return a+b;
}

results in a single addps instruction for SSE2, as if b were a float32.

float32x4 foo(float32x4 a, int32x4 b)
{
    return add(a,b);
}

results in a compilation error, which is resolved by explicitly converting b with to_float32(). Presumably use of + should fail in the same was as use of add() (or even better would be to automatically convert between float/int just scalar operations would)

uninitialized var warning in NEON variant of splat()

simdpp::float32x4 foo1(float a)
{
return simdpp::splat(a);
}

.../submodule/libsimdpp/simdpp/detail/insn/set_splat.h: In function ‘simdpp::arch_neonfltsp::float32x4 foo1(float)’:
.../submodule/libsimdpp/simdpp/detail/insn/set_splat.h:302:43: warning: ‘r.simdpp::arch_neonfltsp::float32<4u>::d_’ is used uninitialized in this function [-Wuninitialized]
typename detail::remove_sign::type r;

aarch64-poky-linux-g++ (GCC) 6.3.0
#define SIMDPP_ARCH_ARM_NEON_FLT_SP

Is still being developed/maintained?

I was looking into using this but the lack of activity is unsettling as I'd like to know that if I find bugs and such that they are resolved and that the library keeps improving with new instruction sets.

Traits don't follow the convention of the standard library

In the standard library, unary type traits that have a boolean value all derive from std::true_type or std::false_type (and generally all unary type traits derive from a specialization of std::integral_constant) instead of defining their own custom value member. This makes them easy to use for, e.g., tag dispatching, like so:

template <typename T>
void f_impl(std::true_type, T whatever) {
    // implement for integral types
}
template <typename T>
void f_impl(std::false_type, T whatever) {
    // implement for non-integral types
}
template <typename T>
void f(T whatever) { f_impl(std::is_integral<T>(), whatever); }

std::integral_constant also provides a few other common niceties like an implicit conversion to to the constant's type (i.e. any object of type std::true_type implicitly converts to a bool with value true) and a type member that can be useful in other meta-programming contexts.

simdpp::is_vector and simdpp::is_mask don't follow this convention, but they should do so, as good C++11 citizens.

Error C2719 in Visual Studio 2012/3

Is Visual Studio supported?

Compiling a simple example I get 100s of error of the form:

1>c:\users\jchown\work\simdpp\simd\int8x16.h(122): error C2719: 'd': formal parameter with __declspec(align('16')) won't be aligned

Feature request: Add support for AVX512BW

As mentioned in Intel® Xeon® Processor Scalable Family Technical Overview
the platform Purley will come with support for AVX512BW. I believe CPUs will be released autumn 2017 (or maybe later).

By looking at
http://p12tic.github.io/libsimdpp/v2.0~rc2/libsimdpp/arch/selection.html
it seems AVX512BW is not supported by libsimdpp right now.

AVX512BW will provide 8-bit and 16-bit integer operations that could speed things up.

As a feature request:
It would nice if libsimdpp could support AVX512BW.

General question about simdpp vector types

Hi,
From what I think I understood, in order to use the simdpp optimized functions, you must use the libsimdpp vector types.
So suppose I already have two float arrays, if I want to add them using libsimdpp, I have to create two vectors and copy the arrays in the vectors, is it right ?
Meaning that you cannot directly use the functions on stl vectors for instance ?

Thanks

loading integers

Hi Povilas,

this is just a minor issue. The load() function seems to work only with vectors of unsigned ints and floats. Signed ints seem to be not supported:

int vi[16];
int32x4 i;
i = load(&vi[0]); // OK
i = load<uint32x4>(&vi[0]); // OK
i = load<int32x4>(&vi[0]); // compilation FAILS

i = 10 + load<int32x4>(&vi[0]) * 2; // compilation FAILS - real use case

...the last line shows just why I would like to use the explicit template argument for the load function. Although it works with uint32x4, It would be nice to be able to use the same type as for the target "i" variable. But as I said, this is a very low priority thing :o)

Cheers,
Michal

Incorrect using __cpuidex intrinsic under MSVC

Hello.
In function simdpp::detail::get_cpuid i found one little error with big effects.

Original code:

...
#elif _MSC_VER
    uint32_t regs[4];
    __cpuidex((int*) regs, subleaf, level);
    *eax = regs[0];
    *ebx = regs[1];
    *ecx = regs[2];
    *edx = regs[3];
#else
...

But if you'll see MSDN (https://msdn.microsoft.com/ru-ru/library/hskdteyh.aspx) you can notice what subleaf and level params followed in the wrong order. If you change that order all works as intended.

You can use some simple test:

#include <iostream>
#include <simdpp/simd.h>
#include <simdpp/dispatch/get_arch_raw_cpuid.h>

#define SIMDPP_USER_ARCH_INFO ::simdpp::get_arch_raw_cpuid()

namespace SIMDPP_ARCH_NAMESPACE {
std::string archToString(simdpp::Arch arch)
{
    std::string ret = "none";

    if ((arch & simdpp::Arch::X86_SSE2) == simdpp::Arch::X86_SSE2)
        ret += " sse2";
    if ((arch & simdpp::Arch::X86_SSE3) == simdpp::Arch::X86_SSE3)
        ret += " sse3";
    if ((arch & simdpp::Arch::X86_SSSE3) == simdpp::Arch::X86_SSSE3)
        ret += " ssse3";
    if ((arch & simdpp::Arch::X86_SSE4_1) == simdpp::Arch::X86_SSE4_1)
        ret += " sse4.1";
    if ((arch & simdpp::Arch::X86_FMA3) == simdpp::Arch::X86_FMA3)
        ret += " fma3";
    if ((arch & simdpp::Arch::X86_FMA4) == simdpp::Arch::X86_FMA4)
        ret += " fma4";
    if ((arch & simdpp::Arch::X86_XOP) == simdpp::Arch::X86_XOP)
        ret += " xop";
    if ((arch & simdpp::Arch::X86_AVX) == simdpp::Arch::X86_AVX)
        ret += " avx";
    if ((arch & simdpp::Arch::X86_AVX2) == simdpp::Arch::X86_AVX2)
        ret += " avx2";
    if ((arch & simdpp::Arch::X86_AVX512F) == simdpp::Arch::X86_AVX512F)
        ret += " avx512f";

    return ret;
}

void printArch()
{
    std::cout << "cpu arch: " << archToString(SIMDPP_USER_ARCH_INFO).c_str();
    std::cout << std::endl;

    std::cout << "compile arch: " << archToString(simdpp::this_compile_arch()).c_str();
    std::cout << std::endl;
}

} // namespace SIMDPP_ARCH_NAMESPACE

SIMDPP_MAKE_DISPATCHER_VOID0(printArch)

I'm tested it on Microsoft C++ Build Tools (based on MSVC 2015 SP3) with my i5-4460 CPU.

Output before fix:

cpu arch: none fma3
compile arch: none

Output after fix:

cpu arch: none sse2 sse3 ssse3 sse4.1 fma3 avx avx2
compile arch: none sse2 sse3 ssse3 sse4.1 fma3 avx

Sorry me, but I can't create pull request at that time :(

test_bits_any() gets compilation errors when called with an expression

Tested with xcode 9.0, SSE4_1 target

bool foo1a(uint32x4 a, uint32x4 b)
{
	return test_bits_any(bit_and(a,b));
}

.../submodule/libsimdpp/simdpp/core/test_bits.h:28:70: No matching constructor for initialization of 'typename detail::get_expr_nosign<uint32<4, expr_bit_and<uint32<4, uint32<4, void> >, uint32<4, uint32<4, void> > > >, typename uint32<4, expr_bit_and<uint32<4, uint32<4, void> >, uint32<4, uint32<4, void> > > >::expr_type>::type' (aka 'uint32<16U / 4, simdpp::arch_sse4p1::expr_bit_and<simdpp::arch_sse4p1::uint32<4, simdpp::arch_sse4p1::uint32<4, void> >, simdpp::arch_sse4p1::uint32<4, simdpp::arch_sse4p1::uint32<4, void> > > >')

bool foo1c(uint32x4 a, uint32x4 b)
{
	return test_bits_any(bit_and(a,b).eval());
}

this compiles successfully. the code uses an unnecessary pand instruction, presumably because of the use of eval():

Inspiration::foo1c(simdpp::arch_sse4p1::uint32<4u, void>, simdpp::arch_sse4p1::uint32<4u, void>):
0000000000000230	pushq	%rbp
0000000000000231	movq	%rsp, %rbp
0000000000000234	pand	%xmm1, %xmm0
0000000000000238	ptest	%xmm0, %xmm0
000000000000023d	setne	%al
0000000000000240	popq	%rbp
0000000000000241	retq

(pand could have been skipped in favor of the AND operation done by ptest)

I noticed this function doesn't appear in the docs. Is it not meant to be a public interface?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.