dltcollab / sse2neon Goto Github PK

A translator from Intel SSE intrinsics to Arm/Aarch64 NEON implementation

License: MIT License

Makefile 0.26% C++ 51.47% Shell 0.23% C 48.04%

arm aarch64 neon sse x86 simd biilabs armv7l armv8-a intel-sse-intrinsics arm64 neon-intrinsics sse-intrinsics sse2neon armv8 intel-intrinsics apple-silicon

sse2neon's Introduction

sse2neon

A C/C++ header file that converts Intel SSE intrinsics to Arm/Aarch64 NEON intrinsics.

Introduction

sse2neon is a translator of Intel SSE (Streaming SIMD Extensions) intrinsics to Arm NEON, shortening the time needed to get an Arm working program that then can be used to extract profiles and to identify hot paths in the code. The header file sse2neon.h contains several of the functions provided by Intel intrinsic headers such as <xmmintrin.h>, only implemented with NEON-based counterparts to produce the exact semantics of the intrinsics.

Mapping and Coverage

Header file	Extension
`<mmintrin.h>`	MMX
`<xmmintrin.h>`	SSE
`<emmintrin.h>`	SSE2
`<pmmintrin.h>`	SSE3
`<tmmintrin.h>`	SSSE3
`<smmintrin.h>`	SSE4.1
`<nmmintrin.h>`	SSE4.2
`<wmmintrin.h>`	AES

sse2neon aims to support SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2 and AES extension.

In order to deliver NEON-equivalent intrinsics for all SSE intrinsics used widely, please be aware that some SSE intrinsics exist a direct mapping with a concrete NEON-equivalent intrinsic. Others, unfortunately, lack a 1:1 mapping, meaning that their equivalents are built utilizing a number of NEON intrinsics.

For example, SSE intrinsic _mm_loadu_si128 has a direct NEON mapping (vld1q_s32), but SSE intrinsic _mm_maddubs_epi16 has to be implemented with 13+ NEON instructions.

Floating-point compatibility

Some conversions require several NEON intrinsics, which may produce inconsistent results compared to their SSE counterparts due to differences in the arithmetic rules of IEEE-754.

Taking a possible conversion of _mm_rsqrt_ps as example:

__m128 _mm_rsqrt_ps(__m128 in)
{
    float32x4_t out = vrsqrteq_f32(vreinterpretq_f32_m128(in));

    out = vmulq_f32(
        out, vrsqrtsq_f32(vmulq_f32(vreinterpretq_f32_m128(in), out), out));

    return vreinterpretq_m128_f32(out);
}

The _mm_rsqrt_ps conversion will produce NaN if a source value is 0.0 (first INF for the reciprocal square root of 0.0, then INF * 0.0 using vmulq_f32). In contrast, the SSE counterpart produces INF if a source value is 0.0. As a result, additional treatments should be applied to ensure consistency between the conversion and its SSE counterpart.

Requirement

Developers are advised to utilize sse2neon.h with GCC version 10 or higher, or Clang version 11 or higher. While sse2neon.h might be compatible with earlier versions, certain vector operation errors have been identified in those versions. For further details, refer to the discussion in issue #622.

Usage

Put the file sse2neon.h in to your source code directory.
Locate the following SSE header files included in the code:

#include <xmmintrin.h>
#include <emmintrin.h>

{p,t,s,n,w}mmintrin.h could be replaceable as well.

Replace them with:

#include "sse2neon.h"

Explicitly specify platform-specific options to gcc/clang compilers.
- On ARMv8-A 64-bit targets, you should specify the following compiler option: (Remove crypto and/or crc if your architecture does not support cryptographic and/or CRC32 extensions)
```
-march=armv8-a+fp+simd+crypto+crc
```
- On ARMv8-A 32-bit targets, you should specify the following compiler option:
```
-mfpu=neon-fp-armv8
```
- On ARMv7-A targets, you need to append the following compiler option:
```
-mfpu=neon
```

Compile-time Configurations

Though floating-point operations in NEON use the IEEE single-precision format, NEON does not fully comply to the IEEE standard when inputs or results are denormal or NaN values for minimizing power consumption as well as maximizing performance. Considering the balance between correctness and performance, sse2neon recognizes the following compile-time configurations:

SSE2NEON_PRECISE_MINMAX: Enable precise implementation of _mm_min_{ps,pd} and _mm_max_{ps,pd}. If you need consistent results such as handling with NaN values, enable it.
SSE2NEON_PRECISE_DIV: Enable precise implementation of _mm_rcp_ps and _mm_div_ps by additional Netwon-Raphson iteration for accuracy.
SSE2NEON_PRECISE_SQRT: Enable precise implementation of _mm_sqrt_ps and _mm_rsqrt_ps by additional Netwon-Raphson iteration for accuracy.
SSE2NEON_PRECISE_DP: Enable precise implementation of _mm_dp_pd. When the conditional bit is not set, the corresponding multiplication would not be executed.

The above are turned off by default, and you should define the corresponding macro(s) as 1 before including sse2neon.h if you need the precise implementations.

Run Built-in Test Suite

sse2neon provides a unified interface for developing test cases. These test cases are located in tests directory, and the input data is specified at runtime. Use the following commands to perform test cases:

$ make check

For running check with enabling features, you can use assign the features with FEATURE command. If none is assigned, then the command will be the same as simply calling make check. The following command enable crypto and crc features in the tests.

$ make FEATURE=crypto+crc check

For running check on certain CPU, setting the mode of FPU, etc., you can also assign the desired options with ARCH_CFLAGS command. If none is assigned, the command acts as same as calling make check. For instance, to run tests on Cortex-A53 with enabling ARM VFPv4 extension and NEON:

$ make ARCH_CFLAGS="-mcpu=cortex-a53 -mfpu=neon-vfpv4" check

Running tests on hosts other than ARM platform

For running tests on hosts other than ARM platform, you can specify GNU toolchain for cross compilation with CROSS_COMPILE command. QEMU should be installed in advance.

For ARMv8-A running in 64-bit mode type:

$ make CROSS_COMPILE=aarch64-linux-gnu- check # ARMv8-A

For ARMv7-A type:

$ make CROSS_COMPILE=arm-linux-gnueabihf- check # ARMv7-A

For ARMv8-A running in 32-bit mode (A32 instruction set) type:

$ make \
  CROSS_COMPILE=arm-linux-gnueabihf- \
  ARCH_CFLAGS="-mcpu=cortex-a32 -mfpu=neon-fp-armv8" \
  check

Check the details via Test Suite for SSE2NEON.

Adoptions

Here is a partial list of open source projects that have adopted sse2neon for Arm/Aarch64 support.

Aaru Data Preservation Suite is a fully-featured software package to preserve all storage media from the very old to the cutting edge, as well as to give detailed information about any supported image file (whether from Aaru or not) and to extract the files from those images.
aether-game-utils is a collection of cross platform utilities for quickly creating small game prototypes in C++.
ALE, aka Assembly Likelihood Evaluation, is a tool for evaluating accuracy of assemblies without the need of a reference genome.
AnchorWave, Anchored Wavefront Alignment, identifies collinear regions via conserved anchors (full-length CDS and full-length exon have been implemented currently) and breaks collinear regions into shorter fragments, i.e., anchor and inter-anchor intervals.
ATAK-CIV, Android Tactical Assault Kit for Civilian Use, is the official geospatial-temporal and situational awareness tool used by the US Government.
Apache Doris is a Massively Parallel Processing (MPP) based interactive SQL data warehousing for reporting and analysis.
Apache Impala is a lightning-fast, distributed SQL queries for petabytes of data stored in Apache Hadoop clusters.
Apache Kudu completes Hadoop's storage layer to enable fast analytics on fast data.
apollo is a high performance, flexible architecture which accelerates the development of Autonomous Vehicles.
ares is a cross-platform, open source, multi-system emulator, focusing on accuracy and preservation.
ART is an implementation in OCaml of Adaptive Radix Tree (ART).
Async is a set of c++ primitives that allows efficient and rapid development in C++17 on GNU/Linux systems.
avec is a little library for using SIMD instructions on both x86 and Arm.
BEAGLE is a high-performance library that can perform the core calculations at the heart of most Bayesian and Maximum Likelihood phylogenetics packages.
BitMagic implements compressed bit-vectors and containers (vectors) based on ideas of bit-slicing transform and Rank-Select compression, offering sets of method to architect your applications to use HPC techniques to save memory (thus be able to fit more data in one compute unit) and improve storage and traffic patterns when storing data vectors and models in files or object stores.
bipartite_motif_finder as known as BMF (Bipartite Motif Finder) is an open source tool for finding co-occurences of sequence motifs in genomic sequences.
Blender is the free and open source 3D creation suite, supporting the entirety of the 3D pipeline.
Boo is a cross-platform windowing and event manager similar to SDL or SFML, with additional 3D rendering functionality.
Brickworks is a music DSP toolkit that supplies with the fundamental building blocks for creating and enhancing audio engines on any platform.
CARTA is a new visualization tool designed for viewing radio astronomy images in CASA, FITS, MIRIAD, and HDF5 formats (using the IDIA custom schema for HDF5).
Catcoon is a feedforward neural network implementation in C.
compute-runtime, the Intel Graphics Compute Runtime for oneAPI Level Zero and OpenCL Driver, provides compute API support (Level Zero, OpenCL) for Intel graphics hardware architectures (HD Graphics, Xe).
contour is a modern and actually fast virtual terminal emulator.
Cog is a free and open source audio player for macOS.
dab-cmdline provides entries for the functionality to handle Digital audio broadcasting (DAB)/DAB+ through some simple calls.
DISTRHO is an open-source project for Cross-Platform Audio Plugins.
Dragonfly is a modern in-memory datastore, fully compatible with Redis and Memcached APIs.
EDGE is an advanced OpenGL source port spawned from the DOOM engine, with focus on easy development and expansion for modders and end-users.
Embree is a collection of high-performance ray tracing kernels. Its target users are graphics application engineers who want to improve the performance of their photo-realistic rendering application by leveraging Embree's performance-optimized ray tracing kernels.
emp-tool aims to provide a benchmark for secure computation and allowing other researchers to experiment and extend.
Exudyn is a C++ based Python library for efficient simulation of flexible multibody dynamics systems.
FoundationDB is a distributed database designed to handle large volumes of structured data across clusters of commodity servers.
fsrc is capable of searching large codebases for text snippets.
gmmlib is the Intel Graphics Memory Management Library that provides device specific and buffer management for the Intel Graphics Compute Runtime for OpenCL and the Intel Media Driver for VAAPI.
HISE is a cross-platform open source audio application for building virtual instruments, emphasizing on sampling, but includes some basic synthesis features for making hybrid instruments as well as audio effects.
iqtree2 is an efficient and versatile stochastic implementation to infer phylogenetic trees by maximum likelihood.
indelPost is a Python library for indel processing via realignment and read-based phasing to resolve alignment ambiguities.
IResearch is a cross-platform, high-performance document oriented search engine library written entirely in C++ with the focus on a pluggability of different ranking/similarity models.
Kraken is a 3D animation platform redefining animation composition, collaborative workflows, simulation engines, skeletal rigging systems, and look development from storyboard to final render.
kram is a wrapper to several popular encoders to and from PNG/KTX files with LDR/HDR and BC/ASTC/ETC2.
Krita is a cross-platform application that offers an end-to-end solution for creating digital art files from scratch built on the KDE and Qt frameworks.
libCML is a SLAM library and scientific tool, which include a novel fast thread-safe graph map implementation.
libhdfs3 is implemented based on native Hadoop RPC protocol and Hadoop Distributed File System (HDFS), a highly fault-tolerant distributed fs, data transfer protocol.
libpostal is a C library for parsing/normalizing street addresses around the world using statistical NLP and open data.
libscapi stands for the "Secure Computation API", providing reliable, efficient, and highly flexible cryptographic infrastructure.
libstreamvbyte is a C++ implementation of StreamVByte.
libmatoya is a cross-platform application development library, providing various features such as common cryptography tasks.
Loosejaw provides deep hybrid CPU/GPU digital signal processing.
Madronalib enables efficient audio DSP on SIMD processors with readable and brief C++ code.
minimap2 is a versatile sequence alignment program that aligns DNA or mRNA sequences against a large reference database.
mixed-fem is an open source reference implementation of Mixed Variational Finite Elements for Implicit Simulation of Deformables.
MMseqs2 (Many-against-Many sequence searching) is a software suite to search and cluster huge protein and nucleotide sequence sets.
MRIcroGL is a cross-platform tool for viewing NIfTI, DICOM, MGH, MHD, NRRD, AFNI format medical images.
N2 is an approximate nearest neighborhoods algorithm library written in C++, providing a much faster search speed than other implementations when modeling large dataset.
nanors is a tiny, performant implementation of Reed-Solomon codes, capable of reaching multi-gigabit speeds on a single core.
niimath is a general image calculator with superior performance.
NVIDIA GameWorks has been already used in a lot of games. These repositories are public on GitHub.
Nx Meta Platform Open Source Components are used to build all Powered-by-Nx products including Nx Witness Video Management System (VMS).
ofxNDI is an openFrameworks addon to allow sending and receiving images over a network using the NewTek Network Device Protocol.
OGRE is a scene-oriented, flexible 3D engine written in C++ designed to make it easier and more intuitive for developers to produce games and demos utilising 3D hardware.
Olive is a free non-linear video editor for Windows, macOS, and Linux.
OpenColorIO a complete color management solution geared towards motion picture production with an emphasis on visual effects and computer animation.
OpenXRay is an improved version of the X-Ray engine, used in world famous S.T.A.L.K.E.R. game series by GSC Game World.
parallel-n64 is an optimized/rewritten Nintendo 64 emulator made specifically for Libretro.
Pathfinder C++ is a fast, practical, GPU-based rasterizer for fonts and vector graphics using Vulkan and C++.
PFFFT does 1D Fast Fourier Transforms, of single precision real and complex vectors.
pixaccess provides the abstractions for integer and float bitmaps, pixels, and aliased (nearest neighbor) and anti-aliased (bi-linearly interpolated) pixel access.
PlutoSDR Firmware is the customized firmware for the PlutoSDR that can be used to introduce fundamentals of Software Defined Radio (SDR) or Radio Frequency (RF) or Communications as advanced topics in electrical engineering in a self or instructor lead setting.
PowerToys is a set of utilities for power users to tune and streamline their Windows experience for greater productivity.
Pygame is cross-platform and designed to make it easy to write multimedia software, such as games, in Python.
R:RandomFieldsUtils provides various utilities might be used in spatial statistics and elsewhere. (CRAN)
RAxML is tool for Phylogenetic Analysis and Post-Analysis of Large Phylogenies.
ReHLDS is fully compatible with latest Half-Life Dedicated Server (HLDS) with a lot of defects and (potential) bugs fixed.
rkcommon represents a common set of C++ infrastructure and CMake utilities used by various components of Intel oneAPI Rendering Toolkit.
RPCS3 is the world's first free and open-source PlayStation 3 emulator/debugger, written in C++.
simd_utils is a header-only library implementing common mathematical functions using SIMD intrinsics.
Sire is a molecular modelling framework that provides extensive functionality to manipulate representations of biomolecular systems.
SMhasher provides comprehensive Hash function quality and speed tests.
SNN++ implements a single layer non linear Spiking Neural Network for images classification and generation.
Spack is a multi-platform package manager that builds and installs multiple versions and configurations of software.
SRA is a collection of tools and libraries for using data in the INSDC Sequence Read Archives.
srsLTE is an open source SDR LTE software suite.
SSW is a fast implementation of the Smith-Waterman algorithm, which uses the SIMD instructions to parallelize the algorithm at the instruction level.
Surge is an open source digital synthesizer.
The Forge is a cross-platform rendering framework, providing building blocks to write your own game engine.
Typesense is a fast, typo-tolerant search engine for building delightful search experiences.
Vcpkg is a C++ Library Manager for Windows, Linux, and macOS.
VelocyPack is a fast and compact format for serialization and storage.
VOLK, Vector-Optimized Library of Kernel, is a sub-project of GNU Radio.
Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning.
Winter is the top rated chess engine from Switzerland and has competed at top invite only computer chess events.
XEVE (eXtra-fast Essential Video Encoder) is an open sourced and fast MPEG-5 EVC encoder.
XMRig is an open source CPU miner for Monero cryptocurrency.
xsimd provides a unified means for using SIMD intrinsics and parallelized, optimized mathematical functions.
YACL is a C++ library contains modules and utilities which SecretFlow code depends on.

Related Projects

SIMDe: fast and portable implementations of SIMD intrinsics on hardware which doesn't natively support them, such as calling SSE functions on ARM.
CatBoost's sse2neon
ARM_NEON_2_x86_SSE
AvxToNeon
sse2rvv: C header file that converts Intel SSE intrinsics to RISC-V Vector intrinsic.
sse2msa: A C/C++ header file that converts Intel SSE intrinsics to MIPS/MIPS64 MSA intrinsics.
sse2zig: Intel SSE intrinsics mapped to Zig vector extensions.
POWER/PowerPC support for GCC contains a series of headers simplifying porting x86_64 code that makes explicit use of Intel intrinsics to powerpc64le (pure little-endian mode that has been introduced with the POWER8).
- implementation: xmmintrin.h, emmintrin.h, pmmintrin.h, tmmintrin.h, smmintrin.h

Reference

Licensing

sse2neon is freely redistributable under the MIT License.

sse2neon's People

Contributors

Stargazers

Watchers

Forkers

ajblane yuehchuan m4xw xiaosawagziluqi zzqcn crazy-william mjmacleod alonegiveup liusheng easyaspi314 boogermann ops-l hurmean brandy2 sebpop chapayghub bob-hu kiwik zloop1982 howjmay omar-fouad seanxcwang jishinmaster yes-jumby afcidk wuyouqian96169 realjishi jackey-huo pjx3 tinyretrowarehouse wangyangcs benartes mfabbri77 110allan seannachen michoumichmich nikhilareddyputta chenghlee hayguen kkxwz iabidi tendyang bryanhuang0927 erykoff wangxiao1254 techsmith hasindu2008 remiarnaud wow123yf fermay myremylar liao00 vell001 scotbren joshlvmh jnulzl barabashkad twbabyduck agsaidi xhy3054 dmitry1945 helifu 4nlp devnexen nillerusr fkfobos kamejoko80 atdt jorive rouault zoro0312 lukagithub betajippity lurkrazy l-net-1992 jmather-sesi longweibing nlacey jdpailleux yazdandeli wizadr madmanchan strogo toxieainc jackyfriend plumewind kmkolasinski anhuipl2010 sleepybishop zhangwangyang luzpaz schlather 00mjk efbicief andrewevstyukhin iteamvep wabbrik jeffhammond ifquant baofangyan1

sse2neon's Issues

Implement SSE4.2 text processing intrinsics

Apache Impala adopts SSE2NEON along with partial SSE4.2 text processing intrinsics support in file be/src/util/sse-util.h:

cmpestrm
cmpestri

However, the above were neither optimized nor generic. Instead, we can provide the full-functioned instrinsics in SSE2NEON.
Reference:

2 Tests fail on M1 chip

Two of the tests fail on apple M1 chip. mm_hadds_epi16 and mm_hsubs_epi16. Seems to be fine on CI. I'm happy to provide more debugging information if needed.
make_check.txt

Allow specifying rounding mode

While developing iqtree_arm_neon, @joshlvmh found:

denormalized numbers are flushed to zero
only default NaNs are supported
the Round to Nearest rounding mode is selected
un-trapped exception handling selected for all floating-point exceptions.

_mm_setcsr should be supported to ensure consistent behavior.

Strategy on non-SSE intrinsics

sse2neon aims to support SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2 and AES extension, and AVX intrinsics would be excluded.

@danlark1 pointed out:

Technically speaking, _mm_fmadd_ps is not an SSE extension, this was introduced with fma extension which took place even after AVX.

We do need to think of the strategy on non-SSE intrinsics to ease the platform transition efforts.

MSVC support

MSVC support will be nice.

Also, is it worth the effort, if I make a pull request with some changes that will improve the compatibility with MSVC? (I know almost nothing about ARM, but I can try to fix some macros, etc, maybe small things, if no one mind about the idea)

Correct _mm_min_ps, _mm_max_ps implementation

_mm_min_ps and _mm_max_ps cannot be accurately emulated with single vminq_f32/vmaxq_f32(and also vminnmq_f32 and vmaxnmq_f32) instruction.

We need special handling when both inputs are zeros and either input is NaN.

https://tavianator.com/fast-branchless-raybounding-box-intersections-part-2-nans/
https://www.felixcloutier.com/x86/minps

Here is an implementation of vmin/vmax which emulates _mm_min_ps/_mm_max_ps exactly(as far as I've tested)

lighttransport/embree-aarch64@e4a2f68

Implement _mm_load_sd

_mm_load_ss has beed implemented, what about the _mm_load_sd?

Synopsis

#include <emmintrin.h>
__m128d _mm_load_sd (double const* mem_addr)

Instruction: movsd xmm, m64
CPUID Flags: SSE2
Description

Load a double-precision (64-bit) floating-point element from memory into the lower of dst, and zero the upper element. mem_addr does not need to be aligned on any particular boundary.

Operation

dst[63:0] := MEM[mem_addr+63:mem_addr]
dst[127:64] := 0

Fix _mm_cmpeq_epi64

Quote from Jukka Liimatta

Looking at _mm_cmpeq_epi64 compares 32 bits at a time, which means lo and hi might have different mask. 64 bit compare should always have all bits 0 or 1. If this mask is used for select the results will be interesting.

Source: https://twitter.com/JukkaLiimatta/status/1276539769619714051

Tidy reference hyperlinks

The project has several reference hyperlinks to Microsoft website. Due to Microsoft Product Lifecycle, however, the content of MSDN is no longer updated regularly. We shall migrate to Intel Intrinsics Guide. For example, change the URL from https://msdn.microsoft.com/en-us/library/84szxsww(v=vs.100).aspx to https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_prefetch.

Implement _mm_popcnt_u32/_mm_popcnt_u64

_mm_popcnt_u64 was introduced in Intel SSE 4.2. GCC extension __builtin_popcountll is portable to non-SSE-4.2 CPUs. We can implement the two instructions as following:

/* Calculate a number of bits set to 1. */
FORCE_INLINE int _mm_popcnt_u32 (uint32_t x) { return __builtin_popcount (x); }
FORCE_INLINE uint64_t _mm_popcnt_u64 (uint64_t x) { return __builtin_popcountll (x); }

Alternative implementation:

FORCE_INLINE int _mm_popcnt_u32(uint32_t a)
{
    uint32_t count = 0;
    uint8x8_t input_val, count8x8_val;
    uint16x4_t count16x4_val;
    uint32x2_t count32x2_val;

    input_val = vld1_u8((uint8_t *) &a);
    count8x8_val = vcnt_u8(input_val);
    count16x4_val = vpaddl_u8(count8x8_val);
    count32x2_val = vpaddl_u16(count16x4_val);

    vst1_u32(&count, count32x2_val);
    return count;
}

FORCE_INLINE uint64_t _mm_popcnt_u64(uint64_t x)
{
    uint64_t count = 0;
    uint8x8_t input_val, count8x8_val;
    uint16x4_t count16x4_val;
    uint32x2_t count32x2_val;
    uint64x1_t count64x1_val;

    input_val = vld1_u8((uint8_t *) &x);
    count8x8_val = vcnt_u8(input_val);
    count16x4_val = vpaddl_u8(count8x8_val);
    count32x2_val = vpaddl_u16(count16x4_val);
    count64x1_val = vpaddl_u32(count32x2_val);
    vst1_u64(&count, count64x1_val);
    return count;
}

Missing intrinsics required for building Cycles

Cycles is Blender's physically-based path tracer for production rendering. Recently, Apple Inc. sent Aarch64 patch to Blender developer as shown on D8237: Cycles: support neon instructions for arm64 processors. However, file intern/cycles/util/util_sse_to_neon.h provided by D8237 was incomplete, and Blender developers were discussing the integration of sse2neon.

Missing intrinsics in SSE2NEON required for building Cycles:

D8237 also defines _mm_cmple_epi32 and _mm_cmpge_epi32, which are not part of SSE intrinsics.

Fortunately, the simplified implementations are available in D8237.

A64 tweak and test case for _mm_sign_epi8

Intrinsic _mm_sign_epi8 can be tweaked for A64. Proposed change:

@@ -2229,12 +2229,16 @@ FORCE_INLINE __m128i _mm_sign_epi8(__m128i _a, __m128i _b)
     int8x16_t a = vreinterpretq_s8_m128i(_a);
     int8x16_t b = vreinterpretq_s8_m128i(_b);
 
-    int8x16_t zero = vdupq_n_s8(0);
     // signed shift right: faster than vclt
     // (b < 0) ? 0xFF : 0
     uint8x16_t ltMask = vreinterpretq_u8_s8(vshrq_n_s8(b, 7));
     // (b == 0) ? 0xFF : 0
+#if defined(__aarch64__)
+    int8x16_t zeroMask = vreinterpretq_s8_u8(vceqzq_s8(b));
+#else
+    int8x16_t zero = vdupq_n_s8(0);
     int8x16_t zeroMask = vreinterpretq_s8_u8(vceqq_s8(b, zero));
+#endif
     // -a
     int8x16_t neg = vnegq_s8(a);
     // bitwise select either a or neg based on ltMask

In the meantime, we lack of _mm_sign_epi8 test case.

Implement _mm_minpos_epu16

Is it possible to implement _mm_minpos_epu16 intrinisic ?

It was introduced in SSE4.1. Detailed explanation In the link below.
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_minpos_epu16&expand=3789

Improve _mm_popcnt_*

Quote from Jukka Liimatta

_mm_popcnt_* uses store, when vget_lane_* would probably be a better fit.. the compiler will optimize the store into lane extract more likely but now it can go either way. The 32 bit load reads 64 bits from 32 bit variable.. that should be fixed. vcreate_u8 would be safer anyway.
the vrev64q_u32 handled the lo/hi case. The load/store in _mm_popcnt might warrant a second look.

Missing _mm_movehdup_ps

Missing A32 implementations for _mm_store_pd, _mm_storeu_pd, _mm_cvtpd_ps, _mm_cvtps_pd, _mm_castpd_si128

The following intrinsics have Aarch64 implementations without A32:

_mm_store_pd
_mm_storeu_pd
_mm_cvtpd_ps
_mm_cvtps_pd
_mm_castpd_si128

We expect to provide corresponding A32 implementations as well.

Refactor max/min intrinsics with IEEE-754 compatible intrinsics

ARM provides IEEE-754 compatible intrinsics (e.g. vmaxnmq_f64). These intrinsics obey IEEE-754 on handling NaN special cases. Should we use them to replace the current implementation?

Missing _mm_movemask_pd

_mm_movemask_pd would set each bit of mask dst based on the most significant bit of the corresponding packed double-precision (64-bit) floating-point element in a.

Reference A64 implementation:

    static const int64x2_t shift = {0, 1};
    uint64x2_t tmp = vshrq_n_u64(a, 63);
    return vaddvq_u64(vshlq_u64(tmp, shift));

unify _mm_hsub_* functions

The implementation for _mm_hsub_* functions varies. Maybe we should unify to the faster one

Add performance measurement of each intrinsic function

The conversion of intrinsic function may be rewritten sometimes.
A performance measurement result is good for checking whether the rewritten conversion behaves better or not.

Introduce macro 'likely' and 'unlikely' for branch prediction instructions

As mentioned in #130 (comment)
We can introduce macro likely and unlikely for adding instruction that make branch prediction prefer for the if condition with likely macro.

reference

https://stackoverflow.com/questions/109710/how-do-the-likely-unlikely-macros-in-the-linux-kernel-work-and-what-is-their-ben

Auto-generated integrated test suite

Once #69 is fully integrated, we can think of the generation of test suite. The rough flow:

Developers provide a comprehensive list of intrinsics;
Scripts generate the stub/skeleton for test cases;
Developers provide the real test entries such as test_mm_set1_ps and parameter list (e.g. test_mm_set1_ps(mTestFloats[i]));
Automated test system can scan and analyze the status of each test item;

Unify validation functions in test suite

The parameter order of the validation functions is not unified.
Some of them are using the reverse order.

Besides, the function naming should be consistent as well.

Missing _mm_movpi64_epi64 and _mm_movepi64_pi64

Both are part of SSE2.

_mm_movpi64_epi64 would copy the 64-bit integer a to the lower element of dst, and zero the upper element.

dst[63:0] := a[63:0]
dst[127:64] := 0

Reference NEON implementation:

return vcombine_s64(a, vdup_n_s64(0));

_mm_movepi64_pi64 would copy the lower 64-bit integer in a to dst.

dst[63:0] := a[63:0]

Reference NEON implementation:

return vget_low_s64(a);

CI:

Missing _mm_mulhi_pu16

_mm_mulhi_pu16 would multiply the packed unsigned 16-bit integers in a and b, producing intermediate 32-bit integers, and store the high 16 bits of the intermediate integers in dst.

Reference NEON imeplementation:

__m64 _mm_mulhi_pu16 (__m64 a, __m64 b)
{
     return vmovn_u32(vshrq_n_u32(vmull_u16(a, b), 16));
}

Missing _mm_{load,loadu,store,storeu}_pd

The suffix pd means the vectors contains doubles (pd stands for packed double-precision). The following intrinsics are missing:

_mm_load_pd
_mm_loadu_pd
_mm_store_pd
_mm_store_pd

For A64, the _mm_load_pd and _mm_storeu_pd can be implemented as following:

// Load 128-bits (composed of 2 packed double-precision (64-bit) floating-
// point elements) from memory into dst.
// https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_load_pd                                                                             
FORCE_INLINE __m128d _mm_load_pd(const double *p)
{
#if defined(__aarch64__)
    return (__m128d)(vld1q_f64(p));
#else
    /* A32 only */
#endif
}

// Stores four single-precision, floating-point values.
// https://msdn.microsoft.com/en-us/library/44e30x22(v=vs.100).aspx
FORCE_INLINE void _mm_storeu_pd(double *p, __m128d a)
{
#if defined(__aarch64__)
    vst1q_f64(p, (__m128d)(a));
#else
    /* A32 only */
#endif
}

Fix the unit test of lower single-precision floating-point comparison command

There are four instructions which cause the unit test fail.

_mm_comilt_ss
_mm_comile_ss
_mm_comieq_ss
_mm_comineq_ss

Base on the comment, I doubt that the failure happens when one of the operand is NaN.
Comment:

// https://msdn.microsoft.com/en-us/library/2kwe606b(v=vs.90).aspx Important
// note!! The documentation on MSDN is incorrect!  If either of the values is a
// NAN the docs say you will get a one, but in fact, it will return a zero!!

Comment:

// **NOTE** The documentation on MSDN is in error!  The actual
// hardware returns a 0, not a 1 if either of the values is a
// NAN!

However, I do not find the document other than MSDN which describes the returned value of the command when one of the operand is NaN.

_MM_TRANSPOSE4_PS

I'm trying to compile obs-studio on AARCH64 and have used sse2neon to handle all the MMX, SSE + SSE2 stuff

xmmintrin.h has a macro for _MM_TRANSPOSE4_PS right at the end of the file

I'm wondering if it's a recent addition to xmmintrin.h as it's not in sse2neon and is defined in a C source file in obs-studio (in libobs/obs-audio-controls.c)

Removing the call causes obs-studio to compile without error and as the application only uses it in one function this is all that appears to stand between me + obs-studio for ARM

Any help would be most appreciated

Here;s the code for reference...

/* Transpose the 4x4 matrix composed of row[0-3].  */
#define _MM_TRANSPOSE4_PS(row0, row1, row2, row3)           \
do {                                    \
  __v4sf __r0 = (row0), __r1 = (row1), __r2 = (row2), __r3 = (row3);    \
  __v4sf __t0 = __builtin_ia32_unpcklps (__r0, __r1);           \
  __v4sf __t1 = __builtin_ia32_unpcklps (__r2, __r3);           \
  __v4sf __t2 = __builtin_ia32_unpckhps (__r0, __r1);           \
  __v4sf __t3 = __builtin_ia32_unpckhps (__r2, __r3);           \
  (row0) = __builtin_ia32_movlhps (__t0, __t1);             \
  (row1) = __builtin_ia32_movhlps (__t1, __t0);             \
  (row2) = __builtin_ia32_movlhps (__t2, __t3);             \
  (row3) = __builtin_ia32_movhlps (__t3, __t2);             \
} while (0)

Missing _MM_GET_ROUNDING_MODE _MM_ROUND_TOWARD_ZERO

Missing _mm_cvtss_sd, _mm_shuffle_pd, _mm_cmplt_pd, _mm_unpacklo_pd

_mm_cvtss_sd
_mm_shuffle_pd
_mm_cmplt_pd
_mm_unpacklo_pd
Please~

Pass NaN value for accurate testing of _mm_min_ps and _mm_max_ps

Discussion: #65 (comment)

The current test suite does not include the testing of NaN value for intrinsics _mm_min_ps and _mm_max_ps.
The test suite should be modified to fit the request.

The corresponding behaviour of NaN value can be referenced from #57.

Implement the rest of the shift intrinsic functions

The following shift intrinsic functions are not implemented yet.

_mm_sll_epi*
_mm_srl_epi*
_mm_sra_epi*
_mm_sllv_epi*
_mm_srlv_epi*
_mm_srav_epi*

However, the related shift intrinsic functions of x86 and ARM have different behaviour when the shift value is too large .
Please check the discussion in #17 (comment).

Refactor macro to inline function

Some implementations are macros instead of inline functions (e.g. _mm_slli_epi32). Should we refactor them into inline functions for consistency?

Implement _mm_aesenclast_si128

Hello,

Is there an implementation for _mm_aesenclast_si128 instruction?
When I'm trying to compile my program, I'm getting this error:
_mm_aesenclast_si128 was not declared in this scope

Lior

Missing "_mm_cmple_epi32" and "_mm_cmpge_epi32"

Intrinsics "_mm_cmple_epi32" and "_mm_cmpge_epi32" are missing at the moment.

sse2neon.h:2655:22: error: argument to '__builtin_neon_vextq_v' must be a constant integer

I build apache/impala on aarch64, I download the file https://github.com/DLTcollab/sse2neon/blob/master/sse2neon.h under impala directory, and include it in some other files, an error raised:

/home/jenkins/workspace/impala/be/src/util/sse2neon.h:2655:22: error: argument to
'__builtin_neon_vextq_v' must be a constant integer
return (__m128i) vextq_s8((int8x16_t) a, (int8x16_t) b, c);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/jenkins/workspace/native-toolchain/build/llvm-5.0.1-p2/lib/clang/5.0.1/include/arm_neon.h:5825:23: note:
expanded from macro 'vextq_s8'
__ret = (int8x16_t) __builtin_neon_vextq_v((int8x16_t)__s0, (int8x16_t)__s1, __p2, 32); \

I don't know how to fix this, the third argument c is const integer type, the value of c only can be known at run-time, right? But according to compile error, it seems that the value of c should be known at compile-time. So I have no idea is it an error of the sse2neon.h? Or how to fix this?

Support running tests on ARM-based machine

Should we add another cross-tool.sh which helps ARM-based machine run x86 tests and A32/A64 tests?

Machine like M1 needs this favor.

A64 fastpath and test case for _mm_hadds_epi16

Intrinsic _mm_hadds_epi16 can be implemented via A64 instructions. The proposed change:

@@ -3172,6 +3172,12 @@ FORCE_INLINE __m128i _mm_hsub_epi16(__m128i _a, __m128i _b)
 // integer values a and b.
 FORCE_INLINE __m128i _mm_hadds_epi16(__m128i _a, __m128i _b)
 {
+#if defined(__aarch64__)
+    int16x8_t a = vreinterpretq_s16_m128i(_a);
+    int16x8_t b = vreinterpretq_s16_m128i(_b);
+    return vreinterpretq_s64_s16(
+        vqaddq_s16(vuzp1q_s16(a, b), vuzp2q_s16(a, b)));
+#else
     int32x4_t a = vreinterpretq_s32_m128i(_a);
     int32x4_t b = vreinterpretq_s32_m128i(_b);
     // Interleave using vshrn/vmovn
@@ -3181,6 +3187,7 @@ FORCE_INLINE __m128i _mm_hadds_epi16(__m128i _a, __m128i _b)
     int16x8_t ab1357 = vcombine_s16(vshrn_n_s32(a, 16), vshrn_n_s32(b, 16));
     // Saturated add
     return vreinterpretq_m128i_s16(vqaddq_s16(ab0246, ab1357));
+#endif
 }
 
 // Computes saturated pairwise difference of each argument as a 16-bit signed

Meanwhile, we lack of _mm_hadds_epi16 test case.

Missing _mm_movemask_pi8

_mm_movemask_pi8 would create mask from the most significant bit of each 8-bit element in a, and store the result in dst.
Reference A64 implementation:

      uint8x8_t input = vreinterpretq_u8_m128i(a);
      const int8_t ALIGN_STRUCT(16) xr[8] = {-7, -6, -5, -4, -3, -2, -1, 0};
      const uint8x8_t mask_and = vdup_n_u8(0x80);
      const int8x8_t mask_shift = vld1_s8(xr);
      const uint8x8_t mask_result = vshl_u8(vand_u8(input, mask_and), mask_shift);
      uint8x8_t lo = mask_result;
      return vaddv_u8(lo);

CI: Migrate from Travis CI to Github Actions

Travis CI introduced the new pricing structure that each running test would consume the credits until it is empty.
Since we have run out of the credits, we decide to move to another CI system, which is Github Actions.

The CI should include running the test of different platforms and compilers
Platform:

Compiler:

GCC
Clang

And it should check the coding convention as well.

Shorten CI pipeline

The current build matrix consists of two architectures (Arm64/AMD64) and two compilers (clang/gcc). However, it takes much longer time to set up rather than doing the real verification. The brief idea to shorten CI pipeline would be:

AMD64+clang: Install clang packages from bionic-updates
AMD64+gcc: Install gcc packages from ppa:ubuntu-toolchain-r/test
Arm64+gcc: Install gcc packages from ppa:ubuntu-toolchain-r/test
Arm64+clang: Install clang packages from bionic. NOTE: bionic-updates provides AMD64 packages only.
cross: Run once with AMD64. Maybe move to jobs.

Implement missing intrinsics required by IQ-TREE

While transiting from SSE intrinsics used by IQ-TREE to Arm NEON, @joshlvmh extended SSE2NEON with additional intrinsics. See https://github.com/joshlvmh/iqtree_arm_neon/blob/7bc67d3449428b0b683fb7359c8945da218f5775/IQ-TREE/sse2neon.h#L4136

It would be great if we can rework these changes back to SSE2NEON.

Generate intrinsics coverage report

As mentioned in #263 (comment), we are going to generate intrinsics coverage reports with the result of CI.

The report can be appended here

improve _mm_set_epi64x

Thanks for providing such a great tool!

The current implementation of _mm_set_epi64x uses an array. It might be better to use vcombine_u64 when it is available?

__m128d for A32 might not behave as same as A64

I was implementing _mm_dp_pd as following:

__m128d _mm_dp_pd(__m128d a, __m128d b, const int imm)
{
    double v = 0;
    v += (imm & 0x10) ? a[0] * b[0] : 0;                                                                                                                      
    v += (imm & 0x20) ? a[1] * b[1] : 0;
    double ret[2] = {(imm & 0x1) ? v : 0, ((imm >> 1) & 0x1) ? v : 0};
    return _mm_load_pd(ret);
}

The code works for A64 while it misbehaves on A32 since __m128d is defined as float32x4_t. The array subscripting would result in different data width.

We should have consistent type definition for __m128d.

Missing _mm_cvt_ss2si

Hi the following instruction is missing: _mm_cvt_ss2si

https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=4979,3330,3863,4979,5197,3853,3330,3863,5197,3407,3863,5610,1382&text=_mm_cvt_ss2si

Thanks,
Alex

Missing _mm_sad_pu8

_mm_sad_pu8 would compute the absolute differences of packed unsigned 8-bit integers in a and b, then horizontally sum each consecutive 8 differences to produce four unsigned 16-bit integers, and pack these unsigned 16-bit integers in the low 16 bits of dst.

Reference NEON implementation:

__m64 _mm_sad_pu8 (__m64 a, __m64 b)
{
    uint16x8_t t = vpaddl_u8(vabd_u8((uint8x16_t) a, (uint8x16_t) b));
    uint16_t r0 = t[0] + t[1] + t[2] + t[3];
    return vset_lane_u16(r0, vdup_n_u16(0), 0);
}

Input data in `test_mm_srli_epi32` seldom hit the values used in real world usage

test_mm_srli_epi32 and other tests implemented in the similar way are testing with rare input data. Take test_mm_srli_epi32 for example, the input value of imm8 should last 32 in the most cases in real world. However, in the current implementation, the imm8 we generated are always out of this scope. We should refactor it to another way which can test the functions properly

_mm_movemask_epi8 regression

The aarch64 code path for _mm_movemask_epi8 introduced in #50 looks to be a regression when you actually compile it. The default behavior compiles to 7 instructions with no constants, while the "fast path" is 14 instructions plus a constant. Should it be reverted?

fast path: https://godbolt.org/z/41s54d
default: https://godbolt.org/z/xsYfz8

dltcollab / sse2neon Goto Github PK

sse2neon's Introduction

sse2neon

Introduction

Mapping and Coverage

Floating-point compatibility

Requirement

Usage

Compile-time Configurations

Run Built-in Test Suite

Running tests on hosts other than ARM platform

Adoptions

Related Projects

Reference

Licensing

sse2neon's People

Contributors

Stargazers

Watchers

Forkers

sse2neon's Issues

reference

Recommend Projects

Recommend Topics

Recommend Org