Git Product home page Git Product logo

rv's Introduction

The Region Vectorizer for SX-Aurora

NEC Deutschland

The Region Vectorizer (RV) is a general-purpose vectorization framework for LLVM. RV provides a unified interface to vectorize code regions, such as inner and outer loops, up to whole functions.

Features

  • Support for tail-predicated OLV through LLVM-VP.
  • Support for OpenMP 4.5 #pragma omp simd and #pragma omp declare simd (pass -fopenmp -fplugin=libRV.so -mllvm -rv to Clang and you are set).
  • Automatic outer-loop vectorization (preview feature) (pass -mllvm -rv-autovec to enable).
  • Support for inter-procedural/recursive vectorization.
  • Implements Partial Control-Flow Linearization, S. Moll and S. Hack (PLDI '18).
  • Automatically uses SLEEF vector math functions.
  • Whole-Function vectorizer (min -> min_avx2).
  • Outer-loop vectorizer.

Buildling libRV

RV is an LLVM project and integrates into the LLVM build system. Clone this repository into llvm-project/rv where llvm-project is your LLVM source directory. To build RV along with LLVM, you need to tell cmake where to find RV. This can be done by specifying -DLLVM_EXTERNAL_PROJECTS="rv" -DLLVM_EXTERNAL_RV_SOURCE_DIR=llvm-project/rv to cmake. Run git submodule update --init to pull the SLEEF submodule. To (optionally) enable vectorized complex arithmetic through compiler-rt checkout compiler-rt in llvm/runtimes and configure cmake with -DRV_ENABLE_CRT=on.

Build prerequisites

  • LLVM trunk (as of latest commit on this branch)
  • Clang (for the vector math libraries)
  • compiler-rt [optional] (for complex arithmetic functions)

Testing

Install LLVM+RV, go to rv/test/ and run ./test_rv.py.

RV's Outer-Loop Vectorizer

RV ships with frontend passes for Outer-Loop and Whole-Function Vectorization. The passes pick up on SIMD pragmas in your code to vectorize the region (loop or function) in question. RV is designed to deal with any control flow inside those regions. However, in case of loop vectorization the annotated loops themselves need to be parallel counting loops. RV supports a range of value reductions and recurrences, including conditional ones (e.g. if (i % 3 == 0) a += A[i]; ). Be aware that RV will exactly do as you annotated. Specifically, RV does not perform exhaustive legality checks nor is there cost modelling of any kind. You'll get what you ordered.

Usage

  1. Annotate vectorizable loops with #pragma clang loop vectorize(assume_safety) vectorize_width(W) where W is the desired vectorization width.
  2. Invoke clang with -fplugin=libRV.so -mllvm -rv-loopvec. We recommend to also disable loop unrolling -fno-unroll-loops.

Getting started on the code

Users of RV should include its main header file include/rv/rv.h and supporting headers in include/rv. The command line tester (tool/rvTool.cpp) is a good starting point to learn how to use RVs API.

Source structure

  • include/ - header files
  • src/ - source files
  • vecmath/ - SIMD library sources
  • test/ - tests
  • tool/ - sources of rvTool

Advanced options

environment variables

RV's diagnostic output can be configured through a couple of environment variables. These will be read by the Outer-Loop Vectorizer and rvTool. To get a short diagnostic report from every transformation in RV, set the environment variable RV_REPORT to any value but 0. To also get a report from RV's Outer-Loop Vectorizer, set the environment variable LV_DIAG to a non-0 value.

Optional cmake flags

  • RV_ENABLE_CRT:BOOL Whether RV should inline and vectorize complex math functions. This makes use of the complex arithmetic implementations in compiler-rt. Requires compiler-rt to live in llvm/projects. Defaults to OFF.
  • RV_TARGETS_TO_BUILD:ListOfTargets List of LLVM targets, for which the SLEEF vector math library should be built. Same format as LLVM_TARGETS_TO_BUILD. RV uses SLEEF to vectorize math functions. Clang has to be able to (cross-)compile for all of these targets or the build will fail. Defaults to "Native", the host target.
  • RV_DEBUG:BOOL If enabled, RV will produce (very) verbose debug output and run additional consistency checks. Make sure you compile with assertions. Recommended for debugging only. Defaults to OFF.
  • LLVM_RVPLUG_LINK_INTO_TOOLS:BOOL Enables the LLVM pass plugin mechanism to link RV into all LLVM tools (opt, clang, ..). Obviates the need to load libRV manually as a plugin on the command line.

The Region Vectorizer is distributed under the University of Illinois Open Source License. See LICENSE.TXT for details.

rv's People

Contributors

baziotis avatar commaster avatar gargaroff avatar jdoerfert avatar leissa avatar m-kurtenacker avatar madmann91 avatar richardmembarth avatar simoll avatar stlemme avatar tkloessner avatar xoofx avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rv's Issues

IsSupportedReduction error output

The Chapel compiler is generating LLVM IR that, when optimized with RV, runs into the case here outputting to errs().

IsSupportedReduction(Loop & L, Reduction & red) {
// check that all users of the reduction are either (a) part of it or (b) outside the loop
for (auto * inst : red.elements) {
for (auto itUser : inst->users()) {
auto * userInst = dyn_cast<Instruction>(itUser);
if (!userInst) return false; // unsupported
if (L.contains(userInst->getParent()) &&
!red.elements.count(userInst)) {
errs() << "Unsupported user of reduction: "; Dump(*userInst);
return false;
}
}
}

Does this mean that there is something wrong with the LLVM IR being generated by the frontend (i.e. a correctness issue when optimized with RV)? Or is it a situation where RV figures out it cannot optimize this case and where extra debug output is present?

Use portable modifiers/attributes from llvm/Support/Compiler.h

RV's implementation uses GCC/Linux-specific features that will not compile on other systems (eg __attribute__((noreturn)) with MSVC, see #13).

The LLVM header llvm/Support/Compiler.h provides portable modifiers/attributes that work on all compilers that LLVM supports. We should use those macros were possible.

Compiler compatibility issue

Change in fabb261 from fabb261#diff-3330861d3ce98513e72cb245f8205bd3L140 to fabb261#diff-2937886c4a61386b2fe398b56fc97a7dR86 made it incompatible with gcc 5 (gcc 7 works fine):

In file included from /usr/include/x86_64-linux-gnu/c++/5/bits/c++allocator.h:33:0,
                 from /usr/include/c++/5/bits/allocator.h:46,
                 from /usr/include/c++/5/string:41,
                 from /usr/include/c++/5/random:40,
                 from /usr/include/c++/5/bits/stl_algo.h:66,
                 from /usr/include/c++/5/algorithm:62,
                 from %%/llvm/include/llvm/ADT/Optional.h:23,
                 from %%/llvm/include/llvm/ADT/STLExtras.h:20,
                 from %%/llvm/include/llvm/ADT/StringRef.h:13,
                 from %%/llvm/tools/rv/include/rv/resolver/resolver.h:5,
                 from %%/llvm/tools/rv/include/rv/resolver/listResolver.h:4,
                 from %%/llvm/tools/rv/src/resolver/listResolver.cpp:1:
/usr/include/c++/5/ext/new_allocator.h: In instantiation of ‘void __gnu_cxx::new_allocator< <template-parameter-1-1> >::construct(_Up*, _Args&& ...) [with _Up = std::pair<const llvm::Function* const, std::unique_ptr<llvm::SmallVector<rv::VectorMapping, 4u> > >; _Args = {llvm::Function*&, llvm::SmallVector<rv::VectorMapping, 4u>*}; _Tp = std::_Rb_tree_node<std::pair<const llvm::Function* const, std::unique_ptr<llvm::SmallVector<rv::VectorMapping, 4u> > > >]’:
/usr/include/c++/5/bits/alloc_traits.h:530:4:   required from ‘static void std::allocator_traits<std::allocator<_Tp> >::construct(std::allocator_traits<std::allocator<_Tp> >::allocator_type&, _Up*, _Args&& ...) [with _Up = std::pair<const llvm::Function* const, std::unique_ptr<llvm::SmallVector<rv::VectorMapping, 4u> > >; _Args = {llvm::Function*&, llvm::SmallVector<rv::VectorMapping, 4u>*}; _Tp = std::_Rb_tree_node<std::pair<const llvm::Function* const, std::unique_ptr<llvm::SmallVector<rv::VectorMapping, 4u> > > >; std::allocator_traits<std::allocator<_Tp> >::allocator_type = std::allocator<std::_Rb_tree_node<std::pair<const llvm::Function* const, std::unique_ptr<llvm::SmallVector<rv::VectorMapping, 4u> > > > >]’
/usr/include/c++/5/bits/stl_tree.h:529:32:   required from ‘void std::_Rb_tree<_Key, _Val, _KeyOfValue, _Compare, _Alloc>::_M_construct_node(std::_Rb_tree<_Key, _Val, _KeyOfValue, _Compare, _Alloc>::_Link_type, _Args&& ...) [with _Args = {llvm::Function*&, llvm::SmallVector<rv::VectorMapping, 4u>*}; _Key = const llvm::Function*; _Val = std::pair<const llvm::Function* const, std::unique_ptr<llvm::SmallVector<rv::VectorMapping, 4u> > >; _KeyOfValue = std::_Select1st<std::pair<const llvm::Function* const, std::unique_ptr<llvm::SmallVector<rv::VectorMapping, 4u> > > >; _Compare = std::less<const llvm::Function*>; _Alloc = std::allocator<std::pair<const llvm::Function* const, std::unique_ptr<llvm::SmallVector<rv::VectorMapping, 4u> > > >; std::_Rb_tree<_Key, _Val, _KeyOfValue, _Compare, _Alloc>::_Link_type = std::_Rb_tree_node<std::pair<const llvm::Function* const, std::unique_ptr<llvm::SmallVector<rv::VectorMapping, 4u> > > >*]’
/usr/include/c++/5/bits/stl_tree.h:546:21:   required from ‘std::_Rb_tree_node<_Val>* std::_Rb_tree<_Key, _Val, _KeyOfValue, _Compare, _Alloc>::_M_create_node(_Args&& ...) [with _Args = {llvm::Function*&, llvm::SmallVector<rv::VectorMapping, 4u>*}; _Key = const llvm::Function*; _Val = std::pair<const llvm::Function* const, std::unique_ptr<llvm::SmallVector<rv::VectorMapping, 4u> > >; _KeyOfValue = std::_Select1st<std::pair<const llvm::Function* const, std::unique_ptr<llvm::SmallVector<rv::VectorMapping, 4u> > > >; _Compare = std::less<const llvm::Function*>; _Alloc = std::allocator<std::pair<const llvm::Function* const, std::unique_ptr<llvm::SmallVector<rv::VectorMapping, 4u> > > >; std::_Rb_tree<_Key, _Val, _KeyOfValue, _Compare, _Alloc>::_Link_type = std::_Rb_tree_node<std::pair<const llvm::Function* const, std::unique_ptr<llvm::SmallVector<rv::VectorMapping, 4u> > > >*]’
/usr/include/c++/5/bits/stl_tree.h:2123:33:   required from ‘std::pair<std::_Rb_tree_iterator<_Val>, bool> std::_Rb_tree<_Key, _Val, _KeyOfValue, _Compare, _Alloc>::_M_emplace_unique(_Args&& ...) [with _Args = {llvm::Function*&, llvm::SmallVector<rv::VectorMapping, 4u>*}; _Key = const llvm::Function*; _Val = std::pair<const llvm::Function* const, std::unique_ptr<llvm::SmallVector<rv::VectorMapping, 4u> > >; _KeyOfValue = std::_Select1st<std::pair<const llvm::Function* const, std::unique_ptr<llvm::SmallVector<rv::VectorMapping, 4u> > > >; _Compare = std::less<const llvm::Function*>; _Alloc = std::allocator<std::pair<const llvm::Function* const, std::unique_ptr<llvm::SmallVector<rv::VectorMapping, 4u> > > >]’
/usr/include/c++/5/bits/stl_map.h:559:64:   required from ‘std::pair<typename std::_Rb_tree<_Key, std::pair<const _Key, _Tp>, std::_Select1st<std::pair<const _Key, _Tp> >, _Compare, typename __gnu_cxx::__alloc_traits<_Allocator>::rebind<std::pair<const _Key, _Tp> >::other>::iterator, bool> std::map<_Key, _Tp, _Compare, _Alloc>::emplace(_Args&& ...) [with _Args = {llvm::Function*&, llvm::SmallVector<rv::VectorMapping, 4u>*}; _Key = const llvm::Function*; _Tp = std::unique_ptr<llvm::SmallVector<rv::VectorMapping, 4u> >; _Compare = std::less<const llvm::Function*>; _Alloc = std::allocator<std::pair<const llvm::Function* const, std::unique_ptr<llvm::SmallVector<rv::VectorMapping, 4u> > > >; typename std::_Rb_tree<_Key, std::pair<const _Key, _Tp>, std::_Select1st<std::pair<const _Key, _Tp> >, _Compare, typename __gnu_cxx::__alloc_traits<_Allocator>::rebind<std::pair<const _Key, _Tp> >::other>::iterator = std::_Rb_tree_iterator<std::pair<const llvm::Function* const, std::unique_ptr<llvm::SmallVector<rv::VectorMapping, 4u> > > >]’
%%/llvm/tools/rv/src/resolver/listResolver.cpp:86:84:   required from here
/usr/include/c++/5/ext/new_allocator.h:120:4: error: no matching function for call to ‘std::pair<const llvm::Function* const, std::unique_ptr<llvm::SmallVector<rv::VectorMapping, 4u> > >::pair(llvm::Function*&, llvm::SmallVector<rv::VectorMapping, 4u>*)’
  { ::new((void *)__p) _Up(std::forward<_Args>(__args)...); }
    ^

[pragma annotated RV loop vectorizer] Not vectorized with obscure diag when annotated

At the README page, I am advised that

Be aware that RV will exactly do as you annotated...You'll get what you ordered

But I get diag message "Unsupported user of reduction" (M1) or a more obscure one "warning: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]" (M2) when I employ RV as loop vectorizer with SPECCPU2017 benchmarks.

Taking the 644.nab_s benchmark as an example, I annotate a loop in eff.c file (line 1760) and get diag message M2, a loop in eff.c file (line 2785) and get diag message M1. I annotate with #pragma clang loop vectorize(assume_safety) vectorize_width(4) . And I compile with -O3 -fno-vectorize -mavx -fplugin= libRV.so -mllvm -rv-loopvec.

I believe these loops could be vectorized by RV. I wonder what I have done wrong and how could I vectorize these loops successfully. Thanks in advance.

Legality checks vs vectorizer features

Hey,
This is more a general question about the separation often cited between legality checks vs the features of a vectorizer (seen here, but also in Intel VPlan slides)

It is said "RV does not perform exhaustive legality checks ", but I'm failing to see how a legality check is not vectorizer dependent? For example, a vectorizer could be able to vectorize a conditional while another would not, but what would be the output of a legality check without the knowledge of way the vectorizer is working?
At some point, the legality check has to generate the same kind of "execution mask" for the code to check that there are no overlapping writes or read depending on previous writes...etc. so How this separation can work in practice?

[MCE] common type too low

-- memCopy elision log --
Found divergent memcpy:   call void @llvm.memcpy.p0i8.p0i8.i64(i8* nonnull %11, i8* %12, i64 16, i32 4, i1 false), !tbaa.struct !12
derive   %arrayidx7.prol.i.i = getelementptr inbounds %struct.FOUR_VECTOR, %struct.FOUR_VECTOR* %callable.coerce3, i64 %10 size 16
        hit: pointer aligns with type!
derive   %arrayidx9.prol.i.i = getelementptr inbounds [100 x %struct.FOUR_VECTOR], [100 x %struct.FOUR_VECTOR]* @_ZZ6kernelIN5pacxx2v25rangeEEvRT_7par_str7dim_strP7box_strP11FOUR_VECTORPfSA_E9rA_shared, i64 0, i64 %indvars.iv245.prol.i.i size 16
        hit: pointer aligns with type!
                 derive common type for %struct.FOUR_VECTOR = type { float, float, float, float } and %struct.FOUR_VECTOR = type { float, float, float, float }
                 derive common type for float and %struct.FOUR_VECTOR = type { float, float, float, float }
                 derive common type for float and float
        skip: common base type: float
[llvm] CreateGEP called
[llvm] General case
[llvm] Values: 3
[llvm] PointeeType is nullptr, replacing
[llvm] Base type: %struct.FOUR_VECTOR*, ScalarPointerType: %struct.FOUR_VECTOR*, ElementType: %struct.FOUR_VECTOR = type { float, float, float, float }
[llvm] Resulting type: %struct.FOUR_VECTOR = type { float, float, float, float }
[llvm] Getting GEP Return Type from Type: %struct.FOUR_VECTOR = type { float, float, float, float }
[llvm] Getting IndexedType for %struct.FOUR_VECTOR = type { float, float, float, float }
[llvm] is sized
[llvm] is composite
[llvm] Verifying index
[llvm] Refreshing Type
[llvm] Type refreshed to float
[llvm] Idx 2 size 2
[llvm] not nullptr
[llvm] Getting IndexedType for %struct.FOUR_VECTOR = type { float, float, float, float }
[llvm] is sized
[llvm] is composite
[llvm] Verifying index
[llvm] Refreshing Type
[llvm] Type refreshed to float
[llvm] Idx 2 size 2
[llvm] CreateGEP called
[llvm] General case
[llvm] Values: 3
[llvm] PointeeType is nullptr, replacing
[llvm] Base type: %struct.FOUR_VECTOR*, ScalarPointerType: %struct.FOUR_VECTOR*, ElementType: %struct.FOUR_VECTOR = type { float, float, float, float }
[llvm] Resulting type: %struct.FOUR_VECTOR = type { float, float, float, float }
[llvm] Getting GEP Return Type from Type: %struct.FOUR_VECTOR = type { float, float, float, float }
[llvm] Getting IndexedType for %struct.FOUR_VECTOR = type { float, float, float, float }
[llvm] is sized
[llvm] is composite
[llvm] Verifying index
[llvm] Refreshing Type
[llvm] Type refreshed to float
[llvm] Idx 2 size 2
[llvm] not nullptr
[llvm] Getting IndexedType for %struct.FOUR_VECTOR = type { float, float, float, float }
[llvm] is sized
[llvm] is composite
[llvm] Verifying index
[llvm] Refreshing Type
[llvm] Type refreshed to float
[llvm] Idx 2 size 2
OK base gep src:   %basegep = getelementptr %struct.FOUR_VECTOR, %struct.FOUR_VECTOR* %arrayidx7.prol.i.i, i32 0, i32 0   base gep dest:   %basegep27 = getelementptr %struct.FOUR_VECTOR, %struct.FOUR_VECTOR* %arrayidx9.prol.i.i, i32 0, i32 0
Lowering
        to        %basegep27 = getelementptr %struct.FOUR_VECTOR, %struct.FOUR_VECTOR* %arrayidx9.prol.i.i, i32 0, i32 0
        from      %basegep = getelementptr %struct.FOUR_VECTOR, %struct.FOUR_VECTOR* %arrayidx7.prol.i.i, i32 0, i32 0
        based on common float of size 16
[llvm] CreateGEP called
[llvm] General case
[llvm] Values: 3
[llvm] PointeeType is nullptr, replacing
[llvm] Base type: float*, ScalarPointerType: float*, ElementType: float
[llvm] Resulting type: float
[llvm] Getting GEP Return Type from Type: float
[llvm] Getting IndexedType for float
[llvm] is sized
[llvm] nullptr
lavaMD: %%/llvm/include/llvm/IR/Instructions.h: llvm::Type* llvm::checkGEPType(llvm::Type*): Assertion `Ty && "Invalid GetElementPtrInst indices for type!"' failed.

I tried setting https://github.com/cdl-saarland/rv/blob/develop/src/transform/memCopyElision.cpp#L61 as the first check instead of last, which let the pass finish.

I also checked other calls to CreateGEP. Calls with basic types (i8, float, etc..) have only 1 index, calls for complex types (Arrays, structures) have 2 or more. I suspect the issue here is that the common type gets decayed to a basic type, but still supplied with two indexes (https://github.com/cdl-saarland/rv/blob/develop/src/transform/memCopyElision.cpp#L107)

Loading plugin with opt gives an error

Building with latest LLVM 12 master and RV with latest master. LLVM is built with each component as a shared library, not a single shared library (BUILD_SHARED_LIBS=ON). then install to some local directory. Loading plugin with clang seems to work fine but when loading with opt with ~/local/llvm/rv/bin/opt -load=/home/kazooie/local/llvm/rv/lib/libRV.so gives:

opt: CommandLine Error: Option 'rv-cns' registered more than once!
LLVM ERROR: inconsistency in registered CommandLine options
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.
Stack dump:
0.	Program arguments: /home/kazooie/local/llvm/rv/bin/opt -load=/home/kazooie/local/llvm/rv/lib/libRV.so 
 #0 0x00007f3aeebc75ba llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) /home/kazooie/extra/programming/llvm-project/llvm/lib/Support/Unix/Signals.inc:563:22
 #1 0x00007f3aeebc7671 PrintStackTraceSignalHandler(void*) /home/kazooie/extra/programming/llvm-project/llvm/lib/Support/Unix/Signals.inc:630:1
 #2 0x00007f3aeebc5381 llvm::sys::RunSignalHandlers() /home/kazooie/extra/programming/llvm-project/llvm/lib/Support/Signals.cpp:71:20
 #3 0x00007f3aeebc6f4a SignalHandler(int) /home/kazooie/extra/programming/llvm-project/llvm/lib/Support/Unix/Signals.inc:405:1
 #4 0x00007f3af37640f0 __restore_rt (/usr/lib/libpthread.so.0+0x140f0)
 #5 0x00007f3aee472615 raise (/usr/lib/libc.so.6+0x3d615)
 #6 0x00007f3aee45b862 abort (/usr/lib/libc.so.6+0x26862)
 #7 0x00007f3aeea761f2 llvm::install_bad_alloc_error_handler(void (*)(void*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool), void*) /home/kazooie/extra/programming/llvm-project/llvm/lib/Support/ErrorHandling.cpp:130:61
 #8 0x00007f3aeea75fad llvm::report_fatal_error(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool) /home/kazooie/extra/programming/llvm-project/llvm/lib/Support/ErrorHandling.cpp:86:77
 #9 0x00007f3aeea2d151 (anonymous namespace)::CommandLineParser::addOption(llvm::cl::Option*, llvm::cl::SubCommand*) /home/kazooie/extra/programming/llvm-project/llvm/lib/Support/CommandLine.cpp:245:17
#10 0x00007f3aeea2d2e0 (anonymous namespace)::CommandLineParser::addOption(llvm::cl::Option*, bool) /home/kazooie/extra/programming/llvm-project/llvm/lib/Support/CommandLine.cpp:261:16
#11 0x00007f3aeea2e439 llvm::cl::Option::addArgument() /home/kazooie/extra/programming/llvm-project/llvm/lib/Support/CommandLine.cpp:446:20
#12 0x00007f3aeea41a98 llvm::cl::opt<bool, false, llvm::cl::parser<bool> >::done() /home/kazooie/extra/programming/llvm-project/llvm/include/llvm/Support/CommandLine.h:1463:22
#13 0x00007f3aec74b006 llvm::cl::opt<bool, false, llvm::cl::parser<bool> >::opt<char [7], llvm::cl::desc, llvm::cl::initializer<bool>, llvm::cl::NumOccurrencesFlag, llvm::cl::cat>(char const (&) [7], llvm::cl::desc const&, llvm::cl::initializer<bool> const&, llvm::cl::NumOccurrencesFlag const&, llvm::cl::cat const&) /home/kazooie/extra/programming/llvm-project/llvm/include/llvm/Support/CommandLine.h:1487:3
#14 0x00007f3aec74a570 __static_initialization_and_destruction_0(int, int) /home/kazooie/extra/programming/llvm-project/rv/src/registerPasses.cpp:45:5
#15 0x00007f3aec74a656 _GLOBAL__sub_I_registerPasses.cpp /home/kazooie/extra/programming/llvm-project/rv/src/registerPasses.cpp:99:42
#16 0x00007f3af497b2de call_init.part.0 (/lib64/ld-linux-x86-64.so.2+0x112de)
#17 0x00007f3af497b3c8 _dl_init (/lib64/ld-linux-x86-64.so.2+0x113c8)
#18 0x00007f3aee5700e5 _dl_catch_exception (/usr/lib/libc.so.6+0x13b0e5)
#19 0x00007f3af497f705 dl_open_worker (/lib64/ld-linux-x86-64.so.2+0x15705)
#20 0x00007f3aee570088 _dl_catch_exception (/usr/lib/libc.so.6+0x13b088)
#21 0x00007f3af497ef3e _dl_open (/lib64/ld-linux-x86-64.so.2+0x14f3e)
#22 0x00007f3aecf1434c (/usr/lib/libdl.so.2+0x134c)
#23 0x00007f3aee570088 _dl_catch_exception (/usr/lib/libc.so.6+0x13b088)
#24 0x00007f3aee570153 _dl_catch_error (/usr/lib/libc.so.6+0x13b153)
#25 0x00007f3aecf14b89 (/usr/lib/libdl.so.2+0x1b89)
#26 0x00007f3aecf143d8 dlopen (/usr/lib/libdl.so.2+0x13d8)
#27 0x00007f3aeeba7ee1 llvm::sys::DynamicLibrary::HandleSet::DLOpen(char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*) /home/kazooie/extra/programming/llvm-project/llvm/lib/Support/Unix/DynamicLibrary.inc:28:26
#28 0x00007f3aeeba80c8 llvm::sys::DynamicLibrary::getPermanentLibrary(char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*) /home/kazooie/extra/programming/llvm-project/llvm/lib/Support/DynamicLibrary.cpp:154:35
#29 0x00007f3aeeb003ba llvm::sys::DynamicLibrary::LoadLibraryPermanently(char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*) /home/kazooie/extra/programming/llvm-project/llvm/include/llvm/Support/DynamicLibrary.h:87:51
#30 0x00007f3aeeb00169 llvm::PluginLoader::operator=(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) /home/kazooie/extra/programming/llvm-project/llvm/lib/Support/PluginLoader.cpp:28:3
#31 0x000056075c3ba75c void llvm::cl::opt_storage<llvm::PluginLoader, false, true>::setValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool) /home/kazooie/extra/programming/llvm-project/llvm/include/llvm/Support/CommandLine.h:1362:5
#32 0x000056075c3b9917 llvm::cl::opt<llvm::PluginLoader, false, llvm::cl::parser<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::handleOccurrence(unsigned int, llvm::StringRef, llvm::StringRef) /home/kazooie/extra/programming/llvm-project/llvm/include/llvm/Support/CommandLine.h:1418:22
#33 0x00007f3aeea34889 llvm::cl::Option::addOccurrence(unsigned int, llvm::StringRef, llvm::StringRef, bool) /home/kazooie/extra/programming/llvm-project/llvm/lib/Support/CommandLine.cpp:1700:46
#34 0x00007f3aeea2f18a CommaSeparateAndAddOccurrence(llvm::cl::Option*, unsigned int, llvm::StringRef, llvm::StringRef, bool) /home/kazooie/extra/programming/llvm-project/llvm/lib/Support/CommandLine.cpp:647:32
#35 0x00007f3aeea2f544 ProvideOption(llvm::cl::Option*, llvm::StringRef, llvm::StringRef, int, char const* const*, int&) /home/kazooie/extra/programming/llvm-project/llvm/lib/Support/CommandLine.cpp:687:41
#36 0x00007f3aeea33c9b (anonymous namespace)::CommandLineParser::ParseCommandLineOptions(int, char const* const*, llvm::StringRef, llvm::raw_ostream*, bool) /home/kazooie/extra/programming/llvm-project/llvm/lib/Support/CommandLine.cpp:1545:36
#37 0x00007f3aeea3290b llvm::cl::ParseCommandLineOptions(int, char const* const*, llvm::StringRef, llvm::raw_ostream*, char const*, bool) /home/kazooie/extra/programming/llvm-project/llvm/lib/Support/CommandLine.cpp:1312:47
#38 0x000056075c3a0e41 main /home/kazooie/extra/programming/llvm-project/llvm/tools/opt/opt.cpp:590:30
#39 0x00007f3aee45d152 __libc_start_main (/usr/lib/libc.so.6+0x28152)
#40 0x000056075c36761e _start (/home/kazooie/local/llvm/rv/bin/opt+0x2261e)
[1]    79773 abort (core dumped)  ~/local/llvm/rv/bin/opt -load=/home/kazooie/local/llvm/rv/lib/libRV.so

Is this because LLVM being configured to be built as a multiple shared libraries rather than one single shared library? I found this similar discussion apache/tvm#1461.

Attributes.inc: No such file or directory

I tried to build llvm-project with rv inside by adding the options to the cmake command (-DLLVM_EXTERNAL_PROJECTS="rv" -DLLVM_EXTERNAL_RV_SOURCE_DIR=llvm-project/rv). However, I got the following error multiple times:

llvm-project/llvm/include/llvm/IR/Attributes.h:88:14 fatal error: llvm/IR/Attributes.inc: No such file or directory

I checked and it seems that Attributes.inc hasn't been generated yet.
Is there a way to fix this?

Keep address spaces in pointer casts

The RV backend currently resets the address spaces to zero when casting pointers (see below). This issue shows up in Chapel generated IR (thanks to @mppf for reporting).

  %179 = getelementptr inbounds i64, i64 addrspace(100)* %53, i128 %178
  %180 = bitcast i64 addrspace(100)* %179 to <2 x i64> addrspace(100)*
  %vec_cast = addrspacecast <2 x i64> addrspace(100)* %180 to <2 x i64>*
  %cont_load = load <2 x i64>, <2 x i64>* %vec_cast, align 8
  %181 = getelementptr inbounds i64, i64* %51, i64 %i.0C361
  %vec_cast362 = bitcast i64* %181 to <2 x i64>*
  store <2 x i64> %cont_load, <2 x i64>* %vec_cast362, align 8

Make RV's lit tests portable

Add proper loopvectorizer tests to RV with the lit infrastructure.

The feature/port-cdl-fixes merge adds LIT testing infra for RV from the NEC RV version. There is also a check-rv target now running both functional tests as well as the lit tests. This issue is about making the lit tests more portable, that is:

  • Remove the dependence on the exact build configuration of the llvm-ve-rv stack.
  • Port/add tests for more likely host targets (such as x86, arm , ..). Check for compiler feature flags to turn these tests on/off.
  • Add a reliable way to call the RV loopvectorizer from lit tests (RVPLUG may not be linked in, etc.).

bosccTransform performance remedy problem

I turn on config.enableHeuristicBOSCC in PACXX and test it using aobench and rodinia/cfd. This transformation does not bypass the desired path, however, with hard-coded optimization, I could get more than 2X than the original RV version. I find several problems in this bosccTransform.

  1. In the computeDispersion func of bosccTransform.cpp: line 528, GetEdgeProb(*start, *end) should not been used here. GetEdgeProb(*start, *end) computes all the paths from *start to *end, while AFAIU, we only need the direct path from *start to *end. GetEdgeProb(*start, Index) is more suitable here.
  2. In bosccHeuristic func, I wonder why maxRatio and minScore is assgined to a specific number(0.14&17). The aforementioned two applications fail to meeting the requirement of <0.14.

Fix core dump in CloneLoop

In trying to integrate RV with the Chapel compiler I ran into a core dump inside RV.

If I apply the following patch, I get an assertion " can only clone single exit loops" instead.

diff --git a/src/transform/loopCloner.cpp b/src/transform/loopCloner.cpp
index 27e191c..b2b3fa6 100644
--- a/src/transform/loopCloner.cpp
+++ b/src/transform/loopCloner.cpp
@@ -28,11 +28,11 @@ struct LoopCloner {
   LoopCloneInfo
   CloneLoop(Loop & L, ValueToValueMapTy & valueMap) {
     auto * loopPreHead = L.getLoopPreheader();
-    auto * preTerm = loopPreHead->getTerminator();
     auto & loopHead = *L.getHeader();
     auto * loopExiting = L.getExitingBlock();
     assert(loopPreHead && loopExiting && " can only clone single exit loops");
 
+    auto * preTerm = loopPreHead->getTerminator();
     auto * splitBranch = BranchInst::Create(&loopHead, &loopHead, ConstantInt::getTrue(loopHead.getContext()), loopPreHead);
 
     // clone all basic blocks

The root cause for the Chapel integration issue was that the frontend needed to call rv::addPreparatoryPasses in addition to rv::addOuterLoopVectorizer. I've fixed that now, but it would be nice if there were a clearer error here (say, one that mentioned the need to call addPreparatoryPasses). Additionally it would be nice if https://github.com/cdl-saarland/rv/wiki/How-to-use-RV%27s-outer-loop-vectorizer-in-your-LLVM-frontend included the need for rv::addPreparatoryPasses (and probably rv::addWholeFunctionVectorizer, rv::addOuterLoopVectorizer, rv::addCleanupPasses(PM); ).

It would also work just fine for me if RV provided registerRVPasses (or something like it) as an exported function. I'm just as happy to call addPreparatoryPasses etc.

build warnings to do with compiler-rt

I'm seeing some warnings that look like they're indicating programs that wouldn't run correctly:

In file included from llvm/tools/rv/vecmath/crt.c:27:
llvm/tools/rv/../../projects/compiler-rt/lib/builtins/divdc3.c:26:22: warning: 
      implicit declaration of function '__compiler_rt_logb' is invalid in C99
      [-Wimplicit-function-declaration]
    double __logbw = __compiler_rt_logb(crt_fmax(crt_fabs(__c), crt_fabs(__d)));
                     ^
In file included from llvm/tools/rv/vecmath/crt.c:28:
llvm/tools/rv/../../projects/compiler-rt/lib/builtins/divdf3.c:151:34: warning: 
      shift count >= width of type [-Wshift-count-overflow]
        residual = (aSignificand << 53) - quotient * bSignificand;
                                 ^  ~~
llvm/tools/rv/../../projects/compiler-rt/lib/builtins/divdf3.c:155:34: warning: 
      shift count >= width of type [-Wshift-count-overflow]
        residual = (aSignificand << 52) - quotient * bSignificand;
                                 ^  ~~
In file included from llvm/tools/rv/vecmath/crt.c:35:
llvm/tools/rv/../../projects/compiler-rt/lib/builtins/divtc3.c:27:9: warning: 
      implicit declaration of function '__compiler_rt_logbl' is invalid in C99
      [-Wimplicit-function-declaration]
        __compiler_rt_logbl(crt_fmaxl(crt_fabsl(__c), crt_fabsl(__d)));
        ^
In file included from llvm/tools/rv/vecmath/crt.c:113:
llvm/tools/rv/../../projects/compiler-rt/lib/builtins/subdf3.c:21:12: warning: 
      implicit declaration of function '__adddf3' is invalid in C99
      [-Wimplicit-function-declaration]
    return __adddf3(a, fromRep(toRep(b) ^ signBit));
           ^

vectorize math on-the-fly

When RV encounters a call to a declared math function (available in the SLEEF submodule) it should automatically vectorize the scalar implementation of that function (for the current vector width/arch; as long as there is no SIMD mapping in PlatformInfo).

cmake reports an error

We used llvm14.0.5 and an error occurred when cmake included the -DLLVM_RVPLUG_LINK_INTO_TOOLS=ON option: undefined reference to `rv::addRVPasses(llvm::PassBuilder&)'.What can we do about it?

cmake /home/liuwb/llvm-project-14.0.5.src -DCMAKE_BUILD_TYPE="Release" -DCMAKE_INSTALL_PREFIX="/home/liuwb/llvm14.05-rv-install" ../llvm  -DLLVM_EXTERNAL_PROJECTS="rv" -DLLVM_EXTERNAL_RV_SOURCE_DIR=/home/liuwb/llvm-project-14.0.5.src/llvm/projects/rv  -DRV_ENABLE_CRT=ON -DRV_DEBUG=ON -DLLVM_RVPLUG_LINK_INTO_TOOLS=ON -DLLVM_ENABLE_PROJECTS=clang

make
make install

Auto-repair attributes of RV intrinsics

This is a reminder to...

a) Implement a pass that sets the right LLVM function attributes for RV builtins (RVIntrinsics)

  • rv_load should be readonly (SIMD codegen crashes for at least one Rodent traversal variant (not all variants are enabled by default) because the DA incorrectly believes the rv_load result shape is varying).
  • It might help if rv_any is annotated convergent (this is a potential workaround to stop LLVM from "hoisting ifs" involving rv_any, which may render the BOSCC/cif gadget ineffective before it hits RV).

b) Add this pass at the earliest insertion point when used with Clang.

c) Make sure RV itself only generates correct/complete RV intrinsic declarations.

[VA] [VEC] [LIN] function pointers

I tried converting one of our samples (https://zivgitlab.uni-muenster.de/HPC2SE-Project/pacxx-samples/tree/master/pointer_function) to your test suite structure.

verify___.cpp:

#include <stdlib.h>
#include <stdio.h>
#include <iostream>

#include <cassert>
#include <random>

#include "launcherTools.h"

extern "C" void foo(int * threadId, int * b, int n);
extern "C" void foo_SIMD(int * threadId, int * b, int n);

int main(int argc, char ** argv) {
	const uint numInputs = 8;

	int threadId[numInputs], expected_b[numInputs];

	for (unsigned i = 0; i < numInputs; ++i) {
		threadId[i] = i;
		expected_b[i] = 0;
		b[i] = 0;
	}

	foo(&threadId, &expected_b);
	foo_SIMD(&threadId, &b);

	size_t hash = hashArray(expected_b, numInputs, 0);
	hash = hashArray(b, numInputs, hash);

	std::cerr << hash << "\n";
	return 0;
}

test___.cpp:

// Shapes: ?_?_?, LaunchCode: functionpointers

int mult(int x)
{
	return 7*x+1;
}

int dual(int x)
{
	return 2*x+1;
}

int beep(int x)
{
	return -x;
}

int sum(int x)
{
	return x+5;
}

extern "C" void
foo(int * threadId, int * b, int n)
{
	for (int i = 0; i < n; i++) {
		int (*funptr)(int);
		switch (threadId[i]%8)
		{
			case 0: funptr = &mult;
				break;
			case 1: funptr = &dual;
				break;
			case 2: funptr = &beep;
				break;
			case 3: funptr = &sum;
				break;
			case 4: funptr = &dual;
				break;
			case 5: funptr = &mult;
				break;
			case 6: funptr = &sum;
				break;
			case 7: funptr = &beep;
				break;
		}
		b[i] = funptr(threadId[i]);
	}
}

Both the test_rv (I tried several shape values)

-- End of Recurrence Analysis --
Extracting a scalar value from a vector:
Original Value:   %0 = load i32, i32* %arrayidx, align 4, !tbaa !2Vector Value:   %cont_load = load <8 x i32>, <8 x i32>* %vec_cast, align 4Extracting a scalar value from a vector:
Original Value:   %0 = load i32, i32* %arrayidx, align 4, !tbaa !2Vector Value:   %cont_load = load <8 x i32>, <8 x i32>* %vec_cast, align 4Extracting a scalar value from a vector:
Original Value:   %0 = load i32, i32* %arrayidx, align 4, !tbaa !2Vector Value:   %cont_load = load <8 x i32>, <8 x i32>* %vec_cast, align 4Extracting a scalar value from a vector:
Original Value:   %0 = load i32, i32* %arrayidx, align 4, !tbaa !2Vector Value:   %cont_load = load <8 x i32>, <8 x i32>* %vec_cast, align 4Extracting a scalar value from a vector:
Original Value:   %0 = load i32, i32* %arrayidx, align 4, !tbaa !2Vector Value:   %cont_load = load <8 x i32>, <8 x i32>* %vec_cast, align 4Extracting a scalar value from a vector:
Original Value:   %0 = load i32, i32* %arrayidx, align 4, !tbaa !2Vector Value:   %cont_load = load <8 x i32>, <8 x i32>* %vec_cast, align 4Extracting a scalar value from a vector:
Original Value:   %0 = load i32, i32* %arrayidx, align 4, !tbaa !2Vector Value:   %cont_load = load <8 x i32>, <8 x i32>* %vec_cast, align 4Extracting a scalar value from a vector:
Original Value:   %0 = load i32, i32* %arrayidx, align 4, !tbaa !2Vector Value:   %cont_load = load <8 x i32>, <8 x i32>* %vec_cast, align 4Extracting a scalar value from a vector:
Original Value:   %call = tail call i32 %switch.load.R.b(i32 %0)Vector Value:   %scalarized21 = insertelement <8 x i32> %scalarized18, i32 %call20, i64 7loopHead: 0: shape unired: 
loopHead: 0: shape varyingred: 
rvTool: %%%/llvm/include/llvm/IR/DataLayout.h:531: uint64_t llvm::DataLayout::getTypeSizeInBits(llvm::Type*) const: Assertion `Ty->isSized() && "Cannot getTypeInfo() on a type that is unsized!"' failed.

Program received signal SIGABRT, Aborted.
0x00007ffff5939428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
54      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  0x00007ffff5939428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007ffff593b02a in __GI_abort () at abort.c:89
#2  0x00007ffff5931bd7 in __assert_fail_base (fmt=<optimized out>, assertion=assertion@entry=0x7ffff79d54d0 "Ty->isSized() && \"Cannot getTypeInfo() on a type that is unsized!\"", 
    file=file@entry=0x7ffff79d5490 "%%%/llvm/include/llvm/IR/DataLayout.h", line=line@entry=531, 
    function=function@entry=0x7ffff79d6ee0 <llvm::DataLayout::getTypeSizeInBits(llvm::Type*) const::__PRETTY_FUNCTION__> "uint64_t llvm::DataLayout::getTypeSizeInBits(llvm::Type*) const") at assert.c:92
#3  0x00007ffff5931c82 in __GI___assert_fail (assertion=0x7ffff79d54d0 "Ty->isSized() && \"Cannot getTypeInfo() on a type that is unsized!\"", 
    file=0x7ffff79d5490 "%%%/llvm/include/llvm/IR/DataLayout.h", line=531, 
    function=0x7ffff79d6ee0 <llvm::DataLayout::getTypeSizeInBits(llvm::Type*) const::__PRETTY_FUNCTION__> "uint64_t llvm::DataLayout::getTypeSizeInBits(llvm::Type*) const") at assert.c:101
#4  0x00007ffff7913c41 in llvm::DataLayout::getTypeSizeInBits(llvm::Type*) const () from %%%/lib/libRV.so
#5  0x00007ffff792a989 in rv::NatBuilder::widenScalar(llvm::Value&, rv::VectorShape) () from %%%/lib/libRV.so
#6  0x00007ffff7934e22 in rv::NatBuilder::requestVectorValue(llvm::Value*) () from %%%/lib/libRV.so
#7  0x00007ffff793c5b8 in rv::NatBuilder::addValuesToPHINodes() () from %%%/lib/libRV.so
#8  0x00007ffff793deb3 in rv::NatBuilder::vectorize(bool, llvm::ValueMap<llvm::Value const*, llvm::WeakTrackingVH, llvm::ValueMapConfig<llvm::Value const*, llvm::sys::SmartMutex<false> > >*) ()
   from %%%/lib/libRV.so

and our pipeline crash in vectorize.

-- End of Recurrence Analysis --
functionpointers: %%%/llvm/include/llvm/IR/DataLayout.h:531: uint64_t llvm::DataLayout::getTypeSizeInBits(llvm::Type*) const: Assertion `Ty->isSized() && "Cannot getTypeInfo() on a type that is unsized!"' failed.

Program received signal SIGABRT, Aborted.
0x00007fffe0af6428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
54      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  0x00007fffe0af6428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007fffe0af802a in __GI_abort () at abort.c:89
#2  0x00007fffe0aeebd7 in __assert_fail_base (fmt=<optimized out>, assertion=assertion@entry=0x7ffff74894d0 "Ty->isSized() && \"Cannot getTypeInfo() on a type that is unsized!\"", 
    file=file@entry=0x7ffff7489490 "%%%/llvm/include/llvm/IR/DataLayout.h", line=line@entry=531, 
    function=function@entry=0x7ffff748aee0 <llvm::DataLayout::getTypeSizeInBits(llvm::Type*) const::__PRETTY_FUNCTION__> "uint64_t llvm::DataLayout::getTypeSizeInBits(llvm::Type*) const") at assert.c:92
#3  0x00007fffe0aeec82 in __GI___assert_fail (assertion=0x7ffff74894d0 "Ty->isSized() && \"Cannot getTypeInfo() on a type that is unsized!\"", 
    file=0x7ffff7489490 "%%%/llvm/include/llvm/IR/DataLayout.h", line=531, 
    function=0x7ffff748aee0 <llvm::DataLayout::getTypeSizeInBits(llvm::Type*) const::__PRETTY_FUNCTION__> "uint64_t llvm::DataLayout::getTypeSizeInBits(llvm::Type*) const") at assert.c:101
#4  0x00007ffff73c7c41 in llvm::DataLayout::getTypeSizeInBits(llvm::Type*) const () from %%%/lib/libRV.so
#5  0x00007ffff73de989 in rv::NatBuilder::widenScalar(llvm::Value&, rv::VectorShape) () from %%%/lib/libRV.so
#6  0x00007ffff73e8e22 in rv::NatBuilder::requestVectorValue(llvm::Value*) () from %%%/lib/libRV.so
#7  0x00007ffff73ec564 in rv::NatBuilder::mapOperandsInto(llvm::Instruction*, llvm::Instruction*, bool, unsigned int) () from %%%/lib/libRV.so
#8  0x00007ffff73ec9d2 in rv::NatBuilder::vectorizeInstruction(llvm::Instruction*) () from %%%/lib/libRV.so
#9  0x00007ffff73f0e5b in rv::NatBuilder::vectorize(llvm::BasicBlock*, llvm::BasicBlock*) () from %%%/lib/libRV.so
#10 0x00007ffff73f27b1 in rv::NatBuilder::vectorize(bool, llvm::ValueMap<llvm::Value const*, llvm::WeakTrackingVH, llvm::ValueMapConfig<llvm::Value const*, llvm::sys::SmartMutex<false> > >*) ()
   from %%%/lib/libRV.so

Is it possible to add support for function pointers to RV?
Or at least a hint on where to look so I could work on a PR.

CMake error on Windows for utility target

Compiling AnyDSL on Windows with LLVM 16 and RV 16.x branch will fail to configure as target_link_libraries can not be applied to the utility target RVPLUG.

target_link_libraries(RVPLUG PRIVATE RV)

Commenting out the line will allow to proceed as usual.
The framework even works without it. However, this was only tested with Ignis. (Artic and JIT)

Make RV work with the new PM

Clang is poised to switch to the new/non-legacy PM for LLVM 10 (http://lists.llvm.org/pipermail/llvm-dev/2019-August/134326.html)
Make sure RV works fine with the new PM.

Deprecation of the develop branch

Please switch to RV master if you have been following the develop branch.
I will remove the develop branch on Friday.

All development of RV based on LLVM trunk will occur on master. We will cherry-pick / merge these patches into the release_X branches for release versions X of LLVM as required.

DFGBase specializations in different namespaces

I recently updated RV to the latest commit and the following errors occurred:

llvm/tools/rv/src/analysis/DFG.cpp:11:53: error: specialization of ‘template<bool backward> bool llvm::DFGBaseWrapper<backward>::runOnFunction(llvm::Function&)’ in different namespace [-fpermissive]
 bool DFGBaseWrapper<true>::runOnFunction(Function& F)

...

I reverted our custom port to LLVM 6.0 since I got a linker error and the original code is now compatible with the LLVM 6.0 API.

The compile error can be fixed in DFG.cpp by pulling everything into the llvm namespace. However, I was wondering if DFGBase should realy live in the llvm namespace.

Check for compiler-rt sources when RV_ENABLE_CRT=on

RV's cmake files should verify that compiler-rt source files are present in llvm/projects/compiler-rt when the cmake build is configured with the option -DRV_ENABLE_CRT=on.

Otw, the build will crash in an attempt to compile non-existent source files of compiler-rt into BC.

Implement atomicrmw CodeGen

RV does not support LLVM atomicrmw yet (https://llvm.org/docs/LangRef.html#atomicrmw-instruction). Currently, RV's lane threads aren't considered concurrent threads in terms of the LLVM execution model and so atomicrmw remains scalar.

What needs to change: The result of atomicrmw is always a varying value. Otherwise, this is mostly a RV codegen issue (NatBuilder.cpp).

When the backend vectorizes an atomic instruction, it should apply the operator of the atomic (add, umin, umax, xor, ..) to reduce the value vector into a scalar value and emit just one atomicrmw with the reduced value.

What is tricky about atomicrmw is two things:

  • Fairness - who "wins" in a vector xchg? RV does not give any (lane)thread fairness or even liveness guarantees.
  • The result (vector) value - what will be the return vector value? The backend will need to emit a prefix-sum like operation over the reduced vector to simulate the incrementally updated value for each lane.

Generation of erroneous LLVM IR code from Impala

The compilation of the following Impala code (https://github.com/AnyDSL/impala) using the RV frontend implemented in https://github.com/AnyDSL/runtime leads to invalid LLVM IR code.
As a result, clang aborts the compilation with the following error message:

SplitVectorResult #0: t72: v4f64 = llvm.x86.avx.cmp.pd.256 TargetConstant:i64<4434>, t64, t70, Constant:i8<1>
fatal error: error in backend: Do not know how to split the result of this operator!
clang-4.0: error: clang frontend command failed with exit code 70 (use -v to see invocation)
clang version 4.0.1 (tags/RELEASE_401/final)
Target: x86_64-unknown-linux-gnu
Thread model: posix

The compilation was performed using the following software:
RV - branch: release_40, commit: cca9052
Impala - branch: master, commit: 16c3ff7d21b3e5fd32aeebbe44d9abf0e638e105
AnyDSL/runtime - branch: master, commit: 652be09a92e565d68b4079e83d8fb9c4596dd8b2
LLVM/Clang - version 4.0.1
AnyDSL has been built in debug mode. In release mode, the same error occurs.
Operating System: Ubuntu 16.04.3 LTS
Hardware: Intel(R) Xeon(R) CPU E3-1275 v5 @ 3.60GHz

Note: The vectorize function is called within the function "compute_forces".

type real_t = f64;
static EPSILON = 1e-9 as real_t;

static mut grid_ : Grid;

struct Vector {
    x: real_t,
    y: real_t,
    z: real_t
}

struct Grid {
    aabb: AABB,
    nx: i32,
    ny: i32,
    spacing: real_t,
    cells: Buffer,
    nparticles: i32
}

struct Cell {
    masses: Buffer,
    positions: Buffer,
    velocities: Buffer,
    forces: Buffer,
    size: i32,
    padding: i32,
    capacity: i32,
    nclusters: i32,
    clusters: Buffer,
    cluster_size: i32
}

struct AABB {
    min: [real_t * 3],
    max: [real_t * 3]
}

struct Cluster {
    neighbor_list: NeighborList,
    aabb: AABB
}

struct NeighborList {
    size: i32,
    capacity: i32,
    cells: Buffer,
    indices: Buffer
}

fn @get_vector_length() -> i32 { 4 }
fn @get_alignment() -> i32 { 32 }
fn @get_thread_count() -> i32 { 4 }

fn @outer_loop(lower: i32, upper: i32, body: fn(i32) -> ()) -> () {
    for i in parallel(get_thread_count(), lower, upper) {
        @@body(i);
    }
}

fn @inner_loop(lower: i32, upper: i32, body: fn(i32) -> ()) -> () {
    range(lower, upper, body)
}

fn @get_vector(i: i32, buf: Buffer) -> Vector {
    bitcast[&[Vector]](buf.data)(i)
}

fn @add_to_vector(i: i32, buf: Buffer, x: real_t, y: real_t, z: real_t) -> () { 
    bitcast[&mut[Vector]](buf.data)(i).x += x;
    bitcast[&mut[Vector]](buf.data)(i).y += y;
    bitcast[&mut[Vector]](buf.data)(i).z += z;
}

fn @sub_from_vector(i: i32, buf: Buffer, x: real_t, y: real_t, z: real_t) -> () { 
    bitcast[&mut[Vector]](buf.data)(i).x -= x;
    bitcast[&mut[Vector]](buf.data)(i).y -= y;
    bitcast[&mut[Vector]](buf.data)(i).z -= z;
}

fn @get_i32(i: i32, buf: Buffer) -> i32 {
    bitcast[&[i32]](buf.data)(i)
}

fn @get_cell_pointer(i: i32, buf: Buffer) -> &Cell {
    bitcast[&[&Cell]](buf.data)(i)
}

fn @min_i32(a: i32, b: i32) -> i32 {
    if(a < b) {a} else {b}
}

fn @get_array_of_cells(buf: Buffer) -> &mut[Cell] {
    bitcast[&mut[Cell]](buf.data)
}

fn @get_array_of_clusters(buf: Buffer) -> &mut[Cluster] {
    bitcast[&mut[Cluster]](buf.data)
}


fn atomic_op_f64(a: &mut f64, b: f64, op: fn(f64, f64) -> f64) -> f64 { 
    let addr_as_ui  = a as &mut u64;
    let mut done = false;
    let mut value : u64;
    while(!done) {
        value = *addr_as_ui;
        done = cmpxchg(addr_as_ui, value, bitcast[u64](op(bitcast[f64](value), b)))(1);
    }
    bitcast[f64](op(bitcast[f64](value),b))
}

fn atomic_add_to_vector(i: i32, buf: Buffer, x: real_t, y: real_t, z: real_t) -> () {
    atomic_op_f64(&mut bitcast[&mut[Vector]](buf.data)(i).x, x, |a,b|{a+b}); 
    atomic_op_f64(&mut bitcast[&mut[Vector]](buf.data)(i).y, y, |a,b|{a+b}); 
    atomic_op_f64(&mut bitcast[&mut[Vector]](buf.data)(i).z, z, |a,b|{a+b}); 
}

fn atomic_sub_from_vector(i: i32, buf: Buffer, x: real_t, y: real_t, z: real_t) -> () {
    atomic_op_f64(&mut bitcast[&mut[Vector]](buf.data)(i).x, x, |a,b|{a-b}); 
    atomic_op_f64(&mut bitcast[&mut[Vector]](buf.data)(i).y, y, |a,b|{a-b}); 
    atomic_op_f64(&mut bitcast[&mut[Vector]](buf.data)(i).z, z, |a,b|{a-b}); 
}

fn create_potential(sigma: real_t, epsilon: real_t) -> fn(real_t) -> real_t {
    let square = |x : real_t| {x*x};
    let cube = |x: real_t| {x*x*x};
    let sigma6 = square(cube(sigma));;
    let tmp1 = 24.0 as real_t * epsilon * sigma6;
    let tmp2 = 2.0 as real_t * sigma6;
    | squared_distance : real_t | {
        let distance_8_inv = 1.0 as real_t / square(square(squared_distance));
        tmp1 * distance_8_inv * (1.0 as real_t - squared_distance * distance_8_inv * tmp2)
    }
}

fn compute_forces(cluster: &Cluster, cluster_index: i32, cell: &Cell, grid: &Grid, squared_cutoff_distance: real_t, potential: fn(real_t) -> real_t) -> () {
    let neighbor_list = cluster.neighbor_list;
    let begin = cluster_index * cell.cluster_size;
    let end = min_i32(begin + cell.cluster_size, cell.size);
    for i in vectorize(4, 32, begin, end) {
        for j in unroll(i + 1, end) {
            compute_pairwise_forces(i, j, cell, cell, squared_cutoff_distance, potential);
        }
    }
    for i in range(0, neighbor_list.size) {
        let neighboring_cell = get_cell_pointer(i, neighbor_list.cells);
        let neighboring_cluster_index = get_i32(i, neighbor_list.indices);
        let begin_neighbor = neighboring_cluster_index * neighboring_cell.cluster_size;
        let end_neighbor = min_i32(begin_neighbor + neighboring_cell.cluster_size, neighboring_cell.size);
        for i in vectorize(4, 32, begin, end) {
            for j in unroll(begin_neighbor, end_neighbor) {
                compute_pairwise_forces(i, j, cell, neighboring_cell, squared_cutoff_distance, potential);
            }
        }
    }
}

fn @compute_pairwise_forces(i: i32, j: i32, cell: &Cell, neighboring_cell: &Cell, squared_cutoff_distance: real_t, potential: fn(real_t) -> real_t) -> () {
    let position = get_vector(i, cell.positions);
    let neighbor_position = get_vector(j, neighboring_cell.positions);
    let dx = neighbor_position.x - position.x;
    let dy = neighbor_position.y - position.y;
    let dz = neighbor_position.z - position.z;
    let squared_distance = dx * dx + dy * dy + dz * dz;
    if(squared_distance < squared_cutoff_distance) {
        let f = potential(squared_distance);
        let dF_x = f * dx;
        let dF_y = f * dy;
        let dF_z = f * dz;
        atomic_add_to_vector(i, cell.forces, dF_x, dF_y, dF_z);
        atomic_sub_from_vector(j, neighboring_cell.forces, dF_x, dF_y, dF_z);
    }
}

fn map_over_grid(grid: &Grid, iterate_outer: fn(i32, i32, fn(i32) -> ()) -> (), iterate_inner: fn(i32, i32, fn(i32) -> ()) -> (), f: fn(&mut Cell, [i32 * 2]) -> ()) -> () {
    let cells = get_array_of_cells((*grid).cells);
    for i in iterate_outer(0, grid.nx) {
        for j in iterate_inner(0, grid.ny) {
            let cell_index = [i,j];
            f(&mut cells(flatten_index(cell_index, grid)), cell_index, continue)
        }
    }
}

fn @flatten_index(cell_index: [i32 * 2], grid: &Grid) -> i32 {
    cell_index(0) * grid.ny + cell_index(1)
}

fn cell_compute_forces(cell: &Cell, grid: &Grid, squared_cutoff_distance: real_t, potential: fn(real_t) -> real_t) -> () {
    let clusters = get_array_of_clusters(cell.clusters);
    for i in range(0, cell.nclusters) {
        compute_forces(clusters(i), i, cell, grid, squared_cutoff_distance, potential);
    }
}
 
fn grid_compute_forces(grid: &Grid, squared_cutoff_distance: real_t, potential: fn(real_t) -> real_t, outer_loop: fn(i32, i32, fn(i32) -> ()) -> (), inner_loop: fn(i32, i32, fn(i32) -> ()) -> ()) -> () {
    for cell, cell_index in map_over_grid(grid, outer_loop, inner_loop) {
        cell_compute_forces(cell, grid, squared_cutoff_distance, potential);
    }
}


extern
fn cpu_compute_forces(cutoff_distance: real_t, epsilon: real_t, sigma: real_t) -> () {
    let potential = create_potential(sigma, epsilon);
    grid_compute_forces(grid_, cutoff_distance*cutoff_distance, potential, outer_loop, inner_loop);
}

The following erroneous LLVM IR code is generated:

; ModuleID = 'force_computation'
source_filename = "force_computation"
target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

%0 = type { %1, i32, i32, double, %2, i32 }
%1 = type { [3 x double], [3 x double] }
%2 = type { i32, [0 x i8]* }
%3 = type { %2, %2, %2, %2, i32, i32, i32, i32, %2, i32 }
%4 = type { %5, %1 }
%5 = type { i32, i32, %2, %2 }
%6 = type { double, double, double }

@grid_ = local_unnamed_addr global %0 zeroinitializer

define void @cpu_compute_forces(double %cutoff_distance_32561, double %epsilon_32562, double %sigma_32563) local_unnamed_addr {
cpu_compute_forces_start:
  %parallel_closure = alloca { %2, double, double, double }, align 8
  %0 = load %2, %2* getelementptr inbounds (%0, %0* @grid_, i64 0, i32 4), align 16
  %1 = fmul double %sigma_32563, %sigma_32563
  %2 = fmul double %1, %sigma_32563
  %3 = fmul double %2, %2
  %4 = load i32, i32* getelementptr inbounds (%0, %0* @grid_, i64 0, i32 1), align 16
  %.fca.0.extract = extractvalue %2 %0, 0
  %.fca.0.gep = getelementptr inbounds { %2, double, double, double }, { %2, double, double, double }* %parallel_closure, i64 0, i32 0, i32 0
  store i32 %.fca.0.extract, i32* %.fca.0.gep, align 8
  %.fca.1.extract = extractvalue %2 %0, 1
  %.fca.1.gep = getelementptr inbounds { %2, double, double, double }, { %2, double, double, double }* %parallel_closure, i64 0, i32 0, i32 1
  store [0 x i8]* %.fca.1.extract, [0 x i8]** %.fca.1.gep, align 8
  %parallel_closure.repack1 = getelementptr inbounds { %2, double, double, double }, { %2, double, double, double }* %parallel_closure, i64 0, i32 1
  store double %epsilon_32562, double* %parallel_closure.repack1, align 8
  %parallel_closure.repack3 = getelementptr inbounds { %2, double, double, double }, { %2, double, double, double }* %parallel_closure, i64 0, i32 2
  store double %cutoff_distance_32561, double* %parallel_closure.repack3, align 8
  %parallel_closure.repack5 = getelementptr inbounds { %2, double, double, double }, { %2, double, double, double }* %parallel_closure, i64 0, i32 3
  store double %3, double* %parallel_closure.repack5, align 8
  %5 = bitcast { %2, double, double, double }* %parallel_closure to i8*
  call void @anydsl_parallel_for(i32 4, i32 0, i32 %4, i8* nonnull %5, i8* bitcast (void (i8*, i32, i32)* @lambda_32590_parallel_for to i8*))
  ret void
}

; Function Attrs: nounwind
define void @lambda_32590_parallel_for(i8* nocapture readonly, i32, i32) #0 {
lambda_32590_parallel_for:
  %3 = getelementptr inbounds i8, i8* %0, i64 8
  %4 = bitcast i8* %3 to [0 x %3]**
  %5 = load [0 x %3]*, [0 x %3]** %4, align 8
  %.elt3 = getelementptr inbounds i8, i8* %0, i64 24
  %6 = bitcast i8* %.elt3 to double*
  %.unpack4 = load double, double* %6, align 8
  %.elt5 = getelementptr inbounds i8, i8* %0, i64 32
  %7 = bitcast i8* %.elt5 to double*
  %.unpack6 = load double, double* %7, align 8
  %8 = icmp slt i32 %1, %2
  br i1 %8, label %body.lr.ph, label %exit

body.lr.ph:                                       ; preds = %lambda_32590_parallel_for
  %.elt1 = getelementptr inbounds i8, i8* %0, i64 16
  %9 = bitcast i8* %.elt1 to double*
  %.unpack2 = load double, double* %9, align 8
  %10 = fmul double %.unpack2, 2.400000e+01
  %11 = fmul double %.unpack4, %.unpack4
  %tmp2.i24.i = fmul double %.unpack6, 2.000000e+00
  %tmp1.i25.i = fmul double %10, %.unpack6
  %.splatinsert7.i.i = insertelement <4 x double> undef, double %11, i32 0
  %.splat8.i.i = shufflevector <4 x double> %.splatinsert7.i.i, <4 x double> undef, <4 x i32> zeroinitializer
  %.splatinsert9.i.i = insertelement <4 x double> undef, double %tmp1.i25.i, i32 0
  %.splat10.i.i = shufflevector <4 x double> %.splatinsert9.i.i, <4 x double> undef, <4 x i32> zeroinitializer
  %.splatinsert11.i.i = insertelement <4 x double> undef, double %tmp2.i24.i, i32 0
  %.splat12.i.i = shufflevector <4 x double> %.splatinsert11.i.i, <4 x double> undef, <4 x i32> zeroinitializer
  br label %body

body:                                             ; preds = %lambda_32590.exit, %body.lr.ph
  %parallel_loop_phi23 = phi i32 [ %1, %body.lr.ph ], [ %630, %lambda_32590.exit ]
  %12 = load i32, i32* getelementptr inbounds (%0, %0* @grid_, i64 0, i32 2), align 4
  %13 = icmp sgt i32 %12, 0
  br i1 %13, label %if_then.i.preheader, label %lambda_32590.exit

if_then.i.preheader:                              ; preds = %body
  br label %if_then.i

if_then.i:                                        ; preds = %if_then.i.preheader, %if_else4.i.if_then.i_crit_edge
  %14 = phi i32 [ %.pre, %if_else4.i.if_then.i_crit_edge ], [ %12, %if_then.i.preheader ]
  %lower.i22 = phi i32 [ %30, %if_else4.i.if_then.i_crit_edge ], [ 0, %if_then.i.preheader ]
  %15 = mul nsw i32 %14, %parallel_loop_phi23
  %16 = add nsw i32 %15, %lower.i22
  %17 = sext i32 %16 to i64
  %18 = getelementptr inbounds [0 x %3], [0 x %3]* %5, i64 0, i64 %17, i32 7
  %19 = getelementptr inbounds [0 x %3], [0 x %3]* %5, i64 0, i64 %17, i32 4
  %20 = getelementptr inbounds [0 x %3], [0 x %3]* %5, i64 0, i64 %17, i32 9
  %21 = getelementptr inbounds [0 x %3], [0 x %3]* %5, i64 0, i64 %17, i32 8, i32 1
  %22 = bitcast [0 x i8]** %21 to [0 x %4]**
  %23 = load [0 x %4]*, [0 x %4]** %22, align 8
  %24 = load i32, i32* %18, align 4
  %25 = icmp sgt i32 %24, 0
  br i1 %25, label %if_then5.i.lr.ph, label %if_else4.i

if_then5.i.lr.ph:                                 ; preds = %if_then.i
  %26 = getelementptr inbounds [0 x %3], [0 x %3]* %5, i64 0, i64 %17, i32 1, i32 1
  %27 = bitcast [0 x i8]** %26 to [0 x %6]**
  %28 = getelementptr inbounds [0 x %3], [0 x %3]* %5, i64 0, i64 %17, i32 3, i32 1
  %29 = bitcast [0 x i8]** %28 to [0 x %6]**
  %wide.trip.count44 = zext i32 %24 to i64
  br label %if_then5.i

if_else4.i.loopexit:                              ; preds = %if_else11.i
  br label %if_else4.i

if_else4.i:                                       ; preds = %if_else4.i.loopexit, %if_then.i
  %30 = add nuw nsw i32 %lower.i22, 1
  %exitcond46 = icmp eq i32 %30, %12
  br i1 %exitcond46, label %lambda_32590.exit.loopexit, label %if_else4.i.if_then.i_crit_edge

if_else4.i.if_then.i_crit_edge:                   ; preds = %if_else4.i
  %.pre = load i32, i32* getelementptr inbounds (%0, %0* @grid_, i64 0, i32 2), align 4
  br label %if_then.i

if_then5.i:                                       ; preds = %if_else11.i, %if_then5.i.lr.ph
  %indvars.iv42 = phi i64 [ 0, %if_then5.i.lr.ph ], [ %indvars.iv.next43, %if_else11.i ]
  %indvars.iv = phi i32 [ 0, %if_then5.i.lr.ph ], [ %indvars.iv.next, %if_else11.i ]
  %31 = load i32, i32* %20, align 4
  %32 = trunc i64 %indvars.iv42 to i32
  %begin.i = mul nsw i32 %31, %32
  %33 = load i32, i32* %19, align 4
  %34 = add nsw i32 %begin.i, %31
  %35 = icmp slt i32 %34, %33
  %. = select i1 %35, i32 %34, i32 %33
  %36 = icmp slt i32 %begin.i, %.
  br i1 %36, label %body.i.lr.ph, label %exit.i

body.i.lr.ph:                                     ; preds = %if_then5.i
  %.splatinsert1.i.i = insertelement <4 x i32> undef, i32 %., i32 0
  %.splat2.i.i = shufflevector <4 x i32> %.splatinsert1.i.i, <4 x i32> undef, <4 x i32> zeroinitializer
  %37 = mul i32 %31, %indvars.iv
  br label %body.i

if_else11.i.loopexit:                             ; preds = %exit19.i
  br label %if_else11.i

if_else11.i:                                      ; preds = %if_else11.i.loopexit, %exit.i
  %indvars.iv.next43 = add nuw nsw i64 %indvars.iv42, 1
  %indvars.iv.next = add nuw nsw i32 %indvars.iv, 1
  %exitcond45 = icmp eq i64 %indvars.iv.next43, %wide.trip.count44
  br i1 %exitcond45, label %if_else4.i.loopexit, label %if_then5.i

if_then12.i:                                      ; preds = %exit19.i, %if_then12.i.lr.ph
  %indvars.iv40 = phi i64 [ 0, %if_then12.i.lr.ph ], [ %indvars.iv.next41, %exit19.i ]
  %38 = load [0 x %3*]*, [0 x %3*]** %368, align 8
  %39 = getelementptr inbounds [0 x %3*], [0 x %3*]* %38, i64 0, i64 %indvars.iv40
  %40 = load %3*, %3** %39, align 8
  %41 = getelementptr inbounds %3, %3* %40, i64 0, i32 4
  %42 = getelementptr inbounds %3, %3* %40, i64 0, i32 9
  %43 = load [0 x i32]*, [0 x i32]** %370, align 8
  %44 = getelementptr inbounds [0 x i32], [0 x i32]* %43, i64 0, i64 %indvars.iv40
  %45 = load i32, i32* %44, align 4
  %46 = load i32, i32* %42, align 4
  %begin_neighbor.i = mul nsw i32 %46, %45
  %47 = load i32, i32* %41, align 4
  %48 = add nsw i32 %begin_neighbor.i, %46
  %49 = icmp slt i32 %48, %47
  %.7 = select i1 %49, i32 %48, i32 %47
  br i1 %36, label %body18.i.lr.ph, label %exit19.i

body18.i.lr.ph:                                   ; preds = %if_then12.i
  %50 = icmp slt i32 %begin_neighbor.i, %.7
  %51 = getelementptr inbounds %3, %3* %40, i64 0, i32 1, i32 1
  %52 = bitcast [0 x i8]** %51 to [0 x %6]**
  %53 = getelementptr inbounds %3, %3* %40, i64 0, i32 3, i32 1
  %54 = bitcast [0 x i8]** %53 to [0 x %6]**
  %55 = sext i32 %begin_neighbor.i to i64
  %56 = sext i32 %.7 to i64
  br label %body18.i

body.i:                                           ; preds = %body.i.lr.ph, %lambda_32651_vectorize.exit.i
  %indvars.iv34 = phi i32 [ %37, %body.i.lr.ph ], [ %indvars.iv.next35, %lambda_32651_vectorize.exit.i ]
  %57 = sext i32 %indvars.iv34 to i64
  %.splatinsert3.i.i = insertelement <4 x i32> undef, i32 %indvars.iv34, i32 0
  %.splat4.i.i = shufflevector <4 x i32> %.splatinsert3.i.i, <4 x i32> undef, <4 x i32> zeroinitializer
  %contiguous_add5.i.i = add <4 x i32> %.splat4.i.i, <i32 0, i32 1, i32 2, i32 3>
  %58 = sext <4 x i32> %contiguous_add5.i.i to <4 x i64>
  %i_32653_lane1.i.i = add i32 %indvars.iv34, 1
  %59 = sext i32 %i_32653_lane1.i.i to i64
  %i_32653_lane2.i.i = add i32 %indvars.iv34, 2
  %60 = sext i32 %i_32653_lane2.i.i to i64
  %i_32653_lane3.i.i = add i32 %indvars.iv34, 3
  %61 = sext i32 %i_32653_lane3.i.i to i64
  br label %may_unroll_step.rv.i.i

may_unroll_step.rv.i.i:                           ; preds = %while_head29.divexit.rv.i.i, %body.i
  %indvars.iv36 = phi i64 [ %indvars.iv.next37, %while_head29.divexit.rv.i.i ], [ %57, %body.i ]
  %62 = phi <4 x i64> [ %66, %while_head29.divexit.rv.i.i ], [ <i64 -1, i64 -1, i64 -1, i64 -1>, %body.i ]
  %63 = phi <4 x i32> [ %67, %while_head29.divexit.rv.i.i ], [ <i32 -1, i32 -1, i32 -1, i32 -1>, %body.i ]
  %indvars.iv.next37 = add i64 %indvars.iv36, 1
  %64 = trunc i64 %indvars.iv.next37 to i32
  %.splatinsert.i.i = insertelement <4 x i32> undef, i32 %64, i32 0
  %.splat.i.i = shufflevector <4 x i32> %.splatinsert.i.i, <4 x i32> undef, <4 x i32> zeroinitializer
  %contiguous_add.i.i = add <4 x i32> %.splat.i.i, <i32 0, i32 1, i32 2, i32 3>
  %65 = icmp slt <4 x i32> %contiguous_add.i.i, %.splat2.i.i
  %66 = select <4 x i1> %65, <4 x i64> %62, <4 x i64> zeroinitializer
  %67 = select <4 x i1> %65, <4 x i32> %63, <4 x i32> zeroinitializer
  %68 = load [0 x %6]*, [0 x %6]** %27, align 8
  %srov_gep.i.i = getelementptr [0 x %6], [0 x %6]* %68, <4 x i64> zeroinitializer, <4 x i64> %58, i32 0
  %69 = icmp ne <4 x i32> %67, zeroinitializer
  %70 = tail call <4 x double> @llvm.masked.gather.v4f64(<4 x double*> %srov_gep.i.i, i32 1, <4 x i1> %69, <4 x double> undef)
  %srov_gep70.i.i = getelementptr [0 x %6], [0 x %6]* %68, <4 x i64> zeroinitializer, <4 x i64> %58, i32 1
  %71 = tail call <4 x double> @llvm.masked.gather.v4f64(<4 x double*> %srov_gep70.i.i, i32 1, <4 x i1> %69, <4 x double> undef)
  %srov_gep71.i.i = getelementptr [0 x %6], [0 x %6]* %68, <4 x i64> zeroinitializer, <4 x i64> %58, i32 2
  %72 = tail call <4 x double> @llvm.masked.gather.v4f64(<4 x double*> %srov_gep71.i.i, i32 1, <4 x i1> %69, <4 x double> undef)
  %73 = sext <4 x i32> %contiguous_add.i.i to <4 x i64>
  %srov_gep72.i.i = getelementptr [0 x %6], [0 x %6]* %68, <4 x i64> zeroinitializer, <4 x i64> %73, i32 0
  %74 = tail call <4 x double> @llvm.masked.gather.v4f64(<4 x double*> %srov_gep72.i.i, i32 1, <4 x i1> %69, <4 x double> undef)
  %srov_gep73.i.i = getelementptr [0 x %6], [0 x %6]* %68, <4 x i64> zeroinitializer, <4 x i64> %73, i32 1
  %75 = tail call <4 x double> @llvm.masked.gather.v4f64(<4 x double*> %srov_gep73.i.i, i32 1, <4 x i1> %69, <4 x double> undef)
  %srov_gep74.i.i = getelementptr [0 x %6], [0 x %6]* %68, <4 x i64> zeroinitializer, <4 x i64> %73, i32 2
  %76 = tail call <4 x double> @llvm.masked.gather.v4f64(<4 x double*> %srov_gep74.i.i, i32 1, <4 x i1> %69, <4 x double> undef)
  %dz_SIMD.i.i = fsub <4 x double> %76, %72
  %dx_SIMD.i.i = fsub <4 x double> %74, %70
  %dy_SIMD.i.i = fsub <4 x double> %75, %71
  %77 = fmul <4 x double> %dz_SIMD.i.i, %dz_SIMD.i.i
  %78 = fmul <4 x double> %dx_SIMD.i.i, %dx_SIMD.i.i
  %79 = fmul <4 x double> %dy_SIMD.i.i, %dy_SIMD.i.i
  %80 = fadd <4 x double> %78, %79
  %squared_distance_SIMD.i.i = fadd <4 x double> %80, %77
  %81 = tail call <4 x double> @llvm.x86.avx.cmp.pd.256(<4 x double> %squared_distance_SIMD.i.i, <4 x double> %.splat8.i.i, i8 1)
  %82 = bitcast <4 x double> %81 to <4 x i64>
  %83 = and <4 x i64> %82, %66
  %84 = fmul <4 x double> %squared_distance_SIMD.i.i, %squared_distance_SIMD.i.i
  %85 = load [0 x %6]*, [0 x %6]** %29, align 8
  %86 = fmul <4 x double> %84, %84
  %.op.i.i = fdiv <4 x double> <double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00>, %86
  %87 = bitcast <4 x i64> %83 to <4 x double>
  %88 = tail call <4 x double> @llvm.x86.avx.blendv.pd.256(<4 x double> <double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00>, <4 x double> %.op.i.i, <4 x double> %87)
  %89 = fmul <4 x double> %.splat10.i.i, %88
  %90 = fmul <4 x double> %squared_distance_SIMD.i.i, %88
  %91 = fmul <4 x double> %.splat12.i.i, %90
  %92 = fsub <4 x double> <double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00>, %91
  %93 = fmul <4 x double> %89, %92
  %dF_x_SIMD.i.i = fmul <4 x double> %dx_SIMD.i.i, %93
  %94 = getelementptr inbounds [0 x %6], [0 x %6]* %85, <4 x i64> zeroinitializer, <4 x i64> %58, i32 0
  %95 = bitcast <4 x double*> %94 to <4 x i64*>
  %96 = getelementptr inbounds [0 x %6], [0 x %6]* %85, i64 0, i64 %57
  %97 = bitcast %6* %96 to i64*
  %98 = getelementptr inbounds [0 x %6], [0 x %6]* %85, i64 0, i64 %59
  %99 = bitcast %6* %98 to i64*
  %100 = getelementptr inbounds [0 x %6], [0 x %6]* %85, i64 0, i64 %60
  %101 = bitcast %6* %100 to i64*
  %102 = getelementptr inbounds [0 x %6], [0 x %6]* %85, i64 0, i64 %61
  %103 = bitcast %6* %102 to i64*
  br label %while_head.rv.i.i

while_head.rv.i.i:                                ; preds = %while_head.rv.i.i, %may_unroll_step.rv.i.i
  %104 = phi <4 x i64> [ %133, %while_head.rv.i.i ], [ zeroinitializer, %may_unroll_step.rv.i.i ]
  %105 = phi <4 x i64> [ %134, %while_head.rv.i.i ], [ %83, %may_unroll_step.rv.i.i ]
  %106 = phi <4 x i64> [ %135, %while_head.rv.i.i ], [ zeroinitializer, %may_unroll_step.rv.i.i ]
  %107 = and <4 x i64> %105, %104
  %108 = xor <4 x i64> %104, <i64 -1, i64 -1, i64 -1, i64 -1>
  %109 = and <4 x i64> %105, %108
  %110 = icmp ne <4 x i64> %109, zeroinitializer
  %111 = tail call <4 x i64> @llvm.masked.gather.v4i64(<4 x i64*> %95, i32 1, <4 x i1> %110, <4 x i64> undef)
  %extract17.i.i = extractelement <4 x i64> %111, i32 3
  %extract15.i.i = extractelement <4 x i64> %111, i32 2
  %extract13.i.i = extractelement <4 x i64> %111, i32 1
  %extract.i.i = extractelement <4 x i64> %111, i32 0
  %112 = bitcast <4 x i64> %111 to <4 x double>
  %113 = fadd <4 x double> %dF_x_SIMD.i.i, %112
  %bc.i.i = bitcast <4 x double> %113 to <4 x i64>
  %114 = extractelement <4 x i64> %bc.i.i, i32 0
  %115 = cmpxchg i64* %97, i64 %extract.i.i, i64 %114 seq_cst seq_cst
  %116 = extractelement <4 x i64> %bc.i.i, i32 1
  %117 = cmpxchg i64* %99, i64 %extract13.i.i, i64 %116 seq_cst seq_cst
  %118 = extractelement <4 x i64> %bc.i.i, i32 2
  %119 = cmpxchg i64* %101, i64 %extract15.i.i, i64 %118 seq_cst seq_cst
  %120 = extractelement <4 x i64> %bc.i.i, i32 3
  %121 = cmpxchg i64* %103, i64 %extract17.i.i, i64 %120 seq_cst seq_cst
  %122 = extractvalue { i64, i1 } %115, 1
  %123 = sext i1 %122 to i64
  %124 = insertelement <4 x i64> undef, i64 %123, i64 0
  %125 = extractvalue { i64, i1 } %117, 1
  %126 = sext i1 %125 to i64
  %127 = insertelement <4 x i64> %124, i64 %126, i64 1
  %128 = extractvalue { i64, i1 } %119, 1
  %129 = sext i1 %128 to i64
  %130 = insertelement <4 x i64> %127, i64 %129, i64 2
  %131 = extractvalue { i64, i1 } %121, 1
  %132 = sext i1 %131 to i64
  %133 = insertelement <4 x i64> %130, i64 %132, i64 3
  %134 = xor <4 x i64> %105, %107
  %135 = or <4 x i64> %107, %106
  %136 = tail call i32 @llvm.x86.avx.ptestz.256(<4 x i64> %134, <4 x i64> %134)
  %137 = icmp eq i32 %136, 0
  br i1 %137, label %while_head.rv.i.i, label %while_head.divexit.rv.i.i

while_head.divexit.rv.i.i:                        ; preds = %while_head.rv.i.i
  %dF_y_SIMD.i.i = fmul <4 x double> %dy_SIMD.i.i, %93
  %138 = getelementptr inbounds [0 x %6], [0 x %6]* %85, <4 x i64> zeroinitializer, <4 x i64> %58, i32 1
  %139 = bitcast <4 x double*> %138 to <4 x i64*>
  %140 = getelementptr inbounds [0 x %6], [0 x %6]* %85, i64 0, i64 %57, i32 1
  %141 = bitcast double* %140 to i64*
  %142 = getelementptr inbounds [0 x %6], [0 x %6]* %85, i64 0, i64 %59, i32 1
  %143 = bitcast double* %142 to i64*
  %144 = getelementptr inbounds [0 x %6], [0 x %6]* %85, i64 0, i64 %60, i32 1
  %145 = bitcast double* %144 to i64*
  %146 = getelementptr inbounds [0 x %6], [0 x %6]* %85, i64 0, i64 %61, i32 1
  %147 = bitcast double* %146 to i64*
  br label %while_head5.rv.i.i

while_head5.rv.i.i:                               ; preds = %while_head5.rv.i.i, %while_head.divexit.rv.i.i
  %148 = phi <4 x i64> [ %177, %while_head5.rv.i.i ], [ zeroinitializer, %while_head.divexit.rv.i.i ]
  %149 = phi <4 x i64> [ %178, %while_head5.rv.i.i ], [ %135, %while_head.divexit.rv.i.i ]
  %150 = phi <4 x i64> [ %179, %while_head5.rv.i.i ], [ zeroinitializer, %while_head.divexit.rv.i.i ]
  %151 = and <4 x i64> %149, %148
  %152 = xor <4 x i64> %148, <i64 -1, i64 -1, i64 -1, i64 -1>
  %153 = and <4 x i64> %149, %152
  %154 = icmp ne <4 x i64> %153, zeroinitializer
  %155 = tail call <4 x i64> @llvm.masked.gather.v4i64(<4 x i64*> %139, i32 1, <4 x i1> %154, <4 x i64> undef)
  %extract29.i.i = extractelement <4 x i64> %155, i32 3
  %extract27.i.i = extractelement <4 x i64> %155, i32 2
  %extract25.i.i = extractelement <4 x i64> %155, i32 1
  %extract23.i.i = extractelement <4 x i64> %155, i32 0
  %156 = bitcast <4 x i64> %155 to <4 x double>
  %157 = fadd <4 x double> %dF_y_SIMD.i.i, %156
  %bc96.i.i = bitcast <4 x double> %157 to <4 x i64>
  %158 = extractelement <4 x i64> %bc96.i.i, i32 0
  %159 = cmpxchg i64* %141, i64 %extract23.i.i, i64 %158 seq_cst seq_cst
  %160 = extractelement <4 x i64> %bc96.i.i, i32 1
  %161 = cmpxchg i64* %143, i64 %extract25.i.i, i64 %160 seq_cst seq_cst
  %162 = extractelement <4 x i64> %bc96.i.i, i32 2
  %163 = cmpxchg i64* %145, i64 %extract27.i.i, i64 %162 seq_cst seq_cst
  %164 = extractelement <4 x i64> %bc96.i.i, i32 3
  %165 = cmpxchg i64* %147, i64 %extract29.i.i, i64 %164 seq_cst seq_cst
  %166 = extractvalue { i64, i1 } %159, 1
  %167 = sext i1 %166 to i64
  %168 = insertelement <4 x i64> undef, i64 %167, i64 0
  %169 = extractvalue { i64, i1 } %161, 1
  %170 = sext i1 %169 to i64
  %171 = insertelement <4 x i64> %168, i64 %170, i64 1
  %172 = extractvalue { i64, i1 } %163, 1
  %173 = sext i1 %172 to i64
  %174 = insertelement <4 x i64> %171, i64 %173, i64 2
  %175 = extractvalue { i64, i1 } %165, 1
  %176 = sext i1 %175 to i64
  %177 = insertelement <4 x i64> %174, i64 %176, i64 3
  %178 = xor <4 x i64> %149, %151
  %179 = or <4 x i64> %151, %150
  %180 = tail call i32 @llvm.x86.avx.ptestz.256(<4 x i64> %178, <4 x i64> %178)
  %181 = icmp eq i32 %180, 0
  br i1 %181, label %while_head5.rv.i.i, label %while_head5.divexit.rv.i.i

while_head5.divexit.rv.i.i:                       ; preds = %while_head5.rv.i.i
  %dF_z_SIMD.i.i = fmul <4 x double> %dz_SIMD.i.i, %93
  %182 = getelementptr inbounds [0 x %6], [0 x %6]* %85, <4 x i64> zeroinitializer, <4 x i64> %58, i32 2
  %183 = bitcast <4 x double*> %182 to <4 x i64*>
  %184 = getelementptr inbounds [0 x %6], [0 x %6]* %85, i64 0, i64 %57, i32 2
  %185 = bitcast double* %184 to i64*
  %186 = getelementptr inbounds [0 x %6], [0 x %6]* %85, i64 0, i64 %59, i32 2
  %187 = bitcast double* %186 to i64*
  %188 = getelementptr inbounds [0 x %6], [0 x %6]* %85, i64 0, i64 %60, i32 2
  %189 = bitcast double* %188 to i64*
  %190 = getelementptr inbounds [0 x %6], [0 x %6]* %85, i64 0, i64 %61, i32 2
  %191 = bitcast double* %190 to i64*
  br label %while_head11.rv.i.i

while_head11.rv.i.i:                              ; preds = %while_head11.rv.i.i, %while_head5.divexit.rv.i.i
  %192 = phi <4 x i64> [ %221, %while_head11.rv.i.i ], [ zeroinitializer, %while_head5.divexit.rv.i.i ]
  %193 = phi <4 x i64> [ %222, %while_head11.rv.i.i ], [ %179, %while_head5.divexit.rv.i.i ]
  %194 = phi <4 x i64> [ %223, %while_head11.rv.i.i ], [ zeroinitializer, %while_head5.divexit.rv.i.i ]
  %195 = and <4 x i64> %193, %192
  %196 = xor <4 x i64> %192, <i64 -1, i64 -1, i64 -1, i64 -1>
  %197 = and <4 x i64> %193, %196
  %198 = icmp ne <4 x i64> %197, zeroinitializer
  %199 = tail call <4 x i64> @llvm.masked.gather.v4i64(<4 x i64*> %183, i32 1, <4 x i1> %198, <4 x i64> undef)
  %extract43.i.i = extractelement <4 x i64> %199, i32 3
  %extract41.i.i = extractelement <4 x i64> %199, i32 2
  %extract39.i.i = extractelement <4 x i64> %199, i32 1
  %extract37.i.i = extractelement <4 x i64> %199, i32 0
  %200 = bitcast <4 x i64> %199 to <4 x double>
  %201 = fadd <4 x double> %dF_z_SIMD.i.i, %200
  %bc100.i.i = bitcast <4 x double> %201 to <4 x i64>
  %202 = extractelement <4 x i64> %bc100.i.i, i32 0
  %203 = cmpxchg i64* %185, i64 %extract37.i.i, i64 %202 seq_cst seq_cst
  %204 = extractelement <4 x i64> %bc100.i.i, i32 1
  %205 = cmpxchg i64* %187, i64 %extract39.i.i, i64 %204 seq_cst seq_cst
  %206 = extractelement <4 x i64> %bc100.i.i, i32 2
  %207 = cmpxchg i64* %189, i64 %extract41.i.i, i64 %206 seq_cst seq_cst
  %208 = extractelement <4 x i64> %bc100.i.i, i32 3
  %209 = cmpxchg i64* %191, i64 %extract43.i.i, i64 %208 seq_cst seq_cst
  %210 = extractvalue { i64, i1 } %203, 1
  %211 = sext i1 %210 to i64
  %212 = insertelement <4 x i64> undef, i64 %211, i64 0
  %213 = extractvalue { i64, i1 } %205, 1
  %214 = sext i1 %213 to i64
  %215 = insertelement <4 x i64> %212, i64 %214, i64 1
  %216 = extractvalue { i64, i1 } %207, 1
  %217 = sext i1 %216 to i64
  %218 = insertelement <4 x i64> %215, i64 %217, i64 2
  %219 = extractvalue { i64, i1 } %209, 1
  %220 = sext i1 %219 to i64
  %221 = insertelement <4 x i64> %218, i64 %220, i64 3
  %222 = xor <4 x i64> %193, %195
  %223 = or <4 x i64> %195, %194
  %224 = tail call i32 @llvm.x86.avx.ptestz.256(<4 x i64> %222, <4 x i64> %222)
  %225 = icmp eq i32 %224, 0
  br i1 %225, label %while_head11.rv.i.i, label %while_head11.divexit.rv.i.i

while_head11.divexit.rv.i.i:                      ; preds = %while_head11.rv.i.i
  %226 = load [0 x %6]*, [0 x %6]** %29, align 8
  %227 = getelementptr inbounds [0 x %6], [0 x %6]* %226, <4 x i64> zeroinitializer, <4 x i64> %73, i32 0
  %228 = bitcast <4 x double*> %227 to <4 x i64*>
  %229 = getelementptr inbounds [0 x %6], [0 x %6]* %226, i64 0, i64 %indvars.iv.next37
  %230 = bitcast %6* %229 to i64*
  %lower_lane1.i.i = shl i64 %indvars.iv36, 32
  %sext = add i64 %lower_lane1.i.i, 8589934592
  %231 = ashr exact i64 %sext, 32
  %232 = getelementptr inbounds [0 x %6], [0 x %6]* %226, i64 0, i64 %231
  %233 = bitcast %6* %232 to i64*
  %lower_lane2.i.i = shl i64 %indvars.iv36, 32
  %sext48 = add i64 %lower_lane2.i.i, 12884901888
  %234 = ashr exact i64 %sext48, 32
  %235 = getelementptr inbounds [0 x %6], [0 x %6]* %226, i64 0, i64 %234
  %236 = bitcast %6* %235 to i64*
  %lower_lane3.i.i = shl i64 %indvars.iv36, 32
  %sext49 = add i64 %lower_lane3.i.i, 17179869184
  %237 = ashr exact i64 %sext49, 32
  %238 = getelementptr inbounds [0 x %6], [0 x %6]* %226, i64 0, i64 %237
  %239 = bitcast %6* %238 to i64*
  br label %while_head17.rv.i.i

while_head17.rv.i.i:                              ; preds = %while_head17.rv.i.i, %while_head11.divexit.rv.i.i
  %240 = phi <4 x i64> [ %269, %while_head17.rv.i.i ], [ zeroinitializer, %while_head11.divexit.rv.i.i ]
  %241 = phi <4 x i64> [ %270, %while_head17.rv.i.i ], [ %223, %while_head11.divexit.rv.i.i ]
  %242 = phi <4 x i64> [ %271, %while_head17.rv.i.i ], [ zeroinitializer, %while_head11.divexit.rv.i.i ]
  %243 = and <4 x i64> %241, %240
  %244 = xor <4 x i64> %240, <i64 -1, i64 -1, i64 -1, i64 -1>
  %245 = and <4 x i64> %241, %244
  %246 = icmp ne <4 x i64> %245, zeroinitializer
  %247 = tail call <4 x i64> @llvm.masked.gather.v4i64(<4 x i64*> %228, i32 1, <4 x i1> %246, <4 x i64> undef)
  %extract57.i.i = extractelement <4 x i64> %247, i32 3
  %extract55.i.i = extractelement <4 x i64> %247, i32 2
  %extract53.i.i = extractelement <4 x i64> %247, i32 1
  %extract51.i.i = extractelement <4 x i64> %247, i32 0
  %248 = bitcast <4 x i64> %247 to <4 x double>
  %249 = fsub <4 x double> %248, %dF_x_SIMD.i.i
  %bc104.i.i = bitcast <4 x double> %249 to <4 x i64>
  %250 = extractelement <4 x i64> %bc104.i.i, i32 0
  %251 = cmpxchg i64* %230, i64 %extract51.i.i, i64 %250 seq_cst seq_cst
  %252 = extractelement <4 x i64> %bc104.i.i, i32 1
  %253 = cmpxchg i64* %233, i64 %extract53.i.i, i64 %252 seq_cst seq_cst
  %254 = extractelement <4 x i64> %bc104.i.i, i32 2
  %255 = cmpxchg i64* %236, i64 %extract55.i.i, i64 %254 seq_cst seq_cst
  %256 = extractelement <4 x i64> %bc104.i.i, i32 3
  %257 = cmpxchg i64* %239, i64 %extract57.i.i, i64 %256 seq_cst seq_cst
  %258 = extractvalue { i64, i1 } %251, 1
  %259 = sext i1 %258 to i64
  %260 = insertelement <4 x i64> undef, i64 %259, i64 0
  %261 = extractvalue { i64, i1 } %253, 1
  %262 = sext i1 %261 to i64
  %263 = insertelement <4 x i64> %260, i64 %262, i64 1
  %264 = extractvalue { i64, i1 } %255, 1
  %265 = sext i1 %264 to i64
  %266 = insertelement <4 x i64> %263, i64 %265, i64 2
  %267 = extractvalue { i64, i1 } %257, 1
  %268 = sext i1 %267 to i64
  %269 = insertelement <4 x i64> %266, i64 %268, i64 3
  %270 = xor <4 x i64> %241, %243
  %271 = or <4 x i64> %243, %242
  %272 = tail call i32 @llvm.x86.avx.ptestz.256(<4 x i64> %270, <4 x i64> %270)
  %273 = icmp eq i32 %272, 0
  br i1 %273, label %while_head17.rv.i.i, label %while_head17.divexit.rv.i.i

while_head17.divexit.rv.i.i:                      ; preds = %while_head17.rv.i.i
  %274 = getelementptr inbounds [0 x %6], [0 x %6]* %226, <4 x i64> zeroinitializer, <4 x i64> %73, i32 1
  %275 = bitcast <4 x double*> %274 to <4 x i64*>
  %276 = getelementptr inbounds [0 x %6], [0 x %6]* %226, i64 0, i64 %indvars.iv.next37, i32 1
  %277 = bitcast double* %276 to i64*
  %278 = getelementptr inbounds [0 x %6], [0 x %6]* %226, i64 0, i64 %231, i32 1
  %279 = bitcast double* %278 to i64*
  %280 = getelementptr inbounds [0 x %6], [0 x %6]* %226, i64 0, i64 %234, i32 1
  %281 = bitcast double* %280 to i64*
  %282 = getelementptr inbounds [0 x %6], [0 x %6]* %226, i64 0, i64 %237, i32 1
  %283 = bitcast double* %282 to i64*
  br label %while_head23.rv.i.i

while_head23.rv.i.i:                              ; preds = %while_head23.rv.i.i, %while_head17.divexit.rv.i.i
  %284 = phi <4 x i64> [ %313, %while_head23.rv.i.i ], [ zeroinitializer, %while_head17.divexit.rv.i.i ]
  %285 = phi <4 x i64> [ %314, %while_head23.rv.i.i ], [ %271, %while_head17.divexit.rv.i.i ]
  %286 = phi <4 x i64> [ %315, %while_head23.rv.i.i ], [ zeroinitializer, %while_head17.divexit.rv.i.i ]
  %287 = and <4 x i64> %285, %284
  %288 = xor <4 x i64> %284, <i64 -1, i64 -1, i64 -1, i64 -1>
  %289 = and <4 x i64> %285, %288
  %290 = icmp ne <4 x i64> %289, zeroinitializer
  %291 = tail call <4 x i64> @llvm.masked.gather.v4i64(<4 x i64*> %275, i32 1, <4 x i1> %290, <4 x i64> undef)
  %extract71.i.i = extractelement <4 x i64> %291, i32 3
  %extract69.i.i = extractelement <4 x i64> %291, i32 2
  %extract67.i.i = extractelement <4 x i64> %291, i32 1
  %extract65.i.i = extractelement <4 x i64> %291, i32 0
  %292 = bitcast <4 x i64> %291 to <4 x double>
  %293 = fsub <4 x double> %292, %dF_y_SIMD.i.i
  %bc108.i.i = bitcast <4 x double> %293 to <4 x i64>
  %294 = extractelement <4 x i64> %bc108.i.i, i32 0
  %295 = cmpxchg i64* %277, i64 %extract65.i.i, i64 %294 seq_cst seq_cst
  %296 = extractelement <4 x i64> %bc108.i.i, i32 1
  %297 = cmpxchg i64* %279, i64 %extract67.i.i, i64 %296 seq_cst seq_cst
  %298 = extractelement <4 x i64> %bc108.i.i, i32 2
  %299 = cmpxchg i64* %281, i64 %extract69.i.i, i64 %298 seq_cst seq_cst
  %300 = extractelement <4 x i64> %bc108.i.i, i32 3
  %301 = cmpxchg i64* %283, i64 %extract71.i.i, i64 %300 seq_cst seq_cst
  %302 = extractvalue { i64, i1 } %295, 1
  %303 = sext i1 %302 to i64
  %304 = insertelement <4 x i64> undef, i64 %303, i64 0
  %305 = extractvalue { i64, i1 } %297, 1
  %306 = sext i1 %305 to i64
  %307 = insertelement <4 x i64> %304, i64 %306, i64 1
  %308 = extractvalue { i64, i1 } %299, 1
  %309 = sext i1 %308 to i64
  %310 = insertelement <4 x i64> %307, i64 %309, i64 2
  %311 = extractvalue { i64, i1 } %301, 1
  %312 = sext i1 %311 to i64
  %313 = insertelement <4 x i64> %310, i64 %312, i64 3
  %314 = xor <4 x i64> %285, %287
  %315 = or <4 x i64> %287, %286
  %316 = tail call i32 @llvm.x86.avx.ptestz.256(<4 x i64> %314, <4 x i64> %314)
  %317 = icmp eq i32 %316, 0
  br i1 %317, label %while_head23.rv.i.i, label %while_head23.divexit.rv.i.i

while_head23.divexit.rv.i.i:                      ; preds = %while_head23.rv.i.i
  %318 = getelementptr inbounds [0 x %6], [0 x %6]* %226, <4 x i64> zeroinitializer, <4 x i64> %73, i32 2
  %319 = bitcast <4 x double*> %318 to <4 x i64*>
  %320 = getelementptr inbounds [0 x %6], [0 x %6]* %226, i64 0, i64 %indvars.iv.next37, i32 2
  %321 = bitcast double* %320 to i64*
  %322 = getelementptr inbounds [0 x %6], [0 x %6]* %226, i64 0, i64 %231, i32 2
  %323 = bitcast double* %322 to i64*
  %324 = getelementptr inbounds [0 x %6], [0 x %6]* %226, i64 0, i64 %234, i32 2
  %325 = bitcast double* %324 to i64*
  %326 = getelementptr inbounds [0 x %6], [0 x %6]* %226, i64 0, i64 %237, i32 2
  %327 = bitcast double* %326 to i64*
  br label %while_head29.rv.i.i

while_head29.rv.i.i:                              ; preds = %while_head29.rv.i.i, %while_head23.divexit.rv.i.i
  %328 = phi <4 x i64> [ %355, %while_head29.rv.i.i ], [ zeroinitializer, %while_head23.divexit.rv.i.i ]
  %329 = phi <4 x i64> [ %331, %while_head29.rv.i.i ], [ %315, %while_head23.divexit.rv.i.i ]
  %330 = xor <4 x i64> %328, <i64 -1, i64 -1, i64 -1, i64 -1>
  %331 = and <4 x i64> %329, %330
  %332 = icmp ne <4 x i64> %331, zeroinitializer
  %333 = tail call <4 x i64> @llvm.masked.gather.v4i64(<4 x i64*> %319, i32 1, <4 x i1> %332, <4 x i64> undef)
  %extract85.i.i = extractelement <4 x i64> %333, i32 3
  %extract83.i.i = extractelement <4 x i64> %333, i32 2
  %extract81.i.i = extractelement <4 x i64> %333, i32 1
  %extract79.i.i = extractelement <4 x i64> %333, i32 0
  %334 = bitcast <4 x i64> %333 to <4 x double>
  %335 = fsub <4 x double> %334, %dF_z_SIMD.i.i
  %bc112.i.i = bitcast <4 x double> %335 to <4 x i64>
  %336 = extractelement <4 x i64> %bc112.i.i, i32 0
  %337 = cmpxchg i64* %321, i64 %extract79.i.i, i64 %336 seq_cst seq_cst
  %338 = extractelement <4 x i64> %bc112.i.i, i32 1
  %339 = cmpxchg i64* %323, i64 %extract81.i.i, i64 %338 seq_cst seq_cst
  %340 = extractelement <4 x i64> %bc112.i.i, i32 2
  %341 = cmpxchg i64* %325, i64 %extract83.i.i, i64 %340 seq_cst seq_cst
  %342 = extractelement <4 x i64> %bc112.i.i, i32 3
  %343 = cmpxchg i64* %327, i64 %extract85.i.i, i64 %342 seq_cst seq_cst
  %344 = extractvalue { i64, i1 } %337, 1
  %345 = sext i1 %344 to i64
  %346 = insertelement <4 x i64> undef, i64 %345, i64 0
  %347 = extractvalue { i64, i1 } %339, 1
  %348 = sext i1 %347 to i64
  %349 = insertelement <4 x i64> %346, i64 %348, i64 1
  %350 = extractvalue { i64, i1 } %341, 1
  %351 = sext i1 %350 to i64
  %352 = insertelement <4 x i64> %349, i64 %351, i64 2
  %353 = extractvalue { i64, i1 } %343, 1
  %354 = sext i1 %353 to i64
  %355 = insertelement <4 x i64> %352, i64 %354, i64 3
  %356 = tail call i32 @llvm.x86.avx.ptestc.256(<4 x i64> %328, <4 x i64> %329)
  %357 = icmp eq i32 %356, 0
  br i1 %357, label %while_head29.rv.i.i, label %while_head29.divexit.rv.i.i

while_head29.divexit.rv.i.i:                      ; preds = %while_head29.rv.i.i
  %358 = sext <4 x i1> %65 to <4 x i32>
  %359 = bitcast <4 x i32> %358 to <2 x i64>
  %360 = bitcast <4 x i32> %63 to <2 x i64>
  %361 = tail call i32 @llvm.x86.sse41.ptestz(<2 x i64> %359, <2 x i64> %360)
  %362 = icmp eq i32 %361, 0
  br i1 %362, label %may_unroll_step.rv.i.i, label %lambda_32651_vectorize.exit.i

lambda_32651_vectorize.exit.i:                    ; preds = %while_head29.divexit.rv.i.i
  %indvars.iv.next35 = add i32 %indvars.iv34, 4
  %363 = icmp slt i32 %indvars.iv.next35, %.
  br i1 %363, label %body.i, label %exit.i.loopexit

exit.i.loopexit:                                  ; preds = %lambda_32651_vectorize.exit.i
  br label %exit.i

exit.i:                                           ; preds = %exit.i.loopexit, %if_then5.i
  %364 = getelementptr inbounds [0 x %4], [0 x %4]* %23, i64 0, i64 %indvars.iv42, i32 0, i32 0
  %365 = load i32, i32* %364, align 4
  %366 = icmp sgt i32 %365, 0
  br i1 %366, label %if_then12.i.lr.ph, label %if_else11.i

if_then12.i.lr.ph:                                ; preds = %exit.i
  %367 = getelementptr inbounds [0 x %4], [0 x %4]* %23, i64 0, i64 %indvars.iv42, i32 0, i32 2, i32 1
  %368 = bitcast [0 x i8]** %367 to [0 x %3*]**
  %369 = getelementptr inbounds [0 x %4], [0 x %4]* %23, i64 0, i64 %indvars.iv42, i32 0, i32 3, i32 1
  %370 = bitcast [0 x i8]** %369 to [0 x i32]**
  %wide.trip.count = zext i32 %365 to i64
  br label %if_then12.i

body18.i:                                         ; preds = %body18.i.lr.ph, %lambda_32928_vectorize.exit.i
  %parallel_loop_phi20.i19 = phi i32 [ %begin.i, %body18.i.lr.ph ], [ %628, %lambda_32928_vectorize.exit.i ]
  br i1 %50, label %if_then.rv.i.i.lr.ph, label %lambda_32928_vectorize.exit.i

if_then.rv.i.i.lr.ph:                             ; preds = %body18.i
  %.splatinsert.i21.i = insertelement <4 x i32> undef, i32 %parallel_loop_phi20.i19, i32 0
  %.splat.i22.i = shufflevector <4 x i32> %.splatinsert.i21.i, <4 x i32> undef, <4 x i32> zeroinitializer
  %contiguous_add.i23.i = add <4 x i32> %.splat.i22.i, <i32 0, i32 1, i32 2, i32 3>
  %371 = sext <4 x i32> %contiguous_add.i23.i to <4 x i64>
  %372 = sext i32 %parallel_loop_phi20.i19 to i64
  %i_32930_lane1.i.i = add i32 %parallel_loop_phi20.i19, 1
  %373 = sext i32 %i_32930_lane1.i.i to i64
  %i_32930_lane2.i.i = add i32 %parallel_loop_phi20.i19, 2
  %374 = sext i32 %i_32930_lane2.i.i to i64
  %i_32930_lane3.i.i = add i32 %parallel_loop_phi20.i19, 3
  %375 = sext i32 %i_32930_lane3.i.i to i64
  br label %if_then.rv.i.i

if_then.rv.i.i:                                   ; preds = %if_then.rv.i.i.lr.ph, %while_head29.divexit.rv.i65.i
  %indvars.iv38 = phi i64 [ %55, %if_then.rv.i.i.lr.ph ], [ %indvars.iv.next39, %while_head29.divexit.rv.i65.i ]
  %376 = load [0 x %6]*, [0 x %6]** %27, align 8
  %srov_gep.i28.i = getelementptr [0 x %6], [0 x %6]* %376, <4 x i64> zeroinitializer, <4 x i64> %371, i32 0
  %377 = tail call <4 x double> @llvm.masked.gather.v4f64(<4 x double*> %srov_gep.i28.i, i32 1, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x double> undef)
  %srov_gep66.i.i = getelementptr [0 x %6], [0 x %6]* %376, <4 x i64> zeroinitializer, <4 x i64> %371, i32 1
  %378 = tail call <4 x double> @llvm.masked.gather.v4f64(<4 x double*> %srov_gep66.i.i, i32 1, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x double> undef)
  %srov_gep67.i.i = getelementptr [0 x %6], [0 x %6]* %376, <4 x i64> zeroinitializer, <4 x i64> %371, i32 2
  %379 = tail call <4 x double> @llvm.masked.gather.v4f64(<4 x double*> %srov_gep67.i.i, i32 1, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x double> undef)
  %380 = load [0 x %6]*, [0 x %6]** %52, align 8
  %.elt.i.i = getelementptr inbounds [0 x %6], [0 x %6]* %380, i64 0, i64 %indvars.iv38, i32 0
  %.unpack.i.i = load double, double* %.elt.i.i, align 8
  %.elt98.i.i = getelementptr inbounds [0 x %6], [0 x %6]* %380, i64 0, i64 %indvars.iv38, i32 1
  %.unpack99.i.i = load double, double* %.elt98.i.i, align 8
  %.elt100.i.i = getelementptr inbounds [0 x %6], [0 x %6]* %380, i64 0, i64 %indvars.iv38, i32 2
  %.unpack101.i.i = load double, double* %.elt100.i.i, align 8
  %.splatinsert1.i29.i = insertelement <4 x double> undef, double %.unpack101.i.i, i32 0
  %.splat2.i30.i = shufflevector <4 x double> %.splatinsert1.i29.i, <4 x double> undef, <4 x i32> zeroinitializer
  %dz_SIMD.i31.i = fsub <4 x double> %.splat2.i30.i, %379
  %.splatinsert3.i32.i = insertelement <4 x double> undef, double %.unpack.i.i, i32 0
  %.splat4.i33.i = shufflevector <4 x double> %.splatinsert3.i32.i, <4 x double> undef, <4 x i32> zeroinitializer
  %dx_SIMD.i34.i = fsub <4 x double> %.splat4.i33.i, %377
  %.splatinsert5.i.i = insertelement <4 x double> undef, double %.unpack99.i.i, i32 0
  %.splat6.i.i = shufflevector <4 x double> %.splatinsert5.i.i, <4 x double> undef, <4 x i32> zeroinitializer
  %dy_SIMD.i35.i = fsub <4 x double> %.splat6.i.i, %378
  %381 = fmul <4 x double> %dz_SIMD.i31.i, %dz_SIMD.i31.i
  %382 = fmul <4 x double> %dx_SIMD.i34.i, %dx_SIMD.i34.i
  %383 = fmul <4 x double> %dy_SIMD.i35.i, %dy_SIMD.i35.i
  %384 = fadd <4 x double> %382, %383
  %squared_distance_SIMD.i36.i = fadd <4 x double> %384, %381
  %385 = tail call <4 x double> @llvm.x86.avx.cmp.pd.256(<4 x double> %squared_distance_SIMD.i36.i, <4 x double> %.splat8.i.i, i8 1)
  %386 = bitcast <4 x double> %385 to <4 x i64>
  %387 = load [0 x %6]*, [0 x %6]** %29, align 8
  %388 = fmul <4 x double> %squared_distance_SIMD.i36.i, %squared_distance_SIMD.i36.i
  %389 = fmul <4 x double> %388, %388
  %.op.i37.i = fdiv <4 x double> <double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00>, %389
  %390 = tail call <4 x double> @llvm.x86.avx.blendv.pd.256(<4 x double> <double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00>, <4 x double> %.op.i37.i, <4 x double> %385)
  %391 = fmul <4 x double> %.splat10.i.i, %390
  %392 = fmul <4 x double> %390, %squared_distance_SIMD.i36.i
  %393 = fmul <4 x double> %.splat12.i.i, %392
  %394 = fsub <4 x double> <double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00>, %393
  %395 = fmul <4 x double> %391, %394
  %dF_x_SIMD.i38.i = fmul <4 x double> %dx_SIMD.i34.i, %395
  %396 = getelementptr inbounds [0 x %6], [0 x %6]* %387, <4 x i64> zeroinitializer, <4 x i64> %371, i32 0
  %397 = bitcast <4 x double*> %396 to <4 x i64*>
  %398 = getelementptr inbounds [0 x %6], [0 x %6]* %387, i64 0, i64 %372
  %399 = bitcast %6* %398 to i64*
  %400 = getelementptr inbounds [0 x %6], [0 x %6]* %387, i64 0, i64 %373
  %401 = bitcast %6* %400 to i64*
  %402 = getelementptr inbounds [0 x %6], [0 x %6]* %387, i64 0, i64 %374
  %403 = bitcast %6* %402 to i64*
  %404 = getelementptr inbounds [0 x %6], [0 x %6]* %387, i64 0, i64 %375
  %405 = bitcast %6* %404 to i64*
  br label %while_head.rv.i39.i

while_head.rv.i39.i:                              ; preds = %while_head.rv.i39.i, %if_then.rv.i.i
  %406 = phi <4 x i64> [ %435, %while_head.rv.i39.i ], [ zeroinitializer, %if_then.rv.i.i ]
  %407 = phi <4 x i64> [ %436, %while_head.rv.i39.i ], [ %386, %if_then.rv.i.i ]
  %408 = phi <4 x i64> [ %437, %while_head.rv.i39.i ], [ zeroinitializer, %if_then.rv.i.i ]
  %409 = and <4 x i64> %407, %406
  %410 = xor <4 x i64> %406, <i64 -1, i64 -1, i64 -1, i64 -1>
  %411 = and <4 x i64> %407, %410
  %412 = icmp ne <4 x i64> %411, zeroinitializer
  %413 = tail call <4 x i64> @llvm.masked.gather.v4i64(<4 x i64*> %397, i32 1, <4 x i1> %412, <4 x i64> undef)
  %extract18.i.i = extractelement <4 x i64> %413, i32 3
  %extract16.i.i = extractelement <4 x i64> %413, i32 2
  %extract14.i.i = extractelement <4 x i64> %413, i32 1
  %extract.i40.i = extractelement <4 x i64> %413, i32 0
  %414 = bitcast <4 x i64> %413 to <4 x double>
  %415 = fadd <4 x double> %dF_x_SIMD.i38.i, %414
  %bc.i41.i = bitcast <4 x double> %415 to <4 x i64>
  %416 = extractelement <4 x i64> %bc.i41.i, i32 0
  %417 = cmpxchg i64* %399, i64 %extract.i40.i, i64 %416 seq_cst seq_cst
  %418 = extractelement <4 x i64> %bc.i41.i, i32 1
  %419 = cmpxchg i64* %401, i64 %extract14.i.i, i64 %418 seq_cst seq_cst
  %420 = extractelement <4 x i64> %bc.i41.i, i32 2
  %421 = cmpxchg i64* %403, i64 %extract16.i.i, i64 %420 seq_cst seq_cst
  %422 = extractelement <4 x i64> %bc.i41.i, i32 3
  %423 = cmpxchg i64* %405, i64 %extract18.i.i, i64 %422 seq_cst seq_cst
  %424 = extractvalue { i64, i1 } %417, 1
  %425 = sext i1 %424 to i64
  %426 = insertelement <4 x i64> undef, i64 %425, i64 0
  %427 = extractvalue { i64, i1 } %419, 1
  %428 = sext i1 %427 to i64
  %429 = insertelement <4 x i64> %426, i64 %428, i64 1
  %430 = extractvalue { i64, i1 } %421, 1
  %431 = sext i1 %430 to i64
  %432 = insertelement <4 x i64> %429, i64 %431, i64 2
  %433 = extractvalue { i64, i1 } %423, 1
  %434 = sext i1 %433 to i64
  %435 = insertelement <4 x i64> %432, i64 %434, i64 3
  %436 = xor <4 x i64> %407, %409
  %437 = or <4 x i64> %409, %408
  %438 = tail call i32 @llvm.x86.avx.ptestz.256(<4 x i64> %436, <4 x i64> %436)
  %439 = icmp eq i32 %438, 0
  br i1 %439, label %while_head.rv.i39.i, label %while_head.divexit.rv.i45.i

while_head.divexit.rv.i45.i:                      ; preds = %while_head.rv.i39.i
  %dF_y_SIMD.i46.i = fmul <4 x double> %dy_SIMD.i35.i, %395
  %440 = getelementptr inbounds [0 x %6], [0 x %6]* %387, <4 x i64> zeroinitializer, <4 x i64> %371, i32 1
  %441 = bitcast <4 x double*> %440 to <4 x i64*>
  %442 = getelementptr inbounds [0 x %6], [0 x %6]* %387, i64 0, i64 %372, i32 1
  %443 = bitcast double* %442 to i64*
  %444 = getelementptr inbounds [0 x %6], [0 x %6]* %387, i64 0, i64 %373, i32 1
  %445 = bitcast double* %444 to i64*
  %446 = getelementptr inbounds [0 x %6], [0 x %6]* %387, i64 0, i64 %374, i32 1
  %447 = bitcast double* %446 to i64*
  %448 = getelementptr inbounds [0 x %6], [0 x %6]* %387, i64 0, i64 %375, i32 1
  %449 = bitcast double* %448 to i64*
  br label %while_head5.rv.i47.i

while_head5.rv.i47.i:                             ; preds = %while_head5.rv.i47.i, %while_head.divexit.rv.i45.i
  %450 = phi <4 x i64> [ %479, %while_head5.rv.i47.i ], [ zeroinitializer, %while_head.divexit.rv.i45.i ]
  %451 = phi <4 x i64> [ %480, %while_head5.rv.i47.i ], [ %437, %while_head.divexit.rv.i45.i ]
  %452 = phi <4 x i64> [ %481, %while_head5.rv.i47.i ], [ zeroinitializer, %while_head.divexit.rv.i45.i ]
  %453 = and <4 x i64> %451, %450
  %454 = xor <4 x i64> %450, <i64 -1, i64 -1, i64 -1, i64 -1>
  %455 = and <4 x i64> %451, %454
  %456 = icmp ne <4 x i64> %455, zeroinitializer
  %457 = tail call <4 x i64> @llvm.masked.gather.v4i64(<4 x i64*> %441, i32 1, <4 x i1> %456, <4 x i64> undef)
  %extract30.i.i = extractelement <4 x i64> %457, i32 3
  %extract28.i.i = extractelement <4 x i64> %457, i32 2
  %extract26.i.i = extractelement <4 x i64> %457, i32 1
  %extract24.i.i = extractelement <4 x i64> %457, i32 0
  %458 = bitcast <4 x i64> %457 to <4 x double>
  %459 = fadd <4 x double> %dF_y_SIMD.i46.i, %458
  %bc105.i48.i = bitcast <4 x double> %459 to <4 x i64>
  %460 = extractelement <4 x i64> %bc105.i48.i, i32 0
  %461 = cmpxchg i64* %443, i64 %extract24.i.i, i64 %460 seq_cst seq_cst
  %462 = extractelement <4 x i64> %bc105.i48.i, i32 1
  %463 = cmpxchg i64* %445, i64 %extract26.i.i, i64 %462 seq_cst seq_cst
  %464 = extractelement <4 x i64> %bc105.i48.i, i32 2
  %465 = cmpxchg i64* %447, i64 %extract28.i.i, i64 %464 seq_cst seq_cst
  %466 = extractelement <4 x i64> %bc105.i48.i, i32 3
  %467 = cmpxchg i64* %449, i64 %extract30.i.i, i64 %466 seq_cst seq_cst
  %468 = extractvalue { i64, i1 } %461, 1
  %469 = sext i1 %468 to i64
  %470 = insertelement <4 x i64> undef, i64 %469, i64 0
  %471 = extractvalue { i64, i1 } %463, 1
  %472 = sext i1 %471 to i64
  %473 = insertelement <4 x i64> %470, i64 %472, i64 1
  %474 = extractvalue { i64, i1 } %465, 1
  %475 = sext i1 %474 to i64
  %476 = insertelement <4 x i64> %473, i64 %475, i64 2
  %477 = extractvalue { i64, i1 } %467, 1
  %478 = sext i1 %477 to i64
  %479 = insertelement <4 x i64> %476, i64 %478, i64 3
  %480 = xor <4 x i64> %451, %453
  %481 = or <4 x i64> %453, %452
  %482 = tail call i32 @llvm.x86.avx.ptestz.256(<4 x i64> %480, <4 x i64> %480)
  %483 = icmp eq i32 %482, 0
  br i1 %483, label %while_head5.rv.i47.i, label %while_head5.divexit.rv.i52.i

while_head5.divexit.rv.i52.i:                     ; preds = %while_head5.rv.i47.i
  %dF_z_SIMD.i53.i = fmul <4 x double> %dz_SIMD.i31.i, %395
  %484 = getelementptr inbounds [0 x %6], [0 x %6]* %387, <4 x i64> zeroinitializer, <4 x i64> %371, i32 2
  %485 = bitcast <4 x double*> %484 to <4 x i64*>
  %486 = getelementptr inbounds [0 x %6], [0 x %6]* %387, i64 0, i64 %372, i32 2
  %487 = bitcast double* %486 to i64*
  %488 = getelementptr inbounds [0 x %6], [0 x %6]* %387, i64 0, i64 %373, i32 2
  %489 = bitcast double* %488 to i64*
  %490 = getelementptr inbounds [0 x %6], [0 x %6]* %387, i64 0, i64 %374, i32 2
  %491 = bitcast double* %490 to i64*
  %492 = getelementptr inbounds [0 x %6], [0 x %6]* %387, i64 0, i64 %375, i32 2
  %493 = bitcast double* %492 to i64*
  br label %while_head11.rv.i54.i

while_head11.rv.i54.i:                            ; preds = %while_head11.rv.i54.i, %while_head5.divexit.rv.i52.i
  %494 = phi <4 x i64> [ %523, %while_head11.rv.i54.i ], [ zeroinitializer, %while_head5.divexit.rv.i52.i ]
  %495 = phi <4 x i64> [ %524, %while_head11.rv.i54.i ], [ %481, %while_head5.divexit.rv.i52.i ]
  %496 = phi <4 x i64> [ %525, %while_head11.rv.i54.i ], [ zeroinitializer, %while_head5.divexit.rv.i52.i ]
  %497 = and <4 x i64> %495, %494
  %498 = xor <4 x i64> %494, <i64 -1, i64 -1, i64 -1, i64 -1>
  %499 = and <4 x i64> %495, %498
  %500 = icmp ne <4 x i64> %499, zeroinitializer
  %501 = tail call <4 x i64> @llvm.masked.gather.v4i64(<4 x i64*> %485, i32 1, <4 x i1> %500, <4 x i64> undef)
  %extract44.i.i = extractelement <4 x i64> %501, i32 3
  %extract42.i.i = extractelement <4 x i64> %501, i32 2
  %extract40.i.i = extractelement <4 x i64> %501, i32 1
  %extract38.i.i = extractelement <4 x i64> %501, i32 0
  %502 = bitcast <4 x i64> %501 to <4 x double>
  %503 = fadd <4 x double> %dF_z_SIMD.i53.i, %502
  %bc109.i55.i = bitcast <4 x double> %503 to <4 x i64>
  %504 = extractelement <4 x i64> %bc109.i55.i, i32 0
  %505 = cmpxchg i64* %487, i64 %extract38.i.i, i64 %504 seq_cst seq_cst
  %506 = extractelement <4 x i64> %bc109.i55.i, i32 1
  %507 = cmpxchg i64* %489, i64 %extract40.i.i, i64 %506 seq_cst seq_cst
  %508 = extractelement <4 x i64> %bc109.i55.i, i32 2
  %509 = cmpxchg i64* %491, i64 %extract42.i.i, i64 %508 seq_cst seq_cst
  %510 = extractelement <4 x i64> %bc109.i55.i, i32 3
  %511 = cmpxchg i64* %493, i64 %extract44.i.i, i64 %510 seq_cst seq_cst
  %512 = extractvalue { i64, i1 } %505, 1
  %513 = sext i1 %512 to i64
  %514 = insertelement <4 x i64> undef, i64 %513, i64 0
  %515 = extractvalue { i64, i1 } %507, 1
  %516 = sext i1 %515 to i64
  %517 = insertelement <4 x i64> %514, i64 %516, i64 1
  %518 = extractvalue { i64, i1 } %509, 1
  %519 = sext i1 %518 to i64
  %520 = insertelement <4 x i64> %517, i64 %519, i64 2
  %521 = extractvalue { i64, i1 } %511, 1
  %522 = sext i1 %521 to i64
  %523 = insertelement <4 x i64> %520, i64 %522, i64 3
  %524 = xor <4 x i64> %495, %497
  %525 = or <4 x i64> %497, %496
  %526 = tail call i32 @llvm.x86.avx.ptestz.256(<4 x i64> %524, <4 x i64> %524)
  %527 = icmp eq i32 %526, 0
  br i1 %527, label %while_head11.rv.i54.i, label %while_head11.divexit.rv.i59.i

while_head11.divexit.rv.i59.i:                    ; preds = %while_head11.rv.i54.i
  %528 = load [0 x %6]*, [0 x %6]** %54, align 8
  %529 = getelementptr inbounds [0 x %6], [0 x %6]* %528, i64 0, i64 %indvars.iv38
  %530 = bitcast %6* %529 to i64*
  br label %while_head17.rv.i60.i

while_head17.rv.i60.i:                            ; preds = %cont_block.i.i, %while_head11.divexit.rv.i59.i
  %531 = phi <4 x i64> [ %573, %cont_block.i.i ], [ zeroinitializer, %while_head11.divexit.rv.i59.i ]
  %532 = phi <4 x i64> [ %574, %cont_block.i.i ], [ %525, %while_head11.divexit.rv.i59.i ]
  %533 = phi <4 x i64> [ %575, %cont_block.i.i ], [ zeroinitializer, %while_head11.divexit.rv.i59.i ]
  %534 = and <4 x i64> %532, %531
  %535 = tail call i32 @llvm.x86.avx.ptestc.256(<4 x i64> %531, <4 x i64> %532)
  %536 = icmp eq i32 %535, 0
  br i1 %536, label %mem_block.i.i, label %cont_block.i.i

while_head17.divexit.rv.i61.i:                    ; preds = %cont_block.i.i
  %537 = getelementptr inbounds [0 x %6], [0 x %6]* %528, i64 0, i64 %indvars.iv38, i32 1
  %538 = bitcast double* %537 to i64*
  br label %while_head23.rv.i62.i

while_head23.rv.i62.i:                            ; preds = %cont_block67.i.i, %while_head17.divexit.rv.i61.i
  %539 = phi <4 x i64> [ %599, %cont_block67.i.i ], [ zeroinitializer, %while_head17.divexit.rv.i61.i ]
  %540 = phi <4 x i64> [ %600, %cont_block67.i.i ], [ %575, %while_head17.divexit.rv.i61.i ]
  %541 = phi <4 x i64> [ %601, %cont_block67.i.i ], [ zeroinitializer, %while_head17.divexit.rv.i61.i ]
  %542 = and <4 x i64> %540, %539
  %543 = tail call i32 @llvm.x86.avx.ptestc.256(<4 x i64> %539, <4 x i64> %540)
  %544 = icmp eq i32 %543, 0
  br i1 %544, label %mem_block66.i.i, label %cont_block67.i.i

while_head23.divexit.rv.i63.i:                    ; preds = %cont_block67.i.i
  %545 = getelementptr inbounds [0 x %6], [0 x %6]* %528, i64 0, i64 %indvars.iv38, i32 2
  %546 = bitcast double* %545 to i64*
  br label %while_head29.rv.i64.i

while_head29.rv.i64.i:                            ; preds = %cont_block84.i.i, %while_head23.divexit.rv.i63.i
  %547 = phi <4 x i64> [ %625, %cont_block84.i.i ], [ zeroinitializer, %while_head23.divexit.rv.i63.i ]
  %548 = phi <4 x i64> [ %627, %cont_block84.i.i ], [ %601, %while_head23.divexit.rv.i63.i ]
  %549 = tail call i32 @llvm.x86.avx.ptestc.256(<4 x i64> %547, <4 x i64> %548)
  %550 = icmp eq i32 %549, 0
  br i1 %550, label %mem_block83.i.i, label %cont_block84.i.i

while_head29.divexit.rv.i65.i:                    ; preds = %cont_block84.i.i
  %indvars.iv.next39 = add nsw i64 %indvars.iv38, 1
  %551 = icmp slt i64 %indvars.iv.next39, %56
  br i1 %551, label %if_then.rv.i.i, label %lambda_32928_vectorize.exit.i.loopexit

mem_block.i.i:                                    ; preds = %while_head17.rv.i60.i
  %scal_mask_mem.i.i = load i64, i64* %530, align 1
  br label %cont_block.i.i

cont_block.i.i:                                   ; preds = %mem_block.i.i, %while_head17.rv.i60.i
  %scal_mask_mem_phi.i.i = phi i64 [ %scal_mask_mem.i.i, %mem_block.i.i ], [ undef, %while_head17.rv.i60.i ]
  %.splatinsert53.i.i = insertelement <4 x i64> undef, i64 %scal_mask_mem_phi.i.i, i32 0
  %.splat54.i.i = shufflevector <4 x i64> %.splatinsert53.i.i, <4 x i64> undef, <4 x i32> zeroinitializer
  %552 = bitcast <4 x i64> %.splat54.i.i to <4 x double>
  %553 = fsub <4 x double> %552, %dF_x_SIMD.i38.i
  %bc113.i66.i = bitcast <4 x double> %553 to <4 x i64>
  %554 = extractelement <4 x i64> %bc113.i66.i, i32 0
  %555 = cmpxchg i64* %530, i64 %scal_mask_mem_phi.i.i, i64 %554 seq_cst seq_cst
  %556 = extractelement <4 x i64> %bc113.i66.i, i32 1
  %557 = cmpxchg i64* %530, i64 %scal_mask_mem_phi.i.i, i64 %556 seq_cst seq_cst
  %558 = extractelement <4 x i64> %bc113.i66.i, i32 2
  %559 = cmpxchg i64* %530, i64 %scal_mask_mem_phi.i.i, i64 %558 seq_cst seq_cst
  %560 = extractelement <4 x i64> %bc113.i66.i, i32 3
  %561 = cmpxchg i64* %530, i64 %scal_mask_mem_phi.i.i, i64 %560 seq_cst seq_cst
  %562 = extractvalue { i64, i1 } %555, 1
  %563 = sext i1 %562 to i64
  %564 = insertelement <4 x i64> undef, i64 %563, i64 0
  %565 = extractvalue { i64, i1 } %557, 1
  %566 = sext i1 %565 to i64
  %567 = insertelement <4 x i64> %564, i64 %566, i64 1
  %568 = extractvalue { i64, i1 } %559, 1
  %569 = sext i1 %568 to i64
  %570 = insertelement <4 x i64> %567, i64 %569, i64 2
  %571 = extractvalue { i64, i1 } %561, 1
  %572 = sext i1 %571 to i64
  %573 = insertelement <4 x i64> %570, i64 %572, i64 3
  %574 = xor <4 x i64> %532, %534
  %575 = or <4 x i64> %534, %533
  %576 = tail call i32 @llvm.x86.avx.ptestz.256(<4 x i64> %574, <4 x i64> %574)
  %577 = icmp eq i32 %576, 0
  br i1 %577, label %while_head17.rv.i60.i, label %while_head17.divexit.rv.i61.i

mem_block66.i.i:                                  ; preds = %while_head23.rv.i62.i
  %scal_mask_mem68.i.i = load i64, i64* %538, align 1
  br label %cont_block67.i.i

cont_block67.i.i:                                 ; preds = %mem_block66.i.i, %while_head23.rv.i62.i
  %scal_mask_mem_phi69.i.i = phi i64 [ %scal_mask_mem68.i.i, %mem_block66.i.i ], [ undef, %while_head23.rv.i62.i ]
  %.splatinsert70.i.i = insertelement <4 x i64> undef, i64 %scal_mask_mem_phi69.i.i, i32 0
  %.splat71.i.i = shufflevector <4 x i64> %.splatinsert70.i.i, <4 x i64> undef, <4 x i32> zeroinitializer
  %578 = bitcast <4 x i64> %.splat71.i.i to <4 x double>
  %579 = fsub <4 x double> %578, %dF_y_SIMD.i46.i
  %bc117.i.i = bitcast <4 x double> %579 to <4 x i64>
  %580 = extractelement <4 x i64> %bc117.i.i, i32 0
  %581 = cmpxchg i64* %538, i64 %scal_mask_mem_phi69.i.i, i64 %580 seq_cst seq_cst
  %582 = extractelement <4 x i64> %bc117.i.i, i32 1
  %583 = cmpxchg i64* %538, i64 %scal_mask_mem_phi69.i.i, i64 %582 seq_cst seq_cst
  %584 = extractelement <4 x i64> %bc117.i.i, i32 2
  %585 = cmpxchg i64* %538, i64 %scal_mask_mem_phi69.i.i, i64 %584 seq_cst seq_cst
  %586 = extractelement <4 x i64> %bc117.i.i, i32 3
  %587 = cmpxchg i64* %538, i64 %scal_mask_mem_phi69.i.i, i64 %586 seq_cst seq_cst
  %588 = extractvalue { i64, i1 } %581, 1
  %589 = sext i1 %588 to i64
  %590 = insertelement <4 x i64> undef, i64 %589, i64 0
  %591 = extractvalue { i64, i1 } %583, 1
  %592 = sext i1 %591 to i64
  %593 = insertelement <4 x i64> %590, i64 %592, i64 1
  %594 = extractvalue { i64, i1 } %585, 1
  %595 = sext i1 %594 to i64
  %596 = insertelement <4 x i64> %593, i64 %595, i64 2
  %597 = extractvalue { i64, i1 } %587, 1
  %598 = sext i1 %597 to i64
  %599 = insertelement <4 x i64> %596, i64 %598, i64 3
  %600 = xor <4 x i64> %540, %542
  %601 = or <4 x i64> %542, %541
  %602 = tail call i32 @llvm.x86.avx.ptestz.256(<4 x i64> %600, <4 x i64> %600)
  %603 = icmp eq i32 %602, 0
  br i1 %603, label %while_head23.rv.i62.i, label %while_head23.divexit.rv.i63.i

mem_block83.i.i:                                  ; preds = %while_head29.rv.i64.i
  %scal_mask_mem85.i.i = load i64, i64* %546, align 1
  br label %cont_block84.i.i

cont_block84.i.i:                                 ; preds = %mem_block83.i.i, %while_head29.rv.i64.i
  %scal_mask_mem_phi86.i.i = phi i64 [ %scal_mask_mem85.i.i, %mem_block83.i.i ], [ undef, %while_head29.rv.i64.i ]
  %.splatinsert87.i.i = insertelement <4 x i64> undef, i64 %scal_mask_mem_phi86.i.i, i32 0
  %.splat88.i.i = shufflevector <4 x i64> %.splatinsert87.i.i, <4 x i64> undef, <4 x i32> zeroinitializer
  %604 = bitcast <4 x i64> %.splat88.i.i to <4 x double>
  %605 = fsub <4 x double> %604, %dF_z_SIMD.i53.i
  %bc121.i.i = bitcast <4 x double> %605 to <4 x i64>
  %606 = extractelement <4 x i64> %bc121.i.i, i32 0
  %607 = cmpxchg i64* %546, i64 %scal_mask_mem_phi86.i.i, i64 %606 seq_cst seq_cst
  %608 = extractelement <4 x i64> %bc121.i.i, i32 1
  %609 = cmpxchg i64* %546, i64 %scal_mask_mem_phi86.i.i, i64 %608 seq_cst seq_cst
  %610 = extractelement <4 x i64> %bc121.i.i, i32 2
  %611 = cmpxchg i64* %546, i64 %scal_mask_mem_phi86.i.i, i64 %610 seq_cst seq_cst
  %612 = extractelement <4 x i64> %bc121.i.i, i32 3
  %613 = cmpxchg i64* %546, i64 %scal_mask_mem_phi86.i.i, i64 %612 seq_cst seq_cst
  %614 = extractvalue { i64, i1 } %607, 1
  %615 = sext i1 %614 to i64
  %616 = insertelement <4 x i64> undef, i64 %615, i64 0
  %617 = extractvalue { i64, i1 } %609, 1
  %618 = sext i1 %617 to i64
  %619 = insertelement <4 x i64> %616, i64 %618, i64 1
  %620 = extractvalue { i64, i1 } %611, 1
  %621 = sext i1 %620 to i64
  %622 = insertelement <4 x i64> %619, i64 %621, i64 2
  %623 = extractvalue { i64, i1 } %613, 1
  %624 = sext i1 %623 to i64
  %625 = insertelement <4 x i64> %622, i64 %624, i64 3
  %626 = xor <4 x i64> %547, <i64 -1, i64 -1, i64 -1, i64 -1>
  %627 = and <4 x i64> %548, %626
  br i1 %550, label %while_head29.rv.i64.i, label %while_head29.divexit.rv.i65.i

lambda_32928_vectorize.exit.i.loopexit:           ; preds = %while_head29.divexit.rv.i65.i
  br label %lambda_32928_vectorize.exit.i

lambda_32928_vectorize.exit.i:                    ; preds = %lambda_32928_vectorize.exit.i.loopexit, %body18.i
  %628 = add i32 %parallel_loop_phi20.i19, 4
  %629 = icmp slt i32 %628, %.
  br i1 %629, label %body18.i, label %exit19.i.loopexit

exit19.i.loopexit:                                ; preds = %lambda_32928_vectorize.exit.i
  br label %exit19.i

exit19.i:                                         ; preds = %exit19.i.loopexit, %if_then12.i
  %indvars.iv.next41 = add nuw nsw i64 %indvars.iv40, 1
  %exitcond = icmp eq i64 %indvars.iv.next41, %wide.trip.count
  br i1 %exitcond, label %if_else11.i.loopexit, label %if_then12.i

lambda_32590.exit.loopexit:                       ; preds = %if_else4.i
  br label %lambda_32590.exit

lambda_32590.exit:                                ; preds = %lambda_32590.exit.loopexit, %body
  %630 = add nsw i32 %parallel_loop_phi23, 1
  %exitcond47 = icmp eq i32 %630, %2
  br i1 %exitcond47, label %exit.loopexit, label %body

exit.loopexit:                                    ; preds = %lambda_32590.exit
  br label %exit

exit:                                             ; preds = %exit.loopexit, %lambda_32590_parallel_for
  ret void
}

declare void @anydsl_parallel_for(i32, i32, i32, i8*, i8*) local_unnamed_addr

; Function Attrs: nounwind readonly
declare <4 x double> @llvm.masked.gather.v4f64(<4 x double*>, i32, <4 x i1>, <4 x double>) #1

; Function Attrs: nounwind readonly
declare <4 x i64> @llvm.masked.gather.v4i64(<4 x i64*>, i32, <4 x i1>, <4 x i64>) #1

; Function Attrs: nounwind readnone
declare <4 x double> @llvm.x86.avx.cmp.pd.256(<4 x double>, <4 x double>, i8) #2

; Function Attrs: nounwind readnone
declare i32 @llvm.x86.sse41.ptestz(<2 x i64>, <2 x i64>) #2

; Function Attrs: nounwind readnone
declare <4 x double> @llvm.x86.avx.blendv.pd.256(<4 x double>, <4 x double>, <4 x double>) #2

; Function Attrs: nounwind readnone
declare i32 @llvm.x86.avx.ptestz.256(<4 x i64>, <4 x i64>) #2

; Function Attrs: nounwind readnone
declare i32 @llvm.x86.avx.ptestc.256(<4 x i64>, <4 x i64>) #2

attributes #0 = { nounwind }
attributes #1 = { nounwind readonly }
attributes #2 = { nounwind readnone }

Thanks and best regards

Migrating from develop to master

So, like you suggested in #28 I migrated from develop to master, added rv::CallPredicateMode to all my VectorMapping-s (https://zivgitlab.uni-muenster.de/HPC2SE-Project/pacxx-runtime/commit/e574e747ffc049f43726747e3f9c33e025b7194b) and ...

SLEEFResolverService: llvm.pacxx.barrier0 for width 8
        sleef: n/a
ListResolverService: llvm.pacxx.barrier0 for width 8
ListR: match VectorMapping {
        scalarFn = llvm.pacxx.barrier0
        vectorFn = llvm.pacxx.barrier0
        vectorW  = 0
        predMode = PredicateArg
        maskPos  = -1
        resultSh = uni
        paramShs: {
        }
}
scalCall = 'call void @llvm.pacxx.barrier0() #4'

#3  0x00007fffe08f7412 in __GI___assert_fail (assertion=0x7fffe20b29b8 "!getType()->isVoidTy() && \"Cannot assign a name to void values!\"", file=0x7fffe20b24c0 "%%/llvm/lib/IR/Value.cpp", line=247, 
    function=0x7fffe20b62e0 <llvm::Value::setNameImpl(llvm::Twine const&)::__PRETTY_FUNCTION__> "void llvm::Value::setNameImpl(const llvm::Twine&)") at assert.c:101
#4  0x00007fffe1f7e71e in llvm::Value::setNameImpl(llvm::Twine const&) () from %%/lib/libLLVMCore.so
#5  0x00007fffe1f7e749 in llvm::Value::setName(llvm::Twine const&) () from %%/lib/libLLVMCore.so
#6  0x00007ffff738b66c in std::_Function_handler<llvm::Value* (llvm::IRBuilder<llvm::ConstantFolder, llvm::IRBuilderDefaultInserter>&), rv::NatBuilder::vectorizeCallInstruction(llvm::CallInst*)::{lambda(llvm::IRBuilder<llvm::ConstantFolder, llvm::IRBuilderDefaultInserter>&)#1}>::_M_invoke(std::_Any_data const&, llvm::IRBuilder<llvm::ConstantFolder, llvm::IRBuilderDefaultInserter>&) () from %%/lib/libRV.so
#7  0x00007ffff739b062 in rv::NatBuilder::createAnyGuard(bool, llvm::BasicBlock&, llvm::Instruction&, bool, std::function<llvm::Value* (llvm::IRBuilder<llvm::ConstantFolder, llvm::IRBuilderDefaultInserter>&)>) ()
   from %%/lib/libRV.so
#8  0x00007ffff7399081 in rv::NatBuilder::vectorizeCallInstruction(llvm::CallInst*) () from %%/lib/libRV.so
#9  0x00007ffff73a159a in rv::NatBuilder::vectorize(llvm::BasicBlock*, llvm::BasicBlock*) () from %%/lib/libRV.so
#10 0x00007ffff73a2ba3 in rv::NatBuilder::vectorize(bool, llvm::ValueMap<llvm::Value const*, llvm::WeakTrackingVH, llvm::ValueMapConfig<llvm::Value const*, llvm::sys::SmartMutex<false> > >*) ()
   from %%/lib/libRV.so
#11 0x00007ffff73aeab2 in rv::VectorizerInterface::vectorize(rv::VectorizationInfo&, llvm::DominatorTree&, llvm::LoopInfo&, llvm::ScalarEvolution&, llvm::MemoryDependenceResults&, llvm::ValueMap<llvm::Value const*, llvm::WeakTrackingVH, llvm::ValueMapConfig<llvm::Value const*, llvm::sys::SmartMutex<false> > >*) () from %%l/lib/libRV.so

Any hints on further changes affecting this?

Build RV as a LLVM plugin

I was thinking about how best to deploy RV for my usecase and I am curious if anyone has tried building RV as an LLVM/opt plugin.

We link against the LLVM dylib and being able to dlopen and then add the RV passes would greatly simplify the deployment for us.

Problems building the release/10.x branch

I'm trying to build a recent RV with a recent LLVM. I've tried checking out the release/10.x branch of both and putting rv into llvm-project/llvm/tools/rv & then supplying some cmake arguments to activate RV -DRV_ENABLE_CRT=on -DLLVM_ENABLE_CXX1Y=on -DLLVM_CXX_STD:STRING=c++14. Are these versions expected to work? It would be nice if the top-level README.md had some build instructions (or maybe a link to the wiki with those to make it easier to update?)

Anyway the cmake run prints out this error:

-- Registering Bye as a pass plugin (static build: OFF)
Traceback (most recent call last):
File "", line 22, in
IndexError: list index out of range

Of course that one might be harmless...

Then, later in the build, I am seeing these errors:

In file included from /home/mppf/w/llvm2/third-party/llvm/llvm-project/llvm/tools/rv/vecmath/crt.c:123:
/home/mppf/w/llvm2/third-party/llvm/llvm-project/llvm/../compiler-rt//lib/builtins/udivsi3.c:15:16: error: typedef redefinition with different types ('su_int' (aka 'unsigned int') vs 'du_int' (aka 'unsigned long'))
typedef su_int fixuint_t;
               ^
/home/mppf/w/llvm2/third-party/llvm/llvm-project/llvm/../compiler-rt//lib/builtins/udivdi3.c:15:16: note: previous definition is here
typedef du_int fixuint_t;
               ^
In file included from /home/mppf/w/llvm2/third-party/llvm/llvm-project/llvm/tools/rv/vecmath/crt.c:123:
/home/mppf/w/llvm2/third-party/llvm/llvm-project/llvm/../compiler-rt//lib/builtins/udivsi3.c:16:16: error: typedef redefinition with different types ('si_int' (aka 'int') vs 'di_int' (aka 'long'))
typedef si_int fixint_t;
               ^
/home/mppf/w/llvm2/third-party/llvm/llvm-project/llvm/../compiler-rt//lib/builtins/udivdi3.c:16:16: note: previous definition is here
typedef di_int fixint_t;
               ^
In file included from /home/mppf/w/llvm2/third-party/llvm/llvm-project/llvm/tools/rv/vecmath/crt.c:123:
In file included from /home/mppf/w/llvm2/third-party/llvm/llvm-project/llvm/../compiler-rt//lib/builtins/udivsi3.c:17:
/home/mppf/w/llvm2/third-party/llvm/llvm-project/llvm/../compiler-rt//lib/builtins/int_div_impl.inc:16:27: error: redefinition of '__udivXi3'
static __inline fixuint_t __udivXi3(fixuint_t n, fixuint_t d) {
                          ^
/home/mppf/w/llvm2/third-party/llvm/llvm-project/llvm/../compiler-rt//lib/builtins/int_div_impl.inc:16:27: note: previous definition is here
static __inline fixuint_t __udivXi3(fixuint_t n, fixuint_t d) {
                          ^
In file included from /home/mppf/w/llvm2/third-party/llvm/llvm-project/llvm/tools/rv/vecmath/crt.c:123:
In file included from /home/mppf/w/llvm2/third-party/llvm/llvm-project/llvm/../compiler-rt//lib/builtins/udivsi3.c:17:
/home/mppf/w/llvm2/third-party/llvm/llvm-project/llvm/../compiler-rt//lib/builtins/int_div_impl.inc:45:27: error: redefinition of '__umodXi3'
static __inline fixuint_t __umodXi3(fixuint_t n, fixuint_t d) {
                          ^
/home/mppf/w/llvm2/third-party/llvm/llvm-project/llvm/../compiler-rt//lib/builtins/int_div_impl.inc:45:27: note: previous definition is here
static __inline fixuint_t __umodXi3(fixuint_t n, fixuint_t d) {
                          ^
In file included from /home/mppf/w/llvm2/third-party/llvm/llvm-project/llvm/tools/rv/vecmath/crt.c:126:
/home/mppf/w/llvm2/third-party/llvm/llvm-project/llvm/../compiler-rt//lib/builtins/umodsi3.c:15:16: error: typedef redefinition with different types ('su_int' (aka 'unsigned int') vs 'du_int' (aka 'unsigned long'))
typedef su_int fixuint_t;
               ^
/home/mppf/w/llvm2/third-party/llvm/llvm-project/llvm/../compiler-rt//lib/builtins/umoddi3.c:15:16: note: previous definition is here
typedef du_int fixuint_t;
               ^
In file included from /home/mppf/w/llvm2/third-party/llvm/llvm-project/llvm/tools/rv/vecmath/crt.c:126:
/home/mppf/w/llvm2/third-party/llvm/llvm-project/llvm/../compiler-rt//lib/builtins/umodsi3.c:16:16: error: typedef redefinition with different types ('si_int' (aka 'int') vs 'di_int' (aka 'long'))
typedef si_int fixint_t;
               ^
/home/mppf/w/llvm2/third-party/llvm/llvm-project/llvm/../compiler-rt//lib/builtins/umoddi3.c:16:16: note: previous definition is here
typedef di_int fixint_t;

Is this supposed to work? Am I doing something wrong?

Thanks.

cost model debug printouts

a8499af changed src/analysis/costModel.cpp to turn off some debug printouts, but these seem to have reappeared in later revisions and are there in the current master branch.

Is it possible to add some sort of test to check RV is not released in this way? Or to make it an LLVM command line option?

Remove latch-exit restriction in rv::LoopVectorizer

At the moment, RV's LoopVectorizer pass only supports latch exit loops. Otherwise, the remainder loop transformation (RemainderTransform) bails and the loop remains scalar. This restriction does not apply to any other nested loops in the outer loop being vectorized.

This loop will be vectorized:

Header:
 ...
Latch:
 ...
  br i1 %exitcond, label %Exit, label %Header

This loop won't be vectorized as the exit is not in the latch:

Header:
 ...
  br i1 %exitcond, label %Exit, ....

Latch:
 ...
  br label %Header

RV debug printouts

I'm seeing several debug prints that should be disabled by default.

Is it possible to add a test to the RV test suite to verify that there are no extraneous debug prints?

Thanks.

Below are a few that I've identified. These cause test failures in our frontend's testing system.

diff --git a/src/analysis/reductionAnalysis.cpp b/src/analysis/reductionAnalysis.cpp
index 5d54728..d06add9 100644
--- a/src/analysis/reductionAnalysis.cpp
+++ b/src/analysis/reductionAnalysis.cpp
@@ -429,7 +429,7 @@ ReductionAnalysis::analyze(Loop & hostLoop) {
       reductMap[inst] = red;
     }
 
-    red->dump();
+    //red->dump();
   }
 }
 
diff --git a/src/transform/remTransform.cpp b/src/transform/remTransform.cpp
index 9eaefdb..c23a31c 100644
--- a/src/transform/remTransform.cpp
+++ b/src/transform/remTransform.cpp
@@ -811,8 +811,8 @@ RemainderTransform::createVectorizableLoop(Loop & L, ValueSet & uniOverrides, in
   auto * branchCond = analyzeExitCondition(L, vectorWidth);
   if (!branchCond) {
     Report() << "remTrans: can not handle loop exit condition\n";
-    L.print(outs());
 #if 0
+    L.print(outs());
     for (auto * BB : L.blocks()) {
         outs() << "\n";
         outs() << *BB;

Lib call triggers assertion during vectorization

As a debugging feature i thought it would be nice to have printf enabled in PACXX kernels. However, having a simple call to puts("hello") in the funktion results in the following assertion:

/llvm/include/llvm/Support/Casting.h:255: typename llvm::cast_retty<X, Y*>::ret_type llvm::cast(Y*) [with X = llvm::GetElementPtrInst; Y = llvm::Value; typename llvm::cast_retty<X, Y*>::ret_type = llvm::GetElementPtrInst*]: Assertion `isa<X>(Val) && "cast<Ty>() argument of incompatible type!"' failed.

I set the vector shape of this call to varying and added a SIMD mapping so that puts would not be vectorized.

The assertions call stack is as follows:

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007fffdf46df5d in __GI_abort () at abort.c:90
#2  0x00007fffdf463f17 in __assert_fail_base (fmt=<optimized out>, 
    assertion=assertion@entry=0x7ffff75aecb0 "isa<X>(Val) && \"cast<Ty>() argument of incompatible type!\"", 
    file=file@entry=0x7ffff75aec48 "/llvm/include/llvm/Support/Casting.h", line=line@entry=255, 
    function=function@entry=0x7ffff75ba940 <llvm::cast_retty<llvm::GetElementPtrInst, llvm::Value*>::ret_type llvm::cast<llvm::GetElementPtrInst, llvm::Value>(llvm::Value*)::__PRETTY_FUNCTION__> "typename llvm::cast_retty<X, Y*>::ret_type llvm::cast(Y*) [with X = llvm::GetElementPtrInst; Y = llvm::Value; typename llvm::cast_retty<X, Y*>::ret_type = llvm::GetElementPtrInst*]") at assert.c:92
#3  0x00007fffdf463fc2 in __GI___assert_fail (
    assertion=0x7ffff75aecb0 "isa<X>(Val) && \"cast<Ty>() argument of incompatible type!\"", 
    file=0x7ffff75aec48 "/llvm/include/llvm/Support/Casting.h", line=255, 
    function=0x7ffff75ba940 <llvm::cast_retty<llvm::GetElementPtrInst, llvm::Value*>::ret_type llvm::cast<llvm::GetElementPtrInst, llvm::Value>(llvm::Value*)::__PRETTY_FUNCTION__> "typename llvm::cast_retty<X, Y*>::ret_type llvm::cast(Y*) [with X = llvm::GetElementPtrInst; Y = llvm::Value; typename llvm::cast_retty<X, Y*>::ret_type = llvm::GetElementPtrInst*]") at assert.c:101
#4  0x00007ffff752a0b1 in llvm::cast_retty<llvm::GetElementPtrInst, llvm::Value*>::ret_type llvm::cast<llvm::GetElementPtrInst, llvm::Value>(llvm::Value*) () from /local/lib/libRV.so
#5  0x00007ffff7537c47 in native::NatBuilder::buildGEP(llvm::GetElementPtrInst*, bool, unsigned int) ()
   from /local/lib/libRV.so
#6  0x00007ffff7537de5 in native::NatBuilder::requestScalarGEP(llvm::GetElementPtrInst*, unsigned int, bool) ()
   from /local/lib/libRV.so
#7  0x00007ffff7535163 in native::NatBuilder::vectorizeCallInstruction(llvm::CallInst*) () from /local/lib/libRV.so
#8  0x00007ffff753cc8f in native::NatBuilder::vectorize(llvm::BasicBlock*, llvm::BasicBlock*) ()
   from /local/lib/libRV.so
#9  0x00007ffff753ea03 in native::NatBuilder::vectorize(bool, llvm::ValueMap<llvm::Value const*, llvm::WeakTrackingVH, llvm::ValueMapConfig<llvm::Value const*, llvm::sys::SmartMutex<false> > >*) () from /local/lib/libRV.so

It seems that there is a problem with the GEP to the global value holding the string.

RV fails to build with LLVM 5 non-debug builds

I'm trying to build RV with LLVM 5 on Ubuntu 17.10, in a non-debug build. I get errors like this:

CMakeFiles/rvTool.dir/rvTool.cpp.o: In function `main':
rvTool.cpp:(.text.startup.main+0x9ed): undefined reference to `llvm::Module::dump() const'
../../../lib/libRV.a(LoopVectorizer.cpp.o): In function `rv::LoopVectorizer::vectorizeLoop(llvm::Loop&)':
LoopVectorizer.cpp:(.text._ZN2rv14LoopVectorizer13vectorizeLoopERN4llvm4LoopE+0x2508): undefined reference to `llvm::Value::dump() const'
LoopVectorizer.cpp:(.text._ZN2rv14LoopVectorizer13vectorizeLoopERN4llvm4LoopE+0x2659): undefined reference to `llvm::Value::dump() const'
collect2: error: ld returned 1 exit status
tools/rv/tools/CMakeFiles/rvTool.dir/build.make:134: recipe for target 'bin/rvTool' failed
make[6]: *** [bin/rvTool] Error 1
CMakeFiles/Makefile2:40841: recipe for target 'tools/rv/tools/CMakeFiles/rvTool.dir/all' failed

See also chapel-lang/chapel#7496 for what we had to do with the Chapel front-end for this... Basically it amounts to replacing myValue->dump() with myValue->print(dbgs(), true).

Re-run DA after SROV

SROV exposes opportunities for vector shape refinement. Re-run the (incremental) DA after SROV to refine vector shapes.

maintain RVInfo after clone a basic block before the linearization pass

I come across with std::out_of_range error at Linearizer pass for bitfield application. It crashs at verifyBlockIndex(). I assume that buildBlockIndex() build some blockindex info and verfifyBlockIndex() does the work for verifying the correctness of this building.
Besides, I also think I may lose something when maintaining the RV-related vecInfo or Domtree updating for the cloned basic block. I have done clone the basic block, remap the instructions with ValueMap,updating phi nodes with successor phis. What may be missing?

VP Intrinsics

Are there plans to have RV generate VP intrinsics in the future?

SCEVCastExpr has been renamed to SCEVIntegralCastExpr

When compiling with latest master LLVM following error message is printed

/home/kazooie/extra/programming/llvm-project/rv/src/native/MemoryAccessGrouper.cpp:198:27: error: ‘SCEVCastExpr’ was not declared in this scope
  198 |       auto * aCast = cast<SCEVCastExpr>(A);
      |                           ^~~~~~~~~~~~
/home/kazooie/extra/programming/llvm-project/rv/src/native/MemoryAccessGrouper.cpp:198:42: error: no matching function for call to ‘cast<<expression error> >(const llvm::SCEV*&)’
  198 |       auto * aCast = cast<SCEVCastExpr>(A);
      |                                          ^

It seems to relate to this change: https://reviews.llvm.org/D89455. Is PR welcome?
I also noticed that experimental intrinsics have been renamed not to include experimental_* anymore: https://reviews.llvm.org/rG322d0afd875

Detect fast-math reductions

The reduction analysis/codegen in RV makes no distinction between ordered (strict) and un-ordered reductions, neither during reduction detection, not during codegen.
This should be improved to enable:

1.) in-order reductions (that is do not privatize the accumulation variable(s), reduce in every loop iteration using a strict reduction
2.) fast-math reduction (privatize the accumulator, reduce using an unordered, fast-math reduction).

RV currently employs an inconsistent mix of these two where reductions are privatized in SIMD code but the generated reduction code does not use the fast-math flags.

Don't vectorize if the vectorization width is 1

This is a reminder for the vectorization width == 1 corner case.

RV will run the full vectorization pipeline if the vectorization width is 1. Since the divergence analysis implicitly assumes that there are at least two threads per vector, this leads to inefficient code and even breakage in LLVM's x86 backend if gathers on <1 x T*> are emitted.

Solution: make the divergence analysis report all instructions as uniform if the vectorization width is 1 and copy the scalar instructions in RV's vector code backend.

Empty error message

During build, RV emits an empty error message:

...
  store float %480, float* %483 : varying
  store float %479, float* %482 : varying
  ret void : uni

}
RV: error: 

^
Extracting a scalar value from a vector:
Original Value:   %366 = fmul float %340, %inv_det
Vector Value:   %183 = fmul <8 x float> %157, %inv_det_SIMD
...

Compilation works nevertheless, and LLVM IR files are correctly generated. The bug can be triggered by building rodent on the master branch.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.