TL;DR
We want to remove implicit VL API and made some API changes and intent to change the high-level semantic of vector intrinsic, so that we can doing more aggressive optimization.
Abstract
The goal of this RFC is improving the intrinsic for vector programming, and chose the explicit VL API as the only one intrinsic API, this RFC is consistent with two part, first part is explanation why we pick explicit VL API as final proposal, and second part is removing the concept of low level register status from the intrinsic programming model.
Background
Last year we've announce the intrinsic spec for vector programming in C. We got lots of useful feedback from several different parties including BSC, SiPearl, Andes, OpenCV China, PLCT Lab, Alibaba/T-Head.
Today, we have open source implementation on both GCC and LLVM*, which is implemented in a different approach, implicit VL and explicit VL, we were expect compiler could be using simple C wrapper to implement each other API, e.g. using implicit VL API with C wrapper to implement explicit VL, however it turns out become a barrier of optimization.
The issue is because the concepts of both API are kind of incompatible, after long discussion and exploration, we think it’s time to pick one as the final proposal of intrinsics spec, in order to reduce the compiler maintenance effort and reduce the learning curve of intrinsic function.
Keep only one intrinsic API also having an advantage on the compiler optimization side, we found several optimization opportunities, but we can’t do that because we need to make sure both API have correct semantic and behavior.
So which one is better is the question, back to the reason why we have two different style API is because we don’t have a conclusion on which API is better before, but this time is different, we have enough exploration, experience and feedback to make the right decision.
- LLVM parts are upstream in progress.
Explicit VL API As The Final Vector Intrinsic
After implementing the intrinsic API on both GCC and LLVM, we found several good reasons for the using explicit VL API from the compiler aspect, explicit VL define-use relationship is more natural to the compiler, it made the analysis and optimization more easier.
We also get feedback from users for both API, implicit VL API is less verbose, but is hard to track the status of VL register, that’s also made debug harder, having an explicit VL argument makes programming easier.
So based on both sides - user feedback and consideration of compiler implementation, we believe explicit is the right way to go, and based on the decision we propose following changes for the vector intrinsic API.
Abstract Low-Level Register Modeling in High-Level Language Layer
During the implementation phase, we found a fact is the status of VL register becomes an optimization barrier, we must maintain the correct order between vreadvl and vsetvli and all other explicit VL API, because explicit VL API has the semantic of writing VL register.
For example, we can’t reorder the operations across vreadvl for the following program.
n = 10000;
avl = vsetvl_e32m4 (n) // Assume VLEN=128, so avl = 16
vl1 = vreadvl(); // 16
vint32m4_t const_one = vmv_v_x_i32m4_vl(1, 4);
vl2 = vreadvl(); // 4
vint32m4_t tmp = vadd_vx_i32m4_vl(const_one, 4, vl / 2);
vl3 = vreadvl(); // 8
vse32_v_i32m4_vl(a, tmp, vl);
Furthermore, we can't move explicit VL vector intrinsic across any other explicit VL vector intrinsic, because the define-use relationship is modeled as coarse-grain, we only model the intrinsic write some global status, but it’s hard to detailly describe and track which status is changed, that’s kind of compiler implementation limitation on current mainstream open-source compiler.
So abstracting the low-level register status from the high-level language layer is a straightforward option here, abstracting the VL register, making the vector length just as an argument, that releases us from implementation limitation, and we also found that’s also comes with several advantages on optimization view - we can model almost all intrinsic function as pure function except load/store and few special instructions, which is no side effect, that’s the fanatic property in the compiler optimization land.
Using an example to demonstrate the power if we treat most intrinsic functions as pure function, here is a function with a loop, and having a loop invariant there.
void foo(int *a, int n) {
while (vl = vsetvl_e32m4 (n)) {
vlmax = vsetvlmax_i32m4 ();
vint32m4_t const_one = vmv_v_x_i32m4_vl(1, 4, vlmax);
vl = vsetvl_e32m4 (n);
vint32m4_t tmp = vadd_vx_i32m4_vl(const_one, vl);
vse32_v_i32m4_vl(a, tmp, vl);
n -= vl;
a += vl;
}
}
Since vsetvl and vsetvlmax is pure function now, so vlmax = vsetvlmax_i32m4 ();
can be safely hoist outside the loop
void foo(int *a, int n) {
vlmax = vsetvlmax_i32m4 ();
while (vl = vsetvl_e32m4 (n)) {
vint32m4_t const_one = vmv_v_x_i32m4_vl(1, 4, vlmax);
vl = vsetvl_e32m4 (n);
vint32m4_t tmp = vadd_vx_i32m4_vl(const_one, vl);
vse32_v_i32m4_vl(a, tmp);
n -= vl;
a += vl;
}
}
And then all arguments of vmv_v_x_i32m4_vl are loop invariant, so we can hoist that too.
void foo(int *a, int n) {
vlmax = vsetvlmax_i32m4 ();
vint32m4_t const_one = vmv_v_x_i32m4_vl(1, 4, vlmax);
while (vl = vsetvl_e32m4 (n)) {
vl = vsetvl_e32m4 (n);
vint32m4_t tmp = vadd_vx_i32m4_vl(const_one, vl);
vse32_v_i32m4_vl(a, tmp);
n -= vl;
a += vl;
}
}
We also found the vsetvl has been called twice with the same input, because it’s pure function, so the CSE pass can easily optimize that!
void foo(int *a, int n) {
vlmax = vsetvlmax_i32m4 ();
vint32m4_t const_one = vmv_v_x_i32m4_vl(1, 4, vlmax);
while (vl = vsetvl_e32m4 (n)) {
vl = vsetvl_e32m4 (n);
vint32m4_t tmp = vadd_vx_i32m4_vl(const_one, vl);
vse32_v_i32m4_vl(a, tmp);
n -= vl;
a += vl;
}
}
According to the above demonstration, the advantage of abstracting VL register status is very obvious, and that’s also hard to do for implicit VL API, considering following example:
void foo(int *a, int n) {
while (vl = vsetvl_e32m4 (n)) {
vsetvlmax_i32m4 ();
vint32m4_t const_one = vmv_v_x_i32m4(1);
vsetvl_e32m4 (n);
vint32m4_t tmp = vadd_vx_i32m4(const_one, 4);
vse32_v_i32m4 (a, tmp);
n -= vl;
a += vl;
}
}
Since there is hidden dependency between all vector intrinsic functions, it’s impossible to reorder between any vector intrinsic function, so that means no optimization can be done at all.
Additionally, we also found this could fix other potential issues for code gen of GNU vector extension type, since all other generic compiler infrastructure won’t aware the VL register status, and that will cause poor performance.
So what is GNU vector extension type? GCC and Clang/LLVM both provide vector type extension for easier write SIMD program, for example, we can declare a type with vector size attribute, and then you can operate variables like normal scalar type via ordinary operator.
typedef int int32x4_t __attribute__ ((vector_size (16)));
int32x4_t a, b, c;
a = b + c; // NO explicit VL reg use or def in middle-end
// but it will expand to vsetvli_e32m1(4) and vadd
We can code gen that with vector instruction, however this code gen path might require changing VL to fit the semantic of operation, in above example, the VL should set to 4 before doing operation, and that would be an issue if we have model VL in the compiler middle-end.
LLVM VPlan IR has similar situation on the compiler middle-end for the GNU vector code gen,
According to the above reasons, we believe abstract low-level register modeling in high-level language layer is right way to go.
API Changing
However the several API must be changed due to removing the concept of VL from the C language layer.
The first one is the vreadvl API, which exposes the status of VL register, so we must remove that to prevent leak of the low-level info to high-level programming languages.
The second, And here is an instructions in RISC-V vector ISA will change the VL other than vsetvl[i] instruction, which is vle*ff.v instructions, the instruction will update VL register if it got exception before load VL-elements, we’ll introduce an extra argument to get the content of the VL register:
vint16m1_t vle16ff_v_i16m1 (const int16_t *base); // Current API
vint16m1_t vle16ff_v_i16m1 (const int16_t *base, size_t *vl); // New API
And the last change is the API name, the _vl suffix will become redundant and verbose once we use explicit VL API as final vector intrinsic API.
Conclusion
In this RFC we proposed chose the explicit VL API as final vector intrinsic API, to reduce the complexity of compiler implementation, second, propose abstract low-level register modeling in high-level language layer to enable the opportunity of further optimization for the vector intrinsic program, and last, we have to change part of intrinsic API due to above changes.