Comments (19)
Thanks for getting in touch! Yes, this is an important topic we could mention in the readme. Here's a proposal:
Let there be a template<class D> void LoopBody(D d, ...)
, T
the element type, count
the number of elements to process, and N
the native number of lanes, i.e. Lanes(HWY_FULL(T))
. There are several ways to "strip-mine" a loop:
- Ensure all inputs/outputs are padded. Then the loop is simply
for (size_t i = 0; i < count; i += N) LoopBody(HWY_FULL(T)(), ...);
This is the preferred option, unless perhaps N
is in the thousands and vectors operations are pipelined with long latencies. This was the case for supercomputers in the 90s, but nowadays ALUs are cheap and we see most implementations split vectors into 1, 2 or 4 parts, so there is little cost to processing entire vectors even if we do not need all their lanes. Indeed this avoids the (potentially large) cost of predication or partial loads/stores on older targets, and does not duplicate code.
- Process whole vectors, then switch to a scalar loop:
size_t i = 0;
for (; i + N <= count; i += N) LoopBody(HWY_FULL(T)(), ...);
for (; i < count; ++i) LoopBody(HWY_CAPPED(T, 1)(), ...);
This allows reusing the same source code, and is reasonable if count
is large. Otherwise, multiple iterations may be slower than one LoopBody
variant with masking, especially because the HWY_SCALAR
target selected by HWY_CAPPED(T, 1)
is slower for some operations due to workarounds for undefined behavior in C++.
- Process whole vectors, then run a single iteration of a modified
LoopBody
with masking:
size_t i = 0;
for (; i + N <= count; i += N) LoopBody(HWY_FULL(T)(), ...);
if (i < count) LoopBodyPartial(count - i);
This is more efficient on older targets than a single loop that always does masking. It may be possible to avoid source-code duplication by adding to LoopBody
a boolean template parameter indicating whether to use full or partial loads/stores. Note that LoadN/StoreN(n)
that touch at most n <= N
aligned elements operations are planned but not yet implemented. If you need them soon, please let us know and we will take that into account.
What do you think? We'd welcome feedback and discussion.
from highway.
@castigli I forgot to add the notification :)
from highway.
With SVE, N is runtime constant (i.e. it's fixed on given CPU, but may be different with different CPUs running the same SVE code). We can use the following tyrick to accomodate both compile-time and run-time N specialization:
template <int N_tpl=0>
void f (int N_runtime)
{
int N = N_tpl ? N_tpl : N_runtime;
and you may even use N_tpl==0
to check for VLA command set, and use predicated operation only in this case.
But afair on Intel masked load/save is very slow operation since it have to check permissions for each byte individually. So you may need to check performance of masked load/store on VLA platfroms.
from highway.
Thank you for the reply @jan-wassenberg.
I agree with you that padded vectors is the preferred option, but it can be cumbersome to do that in legacy code.
The second and third options can always be mapped to both VLA and non-VLA models (at the cost of fragmenting the loop unnecessarily for an ISA that always requires a mask/predicate like SVE), I didn't fully appreciate that.
Perhaps the third option would be the best compromise to not duplicate code and still take advantage of the predicate/mask loading/stores. Additionally if a specific ISA does not have an efficient masked load/store perhaps the LoopBodyPartial
could be mapped to a, possibly unrolled, scalar loop.
In summary, I like the idea of having LoadN/StoreN(n)
, it seems like it could fit well enough both VLA and non-VLA models. It would be nice to have them early, in order to have a general implementation from the get-go, but as of now I am not at the point of needing them yet.
I will get back in touch if/when I reach that stage!
Thanks again for the reply!
from highway.
@Bulat-Ziganshin that looks neat. To make sure I fully understand, how would you call f, or: what is the N_tpl argument?
In Hwy, Lanes(d)
is already effectively compile-time-constant on non-VLA targets, otherwise runtime, so you can use that to initialize your N
.
afair on Intel masked load/save is very slow operation
Right, _mm_maskmoveu_si128 is hundreds of cycles due to the non-temporal "hint". _mm256_maskstore_epi32 is more like 1 cycle/lane, but not available for 8/16-bit. If we instead emulate with an (aligned) load, blend, store, that's not atomic, but it seems risky to rely on that anyway.
Using scalar for large vectors is quite costly per lemaitre's measurements: WebAssembly/flexible-vectors#13
Any thoughts on whether we should only provide 32/64-bit masked stores, or rely on blend for 8/16 bits?
(Compress is similar, we only support 32/64 bit because that is what AVX2 can do. Might extend this to 16 bit later.)
Perhaps the third option would be the best compromise to not duplicate code and still take advantage of the predicate/mask loading/stores
Agreed, that's a good default when padding is not possible.
I like the idea of having LoadN/StoreN(n), it seems like it could fit well enough both VLA and non-VLA models. It would be nice to have them early,
Thanks for your feedback, we can do these fairly soon. Before we consider the API final, I'd like to test with the upcoming VL-parameter intrinsics which will hopefully be in the next RVV compiler release.
Out of curiosity, what's your application/use case?
from highway.
how would you call f, or: what is the N_tpl argument?
f(1); // runtime N
f<1>(0); // compile-time N
I developed this trick to optimize code that should support any N, but works faster when N is compile-time constant. So, I used switch with cases for a few popular N values and default for remaining ones:
switch(N)
{
case 1: f<1>(); break;
case 2: f<2>(); break;
default: f(N); break;
}
This trick allows us to combine speed of template-parameter N with flexibility of function-parameter N without writing too much extra code.
from highway.
@Bulat-Ziganshin Nice and elegant, thanks for sharing. I agree it is helpful to specialize code while minimizing source code differences. Will update the proposed option 3 and add it into the readme.
from highway.
it looks like updated README contains a typo:
LoopBody<true>(count - i)
should be ...(d,count - i)
Also, this one:
Now the template and second function argument
probably should be Now the template parameter...
from highway.
Good catch, thanks, fixed :)
Thinking aloud whether we can make the blend atomic: there is no atomic 128-bit x86 instruction other than cmpxchg16b, which uses 4 general purpose registers. Moving between vector and GPR is probably slower than conditional scalar stores, especially for i64 on SSE4. To reduce branches, we can require n>=1 - anyone who wants to check handle ==0 can do so outside of LoadN.
Let's look at when atomicity might be required: two threads T1, T2; T1 does StoreN(1, p) and T2 does StoreN(1, p+1). If we're unlucky, T1 might overwrite T2's data with the vector it loaded. But: it's anyway helpful to require aligned pointers, otherwise we could trigger page faults if p+2 is in an unmapped page.
So we still have two options:
- blend: can support all types (8/16/32/64-bit), no branches, but requires alignment
- scalar: can support unaligned, but branchy and only support 32/64-bit because 8-bit scalar would be expensive on AVX2/3
I'm leaning heavily towards 1, does anyone see any issues or better path?
from highway.
There is a trick you can do as long as you have at least one full "register" worth of values. So you only need to pad up to 'N', but not to a multiple of 'N'. So let's say that you have 5 values, but the register width is 4 values, you can do the following:
[0][1][2][3][4]
[0][1][2][3] <---- SIMD op starting at [0]
[1][2][3][4] <---- SIMD op ending at [4]
Even though we are redoing work for [1][2][3], you never have to specialize for the partial register case, so no masking is required and the code is never duplicated. If your data was originally aligned, then on the very last iteration, you do an unaligned load and an unaligned store, but that's it.
Another example, 11 values, width 4:
----- Regular Loop Body ------
[0][1][2][3][4][5][6][7][8][9][10]
[0][1][2][3]
[4][5][6][7]
------- Final Iteration------------
[7][8][9][10]
from highway.
@jan-wassenberg I am looking at benchmarking some kernels in CoreNEURON that at the moment are autovectorized with ISPC.
In my limited experience with x86 and avx2, I have not noticed much difference in performance between unaligned and aligned load/store. Because of that, I wonder if requiring alignment might be too strict (again thinking about legacy code).
If you can force alignment easily, you are probably in the position of padding the data as well?
How about using @rhettstucki suggestion for both a head and tail loop (with the head used to align the regular loop body)?
from highway.
I think that we should follow principle of least surprise, i.e. Load/StoreN
shouldn't have unexpected out-of-bounds memory access. Imagine two 17-element arrays laying after each other - using StoreN for a[16]
in one thread may disrupt update of b[0]
performed in another thread.
Similarly, library code don't know whether a[count]
can be accesssed at all without memory access errors, and in the cases when code knows that, it can just use regular Load/Store procedures.
I propose to use your sketch of blended operations to provide BlendN(n, a, b)
that combines n first elements of a with remaining elements of b - sort of IfThenElse
with specific mask. Or may be just provide Mask array so that BlendN(n, a, b) == IfThenElse(Mask[n], a, b)
. This will allow one to use Store(BlendN(Load....
explicitly so he will know that memory beyond bounds is accessed.
For Load/StoreN i propose to try plain memcpy. Compilers are incredibly smart today to implement it, although the price may be code bloat. Just
_mm_store_si128((__m128i*)lanes, v);
memcpy(to, lanes, n*4);
from highway.
The solution I proposed does not access memory out-of-bounds. In your example of two 17 elements done 16 at a time this is what happens:
[0][1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16]
[0][1][2][3][4][5][6][7][8][9][10][11][12][13][14][15]
[1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16]
There is nothing magic about run-time known memcopies, you will always pay for a function call into generic code that has to determine the best way to copy 'n * 4' elements. Compilers can only make memcpy fast when the number of bytes is known at compile time, in which case the memcpy will turn into a store. If the work being done in the loop is significant, then you are right, the memcpy at the end doesn't really matter because it is a small part of the run time. If the work inside of the loop is very little though, the memcpy will potentially take more time than the loop itself.
from highway.
you are right at both points, I answered to Jan proposal of blended Load/Store
from highway.
@Bulat-Ziganshin Oh I see now! Thanks, I wasn't sure.
from highway.
@rhettstucki
Nice, thanks for posting. I agree that's helpful when we're mainly writing to an array; unaligned stores can be expensive but it's just a single iteration. Will add to readme as well, how's this?
size_t i = 0;
for (; i + N <= count; i += N) LoopBody<false>(d, i, 0);
LoopBody<false>(d, count-i, 0);
I am looking at benchmarking some kernels
Nice, would be interested in your results.
I have not noticed much difference in performance between unaligned and aligned load/store
Yes, it is a bit subtle. If the data is actually aligned (and it often is by the compiler), there should be almost no difference.
If not, there's a large spike on page boundaries and less so on cache line boundaries, but those get averaged into less-noticeable overhead in total. On AVX3, every unaligned access is a cache line split, that's less good. Even on AVX2, unaligned loads halve your
bandwidth from L1; aligned won't quite get us to 2 loads/cycle due to fills from L2, but it's much closer.
Imagine two 17-element arrays laying after each other - using StoreN for a[16] in one thread may disrupt update of b[0] performed in another thread.
Yes, this would be problematic if they were not aligned. In Hwy, Load actually requires alignment, whereas LoadU allows unaligned.
Hence we can expect StoreN (as opposed to StoreUN) to require alignment :)
If a[count] is unmapped as you say (which I think can only happen if unaligned), how do we handle that?
Users could expect StoreN should not fault, so it seems we'd be requiring a scalar loop?
From a least-surprise perspective, I like your idea of calling it Blend[U]. Then we could even allow unaligned, with the understanding that page faults are the user's responsibility to prevent or even handle? And on RVV, instead of blend it would be allowed to masked-store in the implementation?
For Load/StoreN i propose to try plain memcpy.
I agree with Rhett's comments here. Example: https://gcc.godbolt.org/z/8bo936
If count is huge, maybe it's still fine, but users can already do that without a new function?
from highway.
@Bulat-Ziganshin @rhettstucki we have a user request for "masked stores" so I will be implementing something soon.
On further thought, I like Bulat's idea of only providing a mask for use by IfThenElse
, with application code responsible for load/store/blend (only the app knows whether atomicity is required, whether pointers are aligned or accessible). This approach at least makes those issues visible.
Specifically, we can add FirstN(d, N)
which is equivalent to Iota(d) < Set(d, N)
. This will not take full advantage of SVE/RVV masked stores, but I suppose that is unavoidable if we have the requirement of making atomicity/fault issues visible instead of hidden behind a StoreN.
Does that make sense and meet everyone's needs? Would also welcome any better suggestions for naming.
from highway.
Closing, please feel free to open an issue if you'd like to discuss masked load/store further.
from highway.
@Bulat-Ziganshin wanted to follow up and thank you for this proposal. We have FirstN which can serve as the mask, plus there's now a CompressBlendedStore
which works like the code sequence you described and is helpful.
from highway.
Related Issues (20)
- Potential sub-optimal AVX512 codegen for DupEven/DupOdd for 16-bit integer types HOT 3
- `Error: unknown architectural extension 'sve2-aes'` when compiling on Graviton3 HOT 5
- compile is broken on aarch64 with gcc: hwy/ops/arm_sve-inl.h:59:7: error: no type named ‘type’ in ‘struct hwy::N_SVE::DFromV_t<__SVBfloat16_t> HOT 6
- Adding support for IBM/Z14 and onwards HOT 5
- Compilation error with GCC 10 and older 'call to non-constexpr function 'float hwy::F32FromF16' HOT 1
- OrderedDemote2To() f64->f32 ? HOT 3
- Support for AVX_VNNI as an extension to the AVX2 target HOT 3
- Test #537 failure upon building Highway-1.0.4 with GCCcore-12.3.0 on x86_64 HOT 6
- Build failure in tests HOT 2
- ICE triggered on ARM when compiling with ASAN
- `HwyBlockwiseShiftTest.TestAllShiftRightLanes` test failing on Graviton3 HOT 7
- Does VQSort support custom object and comparison function ? HOT 4
- Arm NEON compilation error with GCC 10 HOT 1
- Minor issue with docs HOT 1
- bit_pack-inl.h is missing from CMakeLists.txt
- how to convert int8_t vec to int64_t vec? HOT 2
- Does the Highway have partial sort functionality? HOT 3
- Support GatherIndex different sizes (_mm512_i64gather_epi32 etc.) HOT 2
- [feature request] Add a HWY_REGISTER_CALL macro for __vectorcall HOT 2
- Question: VEX-encoded SSE4 mentioned in `README.md` HOT 8
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from highway.