Hello, very interesting project! I am particularly interested in the

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Thank you for the reply <a class="user-mention notranslate" data-hovercard-type="user"

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

it looks like updated README contains a typo: <div class="snippet-clipboard-conten

I wonder how highway can unifiy VLA and non-VLA implementations,about google/highway

Comments (19)

jan-wassenberg commented on May 8, 2024

Thanks for getting in touch! Yes, this is an important topic we could mention in the readme. Here's a proposal:

Let there be a template<class D> void LoopBody(D d, ...), T the element type, count the number of elements to process, and N the native number of lanes, i.e. Lanes(HWY_FULL(T)). There are several ways to "strip-mine" a loop:

Ensure all inputs/outputs are padded. Then the loop is simply

for (size_t i = 0; i < count; i += N) LoopBody(HWY_FULL(T)(), ...);

This is the preferred option, unless perhaps N is in the thousands and vectors operations are pipelined with long latencies. This was the case for supercomputers in the 90s, but nowadays ALUs are cheap and we see most implementations split vectors into 1, 2 or 4 parts, so there is little cost to processing entire vectors even if we do not need all their lanes. Indeed this avoids the (potentially large) cost of predication or partial loads/stores on older targets, and does not duplicate code.

Process whole vectors, then switch to a scalar loop:

size_t i = 0;
for (; i + N <= count; i += N) LoopBody(HWY_FULL(T)(), ...);
for (; i < count; ++i) LoopBody(HWY_CAPPED(T, 1)(), ...);

This allows reusing the same source code, and is reasonable if count is large. Otherwise, multiple iterations may be slower than one LoopBody variant with masking, especially because the HWY_SCALAR target selected by HWY_CAPPED(T, 1) is slower for some operations due to workarounds for undefined behavior in C++.

Process whole vectors, then run a single iteration of a modified LoopBody with masking:

size_t i = 0;
for (; i + N <= count; i += N) LoopBody(HWY_FULL(T)(), ...);
if (i < count) LoopBodyPartial(count - i);

This is more efficient on older targets than a single loop that always does masking. It may be possible to avoid source-code duplication by adding to LoopBody a boolean template parameter indicating whether to use full or partial loads/stores. Note that LoadN/StoreN(n) that touch at most n <= N aligned elements operations are planned but not yet implemented. If you need them soon, please let us know and we will take that into account.

What do you think? We'd welcome feedback and discussion.

from highway.

jan-wassenberg commented on May 8, 2024

@castigli I forgot to add the notification :)

from highway.

Bulat-Ziganshin commented on May 8, 2024

With SVE, N is runtime constant (i.e. it's fixed on given CPU, but may be different with different CPUs running the same SVE code). We can use the following tyrick to accomodate both compile-time and run-time N specialization:

template <int N_tpl=0>
void f (int N_runtime)
{
  int N = N_tpl ? N_tpl : N_runtime;

and you may even use N_tpl==0 to check for VLA command set, and use predicated operation only in this case.

But afair on Intel masked load/save is very slow operation since it have to check permissions for each byte individually. So you may need to check performance of masked load/store on VLA platfroms.

from highway.

castigli commented on May 8, 2024

Thank you for the reply @jan-wassenberg.

I agree with you that padded vectors is the preferred option, but it can be cumbersome to do that in legacy code.

The second and third options can always be mapped to both VLA and non-VLA models (at the cost of fragmenting the loop unnecessarily for an ISA that always requires a mask/predicate like SVE), I didn't fully appreciate that.

Perhaps the third option would be the best compromise to not duplicate code and still take advantage of the predicate/mask loading/stores. Additionally if a specific ISA does not have an efficient masked load/store perhaps the LoopBodyPartial could be mapped to a, possibly unrolled, scalar loop.
In summary, I like the idea of having LoadN/StoreN(n), it seems like it could fit well enough both VLA and non-VLA models. It would be nice to have them early, in order to have a general implementation from the get-go, but as of now I am not at the point of needing them yet.
I will get back in touch if/when I reach that stage!

Thanks again for the reply!

from highway.

jan-wassenberg commented on May 8, 2024

@Bulat-Ziganshin that looks neat. To make sure I fully understand, how would you call f, or: what is the N_tpl argument?

In Hwy, Lanes(d) is already effectively compile-time-constant on non-VLA targets, otherwise runtime, so you can use that to initialize your N.

afair on Intel masked load/save is very slow operation

Right, _mm_maskmoveu_si128 is hundreds of cycles due to the non-temporal "hint". _mm256_maskstore_epi32 is more like 1 cycle/lane, but not available for 8/16-bit. If we instead emulate with an (aligned) load, blend, store, that's not atomic, but it seems risky to rely on that anyway.
Using scalar for large vectors is quite costly per lemaitre's measurements: WebAssembly/flexible-vectors#13

Any thoughts on whether we should only provide 32/64-bit masked stores, or rely on blend for 8/16 bits?
(Compress is similar, we only support 32/64 bit because that is what AVX2 can do. Might extend this to 16 bit later.)

@castigli

Perhaps the third option would be the best compromise to not duplicate code and still take advantage of the predicate/mask loading/stores

Agreed, that's a good default when padding is not possible.

I like the idea of having LoadN/StoreN(n), it seems like it could fit well enough both VLA and non-VLA models. It would be nice to have them early,

Thanks for your feedback, we can do these fairly soon. Before we consider the API final, I'd like to test with the upcoming VL-parameter intrinsics which will hopefully be in the next RVV compiler release.

Out of curiosity, what's your application/use case?

from highway.

Bulat-Ziganshin commented on May 8, 2024

how would you call f, or: what is the N_tpl argument?

f(1);  // runtime N
f<1>(0);  // compile-time N

I developed this trick to optimize code that should support any N, but works faster when N is compile-time constant. So, I used switch with cases for a few popular N values and default for remaining ones:

switch(N)
{
  case 1: f<1>(); break;
  case 2: f<2>(); break;
  default: f(N); break;
}

This trick allows us to combine speed of template-parameter N with flexibility of function-parameter N without writing too much extra code.

from highway.

jan-wassenberg commented on May 8, 2024

@Bulat-Ziganshin Nice and elegant, thanks for sharing. I agree it is helpful to specialize code while minimizing source code differences. Will update the proposed option 3 and add it into the readme.

from highway.

Bulat-Ziganshin commented on May 8, 2024

it looks like updated README contains a typo:

LoopBody<true>(count - i)

should be ...(d,count - i)

Also, this one:

Now the template and second function argument

probably should be Now the template parameter...

from highway.

jan-wassenberg commented on May 8, 2024

Good catch, thanks, fixed :)

Thinking aloud whether we can make the blend atomic: there is no atomic 128-bit x86 instruction other than cmpxchg16b, which uses 4 general purpose registers. Moving between vector and GPR is probably slower than conditional scalar stores, especially for i64 on SSE4. To reduce branches, we can require n>=1 - anyone who wants to check handle ==0 can do so outside of LoadN.

Sketch of scalar and blend

Let's look at when atomicity might be required: two threads T1, T2; T1 does StoreN(1, p) and T2 does StoreN(1, p+1). If we're unlucky, T1 might overwrite T2's data with the vector it loaded. But: it's anyway helpful to require aligned pointers, otherwise we could trigger page faults if p+2 is in an unmapped page.

So we still have two options:

blend: can support all types (8/16/32/64-bit), no branches, but requires alignment
scalar: can support unaligned, but branchy and only support 32/64-bit because 8-bit scalar would be expensive on AVX2/3

I'm leaning heavily towards 1, does anyone see any issues or better path?

from highway.

rhettstucki commented on May 8, 2024

There is a trick you can do as long as you have at least one full "register" worth of values. So you only need to pad up to 'N', but not to a multiple of 'N'. So let's say that you have 5 values, but the register width is 4 values, you can do the following:

[0][1][2][3][4]
[0][1][2][3] <---- SIMD op starting at [0]
[1][2][3][4] <---- SIMD op ending at [4]

Even though we are redoing work for [1][2][3], you never have to specialize for the partial register case, so no masking is required and the code is never duplicated. If your data was originally aligned, then on the very last iteration, you do an unaligned load and an unaligned store, but that's it.

Another example, 11 values, width 4:

----- Regular Loop Body ------
[0][1][2][3][4][5][6][7][8][9][10]
[0][1][2][3]
[4][5][6][7]
------- Final Iteration------------
[7][8][9][10]

from highway.

castigli commented on May 8, 2024

@jan-wassenberg I am looking at benchmarking some kernels in CoreNEURON that at the moment are autovectorized with ISPC.

In my limited experience with x86 and avx2, I have not noticed much difference in performance between unaligned and aligned load/store. Because of that, I wonder if requiring alignment might be too strict (again thinking about legacy code).
If you can force alignment easily, you are probably in the position of padding the data as well?

How about using @rhettstucki suggestion for both a head and tail loop (with the head used to align the regular loop body)?

from highway.

Bulat-Ziganshin commented on May 8, 2024

I think that we should follow principle of least surprise, i.e. Load/StoreN shouldn't have unexpected out-of-bounds memory access. Imagine two 17-element arrays laying after each other - using StoreN for a[16] in one thread may disrupt update of b[0] performed in another thread.

Similarly, library code don't know whether a[count] can be accesssed at all without memory access errors, and in the cases when code knows that, it can just use regular Load/Store procedures.

I propose to use your sketch of blended operations to provide BlendN(n, a, b) that combines n first elements of a with remaining elements of b - sort of IfThenElse with specific mask. Or may be just provide Mask array so that BlendN(n, a, b) == IfThenElse(Mask[n], a, b) . This will allow one to use Store(BlendN(Load.... explicitly so he will know that memory beyond bounds is accessed.

For Load/StoreN i propose to try plain memcpy. Compilers are incredibly smart today to implement it, although the price may be code bloat. Just

_mm_store_si128((__m128i*)lanes, v);
memcpy(to, lanes, n*4);

from highway.

rhettstucki commented on May 8, 2024

The solution I proposed does not access memory out-of-bounds. In your example of two 17 elements done 16 at a time this is what happens:

[0][1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16]

[0][1][2][3][4][5][6][7][8][9][10][11][12][13][14][15]
[1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16]

There is nothing magic about run-time known memcopies, you will always pay for a function call into generic code that has to determine the best way to copy 'n * 4' elements. Compilers can only make memcpy fast when the number of bytes is known at compile time, in which case the memcpy will turn into a store. If the work being done in the loop is significant, then you are right, the memcpy at the end doesn't really matter because it is a small part of the run time. If the work inside of the loop is very little though, the memcpy will potentially take more time than the loop itself.

from highway.

Bulat-Ziganshin commented on May 8, 2024

you are right at both points, I answered to Jan proposal of blended Load/Store

from highway.

rhettstucki commented on May 8, 2024

@Bulat-Ziganshin Oh I see now! Thanks, I wasn't sure.

from highway.

jan-wassenberg commented on May 8, 2024

@rhettstucki
Nice, thanks for posting. I agree that's helpful when we're mainly writing to an array; unaligned stores can be expensive but it's just a single iteration. Will add to readme as well, how's this?

    size_t i = 0;
    for (; i + N <= count; i += N) LoopBody<false>(d, i, 0);
    LoopBody<false>(d, count-i, 0);

@castigli

I am looking at benchmarking some kernels

Nice, would be interested in your results.

I have not noticed much difference in performance between unaligned and aligned load/store

Yes, it is a bit subtle. If the data is actually aligned (and it often is by the compiler), there should be almost no difference.
If not, there's a large spike on page boundaries and less so on cache line boundaries, but those get averaged into less-noticeable overhead in total. On AVX3, every unaligned access is a cache line split, that's less good. Even on AVX2, unaligned loads halve your
bandwidth from L1; aligned won't quite get us to 2 loads/cycle due to fills from L2, but it's much closer.

@Bulat-Ziganshin

Imagine two 17-element arrays laying after each other - using StoreN for a[16] in one thread may disrupt update of b[0] performed in another thread.

Yes, this would be problematic if they were not aligned. In Hwy, Load actually requires alignment, whereas LoadU allows unaligned.
Hence we can expect StoreN (as opposed to StoreUN) to require alignment :)
If a[count] is unmapped as you say (which I think can only happen if unaligned), how do we handle that?
Users could expect StoreN should not fault, so it seems we'd be requiring a scalar loop?

From a least-surprise perspective, I like your idea of calling it Blend[U]. Then we could even allow unaligned, with the understanding that page faults are the user's responsibility to prevent or even handle? And on RVV, instead of blend it would be allowed to masked-store in the implementation?

For Load/StoreN i propose to try plain memcpy.

I agree with Rhett's comments here. Example: https://gcc.godbolt.org/z/8bo936
If count is huge, maybe it's still fine, but users can already do that without a new function?

from highway.

jan-wassenberg commented on May 8, 2024

@Bulat-Ziganshin @rhettstucki we have a user request for "masked stores" so I will be implementing something soon.
On further thought, I like Bulat's idea of only providing a mask for use by IfThenElse, with application code responsible for load/store/blend (only the app knows whether atomicity is required, whether pointers are aligned or accessible). This approach at least makes those issues visible.

Specifically, we can add FirstN(d, N) which is equivalent to Iota(d) < Set(d, N). This will not take full advantage of SVE/RVV masked stores, but I suppose that is unavoidable if we have the requirement of making atomicity/fault issues visible instead of hidden behind a StoreN.

Does that make sense and meet everyone's needs? Would also welcome any better suggestions for naming.

from highway.

jan-wassenberg commented on May 8, 2024

Closing, please feel free to open an issue if you'd like to discuss masked load/store further.

from highway.

jan-wassenberg commented on May 8, 2024

@Bulat-Ziganshin wanted to follow up and thank you for this proposal. We have FirstN which can serve as the mask, plus there's now a CompressBlendedStore which works like the code sequence you described and is helpful.

from highway.

I wonder how highway can unifiy VLA and non-VLA implementations about highway HOT 19 CLOSED

Comments (19)

[0][1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16]

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent