Currently the __next__ function is quite messy. Memch

Use a newline index for FastqIter about dnaio HOT 10 CLOSED

rhpvorderman commented on August 28, 2024

Use a newline index for FastqIter

from dnaio.

Comments (10)

rhpvorderman commented on August 28, 2024

This didn't help for performance. Also it made the code slightly more complex. The next call got easier, but that was offset by added complexity elsewhere. So this experiment can be considered failed.

from dnaio.

marcelm commented on August 28, 2024

Thanks for doing that experiment and documenting it here.

from dnaio.

rhpvorderman commented on August 28, 2024

In the meantime I did another one: what if I only count the newlines but do not index them.
The result is much simpler code in the next call, with only a very simple and a conceptually simple call count_newlines (people can infer functionality from the function name so it does not add to cognitive complexity). By counting newlines we can also ensure a FASTQ record is complete before parsing it. This makes the while loop and all the NULL checks obsolete. The disadvantage is that in essence that you are doing the same work twice (counting newlines also with memchr) and this adds more than 10% overhead. It can be reduced using a customized SIMD routine. But that sort of defeats the purpose of getting simpler code.

The resulting __next__ code is here: https://github.com/rhpvorderman/dnaio/blob/78174ac701be61f5691e6a0cf2b3c4b90d83f3d9/src/dnaio/_core.pyx#LL502C1-L599C23

It also resulted in a very fast SIMD enhanced newline counter. I am posting it here so it may be reused for some other purpose later. I initially thought of using the popcount instruction and the movemask instruction, but popcount is not a part of SSE2. I think the following solution is much faster anyway: https://github.com/rhpvorderman/dnaio/blob/78174ac701be61f5691e6a0cf2b3c4b90d83f3d9/src/dnaio/count_newlines_sse2.h

The reason I have time to do this is that I am wrapping up a lot of work today before going on holiday (to move house). So in between meetings and e-mails I decided to do some educational hacking.

from dnaio.

marcelm commented on August 28, 2024

Nice educational hacking you did there 😄! I’m just getting started again after my holidays.

I looked at the SIMD newline counting function just to learn a bit about SSE intrinsics. I am not quite sure, but is it possible that the count += accumulated_counts[i]; loop is quite inefficient the way GCC compiles it? See https://godbolt.org/z/EGax1G4fG, line 56-105 in the assembly output. SO says that _mm_sad_epu8 may help.

from dnaio.

rhpvorderman commented on August 28, 2024

I am not quite sure, but is it possible that the count += accumulated_counts[i]; loop is quite inefficient the way GCC compiles it?

It unrolls the loop. That is actually quite efficient as it eliminates branches from the code. It only happens every 2000-something bytes or so, so efficiency is not that big of a deal.

SO says that _mm_sad_epu8 may help.

Thanks! That is great! That looks so much more convenient. I will have a look at it.

from dnaio.

marcelm commented on August 28, 2024

It unrolls the loop. That is actually quite efficient as it eliminates branches from the code.

Sorry I didn’t make this clear: I know, my point was that it works on each byte individually instead of in a vectorized fashion.

It only happens every 2000-something bytes or so, so efficiency is not that big of a deal.

Thanks, that’s the point I didn’t get, so then it it shouldn’t be an issue.

from dnaio.

rhpvorderman commented on August 28, 2024

Thanks, that’s the point I didn’t get, so then it it shouldn’t be an issue

I blame that on unclear commenting on my part.

The following updated version adds some more comments, clear vector variable names and uses _mm_sad_epu8 as per your excellent find. I knew there are instructions with hadd (horizontal add) in the name, and that these were not in SSE2, so I figured it was not possible. But I was wrong.

https://github.com/rhpvorderman/dnaio/blob/cdeb92431f12121695b90cd5cc33312fae811a62/src/dnaio/count_newlines_sse2.h

This looks much better in the compiler explorer. Unfortunately I can't benchmark it properly on this laptop and I haven't assembled my home PC yet (lots of it is still in boxes).

I do wonder if due to modern processors having branch prediction etc. if it would be faster to include _mm_sad_epu8 in the inner loop and then directly add to the 64 byte integer. That would make the outer loop redundant and the code much simpler. It should be slower. But with modern CPUs, I don't know if counting latencies is good enough if the code is simpler and the branches can be predicted a lot better.

from dnaio.

marcelm commented on August 28, 2024

Thanks, and cool that it works with _mm_sad_epu8.

from dnaio.

rhpvorderman commented on August 28, 2024

I am back and can now benchmark on my own machine again. Unfortunately the _mm_sad_epu8 solution is slower. I did a count and found that if the compiler was very naive and assigned a new vector register for each differently named variable it would need 9 vector registers (out of 8 available). The _mm_storeu option is faster and easier to understand:
https://github.com/rhpvorderman/dnaio/blob/b688c50cdeef807e61a694af9bbc9a45ac0c313f/src/dnaio/count_newlines_sse2.h

from dnaio.

marcelm commented on August 28, 2024

Welcome back and thanks for testing! As you explained to me, since that part of the code isn’t called that often, any speedup wouldn’t have been that large anyway, so I guess it doesn’t matter. But good you added the comment.

from dnaio.

Use a newline index for FastqIter about dnaio HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent