Git Product home page Git Product logo

Comments (10)

rhpvorderman avatar rhpvorderman commented on August 28, 2024

This didn't help for performance. Also it made the code slightly more complex. The next call got easier, but that was offset by added complexity elsewhere. So this experiment can be considered failed.

from dnaio.

marcelm avatar marcelm commented on August 28, 2024

Thanks for doing that experiment and documenting it here.

from dnaio.

rhpvorderman avatar rhpvorderman commented on August 28, 2024

In the meantime I did another one: what if I only count the newlines but do not index them.
The result is much simpler code in the next call, with only a very simple and a conceptually simple call count_newlines (people can infer functionality from the function name so it does not add to cognitive complexity). By counting newlines we can also ensure a FASTQ record is complete before parsing it. This makes the while loop and all the NULL checks obsolete. The disadvantage is that in essence that you are doing the same work twice (counting newlines also with memchr) and this adds more than 10% overhead. It can be reduced using a customized SIMD routine. But that sort of defeats the purpose of getting simpler code.

The resulting __next__ code is here: https://github.com/rhpvorderman/dnaio/blob/78174ac701be61f5691e6a0cf2b3c4b90d83f3d9/src/dnaio/_core.pyx#LL502C1-L599C23

It also resulted in a very fast SIMD enhanced newline counter. I am posting it here so it may be reused for some other purpose later. I initially thought of using the popcount instruction and the movemask instruction, but popcount is not a part of SSE2. I think the following solution is much faster anyway: https://github.com/rhpvorderman/dnaio/blob/78174ac701be61f5691e6a0cf2b3c4b90d83f3d9/src/dnaio/count_newlines_sse2.h

The reason I have time to do this is that I am wrapping up a lot of work today before going on holiday (to move house). So in between meetings and e-mails I decided to do some educational hacking.

from dnaio.

marcelm avatar marcelm commented on August 28, 2024

Nice educational hacking you did there 😄! I’m just getting started again after my holidays.

I looked at the SIMD newline counting function just to learn a bit about SSE intrinsics. I am not quite sure, but is it possible that the count += accumulated_counts[i]; loop is quite inefficient the way GCC compiles it? See https://godbolt.org/z/EGax1G4fG, line 56-105 in the assembly output. SO says that _mm_sad_epu8 may help.

from dnaio.

rhpvorderman avatar rhpvorderman commented on August 28, 2024

I am not quite sure, but is it possible that the count += accumulated_counts[i]; loop is quite inefficient the way GCC compiles it?

It unrolls the loop. That is actually quite efficient as it eliminates branches from the code. It only happens every 2000-something bytes or so, so efficiency is not that big of a deal.

SO says that _mm_sad_epu8 may help.

Thanks! That is great! That looks so much more convenient. I will have a look at it.

from dnaio.

marcelm avatar marcelm commented on August 28, 2024

It unrolls the loop. That is actually quite efficient as it eliminates branches from the code.

Sorry I didn’t make this clear: I know, my point was that it works on each byte individually instead of in a vectorized fashion.

It only happens every 2000-something bytes or so, so efficiency is not that big of a deal.

Thanks, that’s the point I didn’t get, so then it it shouldn’t be an issue.

from dnaio.

rhpvorderman avatar rhpvorderman commented on August 28, 2024

Thanks, that’s the point I didn’t get, so then it it shouldn’t be an issue

I blame that on unclear commenting on my part.

The following updated version adds some more comments, clear vector variable names and uses _mm_sad_epu8 as per your excellent find. I knew there are instructions with hadd (horizontal add) in the name, and that these were not in SSE2, so I figured it was not possible. But I was wrong.

https://github.com/rhpvorderman/dnaio/blob/cdeb92431f12121695b90cd5cc33312fae811a62/src/dnaio/count_newlines_sse2.h

This looks much better in the compiler explorer. Unfortunately I can't benchmark it properly on this laptop and I haven't assembled my home PC yet (lots of it is still in boxes).

I do wonder if due to modern processors having branch prediction etc. if it would be faster to include _mm_sad_epu8 in the inner loop and then directly add to the 64 byte integer. That would make the outer loop redundant and the code much simpler. It should be slower. But with modern CPUs, I don't know if counting latencies is good enough if the code is simpler and the branches can be predicted a lot better.

from dnaio.

marcelm avatar marcelm commented on August 28, 2024

Thanks, and cool that it works with _mm_sad_epu8.

from dnaio.

rhpvorderman avatar rhpvorderman commented on August 28, 2024

I am back and can now benchmark on my own machine again. Unfortunately the _mm_sad_epu8 solution is slower. I did a count and found that if the compiler was very naive and assigned a new vector register for each differently named variable it would need 9 vector registers (out of 8 available). The _mm_storeu option is faster and easier to understand:
https://github.com/rhpvorderman/dnaio/blob/b688c50cdeef807e61a694af9bbc9a45ac0c313f/src/dnaio/count_newlines_sse2.h

from dnaio.

marcelm avatar marcelm commented on August 28, 2024

Welcome back and thanks for testing! As you explained to me, since that part of the code isn’t called that often, any speedup wouldn’t have been that large anyway, so I guess it doesn’t matter. But good you added the comment.

from dnaio.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.