This is part of my Google Summer of Code project, Independency of D from the C Standard Library.
It is a public repository for the work on the memcpy()
replacement.
You can read more info about this project in this Dlang forum thread.
The following notes are based on the assumption that we compile with DMD.
For some reason, if you take asm directly from GCC and translate it to D, it will be (quite) slower. I don't specifically know why this happens (one reason probably is the note below). But certainly, if you are to measure what actual performance you get from the assembly code that I wrote, you'd have to put it in an assembler and link the object file. You can find respective versions in C and ASM here: opt_memcpy
A big bottle-neck on small sizes seem to be the function prologues and epilog. When using inline asm, DMD for some reason preserves and restores a bunch of registers that are not actually used. We could potentiall get away from this by setting the static on our own. Check TODO 1) below.
Ideally, we would like most of the code to not be big asm blocks but instead either be regular D or at least ASM
interleaved with regular D. Unfortunately, early on when I was implementing Dmemcpy
, it became obvious that this can't
happen. At least I didn't find a way to do it.
First of all, a regular D implementation could not compete in any way with hand-written ASM of libc memcpy
.
Second, interleaving regular D with ASM decreases performance significantly. The reason is mostly that register allocation
goes badly. In cases where you expect a variable to reside in a register, it gets load from memory resulting in big performance hit.
It gets worse when variables are used in loops.
However, I made a D implementation of a C memcpy I wrote, which you can find in Dopt_memcpy.d
. This was written in a rush
and so I did not have much time to explore its capabilities. For the most part it was not very positive.
Check TODO 2)
You can find respective versions in C and ASM here: opt_memcpy
Explore the cases of false data dependency. According to Agner Fog:
The CPU may falsely assume a partial overlap between the written destination and the following read source if source is unaligned and (src-dest) modulo 4096 is close to 4096".
Source: https://gist.github.com/thoughtpolice/26ea25f69715ffde96efa6364c19cf18#file-memcpy64-asm-L528
I don't know exactly what Agner does in this implementation (you might want to check where it branches though).
But what is common among memcpy
implementations is that if dest
>= src
, then they do the copy backwards (starting
from the end of buffers and moving to the start).
A general topic to be discussed is aligned and unaligned data.
First of all, what is a good alignment is not trivial. One could think that if we're using SSE it is 16, AVX 32 etc.
But, there's more to it. Because what is important is the cache line size, because processors work really
well when they work in multiples of that. Newer processors have a 64-byte cache-line.
In the info below, what I consider aligned is 64-byte aligned.
The bottle-neck seems to be in stores. In recent architectures (Haswell and then I think), there are 2 load ports and 1 store port.
With that setup, we can do 2 64-byte aligned loads in 1 cycle but only 1 64-byte store. For memcpy
, we will have to live with the bottle-neck
because for every load there is a store. But consider the case of adding two vectors and output to the third. Now,
for every 2 loads there's a store and we take maximum performance.
With unaligned data, we can only do one 64-byte load per cycle. The reason of that we have to load 2 cachelines. For example:
multiple of 64 multiple of 64
| |
---------------------------------------- ...
1st cache line 2nd cache line
Now, if we are to load on a multiple of 64, we will load one cache-line and be done. And since there are 2 load ports,
we can load 128 bytes per cycle.
But, if we are to load something here:
multiple of 64 multiple of 64
| | |
---------------------------------------- ...
1st cache line 2nd cache line
Then to get those 64 bytes we have to load the 2 adjacent cache-lines. That of course means we can do 2 32-byte loads in a cycle.
Recently, I read that rep movsb is faster than SSE (at least) on Intel architectures if we are on 64-byte boundary. And that seems to be the reason. Check TODO 4)
Another topic is what happens when src
and dst
are unaligned by a different amount. Or put it another way,
they are misaligned relative to each other, or src % alignment
!= dst % alignment
. To tackle this fully,
one needs to read notes 5) and 6). But also, some other proposed technique by Agner is to load an shift until we reach
same alignment on both. That is, imagine the following:
|
src: ---------------
dst: |
They are mis-aligned by 2. Then load from the source say 4 bytes, write 4 to dest, but now advance source only by 2.
|
src: ---------------
dst: |
And now src
and dst
are aligned relative to each other. Another thing to consider is that any reasonable allocator (or compiler
for stack data) will give pointer aligned to some power of 2, which makes this easier (meaning, faster). This should
not be a given though (mostly because the user may end up having changed the alignment).
Source: Agner Fog, Subroutine optimization manual: https://www.agner.org/optimize/optimizing_assembly.pdf
- Explore
naked
: https://dlang.org/spec/iasm.html#naked to avoid function prologue and epilog. - Explore a regular D implementation,
Dopt_memcpy.d
- Benchmark what happens when
dst
>src
. - For asm, we only care about > 64 bytes memcpy. Explore
rep movsb
on 64-byte aligned data. In that regard, we should also explore the "first align to 64-byte boundary, then rep movsb". - Explore the method in note 6) for when
dst
andsrc
are misaligned relative to each other.
- Add support for dynamic arrays. This requires some trivial assembly to handle <= 64 bytes data, as we can't
use
static if
because of course the size of copy is only known at runtime. - Related to the above, provide bounds checking and other useful wrappers. Jonathan Marler proposed some interesting ideas here: https://forum.dlang.org/thread/[email protected]
dmd bench.d Dmemcpy.d -O -inline && ./bench average
(DMD - Optimized and Inlined)
An important reason that I published it so early (about 5 days of work) is to get feedback on different machines,
from different people.
So, if you are interested in helping, it would be beneficial to do one or more of the following:
- Test in DMD.
- Test on Windows or Linux.
See above for compilation and run instructions. Then just send me the results and your CPU model.
E-mail: [email protected]
If you are involved in D, you can also ping me on Slack (Stefanos Baziotis), or post in the dlang forum thread above.