>I tend to the opposite view: those saying "do not do X" are in fact obligated to explain why X should be avoided. But perhaps this is just a difference in worldview.
For me it depends on the context. Here aligned access makes more sense so unaligned should be defended.
I hacked together a test, feel free to point out mistakes.
c part: http://pastebin.com/zMha8Fre
asm part(SSE and SSE2 for nt): http://pastebin.com/mxEFC8Cw
results:
aligned: 0 sec, 69049070 nsec
unaligned on aligned data: 0 sec, 69210069 nsec
unaligned on one byte unaligned data: 0 sec, 70278354 nsec
unaligned on three bytes unaligned data: 0 sec, 70315162 nsec
aligned nontemporal: 0 sec, 42549571 nsec
naive: 0 sec, 67741031 nsec
Repeating the test only shows non-temporal to be of benefit. The difference of, on average, 1-2% is not much, that i yield. But it is measurable.
But that is not all! Changing the copy size to something that fits in the cache (1MB) showed completely different results.
aligned: 0 sec, 160536 nsec
unaligned on aligned data: 0 sec, 179999 nsec
unaligned on one byte unaligned data: 0 sec, 375108 nsec
aligned nontemporal: 0 sec, 374811 nsec // usually a bit slower then one byte unaligned
And, out of interest, i made all the copy-s skip every second 16 bytes, (relative) results are the same as the original test except non-temporal being over 3x slower then anything else.
And this is on a amd fx8320 that has the misalignsse flag. On my former cpu (can't remember if it was the celeron or the amd 3800+) the results were very much in favor of aligned access.
So yea, align things. It's not hard to just add " __attribute__ ((aligned (16))) " (for gcc, idk anything else).
PS It may seem like the naive way is good, but memcpy is a bit more complicated then that.
See what happens when you change HALF_OF_BUFFER_SIZE from 1M to 1M+64. Or 128 or 1024. I think what you observed is the result of loads and stores hitting the same cache set at the same time, all while misalignment additionally increases the number of cache banks involved in any given operation. But that's just hand-waving, I don't know the internals enough to say with confidence what's going on exactly.
BTW, changing misalignment from 1 to 8 reduces this effect by half on my Thuban. Which is important, because nobody sane would misalign an array of doubles by 1 byte, while processing part of an array starting somewhere in the middle is a real thing.
Also, your assembly isn't really that great. In particular, LOOP is microcoded and sucks on AMD. I got better results with this:
>See what happens when you change HALF_OF_BUFFER_SIZE from 1M to 1M+64. Or 128 or 1024.
Tested. There's a greater difference between aligned and aligned_unaligned. But that made the test go over my cache size (2MB per core), so i tested with 512kB with and without your +128. Results were (relatively) similar to the original 1MB test.
>Which is important, because nobody sane would misalign an array of doubles by 1 byte [...]
Adobe flash would, for starters (idk if doubles but it calls unaligned memcpy all the time). The code from the person above also does because compilers sometimes do (aligned mov sometimes segfaults if you don't tell the compiler to aligned an array, especially if it's in a struct).
>Also, your assembly isn't really that great. In particular, LOOP is microcoded and sucks on AMD. I got better results with this:
Of course you did, you unrolled the loop. The whole point was to test memory access, not to write a fast copy function.
>c_is_faster_than_asm_a()
First of all, that is not in the C specification. It is a gcc/clang/idk_if_others extension to C. It compiles to similar what I would write if i had unrolled the loop. Actually worse, here's what it compiled to http://pastebin.com/yL31spR2 . Note that this is still a lot slower then movnpts when going over cache size.
edit: I didn't notice at first. Your code copies 8 16byte... chunks to the first. You forgot to add +n to dst.
Crap, that was bad. Fixed. And removed the insane unrolling, now 2x is sufficient.
You are right, 128 is not enough on Piledriver. Still,
4k is the stride of L1, your code slows down 1.5x:
128k is the stride of L2, both codes slow down further: