points by gens 9 years ago

>I tend to the opposite view: those saying "do not do X" are in fact obligated to explain why X should be avoided. But perhaps this is just a difference in worldview.

For me it depends on the context. Here aligned access makes more sense so unaligned should be defended.

I hacked together a test, feel free to point out mistakes.

c part: http://pastebin.com/zMha8Fre

asm part(SSE and SSE2 for nt): http://pastebin.com/mxEFC8Cw

results:

aligned: 0 sec, 69049070 nsec

unaligned on aligned data: 0 sec, 69210069 nsec

unaligned on one byte unaligned data: 0 sec, 70278354 nsec

unaligned on three bytes unaligned data: 0 sec, 70315162 nsec

aligned nontemporal: 0 sec, 42549571 nsec

naive: 0 sec, 67741031 nsec

Repeating the test only shows non-temporal to be of benefit. The difference of, on average, 1-2% is not much, that i yield. But it is measurable.

But that is not all! Changing the copy size to something that fits in the cache (1MB) showed completely different results.

aligned: 0 sec, 160536 nsec

unaligned on aligned data: 0 sec, 179999 nsec

unaligned on one byte unaligned data: 0 sec, 375108 nsec

aligned nontemporal: 0 sec, 374811 nsec // usually a bit slower then one byte unaligned

And, out of interest, i made all the copy-s skip every second 16 bytes, (relative) results are the same as the original test except non-temporal being over 3x slower then anything else.

And this is on a amd fx8320 that has the misalignsse flag. On my former cpu (can't remember if it was the celeron or the amd 3800+) the results were very much in favor of aligned access.

So yea, align things. It's not hard to just add " __attribute__ ((aligned (16))) " (for gcc, idk anything else).

PS It may seem like the naive way is good, but memcpy is a bit more complicated then that.

qb45 9 years ago

See what happens when you change HALF_OF_BUFFER_SIZE from 1M to 1M+64. Or 128 or 1024. I think what you observed is the result of loads and stores hitting the same cache set at the same time, all while misalignment additionally increases the number of cache banks involved in any given operation. But that's just hand-waving, I don't know the internals enough to say with confidence what's going on exactly.

BTW, changing misalignment from 1 to 8 reduces this effect by half on my Thuban. Which is important, because nobody sane would misalign an array of doubles by 1 byte, while processing part of an array starting somewhere in the middle is a real thing.

Also, your assembly isn't really that great. In particular, LOOP is microcoded and sucks on AMD. I got better results with this:

  typedef float sse_a __attribute__ ((vector_size(16), aligned(16)));
  typedef float sse_u __attribute__ ((vector_size(16), aligned(1)));
  
  void c_is_faster_than_asm_a(sse_a *dst, sse_a *src, int count) {
          for (int i = 0; i < count/sizeof(sse_a); i += 8) {
                  dst[i] = src[i+0];
                  dst[i] = src[i+1];
                  dst[i] = src[i+2];
                  dst[i] = src[i+3];
                  dst[i] = src[i+4];
                  dst[i] = src[i+5];
                  dst[i] = src[i+6];
                  dst[i] = src[i+7];
          }
  }
  void c_is_faster_than_asm_u(sse_u *dst, sse_u *src, int count) {
          // ditto
  • gens 9 years ago

    >See what happens when you change HALF_OF_BUFFER_SIZE from 1M to 1M+64. Or 128 or 1024.

    Tested. There's a greater difference between aligned and aligned_unaligned. But that made the test go over my cache size (2MB per core), so i tested with 512kB with and without your +128. Results were (relatively) similar to the original 1MB test.

    >Which is important, because nobody sane would misalign an array of doubles by 1 byte [...]

    Adobe flash would, for starters (idk if doubles but it calls unaligned memcpy all the time). The code from the person above also does because compilers sometimes do (aligned mov sometimes segfaults if you don't tell the compiler to aligned an array, especially if it's in a struct).

    >Also, your assembly isn't really that great. In particular, LOOP is microcoded and sucks on AMD. I got better results with this:

    Of course you did, you unrolled the loop. The whole point was to test memory access, not to write a fast copy function.

    >c_is_faster_than_asm_a()

    First of all, that is not in the C specification. It is a gcc/clang/idk_if_others extension to C. It compiles to similar what I would write if i had unrolled the loop. Actually worse, here's what it compiled to http://pastebin.com/yL31spR2 . Note that this is still a lot slower then movnpts when going over cache size.

    edit: I didn't notice at first. Your code copies 8 16byte... chunks to the first. You forgot to add +n to dst.

    • qb45 9 years ago

      Crap, that was bad. Fixed. And removed the insane unrolling, now 2x is sufficient.

      You are right, 128 is not enough on Piledriver. Still,

        ./test $(( 512*1024+1024*0 ))
        aligned: 0 sec, 134539 nsec
        unaligned on aligned data: 0 sec, 101471 nsec
        unaligned on one byte unaligned data: 0 sec, 190368 nsec
        unaligned on three bytes unaligned data: 0 sec, 181823 nsec
        aligned nontemporal: 0 sec, 359920 nsec
        naive: 0 sec, 214007 nsec
        c_is_faster_than_asm_a:   0 sec, 92437 nsec
        c_is_faster_than_asm_u:   0 sec, 92643 nsec
        c_is_faster_than_asm_u+1: 0 sec, 156574 nsec
        c_is_faster_than_asm_u+3: 0 sec, 156359 nsec
        c_is_faster_than_asm_u+4: 0 sec, 154932 nsec
        c_is_faster_than_asm_u+8: 0 sec, 155784 nsec
      
        ./test $(( 512*1024+1024*1 ))
        aligned: 0 sec, 107036 nsec
        unaligned on aligned data: 0 sec, 94861 nsec
        unaligned on one byte unaligned data: 0 sec, 114444 nsec
        unaligned on three bytes unaligned data: 0 sec, 115915 nsec
        aligned nontemporal: 0 sec, 407951 nsec
        naive: 0 sec, 219215 nsec
        c_is_faster_than_asm_a:   0 sec, 82474 nsec
        c_is_faster_than_asm_u:   0 sec, 82554 nsec
        c_is_faster_than_asm_u+1: 0 sec, 112544 nsec
        c_is_faster_than_asm_u+3: 0 sec, 115159 nsec
        c_is_faster_than_asm_u+4: 0 sec, 198434 nsec
        c_is_faster_than_asm_u+8: 0 sec, 118952 nsec
      

      4k is the stride of L1, your code slows down 1.5x:

        ./test $(( 512*1024+1024*4 ))
        aligned: 0 sec, 107576 nsec
        unaligned on aligned data: 0 sec, 94010 nsec
        unaligned on one byte unaligned data: 0 sec, 140534 nsec
        unaligned on three bytes unaligned data: 0 sec, 140517 nsec
        aligned nontemporal: 0 sec, 467981 nsec
        naive: 0 sec, 206891 nsec
        c_is_faster_than_asm_a:   0 sec, 85294 nsec
        c_is_faster_than_asm_u:   0 sec, 85174 nsec
        c_is_faster_than_asm_u+1: 0 sec, 118674 nsec
        c_is_faster_than_asm_u+3: 0 sec, 118902 nsec
        c_is_faster_than_asm_u+4: 0 sec, 118370 nsec
        c_is_faster_than_asm_u+8: 0 sec, 118638 nsec
        

      128k is the stride of L2, both codes slow down further:

        ./test $(( 512*1024+1024*128 ))
        aligned: 0 sec, 167906 nsec
        unaligned on aligned data: 0 sec, 140650 nsec
        unaligned on one byte unaligned data: 0 sec, 239271 nsec
        unaligned on three bytes unaligned data: 0 sec, 251342 nsec
        aligned nontemporal: 0 sec, 458850 nsec
        naive: 0 sec, 364731 nsec
        c_is_faster_than_asm_a:   0 sec, 125240 nsec
        c_is_faster_than_asm_u:   0 sec, 118917 nsec
        c_is_faster_than_asm_u+1: 0 sec, 197348 nsec
        c_is_faster_than_asm_u+3: 0 sec, 196755 nsec
        c_is_faster_than_asm_u+4: 0 sec, 199757 nsec
        c_is_faster_than_asm_u+8: 0 sec, 197842 nsec