| Svelte Hacker News

qb45 9 years ago

See what happens when you change HALF_OF_BUFFER_SIZE from 1M to 1M+64. Or 128 or 1024. I think what you observed is the result of loads and stores hitting the same cache set at the same time, all while misalignment additionally increases the number of cache banks involved in any given operation. But that's just hand-waving, I don't know the internals enough to say with confidence what's going on exactly.

BTW, changing misalignment from 1 to 8 reduces this effect by half on my Thuban. Which is important, because nobody sane would misalign an array of doubles by 1 byte, while processing part of an array starting somewhere in the middle is a real thing.

Also, your assembly isn't really that great. In particular, LOOP is microcoded and sucks on AMD. I got better results with this:

  typedef float sse_a __attribute__ ((vector_size(16), aligned(16)));
  typedef float sse_u __attribute__ ((vector_size(16), aligned(1)));
  
  void c_is_faster_than_asm_a(sse_a *dst, sse_a *src, int count) {
          for (int i = 0; i < count/sizeof(sse_a); i += 8) {
                  dst[i] = src[i+0];
                  dst[i] = src[i+1];
                  dst[i] = src[i+2];
                  dst[i] = src[i+3];
                  dst[i] = src[i+4];
                  dst[i] = src[i+5];
                  dst[i] = src[i+6];
                  dst[i] = src[i+7];
          }
  }
  void c_is_faster_than_asm_u(sse_u *dst, sse_u *src, int count) {
          // ditto

gens 9 years ago

>See what happens when you change HALF_OF_BUFFER_SIZE from 1M to 1M+64. Or 128 or 1024.

Tested. There's a greater difference between aligned and aligned_unaligned. But that made the test go over my cache size (2MB per core), so i tested with 512kB with and without your +128. Results were (relatively) similar to the original 1MB test.

>Which is important, because nobody sane would misalign an array of doubles by 1 byte [...]

Adobe flash would, for starters (idk if doubles but it calls unaligned memcpy all the time). The code from the person above also does because compilers sometimes do (aligned mov sometimes segfaults if you don't tell the compiler to aligned an array, especially if it's in a struct).

>Also, your assembly isn't really that great. In particular, LOOP is microcoded and sucks on AMD. I got better results with this:

Of course you did, you unrolled the loop. The whole point was to test memory access, not to write a fast copy function.

>c_is_faster_than_asm_a()

First of all, that is not in the C specification. It is a gcc/clang/idk_if_others extension to C. It compiles to similar what I would write if i had unrolled the loop. Actually worse, here's what it compiled to http://pastebin.com/yL31spR2 . Note that this is still a lot slower then movnpts when going over cache size.

edit: I didn't notice at first. Your code copies 8 16byte... chunks to the first. You forgot to add +n to dst.

qb45 9 years ago

Crap, that was bad. Fixed. And removed the insane unrolling, now 2x is sufficient.

You are right, 128 is not enough on Piledriver. Still,

  ./test $(( 512*1024+1024*0 ))
  aligned: 0 sec, 134539 nsec
  unaligned on aligned data: 0 sec, 101471 nsec
  unaligned on one byte unaligned data: 0 sec, 190368 nsec
  unaligned on three bytes unaligned data: 0 sec, 181823 nsec
  aligned nontemporal: 0 sec, 359920 nsec
  naive: 0 sec, 214007 nsec
  c_is_faster_than_asm_a:   0 sec, 92437 nsec
  c_is_faster_than_asm_u:   0 sec, 92643 nsec
  c_is_faster_than_asm_u+1: 0 sec, 156574 nsec
  c_is_faster_than_asm_u+3: 0 sec, 156359 nsec
  c_is_faster_than_asm_u+4: 0 sec, 154932 nsec
  c_is_faster_than_asm_u+8: 0 sec, 155784 nsec

  ./test $(( 512*1024+1024*1 ))
  aligned: 0 sec, 107036 nsec
  unaligned on aligned data: 0 sec, 94861 nsec
  unaligned on one byte unaligned data: 0 sec, 114444 nsec
  unaligned on three bytes unaligned data: 0 sec, 115915 nsec
  aligned nontemporal: 0 sec, 407951 nsec
  naive: 0 sec, 219215 nsec
  c_is_faster_than_asm_a:   0 sec, 82474 nsec
  c_is_faster_than_asm_u:   0 sec, 82554 nsec
  c_is_faster_than_asm_u+1: 0 sec, 112544 nsec
  c_is_faster_than_asm_u+3: 0 sec, 115159 nsec
  c_is_faster_than_asm_u+4: 0 sec, 198434 nsec
  c_is_faster_than_asm_u+8: 0 sec, 118952 nsec

4k is the stride of L1, your code slows down 1.5x:

  ./test $(( 512*1024+1024*4 ))
  aligned: 0 sec, 107576 nsec
  unaligned on aligned data: 0 sec, 94010 nsec
  unaligned on one byte unaligned data: 0 sec, 140534 nsec
  unaligned on three bytes unaligned data: 0 sec, 140517 nsec
  aligned nontemporal: 0 sec, 467981 nsec
  naive: 0 sec, 206891 nsec
  c_is_faster_than_asm_a:   0 sec, 85294 nsec
  c_is_faster_than_asm_u:   0 sec, 85174 nsec
  c_is_faster_than_asm_u+1: 0 sec, 118674 nsec
  c_is_faster_than_asm_u+3: 0 sec, 118902 nsec
  c_is_faster_than_asm_u+4: 0 sec, 118370 nsec
  c_is_faster_than_asm_u+8: 0 sec, 118638 nsec

128k is the stride of L2, both codes slow down further:

  ./test $(( 512*1024+1024*128 ))
  aligned: 0 sec, 167906 nsec
  unaligned on aligned data: 0 sec, 140650 nsec
  unaligned on one byte unaligned data: 0 sec, 239271 nsec
  unaligned on three bytes unaligned data: 0 sec, 251342 nsec
  aligned nontemporal: 0 sec, 458850 nsec
  naive: 0 sec, 364731 nsec
  c_is_faster_than_asm_a:   0 sec, 125240 nsec
  c_is_faster_than_asm_u:   0 sec, 118917 nsec
  c_is_faster_than_asm_u+1: 0 sec, 197348 nsec
  c_is_faster_than_asm_u+3: 0 sec, 196755 nsec
  c_is_faster_than_asm_u+4: 0 sec, 199757 nsec
  c_is_faster_than_asm_u+8: 0 sec, 197842 nsec