| Svelte Hacker News

pcwalton 9 years ago

> Here is an operating system containing many nontrivial applications, all written in 100% Asm:

Here's a small snippet I found from draw.asm:

    add   r9  , [copyxe]
    add   r9  , [copyxe]
    add   r9  , [copyxe]
    sub   r9  , [copyx]
    sub   r9  , [copyx]
    sub   r9  , [copyx]
    add   r10 , [sizex]
    add   r10 , [sizex]
    add   r10 , [sizex]
    add   r14 , 1

72 bytes.

Here's what GCC generates with -Os for the same code:

    imul rax,qword ptr [copyxe],3
    inc r14
    add r9,rax
    imul rax,qword ptr [sizex],3
    add r10,rax
    imul rax,qword ptr [copyx],-3
    add r9,rax

36 bytes.

As a bonus, GCC with -O2:

    mov rax,[copyxe]
    add r14,1
    lea rax,[rax+rax*2]
    add r9,rax
    mov rax,[sizex]
    lea rax,[rax+rax*2]
    add r10,rax
    mov rax,[copyx]
    mov rcx,rax
    neg rcx
    lea rax,[rax+rcx*4]   ; clever!
    add r9,rax

52 bytes.

So, not only does GCC hugely beat this snippet of handwritten Menuet assembly in code size, it can even do so handily while folding all the repeated additions into LEA and eliminating most of the loads.

This is an excellent example of why you shouldn't write in assembler.

slededit 9 years ago

An assembler programmer at full attention can beat a compiler at most operations. That said, the average level of attention a compiler can give over the whole program is much greater than an assembly programmer. When performance is the ultimate goal it makes sense to only drop into assembler for your inner loop.
However assembly programmers will also instinctively avoid things that result in massive amounts of code because they have to manually write every line of it. Compared to C++ where templates are effectively a code generation meta-language. C makes you work a lot harder to generate huge amounts of machine code - and its binaries tend to be smaller because of it.
userbinator 9 years ago
72 bytes.
I don't think that's correct; assuming RIP-relative for all the global variables, the first 9 instructions are 7 bytes each and the last one 4, giving a total of 67 bytes.
You can cherry-pick examples all you want, but I think this is a great example of why you should use Asm: It's not even trying to be optimised, and it still somehow manages to be smaller overall!
Try to optimise it, however, and you can definitely beat GCC...
```
    lea rcx, [copyxe]
    imul rax, [rcx], 3
    add r9, rax
    imul rax, [rcx+copyx-copyxe], -3
    add r9, rax
    imul rax, [rcx+sizex-copyxe], 3
    add r10, rax
    inc r14
```
...with 32 bytes. This human gives, for "-O2", the following 41-byte snippet:
```
    lea rcx, [copyxe]
    mov rax, [rcx]
    lea rax, [rax+rax*2]
    add r9, rax
    mov rax, [rcx+copyx-copyxe]
    lea rax, [rax+rax*2]
    sub r9, rax
    mov rax, [rcx+sizex-copyxe]
    lea rax, [rax+rax*2]
    add r10, rax
    inc r14
```
GCC may be clever to turn (-3) * x into x - 4 * x, but not clever enough to realise that turning 3 * x into x + 2 * x like it did for the other two "statements" and a subtraction would be shorter and faster because it means one less operation on the critical path.
l8rlump 9 years ago

I love this! I've been staring at it (on and off) most of the day, trying to understand what's going on. I see that the compiler has changed the order of execution of "add r14, 1". Would that be to avoid stalling the pipeline due to several rax-intensive instructions bunched together ("data hazard")? Would you be able to elaborate more on what's going on with the code and why it's smart?
- userbinator 9 years ago
  
  r14 is not used at all in that snippet except for that one instruction; its increment can thus be put anywhere between the others, and where the compiler put it is as arbitrary as what a human would do in the absence of any further information. In any OoO architecture (for Intel x86, everything except early Atoms and the tiny embedded cores) the CPU will reorder instructions anyway and can look ahead several dozen instructions:
  http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/
  The repeated use of rax is also no problem because of register renaming. Despite two dependency chains specifying rax, the CPU can detect that they're independent and assign different physical registers for rax, allowing them to execute in parallel.

TazeTSchnitzel 9 years ago

Menuet OS could be the same size if it was written in C or C++. It's small because it's minimal and uses few dependencies.