Whether this is a good thing or bad, the reality is that hundreds of millions of lines of assembly would be required to replicate complex modern programs like web browsers - projects of that scale will always require powerful HLLs to manage abstraction, something that assembly does not and cannot provide on its own.
On the other hand, when using Asm you can often leverage techniques not possible in HLLs to avoid complexity and abstraction. This is a fact which seems not commonly known, even in CS courses involving Asm, as those tend to be more about inspecting compiler output and the few "write in Asm" exercises are mimicing code that a compiler would generate, which IMHO misses the whole point of writing in Asm.
Here is an operating system containing many nontrivial applications, all written in 100% Asm:
Among other things, it contains a minimal yet functional web browser (around the same level of minimalism as Dillo or NetSurf, i.e. no JS or fancy CSS), and the whole package fits on a single floppy disk.
An entire OS with kernel, drivers, and applications in total can be smaller than a "Hello World" application in a "modern" programming language (which depends on countless other external libraries too, including an OS several GB or more.) It really makes you think.
> Here is an operating system containing many nontrivial applications, all written in 100% Asm:
Here's a small snippet I found from draw.asm:
72 bytes.
Here's what GCC generates with -Os for the same code:
36 bytes.
As a bonus, GCC with -O2:
52 bytes.
So, not only does GCC hugely beat this snippet of handwritten Menuet assembly in code size, it can even do so handily while folding all the repeated additions into LEA and eliminating most of the loads.
This is an excellent example of why you shouldn't write in assembler.
An assembler programmer at full attention can beat a compiler at most operations. That said, the average level of attention a compiler can give over the whole program is much greater than an assembly programmer. When performance is the ultimate goal it makes sense to only drop into assembler for your inner loop.
However assembly programmers will also instinctively avoid things that result in massive amounts of code because they have to manually write every line of it. Compared to C++ where templates are effectively a code generation meta-language. C makes you work a lot harder to generate huge amounts of machine code - and its binaries tend to be smaller because of it.
72 bytes.
I don't think that's correct; assuming RIP-relative for all the global variables, the first 9 instructions are 7 bytes each and the last one 4, giving a total of 67 bytes.
You can cherry-pick examples all you want, but I think this is a great example of why you should use Asm: It's not even trying to be optimised, and it still somehow manages to be smaller overall!
Try to optimise it, however, and you can definitely beat GCC...
...with 32 bytes. This human gives, for "-O2", the following 41-byte snippet:
GCC may be clever to turn (-3) * x into x - 4 * x, but not clever enough to realise that turning 3 * x into x + 2 * x like it did for the other two "statements" and a subtraction would be shorter and faster because it means one less operation on the critical path.
I love this! I've been staring at it (on and off) most of the day, trying to understand what's going on. I see that the compiler has changed the order of execution of "add r14, 1". Would that be to avoid stalling the pipeline due to several rax-intensive instructions bunched together ("data hazard")? Would you be able to elaborate more on what's going on with the code and why it's smart?
r14 is not used at all in that snippet except for that one instruction; its increment can thus be put anywhere between the others, and where the compiler put it is as arbitrary as what a human would do in the absence of any further information. In any OoO architecture (for Intel x86, everything except early Atoms and the tiny embedded cores) the CPU will reorder instructions anyway and can look ahead several dozen instructions:
http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/
The repeated use of rax is also no problem because of register renaming. Despite two dependency chains specifying rax, the CPU can detect that they're independent and assign different physical registers for rax, allowing them to execute in parallel.
Menuet OS could be the same size if it was written in C or C++. It's small because it's minimal and uses few dependencies.