*Donald Knute -> Donald Ervin Knuth is the author of the book "The Art of Computer Programming" (in progress for a couple of decades, currently volume 4c is being written). It is quite advanced, and it will likely not cover compilers anymore (Addison-Wesley had commissioned a compiler book from Knuth when he was a doctoral candidate, now he is retired and has stated his goal for the series has changed).
I disagree with the author's point: the "Dragon book"'s ("Compilers: Principles, Techniques, and Tools" by Aho et al.) Chapter 2 is a self-sufficient introduction into compilers from end to end, and it can be read on its own, ignoring the rest of the excellent book.
Another fantastic intro to compiler writing is the short little book "Compilers" by Niklaus Wirth, which explains and contains the surprisingly short source code of a complete compiler (the whole book is highly understandable - pristine clarity, really) and all in <100 pages total (99).
(I learned enough from these two sources to write a compiler in high school.)
There is still hope for a compiler book. From Knuth's website:
> And after Volumes 1--5 are done, God willing, I plan to publish Volume 6 (the theory of context-free languages) and Volume 7 (Compiler techniques), but only if the things I want to say about those topics are still relevant and still haven't been said.
I hope that God is indeed willing, but the man is 88 years old and he’s not done with the third tome of volume four. It would require a minor miracle for him to finish volume 7 within this lifetime.
I don't think there is hope if you look at actuarial tables and Knuth's age. It's not clear to me if he'll be able to finish volume 4. The outline he has seems to have enough material to fill volumes 4C-4G to my eyes, and he isn't exactly cranking out the volumes.
Admittedly, volumes 5-7 wouldn't be as massive as volume 4 (it sort of turns out that almost all interesting algorithms ends up being categorized as being in volume 4), so you probably wouldn't have a half-dozen subvolumes per topic but, it's still too many books down the line, especially if he plans to revise volumes 1-3 before working on anything else.
The dragon book almost convinced me never to try to write a compiler. I don't know why people recommend it. I guess you're a lot smarter than I am.
There are some excellent books out there. In its own way, the dragon book is excellent, but it is a terrible starting place.
Here are a bunch of references from the same vintage as OP. I recommend starting with a book that actually walks through the process of building a compiler and doesn't spend its time exclusively with theory.
I started with the dragon book, and I found it to be a good introductory text.
A lot of people say the dragon book is difficult, so I suppose there must be something there. But I don't see what it is, I thought it was quite accessible.
I'm curious, what parts/aspects of the dragon book make it difficult to start with?
It's been a few years since I worked with the dragon book, but I think the most common complaint was that it starts with like 350 pages on parser theory: generating bottom-up and top-down parsers from context free grammars, optimizing lexers for systems that don't have enough RAM to store an entire source file, etc... before ever getting to what most people who want to write a compiler care about (implementing type inference, optimizing intermediate representations, generating assembly code). Of course parsing is important, and very interesting to some. But there's a reason most modern resources skip over all of that and just make the reader write a recursive descent parser.
I guess "back in the day" you had to be able to write an efficient parser, as no parser generators existed. If you couldn't implement whatever you wanted due to memory shortage at the parser level, then obviously it's gonna be a huge topic.
Even now I believe it is good to know about this - if only to avoid pitfalls in your own grammar.
I repeatedly skip parts that are not important to me when reading books like this.
I grabbed a book about embedded design and skipped about half of it, which was bus protocols, as I knew I wouldn't need it.
There is no need to read the dragon book from front to back.
> But there's a reason most modern resources skip over all of that and just make the reader write a recursive descent parser.
Unless the reason is explicitly stated there is no way to verify it's any good.
There's a reason people use AI to write do their homework - it just doesn't mean it's a good one.
I can think of plenty arguments for why you wouldn't look into the pros and cons of different parsing strategies in an introduction to compilers, "everyone is(or isn't) doing it" does not belong to them.
In the end, it has to be written down somewhere, and if no other book is doing it for whatever reason, then the dragon book it shall be. You can always recommend skipping that part if someone asks about what book to use.
I actually think the parsing part is more important for laymen. Like, there may be a total of 10K programmers who are interested in learning compiler theories, but maybe 100 of them are ever going to write the backend -- the rest of them are stuck with either toy languages, or use parsing to help with their job. Parsing is definitely more useful for most of us who are not smart enough :D
Yeah I agree, that seems vey true. Although the average person probably also benefits more from learning about recursive descent and pratt parsing than LL(k) parser generators, automata, and finding first and follow sets :)
The thing about parsing (and algorithms in general) is that it can be hair raisingly complex for arbitrary grammars, but in practice, people have recently discovered, that making simple, unambiguous grammars, and avoiding problems, like context dependent parsing, make the parsing problem trival.
Accepting such constraints is quite practical, and lead to little to no loss of power.
In fact, most modern languages are designed with little to no necessary backtracking and simple parsing, Go and Rust being noteworthy examples.
>
In fact, most modern languages are designed with little to no necessary backtracking and simple parsing, Go and Rust being noteworthy examples.
But to understand how to generate grammars for languages that are easy to parse, you have in my opinion to dive quite deeply into parsing theory to understand which subtle aspects make parsing complicated.
My personal context to understand where I'm coming from - I'm working on my own language, which is a curly-brace C-style language with quite, where I didn't try to stray too far from established norms, and the syntax is not that fancy (at least not in that regard). I also want my language to look familiar to most programmers, so I'm deliberately sticking close to established norms.
I'm thankfully past the parsing stage and so far I haven't really encountered much issues with ambiguity, but when I did, I was able to fix them.
Also in certain cases I'm quite liberal with allowing omission of parentheses and other such control tokens, which I know leads to some cases where either the code is ambiguous (as in there's no strictly defined way the compiler is supposed to interpret it) or valid code fails to parse,
So far I have not tackled this issue, as it can always be fixed by the programmer manually adding back those parens for example. I know this is not up to professional standards, but I like the cleanliness of the syntax and simplicity of the compiler, and the issue is always fixable for me later. So this is a firm TODO for me.
Additionally I have some features planned that would crowd up the syntax space in a way that I think would probably need some academic chops to fix, but I'm kinda holding off on those, as they are not central to the main gimmickTM and I want release this thing in a reasonable timeframe.
I don't really have much of a formal education in this, other than reading a few tutorials and looking through a few implementations.
Btw, besides just parsing, there are other concerns in modern languages, such as IDE support, files should be parseable independently etc., error recovery, readable errors, autocomplete hints, which I'm not sure are addressed in depth in the dragon book. These features I do want.
My two cents is that for a simple modern language, you can get quite far with zero semantic model, while with stuff like C++ (with macros), my brain would boil at the thought of having to write a decent IDE backend.
the dragon book is how to write a production grade thing i guess. it has all the interesting concepts very elaborated on which is great but it dives quickly into things that can clutter a project if its just for fun..
It’s academic and comprehensive, that’s the issue. It’s not about writing a production grade compiler, though, in my humble opinion. There are more things to learn for that, unfortunately… is just a pretty big topic with lots of stuff to learn.
When I was professionally writing a compiler professionally (see https://ciex-software.com/intro-to-compilers.html) the Dragon book was the second book that I read. I found it very helpful. That was the first Dragon book. I got the second one later. I would have been ok to start with the Dragon book--the Compiler Generator book was a harder study.
You're not the only one. In college I took a compilers course and we used the dragon book, to me it sucked the joy out of the magical concept of making a compiler.
Some years later I (re-) discovered Forth, and I thought "why not?" and built my own forth in 32-bit Intel assembly, _that_ brought back the wonder and "magical" feeling of compilers again. All in less than 4KB.
I guess I wasn't the right audience for the dragon book.
Imho the problem is the fixation on parser generators and BNF. It's just a lot easier to write a recursive descent parser than to figure out the correct BNF for anything other than a toy language with horrible syntax.
> But then, pushing regular languages theory into the curriculum, just to rush over it so you can use them for parsing is way worse.
At least in the typical curriculum of German universities, the students already know the whole theory of regular languages from their Theoretical Computer Science lectures quite well, thus in a compiler lecture, the lecturer can indeed rush over this topic because it is just a repetition.
I would argue the opposite: Being describable in BNF is exactly the hallmark of sensible syntax in a language, and of a language easily amenable to recursive descent parsing. Wirth routinely published (E)BNF for the languages he designed.
Imo BNF (or some other formal notation) is quite useful for defining your syntax, my biggest gripe with BNF in particular is the way it handles operator precedence (through nested recursive expressions), which can get messy quite fast.
Pratt parsers dont even use this recursion, they only have a concept of 'binding strength', which means in laymans terms that if I'm parsing the left side of say a '' expression, and I managed to parse something a binary subexpression, and the next token I'm looking at is another binary op, do I continue parsing that subexpression, which will be the RHS of the '' expression, or do I finish my original expression which will then be the LHS of the new one?
It represents this through the concept of stickiness, with onesimple rule - the subexpression always sticks to the operator that's more sticky.
This is both quite easy to imagine, and easy to encode, as stickiness is just a number.
I think a simpler most straightforward notation that incorporates precedence would be better.
Great thread. If you have 1 hour to get started, I recommend opening Engineering a Compiler and studying Static Single-Assignment (SSA) from ch 9.3.
The book is famous for its SSA treatment. Chapters 1-8 are not required to understand SSA. This allows you to walk away with a clear win. Refer to 9.2 if you're struggling with dominance + liveness.
I bought this book when I was working on a toy language and I think I was too stupid to understand most of it. The first few chapters were great, but it quickly surpassed my capacity to understand. Seeing it mentioned makes me want to revisit.
Another vote for Lisp in Small Pieces. Great high level compiler book that teaches you how to build a Lisp and doesn’t get bogged down in lexing and parsing.
The "Dragon Book" is big on parsing but I wouldn't recommend it if you want to make many optimisation passes or a back-end.
The first edition was my first CS textbook, back in the '90s and as a young programmer I learned a lot from it.
A couple years ago, I started on a modern compiler back-end however, and found that I needed to update my knowledge with quite a lot.
The 2nd ed covers data-flow analysis, which is very important.
However, modern compilers (GCC, LLVM, Cranelift, ...) are built around an intermediate representation in Static Single Assignment-form. The 2nd ed. has only a single page about SSA and you'd need to also learn a lot of theory about its properties to actually use it properly.
Parsing is the front end to a compiler. Can't get semantics without first recognizing syntax. I have a hard time thinking about programming languages without seeing them as a parsing exercise first, every time.
Syntax and semantics are never orthogonal and you always need syntax so it must be considered from the start. Any reasonable syntax will quickly become much more pleasant to generate an ast or ir than, say, manually building these objects in the host language of the compiler which is what the semantics first crowd seem to propose.
It also is only the case that most of the work is the backend for some compilers, though of course all of this depends on how backend is defined. Is backend just codegen or is it all of the analysis between parsing and codegen? If you target a high level language, which is very appropriate for one's first few compilers, the backend can be quite simple. At the simplest, no ast is even necessary and the compiler can just mechanically translate one syntax into another in a single pass.
I think his point is that "form follows function".
If you know what kind of semantics you're going to have, you can use that to construct a syntax that lends itself to using it properly.
> The recommended advice is to start with semantics first. Syntax will change, there is not much point fixing it down too early.
It's actually the reverse, in my opinion. Semantics can change much more easily than syntax. You can see this in that small changes in syntax can cause massive changes in a recursive-descent parser while the semantics can change from pass-by-reference to pass-by-value and make it barely budge.
There is a reason practically every modern language has adopted syntax sigils like (choosing Zig):
pub fn is_list(arg: arg_t, len: ui_t) bool {
This allows the identification of the various parts and types without referencing or compiling the universe. That's super important and something that must be baked in the syntax at the start or there is nothing you can do about it.
Getting an overview of parsing theory is mainly useful to avoid making ambiguous or otherwise hard to parse grammars. Usually one can't go too wrong with a hand-written recursive descent parser, and most general-purpose language are so complicated that parser generator can't really handle them. Anyway the really interesting parts of compiling happen in the backend.
Another alternative is basing the language on S-expressions, for which a parser is extremely simple to write.
Compilers are perceived to be magical artifacts, carefully crafted by the wizards, and unfathomable by the mere mortals. Books on compilers are better described as wizard-talk: written by and for a clique of all-knowing practitioners. Real-life compilers are too complex to serve as an educational tool. And the gap between real-life compilers and the educational toy compilers is too wide. The novice compiler writer stands puzzled facing an impenetrable barrier, “better write an interpreter instead.”
The goal of this paper is to break that barrier. We show that building a compiler can be as easy as building an interpreter. The compiler we construct accepts a large subset of the Scheme programming language and produces assembly code for the Intel-x86 architecture, the dominant architecture of personal computing. The development of the compiler is broken into many small incremental steps. Every step yields a fully working compiler for a progressively expanding subset of Scheme. Every compiler step produces real assembly code that can be assembled then executed directly by the hardware. We assume that the reader is familiar with the basic computer architecture: its components and execution model. Detailed knowledge of the Intel-x86 architecture is not required.
The development of the compiler is described in detail in an extended tutorial. Supporting material for the tutorial such as an automated testing facility coupled with a comprehensive test suite are provided with the tutorial. It is our hope that current and future implementors of Scheme find in this paper the motivation for developing high-performance compilers and the means for achieving that goal.
Compiler writing has progressed a lot. Notably in meta compilers [1] written in a few hundred lines of code and adaptive compilation [3] and just in time compilers.
Alan Kay's research group VPRi tackled the problems of complexity (in writing compilers) [4].
One nice piece of advice that I received is that books are like RAMs, you do not have to go through them sequentially, but can do random access to the parts of it you need. With this in mind I find it doable to get one the thick books and only read the part that I need for my task.
But, to also be fair, the above random access method does not work when you don't know what you don't know. So I understand why having a light, but good introduction to the topic is important, and I believe that's what the author is pointing out.
I've seen people suggest that throughout the years, but it's never worked out for me. To get anything meaningful out of a printed book, I've had to read them cover to cover. There used to be worthwhile reference books, but those have moved on to the internet.
Most books have so much nonsense details that I cant help but skip most of it.
On the other hand technical books can be so overwhelmingly difficult that you need to go outside and do hours of learning to understand one tidbit of it
Awesome course! finished it while i was doing my final CS year because I had to wait on a bunch of classes (and frankly had no one to talk to before classes). I haven't tried nanopass, but there's other links that work, so I'll give it a go.
Compilers are broad enough that when someone recommends a "compiler book", it's rarely exactly the slice you wanted.
So this made me do a runnable cheat sheet for Crafting Interpreters. I keep parsing demonstrative, and the AST is a little more Lisp-y than the book's.
Disclaimer: it's meant to convey the essence of what you'll learn, it is NOT by any means a replacement for the book. I'd also describe the book as more of an experience (including some things Nystrom clearly enjoyed, like the visitor pattern) than a compilers manual. If anyone's interested, I can do a separate visitor-pattern cheat sheet too, also in Python.
I turned it into a 'public-facing artifact' from private scripts with an AI agent.
This would be like asking for a book on designing grammar. It's just too disjoint of a field to have any kind of reasonable baseline, and it's drop dead easy to grok a basic one together. With those two things being equal, just like with grammar, the answer to this is any resource about implementing the language you're trying to ape.
It's drop dead easy to grok a basic one together until you get to hairy stuff like overloading, lambdas and generics.
The reasonable baseline would be something like Java 1. Scalars, arrays and classes. If I remember correctly, Lox even skips arrays as an exercise for the user.
The problem is in the case of overloading or lambdas being "hairy", it follows you're well outside the scope of a beginner. Generics are explicitly outside the scope of a beginner learning to implement a basic type system.
In the case you're not a beginner, it's not true that no literature exists on type systems and their implementation. The mountain of literature is essentially inexhaustible, but if you're still wondering about how to implement any type system at all, if you've never done it before, you really need to not worry so much about that. Kind of like how you don't even think about implementing a tracing jitter before you've written the world's shittiest tree-walking interpreter.
I have ignored all the stuff about parsing theory, parser generators, custom DSL's, formal grammers etc. and instead have just been using the wonderful Megaparsec parser combinator library. I can easily follow the parsing logic, it's unambiguous (only one successful parse is possible, even if it might not be what you intended), it's easy to compose and re-use parser functions (was particularly helpful for whitespace sensitive parsing/line-fold handling), and it removes the tedious lexer/parser split you get with traditional parsing approaches.
I'll push back and say that the lexer/parser split is well worth it.
And the best thing about the parser combinator approach is that each is just a kind of parser, something like
type Lexer = ParsecT e ByteString m [Token]
type Parser = ParsecT e [Token] Expr
All the usual helper functions like many or sepBy work equally well in the lexing and parsing phases.
It really beats getting to the parentheses-interacting-with-ordering-of-division-operations stage and still having to think "have I already trimmed off the whitespace here or not?"
I am writing a whitespace sensitive parser - trimming whitespace matters because whitespace consumption is used to implement the indentation rules/constraints.
For example, doing things like passing an indentation sensitive whitespace consumer to a parser inside `many` for consuming all of an indented child block. If I split lexing/parsing I think I'd have to do things like insert indentation tokens into the stream, and end up with the same indentation logic (but instead matching on those indentation tokens) in the parser regardless.
I have found that order-of-operations is somewhat trivially solved by `makeExprParser` from `Control.Monad.Combinators.Expr`.
It seems to me LL and LR parser generators are overrated, and hand-written recursive descent is best in practice. I understand why academics teach them, but not why some spend so long on different parsing techniques, nor why hobbyists who just want to compile their toy language are directed to them.
I work in PL, and from my first compiler to today, have always found recursive descent easiest, most effective (less bugs, better error diagnostics, fast enough), and intuitive. Many popular language compilers use recursive descent: I know at least C# (Roslyn) and Rust, but I believe most except Haskell (GHC) and ocaml.
The LR algorithm was simple once I
learned it, and yacc-like LR (and antlr-like LL) parser generators were straightforward once I learned how to resolve conflicts. But recursive descent (at least to me) is simpler and more straightforward.
LR being more expressive than LL has never mattered. A hand-written recursive descent parser is most expressive: it has unlimited lookahead, and can modify
parsed AST nodes (e.g. reordering for precedence, converting if into if-else).
The only solution that comes close is tree-sitter, because it implements GLR, provides helpful conflict messages, and provides basic IDE support (e.g. syntax highlighting) almost for free. But it’s a build dependency, while recursive descent parsers can be written in most languages with zero dependencies and minimal boilerplate.
Parser generators are great in Python (Lark for me) so you can iterate fast and get a runtime spec of your grammar.
A hand-written recursive descent parser is something you do later when you start to industrialize your code, to get better error messages, make the parser incremental, etc.
Bison/ANTLR are code generators, they do not fit well in that model.
> It seems to me LL and LR parser generators are overrated, and hand-written recursive descent is best in practice
I would now agree with that. My compiler experience was on a team that happened to have a LALR expert, Jeanne Musinski PhD, a student of Jeffrey Ullman. She invented a better error recovery for the language. Recursive descent would have been perfectly suited to the task.
> LR being more expressive than LL has never mattered.
Quite agree. One might guess that a language that needs that might be hard to program in.
What taught me how to write a compiler was the BYTE magazine 1978-08 .. 09 issues which had a listing for a Tiny Pascal compiler. Reading the listing was magical.
What taught me how to write an optimizer was a Stanford summer course taught by Ullman and Hennessy.
The code generator was my own concoction, and is apparently quite unlike any other one out there!
I have the Dragon Book, but have never actually read it. So sue me.
the article's framing around nanopass is undersold: the real insight isn't the number of passes but that each pass has an explicit input and output language, which forces you to think about what invariants hold at each stage. that discipline alone catches a suprising number of bugs before you even run the compiler. crenshaw is fantastic but this structural thinking is what separates toy compilers from ones you can actaully extend later.
It's been about 4 years since I took a compilers course (from OMSCS, graduate program) and still shutter ... it was, hands down, the most difficult (yet rewarding) classes I've taken.
Based on another reply I can’t tell if there’s a clever window-based pun that I’m missing. If not, I think you want “shudder” and not “shutter” here. I’m sorry if I just ruined the joke.
I think there is a million ways to make a compilers course.
The course I did was organized perfectly with big parts of compiler boiler plate already written, and I only had to implement parser/lexer rules and the translation of language structures into assembly instructions.
Also it was a compiler for a language designed just for this course with the intention of it being specifically easy to write a compiler for it and not programming.
Without this I can imagine it being a painful experience
I took that course too and ruined my life...by making me think writing a compiler could be fun. The course itself was worth the money I paid for the program.
I have fond memories of implementing an optimizing compiler for the CS241 compiler course offered back then by Prof Michael Franz who was a student of Niklaus Wirth, probably the most exhilarating course during my time at UC Irvine. This was in 2009 so my memory is vague but I recall he provided a virtual machine for a simple architecture called DLX and the compiler was to generate byte code for it.
See also, Andy Keep's dissertation [1] and his talk at Clojure/Conj 2013 [2].
I think that the nanopass architecture is especially well suited for compilers implemented by LLMs as they're excellent at performing small and well defined pieces of work. I'd love to see Anthropic try their C compiler experiment again but with a Nanopass framework to build on.
I've recently been looking in to adding Nanopass support to Langkit, which would allow for writing a Nanopass compiler in Ada, Java, Python, or a few other languages [3].
I learned from the Dragon Book, decades ago. I already knew a lot of programming at that point, but I think most people writing compilers do. I'm curious if there really is an audience of people whose first introduction to programming is writing a compiler... I would think not, actually.
I highly recommend nand2tetris to everyone. For me, nothing ever explained the whole domain from logic gates and inner workings of a CPU to compilers better than this course.
I think it's worth mentioning Gustavo Pezzi's lectures at pikuma.com. The one on "Digital Electronics" and the one on "Interpreters & Compilers" really helped me.
Maybe I'm missing the point of this article? Writing a simple compiler is not difficult. It's not something for beginners, but towards the end of a serious CS degree program it is absolutely do-able. Parsing, transforming into some lower-level representation, even optimizations - it's all fun really not that difficult. I still have my copy of the "Dragon Book", which is where I originally learned about this stuff.
In fact, inventing new programming languages and writing compilers for them used to be so much of a trend that people created YACC (Yet Another Compiler Compiler) to make it easier.
The biggest issue with technical books is they spend the first 1-2 chapters vaguely describing some area and then follow up with but that's for a later more advanced discussion or we'll cover that in that last 1-2 chapters. Don't vaguely tell me about something you're not gonna go into detail about, because now all I'm thinking about reading the subsequent chapters is all the questions I have about that topic.
That's how all education works. It's the spiral model of teaching. In one grade you learn a bit of this and a bit of that, then the next year you retread, and flesh all those things out by adding more depth and complexity. Rinse and repeat every grade.
What would the alternative look like? Should a foreign language course spend three years on Nouns, just to make sure they're comprehensively covered, before you ever see your first Verb?
A similarly scoped book series is „AI game programming wisdom“, which contains a multitude of chapters that focus on diverse, individual algorithms that can be practically used in games for a variety of usecases.
I might be in the minority, but I think the best way to learn how to write a compiler is to try writing one without books or tutorials. Keep it very small in scope at first, small enough that you can scrap the entire implementation and rewrite in an afternoon or less.
You'd more likely want to emit LLVM IR rather than try to match clang's internal AST. That's essentially what most new language projects do now (Rust, Swift, Zig all use LLVM as their backend). You get optimization passes and codegen for multiple architectures for free, and the IR is well-documented. The tradeoff is you skip learning about the backend, which is arguably the most interesting part.
fanf2 on Dec 25, 2015 [dead] | parent | prev | next [–]
I quite like "understanding and writing compilers" by Richard Bornat - written in the 1970s using BCPL as the implementation language, so rather old-fashioned, but it gives a friendly gentle overview of how to do it, without excessive quantities of parsing theory.
I've been having a look at the Crenshaw series, and it seems pretty good, but one thing that kinda annoys me is the baked-in line wrapping. Is there a way to unwrap the text so its not all in a small area on the left of my screen?
https://t3x.org has literal books on that, from a simple C compiler to Scheme (you might heard of s9) and T3X0 itself which can run under Unix, Windows, DOS, CP/M and whatnot.
PD: Klong's intro to statisticks, even if the compiler looks like a joke, it isn't. It can be damn useful. Far easier than Excel. And it comes with a command to output a PS file with your chart being embedded.
On S9, well, it has Unix, Curses, sockets and so on support with an easy API. So it's damn easy to write something if you know Scheme/Ncurses and try stuff in seconds. You can complete the "Concrete Abstractions" book with it, and just adapt the graphic functions to create the (frame) one for SICP (and a few more).
And as we are doing compilers... with SICP you create from some simulator to some Scheme interpreter in itself.
These days there's an even easier way to learn to write a compiler. Just ask Claude to write a simple compiler. Here's a simple C compiler (under 1500 lines) written by Claude: https://github.com/Rajeev-K/c-compiler It can compile and run C programs for sorting and searching. The code is very readable and very easy to understand.
Because an actual compiler would be tens of thousands of lines and most of it is going to be perf optimization. If you want to get the big picture first, read a simple working compiler that has all the key parts, such as a lexer, abstract syntax tree, parser, code generator and so on.
For those of us that learn better by taking something and tinkering with it this is definitely the better approach.
Ive never been a good book learner but I love taking apart and tinkering with something to learn. A small toy compiler is way better than any book and its not like the LLM didnt absorb the book anyways during training.
Exactly! Writing a compiler is not rocket science if you know assembly language. You can pick up the gist in an hour or two by looking at a simple toy compiler.
I did not and will not run this on my computer but it looks like while loops are totally broken; note how poor the test coverage is. This is just my quick skimming of the code. Maybe it works perfectly and I am dumber than a computer.
Regardless, it is incredibly reckless to ask Claude to generate assembly if you don't understand assembly, and it's irresponsible to recommend this as advice for newbies. They will not be able to scan the source code for red flags like us pros. Nor will they think "this C compiler is totally untrustworthy, I should test it on a VM."
Are you concerned that the compiler might generate code that takes over your computer? If so the provided Dockerfile runs the generated code in a container.
Regarding test coverage, this is a toy compiler. Don't use it to compile production code! Regarding while loops and such, again, this is a simple compiler intended only to compile sort and search functions written in C.
No, the problem is much more basic than "taking over your computer," it looks like the compiler generates incorrect assembly. Upon visual inspection I found a huge class of infinite loops, but I am sure there are subtle bugs that can corrupt running user/OS processes... including Docker, potentially. Containerization does not protect you from sloppy native code.
> Don't use it to compile production code!
This is an understatement. A more useful warning would be "don't use it to compile any code with a while loop." Seriously, this compiler looks terrible. Worse than useless.
If you really want AI to make a toy compiler just to help you learn, use Python or Javascript as a compilation target, so that the LLM's dumb bugs are mostly contained, and much easier to understand. Learn assembly programming separately.
You have not provided any evidence that can be refuted, only vague assertions.
The compiler is indeed useless for any purpose other than learning how compilers work. It has all the key pieces such as a lexer, abstract syntax tree, parser, code generator, and it is easy to understand.
If the general approach taken by the compiler is wrong then I would agree it is useless even for learning. But you are not making that claim, only claiming to have found some bugs.
The thing that is obviously and indisputably wrong, terrible for learners, is the test cases. They are woefully insufficient, and will not find those infinite loops I discovered upon reading the code. The poor test coverage means you should assume I am correct about the LLM being wrong! It is rude and insulting to demand I provide evidence that some lazy vibe-coded junk is in fact bad software. You should be demanding evidence that the project's README is accurate. The repo provides none.
The code quality is of course unacceptably terrible but there is no point in reviewing 1500 lines of LLM output. A starting point: learners will get nothing out of this without better comments. I understand what's going on since this is all Compilers 101. But considering it's a) stingily commented and b) incorrect, this project is 100% useless for learners. It's indefensible AI slop.
Sorry I disagree. I have written compilers by hand and this compiler generated by Claude is pretty good for learning.
I am only asking you to backup your own assertions. If you can't then I would have to assume that you are denigrating AI because you are threatened by it.
*Donald Knute -> Donald Ervin Knuth is the author of the book "The Art of Computer Programming" (in progress for a couple of decades, currently volume 4c is being written). It is quite advanced, and it will likely not cover compilers anymore (Addison-Wesley had commissioned a compiler book from Knuth when he was a doctoral candidate, now he is retired and has stated his goal for the series has changed).
I disagree with the author's point: the "Dragon book"'s ("Compilers: Principles, Techniques, and Tools" by Aho et al.) Chapter 2 is a self-sufficient introduction into compilers from end to end, and it can be read on its own, ignoring the rest of the excellent book.
Another fantastic intro to compiler writing is the short little book "Compilers" by Niklaus Wirth, which explains and contains the surprisingly short source code of a complete compiler (the whole book is highly understandable - pristine clarity, really) and all in <100 pages total (99).
(I learned enough from these two sources to write a compiler in high school.)
> "Compilers" by Niklaus Wirth
This one? https://people.inf.ethz.ch/wirth/CompilerConstruction/Compil...
i found the same file but that's only 44 pages long ?
This ( https://github.com/tpn/pdfs/blob/master/Compiler%20Construct... ) seems to be a previous version (2005) and it's 131 pages long
You'll want part 2 as well for a total of 107 pages.
There is still hope for a compiler book. From Knuth's website:
> And after Volumes 1--5 are done, God willing, I plan to publish Volume 6 (the theory of context-free languages) and Volume 7 (Compiler techniques), but only if the things I want to say about those topics are still relevant and still haven't been said.
https://www-cs-faculty.stanford.edu/~knuth/taocp.html
I hope that God is indeed willing, but the man is 88 years old and he’s not done with the third tome of volume four. It would require a minor miracle for him to finish volume 7 within this lifetime.
I don't think there is hope if you look at actuarial tables and Knuth's age. It's not clear to me if he'll be able to finish volume 4. The outline he has seems to have enough material to fill volumes 4C-4G to my eyes, and he isn't exactly cranking out the volumes.
Admittedly, volumes 5-7 wouldn't be as massive as volume 4 (it sort of turns out that almost all interesting algorithms ends up being categorized as being in volume 4), so you probably wouldn't have a half-dozen subvolumes per topic but, it's still too many books down the line, especially if he plans to revise volumes 1-3 before working on anything else.
Have no fear, we'll just train an LLM on TAOCP and have it automatically generate the remaining volumes‽
I really hope he ends up completing the whole series. I started volume one recently and it is excellent
The dragon book almost convinced me never to try to write a compiler. I don't know why people recommend it. I guess you're a lot smarter than I am.
There are some excellent books out there. In its own way, the dragon book is excellent, but it is a terrible starting place.
Here are a bunch of references from the same vintage as OP. I recommend starting with a book that actually walks through the process of building a compiler and doesn't spend its time exclusively with theory.
https://news.ycombinator.com/item?id=136875
I started with the dragon book, and I found it to be a good introductory text.
A lot of people say the dragon book is difficult, so I suppose there must be something there. But I don't see what it is, I thought it was quite accessible.
I'm curious, what parts/aspects of the dragon book make it difficult to start with?
It's been a few years since I worked with the dragon book, but I think the most common complaint was that it starts with like 350 pages on parser theory: generating bottom-up and top-down parsers from context free grammars, optimizing lexers for systems that don't have enough RAM to store an entire source file, etc... before ever getting to what most people who want to write a compiler care about (implementing type inference, optimizing intermediate representations, generating assembly code). Of course parsing is important, and very interesting to some. But there's a reason most modern resources skip over all of that and just make the reader write a recursive descent parser.
I guess "back in the day" you had to be able to write an efficient parser, as no parser generators existed. If you couldn't implement whatever you wanted due to memory shortage at the parser level, then obviously it's gonna be a huge topic. Even now I believe it is good to know about this - if only to avoid pitfalls in your own grammar.
I repeatedly skip parts that are not important to me when reading books like this. I grabbed a book about embedded design and skipped about half of it, which was bus protocols, as I knew I wouldn't need it. There is no need to read the dragon book from front to back.
Unless the reason is explicitly stated there is no way to verify it's any good. There's a reason people use AI to write do their homework - it just doesn't mean it's a good one. I can think of plenty arguments for why you wouldn't look into the pros and cons of different parsing strategies in an introduction to compilers, "everyone is(or isn't) doing it" does not belong to them. In the end, it has to be written down somewhere, and if no other book is doing it for whatever reason, then the dragon book it shall be. You can always recommend skipping that part if someone asks about what book to use.
I actually think the parsing part is more important for laymen. Like, there may be a total of 10K programmers who are interested in learning compiler theories, but maybe 100 of them are ever going to write the backend -- the rest of them are stuck with either toy languages, or use parsing to help with their job. Parsing is definitely more useful for most of us who are not smart enough :D
Yeah I agree, that seems vey true. Although the average person probably also benefits more from learning about recursive descent and pratt parsing than LL(k) parser generators, automata, and finding first and follow sets :)
The thing about parsing (and algorithms in general) is that it can be hair raisingly complex for arbitrary grammars, but in practice, people have recently discovered, that making simple, unambiguous grammars, and avoiding problems, like context dependent parsing, make the parsing problem trival.
Accepting such constraints is quite practical, and lead to little to no loss of power.
In fact, most modern languages are designed with little to no necessary backtracking and simple parsing, Go and Rust being noteworthy examples.
> In fact, most modern languages are designed with little to no necessary backtracking and simple parsing, Go and Rust being noteworthy examples.
But to understand how to generate grammars for languages that are easy to parse, you have in my opinion to dive quite deeply into parsing theory to understand which subtle aspects make parsing complicated.
My personal context to understand where I'm coming from - I'm working on my own language, which is a curly-brace C-style language with quite, where I didn't try to stray too far from established norms, and the syntax is not that fancy (at least not in that regard). I also want my language to look familiar to most programmers, so I'm deliberately sticking close to established norms.
I'm thankfully past the parsing stage and so far I haven't really encountered much issues with ambiguity, but when I did, I was able to fix them.
Also in certain cases I'm quite liberal with allowing omission of parentheses and other such control tokens, which I know leads to some cases where either the code is ambiguous (as in there's no strictly defined way the compiler is supposed to interpret it) or valid code fails to parse,
So far I have not tackled this issue, as it can always be fixed by the programmer manually adding back those parens for example. I know this is not up to professional standards, but I like the cleanliness of the syntax and simplicity of the compiler, and the issue is always fixable for me later. So this is a firm TODO for me.
Additionally I have some features planned that would crowd up the syntax space in a way that I think would probably need some academic chops to fix, but I'm kinda holding off on those, as they are not central to the main gimmickTM and I want release this thing in a reasonable timeframe.
I don't really have much of a formal education in this, other than reading a few tutorials and looking through a few implementations.
Btw, besides just parsing, there are other concerns in modern languages, such as IDE support, files should be parseable independently etc., error recovery, readable errors, autocomplete hints, which I'm not sure are addressed in depth in the dragon book. These features I do want.
My two cents is that for a simple modern language, you can get quite far with zero semantic model, while with stuff like C++ (with macros), my brain would boil at the thought of having to write a decent IDE backend.
the dragon book is how to write a production grade thing i guess. it has all the interesting concepts very elaborated on which is great but it dives quickly into things that can clutter a project if its just for fun..
It’s academic and comprehensive, that’s the issue. It’s not about writing a production grade compiler, though, in my humble opinion. There are more things to learn for that, unfortunately… is just a pretty big topic with lots of stuff to learn.
the dragon book is all i have on the topic. it was a big investment for me.
it taught me to think very differently but i am sure i am still not ready to write a compiler :D
When I was professionally writing a compiler professionally (see https://ciex-software.com/intro-to-compilers.html) the Dragon book was the second book that I read. I found it very helpful. That was the first Dragon book. I got the second one later. I would have been ok to start with the Dragon book--the Compiler Generator book was a harder study.
You're not the only one. In college I took a compilers course and we used the dragon book, to me it sucked the joy out of the magical concept of making a compiler.
Some years later I (re-) discovered Forth, and I thought "why not?" and built my own forth in 32-bit Intel assembly, _that_ brought back the wonder and "magical" feeling of compilers again. All in less than 4KB.
I guess I wasn't the right audience for the dragon book.
Imho the problem is the fixation on parser generators and BNF. It's just a lot easier to write a recursive descent parser than to figure out the correct BNF for anything other than a toy language with horrible syntax.
The problem with recursive descent parsers is that they don't restrict you into using simple grammars.
But then, pushing regular languages theory into the curriculum, just to rush over it so you can use them for parsing is way worse.
> But then, pushing regular languages theory into the curriculum, just to rush over it so you can use them for parsing is way worse.
At least in the typical curriculum of German universities, the students already know the whole theory of regular languages from their Theoretical Computer Science lectures quite well, thus in a compiler lecture, the lecturer can indeed rush over this topic because it is just a repetition.
(Same for US universities at least 30ish years ago)
I would argue the opposite: Being describable in BNF is exactly the hallmark of sensible syntax in a language, and of a language easily amenable to recursive descent parsing. Wirth routinely published (E)BNF for the languages he designed.
Imo BNF (or some other formal notation) is quite useful for defining your syntax, my biggest gripe with BNF in particular is the way it handles operator precedence (through nested recursive expressions), which can get messy quite fast.
Pratt parsers dont even use this recursion, they only have a concept of 'binding strength', which means in laymans terms that if I'm parsing the left side of say a '' expression, and I managed to parse something a binary subexpression, and the next token I'm looking at is another binary op, do I continue parsing that subexpression, which will be the RHS of the '' expression, or do I finish my original expression which will then be the LHS of the new one?
It represents this through the concept of stickiness, with onesimple rule - the subexpression always sticks to the operator that's more sticky.
This is both quite easy to imagine, and easy to encode, as stickiness is just a number.
I think a simpler most straightforward notation that incorporates precedence would be better.
> I think a simpler most straightforward notation that incorporates precedence would be better.
For example YACC provides a way to specify how a shift/reduce conflict
> https://www.gnu.org/software/bison/manual/html_node/Shift_00...
and a reduce/reduce conflict
> https://www.gnu.org/software/bison/manual/html_node/Reduce_0...
should be resolved; see also
> https://www.gnu.org/software/bison/manual/html_node/Mysterio...
This is actually a way to specify operator precedence in a much simpler and more straightforward way than via nested recursive expressions.
Great thread. If you have 1 hour to get started, I recommend opening Engineering a Compiler and studying Static Single-Assignment (SSA) from ch 9.3.
The book is famous for its SSA treatment. Chapters 1-8 are not required to understand SSA. This allows you to walk away with a clear win. Refer to 9.2 if you're struggling with dominance + liveness.
http://www.r-5.org/files/books/computers/compilers/writing/K...
I bought this book when I was working on a toy language and I think I was too stupid to understand most of it. The first few chapters were great, but it quickly surpassed my capacity to understand. Seeing it mentioned makes me want to revisit.
It was a product of its time I guess, much better ones from similar vintage,
The Tiger book (with C, Standard ML, and Java variants)
https://www.cs.princeton.edu/~appel/modern/
Compiler Design in C (freely available nowadays, beware this is between K&R C and C89)
https://holub.com/compiler/
lcc, A Retargetable Compiler for ANSI C
https://drh.github.io/lcc/
Or if one wants to go with more clever stuff,
Compiling with Continuations
Lisp in Small Pieces
Another vote for Lisp in Small Pieces. Great high level compiler book that teaches you how to build a Lisp and doesn’t get bogged down in lexing and parsing.
Instead of Lisp in Small Pieces I'd recommend SICP instead. No continuation passing, but much better written.
That was the point. That's why it's not a cute beaver on the cover :)
The "Dragon Book" is big on parsing but I wouldn't recommend it if you want to make many optimisation passes or a back-end.
The first edition was my first CS textbook, back in the '90s and as a young programmer I learned a lot from it. A couple years ago, I started on a modern compiler back-end however, and found that I needed to update my knowledge with quite a lot.
The 2nd ed covers data-flow analysis, which is very important. However, modern compilers (GCC, LLVM, Cranelift, ...) are built around an intermediate representation in Static Single Assignment-form. The 2nd ed. has only a single page about SSA and you'd need to also learn a lot of theory about its properties to actually use it properly.
Parsing is the front end to a compiler. Can't get semantics without first recognizing syntax. I have a hard time thinking about programming languages without seeing them as a parsing exercise first, every time.
The recommended advice is to start with semantics first. Syntax will change, there is not much point fixing it down too early.
Most of the work is actually the backend, and people sort of illusion themselves into "creating a language" just because they have an AST.
Syntax and semantics are never orthogonal and you always need syntax so it must be considered from the start. Any reasonable syntax will quickly become much more pleasant to generate an ast or ir than, say, manually building these objects in the host language of the compiler which is what the semantics first crowd seem to propose.
It also is only the case that most of the work is the backend for some compilers, though of course all of this depends on how backend is defined. Is backend just codegen or is it all of the analysis between parsing and codegen? If you target a high level language, which is very appropriate for one's first few compilers, the backend can be quite simple. At the simplest, no ast is even necessary and the compiler can just mechanically translate one syntax into another in a single pass.
I think his point is that "form follows function". If you know what kind of semantics you're going to have, you can use that to construct a syntax that lends itself to using it properly.
> The recommended advice is to start with semantics first. Syntax will change, there is not much point fixing it down too early.
It's actually the reverse, in my opinion. Semantics can change much more easily than syntax. You can see this in that small changes in syntax can cause massive changes in a recursive-descent parser while the semantics can change from pass-by-reference to pass-by-value and make it barely budge.
There is a reason practically every modern language has adopted syntax sigils like (choosing Zig):
This allows the identification of the various parts and types without referencing or compiling the universe. That's super important and something that must be baked in the syntax at the start or there is nothing you can do about it.
Getting an overview of parsing theory is mainly useful to avoid making ambiguous or otherwise hard to parse grammars. Usually one can't go too wrong with a hand-written recursive descent parser, and most general-purpose language are so complicated that parser generator can't really handle them. Anyway the really interesting parts of compiling happen in the backend.
Another alternative is basing the language on S-expressions, for which a parser is extremely simple to write.
I'd never seen Knuth's middle name until your comment. I think it safely could be left out of an article.
An Incremental Approach to Compiler Construction
Abdulaziz Ghuloum
http://scheme2006.cs.uchicago.edu/11-ghuloum.pdf
Abstract
Compilers are perceived to be magical artifacts, carefully crafted by the wizards, and unfathomable by the mere mortals. Books on compilers are better described as wizard-talk: written by and for a clique of all-knowing practitioners. Real-life compilers are too complex to serve as an educational tool. And the gap between real-life compilers and the educational toy compilers is too wide. The novice compiler writer stands puzzled facing an impenetrable barrier, “better write an interpreter instead.”
The goal of this paper is to break that barrier. We show that building a compiler can be as easy as building an interpreter. The compiler we construct accepts a large subset of the Scheme programming language and produces assembly code for the Intel-x86 architecture, the dominant architecture of personal computing. The development of the compiler is broken into many small incremental steps. Every step yields a fully working compiler for a progressively expanding subset of Scheme. Every compiler step produces real assembly code that can be assembled then executed directly by the hardware. We assume that the reader is familiar with the basic computer architecture: its components and execution model. Detailed knowledge of the Intel-x86 architecture is not required.
The development of the compiler is described in detail in an extended tutorial. Supporting material for the tutorial such as an automated testing facility coupled with a comprehensive test suite are provided with the tutorial. It is our hope that current and future implementors of Scheme find in this paper the motivation for developing high-performance compilers and the means for achieving that goal.
Inspired by Ghuloum's book is this really nice book by Nora Sandler: Writing a C Compiler https://norasandler.com/book/
Another recent book inspired by Ghuloum is Essentials of Compilation: An Incremental Approach (which publishes Python and Racket versions) https://github.com/IUCompilerCourse/Essentials-of-Compilatio...
Nada Amin has a nice implementation of Aziz's approach, with tests:
https://github.com/namin/inc
One of the greatest papers in computer science. So dense in its 11 pages, yet very approachable.
Compiler writing has progressed a lot. Notably in meta compilers [1] written in a few hundred lines of code and adaptive compilation [3] and just in time compilers. Alan Kay's research group VPRi tackled the problems of complexity (in writing compilers) [4].
[1] Ometa https://tinlizzie.org/VPRIPapers/tr2007003_ometa.pdf
[2] Other ometa papers https://tinlizzie.org/IA/index.php/Papers_from_Viewpoints_Re...
[3] Adaptive compilation https://youtu.be/CfYnzVxdwZE?t=4575
the PhD thesis https://www.researchgate.net/publication/309254446_Adaptive_...
[4] Is it really "Complex"? Or did we just make it "Complicated"? Alan Kay https://youtu.be/ubaX1Smg6pY?t=3605
One nice piece of advice that I received is that books are like RAMs, you do not have to go through them sequentially, but can do random access to the parts of it you need. With this in mind I find it doable to get one the thick books and only read the part that I need for my task.
But, to also be fair, the above random access method does not work when you don't know what you don't know. So I understand why having a light, but good introduction to the topic is important, and I believe that's what the author is pointing out.
I've seen people suggest that throughout the years, but it's never worked out for me. To get anything meaningful out of a printed book, I've had to read them cover to cover. There used to be worthwhile reference books, but those have moved on to the internet.
I like doing both. Skimming through the interesting parts first, them re-reading from start sequentially.
A significant fraction of my technical library is used just this way--as a reference, checking out the parts to answer a specific question.
Most books have so much nonsense details that I cant help but skip most of it.
On the other hand technical books can be so overwhelmingly difficult that you need to go outside and do hours of learning to understand one tidbit of it
Nowadays I’ve heard recommended Crafting Interpreters. (https://craftinginterpreters.com)
The Nanopass paper link doesn’t work.
Awesome course! finished it while i was doing my final CS year because I had to wait on a bunch of classes (and frankly had no one to talk to before classes). I haven't tried nanopass, but there's other links that work, so I'll give it a go.
Incredible book for self guided learning!
Compilers are broad enough that when someone recommends a "compiler book", it's rarely exactly the slice you wanted.
So this made me do a runnable cheat sheet for Crafting Interpreters. I keep parsing demonstrative, and the AST is a little more Lisp-y than the book's.
Disclaimer: it's meant to convey the essence of what you'll learn, it is NOT by any means a replacement for the book. I'd also describe the book as more of an experience (including some things Nystrom clearly enjoyed, like the visitor pattern) than a compilers manual. If anyone's interested, I can do a separate visitor-pattern cheat sheet too, also in Python.
I turned it into a 'public-facing artifact' from private scripts with an AI agent.
[0] https://ouatu.ro/blog/crafting-interpreters-cheat-sheet/
Crafting Interpreters is great, I wish it had a companion book that covered:
Then two of them would be sufficient for writing a compiler.
To your last point, "Linkers and Loaders" has no equal despite being a bit dated
> types and typing
This would be like asking for a book on designing grammar. It's just too disjoint of a field to have any kind of reasonable baseline, and it's drop dead easy to grok a basic one together. With those two things being equal, just like with grammar, the answer to this is any resource about implementing the language you're trying to ape.
It's drop dead easy to grok a basic one together until you get to hairy stuff like overloading, lambdas and generics.
The reasonable baseline would be something like Java 1. Scalars, arrays and classes. If I remember correctly, Lox even skips arrays as an exercise for the user.
The problem is in the case of overloading or lambdas being "hairy", it follows you're well outside the scope of a beginner. Generics are explicitly outside the scope of a beginner learning to implement a basic type system.
In the case you're not a beginner, it's not true that no literature exists on type systems and their implementation. The mountain of literature is essentially inexhaustible, but if you're still wondering about how to implement any type system at all, if you've never done it before, you really need to not worry so much about that. Kind of like how you don't even think about implementing a tracing jitter before you've written the world's shittiest tree-walking interpreter.
> types and typing
Types and Programming Languages, Benjamin C Pierce
> object files, executables, libraries and linking
Linkers and Loaders, John R Levine
I've read Pierce. It's not a bad book, but less grounded than CI, which has an explicit "workmanlike" approach.
It was saved: https://github.com/asalber/books/blob/master/A%20Nanopass%20...
Many languages like Pancake Stack are looking for efficient interpreters:
https://esolangs.org/wiki/Pancake_Stack
:)
Been working on a toy compiler for fun recently.
I have ignored all the stuff about parsing theory, parser generators, custom DSL's, formal grammers etc. and instead have just been using the wonderful Megaparsec parser combinator library. I can easily follow the parsing logic, it's unambiguous (only one successful parse is possible, even if it might not be what you intended), it's easy to compose and re-use parser functions (was particularly helpful for whitespace sensitive parsing/line-fold handling), and it removes the tedious lexer/parser split you get with traditional parsing approaches.
I'll push back and say that the lexer/parser split is well worth it.
And the best thing about the parser combinator approach is that each is just a kind of parser, something like
All the usual helper functions like many or sepBy work equally well in the lexing and parsing phases.
It really beats getting to the parentheses-interacting-with-ordering-of-division-operations stage and still having to think "have I already trimmed off the whitespace here or not?"
I am writing a whitespace sensitive parser - trimming whitespace matters because whitespace consumption is used to implement the indentation rules/constraints.
For example, doing things like passing an indentation sensitive whitespace consumer to a parser inside `many` for consuming all of an indented child block. If I split lexing/parsing I think I'd have to do things like insert indentation tokens into the stream, and end up with the same indentation logic (but instead matching on those indentation tokens) in the parser regardless.
I have found that order-of-operations is somewhat trivially solved by `makeExprParser` from `Control.Monad.Combinators.Expr`.
It seems to me LL and LR parser generators are overrated, and hand-written recursive descent is best in practice. I understand why academics teach them, but not why some spend so long on different parsing techniques, nor why hobbyists who just want to compile their toy language are directed to them.
I work in PL, and from my first compiler to today, have always found recursive descent easiest, most effective (less bugs, better error diagnostics, fast enough), and intuitive. Many popular language compilers use recursive descent: I know at least C# (Roslyn) and Rust, but I believe most except Haskell (GHC) and ocaml.
The LR algorithm was simple once I learned it, and yacc-like LR (and antlr-like LL) parser generators were straightforward once I learned how to resolve conflicts. But recursive descent (at least to me) is simpler and more straightforward.
LR being more expressive than LL has never mattered. A hand-written recursive descent parser is most expressive: it has unlimited lookahead, and can modify parsed AST nodes (e.g. reordering for precedence, converting if into if-else).
The only solution that comes close is tree-sitter, because it implements GLR, provides helpful conflict messages, and provides basic IDE support (e.g. syntax highlighting) almost for free. But it’s a build dependency, while recursive descent parsers can be written in most languages with zero dependencies and minimal boilerplate.
Parser generators are great in Python (Lark for me) so you can iterate fast and get a runtime spec of your grammar.
A hand-written recursive descent parser is something you do later when you start to industrialize your code, to get better error messages, make the parser incremental, etc.
Bison/ANTLR are code generators, they do not fit well in that model.
> It seems to me LL and LR parser generators are overrated, and hand-written recursive descent is best in practice
I would now agree with that. My compiler experience was on a team that happened to have a LALR expert, Jeanne Musinski PhD, a student of Jeffrey Ullman. She invented a better error recovery for the language. Recursive descent would have been perfectly suited to the task.
> LR being more expressive than LL has never mattered.
Quite agree. One might guess that a language that needs that might be hard to program in.
Nanopass paper seems to be dead but can be found here at least https://stanleymiracle.github.io/blogs/compiler/docs/extra/n...
Also:
https://www.cambridge.org/core/journals/journal-of-functiona...
Also you could go to Andy Keep's site: https://nanopass.org/
What taught me how to write a compiler was the BYTE magazine 1978-08 .. 09 issues which had a listing for a Tiny Pascal compiler. Reading the listing was magical.
What taught me how to write an optimizer was a Stanford summer course taught by Ullman and Hennessy.
The code generator was my own concoction, and is apparently quite unlike any other one out there!
I have the Dragon Book, but have never actually read it. So sue me.
You can find some of those magazines linked below:
https://archive.org/details/BYTE-MAGAZINE-COMPLETE
the article's framing around nanopass is undersold: the real insight isn't the number of passes but that each pass has an explicit input and output language, which forces you to think about what invariants hold at each stage. that discipline alone catches a suprising number of bugs before you even run the compiler. crenshaw is fantastic but this structural thinking is what separates toy compilers from ones you can actaully extend later.
It's been about 4 years since I took a compilers course (from OMSCS, graduate program) and still shutter ... it was, hands down, the most difficult (yet rewarding) classes I've taken.
It made me drink myself Venetian blind.
I would like to agree with this comment, but I certainly didn't find it rewarding. It was pure pain.
10 years ago I took few coursera courses to fill the gaps in my computer science education.
One of them was a compilers course done by karpathy. It was pure joy and a great learning experience.
Also in my experience the joy of doing a course was much stronger correlated with the teacher's qualities rather than the subject itself.
Do you have a link by by chance? A quick search didn't turn anything up.
No, I am checking my emails (it was in 2012), and all the links are broken.
however I could dig out the references to it. Apparently it was a course by Prof. Alex Aiken and karpathy was a TA.
This repository seems to be a future version of the same course. https://github.com/gboduljak/stanford-compilers-coursework
Edit: found the videos from the course on the archive https://archive.org/details/academictorrents_e31e54905c7b266...
Ah yes, I remember this one. Very challenging indeed. Thanks!
Based on another reply I can’t tell if there’s a clever window-based pun that I’m missing. If not, I think you want “shudder” and not “shutter” here. I’m sorry if I just ruined the joke.
What did you find more painful about compilers than other forms of programming?
I think there is a million ways to make a compilers course.
The course I did was organized perfectly with big parts of compiler boiler plate already written, and I only had to implement parser/lexer rules and the translation of language structures into assembly instructions. Also it was a compiler for a language designed just for this course with the intention of it being specifically easy to write a compiler for it and not programming.
Without this I can imagine it being a painful experience
I used to judge CS programs based on if they had compiler classes or not.
I loved that course so much but it was incredibly difficult to do while also working
I took that course too and ruined my life...by making me think writing a compiler could be fun. The course itself was worth the money I paid for the program.
I have fond memories of implementing an optimizing compiler for the CS241 compiler course offered back then by Prof Michael Franz who was a student of Niklaus Wirth, probably the most exhilarating course during my time at UC Irvine. This was in 2009 so my memory is vague but I recall he provided a virtual machine for a simple architecture called DLX and the compiler was to generate byte code for it.
Google search points me to https://github.com/cesarghali/PL241-Compiler/blob/master/DLX... for a description of the architecture and possibly https://bernsteinbear.com/assets/img/linear-scan-ra-context-... for the register allocation algorithm
I liked this series of lectures on Youtube.
Compilers - Alex Aiken | Stanford
https://www.youtube.com/playlist?list=PLEAYkSg4uSQ3yc_zf_f1G...
Mentioned in another comment, but with a different link.
See also, Andy Keep's dissertation [1] and his talk at Clojure/Conj 2013 [2].
I think that the nanopass architecture is especially well suited for compilers implemented by LLMs as they're excellent at performing small and well defined pieces of work. I'd love to see Anthropic try their C compiler experiment again but with a Nanopass framework to build on.
I've recently been looking in to adding Nanopass support to Langkit, which would allow for writing a Nanopass compiler in Ada, Java, Python, or a few other languages [3].
[1]: https://andykeep.com/pubs/dissertation.pdf
[2]: https://www.youtube.com/watch?v=Os7FE3J-U5Q
[3]: https://github.com/AdaCore/langkit/issues/668
Andy gave a nice talk on the implementation of Chez Scheme, an optimizing compiler, at the Scheme Workshop in Berlin in 2019:
https://www.youtube.com/watch?v=N_-enNCZxaU
Past comments: http://news.ycombinator.com/item?id=2927784, https://news.ycombinator.com/item?id=10786842
Thanks! Macroexpanded:
Want to Write a Compiler? Read These Two Papers (2008) - https://news.ycombinator.com/item?id=10786842 - Dec 2015 (70 comments)
Want to Write a Compiler? Just Read These Two Papers. - https://news.ycombinator.com/item?id=2927784 - Aug 2011 (77 comments)
Want to Write a Compiler? Just Read These Two Papers - https://news.ycombinator.com/item?id=231758 - June 2008 (39 comments)
I learned from the Dragon Book, decades ago. I already knew a lot of programming at that point, but I think most people writing compilers do. I'm curious if there really is an audience of people whose first introduction to programming is writing a compiler... I would think not, actually.
Wouldn't say a lot of people use it for an introduction to programming, but I've personally seen it appear quite early in the programmers journey.
I was first exposed to compilers as a learning subject as a mandatory 2nd year/1st semester course; with the Dragon Book as the main textbook...
I wonder if it makes sense to do the nand2tetris course for an absolute beginner since it too has compiler creation in it.
I highly recommend nand2tetris to everyone. For me, nothing ever explained the whole domain from logic gates and inner workings of a CPU to compilers better than this course.
On a side note, why is imrozim's comment dead? What in the world is wrong with it? It's perfectly fine IMO.
check comment history
https://news.ycombinator.com/item?id=47582720
I think it's worth mentioning Gustavo Pezzi's lectures at pikuma.com. The one on "Digital Electronics" and the one on "Interpreters & Compilers" really helped me.
nand2tetris only requires programming ability at the level of someone who's taken freshman level CS IIRC.
You could take Harvard's CS50 and then tackle it.
And Nystrom's book
Yeah, I really enjoyed Crafting Interpreters, wholeheartedly recommend!
Maybe I'm missing the point of this article? Writing a simple compiler is not difficult. It's not something for beginners, but towards the end of a serious CS degree program it is absolutely do-able. Parsing, transforming into some lower-level representation, even optimizations - it's all fun really not that difficult. I still have my copy of the "Dragon Book", which is where I originally learned about this stuff.
In fact, inventing new programming languages and writing compilers for them used to be so much of a trend that people created YACC (Yet Another Compiler Compiler) to make it easier.
Maybe you are a genius among mere mortals, but for the majority of people even after a CS degree writing a compiler is a bit difficult
The biggest issue with technical books is they spend the first 1-2 chapters vaguely describing some area and then follow up with but that's for a later more advanced discussion or we'll cover that in that last 1-2 chapters. Don't vaguely tell me about something you're not gonna go into detail about, because now all I'm thinking about reading the subsequent chapters is all the questions I have about that topic.
That's how all education works. It's the spiral model of teaching. In one grade you learn a bit of this and a bit of that, then the next year you retread, and flesh all those things out by adding more depth and complexity. Rinse and repeat every grade.
What would the alternative look like? Should a foreign language course spend three years on Nouns, just to make sure they're comprehensively covered, before you ever see your first Verb?
A similarly scoped book series is „AI game programming wisdom“, which contains a multitude of chapters that focus on diverse, individual algorithms that can be practically used in games for a variety of usecases.
I'm also writing a compiler and CS6120 from Cornell has helped me a lot: https://www.cs.cornell.edu/courses/cs6120/2025fa/self-guided...
archive of forth translation of Crenshaw’s howto
https://web.archive.org/web/20190712115536/http://home.iae.n...
I might be in the minority, but I think the best way to learn how to write a compiler is to try writing one without books or tutorials. Keep it very small in scope at first, small enough that you can scrap the entire implementation and rewrite in an afternoon or less.
If i had a euro for every time I started writing a compiler, and got lost in the parser weeds, i d have ... At least couple of euros
All you really need is a copy of "The Unix Programming Environment", where you can implement `hoc` in a couple hours.
https://news.ycombinator.com/item?id=10786842
Would a practical approach be parsing the source into clang's AST format. Then let it make the actual executable.
You'd more likely want to emit LLVM IR rather than try to match clang's internal AST. That's essentially what most new language projects do now (Rust, Swift, Zig all use LLVM as their backend). You get optimization passes and codegen for multiple architectures for free, and the IR is well-documented. The tradeoff is you skip learning about the backend, which is arguably the most interesting part.
Right that's what I meant ... the .ll file. Thanks
Any tips on reading material for code generation and scheduling for parallel engines?
fanf2 on Dec 25, 2015 [dead] | parent | prev | next [–]
I quite like "understanding and writing compilers" by Richard Bornat - written in the 1970s using BCPL as the implementation language, so rather old-fashioned, but it gives a friendly gentle overview of how to do it, without excessive quantities of parsing theory.
https://news.ycombinator.com/item?id=10791521
I like that book too, I bought it many decades ago and learned enough to write a transpiler for converting Fortran source code to Turbo Pascal.
Sadly, Richard died in January this year. This book is available to download from one of his academic pages:
https://www.eis.mdx.ac.uk/staffpages/r_bornat/#compilerbook
https://www.eis.mdx.ac.uk/staffpages/r_bornat/books/compilin...
https://en.wikipedia.org/wiki/Richard_Bornat
Whatever you do don't waste your time on the "Dragon" book.
Google "recursive descent parsing" and it will tell you everything you need to know about the front-end of a compiler.
Google "My First Language Frontend with LLVM" and it will teach you the other half.
> The best source for breaking this myth is Jack Crenshaw's series, Let's Build a Compiler!
Right, I've heard of that...
> , which started in 1988.
... Oh. Huh.
(Staring at the red dragon book on my bookshelf, which was my course textbook in the early 00s.)
I've been having a look at the Crenshaw series, and it seems pretty good, but one thing that kinda annoys me is the baked-in line wrapping. Is there a way to unwrap the text so its not all in a small area on the left of my screen?
https://xmonader.github.io/letsbuildacompiler-pretty/
Perfect!
Here's an archived version converted to x86: https://web.archive.org/web/20220603004249/http://www.pp4s.c...
https://t3x.org has literal books on that, from a simple C compiler to Scheme (you might heard of s9) and T3X0 itself which can run under Unix, Windows, DOS, CP/M and whatnot.
PD: Klong's intro to statisticks, even if the compiler looks like a joke, it isn't. It can be damn useful. Far easier than Excel. And it comes with a command to output a PS file with your chart being embedded.
https://t3x.org/klong/
Intro to statistics with Klong
https://t3x.org/klong/klong-intro.txt.html
https://t3x.org/klong/klong-ref.txt.html
On S9, well, it has Unix, Curses, sockets and so on support with an easy API. So it's damn easy to write something if you know Scheme/Ncurses and try stuff in seconds. You can complete the "Concrete Abstractions" book with it, and just adapt the graphic functions to create the (frame) one for SICP (and a few more).
And as we are doing compilers... with SICP you create from some simulator to some Scheme interpreter in itself.
These days there's an even easier way to learn to write a compiler. Just ask Claude to write a simple compiler. Here's a simple C compiler (under 1500 lines) written by Claude: https://github.com/Rajeev-K/c-compiler It can compile and run C programs for sorting and searching. The code is very readable and very easy to understand.
why read that, vs an actually well-written compiler though?
Because an actual compiler would be tens of thousands of lines and most of it is going to be perf optimization. If you want to get the big picture first, read a simple working compiler that has all the key parts, such as a lexer, abstract syntax tree, parser, code generator and so on.
Is it less work than finding a human authored toy compiler of good quality. How long did it take to generate?
For those of us that learn better by taking something and tinkering with it this is definitely the better approach.
Ive never been a good book learner but I love taking apart and tinkering with something to learn. A small toy compiler is way better than any book and its not like the LLM didnt absorb the book anyways during training.
Exactly! Writing a compiler is not rocket science if you know assembly language. You can pick up the gist in an hour or two by looking at a simple toy compiler.
I did not and will not run this on my computer but it looks like while loops are totally broken; note how poor the test coverage is. This is just my quick skimming of the code. Maybe it works perfectly and I am dumber than a computer.
Regardless, it is incredibly reckless to ask Claude to generate assembly if you don't understand assembly, and it's irresponsible to recommend this as advice for newbies. They will not be able to scan the source code for red flags like us pros. Nor will they think "this C compiler is totally untrustworthy, I should test it on a VM."
Are you concerned that the compiler might generate code that takes over your computer? If so the provided Dockerfile runs the generated code in a container.
Regarding test coverage, this is a toy compiler. Don't use it to compile production code! Regarding while loops and such, again, this is a simple compiler intended only to compile sort and search functions written in C.
No, the problem is much more basic than "taking over your computer," it looks like the compiler generates incorrect assembly. Upon visual inspection I found a huge class of infinite loops, but I am sure there are subtle bugs that can corrupt running user/OS processes... including Docker, potentially. Containerization does not protect you from sloppy native code.
> Don't use it to compile production code!
This is an understatement. A more useful warning would be "don't use it to compile any code with a while loop." Seriously, this compiler looks terrible. Worse than useless.
If you really want AI to make a toy compiler just to help you learn, use Python or Javascript as a compilation target, so that the LLM's dumb bugs are mostly contained, and much easier to understand. Learn assembly programming separately.
You have not provided any evidence that can be refuted, only vague assertions.
The compiler is indeed useless for any purpose other than learning how compilers work. It has all the key pieces such as a lexer, abstract syntax tree, parser, code generator, and it is easy to understand.
If the general approach taken by the compiler is wrong then I would agree it is useless even for learning. But you are not making that claim, only claiming to have found some bugs.
The thing that is obviously and indisputably wrong, terrible for learners, is the test cases. They are woefully insufficient, and will not find those infinite loops I discovered upon reading the code. The poor test coverage means you should assume I am correct about the LLM being wrong! It is rude and insulting to demand I provide evidence that some lazy vibe-coded junk is in fact bad software. You should be demanding evidence that the project's README is accurate. The repo provides none.
The code quality is of course unacceptably terrible but there is no point in reviewing 1500 lines of LLM output. A starting point: learners will get nothing out of this without better comments. I understand what's going on since this is all Compilers 101. But considering it's a) stingily commented and b) incorrect, this project is 100% useless for learners. It's indefensible AI slop.
Sorry I disagree. I have written compilers by hand and this compiler generated by Claude is pretty good for learning.
I am only asking you to backup your own assertions. If you can't then I would have to assume that you are denigrating AI because you are threatened by it.