abecedarius 11 years ago

On a first skim, this looks really nice; complaints that it's unreadable are unfounded. The background that makes it readable are Wirth's Compiler Construction http://www.ethoberon.ethz.ch/WirthPubl/CBEAll.pdf plus precedence climbing http://en.wikipedia.org/wiki/Operator-precedence_parser#Prec...

  • dabockster 11 years ago

    complaints that it's unreadable are unfounded

    Not exactly. You have to remember that language and compiler design require a LOT of work and experience to understand, and that many programmers will only see this as, frankly, spaghetti.

    I think it could have used some more block comments, but that's just me.

    • abecedarius 11 years ago

      I'm writing a chapter for AOSA on something like this: a self-hosting compiler of a subset of Python to Python bytecode. It'll present the full code (about the same length) and try to explain it well for people not into compilers yet, but in the meantime I recommend the Wirth book.

      What's your favorite shorter intro? I'd especially like to reference other educational compilers that are self-hosting.

      • metaobject 11 years ago

        So, there is another version of Architecture of Open Source Applications in the works? I love the first two editions - they are unique books that really are gems. Are you able to divulge what projects are covered in the new version?

    • anigbrowl 11 years ago

      I had the same first instinct, but given that a) it's very very tidy code and b) if you want to understand the inner workings of a compiler then you really do need to figure this out, I decided on review that it's basically self-documenting.

      Of course figuring out what it's doing is one thing - understanding why it is done in this particular way is another, and while I was able to find my way around fairly quickly I'd cry if I had to re-implement it. I do love how small it is though, that gives it great educational value.

      • blinks 11 years ago

        > understanding why it is done in this particular way is another

        Isn't that the reason for comments in the first place?

        • anigbrowl 11 years ago

          I look to comments to tell me what a block of code is doing rather than why, eg 'Performs a Discrete Cosine Transform on the contents of the buffer' or 'Bubble sort algorithm rearranges the records in at least as much time as required to enjoy a nice cup of tea.'

          The 'why' of a very low-level tools like this is the sort of thing that needs to be explored at length in a paper or (in this case) a book, otherwise they'll swamp the actual code. Sometimes as a learning exercise I'll take something like this and comment the hell out of everything, but the value there is more in writing the comments than trying to read them again later. Of course this is very much a matter of personal taste.

          • aeonsky 11 years ago

            I am a junior dev without a ton of experience so correct me if I'm wrong, but I strongly disagree. Comments should explain "why" something was written. Wouldn't the function name indicate what you are doing (and comments in the function)? This is especially true in business logic.

            reverseNaturalSortOrder(listOfItems); // case sensitive sort of items by reverse alphabetical order

            or

            reverseNaturalSortOrder(listOfItems); // sort this way because the upper layer expects items in reverse order since user requested it

            I think it is usually significantly easier to understand what something is doing rather than why it is doing that. To answer the former it usually requires a narrow scope of focus, but the latter requires a very broad scope.

            • anigbrowl 11 years ago

              Sure, I'm not arguing against ever saying why you'd do something :) I just like descriptive comments because it speeds up the business of figuring the high level structure of the code - you're right that understanding exactly what's going on can be the most difficult part, but for me that answer usually falls out as I build a mental model of what it's doing. On the other hand some algorithms need in-depth explanation that's beyond the scope of comment (hence my example of the transform).

              Mind, I mostly do hobby/experimental programming, I've just been doing it for a long time. So I'm not commending this as a work practice or anything - your point about business logic made good sense.

            • DSMan195276 11 years ago

              I agree with you completely - The code explains what you're doing, comments explain why you did it that way. Ideally, any comments that explain what you're doing would end-up being redundant when looking at the code.

              I think this code details a special case of the above though, in that it comments what the enums are instead of just naming the enums. I give that a pass strictly because this code needs to be able to compile itself, and I don't think it supports named enums, so the comment was necessary to make up for that.

              • MrTortoise 11 years ago

                Its not that simple though, error fixes and edge cases often obfuscate something that was understandable. A why comment is never bad, but a what comment is often as valuable as a test

                • DSMan195276 11 years ago

                  Like I noted with my special case, it's not always that simple, but I routinely find the best commented code to be code which was written with the comments explaining why and the code explaining what. There are definitely time where a what comment is warranted, but it's just not the general case.

                • moron4hire 11 years ago

                  That should almost never be the case. If you find this to be a frequent occurrence, then the code base in which you are working is not designed for the problem domain to which it is being applied.

            • tdsamardzhiev 11 years ago

              Agreed - code should explain what code does. Duh.

            • chipsy 11 years ago

              There are three levels to consider:

              1. Readability for modification

              2. Readability for the "what"

              3. Readability for the "why"

              All human-readable description in code is there to make the difference between having a piece of documentation pointing you in the right direction, and having to reverse engineer your understanding. It's an optimization problem with definite tradeoffs. Although CS professors and many tutorials will tend to encourage you towards heavy description, over-description creates space for inconsistent, misleading documentation, which is worse than "not knowing what it does."

              When you see code that is dense and full of short variables, it's written favorably towards modification. It is relying on a summary comment for "what" it does, and perhaps on namespaces, scope blocks, or equivalent guards to keep names from colliding. Such code lets you reach flow state quickly once you've grokked it and are ready to begin work, because more of it fits onscreen, and you're assured the smallest amount of typing and thus can quickly try and revise ideas. The summary often gets to stay the same even if your implementation is completely different. And if lines of code are small, you enjoy a high comments-to-code ratio, at least visually.

              Code that builds in the "what" through more descriptive variable names pays a price through being a little harder to actually work in, with big payoffs coming mostly where the idea cannot be captured and summarized so easily through a comment.

              In your example, one might instead rework the whole layout of the code so:

                  var urq; /* user request type */
              
                  ... (we build list "l") 
                  
                  /* adjustments for user request */
                  {
                    if (urq=="rns") { /* case sensitive sort by reverse alphabetical order */ ... }
                  }
              

              If you aren't reusing the algorithm, inline and summarize. And if you're writing comments about "what the upper layer expects", then you(or your team) spread and sliced up the meaning of the code and created that problem; that kind of comment isn't answering a "why," it's excusing something obtuse and unintuitive in the design, and is a code smell - the premise for that comment is hiding something far worse. If the sequence of events is intentionally meaningful and there's no danger of cut-and-paste modifications getting out of sync, it doesn't need to be split into tiny pieces. Big functions can be well-organized and easy to navigate just with scope blocks, comments, and a code-folding editor.

              "Why" is a complicated thing. It's not really explainable through any one comment, unless the comment is an excuse like the example you gave. The whole program has to be written towards the purpose of explaining itself(e.g. "literate programming"), yet most code is grown organically in a way that even the original programmers can only partially explain, starting with a simple problem and simple data and then gradually expanding in complexity. Experience(and unfamiliar new problem spaces) eventually turns most programmers towards the organic model, even if they're aware of top-down architecture strategies. Ultimately, a "why" has to ask Big Questions about the overall purpose of the software; software is a form of automation, and the rationale for automating things has to be continuously interrogated.

    • ezy 11 years ago

      Well, the goal is for it to be minimal, and C doesn't actually do a lot of work for you. On reading it, the algorithm it uses to parse was of less interest than the various tricks it uses to initialize and manage the state of the parser in a compact way.

    • qznc 11 years ago

      Compilers are also a very well understood topic. If you have seen (and understood) a recursive decent parser with precedence climbing, the code looks as expected. It is a pretty straightforward implementation.

    • sklogic 11 years ago

      Basically, what you're saying is that any toy compiler example should be accompanied with a copy of the Dragon Book.

      • WhitneyLand 11 years ago

        You seem to suggest that a background in compiler theory is somehow table stakes for commenting on HN. Since many here are not developers, and many developers don't have a CS degree, a few contextual comments seem appropriate.

        • sklogic 11 years ago

          You seem to suggest that a niche, minimalistic toy example should always be accessible to non-developers.

          • WhitneyLand 11 years ago

            It's false dichotomy that you either get it or you don't. People will access things according to their ability. Where to draw the line on being inclusive? If some has genuine curiosity and motivation to ask a question and learn, then providing a few lines of overview doesn't clutter the board much and can be a positive contribution.

            • sklogic 11 years ago

              Exactly, where to draw a line? Explaining a concept of abstract and virtual machines may take a few pages of a dense text, explaining how to parse expressions with precedence may require dozens of pages, explaining C types will add a few more.

              So, yes, it's either you're curious enough to dig into a code and find the relevant explanations somewhere else (the said Dragon Book and alike), or you won't get it, regardless of how comprehensive comments are.

        • ctdonath 11 years ago

          If you're commenting about a remarkably clever example of an obscure topic which requires prolonged study to understand, then yes I'd suggest that a background in _______ is somehow table stakes for commenting on a focused discussion of _______ on HN.

          "Toy examples" are often the result of long & deep study and practice of a subject, creating something profound which casual observers are not entitled to instantly understand. In this case, it's a very clever compiler: everybody understands this summary, and if you want "a few contextual comments" beyond the source code itself then you know where to get enough information to learn what you need to understand this.

          If you don't "get it", and don't want to "get it" on your own, it's not for you.

  • shangxiao 11 years ago

    It is unreadable. Lumping code together into big functions just so you can say "Look I've created a compiler in 4 functions" is pointless, unless your goal is to post it to HN to show everyone how clever you are.

    This is not how you code when you work in a team or when you know some other poor soul has to come along and maintain it.

    I suggest you take a look at [1] then go and read this excellent book by Martin Fowler: "Refactoring: Improving the Design of Existing Code" [2]

    [1] https://en.wikipedia.org/wiki/Code_smell

    [2] http://www.goodreads.com/book/show/44936.Refactoring

    • RubyPinch 11 years ago

      I have a feeling that the project is a mix of for-fun and to-see-if-its-possible.

      Not all programming is enterprise quality, some programming is intentionally not.

      Being so dismissive of that, seems a little silly to me.

      The demoscene doesn't exist because of a bunch of programmers trying to make the lives of every other programmer more difficult, it exists because people like the challenge.

      • shangxiao 11 years ago

        Of course this occurred to me, my comment was more of an emotional response to "complaints that it's unreadable are unfounded".

        If quite a few people are saying that it's unreadable, you don't just dismiss them as making "unfounded" statements.

        > The demoscene doesn't exist because of a bunch of programmers trying to make the lives of every other programmer more difficult, it exists because people like the challenge.

        I'm pretty sure that this doesn't need to be stated.

    • userbinator 11 years ago

      This also is not code that would need to be written by a team.

      The functions may be big, but they also don't have all that much duplicate code inside them. The choice of 4 functions isn't arbitrary either - they nicely divide the problem into:

      - next() - splits the source code into a series of tokens

      - expr() - parses expressions

      - stmt() - parses statements

      - main() - starts the processing of the source, and also contains the main interpreter VM's execution loop

      Code generation is integrated into the parsing, since it's generating code for a stack-based machine and that also very nicely follows the sequence of actions performed when parsing.

      In fact I'm of the opinion that the obsession with breaking up code into tiny pieces (usually accompanied by the overuse of OOP) is harmful to the understanding of the program as a whole since it encourages looking at each piece independently and misses "seeing the forest for the trees".

      In contrast, this is code that is designed to be easily read and understood by a single person, showing how very simple an entire compiler and interpreter/VM can be. It doesn't attempt to hide anything with thick layers upon layers of abstraction and deep chains of function calls, but instead is the "naked essence" of the solution to the problem.

      Someone used to e.g. enterprise Java may find this style of code quite jarring to their senses, but that's only because they've grown accustomed to an environment in which everything is highly-abstracted and indirect, hiding the true nature of the solution. Personally, I think the simplicity and "nakedness" of this code has a great beauty to it --- it's a functional work of art.

      • shangxiao 11 years ago

        Breaking up code is more than just removing redundancy - it's about exactly what you have written: "Encouraging looking at each piece independently". That actually has advantages, the main being that it is easy to understand how each piece operates independently of the other pieces - simply cohesion & coupling which is applicable to any language, not just "Java". I don't believe in my experience that I miss "seeing the forest for the trees" - I encourage you to try it, perhaps by learning some functional programming.

        There is one obsession that I am tired of is people posting "X awesome thing in 120 lines of JavaScript" or "Y in 4 functions". Just because the problem is reduced to small as possible metric, it doesn't make it good.

        PS: Also you mention negatively connotated terms like "OOP", "Java", "thick layers of abstraction" and "deep chains of function calls" as if you've ascertained that I'm some enterprise developer that doesn't have any C experience and wouldn't know simplicity if it bit me in the ass.

    • Confusion 11 years ago

      I suggest you consider how someone could know of, and understand, your suggestions very well, while still having the opinion he does. Don't underestimate other people.

  • pvidler 11 years ago

    Most of it was readable, but the printf on lines 57--59 made me retch. I see what it's doing, but it's not what I'd call easily maintainable:

      printf("%8.4s", &"LEA ,IMM ,JMP ,JSR ,BZ ,BNZ ,ENT ,ADJ ,LEV ,LI ,LC ,SI ,SC ,PSH ,"
             "OR ,XOR ,AND ,EQ ,NE ,LT ,GT ,LE ,GE ,SHL ,SHR ,ADD ,SUB ,MUL ,DIV ,MOD ,"
             "OPEN,READ,CLOS,PRTF,MALC,MSET,MCMP,EXIT,"[*++le * 5]);
    • manish_gill 11 years ago

      I'd like to know why this was downvoted, and if people who down voted it can explain what the code is doing please?

      • maffydub 11 years ago

        It's taking an integer (representing an operation) and printing out the name of that operation.

        First thing to say is that "* ++le" is the integer representing the operation to perform. This basically walks through the array of instructions returning each integer in turn.

        Starting at the beginning of the line, we have "printf" with a format string of "%8.4s". This means print out the first 4 characters of the string that I pass next (padded to 8 characters). There then follows a string containing all of the operation names, in numerical order, padded to 4 characters and separated by commas (so the start of each is 5 apart). Finally, we do a lookup into this string (treating it as an array) at offset "* ++le * 5", i.e. the integer representing the operation multipled by 5 (5 being the number of characters between the start of each operation name). Doing this lookup gives us a char, but actually we wanted the pointer to this char (as we want printf to print out this char and the following 3 chars), so we take the address of this char (the & at the beginning of the whole expression).

        It's concise, but not exactly self-documenting.

        Does that make sense?

        (I didn't downvote.)

        • peterfirefly 11 years ago

          How is that not self-documenting if one knows C?

          • maffydub 11 years ago

            I think you and I might disagree on the meaning of self-documenting. ;)

            • peterfirefly 11 years ago

              I don't think we really do.

              It is impenetrable black magick if one "knows" C -- but quite clear if one /actually/ knows C.

              • moron4hire 11 years ago

                Ah, the No True Scotsman finally arrives to the party.

                • peterfirefly 11 years ago

                  No. The printf() requires that one has read K&R. That's not a high barrier to clear. Pointers are chapter 5.

  • PhasmaFelis 11 years ago

    > On a first skim, this looks really nice; complaints that it's unreadable are unfounded.

    Man, I can't even tell what this is supposed to be. My confusion is entirely founded. My thought process with articles like this goes something like "C in four functions, huh? Sounds like it could be clever. I'll just click and read the explanation... Oh, there isn't an explanation. Well, maybe this file will explain things! ...Nope, it's 500 lines of mostly-uncommented if-else statements. Maybe it's a compiler? I dunno!"

    I'm sure there's a subset of the programming community for whom this is crystal clear on first sight, and that's great; but there's a lot more of us who could probably get the joke with a few hints, so it would be nice if you'd help out instead of declaring that if you understand it, it must be easy.

    • boomlinde 11 years ago

      In the sense that some people won't have an idea of what's going on, this community altogether isn't particularly inclusive at all. Personally, I really don't want the topics this site covers to cater to a lowest common denominator, and I'm sure that isn't what you had in mind either, but that's the effect of taking "more of us" to mean more than you personally.

      • chris_wot 11 years ago

        Actually, a lot of us probably don't understand what this is doing. I sure don't, but I really don't consider myself "lowest common denominator" either. I come here to learn, to be honest!

        • boomlinde 11 years ago

          My point isn't that not knowing what this does makes you a lowest common denominator. It's that most stuff that is shared on this site at least borders on being what I'd call esoteric, and if making each individual submission more approachable or catering to a larger general audience is a goal of this community, it isn't really going that way. If it was going that way, I doubt the community would be particularly interested in this site.

          This submission in particular has dubious practical use, and the description, "an exercise in minimalism", is telling of a sort of artistic intent. If you don't understand what it does, how it does it, or if you don't like it, it won't lower my opinion of you in any way, but its inclusion on this site is part of why I like to come here every now and then. I get to discuss subjects that relate to my work and hobbies, but I also get to look at weird alien code and think hard in unfamiliar terms. From what you are saying, I think you can relate.

          Personally, I could glance over it and get the idea that it is a C compiler, but if you were to show me some code in written with the latest JS MVC or FRP framework, don't hold your breath for me to tell you what it does. I can't say that I fully understand this, and that's why I enjoy the rich discussion the submission spawned here.

          • chris_wot 11 years ago

            Fair enough :-) thanks for clarifying.

      • WhitneyLand 11 years ago

        The logical leap from adding a "few hints" to everything becomes "lowest common denominator" is the size of the Grand Canyon.

        • boomlinde 11 years ago

          Personally I don't see anything wrong with suggesting to add a few hints, but the basis of that suggestion in this case was that "there's a lot more of us who could probably get the joke" with the hints. If making it approachable to more people is inherently a good thing, the logical conclusion is to make it approachable to everyone.

          With code like this, the readability obviously isn't a high priority consideration and sometimes the exact opposite of the goal, with the impenetrability sometimes being part of its charm. This is Hacker News after all, and if your reaction to of a piece of code that describes itself as "an exercise in minimalism" is to leave a snarky comment about the lack of documentation, you should probably check your news elsewhere.

          If you have any interest in the subject, the initial comment "just enough features to allow self-compilation and a bit more" should give the purpose of the code away. If not, it ought to have been a clear sign of dragons.

    • abecedarius 11 years ago

      Yes, I'm sorry: I meant to defend this code from the charge of being pointless code golf, and inadvertently disparaged people without the background to enjoy reading it. It really is hard to follow without that background, which lots of good programmers don't have.

      I hope someone will write an explanation. I'm still working on one for my own (quite different) little compiler.

  • tomp 11 years ago

    > complaints that it's unreadable are unfounded

        int *a, *b;
        int t, *d;
    

    I can't really see how anyone can say that code that uses one-letter variable names (with the exception of the "standard ones", whose meaning is defined at the top) is readable.

    • abecedarius 11 years ago

      OK, 'unfounded' was a little too strong. I'd change some things myself, including comments on those declarations; but if this looks like a code-golf game to you, it's not, it's a style you're not used to.

userbinator 11 years ago

From a cursory glance this appears to be a much-condensed recursive-descent parser with all of the usual parsing functions moved into one, so it's not all that difficult to understand. I think recursive-descent is one of the easiest to intuitively understand parsing algorithms, and it also makes for some concise code; it's also far easier to debug a recursive-descent parser than one of the traditional table-driven ones.

Edit: OTCC (http://www.bellard.org/otcc/ ) is another extremely tiny (to the point of being obfuscated) example of a compiler using a recursive-descent parser.

  • barrkel 11 years ago

    It's an operator precedence parser, which is actually quite a bit faster than recursive descent for expressions when operator precedence is defined using grammar productions that would otherwise get turned into methods.

    For example, a simplified expression operator precedence can be expressed like this (where {} denotes repetition, implemented with e.g. a while loop in a recursive descent parser (RDP)):

        expr ::= term { '+' term | '-' term } ;
        term ::= factor { '*' factor | '/' factor } ;
        factor ::= '-' factor | <number> | '(' expr ')' ;
    

    Typically this would be parsed in RDP using expr(), term() and factor() functions, where expr() calls term(), term() calls factor(), and factor() calls expr() if it sees an opening parenthesis.

    The trouble is that this means that every single expression parse needs to call through this deep code flow before it can see a single interesting bit of the expression being parsed. Parsing "2 + 3" will result in 5 function calls: expr, term, factor, then term and factor again.

    The operator precedence parser presented avoids this deeply nested set of calls before it starts parsing interesting things.

    This problem is far more acute in a language like C, which has lots of precedence levels, than it is in a language like Pascal, which only has a handful.

    • userbinator 11 years ago

      I was contrasting it against table-driven parsers (e.g. typical of the Yacc/Bison type) which I believe don't have any recursive calls and use a separate stack, while this one does make recursive calls. In some ways operator precedence/precedence climbing is like an optimisation of RDP, replacing a set of recursive functions with a level counter and a loop.

      • barrkel 11 years ago

        I can view using the CPU stack instead of an explicit stack, and the CPU instruction pointer instead of an index into a state table, as optimizations too (direct encoding). There also exist recursive ascent parsers, which are the RDP analogue of table-driven LALR parsers you're talking about.

        Point being, there are a lot of different parsing approaches, with pros and cons, and it can be useful to treat them separately rather than view them within only a couple of lenses.

petercooper 11 years ago

I just wanted to credit Reddit's /r/tinycode sub-Reddit for this link: http://www.reddit.com/r/tinycode - it's a pretty cool place to discover minimalistic implementations of things.

bjornsing 11 years ago

This shit doesn't scale:

  $ time ./c4 c4.c c4.c hello.c 
  hello, world
  exit(0) cycle = 9
  exit(0) cycle = 22614
  exit(0) cycle = 9273075

  real	0m0.067s
  user	0m0.067s
  sys	0m0.000s
  $ time ./c4 c4.c c4.c c4.c hello.c 
  hello, world
  exit(0) cycle = 9
  exit(0) cycle = 22614
  exit(0) cycle = 9273075
  exit(0) cycle = 933197195

  real	0m5.834s
  user	0m5.827s
  sys	0m0.000s
  $ time ./c4 c4.c c4.c c4.c c4.c hello.c 

Just kidding. :) Amazingly cool! Does anybody have a smaller self-hosing compiler & bytecode vm?

  • bjornsing 11 years ago

    For the record:

      $ time ./c4 c4.c c4.c c4.c c4.c hello.c 
      hello, world
      exit(0) cycle = 9
      exit(0) cycle = 22614
      exit(0) cycle = 9273075
      exit(0) cycle = 933197195
      exit(0) cycle = -1428163377
    
      real	9m23.409s
      user	9m22.673s
      sys	0m0.020s
    

    :)

0x0 11 years ago

Would you call this a compiler, an interpreter, a virtual machine, a scripting engine, or a combination of those?

  • marktangotango 11 years ago

    It compiles a c subset to byte code, then executes in a virtual machine. I think generally an intepreter can refer to either a byte code interpreter (ie virtual machine) or an AST walking interpreter. I didn't see a way to embed c4 into a host language, so maybe not a scripting language?

    IMO the real value of exhibits like this are boiling the problem (lexing, parsing, compiler, interpreting) down to their most basic parts. One could easily imagine this same language being implemented over 10's or 100's of class files in a more verbose language.

    • stcredzero 11 years ago

      I think generally an intepreter can refer to either a byte code interpreter (ie virtual machine) or an AST walking interpreter.

      This brings to mind how fuzzy "interpreter" is as terminology. What is a virtual machine that JIT compiles the byte code or the AST? Is it a JIT implementation of an interpreter? What of implementations that only JIT the most frequently used functions? Wouldn't those be half interpreter and half virtual machine?

      When it comes down to it, they're all really virtual machines. The real distinction is how we've come to think of different implementations and the representations sub-culturally. For some reason, it makes us feel better when we call certain things interpreters, because of some meaningless (and sometimes factually challenged) competitive instincts concerning implementation speed. (Also, we arbitrarily feel that byte code is somehow more "machine-y" than an AST.)

      So do I have a problem with "interpreter"? Only when people correct others, as if they're making a correction about something fundamental and factual. In reality, the distinction is between machines that are intended to have the same runtime semantics and really the distinction is only around what optimizations are present in their implementations. Furthermore, if you look at those optimizations in detail, the distinction gets even hazier.

      • TazeTSchnitzel 11 years ago

        Most interpreters (not all, but most) are actually a compiler and a VM, yes. The difference between a "compiler" and an "interpreter", in practise, seems to be that "compilers" lack a built-in interpreter.

        • stcredzero 11 years ago

          That's another very good point. One can just as well think of the VisualWorks Smalltalk VM as a compiler (which is actually implemented in Smalltalk) with an interpreter+JIT which also functions as a linker. This means, one can also think of Smalltalk as a compiled language with lots of late binding that makes for a more complicated linker. (JIT) Then the part that's a byte code interpreter can be thought of as an "optimization" on the compiler+linker combination. In fact, if you don't want to bother with flexible debugging, you could implement Smalltalk as a compiled language with "fancy linking" and no interpretation at all.

          Any interpreter vs. VM distinction is mostly a social construct. Looked at technically, it's a mishmash.

          • barrkel 11 years ago

            Also, a linker is very similar to a simple garbage collector: it follows code references, marking blocks as it goes (physically allocated in the final output), and adding the references in those blocks to its current list of references to resolve, until it runs out of references to resolve (done) or fails to resolve a reference (link error). All the references then need to be patched up according to their final address, very similar to a GC fixing up memory references.

        • barrkel 11 years ago

          Most static language compilers include an interpreter for constant expressions at a minimum, because otherwise statically allocating things like arrays is a bit tricky. Handily, this interpreter can be reused for constant folding.

          C++ compilers nowadays necessarily include a Turing complete interpreter.

          • TazeTSchnitzel 11 years ago

            C has one in the preprocessor, to do platform-independent conditionals.

astrodust 11 years ago

That is pretty stripped down. Could probably be ported to JavaScript without breaking a sweat.

  • Lerc 11 years ago

    That was my first thought, a micro-C to asm.js compiler wouldn't be hard.

TazeTSchnitzel 11 years ago

Pretty damn impressive.

Though I must wonder: how complete is it? What does it and does it not support? It's at least complete enough to be self-hosting, but beyond that? The code doesn't use that much of C.

  • cbhl 11 years ago

    Judging from the comments in c4.cpp, it probably only supports enough of a subset to compile itself.

    Granted, while building a parser that can parse (let alone compiling) the full C language is nontrivial, any undergrad should be able to build a parser and compiler for a sufficiently simple subset of it. (In my undergrad, we used this subset to build a "compiler" in second year: https://www.student.cs.uwaterloo.ca/~cs241/wlp4/WLP4.html)

    • JoeAltmaier 11 years ago

      You can build a C parser in an afternoon. It only has a few language constructs. Declarations are the hardest. Scanners are readily available for expressions and constants.

      • cbhl 11 years ago

        I would be worried about handling the not-context-free parts of the language: http://trevorjim.com/c-and-cplusplus-are-not-context-free/

        How would you handle these?

        • JoeAltmaier 11 years ago

          Scanner has to have access to the symbol table, emit 'ident' or 'type' as appropriate. So not a pure context-free parser.

          • sedachv 11 years ago

            Yes. This is the reason it is way easier to hand-build a parser for C than try to use any kind of parser generator.

            • JoeAltmaier 11 years ago

              Bison and Flex work fine - do they count as 'parser generators'?

              • sedachv 11 years ago

                Flex does not work fine, that's why you need the lexer hack: https://en.wikipedia.org/wiki/The_lexer_hack

                Non-working (no lexer hack) Flex and Bison specification for C99: https://gist.github.com/codebrainz/2933703 653 lines

                Hand-written recursive-descent C parser and pre-processor (something not handled by Lex/Yacc) that produces executable Common Lisp code: https://github.com/vsedach/Vacietis/blob/master/compiler/rea... 935 lines

                • JoeAltmaier 11 years ago

                  You can put whatever code you want into the Flex phrase rules. The 'lexer hack' can be implemented there. That's what I do whenever I have to parse something like C.

                  • sedachv 11 years ago

                    Exactly. You have to drop hooks into the lexer from the parser, and by the time you're done you end up with just as much code. Only it's slower than a recursive-descent parser would be, and a lot of the time parsing speed really matters because it shows up as user-visible latency.

                    • vidarh 11 years ago

                      While I have a strong dislike for lex/yacc and descendants, the "hooks" you need for C are trivial. As far as I remember the only thing you need is an ability for the lexer to check whether or not a given identifier is a variable or type.

                      Even that is only needed if you want to report more specific information up to the parser. E.g. Clang doesn't. In Clang it is instead the parser that looks up the information in order to figure out what type of identifier it has received.

                      • JoeAltmaier 11 years ago

                        Right but by then the rule matching has been done - now you're writing code to distinguish rules instead of code to handle rules.

                        The advantage of a parser tool is, it makes the language constructs explicit in the code - here be assignment, here be for() etc. That has some value.

                      • JoeAltmaier 11 years ago

                        And what are the alternatives? I've always wanted a parser object (in C++ anyway) that I can add constructions to, feed it scanner output and have it build a symbol table and semantic tree. Does such a thing exist?

      • rui314 11 years ago

        C is not a simple language as a CIL guy says [1]. I wrote my own C compiler [2] and I can say that writing a parser was harder than I thought. It would take more than half a day at least.

        [1] http://www.cs.berkeley.edu/~necula/cil/cil016.html [2] https://github.com/rui314/8cc

        • zik 11 years ago

          Yes, I also wrote my own C interpreter and the parser was a lot more than a day's work.

        • vidarh 11 years ago

          That first link is almost all about language semantics, not parsing issues.

          As for your example, I'll be solomonic and say that you and Joe are right, of sorts (though I do think it'd be more than an afternoon).

          It's certainly far more than a days work if you handwrite a lexer and parser that does the amount of additional work that yours do (AST construction; a lot of error reporting and sanity checking). But you can get very far with C very quickly if you use parser generation tools and have prior experience writing compilers and your goal is "just" to get something to parse it as quickly as possible - it's a tiny language.

          Of course, in practice most real compilers don't use these parser-generation tools exactly because things like proper error reporting etc. is far harder, and a simple recursive descent parser is so much easier to work with.

    • verytrivial 11 years ago

      Indeed. The lack of a case statement in the run loop is a bit of a giveaway. Excellent work nonetheless.

  • Lerc 11 years ago

    It's clear from the huge number of 'else if' statements that it doesn't support switch statements.

    This is also a bit of a clue

        p = "char else enum if int return while "
        "open read close printf malloc memset memcmp exit main";
    • TazeTSchnitzel 11 years ago

      Ah, the lack of struct explains why the compiler doesn't use any structs.

jdp 11 years ago

This is really similar to Marc Feeley's tinyc.c[0], which implements a C subset parser, code generator, and virtual machine in about ~300 lines of C.

[0]: http://www.iro.umontreal.ca/~felipe/IFT2030-Automne2002/Comp...

  • abecedarius 11 years ago

    That's a nice one. The big difference is that it doesn't compile itself -- not that there's anything magical about that, but it's a kind of threshold of seriousness.

    • userbinator 11 years ago

      One of the things I've wondered about is how small one could make an ISO C89-compliant compiler (that would be able to compile itself), and all these tiny compiler projects have inspired me to revisit that thought now... I've written pieces of compilers like expression parsers and tokenisers, and even then I felt like it wouldn't be so hard (if I had the time) to write a full compiler.

      These are all great for dispelling the notion that all compilers somehow necessarily have to be greatly complex and impenetrable to anyone but highly-trained professionals and theoreticians. (Look at the Dragon Book, for example.)

amelius 11 years ago

Next challenge: a fully compliant C++11 compiler written in 4 functions (1)

Then, I'd be impressed :)

(1) At maximum 500 lines per function, where each line is at most 200 characters long.

vesinisa 11 years ago

Neat, but this is not a general purpose C interpreter. It seems to lack the preprocessor and is only able to execute the example program and itself because it has wrapper implementations for a static set of standard library functions including printf(), open(), read() and malloc().[1] Use a standard library function it does not readily support, and you're out of luck.

[1]: https://github.com/rswier/c4/blob/master/c4.c#L492-L498

nichochar 11 years ago

For mac users: gcc -Wno-all -arch i386 -o c4 c4.c

methyl 11 years ago

Unfortunately, I cannot compile it on my OSX 10.10... here's what I get: http://pastebin.com/cVvaYFEH

EDIT: just make next() function void and it works.

EDIT2: still no fortune :(

  $ ./c4 hello.c
  [1]    33920 segmentation fault  ./c4 hello.c
  • cremno 11 years ago

    I think it doesn't work on x86-64, since the code assumes

        sizeof(int) == sizeof(void*)
    • jparishy 11 years ago

      Yep, you can compile it on 64 bit OS X with clang's -m32 option and it should work:

          ➜  c4 git:(master) ✗ clang -m32 c4.c
          ...
          ➜  c4 git:(master) ✗ ./a.out hello.c 
          hello, world
          exit(0) cycle = 9
  • kschults 11 years ago

    For now, you need to add both -Wno-return-type and -m32 when compiling.

marcofiset 11 years ago

I honestly think this is ridiculous. Sure, this is an incredible feat, and congrats. But serioulsy, I would be ashamed to publish such unreadable code under my name.

What about naming your variables with descriptive names?

What about extracting complex conditions into well named function to understand what is going on (thus defeating the purpose of the "4 functions") ?

This list could go on forever...

Writing software is not a contest for who can write the most amount of code in the most cryptic way.

  • tehwalrus 11 years ago

    The variable names thing really annoyed me too - a habit of code golf, and people who were originally trained in an old FORTRAN edition that had a 6 or 7 char limit on names.

    • _wmd 11 years ago

      Unless you're talking about a large piece of software composed entirely of single character functions and variable names, I pretty much disagree. Verbose variable names do not magically teach those reading a piece of code how it works, simultaneously they tend to make it impossible to write many kinds of expressions concisely, and consequently they regularly damage the readability of more complex pieces of code (e.g. arithmetic expressions involving 3 or more terms).

      The same goes for symbolic constants. Sometimes (but not exactly "often"), use of numeric literals can vastly improve the maintainability of some code, assuming the reader understands how to maintain it in the first instance.

      As for increasing reader comprehension, carefully thought out comments are a better mechanism by far.

      In this case, it is sufficient to know that the file is a compiler/interpreter for its entirety to make sense, assuming the reader has implemented (or at least understood the principles behind) a compiler/interpreter in their past. Expanding the function/variable names, splitting "complicated" expressions out into their own functions, etc., does not magically improve the uninitiated's chance of understanding what is going on

      • ajkjk 11 years ago

        I mostly strongly disagree.

        I see no value in naming a variable 'tk', 'pp', or 'bt'. It can only help to make the code more readable with less context. I do not need to understand compilers in detail to know what this program is doing, except, for the names being useless. And if I do understand them but have not spent half an hour or probably much more to digest the exact system by which it operates, I would be completely unable to identify bugs or talk about modifying or extending it in any useful way.

        Well written comments do not make the code below them more intelligible as text even if they do give you good hints for how to load it into your brain.

        And "Expanding the function/variable names, splitting "complicated" expressions out into their own functions, etc." all ABSOLUTELY improve the uninitiated's chance of understanding what is going on. Speaking as one of the uninitiated (though I can mostly make out how this works). Man, I wish I knew what 'tk' was. That whole section at line 204 would be so clear if I could follow the intent.

        I agree with you on symbolic constants and arithmetic though.

        • vajrabum 11 years ago

          As you wish. From the program, line 19:

              tk,       // current token
          • tehwalrus 11 years ago

            replace all: tk -> current_token

          • ajkjk 11 years ago

            yeah, my point is that should be swapped out everywhere. There's no reason not to.

      • tehwalrus 11 years ago

        Writing expressions concisely is an interesting problem, but in order to do, for example, tensor algebra in C you basically must use macros to define a DSL. It is not a goal which can always be achieved, but structures higher than variable names are what allow one to achieve it (usually.)

        I didn't say that "verbose" variable names were mandatory everywhere - i and j have their place - but names which are at least pronounceable words are essential, especially if they appear in more places than a five line function.

        This project is a special case, certainly, but toy compilers are nothing if not to learn from.

    • emmanueloga_ 11 years ago

      I used to rather long names too... but now I think short variables have their place, and sometimes they even improve readability.

      This code seems to follow a lot of conventions (if I see the var "i" I could bet a million dollars is an int that is being used as a counter, probably to go through the positions of an array). It uses plenty of enumerated constants, which is good too.

      I've been doing some functional programming, where you find that often types are more important than names. See the "Names are overrated" section of this article [1] ( although this point may not apply to this piece of software... C being the language that it is :p).

      1: http://techblog.realestate.com.au/the-abject-failure-of-weak...

      • tehwalrus 11 years ago

        I'm not against local variables for counting called i and j.

        I'm against a big list of forward declarations, with tens of different variables, each with a very short name, and a comment explaining what this is an abbreviation for. Just replace the names with the comments, with underscores for spaces, using find and replace. two minute job for the whole code base, much more readable code almost everywhere.

        I agree that functional languages may be a counter case; but codebases in C and python (in my experience), benefit greatly from well named variables.

  • privong 11 years ago

    > Writing software is not a contest for who can write the most amount of code in the most cryptic way.

    It can be: http://ioccc.org/

  • mturmon 11 years ago

    > This list could go on forever...

    Yes it could. While you are adding to your list of coding rules, the OP will have written another fun, tiny compiler.

    Who is having more fun?

    • marcofiset 11 years ago

      Those are not rules, they are mostly common sense to help the next programmer who has to maintain your code.

      • afandian 11 years ago

        And most common sense of all is that you choose your rules for the audience. This is obviously not production code.

        In any case if someone were competent enough to work on it then the style is actually quite readable. It's even full of comments.

  • cbhl 11 years ago

    I actually found the code surprisingly easy to read; "tk" stands for "token", "ty" for "type", and so forth.

    It's worth noting that compilers don't pop out of thin air -- you have to start with something simple in order to compile a more complicated compiler. Bootstrapping your own self-hosting compiler is a useful academic exercise, and you should try it sometime if you haven't already: http://en.wikipedia.org/wiki/Bootstrapping_(compilers)

    • marcofiset 11 years ago

      Well, if this particular compiler is defined in 4 functions, why couldn't it be made out of more functions, enhancing the readability and maintainability of the code?

      • afandian 11 years ago

        Who says it's designed to be maintainable? Not everyone writes code for the reason that everyone else does.

      • cbhl 11 years ago

        There is no reason why couldn't it be made out of more functions. Absolutely none. In fact, you can fork the repository and do the refactoring yourself, right now.

        If I had to guess, the functions are tied pretty closely to how the author is parsing the file; "next" (next token), "expr" (expression), "stmt" (statement), and "main".

        As for the project being called "C in 4 functions"; at best, I'd argue that's just a linkbait-y title since it's not actually C (it's a subset). I don't have a problem with the code _per se_; just the title.

  • fla 11 years ago

    The goal here is to write a compiler that can compile itself.

    • jessaustin 11 years ago

      Yeah I loved this part:

        ./c4 c4.c c4.c hello.c
      

      Granted, most compilers can do the equivalent, but it's rarely so simple to invoke.

  • more_original 11 years ago

    > I honestly think this is ridiculous. Sure, this is an incredible feat, and congrats.

    It's not that bad (or difficult), really. It's a hand-written parser for a subset of C that emits assembly code right away. This is how compilers like Turbo Pascal used to work (see http://www.pcengines.ch/tp3.htm for an explanation of what's happening).

    Sure, you could apply cosmetic changes like making "*++e = bla;" into "emit(bla);" and you could move the cases into independent methods (that are used once), but this isn't meant to be a state-of-the art compiler and it won't become one if one applies best practices to it.

    • sehugg 11 years ago

      I don't think it supports forward references either, which also makes it more like Pascal.

  • karlgrz 11 years ago

    I honestly think this comment is ridiculous. Sure, this is an incredible feat, and congrats. But seriously, I would be ashamed to publish such unreadable English under my name. What about spelling all the words correctly? What about not being an asshole and criticizing someone just because? This list could go on forever... Writing comments is not a contest for who can write the most amount of bullshit in the most non-constructive way.

  • PavlovsCat 11 years ago

    Writing software is not a contest, period :) It may be for some, it may be for you, but you don't get to sign up other people for it without their consent.

  • nikki93 11 years ago

    But, "it is only as an aesthetic phenomenon that existence and the world are eternally justified." This piece of code can be viewed aesthetically. Artificial guidelines take the fun out of coding.

  • fsloth 11 years ago

    Look, one aspect of programming is producing production code that is maintainable, etc.

    There are other aspects that are equally as important, and sometimes even more so. Exploring, learning and prototyping are among them, as are "back of the envelope" constructs.

    This is a really cool example that shows how far you can go with tiny amounts of code.

    Pithyness is actually one way to make code _more_ understandable, given that the reader is familiar with the subject area.

    Don't be a snob. The world is a better place because someone wrote this and not worse off.

    Since compiler construction is kinda a deep technical field, documenting this in pedagocially sane way would have been a huge task.

    I'm happy the author took the effort of writing and publishing this piece, it would be sad if he hadn't just because it's missing an embedded tutorial.

    He's not asking you to maintain it. There are references in this thread that explain what is going on...

    I write and maintain C++ production code as a day job and some of that stuff is an order of magnitude harder to grok than this (no, it's not documented in any _helpfull_ way either).

  • boomlinde 11 years ago

    > Writing software is not a contest for who can write the most amount of code in the most cryptic way.

    Except when it is.