"If it took that much thrashing to get it right, you’d expect it to do something pretty deep right? Maybe a low-level hardware interface or .. I’m talking, of course, about an automated code formatter."
There was once a programmer who was attached to the court of the warlord of Wu. The warlord asked the programmer: "Which is easier to design: an accounting package or an operating system?"
"An operating system," replied the programmer.
The warlord uttered an exclamation of disbelief. "Surely an accounting package is trivial next to the complexity of an operating system," he said.
"Not so," said the programmer, "When designing an accounting package, the programmer operates as a mediator between people having different ideas: how it must operate, how its reports must appear, and how it must conform to the tax laws. By contrast, an operating system is not limited by outside appearances. When designing an operating system, the programmer seeks the simplest harmony between machine and ideas. This is why an operating system is easier to design."
The source of that particular one is one of those classic pieces of hacker (not to be confused with cracker) culture that has been floating around for decades.
Reading one book (The Tao of Programming) would be sufficient (but not necessary) to be familiar with a lot of quotes similar to this one. Excerpts from that book are posted pretty often. I think that I probably ran into Tao of Programming around the time that I found The Jargon File, and some other things connected to classic hacker culture.
> where do people find quotes like this? Just reading a ton of books?
No lie, I've gotten more than one good book recommendation (including "Tao of Programming") from '/usr/games/fortune'. Also, if you find a book fascinating or highly informative, skim the bibliography for more like it.
I always disagree with the people that dismiss "CRUD-apps" or languages/tools made for "simply (ja!) crud code!".
I do CRUD for life. Is NOT simply. Ever.
Sometimes, for relax, I toy with language design. Maybe I must do some assembler or APL. Certainly easier that try to implement the crazy scheme of "discounts" of my customers!
I do quite a lot of integration with our Point of Sale, directly through it's SQL back-end. Occasionally I represent partial or all inventory in a separate DB which is easier to deal with, isn't subject to their odd schema, and so I can tie it in more directly with lots of other data we've acquired.
Occasionally the owner brings up that maybe we should just write our own POS, to which I very quickly back-pedal and let him know that's not a good idea. As much as the actual inventory tracking isn't that hard compared to what we've done, the accounting portion is not something I want to touch with a ten-foot pole. Let someone else make sure the accounting reports we do our taxes from don't have odd bugs, and spread the gains of bug-testing and fixing across thousands of customers. I've already got plenty on my plate before trying to take on a massive project with huge downsides like that.
> Certainly easier that try to implement the crazy scheme of "discounts" of my customers!
Any billing system that can correctly account for all the various ways businesses and sales departments want to display, link, sell and promote services over time quickly approaches a complexity which is hard to to explain, and without conscious effort also quickly becomes the domain of a few select people that have bothered to actually dive into the mess, leading to silo'd expertise.
Additionally, if you try to make it easy for non-programmers to use, you quickly end up with a DSL which while it may get the job done admirably, is necessarily powerful and expressive enough that you end up having pseudo programmers to write and read them anyway.
> you quickly end up with a DSL which while it may get the job done admirably, is necessarily powerful and expressive enough that you end up having pseudo programmers to write and read them anyway.
Yes! Now I truly converting that "toy with language design" to actually "maybe use it for my apps!".
P.D: And yes, I know is good to use any lang already made, like Lua or TCL or similar. But CRUD-apps are not very well served by most languages today. If wanna know what kind of language is almost ideal check it FoxPro (or the dbase family)
My start as a professional programmer (freelance is one would call it now) was with end user applications. I am talking mid-80s.
What I took away was the required level off fraud that was demanded in the sense that each business I coded for wanted a way to cook the books. Driving schools, bookshops you name it. As soon as the program was about ready and they were happy with the interface and the reports, they wanted 2 distinct outputs: one for the tax office and one for themselves.
So yes, coding (parts of) Operating Systems is really relaxing and satisfying.
Pretty-printing is tough. That said, please don't reinvent the wheel. There is research that has gone into this that should make most of this stuff pretty straightforward. I personally recommend Wadler's "A Prettier Printer" [0] (although credit goes to Hughes for laying a lot of the groundwork [1]). It too uses an IR and has several possible heuristics for rendering.
I've been using an implementation of it [2] with a lot of success for pretty printing Rust code [3].
It sounds like perhaps a case of being so focused on whether the program could be built that they didn't stop to ask whether it should be built.
Automatic code formatters don't need to be perfect or complete; they simply must format good code well. If bad code (as many of the difficult examples were) formats poorly, that's just another reason for people to write better code.
You're right that "good enough" code formatting is often good enough. On the other hand, a tool that does better formatting in more cases is more likely to be used. Also, sometimes code refactoring is done automatically, so nobody is closely looking at the code.
I wrote a pretty-printer for bash scripts using PHP. Colouring the keywords was fun, but I moved on to other personal projects instead of tackling line breaks.
More recently I've been trying to learn Chinese, and one of the features of Pingtype is to put spaces between words.
"Most existing word processors and typesetting software cannot handle either [personal names or compound words]."
My method works, but I don't know who to give it to. This is Hacker News and there's people from all different backgrounds here, so I'll just throw it out there - if anyone is interested, please contact me.
You are correct that no one is interested in yet another translation Web application, especially since it does not say how it's different or better than the rest. That's all anyone got to see so far.
If you want to be successful in communicating your improved method for word segmentation to the world at large, extract the code, publish it, and document the relevant details what makes it better and how it compares against existing solutions.
I am a hobby computational linguist. "Contact me" is an outmoded concept, most of us are doing our collaboration in the open.
Some style guides have a line limit (including Dart). If you want to submit code without hand-editing, the formatter needs to do it.
On the other hand, Go's style guide doesn't have a line limit, so its formatter doesn't have to solve tricky problems like this. Programmers can add line breaks by hand if they get too long. (A nice side effect is that auto-formatting your code doesn't change line numbers.)
One of my side project was an auto formatter for R. There are some limits in existing formatters:
- I think most of them doesn't recognize multi line string literals, which is difficult if you consider the case that you can have "" in comments, comment symbol in "" string literals and line breaks. The only way to deal with it is to scan linearly with context.
- It's tricky to wrap a long line:
+ some points in a long line are more suitable for breaking points in logic level
+ but sometimes you want less lines and not to break too often, even it's more clear in logic. The lines could be just some parameter list that will be both well represented in one column or multi columns.
+ with nested code the natural indent position could be at the far right, which make each line very short if you stick to 80 columns rule.
After quite some efforts my code can deal with all the comments, multi line strings, all the operators I known (I need to separate unary and binary operators to determine whether to insert space), but the script take several seconds to run, and I haven't start to deal with indent. I probably can save some time if I do more optimization, but I don't have time to finish it now.
This python formatter talked about its algorithm, worth a read.
Cool post. Some insight into why this problem is so hard: what this post is describing seems to be an integer linear programming problem [1]. I.e. optimizing a linear cost function constrained by (convex) linear bounds with integer-valued variables. The reason it's so difficult is that ILP is an NP-hard problem. Finding the right way to represent program source is also tricky, but, as the post says, doing so in a way that caters to the extremely performance-sensitive solver code is the really difficult bit. A better approach might be to produce an explicit ILP program and use an ILP solver to decide where the line breaks should go. As with many NP-hard problems (e.g. SAT, TSP), there are very good solvers these days that are much faster than anything you could ever hope to write yourself – and they produce fully optimal solutions.
> As with many NP-hard problems (e.g. SAT, TSP), there are very good solvers these days that are much faster than anything you could ever hope to write yourself – and they produce fully optimal solutions.
And if it's not fast enough, you can use one of a number of relaxations to a standard LP and call it a day.
> There are thirteen places where a line break is possible here according to our style rules. That’s 8,192 different combinations if we brute force them all
This is why the language should be designed with a formatter in mind from the beginning, as Go was designed. Just enough mustaches to make formatters accurate and fast. How many possibilities should there be? Exactly one.
"There are thirteen places where a line break is possible here according to our style rules. That’s 8,192 different combinations if we brute force them all. The search space we have to cover is exponentially large..."
Sounds like this might be a good candidate for some AI methods which are not intimidated by such large search spaces.
Would there be AI methods that would produce deterministic results for code formatting? Deterministic output seems to be the main draw of a code formatter.
The code formatter described in the article is not deterministic either. The author admits the search space is far too large for him to brute-force it, so his formatter does the best it can, which might not get the optimal results:
"If the line splitter tries, like, 5,000 solutions and still hasn't found a winner yet, it just picks the best it found so far and bails."
This sounds like it would give no more guarantees of a deterministic result than AI methods would. They also try their best, only search a subset of the search space, and bail when user-defined criteria such as time, iteration, or space limits are reached.
Another approach could be to use both the author's hand-crafted algorithm and an AI method. There are at least two ways to go about that. The first would be to run the hand-crafted algorithm and pass its evaluation of the best solutions to seed an AI method (for example, as the seed populations in an evolutionary algorithm). The second would be to do the opposite, and run the AI method first, and seed the hand-crafted algorithm with the best solutions the AI method discovered.
I doesn't look like that stopping time is user-defined, it's a compiled in constant.
Given that, I don't see any reason why the algorithm should be non-deterministic just because it gives up after a set number of iterations. As long every time it sees a given string it tries the possibilities in the same order, scores them the same way, and stops at the same place then it should be deterministic.
By "user-defined", I meant to refer to the user of the algorithm (AI or not), who is the developer.
You're right that if the algorithm described in the article will always get the same result given the same input it will be deterministic in a way that the kinds of AI methods I'm thinking of (the stochastic kind) will not. That sort of non-determinism will be a problem if you expect your code to be formatted the same way every time it goes through the formatter.
There are lots of "AI methods" that can be used to produce deterministic results. In general, this sounds like a problem where combinatorial optimization techniques can be useful. For example, constraint programming, mixed integer programming, and various local search methods. For all techniques, there are both deterministic and non-deterministic approaches.
He is using the classic AI method of heuristic-guided graph search with search-space pruning. That still doesn't make the worst case tractable, but it does improve a lot on brute force enumeration.
I was thinking along similar lines, like some kind of reinforcement learning model that penalized lines that seem ugly to humans... but probably hard to get enough humans looking at enough code.
An active learning approach would be cool, where the formatter process a large corpus of code and asks a human operator for advice when it is especially uncertain about the best place to split.
Even if training the splitting policy from scratch is too hard, learning would be a good way to establish cost weights on different kinds of rule violations.
> For most of the time, the formatter did use dynamic programming and memoization. I felt like a wizard when I first figured out how to do it. It worked fairly well, but was a nightmare to debug.
Yes, but in this case it would have to be applied to the Dart language syntax rather than the formatter. At this point it's too late for the formatter to be simple.
"If it took that much thrashing to get it right, you’d expect it to do something pretty deep right? Maybe a low-level hardware interface or .. I’m talking, of course, about an automated code formatter."
This reminds me of the Tao of Programming 3.3, at http://www.mit.edu/~xela/tao.html , the relevant part of which I will copy here:
There was once a programmer who was attached to the court of the warlord of Wu. The warlord asked the programmer: "Which is easier to design: an accounting package or an operating system?"
"An operating system," replied the programmer.
The warlord uttered an exclamation of disbelief. "Surely an accounting package is trivial next to the complexity of an operating system," he said.
"Not so," said the programmer, "When designing an accounting package, the programmer operates as a mediator between people having different ideas: how it must operate, how its reports must appear, and how it must conform to the tax laws. By contrast, an operating system is not limited by outside appearances. When designing an operating system, the programmer seeks the simplest harmony between machine and ideas. This is why an operating system is easier to design."
Agreed. Crossing the boundary between the real world and the sanitized, virtual world takes a ton of effort.
where do people find quotes like this? Just reading a ton of books?
The source of that particular one is one of those classic pieces of hacker (not to be confused with cracker) culture that has been floating around for decades.
If you have not seen them, these are enlightening:
http://www.catb.org/jargon/html/koans.html
I especially like the power-cycle one, due to spending plenty of time in a lab debugging circuits.
Reading one book (The Tao of Programming) would be sufficient (but not necessary) to be familiar with a lot of quotes similar to this one. Excerpts from that book are posted pretty often. I think that I probably ran into Tao of Programming around the time that I found The Jargon File, and some other things connected to classic hacker culture.
> where do people find quotes like this? Just reading a ton of books?
No lie, I've gotten more than one good book recommendation (including "Tao of Programming") from '/usr/games/fortune'. Also, if you find a book fascinating or highly informative, skim the bibliography for more like it.
what's "/usr/games/fortune"?
https://en.wikipedia.org/wiki/Fortune_(Unix)
Oh, this one is great!
I always disagree with the people that dismiss "CRUD-apps" or languages/tools made for "simply (ja!) crud code!".
I do CRUD for life. Is NOT simply. Ever.
Sometimes, for relax, I toy with language design. Maybe I must do some assembler or APL. Certainly easier that try to implement the crazy scheme of "discounts" of my customers!
I do quite a lot of integration with our Point of Sale, directly through it's SQL back-end. Occasionally I represent partial or all inventory in a separate DB which is easier to deal with, isn't subject to their odd schema, and so I can tie it in more directly with lots of other data we've acquired.
Occasionally the owner brings up that maybe we should just write our own POS, to which I very quickly back-pedal and let him know that's not a good idea. As much as the actual inventory tracking isn't that hard compared to what we've done, the accounting portion is not something I want to touch with a ten-foot pole. Let someone else make sure the accounting reports we do our taxes from don't have odd bugs, and spread the gains of bug-testing and fixing across thousands of customers. I've already got plenty on my plate before trying to take on a massive project with huge downsides like that.
> Certainly easier that try to implement the crazy scheme of "discounts" of my customers!
Any billing system that can correctly account for all the various ways businesses and sales departments want to display, link, sell and promote services over time quickly approaches a complexity which is hard to to explain, and without conscious effort also quickly becomes the domain of a few select people that have bothered to actually dive into the mess, leading to silo'd expertise.
Additionally, if you try to make it easy for non-programmers to use, you quickly end up with a DSL which while it may get the job done admirably, is necessarily powerful and expressive enough that you end up having pseudo programmers to write and read them anyway.
> you quickly end up with a DSL which while it may get the job done admirably, is necessarily powerful and expressive enough that you end up having pseudo programmers to write and read them anyway.
Yes! Now I truly converting that "toy with language design" to actually "maybe use it for my apps!".
P.D: And yes, I know is good to use any lang already made, like Lua or TCL or similar. But CRUD-apps are not very well served by most languages today. If wanna know what kind of language is almost ideal check it FoxPro (or the dbase family)
My start as a professional programmer (freelance is one would call it now) was with end user applications. I am talking mid-80s. What I took away was the required level off fraud that was demanded in the sense that each business I coded for wanted a way to cook the books. Driving schools, bookshops you name it. As soon as the program was about ready and they were happy with the interface and the reports, they wanted 2 distinct outputs: one for the tax office and one for themselves. So yes, coding (parts of) Operating Systems is really relaxing and satisfying.
Pretty-printing is tough. That said, please don't reinvent the wheel. There is research that has gone into this that should make most of this stuff pretty straightforward. I personally recommend Wadler's "A Prettier Printer" [0] (although credit goes to Hughes for laying a lot of the groundwork [1]). It too uses an IR and has several possible heuristics for rendering.
I've been using an implementation of it [2] with a lot of success for pretty printing Rust code [3].
It sounds like perhaps a case of being so focused on whether the program could be built that they didn't stop to ask whether it should be built.
Automatic code formatters don't need to be perfect or complete; they simply must format good code well. If bad code (as many of the difficult examples were) formats poorly, that's just another reason for people to write better code.
You're right that "good enough" code formatting is often good enough. On the other hand, a tool that does better formatting in more cases is more likely to be used. Also, sometimes code refactoring is done automatically, so nobody is closely looking at the code.
If you are using it to enforce a formating over a code base, ok a good enough formater will do.
But one of the biggest use cases of a code formater is for understanding bad code. If it can't format bad code, that use case is gone.
I wrote a pretty-printer for bash scripts using PHP. Colouring the keywords was fun, but I moved on to other personal projects instead of tackling line breaks.
More recently I've been trying to learn Chinese, and one of the features of Pingtype is to put spaces between words.
http://pingtype.github.io
To my surprise, this article linked to a Wikipedia page about line wrapping, which says that line wrapping in CJK is unsolved.
https://en.wikipedia.org/wiki/Line_wrap_and_word_wrap#Word_w...
"Most existing word processors and typesetting software cannot handle either [personal names or compound words]."
My method works, but I don't know who to give it to. This is Hacker News and there's people from all different backgrounds here, so I'll just throw it out there - if anyone is interested, please contact me.
> if anyone is interested, please contact me
This is not how things are done. Just put your code and documentation on the public Web, then submit the link to Hackernews or other relevant forums.
Did that. Nobody noticed.
https://news.ycombinator.com/item?id=14907618
You are correct that no one is interested in yet another translation Web application, especially since it does not say how it's different or better than the rest. That's all anyone got to see so far.
If you want to be successful in communicating your improved method for word segmentation to the world at large, extract the code, publish it, and document the relevant details what makes it better and how it compares against existing solutions.
I am a hobby computational linguist. "Contact me" is an outmoded concept, most of us are doing our collaboration in the open.
Why do you necessarily need a line limit?
I would simply indent all chained function calls and be done with it. Eg.
Or just have an allowed limit for 1 liners, then indent all of it the way you suggested if it is surpassed.
Some style guides have a line limit (including Dart). If you want to submit code without hand-editing, the formatter needs to do it.
On the other hand, Go's style guide doesn't have a line limit, so its formatter doesn't have to solve tricky problems like this. Programmers can add line breaks by hand if they get too long. (A nice side effect is that auto-formatting your code doesn't change line numbers.)
One of my side project was an auto formatter for R. There are some limits in existing formatters:
- I think most of them doesn't recognize multi line string literals, which is difficult if you consider the case that you can have "" in comments, comment symbol in "" string literals and line breaks. The only way to deal with it is to scan linearly with context.
- It's tricky to wrap a long line: + some points in a long line are more suitable for breaking points in logic level + but sometimes you want less lines and not to break too often, even it's more clear in logic. The lines could be just some parameter list that will be both well represented in one column or multi columns. + with nested code the natural indent position could be at the far right, which make each line very short if you stick to 80 columns rule.
After quite some efforts my code can deal with all the comments, multi line strings, all the operators I known (I need to separate unary and binary operators to determine whether to insert space), but the script take several seconds to run, and I haven't start to deal with indent. I probably can save some time if I do more optimization, but I don't have time to finish it now.
This python formatter talked about its algorithm, worth a read.
https://github.com/google/yapf
Cool post. Some insight into why this problem is so hard: what this post is describing seems to be an integer linear programming problem [1]. I.e. optimizing a linear cost function constrained by (convex) linear bounds with integer-valued variables. The reason it's so difficult is that ILP is an NP-hard problem. Finding the right way to represent program source is also tricky, but, as the post says, doing so in a way that caters to the extremely performance-sensitive solver code is the really difficult bit. A better approach might be to produce an explicit ILP program and use an ILP solver to decide where the line breaks should go. As with many NP-hard problems (e.g. SAT, TSP), there are very good solvers these days that are much faster than anything you could ever hope to write yourself – and they produce fully optimal solutions.
[1] https://en.wikipedia.org/wiki/Integer_programming
> As with many NP-hard problems (e.g. SAT, TSP), there are very good solvers these days that are much faster than anything you could ever hope to write yourself – and they produce fully optimal solutions.
And if it's not fast enough, you can use one of a number of relaxations to a standard LP and call it a day.
> There are thirteen places where a line break is possible here according to our style rules. That’s 8,192 different combinations if we brute force them all
This is why the language should be designed with a formatter in mind from the beginning, as Go was designed. Just enough mustaches to make formatters accurate and fast. How many possibilities should there be? Exactly one.
"There are thirteen places where a line break is possible here according to our style rules. That’s 8,192 different combinations if we brute force them all. The search space we have to cover is exponentially large..."
Sounds like this might be a good candidate for some AI methods which are not intimidated by such large search spaces.
Would there be AI methods that would produce deterministic results for code formatting? Deterministic output seems to be the main draw of a code formatter.
The code formatter described in the article is not deterministic either. The author admits the search space is far too large for him to brute-force it, so his formatter does the best it can, which might not get the optimal results:
"If the line splitter tries, like, 5,000 solutions and still hasn't found a winner yet, it just picks the best it found so far and bails."
This sounds like it would give no more guarantees of a deterministic result than AI methods would. They also try their best, only search a subset of the search space, and bail when user-defined criteria such as time, iteration, or space limits are reached.
Another approach could be to use both the author's hand-crafted algorithm and an AI method. There are at least two ways to go about that. The first would be to run the hand-crafted algorithm and pass its evaluation of the best solutions to seed an AI method (for example, as the seed populations in an evolutionary algorithm). The second would be to do the opposite, and run the AI method first, and seed the hand-crafted algorithm with the best solutions the AI method discovered.
The chosen quote doesn't indicate whether or not it's deterministic.
...But if, for the same input, it always tries the same candidates in the same order, then it is deterministic.
I doesn't look like that stopping time is user-defined, it's a compiled in constant.
Given that, I don't see any reason why the algorithm should be non-deterministic just because it gives up after a set number of iterations. As long every time it sees a given string it tries the possibilities in the same order, scores them the same way, and stops at the same place then it should be deterministic.
By "user-defined", I meant to refer to the user of the algorithm (AI or not), who is the developer.
You're right that if the algorithm described in the article will always get the same result given the same input it will be deterministic in a way that the kinds of AI methods I'm thinking of (the stochastic kind) will not. That sort of non-determinism will be a problem if you expect your code to be formatted the same way every time it goes through the formatter.
There are lots of "AI methods" that can be used to produce deterministic results. In general, this sounds like a problem where combinatorial optimization techniques can be useful. For example, constraint programming, mixed integer programming, and various local search methods. For all techniques, there are both deterministic and non-deterministic approaches.
Set the random seed?
He is using the classic AI method of heuristic-guided graph search with search-space pruning. That still doesn't make the worst case tractable, but it does improve a lot on brute force enumeration.
I was thinking along similar lines, like some kind of reinforcement learning model that penalized lines that seem ugly to humans... but probably hard to get enough humans looking at enough code.
I wonder if well-respected real-world code samples could be used as a training and testing set.
For example, one could use a subset of the Linux kernel source code as training, and another (unseen) subset as testing.
Highly rated projects on github might work too.
An active learning approach would be cool, where the formatter process a large corpus of code and asks a human operator for advice when it is especially uncertain about the best place to split.
Even if training the splitting policy from scratch is too hard, learning would be a good way to establish cost weights on different kinds of rule violations.
I remember this coming up before :) https://news.ycombinator.com/item?id=10195091
ninja edit: I mostly jump remembered the picture of Robert with his/a dog and the text, “Hi! I'm Bob Nystrom, the one on the left.”
> That means adding line breaks (or “splits” as the formatter calls them), and determining the best place to add those is famously hard.
Naive question here: What is so hard? It can be solved with dynamic programming, right? Doesn't he even link to solutions of the problem?
> For most of the time, the formatter did use dynamic programming and memoization. I felt like a wizard when I first figured out how to do it. It worked fairly well, but was a nightmare to debug.
http://suckless.org/philosophy
Yes, but in this case it would have to be applied to the Dart language syntax rather than the formatter. At this point it's too late for the formatter to be simple.
Please mark with (2015).
Sure. Thanks.