Snake oil. Good to read for sure. Seems all plausible too. But snake oil nevertheless.
Here's why: The slot machine can drop any hard requirement that you specifically in your AGENTS.md, memory.md or your dozens of skill markdowns. Pretty much guaranteed.
These harnesses approaches pretend as if LLMs are strict and perfect rule followers and the only problem is not being able to specify enough rules clearly enough. That's fundamental cognitive lapse in how LLMs operate.
That leaves only one option not reliable but more reliable nevertheless: Human review and oversight. Possibly two of them one after the other.
Everything else is snake oil but at that point, you also realize that promised productivity gains are also snake oil because reading code and building a mental model is way harder than having a mental model and writing it into code.
Snake oil may be a bit strong, because snake oil never works (except maybe as placebo?) whereas anything with an LLM, even though stochastic, has a pretty high chance of working.
> ... you also realize that promised productivity gains are also snake oil because reading code and building a mental model is way harder than having a mental model and writing it into code.
Not really, though it depends on the code; reading code is a skill that gets easier with practice, like any other. This is common any time you're ever in a situation where you're reading much more code than writing it (e.g. any time you have to work with a large, sprawling codebase that has existed long before you touched it.)
What makes it even easier, though, is if you're armed with an existing mental model of the code, either gleaned through documentation, or past experience with the code, or poking your colleagues.
And you can do this with agents too! I usually already have a good mental model of the code before I prompt the AI. It requires decomposing the tasks a bit carefully, but because I have a good idea of what the code should look like, reviewing the generated code is a breeze. It's like reading a book I've read before. Or, much more rarely, there's something wrong and it jumps out at me right away, so I catch most issues early. Either way the speed up is significant.
Pretty high chance isn’t what the intent or impression the end user often has.
Indeed, and it is a complicated problem to solve. A GUI or CLI can hide footguns or make them less likely to be misused. But an AI agent is perfectly happy to use a wrecking ball to put a nail without any second thought or confirmation.
It’s a human articulation problem.
When it receives a generic vague input it is free to interpret according to how its corpus fires like any human interaction.
How to articulate better is like writing a sentence that will stand the test of model updates.
Even then. I don’t have an example off the top of my head but even perfectly clear sentences can lead the agent to strange places. Even between humans, miscommunication is easy, but then anyone sensible would ask for confirmation if their interpretation is weird. But the LLM very rarely questions the user.
I don’t think it’s fair to blame the user here. The tool must be operated by normal users.
I'm trying to think of other types of tooling that normal users can all use equally well, or in the best ways possible.
> has a pretty high chance of working.
for MVPs, mock ups, prototypes or in the hands of an expert coder. You can't let them go unsupervised. The promise of automated intelligence falls far short of the reality.
Not only "has a high chance of working", but you can pay more to make it more reliable. It really is striking trying to run a harness openClaw thing on a smaller or quantised model, really makes you realise how much we take for granted from SOTA models that was totally impossible just a year ago, in terms of complex, generally reliable tool use.
I think the placebo effect might be a decent comparison. It works most of the time, and you don't worry about it as long as you fully believe in its efficacy. However, once the illusion is shattered, the positive effects are diminished, and you can never fully trust the solution again.
Humans also drop any hard requirements you specify regularly, and similarly require review. Nevertheless we manage to increase reliability of human output through processes and reviews, and most of the methods we use for harnesses are taken from experience with how to reduce reliability issues in humans, who are notoriously difficult to ensure delivers reliably.
The primary way to increase reliability is to automate. Instead of humans producing some output manually, humans producing machines which produce that output.
I've seen a disturbing trend where a process that could've been a script or a requirement that could've been enforced deterministically is in fact "automated" through a set of instructions for an LLM.
Sure, when that is possible. However, there are lots of processes we don't know how to automate in a deterministic way. Hence the vast amount of investment in building organisations of people with mechanism to make peoples output more reliable through structure, reviews, and so on.
Large parts of human civilization rests on our ability to make something unreliable less unreliable through organisational structure and processes.
Underrated comment.
So many applications of LLMs have even to start with deterministic brain when using a non-deterministic llm and then wonder why it’s not working.
We resolve that through liability, penalties, trust, responsibility, review and oversight.
At the end of the day, if I am spending X$s for automation, I want to be able to sleep at night knowing my factory will not build a WMD or delete itself.
If its simply a tool that is a multiplier for experts, then do I really need it? How much does it actually make my processes more efficient, faster, or more capable of earning revenue?
There is a LOT that is forgiven when tech is new - but at some point the shiny newness falls off and it is compared to alternatives.
Liability, penalties, trust, and responsibility are means we use to try to influence the application of the processes that do. They do not directly affect reliability. They can be applied just as much to a team using AI as one that does not.
Review and oversight does address reliability directly, and hence why we make use of those in processes to improve the reliability of mechanical processes as well, and why they are core elements of AI harnesses.
> If its simply a tool that is a multiplier for experts, then do I really need it? How much does it actually make my processes more efficient, faster, or more capable of earning revenue?
You can ask the same thing about all the supporting staff around the experts in your team.
> There is a LOT that is forgiven when tech is new - but at some point the shiny newness falls off and it is compared to alternatives.
Only teams without mature processes are not doing that for AI today.
Most of the deployments of AI I work on are the outcome of comparing it to alternatives, and often are part of initiatives to increase reliability of human teams jut as much as increasing raw productivity, because they are often one and the same.
> Liability, penalties, trust, and responsibility are means we use to try to influence the application of the processes that do. They do not directly affect reliability. They can be applied just as much to a team using AI as one that does not.
Yes and no. see next point.
> You can ask the same thing about all the supporting staff around the experts in your team.
I have a good idea of the shape of errors for a human based process, costing and the type of QA/QC team that has to be formed for it.
We have decades, if not centuries of experience working with humans, which LLMs are promising to be the equivalents/superiors of.
I think you and me, would both agree with the statement "use the right tool for the job".
However, the current hype cycle has created expectations of reliability from LLMs that drive 'Automated Intelligence' styled workflows.
On the other hand:
> part of initiatives to increase reliability of human teams
is a significantly more defensible uses of LLMs.
For me, most deployments die on the altar of error rates. The only people who are using them to any effect are people who have an answer to "what happens when it blows up" and "what is the cost if something goes wrong".
(there is no singular thread behind my comment. I think we probably have more in agreement than not, and its more a question of finding the precise words to declare the shapes we perceive.)
> (there is no singular thread behind my comment. I think we probably have more in agreement than not, and its more a question of finding the precise words to declare the shapes we perceive.)
I moved this up top, because I agree, despite the length of the below:
> However, the current hype cycle has created expectations of reliability from LLMs that drive 'Automated Intelligence' styled workflows.
Because for a lot of things it works. Today. I have a setup doing mostly autonomous software development. I set direction. I don't even write specs. It's not foolproof yet by any means - that is on the edge of what is doable today. Dial it back just a little bit, and I have projects in production that are mostly AI written, that have passed through rigorous reviews from human developers.
The key thing is that you can't "vibecode" that. I'm sure we agree there.
There needs to be a rigorous process behind it, and I think we'll agree on that too.
Those processes are largely the same as the processes required for human developers. Only for human developers we leave a lot of that process "squishy" and under-specified.
We trust our human developers to mostly do the right thing, even though many don't, and to not need written checklists and controls, even though many do.
What is coming out of this is a start of systems that codify processes that are very much feels based with human teams. Partly because we still need to codify them for AI, but also because we can - most people wouldn't want to work in the kind of regimented environment we can enforce on AI.
Sure, there is a lot of hype from people who just want to throw random prompts at an LLM and get finished software out. That is idiocy. Even a super-intelligent future AI can't read minds.
But there are a lot of people building harnesses to wrap these LLMs in process and rigor to squeeze as much reliability as possible from them, and it turns out you can leverage human organisational knowledge to get surprisingly far in that respect.
> Because for a lot of things it works. Today. I have a setup
> There needs to be a rigorous process behind it, and I think we'll agree on that too.
I would simplify it to: “I have a setup” is the part that is doing the actual heavy lifting.
From my very unscientific survey / extensive pestering of network, the only people getting lift out of AI are people with both domain expertise/experience and familiarity with the tooling.
The types of automation I see people wanting though are fully automated customer support systems, fully automated document review - essentially white collar dark factories. (Hey thats a good term). The need is for a process that is stable, and behaves the same way every time.
It seems actual AI use cases are more like sketching - if you have enough skill you can make out the rough sketch is unbalanced and won’t resolve into a good final piece. Non experts spend far more time exploring dead ends because they don’t have the experience.
In my opinion, it’s a force multiplier for experts or stable processes, and it’s presented as Intelligence.
I feel your examples fit within these boundaries as well as the ones you have described.
I would agree with all of this. We could argue over whether/when there's sufficient intelligence for fully autonomous systems, but those systems will keep being tools for experts for the foreseeable future, and the question is just how small or large the autonomous components of that are, not whether or not you still need experts to wield them.
it's strange to see software engineers using skills aka human description of small scripts instead of scripting things directly. often there were cli / tools / libraries to do what a skill does for many years. maybe it's culture issue, people who enjoy automation / devops / predictability will naturally help themselves, but other people just want to "delegate" and be done without trying.
When people do that they are using skills wrong. The best way to use a skill is as a means to give targeted instructions on how to make use of cli / tools/ libraries, with the skill just covering the "squishy bits" that aren't easily encoded into something deterministic.
Everything you say is all possible, and in theory I agree with you.
However, I have been using spec-kit (which is basically this style of AI usage) for the last few months and it has been AMAZING in practice. I am building really great things and have not run into any of the issues you are talking about as hypotheticals. Could they eventually happen? Sure, maybe. I am still cautious.
But at some point once you have personally used it in practice for long enough, I can't just dismiss it as snake oil. I have been a computer programmer for over 30 years, and I feel like I have a good read on what works and what doesn't in practice.
We can build all the scaffolding around but I assure you that the LLMs aren't perfect rule following machines is the fundamental problem here and that would remain.
Give it a few more months and I'm sure you'll see some of what I see if not all.
I'm saying all the above having all sorts of systems tried and tested with AI leading me to say what I said.
I have been doing this for 6 months or so now, and I am not sure that even if you have a lot more experience than me that it would make your assessment more accurate, since that just means you have more experience with prior generations of the models. What I have experienced is that the AI has been getting better and better, and is making fewer and fewer mistakes.
Now, part of that is my advancements as well, as I learn how to specify my instructions to the AI and how to see in advance where the AI might have issues, but the advancements are also happening in the models themselves. They are just getting better, and rapidly.
The combination of getting better at steering the AI along with the AI itself getting better is leading me to the opposite conclusion you have. I have production systems that I wrote using spec-kit, that have been running in production for months, and have been doing spectacularly. I have been able to consistently add the new features that I need to, without losing any cohesion or adherence to the principals i have defined. Now, are there mistakes? Of course, but nothing that can't be caught and fixed, and not at a higher rate than traditional programming.
> LLMs aren't perfect rule following machines is the fundamental problem here
I kind of get what you're saying, but let us not pretend that SW engineers are perfect rule followers either.
Having a framework to work within, whether you are an LLM or a human, can be helpful.
If someone regularly ignored critical instructions even though they were written down and had been told to follow them, that person would be fired.
People are excused all the time for things because they are elevated in other areas. It's about their value as a whole and that's where we are with LLMs. They aren't perfect but they do plenty we can't which means they are worth using.
i think it depend on your goals and also your preference / expectation how your experience with LLMs is. i dont mind if they hallucinate. even if i have mental model of code i wont write it myself perfectly either.
the only downside i see is getting out of practice, which is why for my passion projects i dont use it. work is just work and pressing 1 or 2 and having 'good enough' can be a fine way to get through the day. (lucky me i dont write production code ;D... goals...)
> Give it a few more months
By that time, they will have realized immense value before seeing some of what you see. Sounds like an endorsement of spec-kit.
> The slot machine can drop any hard requirement that you specifically in your AGENTS.md, memory.md or your dozens of skill markdowns. Pretty much guaranteed.
Indeed. That said, I’ve had some success with agent skills, but I use them to make the LLM aware of things it can do using specific external tools. I think it is a really bad idea to use this mechanism to enforce safety rules. We need good sandboxing for this, and promises from a model prone to getting off the rails is not a good substitute.
But I have taught my coding agent to use some ad hoc tools to gather statistics from a directory containing experimental data and things like that. Nobody is going to fine tune a LLM specifically for my field (condensed matter Physics) but using skills I still can make it useful work. Like monitoring simulations where some runs can fail for various reasons and each time we must choose whether to run another iteration or re-start from a previous point, based on eyeballing the results ("the energy is very strange, we should restart properly and flag for review if it is still weird", this sort of things). I don’t give too many rules to the agent, I just give it ways of solving specific problems that may arise.
Do you have any information on skills you've found useful here?
Not really, unfortunately. I took some inspiration from existing skills, mostly in the official GitHub repo https://github.com/agentskills/ . But mostly I had to come up with them myself. I try to use Claude to help but it was not that useful.
I hope the only reason people are pretending these markdown suggestions are a "workflow" is fear that a more structured approach will be obsolete by the time it's polished. I can't imagine the pace of innovation with the underlying models will stay like this forever.
I hope to see harnesses that will demand instead of ask. Kill an agent that was asked to be in plan mode but did not play the prescribed planning game. Even if it's not perfect, it'd have to better than the current regime when combined with a human in the loop.
Don't let the perfect be the enemy of the good. Of course we know the AGENTS.md and skills aren't 100% effective. But no, it doesn't mean that they're 0% effective.
Helps if you both hand to original agent as strong guidance and then to an adversarial agent as a quality reviewer. The adversarial agent is more likely tro loop the work back if it fails the validation criteria.
I do find that just asking the same agent to do and check it's own work is not particularly reliable.
This is like saying a +5 sword is useless because you still miss on a one. We’ve got to think about expected outcomes. Because if ahe’s merging five solid PRs to your three, loudly complaining about the one she saw was rubbish and threw away.
I can see why this would seem to be “snake oil” logically. However, this approach does work in reality. Your comment just shows that you seem inexperienced with using generative AI.
> That leaves only one option not reliable but more reliable nevertheless: Human review and oversight.
Couldn't non-manual oversight also help e.g. sandboxes?
All these points apply to human devs as well. The test is not infallibility but magnitude
A slot machine isn't snake oil.
Slot machine give you rewards when star aligns, snake oil never do :)
All this said, I quite like the mental model of documenting a simple process, and I suspect our future ai overlords will find it useful that I have a series of md files that outline my preferences and processes for certain tasks.
I am not however going to share any of this with work colleagues and make myself redundant.