Show HN: Libretto – Making AI browser automations deterministic
github.comLibretto (https://libretto.sh) is a Skill+CLI that makes it easy for your coding agent to generate deterministic browser automations and debug existing ones. Key shift is going from “give an agent a prompt at runtime and hope it figures things out” to: “Use coding agents to generate real scripts you can inspect, run, and debug”.
Here’s a demo: https://www.youtube.com/watch?v=0cDpIntmHAM. Docs start at https://libretto.sh/docs/get-started/introduction.
We spent a year building and maintaining browser automations for EHR and payer portal integrations at our healthcare startup. Building these automations and debugging failed ones was incredibly time-consuming.
There’s lots of tools that use runtime AI like Browseruse and Stagehand which we tried, but (1) they’re reliant on custom DOM parsing that's unreliable on older and complicated websites (including all of healthcare). Using a website’s internal network calls is faster and more reliable when possible. (2) They can be expensive since they rely on lots of AI calls and for workflows with complicated logic you can’t always rely on caching actions to make sure it will work. (3) They’re at runtime so it’s not interpretable what the agent is going to do. You kind of hope you prompted it correctly to do the right thing, but legacy workflows are often unintuitive and inconsistent across sites so you can’t trust an agent to just figure it out at runtime. (4) They don’t really help you generate new automations or help you debug automation failures.
We wanted a way to reliably generate and maintain browser automations in messy, high-stakes environments, without relying on fragile runtime agents.
Libretto is different because instead of runtime agents it uses “development-time AI”: scripts are generated ahead of time as actual code you can read and control, not opaque agent behavior at runtime. Instead of a black box, you own the code and can inspect, modify, version, and debug everything.
Rather than relying on runtime DOM parsing, Libretto takes a hybrid approach combining Playwright UI automation with direct network/API requests within the browser session for better reliability and bot detection evasion.
It records manual user actions to help agents generate and update scripts, supports step-through debugging, has an optional read-only mode to prevent agents from accidentally submitting or modifying data, and generates code that follows all the abstractions and conventions you have already in your coding repo.
Would love to hear how others are building and maintaining browser automations in practice, and any feedback on the approach we’ve taken here.
1. playwright-cli for exploration and ad-hoc scraping, in order to determine what works.
2. playwright code generation based on 1, which captures a repeatable workflow
3. agent skills - these can be playwright based, but in some cases if I can just rely on built-in tools like Web Search and Web Fetch, I will.
playwright is one of the unsung heroes of agentic workflows. I heavily rely on it. In addition to the obvious DOM inspection capabilities, the fact that the console and network can be inspected is a game changer for debugging. watching an agent get rapid feedback or do live TDD is one of the most satisfying things ever.
Browser automation and being able to record the graphics buffer as video, during a run, open up many possibilities.
Same playwright is phenomenal. You can also have the agent browse with MCP to figure out the workflow, then bang out a repeatable playwright script for it. It's a great combo
You can also do Chrome MCP.
"Claude, reverse engineer the APIs of this website and build a client. Use Dev Tools."
I have succeed 8/8 websites with this.
Sites like Booking.com, Hotels.com, try to identify real humans with their AWS solution and Cloudflare, but you can just solve the captcha yourself, login and the session is in disguishable from a human. Playwright is detected and often blocked.
Agreed! One thing that we felt was missing from the existing MCP tools was user recording. For old and shitty healthcare websites it's easier to just show the workflow than explain it
The playwright codegen tool exists, but the script it generates is super simple and it can't handle loops or data extraction.
So for libretto we often use a mix of instructions + recording my actions for the agent. Makes the process faster than just relying on a description and waiting for the agent to figure out the whole flow
The interesting part to me is recovery after the first generated script goes stale. I’d be curious whether you measure success as 'initial generation works' or 'the same flow still passes after small DOM/layout changes a week later', since that seems like the boundary between a neat demo and something a team can rely on.
And even beyond a week. What happens when the site owner redesigns the interface and breaks things? What is the recovery process and how “deep” does the rework have to go? Would you expect to use the same prompt and just have the AI figure it out again and then ship the new code?
The 'deterministic' framing is the part I'd want to understand better. When a model generates a Playwright script, selector choice is often the fragile element: LLMs frequently generate CSS class selectors or XPath rather than Playwright's recommended getByRole/getByLabel/getByText approach, even when accessible-name selectors would work. The generated code can 'work' on first run but break on the first layout tweak.
@muchael: does Libretto constrain the model to prefer accessible-name-based selectors during generation, or does the determinism come primarily from the execution-verification loop (run → fail → self-correct)? The two approaches have meaningfully different failure modes—the first makes the initial code robust, the second only catches brittleness at runtime.
This is a great flag and something we want to spend more time experimenting with as we continue to build out the repo.
Right now we kind of have a mixture of the 2 approaches, but there's a large room for improvement.
- When libretto performs the code generation it initially inspects the page and sends the network calls/playwright actions using `snapshot` and `exec` tools to test them individually. After it's tested all of individual selectors and thinks it's finished, it creates a script and then runs the script from scratch. Oftentimes the generated script will fail, and that will trigger libretto to identify the failure and update the code and repeat this process until the script works. That iteration process helps make the scripts much more reliable.
- The way our `snapshot` command works is that we send a screenshot + DOM (depending on size may be condensed) to a separate LLM and ask it to figure out the relevant selectors. We do this to not pollute context of main agent with the DOM + lots of screenshots. As a part of that analyzers prompt we tell it to prefer selectors using: data-testid, data-test, aria-label, name, id, role. This just lives in the analyzer prompt and is not deterministic though. It'd be interesting to see if we can improve script quality if we add a hard constraint on the selectors or with different prompting.
I'm also curious if you have any guidance for prompt improvements we can give the snapshot analyzer LLM to help it pick more robust selectors right off the bat.
I literally _just_ put up an announcement on our internal Slack of a tool I had spent a few weeks trying to get right. Strange to post the announcement and, literally the same day, see a better, publicly available toolkit to do enable that very workflow!
I'm also using Playwright, to automate a platform that has a maze of iframes, referer links, etc. Hopefully I can replace the internals with a script I get from this project.
Haha that's wild, let me know if you run into any issues with it!
Looks awesome, but I wonder if its functionality could be exposed to existing CLIs such as Claude Code instead of having to run it through its own CLI, mainly because I don't want to spend on credits when I've already got a CC subscription.
EDIT: To clarify, I realize there are skill files that can be used with Claude directly, but the snapshot analysis model seems to require a key. Any way to route that effort through Claude Code itself, such as for example exporting the raw snapshot to a file and instructing Claude Code to use a built-in subagent instead?
Answered similar question above responding to someone's MCP sampling comment. Spinning up a separate agent in the CLI was our initial approach but we switched to snapshot via API because of speed and reliability.
We can update the config though to allow you to set up snapshot through the CLI instead of going through the API!
Did you consider MCP sampling to avoid requiring your own LLM access? (for the clients that support it of course, but I think it's important and will become standard anyway)
Not totally sure I understand, but if you're talking about the snapshot command which requires an API key we initially had it spinning up a tmux session to analyze the snapshot instead of using the API. But we switched it to use the API for 2 reasons:
1. Noticed that the API was a couple seconds faster than spinning up the coding agent
2. Spinning up a separate agent you can't guarantee its behavior, and we wanted to enforce that only a single LLM call was run to read the snapshot and analyze the selector. You can guarantee this with an API call but not with a local coding agent
Sorry yeah it was a big vague, I was thinking about creating a Libretto MCP since it's a/the standard way to share AI tooling nowadays and that would make it usable in more contexts.
In that case, the protocol has a feature called "sampling" that allow the MCP server (Libretto) to send completion requests to the MCP client (the main agent/harness the user interacts with), that means that Libretto would not need its own LLM API keys to work, it would piggyback on the LLMs configured in the main harness (sampling support "picking" the style of model you prefer too - smart vs fast etc).
This is what I found doing playwright based extraction against anti-bot defenses. Runtime agents were brittle. It felt like trying to debug/audit a black box.
We used to deal with RPA stuff at work. Always fragile. Good to see evolution in the space.
Love it! Do you have a BAA with Claude though? Otherwise, your demo is likely exposing PHI to 3rd parties and exposing you to risk related to HIPAA
It's a good callout. We have a BAA + ZDR with Anthropic and OpenAI, and if you want to use libretto for healthcare use cases having a BAA is essential. Was using Codex in the demo, and we've seen that both Claude and Codex work pretty well
just adding to michael's reply - we took care to make sure no PHI was exposed in our demo video as well.
Curious how you handle target site changes - does the agent get triggered to regenerate, or do you just wait for the script to fail in prod first?
For our scripts running in prod we handle this in 2 ways:
- We use runtime agents in very specific places. For example on Availity they frequently have popups right after you login, so if there's a failure right after signup we spin up an agent to close it and then resume the flow with basically a try/catch
- We wait for it to fail and then tell the agent to look at the error logs and use `libretto run` command to rerun the workflow and fix the error
We're thinking of extending libretto to handle these better though. Some of our ideas:
- Adding a global/custom fallback steps to every page action. This way we could for example add our popup handler error recovery to all page actions or some subset of them
- Having a hosted version which flags errors and has a coding agent examine the issue and open a PR with the fix
Curious if you have any other ideas!
Very interesting idea. Old school solutions but with new methods. But maybe we can't make everything deterministic for complex cases, the scenarios that opened after LLM arrived into scene. Maybe we need a mix of both.
Thanks! I think the right solution is 100% a mixture. Currently thinking it should be mostly deterministic with some intentional/limited usage of runtime AI. And then AI debugging tooling on top of that
I like the pre-gen approach! Curious how it responds to JS that changes how components are rendered at run-time.
There are a couple ways to handle JS components rendered at runtime:
- Libretto prefers network requests over DOM interaction when possible, so this will circumvent a lot of complex JS rendering issues
- When you do need the DOM, playwright can handle a lot of the complexity out of the box: playwright will re-query the live DOM at action time and automatically wait for elements to populate. Libretto is also set up to pick selectors like data-testid, aria-label, role, id over class names or positional stuff that's likely to be dynamic.
- At the end of the day the files still live as code so you could always just throw a browser agent at it to handle a part of a workflow if nothing else works
I built something very similar for my company internally. The idea was that that the maintenance of the code is on the agent and the code is purely an optimization. If it breaks the agent runs it iteratively, fixes the code for next time. Happy to replace my tool with this and see how it does!
Super cool! Please let me know how it goes. Since agents are so good at writing code, we think letting the agent rewrite/test the code on failure is better than just using a prompt at runtime
Thanks for this! We have clear answers for things that are 100% and 0% automated, but it’s always that 80%-99% automated slice where the frontier is, great idea.
Curious how it compares to Browser Use
how does it differ from playwright-cli?
At its core, libretto generates, validates, and helps with debugging RPA scripts. As far as I understand tools like playwright CLI are more focused on letting your agent use playwright to perform one-off automations.
The implementation is also pretty different:
- libretto gives your agent a single exec tool (instead of different tools for each action) so it can write arbitrary playwright/javascript and is more context efficient
- Also we gave libretto instructions on bot detection avoidance so that it will prefer using network requests for automation (something that other tools don’t support), but will fall back to playwright if it identifies network requests as too risky
playwright-cli is very simple and meant for humans - it basically generates a first draft of a script, and was originally meant for writing e2e tests. You need to do a lot of post-processing on it to get it to be a reliable automation.
libretto gives a similar ability for agents for building scripts but:
- agents automatically run, debug, and test the integrations they write - they have a much better understanding of the semantics of the actions you take (vs. playwright auto-assuming based on where you clicked) - they can parse network requests and use those to make direct API calls instead
there's fundamentally a mismatch where playwright-cli is for building e2e test scripts for your own app but libretto is for building robust web automations
how does it differ from https://github.com/browserbase/stagehand ?
we started using stagehand initially! But it doesn't follow the same model of pre-generating deterministic code. Your code is meant to look like this:
// Let AI click await stagehand.act("click on the comments link for the top story");
the issue with this is that there's now runtime non-determinism. We move the AI work during dev-time: AI explores and crawls the website first, and generates a deterministic legible script.
Tangentially, Stagehand's model may have worked 2 years ago when humans still wrote the code, but it's no longer the case. We want to empower agents to do the heavy lifting of building a browser automation for us but reap the benefits of running deterministic, fast, cheap, straightforward code.
How does it have deterministic output when using LLMs that are non-deterministic by nature?
I believe it generates playwright scripts (non deterministically) which are saved and executed again (deterministically)
I've wanted something like this for ages, excited to try this out!
glad to hear! Please reach out on Discord or Github issues you run into issues!
What is the license?
Edit: nevermind. I see from the website it is MIT. Probably should add a COPYING.md or LICENSE.md to the repository itself.
Sorry! Yes, MIT. Forgot to lift it up when I converted to a monorepo, but it's in packages/libretto
Cool. Thank you for sharing. While AI tools are extremely powerful, packages like this help create some good standards and stepping stones for connectivity that the models haven’t gotten around to yet. Thanks again.
Ofc! Please try it out. Stop by in the Discord or Github Issues if you have any questions!
this is interesting
Thanks! Please try it out. Stop by in the Discord or Github Issues if you have any questions!
this looks awesome
Thanks! Please try it out. Stop by in the Discord or Github Issues if you have any questions!
[flagged]
Lol sorry for the misleading click. We named it libretto after the term in theater, inspired by Playwright. No retro gaming here, just browser automation!
Ok, but please don't post unsubstantive comments to HN.
[flagged]
Right now libretto only captures HTTP requests, which the coding agent can use to determine how to perform the automation.
For more complex cases where libretto can't validate that the network approach would produce the right data (like sites that rely on WebSockets or heavy client-side logic) it falls back to using the DOM with playwright
https://news.ycombinator.com/newsguidelines.html#generated