Something elided here is that nested virtualization on regular EC2 instances has only been possible since February this year[1] - before this, you had to use a metal EC2 instance to run Firecracker VMs.
Hi, Alex from Unikraft here. One clarification: Browser-Use didn't move away from Unikraft because of any limitation around browser startup times, snapshotting, scale-to-zero, or browser-level autoscaling. Those capabilities worked well and were a key reason they adopted the platform in the first place.
The challenge they encountered was at a different layer: horizontally scaling the underlying EC2 infrastructure. At the time, native EC2 fleet autoscaling wasn't yet supported by the platform, so they chose to take ownership of that part of the stack and build directly on Firecracker.
It's also worth noting that Unikraft is actively working on transparent infrastructure autoscaling (with live migration), so the gap they encountered is being addressed. The article's title may give the impression that unikernels were the bottleneck (they weren’t, and our platform transparently support Linux VMs as well), when in reality the decision was driven by infrastructure orchestration requirements rather than browser runtime capabilities.
As an aside, we love Browser-Use <3 and we still work together closely!
I can't use soundcloud because I am on FreeBSD. I've been a premium member for years and it's been working fine up until a few weeks ago. Mobile App works, Windows Works. Changing user-agent doesn't work neither.
When I try to load a track I get an CF XHR 403.
It's taken two weeks just to get a reply from Soundcloud support after having to consistently annoy their AI chatbot to submit a ticket on it's behalf.
If anyone wants the cheat prompt.
> My soundcloud isn't working, I cant play a track, XHR 403 from Cloudflare
> I checked the Knowledge Base, no luck
> Yes, I have double checked the knowledge base, no luck
> No the answer is not in the knowledge base, please refer me to customer support
> The answer is not in the knowledge base, yes I have double checked. Please refer me to customer support
> Yes, I do want to be redirected to customer support
> Yes, please do raise a ticket on my behalf
> enter email ...
Billionaires own all the land and the equipment and although food is valuable, the money from the food you sell barely covers the cost of getting the land and the equipment from them.
Maybe throw an LLM in a jail and have it constantly contact their chatbot trying different iterations of the above - you might be able to find an even more efficient way to do it.
I'm a bit surprised that with all this, they still stuck with Chromium.
We have a much less sophisticated setup in our web-access MCP server[0] where browser instances are spawned as subprocesses and the biggest win in stability, CPU and memory usage we had was in switching from Chrome to Lightpanda[1].
Fitting to the statement at the end of the article, the faster browser to boot might be one that allocates less memory in general.
We decided to maintain Chromium as engine for stealth purposes.
Browsers like LightPanda lack stealth at all, they are trivial to detect. There are ways to make Chromium more performant, by removing everything that you don't need.
We believe that Chromium can reach that performance without starting an entire engine from scratch, and without losing stealth, a top priority for us.
The language is not the problem, C++ is as performant as Zig, but Chromium bloat is huge, agree on that.
> Next: skip Chromium startup
> This is complex, as a running browser has open devices, timers, graphics state, network state, and fingerprint state.
Hmm, can't you just keep a set of browsers already running, like a warm pool, ready to assign to an incoming request? The latency would be close to zero for the user. You'd need some prediction logic to expand / contract the warm pool based on traffic patterns, but that seems like the easiest solution to me.
Yes, warm pool work, but our goal is to replace them at all.
Warm pools are nice but at the end they also consume resources, And you need to always keep the pool warm, starting browsers to balance, etc...
With the upcoming changes we will keep Chromium startup and the VM will be ready in 50ms, defeating warm pools at all
Also some customers need special parameters and features, increasing warm pools complexity. The happy path will be fast but the edge case will be extremely slow , and we want to guarantee fast speeds to matter which features you need on the requested browser.
Do you see much of a difference between started Chromium instances with the same configuration in terms of the contents of allocated memory? Are they deterministic?
If not, could you template the memory and apply runtime patches (like timers or other initialized values) before releasing the process to run?
Would forcing the isolates to allocate memory better help at all, such as reducing fragmentation making your 2MB page sizes more effective?
We run a screenshot API (ApiFlash) with Chromium packaged in an AWS Lambda container image instead of Firecracker on EC2. AWS Lambda gives you the isolation and autoscaling for free which is ideal for spiky stateless work like screenshots. I believe we get mostly the same benefits compared to browser-use solution but with a much much simpler architecture. The tradeoff is the AWS lambda cold starts, but in practice sequential AWS Lambda invocations actually reuse a hot function. As a result, with a large enough volume, spikes are smoothed and cold starts are not that frequent.
Not all use cases require all the features that we built
Few issues we had with lambas:
- Limited running time (15 min), we support up to 4 hours (we can run longer if needed)
- Price
- Lack of snapshotting mechanisms
- Lack of low-level control over the running host
But yeah, lambda is way more than enough for most common use cases automating the web
From our production stats, a median screenshots capture is 5.7s. Browser-use bills per minute, not per millisecond like lambda does. As is, it's around 2x more expensive than Lambda for our use-case.
> Plain headless Chromium is easy to detect by websites with anti-bot measures. Plain headless Chromium avoided getting blocked by websites only 2% of the time, according to our stealth benchmark.
> Our browsers avoid blocks 81% of the time on our stealth benchmark, and 84.8% on Halluminate BrowserBench, the highest of any provider.
Seems very unethical, no? Who uses service providers like this? The whole point of anti-bot measures is to get rid of bots - you are not wanted there.
These kinds of services inevitably make the web more human-hostile and expensive. Websites will continue pushing back on automated usage, meaning more hurdles to access content.
No doubt part of why we see this push for verified ID on the web - not just age gating and "protect the children", but also protect sites from bots, and protect ad revenue (not a statement of support; just seems like an obvious higher order effect)
> Seems very unethical, no? Who uses service providers like this? The whole point of anti-bot measures is to get rid of bots - you are not wanted there.
Unethical just because it does something someone else doesn't want? I guess it depends on why and what the intention is. I don't have time to sit 24/7 in front of a computer to get a ticket to some events, does that mean it's unethical for me to use my own bot so I can purchase a ticket to bands I'm a fan of? Probably not. But if I did so for scalping purposes? Then yeah, I'd agree it's unethical.
The whole point of anti-anti-bot measures is to be able to do things even if others don't think that thing should be automated, so from the hacker news audience, I think quite a lot of us have at one point or another engaged in stuff like that. Doing so merely for profits of course stinks, but for you to be able to have a fighting chance against scalpers? Probably OK.
An example I ran into recently: I wanted to scrape pricing data for used cars, to better inform a friend's decision about what to purchase.
I know there's a relationship between mileage and depreciation, but wanted to have a better sense of what that relationship is to know whether a given car was over or underpriced.
Similarly, if I was pulling that data to build a service of my own to offer to users... is that unethical?
All of these questions are easily answered by the question: can I run the bot on the same PC I use regularly? If so, then do it there. If not, then don’t do it at all.
Why should technical capability to evade countermeasures dictate whether or not something is ethical? My view is that scraping remains ethical as long as your actions aren't causing technical problems for the operator. If anything, a retailer attempting to hide pricing data is what's unethical in my view.
Time was you could get lovely json feeds from every site by iterating the inspector curl statement. Now-a-days you can't even use Selenium without Cloudflare getting grouchy. Last fall had to make my spreadsheet like a cave-person control c, control v. It wouldn't be so bad if the dealer aggregators' coverage was xor, but you have to dedupe listings. Then there is the whole online salespeople who don't show up at the dealership.
There's a JavaScript property called navigator.webdriver that returns true if selenium is in use. Obviously, every antibot system checks it. Obviously, you can patch it to always say false.
> even if others don't think that thing should be automated
It's an interesting thought that can be further explored. Could anything that's considered "unwanted" by a third party considered unethical, if I do it anyway?
If the hotel self-service restaurant has a sign "don't take the food out" and I take 1 apple in my pocket for a snack, is it unethical? Or maybe the sign is just for people that would otherwise take $100 of watermelons out of the cantina daily and try to resell it on the beach.
We ("society") do things to people that they don't want all the time, namely punishing them via the legal system. Then there are things that other people are doing that are immoral and should be punished for even though they are not illegal. And the whole class of inactions where we don't e.g buy something because it's overpriced and that's certainly something the seller doesn't like.
Mutual consent to what? I didn't agree for rich people to own all the housing around here, is it mutual consent if I agree to pay one to get some, or is it extortion?
Was Rosa Parks unethical for sitting down on a bus?
The point is that the context matters: both the users context and the context of the restriction. It’s not as clear cut as “ignoring restrictions = bad”.
The restriction itself can be unethical, in the same way that bypassing a restriction can be unethical.
> As a discussion regarding if it’s ethical to ignore restrictions progresses, the probability of someone bringing up a famous case where someone ignored unethical restrictions approaches one
Seems reasonable to me. Substitute Rosa parks with another example of unethical restrictions if you wish - there are many.
Do you think it is a problem that someone said it's always unethical to violate a restrictions, and someone else brought up Rosa Parks?
I propose a new law myself: as an online discussion gets longer, the probability of someone trying to defeat an argument by stating that it mentioned Rosa Parks or Hitler without engaging with the substance of that argument approaches one.
Look at what Google's doing right now with Chrome. On June 30 Chrome will remove the last flag that let uBlock keep working, and there's no workaround. Google says it's about security and performance, but is it? $239 billion in ad revenue last year seems to be the motivational factor. The "restriction" is a rule written by the company that profits when you can't block its ads, dressed up as protecting you. But... CISA recommends ad blockers as a defense against malware spread through ad networks.
The rules aren't always right and sometimes have unintended consequences. I think a bigger issue than Browser Use is all of the copyrighted material in every LLM. Given that precedent has been set with zero legal consequences, I'm not sure there's much of a leg for you to stand on here.
There are many ethical reasons to bypass restrictions. Colloquially, we just call them exceptions.
There are many valid ethical exceptions for evading anti-bot detections. For example: you are a white hat actor scraping a black hat site. There are hundreds of other plausible examples.
If the sign says 1 per person, the reason it's unethical to take more than 1 isn't because you're disobeying a sign - it's because someone else might not be able to get one, and the sign is indicating to you this is likely to be a problem. If the store is about to throw out all the unsold ones in 5 minutes, then ignoring the sign is completely ethical.
We use headless browser providers because the companies we interact with don't and won't create a proper API for us to use. Lots of legacy web apps/portals. Saves thousands of man hours.
> But if I did so for scalping purposes? Then yeah, I'd agree it's unethical.
Is scalping actually unethical though? Sure it's unpopular, but I'd argue that's just your average person not properly grasping supply and demand and thinking through the consequences. If you want to sell something below market then you should raffle it off and take extensive measures to prevent transfer of ownership. The current practice is trying to pretend it's an open market while fixing the price, then getting angry when the obvious consequences materialize. Scalpers are merely the agents that correct a market inefficiency introduced by a dysfunctional status quo.
I'm not defending the overall arrangement, simply pointing out that the blame is misplaced. If you want to sell below market value in a capitalist system then you must take appropriate measures to prevent a market from forming. More or less by design if one can form then it will.
Similarly I do not attempt to blame the rank and file employees of the ad tech industry for the actions of their employers, nor of the defense industry for the actions of the government. As an individual living under such a system either you move to make money when the opportunity arises or someone else will instead and you will lose out.
This is called "victim blaming". You are saying the blame for a problem shouldn't be on those who directly caused the problem, but on those who failed to prevent them from causing the problem.
You're right but in a different way. Scalpers aren't independent, they work for the artists to maximise artist revenue while absorbing the PR hit themselves.
No, I am saying that the people who went and created conditions that they knew would lead to the problem are the ones to blame. They are not victims except perhaps of their own poor decisions. I explicitly do not think that scalpers are doing anything wrong given that in a capitalistic system someone is always going to arbitrage things.
You don't get to enact poor policy, stick your fingers in your ears, then blame everyone but yourself for the place burning down.
Scalping is ethical if and only if you are marketpilled. If you think anything outside the market matters, such as maximizing human enjoyment of the concert, then it's unethical.
(I haven't tried this out yet.) My use case would be to take a snapshot of each HN story. This is surprisingly hard, because most websites prevent bots from doing that.
For example, Claude has a lot of trouble reading HN's front page. HN itself is fine, but the moment you ask it to pick out an article, it often chokes. The website has put up a verification captcha, or it's a paywall, etc. Paywalls can be bypassed by reading HN comments and looking for archive links. But those archives often block bots too, so you're back to square one.
Whether it's unethical is an interesting question. I believe I should have the right to do what I want with internet content, as long as I'm not abusive. Merely having a bot isn't abusive. It would be one thing if the bot is hammering a server or vacuuming up training data, but having a bot at all is presently very hard.
This service caught my attention because it could potentially solve the problem I'm running into. Simply taking snapshots of articles that hit HN shouldn't be so hard, but it is. HN sends millions of views to websites; one bot taking a snapshot isn't going to make a difference. I don't think it counts as "unethical" just because we're going against the website owner's wishes. When you post content to the internet, you sign up to share that content with everyone, other than what's denied by robots.txt. If it's not blacklisted by robots.txt, it should be possible for well-behaved bots to access.
I don't expect very many people here to care about the poor bot creators. Most of the bot creators are malicious anyway. But I personally lament the loss of being able to write a program that can process information from the browser in arbitrary ways. You should be able to, yet we're buying into the notion that it's okay for website owners to say "this content is only accessible by approved bots like Google, and everyone else can sod off."
HN proves it doesn't need to be like that. It gets dozens of millions of page views a day, a lot of which is bot traffic. HN only uses captchas for creating accounts or logging in. You're free to scrape any content as long as you respect the crawl delay of 30 seconds specified in robots.txt, and don't try to visit links that perform actions a human would take (like adding things to favorites or voting). That's how the internet should work: just deliver content.
I don't think half is a realistic number, and any realistic numbers are not going to change the server load much. The real cost of that scenario is all the AI compute.
Web archival/preservation services/projects that need to get past captchas and other bot checks are a prime target for a service like this... but I think their main customers are people just mass scraping parts of the internet for less altruistic reasons.
> Seems very unethical, no? Who uses service providers like this? The whole point of anti-bot measures is to get rid of bots - you are not wanted there.
I'm familiar with companies automating access to software only accessible via the web with poor/no API support. This is software they pay (usually a lot of money) for, and usually has built in captchas to guard logins. They aren't a large enough customer to ask the removal of these captchas or whitelabelled (just one out of many SaaS tenants), so they simply work around that restriction.
Exactly these crappy companies like browser use is causing more captcha etc.. All these scraper companies should've been regulated heavily. They use residential proxy creating incentive for hacking IOT devices etc..
I don't think one can judge it ethically without considering the context. Are we talking about mass automated scraping? Or are we talking about me trying to get a good deal by scraping local used car dealership listing once per day for my personal need (just so I don't have to do it manually)?
One of these is strictly more ethical, but both will be blocked by Cloudflare for example. I'd happily use such service in my personal case.
I wish simpler bots existed for consumers. I want to know when someone replies to me, when a price drops, when airlines open new seat reservations, when a new seat opens for a college class, when a concert is coming to my area for a musician I listen to, when my local grocer has new stock, when a new Hyatt offer is available in a city I want to visit, etc. doesn’t mean I’m abusive. I can have it check once a day. In almost all those cases, I want to spend money with the business but I don’t want to manually check
I briefly tried to do his job where it was scraping steam for CS GO skins (think a knife skin for $2,000.00) and yeah trying to find proxy poviders/get around the ip limit... tough one but market for it people paying for the tool (not mine).
Once again I'd like to remind that violating Terms of Service isn't the same as violating some moral ethics. They are literally just expectations with no enforceable or legal boundaries.
For example I could write in my Terms of Service that you do not view more than one page on my website and expect you to send me a written permission to read the rest. I don't expect anybody to follow and I sure don't think less of those that do.
The push for verified IDs is not related to this, its more of a politically motivated attempt at selling fear to justify more surveillance.
Whether or not scrapping publically available websites is unethical is probably up for debate. In some cases at least, courts have found it to be legal, even when the site is throwing up technical barriers or issues cease and desists.
What is likely unethical is the fact that they offer residential proxies. The residential providers of those proxies are frequently not aware they’ve been opted in to provide such a service.
I don't know if this falls under good ethical behavior but the one time I needed to mass scrape sites was to build an affordable housing directory for the Bay Area. Whether or not a unit of housing was under rent control was only available through private entities or apartment hosting sites and it was scattered and hard to find.
In my opinion, a directory of subsidized housing should have been provided by local governments and not through a plethora of real estate websites.
Obviously don't know what percentage represents "legit" use cases vs other more morally questionable, but in our case we have a cms where content team can include external links and we need to verify periodically whether those links work or not, which is not as easy as making get requests with a client.
The people who've been in charge of the web (i.e., mostly the browser makers, but also the owners of the most popular sites) have made decisions that are IMHO severely anti-user. Although these anti-user design decisions have been accumulating for 30 years, users have had no alternative because all the content was on the web with way to get it other than to visit web sites with a web browser.
Now that there is an alternative (namely AI)
people (including me) are flocking to the alternative. You want frame this as unethical bots versus ethically-acceptable human site visitors, but the main motivation for the use of scraping bots these days is to provide services (i.e, AI-based question answering) that users (like me) consider far superior to going directly to web sites for information because visiting web sites with a web browser is a frustrating tedious experience.
I use change detection to monitor all sorts of websites for changes. Some of my favorite authors don't have RSS.
I always set up price monitoring for any big ticket item I'm considering like appliances so I can see how their pricing changes over time.
I also use scrapers for websites that don't have an API. I like having all of my purchase history indexed in a database where I can do analysis.
> These kinds of services inevitably make the web more human-hostile and expensive.
I would rather not have to spend more time circumventing stupid bot detection things. I would be more than happy to pay for access to some of this data that I cannot access any other way.. but sure, let's keep burning resources on a cat and mouse game that scrapers will always be able to win.
I don't know who 'they' is here but that's not the point? I would bet that a decent chunk of scraping happens because there is no API (or other machine-focused interface, like RSS).
"pay to crawl" sounds like the absolutely laziest possible way that a particular site could bolt on an API.
This attitude, and by proxy this business are the epitome of selfish entitlement.
You state that you believe you deserve access to others’ resources, at their cost, despite their clear attempts to stop you from using them, simply because you want it.
You can already access these resources. What does it matter if you do the clicking or you have headless chrome do the clicking while you make a cup of coffee?
I'd counter that your attitude is a techno-authoritarian one. Why should anyone have any say over how I access and use a publicly available resource? At least so long as my actions don't directly cause technical problems for the service operator.
Look, it wasn't _my_ request that made the server fall over, it must have been one of the other several thousand thoughtless scrapers running on the website that caused it to die.
If you're claiming that the operators of high volume AI scrapers that wantonly disregard rate limits and all common sense are unethical then I'm right there with you. But that's not at all what was described upthread nor is it the only way in which bots get used by any stretch of the imagination.
As far as anti-bot countermeasures go I quite like proof of work solutions since those disproportionately impact high volume scrapers without noticeably impeding a small hobby project.
Unfortunately the operators of many major websites appear to want something akin to DRM with the excuse of bots used merely as window dressing.
There was a time when a person could walk through a few department stores every week (or even every day) just to take note of some prices along the way, and ultimately tabulate them to try to identify and snatch up the best deal once it happens.
And if everyone did this, it'd be a real problem. The stores would be clogged up by geeks writing notes in little books with Parker Jotters and just basically wasting space and taking up air conditioning while they sleuth out the best way to put the screws to the company for a few measly dollars.
That'd be awful.
But not many people ever did that in stores, and not many individual people are doing that today with the web. It's really not a problem.
(And if a website in 2026 can't stand the burn of several thousand personal scrapers that are operated by people who actually want to buy stuff from it, then maybe that system simply sucks and needs to be rethought.)
> At least so long as my actions don't directly cause technical problems for the service operator.
That's the point of the criticism. The praise of their anti-anti-bot features reads like it is commonly used to cause technical problems to the service providers, be it intended or accepted for the cause.
> At least so long as my actions don't directly cause technical problems for the service operator.
But they do.
The reasoning you’re describing is not altruistic. It’s the same reasoning used by every AI scraper.
It’s the very reason I am paying a couple hundred dollars out of my own pocket every month to keep the websites of hundreds of small businesses and hobbyists online while I try to help them move to bigger cloud hosts, when I used to turn a small profit from it.
That’s a great theory, unfortunately it’s defeated by the fact that I didn’t need to use anti-bot solutions until I was charged for 38,000x my normal ingress traffic in a single month by bot traffic.
> The reasoning you’re describing is not altruistic. It’s the same reasoning used by every AI scraper.
I think that's bad faith on your part. Clearly AI scrapers are aware of what they are doing and simply don't care. The entire purpose of my including the bit you quoted there was to explicitly exclude that sort of behavior.
I built a similar system for an identity protection service that automated removing PII from directory websites like whitepages. Which was less ethical, stealth browser automation or monetized privacy invasion?
I didn't use this company, but there are some legitimate purposes for scraping.
For example, at a startup a few years ago, one of the many technological things we needed to do was to monitor marketplaces for suspected counterfeit and contract-violating gray market goods for ~100 brands. And we couldn't just ask for data feeds, because, well, the marketplaces make money off of all those sales. And the off-the-shelf third-party data solutions were useless crap quality, worse than your average vibe-coding. So I made a bespoke crawler that gently and accurately tracked the data we needed, including global geofencing. So gently, I never got a whiff of disapproval or countermeasures (like throttling, 403, nor data poisoning). We were putting insignificant load on the marketplaces, for the purpose of helping to make the market better for both consumers and legitimate businesses. It was like a single "secret shopper" unobtrusively walking around some parts of a store. (And I also made an iOS app that did something different for actual secret shoppers in physical stores, for legitimate supply chain traceability for customers' brands.) Personally, I love the marketplaces, and hate the counterfeits, and this was my version of PG's advice that startups should be a little bit naughty.
Two of the problems with the current AI scrapers, which are destroying servers, and inviting backlash:
1. The gold rush situation brings out many of the crappiest people in the world. And also many who aren't crappy might behave in a crappy manner. (The latter, maybe because they're just emulating what they see, or extrapolating from the ethical temperature of prior industry norms, like surveillance capitalism in everything.)
2. Many of these scrapers are shockingly bad at what they do, and grossly inefficient. Almost like they're just pounding the same unchanging resources to DoS the servers for competitors. Or to drive sites to a protection racket company that's set up so they can also monitor cleartext. Or (Occam's Razor) just plain bad at what they do, and the people who pay for the salaries and computer resources either don't know or don't care.
Like if cloudflare and other protection services don't make the web more human-hostile blocking un-approved browsers and tools like cURL. Captchas and others already get solved by AIs nowadays, they're just friction done to collect labeling data for free from users...
There are a lot of legit uses for keeping tabs on information. Price comparison websites for example allow the public to be better informed and fight hostile pricing strategies that are more common now corporate consolidation is at all time highs.
Oligopolies don't want their prices scraped so they put up anti-bot measures.
When the price history is layed out in a plain chart it becomes clear how efficient the economy is at segmenting markets and emptying wallets.
It's not unethical to do something someone else doesn't want you to do. Discord doesn't want me to use it without buying Nitro, but I do. My bank doesn't want me to get a cheaper loan from a different bank. YouTube creators don't want me to use SponsorBlock.
Firecracker is fantastic technology. I'm using it for my interviewing startup to run isolated runtimes for coding interviews (and personal workspaces), and it's been rock solid and incredibly lightweight. Interfacing with it through the Go SDK has been a piece of cake, too.
> The catch is that regular EC2 is already a VM. AWS runs our host inside its own isolation layer, and then we run browser VMs inside that host. In other words, every browser is a VM inside a VM.
yes but i think there is specifically some ec2s which give you hypervisor access and thereby firecracker too - someone correct me if im wrong?
Unfortunately supply is quite limited. If you want to horizontally scale on these instances you need to have a good relationship with AWS so they'll give you a big allocation before c9i is a thing.
I haven't personally tried, so I can't say for certain, but Lambda has publicly stated they run on bare metal EC2 instances, presumably the supply of whatever instance types they use should be fairly healthy
The interesting part to me is less the exact hardware generation and more the control plane around placement, isolation, and startup latency. That is hard to copy outside AWS.
When we had need of quite big machines (AWS metal instances), we've found the performance differential between metal, and the equivalent size VM was 10-20% for CPU heavy workloads.
What is firecracker needed? Couldn’t this just run in a container directly? I understand some of the isolation concerns but a browser and container breakout is a billion dollar CVE, no?
You can have a volume mount into your container backed by whatever block storage which may have snapshotting or format with a FS that supports snapshots.
Most mature and/or security conscious providers don't consider containers to be a secure isolation boundary (with Microsoft being a notable exception, though it's unclear whether that's a failure of internal policy or incompetent enforcement of policy).
Containers provide a much broader attack surface than VM's, and since they're not considered secure as an industry standard there's likely to be less resources put towards managing container escape CVE's than VM escape ones.
Pretty light on details, heavy on fluff. 9.8s to 3.1s was userfaultd + hugepages, 500ms was PS/2 mouse and… where is the rest of the time to get to 400ms?
have you tried running android browsers ? we run RL workloads using android browsers. We are having to maintain a fork of https://github.com/budtmo/docker-android/ and android chrome on top. We would rather use browser-use if it had that support.
P.S. we do maintain our fork of a browser for rubric computation...but that is not relevant for this. The infrastructure is what we are looking for.
I've experimented with Android Browsers. The problem is that android VMs are super heavy compared to the resources needed to run just Chromium
Startups are absurdly slow, isolation is harder, etc...
Android bloat is insane, you need to run the entire Java VM to start the browser... It's also harder to fingerprint, and at scale that's something that we need for Browser Use
Cool experiment but not yet production ready, at least for us
Checkpoint with Chromium running is possible and will be our next step.
Main blockers right now is fingerprint injection and profile injection, solved already.
It's always a balance of engineering effort & gains. Post-Chromium snapshot let's us save 200ms, which is not that important for 99% of use-cases, but that will come soon since it brings some other benefits (like CPU footprint)
Profiling and tools used are already included with Chromium, they provide nice debugging tools
> Main blockers right now is fingerprint injection and profile injection, solved already.
Do you do this at the chromium/V8 level or CDP?
I've been having mixed success with CDP and was thinking of going to the level below, but it feels like just getting Chromium itself to baseline chrome detection profile is significant work
Just use something like https://shellbox.dev instead of FireCracker inside ec2. Much simpler, boxes are up in a couple of seconds, and it is way cheaper.
The state does not contain the browser config, since it's configured just before it starts running (and we currently snapshot before Chromium starts).
In our case, we prepare the environment, load files that we need later and then we create the state. Once we start, we instantly start Chromium with the config requested by the customer.
I have tried it before by saving the entire memory state of the VM actively running but man oh man were there alot of bugs. My idea was different I was playing with spinning browsers on spot nodes and swap them over + state if they were revoked.
You thinking custom Chromium startup sequence for that?
Or processes. Chrome has builtin process isolation for every browser tab. It starts up darn near instantly, and scores as 'pretty good' as far as sandboxing is concerned.
So I've been playing and tweaking for a while with running different browsers in containers. And it took a long while to get working well, but it's doable.
The only issue is scaling, the containers aren't super quick to start (so we keep a spare container ready) and there's plenty of other issues. Also docker isn't really a security boundary so there's issues and concerns there.
Docker doesn't provide any security. You install Docker on your local laptop, and the container you spin up when you execute `docker run` interacts with your laptop's kernel directly. It provides logical isolation between containers but provides zero protection for your host kernel (assuming you decide to install Docker on a remote server instead).
Firecracker provides an isolation between the host kernel, on the one hand, and the guest microVM, on the other hand. So on AWS, you use an Amazon Machine Image (AMI) to specify the OS and other components and libraries installed on an EC2 server such as c5.metal, or if you're using nested virtualization, you can use c8i, s8i, or m8i instances at a discount of about 80%-90% at some performance and other cost, and you bundle Linux along with the Firecracker binary. Then you compile a build artifact including `rootfs` for the Firecracker baked image which is the microVM image (analogous to a Docker image that results from executing `docker build`). But the microVM process has its own virtual kernel and is a guest on the host machine. So for instance, you can place Docker inside the microVM, then the container is executing against the microVM kernel, not the host EC2 kernel. Communication is achieved securely between the two using `vsock` and probably something like `socat` so that data travels, say, from guest RAM to host RAM directly to an S3 quarantine bucket, for instance, without ever touching the host's kernel or filespace.
Headless, so there is no screen to composite and no GPU passed into the VM. Firecracker has no GPU passthrough, so GL work falls back to SwiftShader, the software rasterizer. For automation that is fine. The cost is in layout, JS and network, not raster. It only bites on WebGL or canvas heavy pages.
They just don't have access to giant pools of residential IPs, so too many sites end up blocking all the cloud providers by IP range/ASN anyway, even if they could get through a captcha.
google has a large amount of "caching servers (GGC)" located in data centers for residential providers all over the world.. They use these servers for a variety of services.. Most of the traffic I have seen from them have been for their "URL preview" service ..
they kind of do.. gcp has their lambda equivalent which i believe comes with chromium preinstalled, its how major search tools like jina work, sure thre problaby somethign about session management that they probably neuter to prevent abuse though
>Unikraft does not have good built-in autoscaling, so an engineer had to change a variable, manually adding more instances.
>During a burst in traffic, the system, instead of reacting on its own, required humans to adjust it. This caused problems: one load test brought down production for 45 minutes. So we rebuilt our setup on Firecracker.
It shouldn't need to have autoscaling built in. If the variable is adjustable, why couldn't monitoring happen that sets off a process to adjust the variable when traffic spikes?
The Internet is drowning in bots, everyone who hosts a site or service is paying the price. At least we have companies like this to make the problem worse.
You have to be a bit more restrictive today yes, but if you weren't already overrun with bots and hacking attempts while hosting a public service many years ago, you probably weren't hosting a even medium-popular website in the first place. Same thing goes today. Slap a rate limit on it and be done with it.
Really depends on the server specs. Tab amount relies entirely in memory & CPU availability, not in the infra that runs behind the scenes
But yeah, in one server we can fit hundreds of browsers, or even thousands if we use bigger servers. And each one of them with dozens of tabs, no issue
Just hot stage a bunch of VMs and then there is no startup time. Every time someone finishes, just start another one and leave it running waiting for the next customer.
Starting the VM itself takes 20ms with Firecracker, the slowest part is starting the browser.
So there's no benefit on reusing the VM but not the browser. VM isolation is also important, customers can leave downloads and other files that should not be accessible for freshly created browsers on that same VM.
Oh my bad. You mean warm pools then. That works, yes, but you need to maintain that warm pool, which might not be ready if we receive a big burst of demand
Keeping the browser open and warm is also a problem, not all customers require the same features. The same engineering required to fix that (modifying values with Chromium open), also fixes the post-chromium snapshot
VM takes 20ms to start, browser around 300ms. Post-Chromium snapshot is at 50ms end-to-end, defeating the benefits of the warm pool you suggest, that will be our next step.
if people want custom features, then of course there is a cost to that. but if the majority of your customers are running on defaults, then there is a benefit. yes, it creates other issues, such as pool management, and if you do that wrong and you can't predict capacity well enough, then people get your "slow" path. but, overall, my experience is that the warm pools are extremely well regarded and not something that most people think of.
Sure, you're right. I edited to remove that bit. Thanks for calling me out. I was getting frustrated for having felt like I was extremely clear in what I wrote and the person kept repeating something that I had clarified.
> At any rate, warm pools aren't cost free. If you overestimate demand, you'll waste too much money on idle resources.
Depends on how you're running your business. If it is your hardware, it isn't much of an expense at the benefit for having a product that makes your customers happy.
It depends on how you think about spare capacity on your hardware. They're an opportunity cost. Every idle cycle and bit of unallocated memory could be spent doing something else valuable. Consider the most extreme example of GPUs: CoreWeave was founded by repurposing nearly-useless GPUs their founders had on hand to mine Bitcoin (correction: Ethereum) that was obsoleted by ASICs and other specialized mining hardware (correction: transition to proof of stake).
Besides, they did say they were running on EC2, which charges by the second.
Your CW analogy is wonky since that isn't how it went down. You know my name (I don't know yours. edit: Michael), so you should know a bit more about my history in the space too, right? I can explain it out, but afraid of either being called names, or just not being worth it to you (or me for that matter).
EC2 has preemptable and reserved pricing. It is possible to build autosizing solutions, this is what Google did with AppEngine and later GCP Functions/Cloud Run. Just like optimizing start times, it is also possible to optimize those idle resources. For me, I'd go with the idle resources as the lower hanging fruit over trying to shave ms off making things available on-demand, since it affects the customer experience first.
Following up like this--twice--is douchebaggy behavior. If someone misunderstands what you're saying, try elaborating or rephrasing instead. Help make the conversation more productive and less combative.
I love that they start no no core pinning, then switch-over to having cores pinned.
This could be a bit of a tricky one, but I'd expect Checkpoint Restore In Userspace eventually tackles a lot of this. An image of a running Chromium process on a tmpfs (in-memory filesystem) that can just be launched endlessly tackles the memory slowdown problem, eliminates conventional startup costs. This feels like an ideal CRIU use case.
I imagine there's a lot of things Chrome needs to run though, bits of state to save/restore.
> Plain headless Chromium is easy to detect by websites with anti-bot measures.
what a disgusting business model. they are central to the main bot problem of our time. any one with morals would expect those systems to respect robots.txt and announce themselves via user agent strings.
disgusting and i hope they crash and burn. i will actively spend time today looking for open source projects detecting their browsers and start contributing.
Something elided here is that nested virtualization on regular EC2 instances has only been possible since February this year[1] - before this, you had to use a metal EC2 instance to run Firecracker VMs.
1. https://aws.amazon.com/about-aws/whats-new/2026/02/amazon-ec...
Yeah pretty new stuff - official it’s still not recommended but works really well so far! Finally we don’t have to run baremetal
And metal instances are MEGA slow to start and stop.
Never used baremetal servers? They’re freaking slow to cold start or restart cycles
Hi, Alex from Unikraft here. One clarification: Browser-Use didn't move away from Unikraft because of any limitation around browser startup times, snapshotting, scale-to-zero, or browser-level autoscaling. Those capabilities worked well and were a key reason they adopted the platform in the first place.
The challenge they encountered was at a different layer: horizontally scaling the underlying EC2 infrastructure. At the time, native EC2 fleet autoscaling wasn't yet supported by the platform, so they chose to take ownership of that part of the stack and build directly on Firecracker.
It's also worth noting that Unikraft is actively working on transparent infrastructure autoscaling (with live migration), so the gap they encountered is being addressed. The article's title may give the impression that unikernels were the bottleneck (they weren’t, and our platform transparently support Linux VMs as well), when in reality the decision was driven by infrastructure orchestration requirements rather than browser runtime capabilities.
As an aside, we love Browser-Use <3 and we still work together closely!
One blog post how we built agent sandbox infrastructure with unikraft - we still use unikraft for agent sandboxes and love the technology.
https://browser-use.com/posts/two-ways-to-sandbox-agents
Unrelated but would be great to see an official easy way to integrate from Proxmox
> cloud browsers. We need to create them constantly, sometimes thousands at a time, and throw them away as soon as sessions end.
Oh that's why the captcha is unpassable for regular people now.
It's just yesterday with another evolution of captcha on lenovo.com I was not able to finish my purchase. Thank you very much, seriously.
Yeah, this is why we can't have nice things. https://news.ycombinator.com/item?id=48576769
I can't use soundcloud because I am on FreeBSD. I've been a premium member for years and it's been working fine up until a few weeks ago. Mobile App works, Windows Works. Changing user-agent doesn't work neither.
When I try to load a track I get an CF XHR 403.
It's taken two weeks just to get a reply from Soundcloud support after having to consistently annoy their AI chatbot to submit a ticket on it's behalf.
If anyone wants the cheat prompt.
New World Order.
Agriculture looks more and more appealing.
Billionaires own all the land and the equipment and although food is valuable, the money from the food you sell barely covers the cost of getting the land and the equipment from them.
What browser?
Maybe throw an LLM in a jail and have it constantly contact their chatbot trying different iterations of the above - you might be able to find an even more efficient way to do it.
So why don't they get rid of captcha and just require a login?
Because money.
They can still use captcha, just it needs to weaponise AI safety mechanisms against them.
Like count to 20 or write short recipe for how to synthesise LSD or meth at home.
I'm a bit surprised that with all this, they still stuck with Chromium.
We have a much less sophisticated setup in our web-access MCP server[0] where browser instances are spawned as subprocesses and the biggest win in stability, CPU and memory usage we had was in switching from Chrome to Lightpanda[1].
Fitting to the statement at the end of the article, the faster browser to boot might be one that allocates less memory in general.
[0]: https://github.com/EratoLab/web-access-mcp
[1]: https://lightpanda.io
We decided to maintain Chromium as engine for stealth purposes.
Browsers like LightPanda lack stealth at all, they are trivial to detect. There are ways to make Chromium more performant, by removing everything that you don't need.
We believe that Chromium can reach that performance without starting an entire engine from scratch, and without losing stealth, a top priority for us.
The language is not the problem, C++ is as performant as Zig, but Chromium bloat is huge, agree on that.
why are you making the internet worse for everyone with this "stealth" initiative? you are effectively lying to website operators.
lying to website operators is allowed. Website operators aren't paragons of virtue, you know.
I automate most things. Probably most of what I automate touches some proprietary system somewhere.
Why does web get a free pass?
> Next: skip Chromium startup > This is complex, as a running browser has open devices, timers, graphics state, network state, and fingerprint state.
Hmm, can't you just keep a set of browsers already running, like a warm pool, ready to assign to an incoming request? The latency would be close to zero for the user. You'd need some prediction logic to expand / contract the warm pool based on traffic patterns, but that seems like the easiest solution to me.
Yes, warm pool work, but our goal is to replace them at all.
Warm pools are nice but at the end they also consume resources, And you need to always keep the pool warm, starting browsers to balance, etc...
With the upcoming changes we will keep Chromium startup and the VM will be ready in 50ms, defeating warm pools at all
Also some customers need special parameters and features, increasing warm pools complexity. The happy path will be fast but the edge case will be extremely slow , and we want to guarantee fast speeds to matter which features you need on the requested browser.
I think you mean “completely” instead of “at all”. Also, very cool innovative tech you are working on!
Do you see much of a difference between started Chromium instances with the same configuration in terms of the contents of allocated memory? Are they deterministic?
If not, could you template the memory and apply runtime patches (like timers or other initialized values) before releasing the process to run?
Would forcing the isolates to allocate memory better help at all, such as reducing fragmentation making your 2MB page sizes more effective?
Very cool to see more use of userfaultfd, really powerful API because you can fully control how and from where memory is loaded during a pagefault.
We run a screenshot API (ApiFlash) with Chromium packaged in an AWS Lambda container image instead of Firecracker on EC2. AWS Lambda gives you the isolation and autoscaling for free which is ideal for spiky stateless work like screenshots. I believe we get mostly the same benefits compared to browser-use solution but with a much much simpler architecture. The tradeoff is the AWS lambda cold starts, but in practice sequential AWS Lambda invocations actually reuse a hot function. As a result, with a large enough volume, spikes are smoothed and cold starts are not that frequent.
Not all use cases require all the features that we built
Few issues we had with lambas: - Limited running time (15 min), we support up to 4 hours (we can run longer if needed) - Price - Lack of snapshotting mechanisms - Lack of low-level control over the running host
But yeah, lambda is way more than enough for most common use cases automating the web
Your solution sounds very expensive.
From our production stats, a median screenshots capture is 5.7s. Browser-use bills per minute, not per millisecond like lambda does. As is, it's around 2x more expensive than Lambda for our use-case.
Fair. We bill by minute cause our main use case is web automation. If you compare per minute, Lambdas are 4-6x more expensive than our solution
Doesnt lambda use Firecracker under the hood?
Yes
https://aws.amazon.com/blogs/aws/firecracker-lightweight-vir...
> Plain headless Chromium is easy to detect by websites with anti-bot measures. Plain headless Chromium avoided getting blocked by websites only 2% of the time, according to our stealth benchmark.
> Our browsers avoid blocks 81% of the time on our stealth benchmark, and 84.8% on Halluminate BrowserBench, the highest of any provider.
Seems very unethical, no? Who uses service providers like this? The whole point of anti-bot measures is to get rid of bots - you are not wanted there.
These kinds of services inevitably make the web more human-hostile and expensive. Websites will continue pushing back on automated usage, meaning more hurdles to access content.
No doubt part of why we see this push for verified ID on the web - not just age gating and "protect the children", but also protect sites from bots, and protect ad revenue (not a statement of support; just seems like an obvious higher order effect)
> Who uses service providers like this?
People who don't want their headless browser to get blocked?
> Seems very unethical, no? Who uses service providers like this? The whole point of anti-bot measures is to get rid of bots - you are not wanted there.
Unethical just because it does something someone else doesn't want? I guess it depends on why and what the intention is. I don't have time to sit 24/7 in front of a computer to get a ticket to some events, does that mean it's unethical for me to use my own bot so I can purchase a ticket to bands I'm a fan of? Probably not. But if I did so for scalping purposes? Then yeah, I'd agree it's unethical.
The whole point of anti-anti-bot measures is to be able to do things even if others don't think that thing should be automated, so from the hacker news audience, I think quite a lot of us have at one point or another engaged in stuff like that. Doing so merely for profits of course stinks, but for you to be able to have a fighting chance against scalpers? Probably OK.
An example I ran into recently: I wanted to scrape pricing data for used cars, to better inform a friend's decision about what to purchase.
I know there's a relationship between mileage and depreciation, but wanted to have a better sense of what that relationship is to know whether a given car was over or underpriced.
Similarly, if I was pulling that data to build a service of my own to offer to users... is that unethical?
All of these questions are easily answered by the question: can I run the bot on the same PC I use regularly? If so, then do it there. If not, then don’t do it at all.
Why should technical capability to evade countermeasures dictate whether or not something is ethical? My view is that scraping remains ethical as long as your actions aren't causing technical problems for the operator. If anything, a retailer attempting to hide pricing data is what's unethical in my view.
> scrape pricing data for used cars
Time was you could get lovely json feeds from every site by iterating the inspector curl statement. Now-a-days you can't even use Selenium without Cloudflare getting grouchy. Last fall had to make my spreadsheet like a cave-person control c, control v. It wouldn't be so bad if the dealer aggregators' coverage was xor, but you have to dedupe listings. Then there is the whole online salespeople who don't show up at the dealership.
There's a JavaScript property called navigator.webdriver that returns true if selenium is in use. Obviously, every antibot system checks it. Obviously, you can patch it to always say false.
> even if others don't think that thing should be automated
It's an interesting thought that can be further explored. Could anything that's considered "unwanted" by a third party considered unethical, if I do it anyway?
If the hotel self-service restaurant has a sign "don't take the food out" and I take 1 apple in my pocket for a snack, is it unethical? Or maybe the sign is just for people that would otherwise take $100 of watermelons out of the cantina daily and try to resell it on the beach.
We ("society") do things to people that they don't want all the time, namely punishing them via the legal system. Then there are things that other people are doing that are immoral and should be punished for even though they are not illegal. And the whole class of inactions where we don't e.g buy something because it's overpriced and that's certainly something the seller doesn't like.
What do you think of Anubis and Cloudflare? If they block your bot, is that unethical?
Seems like doing business with other people should normally be based on mutual consent, not whatever you can get away with technically.
Mutual consent to what? I didn't agree for rich people to own all the housing around here, is it mutual consent if I agree to pay one to get some, or is it extortion?
Yes, when you buy a house it’s usually because the buyer and seller agreed to it. It seems better than, say, a foreclosure or an eviction.
Its unethical because you're intentionally bypassing restrictions. Just because others do it doesn't mean its okay.
If you saw a sign in a store that said "1 per person" or "for registered guests only", would you ignore it?
> Its unethical because you're intentionally bypassing restrictions
I'd still consider why the restriction is there and why I'm thinking of breaking it, before deciding if it's unethical or not.
It depends, basically. Generally I follow the rules and restrictions, but maybe see them more as guidelines or suggestions.
Was Rosa Parks unethical for sitting down on a bus?
The point is that the context matters: both the users context and the context of the restriction. It’s not as clear cut as “ignoring restrictions = bad”.
The restriction itself can be unethical, in the same way that bypassing a restriction can be unethical.
Woah now, I'm for headless browsers but let's not start comparing any of this to Rosa Parks lol.
The reality is a lot of interesting, trivially harmful to non harmful things are illegal and we still do them anyways.
we need a new version of Godwin's Law after this comment.
orf's law: > As an online discussion grows longer, the probability of a comparison with Rosa Parks approaches one."
> As a discussion regarding if it’s ethical to ignore restrictions progresses, the probability of someone bringing up a famous case where someone ignored unethical restrictions approaches one
Seems reasonable to me. Substitute Rosa parks with another example of unethical restrictions if you wish - there are many.
Do you think it is a problem that someone said it's always unethical to violate a restrictions, and someone else brought up Rosa Parks?
I propose a new law myself: as an online discussion gets longer, the probability of someone trying to defeat an argument by stating that it mentioned Rosa Parks or Hitler without engaging with the substance of that argument approaches one.
You're confusing law with ethics, they are not the same.
Look at what Google's doing right now with Chrome. On June 30 Chrome will remove the last flag that let uBlock keep working, and there's no workaround. Google says it's about security and performance, but is it? $239 billion in ad revenue last year seems to be the motivational factor. The "restriction" is a rule written by the company that profits when you can't block its ads, dressed up as protecting you. But... CISA recommends ad blockers as a defense against malware spread through ad networks.
The rules aren't always right and sometimes have unintended consequences. I think a bigger issue than Browser Use is all of the copyrighted material in every LLM. Given that precedent has been set with zero legal consequences, I'm not sure there's much of a leg for you to stand on here.
There are many ethical reasons to bypass restrictions. Colloquially, we just call them exceptions.
There are many valid ethical exceptions for evading anti-bot detections. For example: you are a white hat actor scraping a black hat site. There are hundreds of other plausible examples.
If the sign says 1 per person, the reason it's unethical to take more than 1 isn't because you're disobeying a sign - it's because someone else might not be able to get one, and the sign is indicating to you this is likely to be a problem. If the store is about to throw out all the unsold ones in 5 minutes, then ignoring the sign is completely ethical.
We use headless browser providers because the companies we interact with don't and won't create a proper API for us to use. Lots of legacy web apps/portals. Saves thousands of man hours.
> But if I did so for scalping purposes? Then yeah, I'd agree it's unethical.
Is scalping actually unethical though? Sure it's unpopular, but I'd argue that's just your average person not properly grasping supply and demand and thinking through the consequences. If you want to sell something below market then you should raffle it off and take extensive measures to prevent transfer of ownership. The current practice is trying to pretend it's an open market while fixing the price, then getting angry when the obvious consequences materialize. Scalpers are merely the agents that correct a market inefficiency introduced by a dysfunctional status quo.
Noone thinks concert ticket sales are an open market tho. They have one seller that sets the price, no competition between different sellers.
It only becomes a "market" after scalpers buy all contingent to resell.
Unbelievable how somebody could defend scalping. There is no ethical or moral value in that practice outside of "I can earn money with that".
I'm not defending the overall arrangement, simply pointing out that the blame is misplaced. If you want to sell below market value in a capitalist system then you must take appropriate measures to prevent a market from forming. More or less by design if one can form then it will.
Similarly I do not attempt to blame the rank and file employees of the ad tech industry for the actions of their employers, nor of the defense industry for the actions of the government. As an individual living under such a system either you move to make money when the opportunity arises or someone else will instead and you will lose out.
This is called "victim blaming". You are saying the blame for a problem shouldn't be on those who directly caused the problem, but on those who failed to prevent them from causing the problem.
You're right but in a different way. Scalpers aren't independent, they work for the artists to maximise artist revenue while absorbing the PR hit themselves.
No, I am saying that the people who went and created conditions that they knew would lead to the problem are the ones to blame. They are not victims except perhaps of their own poor decisions. I explicitly do not think that scalpers are doing anything wrong given that in a capitalistic system someone is always going to arbitrage things.
You don't get to enact poor policy, stick your fingers in your ears, then blame everyone but yourself for the place burning down.
Scalping is ethical if and only if you are marketpilled. If you think anything outside the market matters, such as maximizing human enjoyment of the concert, then it's unethical.
(I haven't tried this out yet.) My use case would be to take a snapshot of each HN story. This is surprisingly hard, because most websites prevent bots from doing that.
For example, Claude has a lot of trouble reading HN's front page. HN itself is fine, but the moment you ask it to pick out an article, it often chokes. The website has put up a verification captcha, or it's a paywall, etc. Paywalls can be bypassed by reading HN comments and looking for archive links. But those archives often block bots too, so you're back to square one.
Whether it's unethical is an interesting question. I believe I should have the right to do what I want with internet content, as long as I'm not abusive. Merely having a bot isn't abusive. It would be one thing if the bot is hammering a server or vacuuming up training data, but having a bot at all is presently very hard.
This service caught my attention because it could potentially solve the problem I'm running into. Simply taking snapshots of articles that hit HN shouldn't be so hard, but it is. HN sends millions of views to websites; one bot taking a snapshot isn't going to make a difference. I don't think it counts as "unethical" just because we're going against the website owner's wishes. When you post content to the internet, you sign up to share that content with everyone, other than what's denied by robots.txt. If it's not blacklisted by robots.txt, it should be possible for well-behaved bots to access.
I don't expect very many people here to care about the poor bot creators. Most of the bot creators are malicious anyway. But I personally lament the loss of being able to write a program that can process information from the browser in arbitrary ways. You should be able to, yet we're buying into the notion that it's okay for website owners to say "this content is only accessible by approved bots like Google, and everyone else can sod off."
HN proves it doesn't need to be like that. It gets dozens of millions of page views a day, a lot of which is bot traffic. HN only uses captchas for creating accounts or logging in. You're free to scrape any content as long as you respect the crawl delay of 30 seconds specified in robots.txt, and don't try to visit links that perform actions a human would take (like adding things to favorites or voting). That's how the internet should work: just deliver content.
> one bot taking a snapshot isn't going to make a difference
until half of HN users start asking their agent to do the same, to summarize the top HN articles every day
I don't think half is a realistic number, and any realistic numbers are not going to change the server load much. The real cost of that scenario is all the AI compute.
no, to dang the real cost would be bandwidth
Did that happen?
Web archival/preservation services/projects that need to get past captchas and other bot checks are a prime target for a service like this... but I think their main customers are people just mass scraping parts of the internet for less altruistic reasons.
> Seems very unethical, no? Who uses service providers like this? The whole point of anti-bot measures is to get rid of bots - you are not wanted there.
I'm familiar with companies automating access to software only accessible via the web with poor/no API support. This is software they pay (usually a lot of money) for, and usually has built in captchas to guard logins. They aren't a large enough customer to ask the removal of these captchas or whitelabelled (just one out of many SaaS tenants), so they simply work around that restriction.
Exactly these crappy companies like browser use is causing more captcha etc.. All these scraper companies should've been regulated heavily. They use residential proxy creating incentive for hacking IOT devices etc..
The captchas are caused by captcha companies, nobody else
> Seems very unethical, no?
I don't think one can judge it ethically without considering the context. Are we talking about mass automated scraping? Or are we talking about me trying to get a good deal by scraping local used car dealership listing once per day for my personal need (just so I don't have to do it manually)?
One of these is strictly more ethical, but both will be blocked by Cloudflare for example. I'd happily use such service in my personal case.
I wish simpler bots existed for consumers. I want to know when someone replies to me, when a price drops, when airlines open new seat reservations, when a new seat opens for a college class, when a concert is coming to my area for a musician I listen to, when my local grocer has new stock, when a new Hyatt offer is available in a city I want to visit, etc. doesn’t mean I’m abusive. I can have it check once a day. In almost all those cases, I want to spend money with the business but I don’t want to manually check
I briefly tried to do his job where it was scraping steam for CS GO skins (think a knife skin for $2,000.00) and yeah trying to find proxy poviders/get around the ip limit... tough one but market for it people paying for the tool (not mine).
Once again I'd like to remind that violating Terms of Service isn't the same as violating some moral ethics. They are literally just expectations with no enforceable or legal boundaries.
For example I could write in my Terms of Service that you do not view more than one page on my website and expect you to send me a written permission to read the rest. I don't expect anybody to follow and I sure don't think less of those that do.
The push for verified IDs is not related to this, its more of a politically motivated attempt at selling fear to justify more surveillance.
Whether or not scrapping publically available websites is unethical is probably up for debate. In some cases at least, courts have found it to be legal, even when the site is throwing up technical barriers or issues cease and desists.
What is likely unethical is the fact that they offer residential proxies. The residential providers of those proxies are frequently not aware they’ve been opted in to provide such a service.
> courts have found it to be legal
≠ ethical
I don't know if this falls under good ethical behavior but the one time I needed to mass scrape sites was to build an affordable housing directory for the Bay Area. Whether or not a unit of housing was under rent control was only available through private entities or apartment hosting sites and it was scattered and hard to find.
In my opinion, a directory of subsidized housing should have been provided by local governments and not through a plethora of real estate websites.
Some are aware. I know some providers that pay you to run a proxy, but the pay isn't very much.
Antibot measure also block real users at the slightest change they don't like. Anti-fingerprinting measure? You're a bot. Adblockers? You're a bot.
There's no ethical consumption of... ad supported content.
Obviously don't know what percentage represents "legit" use cases vs other more morally questionable, but in our case we have a cms where content team can include external links and we need to verify periodically whether those links work or not, which is not as easy as making get requests with a client.
The people who've been in charge of the web (i.e., mostly the browser makers, but also the owners of the most popular sites) have made decisions that are IMHO severely anti-user. Although these anti-user design decisions have been accumulating for 30 years, users have had no alternative because all the content was on the web with way to get it other than to visit web sites with a web browser.
Now that there is an alternative (namely AI) people (including me) are flocking to the alternative. You want frame this as unethical bots versus ethically-acceptable human site visitors, but the main motivation for the use of scraping bots these days is to provide services (i.e, AI-based question answering) that users (like me) consider far superior to going directly to web sites for information because visiting web sites with a web browser is a frustrating tedious experience.
> Who uses service providers like this?
I use change detection to monitor all sorts of websites for changes. Some of my favorite authors don't have RSS. I always set up price monitoring for any big ticket item I'm considering like appliances so I can see how their pricing changes over time. I also use scrapers for websites that don't have an API. I like having all of my purchase history indexed in a database where I can do analysis.
> These kinds of services inevitably make the web more human-hostile and expensive.
I would rather not have to spend more time circumventing stupid bot detection things. I would be more than happy to pay for access to some of this data that I cannot access any other way.. but sure, let's keep burning resources on a cat and mouse game that scrapers will always be able to win.
The litmus test here is whether they support https://blog.cloudflare.com/introducing-pay-per-crawl/ out of the box or not
They do not.
I don't know who 'they' is here but that's not the point? I would bet that a decent chunk of scraping happens because there is no API (or other machine-focused interface, like RSS).
"pay to crawl" sounds like the absolutely laziest possible way that a particular site could bolt on an API.
has anyone been using this for success? wondering what kinds of pays they are getting. or are the crawlers just avoiding those sites.
"we don't negotiate with terrorists"
This attitude, and by proxy this business are the epitome of selfish entitlement.
You state that you believe you deserve access to others’ resources, at their cost, despite their clear attempts to stop you from using them, simply because you want it.
Personally I consider it fair game in "price wars".
Dynamic pricing designed to extract every penny out. Then why shouldn't I be allowed to monitor your pricing changes?
Dude, it's a web request. It's not that deep.
Dude, as someone who runs web servers, my pockets are not that deep.
I’m struggling to keep the websites of hundreds of hobbyists and small businesses alive right now because of people like this.
How many requests per second are you getting?
You can already access these resources. What does it matter if you do the clicking or you have headless chrome do the clicking while you make a cup of coffee?
This is my entire point!
And it’s why scrapers will always win; absolutely worst case, I get a screenshot of the content and have to process it further.
I'd counter that your attitude is a techno-authoritarian one. Why should anyone have any say over how I access and use a publicly available resource? At least so long as my actions don't directly cause technical problems for the service operator.
Look, it wasn't _my_ request that made the server fall over, it must have been one of the other several thousand thoughtless scrapers running on the website that caused it to die.
If you're claiming that the operators of high volume AI scrapers that wantonly disregard rate limits and all common sense are unethical then I'm right there with you. But that's not at all what was described upthread nor is it the only way in which bots get used by any stretch of the imagination.
As far as anti-bot countermeasures go I quite like proof of work solutions since those disproportionately impact high volume scrapers without noticeably impeding a small hobby project.
Unfortunately the operators of many major websites appear to want something akin to DRM with the excuse of bots used merely as window dressing.
There was a time when a person could walk through a few department stores every week (or even every day) just to take note of some prices along the way, and ultimately tabulate them to try to identify and snatch up the best deal once it happens.
And if everyone did this, it'd be a real problem. The stores would be clogged up by geeks writing notes in little books with Parker Jotters and just basically wasting space and taking up air conditioning while they sleuth out the best way to put the screws to the company for a few measly dollars.
That'd be awful.
But not many people ever did that in stores, and not many individual people are doing that today with the web. It's really not a problem.
(And if a website in 2026 can't stand the burn of several thousand personal scrapers that are operated by people who actually want to buy stuff from it, then maybe that system simply sucks and needs to be rethought.)
> At least so long as my actions don't directly cause technical problems for the service operator.
That's the point of the criticism. The praise of their anti-anti-bot features reads like it is commonly used to cause technical problems to the service providers, be it intended or accepted for the cause.
Anti-bot features are definitely used to cause technical problems to service providers you don't like.
> At least so long as my actions don't directly cause technical problems for the service operator.
But they do.
The reasoning you’re describing is not altruistic. It’s the same reasoning used by every AI scraper.
It’s the very reason I am paying a couple hundred dollars out of my own pocket every month to keep the websites of hundreds of small businesses and hobbyists online while I try to help them move to bigger cloud hosts, when I used to turn a small profit from it.
Maybe if you weren't using expensive anti-bot solutions people wouldn't use expensive bots.
That’s a great theory, unfortunately it’s defeated by the fact that I didn’t need to use anti-bot solutions until I was charged for 38,000x my normal ingress traffic in a single month by bot traffic.
How much traffic was that?
> The reasoning you’re describing is not altruistic. It’s the same reasoning used by every AI scraper.
I think that's bad faith on your part. Clearly AI scrapers are aware of what they are doing and simply don't care. The entire purpose of my including the bit you quoted there was to explicitly exclude that sort of behavior.
I mean, this is how Google was built.
Not a fair statement. Google wasn't built on bypassing bot protections.
Google is providing a service to the websites they crawl.
They try to not crawl when we don't want them (robots.txt, clear user-agent, no-index no-follow...).
> Google is providing a service to the websites they crawl.
Yeah they're building an LLM and making it pointless to visit the websites.
Let's agree on "Google was providing a service". Current and future state can be questionable.
Btw you can still block it.
> I use change detection to monitor all sorts of websites for changes. Some of my favorite authors don't have RSS.
Have you considered offering, as penitence, a public feed to share the information that this process produces?
Did you ask them for an RSS feed? Lots of people are pretty reasonable for such requests if you write a nice email.
I built a similar system for an identity protection service that automated removing PII from directory websites like whitepages. Which was less ethical, stealth browser automation or monetized privacy invasion?
Does it means the Wayback Machine is also unethical to you as well?
To me archiving the internet is way more ethical than putting bulk of the content behind paywall.
Coming with proper user-agent sounds ethical. https://archive.org/details/archive.org_bot
Author/publisher are owning their content. Expecting work of others to always be free doesn't sound really ethical.
I didn't use this company, but there are some legitimate purposes for scraping.
For example, at a startup a few years ago, one of the many technological things we needed to do was to monitor marketplaces for suspected counterfeit and contract-violating gray market goods for ~100 brands. And we couldn't just ask for data feeds, because, well, the marketplaces make money off of all those sales. And the off-the-shelf third-party data solutions were useless crap quality, worse than your average vibe-coding. So I made a bespoke crawler that gently and accurately tracked the data we needed, including global geofencing. So gently, I never got a whiff of disapproval or countermeasures (like throttling, 403, nor data poisoning). We were putting insignificant load on the marketplaces, for the purpose of helping to make the market better for both consumers and legitimate businesses. It was like a single "secret shopper" unobtrusively walking around some parts of a store. (And I also made an iOS app that did something different for actual secret shoppers in physical stores, for legitimate supply chain traceability for customers' brands.) Personally, I love the marketplaces, and hate the counterfeits, and this was my version of PG's advice that startups should be a little bit naughty.
Two of the problems with the current AI scrapers, which are destroying servers, and inviting backlash:
1. The gold rush situation brings out many of the crappiest people in the world. And also many who aren't crappy might behave in a crappy manner. (The latter, maybe because they're just emulating what they see, or extrapolating from the ethical temperature of prior industry norms, like surveillance capitalism in everything.)
2. Many of these scrapers are shockingly bad at what they do, and grossly inefficient. Almost like they're just pounding the same unchanging resources to DoS the servers for competitors. Or to drive sites to a protection racket company that's set up so they can also monitor cleartext. Or (Occam's Razor) just plain bad at what they do, and the people who pay for the salaries and computer resources either don't know or don't care.
Like if cloudflare and other protection services don't make the web more human-hostile blocking un-approved browsers and tools like cURL. Captchas and others already get solved by AIs nowadays, they're just friction done to collect labeling data for free from users...
There are a lot of legit uses for keeping tabs on information. Price comparison websites for example allow the public to be better informed and fight hostile pricing strategies that are more common now corporate consolidation is at all time highs. Oligopolies don't want their prices scraped so they put up anti-bot measures. When the price history is layed out in a plain chart it becomes clear how efficient the economy is at segmenting markets and emptying wallets.
It's not unethical to do something someone else doesn't want you to do. Discord doesn't want me to use it without buying Nitro, but I do. My bank doesn't want me to get a cheaper loan from a different bank. YouTube creators don't want me to use SponsorBlock.
Which won't work, obviously, because as a bot operator, I'd just have my users provide their own IDs, or run another website to harvest IDs.
Firecracker is fantastic technology. I'm using it for my interviewing startup to run isolated runtimes for coding interviews (and personal workspaces), and it's been rock solid and incredibly lightweight. Interfacing with it through the Go SDK has been a piece of cake, too.
> Unikraft needed an engineer to add capacity by hand
This seems a little unfair - the _architecture_, as designed, required the human in the loop. The tool doesn’t require it.
> The catch is that regular EC2 is already a VM. AWS runs our host inside its own isolation layer, and then we run browser VMs inside that host. In other words, every browser is a VM inside a VM.
yes but i think there is specifically some ec2s which give you hypervisor access and thereby firecracker too - someone correct me if im wrong?
yes only c8i, m8i and r8i instance types support it. It is called nested virtualization[1]
[1] https://aws.amazon.com/about-aws/whats-new/2026/02/amazon-ec...
Unfortunately supply is quite limited. If you want to horizontally scale on these instances you need to have a good relationship with AWS so they'll give you a big allocation before c9i is a thing.
also i found them much less stable than metal instances running into weird kvm failures
Yes, it is. It was a challenge to make it work smooth without metal. The scaling out speed was one of the main reasons
I haven't personally tried, so I can't say for certain, but Lambda has publicly stated they run on bare metal EC2 instances, presumably the supply of whatever instance types they use should be fairly healthy
You're talking about AWS Lambda?
- Their use of bare metal isn't necessarily the latest gen hardware - AWS Lambda is part of AWS, and obviously has privileged access to supply
The interesting part to me is less the exact hardware generation and more the control plane around placement, isolation, and startup latency. That is hard to copy outside AWS.
When we had need of quite big machines (AWS metal instances), we've found the performance differential between metal, and the equivalent size VM was 10-20% for CPU heavy workloads.
What is firecracker needed? Couldn’t this just run in a container directly? I understand some of the isolation concerns but a browser and container breakout is a billion dollar CVE, no?
If you follow the kernel mailing list container breakout exploits are currently a weekly occurrence
Oh really, not a security expert, but could you send me some examples?
https://copy.fail
Though it is true that bleeding edge browsers are fairly secure.
You can take a snapshot of a microVM and roll back. I've never heard of this being done with containers.
You can have a volume mount into your container backed by whatever block storage which may have snapshotting or format with a FS that supports snapshots.
The VM snapshot/load is about memory, not storage.
Most mature and/or security conscious providers don't consider containers to be a secure isolation boundary (with Microsoft being a notable exception, though it's unclear whether that's a failure of internal policy or incompetent enforcement of policy).
Containers provide a much broader attack surface than VM's, and since they're not considered secure as an industry standard there's likely to be less resources put towards managing container escape CVE's than VM escape ones.
But everyone is running containers on Kubernetes?
Pretty light on details, heavy on fluff. 9.8s to 3.1s was userfaultd + hugepages, 500ms was PS/2 mouse and… where is the rest of the time to get to 400ms?
have you tried running android browsers ? we run RL workloads using android browsers. We are having to maintain a fork of https://github.com/budtmo/docker-android/ and android chrome on top. We would rather use browser-use if it had that support.
P.S. we do maintain our fork of a browser for rubric computation...but that is not relevant for this. The infrastructure is what we are looking for.
Shameless plug, we have the infra for exactly that use-case. Reach out if you're interested. Email in profile.
Why do this? I can’t see that it would be better than running chrome.
touch. RL needs to model real usecases. Some of our customers build mobile-use.
I've experimented with Android Browsers. The problem is that android VMs are super heavy compared to the resources needed to run just Chromium
Startups are absurdly slow, isolation is harder, etc...
Android bloat is insane, you need to run the entire Java VM to start the browser... It's also harder to fingerprint, and at scale that's something that we need for Browser Use
Cool experiment but not yet production ready, at least for us
>compared to the resources needed to run just Chromium
That Chromium is still running in a VM.
Running in firecracker is fine but if you want density you can run them in containers.
https://hub.docker.com/r/kasmweb/chrome
No mention of the tools/methods used to do the profiling, I think that would be the most interesting part.
Also a bit surprising that a checkpoint with the browser running wouldn't just work. Is this some quirk of firecracker?
Checkpoint with Chromium running is possible and will be our next step.
Main blockers right now is fingerprint injection and profile injection, solved already.
It's always a balance of engineering effort & gains. Post-Chromium snapshot let's us save 200ms, which is not that important for 99% of use-cases, but that will come soon since it brings some other benefits (like CPU footprint)
Profiling and tools used are already included with Chromium, they provide nice debugging tools
> Main blockers right now is fingerprint injection and profile injection, solved already.
Do you do this at the chromium/V8 level or CDP?
I've been having mixed success with CDP and was thinking of going to the level below, but it feels like just getting Chromium itself to baseline chrome detection profile is significant work
> Do you do this at the chromium/V8 level or CDP?
Deepest level possible, harder but required for some workflows
Just use something like https://shellbox.dev instead of FireCracker inside ec2. Much simpler, boxes are up in a couple of seconds, and it is way cheaper.
It's not cheaper, slower startup, we lose full control and the environment is not optimized to run Chromium, so we also lose performance
Cheaper by a factor of 2-5 depending on your usage. Read here: https://shellbox.dev/#post-race-to-the-bottom
Startup is fast, less than 2 seconds if you pool connections.
Also, what do you mean by optimized environment for chrome. You can use whatever image you want, use an optimized one if there is such thing
Saving state before launch of the browser for quick startup is interesting but how do you configure it? I suppose you don't? Or post configure?
The state does not contain the browser config, since it's configured just before it starts running (and we currently snapshot before Chromium starts).
In our case, we prepare the environment, load files that we need later and then we create the state. Once we start, we instantly start Chromium with the config requested by the customer.
More wondering about the future side there.
I have tried it before by saving the entire memory state of the VM actively running but man oh man were there alot of bugs. My idea was different I was playing with spinning browsers on spot nodes and swap them over + state if they were revoked.
You thinking custom Chromium startup sequence for that?
> During a burst in traffic, the system, instead of reacting on its own, required humans to adjust it.
Isn't this solvable with autoscaling? how is this not an issue with Firecracker as well?
Our previous solution (Unikraft) did not supported auto-scaling
That's why we moved to a fully in-house solution with Firecracker and auto-scaling on EC2
The article doesn't mention docker at all. I don't understand why containers are not viable solution for headless browsers.
Their competitive advantage is not so much running the browser but rather making the browser undetectable.
They boast a large residential proxy network too, which tells you all you need to know.
Yeah, where is the blog post on the residential network?
docker is not a security boundary but a resource boundary.
It is security boundary but a weak one. Escaping from docker is very hard.
> Escaping from docker is very hard.
You mean a microVM.
A docker LPE (local privilege escalation) requires a kernel exploit such as Copyfail would work under docker but not in a microVM.
Docker does not isolate, consumes more resources and is slower
Or processes. Chrome has builtin process isolation for every browser tab. It starts up darn near instantly, and scores as 'pretty good' as far as sandboxing is concerned.
Startup time probably. They can start firecracker from a snapshot state.
So I've been playing and tweaking for a while with running different browsers in containers. And it took a long while to get working well, but it's doable.
The only issue is scaling, the containers aren't super quick to start (so we keep a spare container ready) and there's plenty of other issues. Also docker isn't really a security boundary so there's issues and concerns there.
Docker doesn't provide any security. You install Docker on your local laptop, and the container you spin up when you execute `docker run` interacts with your laptop's kernel directly. It provides logical isolation between containers but provides zero protection for your host kernel (assuming you decide to install Docker on a remote server instead).
Firecracker provides an isolation between the host kernel, on the one hand, and the guest microVM, on the other hand. So on AWS, you use an Amazon Machine Image (AMI) to specify the OS and other components and libraries installed on an EC2 server such as c5.metal, or if you're using nested virtualization, you can use c8i, s8i, or m8i instances at a discount of about 80%-90% at some performance and other cost, and you bundle Linux along with the Firecracker binary. Then you compile a build artifact including `rootfs` for the Firecracker baked image which is the microVM image (analogous to a Docker image that results from executing `docker build`). But the microVM process has its own virtual kernel and is a guest on the host machine. So for instance, you can place Docker inside the microVM, then the container is executing against the microVM kernel, not the host EC2 kernel. Communication is achieved securely between the two using `vsock` and probably something like `socat` so that data travels, say, from guest RAM to host RAM directly to an S3 quarantine bucket, for instance, without ever touching the host's kernel or filespace.
can i use these from python script without any ai? i need a small data retriever that wont get blocked, so no same ip every time
But Firecracker is not compatible with GPU for Chrome, is that right?
That means Chrome is slow - quite the tradeoff.
Our browsers beat competitors in performance too. Chrome uses mainly CPU, not GPU
We support GPU via software tho
Headless, so there is no screen to composite and no GPU passed into the VM. Firecracker has no GPU passthrough, so GL work falls back to SwiftShader, the software rasterizer. For automation that is fine. The cost is in layout, JS and network, not raster. It only bites on WebGL or canvas heavy pages.
to do this on non `.metal` instances, you would need to patch kernel right? PVM patch requried?
crazy that the maker of chrome(google) and also the owner of a massive amount of cloud services has not made a cloud product identical to this yet
They have IMO: https://web.archive.org/web/20180823072111/https://cloud.goo...
They just don't have access to giant pools of residential IPs, so too many sites end up blocking all the cloud providers by IP range/ASN anyway, even if they could get through a captcha.
google has a large amount of "caching servers (GGC)" located in data centers for residential providers all over the world.. They use these servers for a variety of services.. Most of the traffic I have seen from them have been for their "URL preview" service ..
they kind of do.. gcp has their lambda equivalent which i believe comes with chromium preinstalled, its how major search tools like jina work, sure thre problaby somethign about session management that they probably neuter to prevent abuse though
not google but cloudflare has a similar product - though I am not sure how good it is
>Unikraft does not have good built-in autoscaling, so an engineer had to change a variable, manually adding more instances.
>During a burst in traffic, the system, instead of reacting on its own, required humans to adjust it. This caused problems: one load test brought down production for 45 minutes. So we rebuilt our setup on Firecracker.
It shouldn't need to have autoscaling built in. If the variable is adjustable, why couldn't monitoring happen that sets off a process to adjust the variable when traffic spikes?
The Internet is drowning in bots, everyone who hosts a site or service is paying the price. At least we have companies like this to make the problem worse.
You have to be a bit more restrictive today yes, but if you weren't already overrun with bots and hacking attempts while hosting a public service many years ago, you probably weren't hosting a even medium-popular website in the first place. Same thing goes today. Slap a rate limit on it and be done with it.
Yes, externalize the expense of dealing with it to each operator.
Been like that since what, early 2000s?
It... wasn't okay then. And now it's scaled up immensely.
Kubernetes + Kata would have helped with scaling here, I think.
EC2 is more scalable than your wallet. What do you need Kubernetes for?
How many tabs do you use per server?
Really depends on the server specs. Tab amount relies entirely in memory & CPU availability, not in the infra that runs behind the scenes
But yeah, in one server we can fit hundreds of browsers, or even thousands if we use bigger servers. And each one of them with dozens of tabs, no issue
Just hot stage a bunch of VMs and then there is no startup time. Every time someone finishes, just start another one and leave it running waiting for the next customer.
Browsers can't be reused between customers. They contain sensitive and private data. Everything needs to be isolated and ephemeral.
I never suggested reuse.
Starting the VM itself takes 20ms with Firecracker, the slowest part is starting the browser.
So there's no benefit on reusing the VM but not the browser. VM isolation is also important, customers can leave downloads and other files that should not be accessible for freshly created browsers on that same VM.
I never suggested reuse.
Oh my bad. You mean warm pools then. That works, yes, but you need to maintain that warm pool, which might not be ready if we receive a big burst of demand
Keeping the browser open and warm is also a problem, not all customers require the same features. The same engineering required to fix that (modifying values with Chromium open), also fixes the post-chromium snapshot
VM takes 20ms to start, browser around 300ms. Post-Chromium snapshot is at 50ms end-to-end, defeating the benefits of the warm pool you suggest, that will be our next step.
if people want custom features, then of course there is a cost to that. but if the majority of your customers are running on defaults, then there is a benefit. yes, it creates other issues, such as pool management, and if you do that wrong and you can't predict capacity well enough, then people get your "slow" path. but, overall, my experience is that the warm pools are extremely well regarded and not something that most people think of.
> case in point, it took me saying the same thing twice, for you to catch on
Quit being a dick, Jon. It doesn't suit your stature.
At any rate, warm pools aren't cost free. If you overestimate demand, you'll waste too much money on idle resources.
Sure, you're right. I edited to remove that bit. Thanks for calling me out. I was getting frustrated for having felt like I was extremely clear in what I wrote and the person kept repeating something that I had clarified.
> At any rate, warm pools aren't cost free. If you overestimate demand, you'll waste too much money on idle resources.
Depends on how you're running your business. If it is your hardware, it isn't much of an expense at the benefit for having a product that makes your customers happy.
It depends on how you think about spare capacity on your hardware. They're an opportunity cost. Every idle cycle and bit of unallocated memory could be spent doing something else valuable. Consider the most extreme example of GPUs: CoreWeave was founded by repurposing nearly-useless GPUs their founders had on hand to mine Bitcoin (correction: Ethereum) that was obsoleted by ASICs and other specialized mining hardware (correction: transition to proof of stake).
Besides, they did say they were running on EC2, which charges by the second.
Your CW analogy is wonky since that isn't how it went down. You know my name (I don't know yours. edit: Michael), so you should know a bit more about my history in the space too, right? I can explain it out, but afraid of either being called names, or just not being worth it to you (or me for that matter).
EC2 has preemptable and reserved pricing. It is possible to build autosizing solutions, this is what Google did with AppEngine and later GCP Functions/Cloud Run. Just like optimizing start times, it is also possible to optimize those idle resources. For me, I'd go with the idle resources as the lower hanging fruit over trying to shave ms off making things available on-demand, since it affects the customer experience first.
You're right to call me out on the facts about CW as the story's a bit different from what I remembered from reading an article about them a few months back (https://www.wired.com/story/coreweave-scrappy-cryptominer-mu...).
That said, I think you get the point about idle capital being waste. EC2 Spot Instances are a great example about how to turn that waste into revenue.
Following up like this--twice--is douchebaggy behavior. If someone misunderstands what you're saying, try elaborating or rephrasing instead. Help make the conversation more productive and less combative.
I love that they start no no core pinning, then switch-over to having cores pinned.
This could be a bit of a tricky one, but I'd expect Checkpoint Restore In Userspace eventually tackles a lot of this. An image of a running Chromium process on a tmpfs (in-memory filesystem) that can just be launched endlessly tackles the memory slowdown problem, eliminates conventional startup costs. This feels like an ideal CRIU use case.
I imagine there's a lot of things Chrome needs to run though, bits of state to save/restore.
Assuming CRIU can checkpoint and restore Chrome, and especially recent versions of Chrome, just fine, is a little bit of stretch.
How do you handle browser sessions?
We persist profiles to maintain sessions if needed, this includes cookies, session storage and everything needed to keep your account logged in
> Plain headless Chromium is easy to detect by websites with anti-bot measures.
what a disgusting business model. they are central to the main bot problem of our time. any one with morals would expect those systems to respect robots.txt and announce themselves via user agent strings.
disgusting and i hope they crash and burn. i will actively spend time today looking for open source projects detecting their browsers and start contributing.
Im super interested on that, please let me know if you manage to detect their browsers!
fancy terms aside... they likely just run alpine linux.
“ click this button, type this text, read this page, take this screenshot.”
You left in the Ai’s instructions. lol
Interesting read though, thanks
well that's how browser agents work in a nutshell lol