In the last year, I have bought an M3 Ultra Mac Studio with 512 GB, a Macbook Pro M5 MAX with 128 GB and an RTX 6000 Pro. I have spent around $25k so far, not including electricity. I figured worst case scenario I can sell them in the next year and only take a haircut as opposed to losing my entire investment.
In comparison to just spending for tokens, the tokens would have been much cheaper and much much faster. I've been running against Gemma4:31b, Qwen3.5 and 3.6, and getting local LLMs to solve AMC 8/10 math questions and it's about 10-100x slower than just doing it online. When I tried it with ChatGPT late last year, it took about one night and $25 to solve about 1000 questions. Using my RTX 6000 and M3 Ultra and Gemma4:31b on both, it answered about 40 questions in 7 hours and I haven't checked how good the answer is yet. At 800 watts (600 for RTX and 200 for M3 Ultra) and running for 7 hours, it solved around 40 questions.
At the very least I'm going to try to sell my M3 Ultra if I can find a reliable place to sell it without getting ripped off by scammers.
I’m not usually one to ask this because learning to do a thing can be fun, but why exactly have you spent 25 thousand dollars on getting an LLM someone else made to answer maths exam questions?
It's just a project I'm working on. I'm working on projects where AIs are processing and classifying large amounts of data that would be a lot of work for humans to do.
I think of LLMs as being well equipped for handling dynamic data or adapting to unforeseen circumstances well (random code requests, website's ever changing layouts, typos, non-standard formatting in docs, groking out important info, etc), but math problems are be definition a very specific set of instructions to run, so is the overhead and "thinking" aspect of a LLM/AI even needed here? I'm genuinely curious, btw, I'm not asking sarcastically. Can't these math problems just be yanked from some test file and rapid fired directly at a gpu/compute unit?
> Can't these math problems just be yanked from some test file and rapid fired directly at a gpu/compute unit?
Yes this is exactly what I'm doing. I isolated the actual math question, and then sent it to my two servers to process and that's what's taking 10m+ to return. I'm asking them to solve the question and return the full answer along with their steps. I care about correctness so taking time is okay but I can't use 10m per solution.
Nono, parent was asking “They’re bad and inefficient at that, so why have an LLM do math? Why not just use some code and the CPU/GPU that’s already good and efficient at basic math?”
Privacy and offline operation are valuable or non-negotiable in some cases, but the difference is pretty categorical between what can run on a single card and what can run on a DGX GB200 NVL72 cabinet. Doesn't mean it's not worth seeing how far local models can be pushed. Not every problem needs a senior engineer.
I know it's one of those "if you have to ask" situations, but curiosity got the better part of me. Here's the search assist response:
"The DGX GB200 NVL72 AI server costs approximately $3 million per unit. This system includes 72 Blackwell GPUs and 36 Grace CPUs, making it one of the most powerful AI servers available."
That $25k spend by GGGP seems like nothing in comparison. That's ~1/3 of one chip in that cabinet. God gawd I'm old and out of touch with modern AI data centers.
It's The Circle of Computing Life. The pendulum swings between centralised mainframe timesharing-for-hire and desktop individuality.
We've been in a centralised phase for longer than usual - first cloud everything, then AI - but at some point in the next decade prices will crash and a market will appear for personal, local intelligence.
By comparison, the Colossus 1 data center had 32,000 GB200s (as well as 150,000 H100 GPUs, 50,000 H200 GPUs), and they are bringing another 110,000 GB200s online (although this might be Colossus 2?)
There are bigger data centers than Colossus 1 around too.
There is a reason NVidia is the most valuable company on the planet.
> the difference is pretty categorical between what can run on a single card and what can run on a DGX GB200 NVL72 cabinet.
A better way of putting it is that you can run plenty of things on a single ordinary system, but you may be disappointed at the performance. Generally, you can't expect inference to be as quick as with cloud for SOTA-like models. You have to run smaller models for quick replies, and large models with a lot of real-world knowledge for less time-critical inference, possibly batching many requests simultaneously to improve throughput.
The cost is obviously not that big of factor for OP as it might be for others. It's actually refreshing to hear the candid viewpoint that he expresses here.
25k is definitely a lot but I did the risk analysis and I figured worst case I would lose a 1000-2000 after a year of playing around with it, so I look at it more like renting (I'm going to keep the Macbook Pro no matter what since I needed a new one).
Nitpicking, but the worst case of spending $25k is unforeseen circumstances that write off the entire asset. I don’t think -$2000 is a conservative enough figure for standard depreciation either (a lot can happen in a year)
Either I don't understand the used apple market.. or I agree this is crazy. Someone spends $25k on new hardware, waits a year, and expects to sell it for $23k? Unless the ram issues save him, and cost of new goes up, I don't see how that was going to work.
This is the case in lots of markets, e.g. look at used cars, luxury goods and more. Some of it is driven by inflation/the rapid devaluation of the dollar. General and AI-adjacent compute in particular hasn't come down in price in a long while.
oh young grasshopper, I see you dont know that money launderers love the ebay hype cycle. Its REALLY common on high dollar hot items to have phantom transactions where parties are on both sides of the transaction to clean illicit money. The high price tag and high volume amount of transactions hides the illicit signal. I have tried to buy a few of these mac studios only to have the transaction cancelled because I wasnt the dirty money on the other side.
It's still very contrarian to expect GPUs won't depreciate rapidly. Yes 3090s were a good investment then, but way worse than just buying Nvidia stock directly
How did we go from "I expect to lose only 1000-2000 if I try to sell my used equipment" to "you should have just bought NVDA to get a better return." The point wasn't the better return, the point is that I wouldn't lose all my initial investment if i decided I wanted to sell it.
And the fact of the matter is that in 2026, all electronics has gone up, not down, and sought-after GPUs have gone up in price in the used market.
A lot of expensive things hold their value well. I have a friend who is really into telescopes and he now owns a $100,000 telescope but he didn't directly buy such an expensive scope. He started out with much cheaper ones and was able to sell them for about what he bought them for to help fund more expensive ones over 20 years. It is really interesting.
Apple products have had relatively high resale for a while. Only losing 8% in a year is probably extra unusual, and 1-year-old wasn't really ever the sweet spot, but a "sell used privately after a few years, roll onto the new one" has been a relatively common play.
Doing this particular one is definitely expecting the market squeeze to continue. "Worst case" is back to more "normal" depreciation. Where I'd expect to only be able to recoup more like 18k. But... if you look at GPU prices the last 3 years... it's not a crazy assumption that it won't drop that fast.
iPhone example since those are easiest to find in quantity: new iPhone 16 Pro Max for $1200, Gazelle would want $866 for "execllent" condition. Lost ~28% for one-model-back. iPhone 15 Pro Max, though: excellent priced at $667 here, only down another 23%, and gives you basically half-priced-upgrade if you can sell it for that and roll into the newest.
So to have never-more-than-one-model-old rough estimate at today's value-holding you'd be out $3600 for three new phones, with getting 1732 of that back, or 1868 for it (with a $334-per-year incremental cost of upgrade).
For never-more-than-two-models-back you'd be out $2400, getting back $866, for net $1534 spend, with a $167 incremental per-year upgrade cost once you buy the first one. Pretty good if you keep the phone in excellent condition and are happy to budget a bit over $10/month to be on a every-two-year upgrade train.
What you describe is something people with enough to make the first purchase and eat the cost when it breaks have been doing for years e.g. with cars. People on the lower end of money scale tend to use products for well over their economic lifetime saving way more and buying a cheap replacement if it breaks. Notable exception as stated being phones for some reason as it likely is a status symbol for more people in a [insert preferred external sexual characteristic] measuring contest.
> I don’t think -$2000 is a conservative enough figure for standard depreciation either (a lot can happen in a year)
We aren't exactly in "standard" times and haven't been for quite a while. Even five year old graphics cards are worth more today than they were just a year ago. Things will obviously depreciate at some point, but you gotta throw your existing notions of how quickly and how much hardware will depreciate out the window. There's just been too much money dumped into AI for a "well I guess this won't ever pan out, let's dump all this hardware to recoup our costs" moment to happen and tank the price of everything suddenly IMO.
And that's not even getting into the other geopolitical stuff going on right now. Strange times.
Sure, they took a gamble that they wouldn't be able to sell it used.
If you are able to tie up $25k for a few years just for shiggles, you clearly are able to make do fine without that money and if lost it would be at worst annoying, not catastrophic.
I wouldn't call this nitpicking. This is how people who are careful with money think. I learned embarrassingly late to stop justifying purchases by making predictions about future returns. I treat everything as having zero value as soon as I purchase it. Thinking otherwise is, for me, always a dangerous rationalization -- always a craving that's trying to outmaneuver sense.
I am (clearly) not as far down the rabbithole as the commenter you're replying to, but almost certainly not. Streaming 4k blueray is on the order or ~100Mb/s, which means on a LAN bog-standard gigabit ethernet and associated networking hardware would be more than sufficient.
This is taking a hobby to its extremes, in much the same way that a $5k boat and $500k boat let you catch the same fish.
One year ago finetuned local LLMs had a significant edge over ChatGPT or Claude. Look up in YouTube all the DIY videos testing LLMs on their own machines with different setups.
Remember: one year showed up to be a gigantic leap in regards to quality of results and innovation in the AI space. Agents weren't really a thing and vibe coding wasn't even invented as a term because the top notch tools at the time were lousy, with lovable being the frontrunner with its - in my view - sorry Tailwind recombination tool shaming AI to do the work.
Then fall hit 2025 hit us, new year's eve and suddenly there was such a massive surge of innovation and competition with ChatGPT Codex suddenly showing up.
Remember: one year ago many now commonly used tools weren't yet available like Nano Banana or Codex.
"The 25k are so vast" - Yes, and no. For example, if the machine is bought for business usage I can deduct the costs from taxes. This roughly amount for 50% of the financial burden.
So I jokingly use to say, that I pay only half the price for my Apple business machines. And yes, I am strict in this regard. Business means business. No private emails etc. nothing on my company computers.
Maybe there are other options as well to reduce the financial expenses the dude mentions, but it doesn't seem so.
I would also go for leasing, this way already the monthly payments can be deduced and I don't need to buy and maybe resell the machine.
Apple is a luxury good. Without business usage or at least partly using it for business as well as private (mixed usage in tax reports) I wouldn't buy the devices or think twice.
Apple under Cook evolved into a Gucci like luxury brand, that is more and more a rip off than quality delivered, especially considering the latest OS updates for Mac, iOS and iPad. Apple is a mess, following Microsoft Windows' footsteps happily, because the CEO is as has been correctly assessed, no product guy.
But I stop with my rant here.
Always try to use tax deduction as leverage for your computer expenses. Every citizen should invest in basic knowledge about that.
Even a 10-20% professional usage for work (mixed usage) gives you a noticeable advantage over normal pay.
I didn't spend that much, only $6500 AUD for a GB10 based Asus GX10 which is even slower than OPs, but I spent that because it makes for a great learning platform. Theres not much else that lets me fiddle with 128GB of RAM for my graphics processor, and it's quite lovely to be able to run things as long as I like without worrying about my cloud instance being shut down.
It's not financially a good idea: renting really does beat owning, and cloud beats both if you're only running inference on these machines. But I'm not just doing inference, and as a thing I can do silly stuff on to learn, it's hard to beat!
Spot pricing and instance availability don’t apply to on metal hosting. You’d have your own machine dedicated to your own use only, at a locked in price.
Because buying Macs is not about performance, its about feeling like you are rich.
That money could have been spent on way more bang/buck performance in the form of a set of 4 graphics cards.
Also I would probably put the odds 70:30 that Apple marketing is astroturfing on HN from the amount of posts about running llms on Macbooks, because in reality, the inference speed of any decent llm is unusable on a Macbook despite the ability to fit it into RAM.
Excuse me for this comment, really, but I can't comprehend the absurdity, some people are buying GPUs when other people have no money for insulin so they literally die. I don't mean anything towards op or gp, quite the opposite I'm truly happy they have this kind of freedom, it must feel really nice, I just hate this game so much.
This is making me feel a lot better about my plan to lease a $25k EV simply because it's available at a massive discount. I'll probably end up using less electricity, too.
If you don't need cash right away, I'd wait until the M5 Ultra comes out and see how things shape up. There have been some early efforts aimed at combining the prefill performance of a GPU with the high throughput achievable with the Mac's unified memory architecture (see various YouTube videos by Ziskind and others, as well as https://old.reddit.com/r/LocalLLM/comments/1r6drpi/exo_clust... ).
Point being, once the M5 Ultra is available, I suspect a lot of people will get very serious about making Macs work with RTX GPUs because that will yield an inference platform with a good bang:buck ratio. If so, you may find that your existing hardware is more powerful than it seems today. And it may be a lot more expensive to replace later if you sell it now.
I've seen a lot of sales on eBay for over $20k, but I don't know if I believe it. Plus the lack of seller protection and the prevalence of scams on eBay make me too hesitant to actually want to risk it so I don't know what to do haha
Haha, yeah, it's about $23k or so. Should be twice the price what you bought it for if you got it last year. Tbh I don't know why. The RAM is large but the bandwidth and the compute isn't nearly enough. You can fit DeepSeek V3 on it quantized but inference is like 10 tok/s. Honestly, you'll be able to sell it locally for that in cash, and I would in your place.
I saw your heat comments about the RTX 6000 Pro as well. I bought a few of them recently and I'm running 2 of them in a 2U case in a colo. You need a lot of active airflow to keep them cool. Mine range from 23 C to 80 C.
RTX 6000 is some-what obviously my fastest card but my biggest problem with the RT 6000 is the immense heat. The GPU itself is almost 200F and the exhaust from the fans itself is over 150F. I'm worried that my hard drives are going to fail. I was told that the GDDR7 is even hotter than the GPU which is surprising to me.
After my last run, I'm going to wait for the new case I ordered to come in and cannibalize my kid's PC that we built beginning of this year to form an entirely separate computer. And then figure out better ways to deal with the heat, especially with summer coming up. I'll have to play around with undervolting and running vents directly outside my house to see if that helps.
From my failed and expensive affair with GPU mining 5 years ago, You can get a great heat dissipation outcome by using an open case with a lot of directed fans at the expense of a bit of dust and lots of noise
That's about what my OC'd and watercooled 4090 runs at. The cards are designed for it. Only problem I have is when sitting next to the computer under load -- I either have to open windows or blast the AC. Too bad I don't live in a cold climate -- that 60c heat output would come in handy :)
I've always thought about doing something like this in the Midwest US, but was always a bit nervous about condensation damaging the components over time; did you run with that sort of setup consistently, or only when pushing high scores? Ever run into issues with components failing?
You'll probably make a profit by selling them today. I bought a M1 Max Studio with 64 GB last year off FB Marketplace for $1000 and today I'm seeing numerous 32 GB M1 Maxes for $1200-1500.
Yes the prices on eBay for the Mac Studio are all over the place, but I've seen sales for over $20k. I don't know if I believe it but there's enough to make me think if I can sell it for that price it would be worth it, but eBay has basically no seller protection so I'm not willing to take that chance.
I looked into the M3 Ultra 512GB Mac Studio before it was discontinued and the as best as I could determine it just wasn't worth it... yet. The GFLOPS and memory bandwidth just arne't there even though it can hold a much larger model in memory.
But the trend here is interesting. I think by 2030 you'll be able to buy fairly cheap hardware that is currently $10k+. I don't know what this does to the trillions invested in AI data centers because the next NVidia architecture after Blackwell will essentially half the value of purchased cards overnight.
I'm not convinced Apple has yet pivoted the Mac Studio line towards this market and the expected M5 Ultras in Q3 2026 will likely be an incremental improvement rather than big leap forward but I'd like to be proven wrong.
I agree that all these datacenter companies like Coreweave are investing billions in technology that has a very fast depreciation curve and I don't know how they will sustain income. The same goes for datacenters in space, what happens when those chips are obsolete? Will they sent astronauts to replace them or will they let them burn up and send new ones into orbit every year?
I feel that the open weight models pale in comparison to the frontier models, and I believe that if the gap closes quickly, that the open weight vendors will stop releasing it for free.
I'm not really asking this from the perspective of whether I should buy hardware. I'm trying to understand the economics.
The AI space is moving so fast that it is hard to know which conclusions are stable. After all the discussion around local models, is the practical conclusion still that API/frontier providers have a huge structural advantage because of datacenter hardware, high utilization, batching, optimized inference stacks, and perhaps strategic pricing?
In a comparison like this, a $25k local setup versus buying tokens, what multiple are we really talking about? 10x? 100x? Or is it too workload-dependent to reduce to a single number?
Has someone written a good breakdown that separates true infrastructure efficiency from temporary underpricing/subsidy? The part I'm trying to understand is less ideological (local vs. cloud) and more basic economics.
The speed of results for an API call to ChatGPT is 10-100x faster than my local LLM. I haven't exactly quantified the results but I was getting results in a few seconds vs 10+ minutes for my local LLM. I'm going to do a deep dive this weekend and try to get better results, but it was staggering. I'll also do a deep dive on how to optimize my setup and see if I can get things to perform much quicker.
A M3 ultra mac Studio can run models that do not fit in similarly priced computers with multiple Nvidia GPUs. And it will use a lot less electricity while still having good enough performance. Except the pre-filing perfs that are quite poor on the M3.
If you buy Mac get at least 256GB ram otherwise just buy a bunch of nvidia cards. It really does not make sense otherwise if you are looking for performance / $. The mac (studio) is unique as it has more ram than the alternatives(I.e consumer nvidia cards or spark stuff) so it can fit bigger models but otherwise its performance is worse.
>> find a reliable place to sell it without getting ripped off by scammers.
This is a real problem and why I've just about given up on ebay or fb marketplace, esp for computers. If you are in Canada though sellit9.com is a great solution to having to deal with sketchy buyers.
If you're in a decent sized city, you should be able to find a local buyer on Craigslist or FB Marketplace... Beyond that, for higher value, smaller items like your M3 Ultra, I would talk to your local police department and/or library to see if you can do the exchange there. Larger libraries usually have a police officer on site or nearby, and the PD office near you may also provide a "safe" exchange location... I'd bring a monitor/keyboard/mouse so you can demonstrate the system working properly.
YMMV but between your nearest PD office and Library, you should be able to use one or the other for your exchange of goods/money. The biggest thing I've sold is a mid-range video card during late covid (I managed to get a better one via newegg shuffle) so I sold the old one (RX 5700XT -> RTX 2080) to make up the difference a bit. I just did the exchange at the Starbucks near me for that.
You don't have to... but it's a matter of a safe location for both parties. If it was more expensive, I'd probably work through a broker (like a car or house).
The buyer doesn't know who the seller is, and vice-versa... the level of trust you can bear depends on how much you're willing to lose. My advice is only in that there are safe venues you can use to make such an exchange.
Not really. Every country has a nonzero number of criminals. It's entirely a matter of the risk/reward tradeoff. A small consumer item over $10k is well into dangerous territory.
Are we talking about a cash transaction? If so >$10k is dangerous as the police may want to steal it themselves.
If it is an electronic payment, I'm not sure how completing the transaction in front of a police station will help any. Well, it will help the buyer to see it working, but the seller gets no additional protection besides seeing "a person."
Why would the seller worry about ... himself? The seller worries that the buyer might not have any intention to pay for the small, expensive, easy to fence item to begin with. Conversely, the buyer worries that the seller might not have brought an item at all.
> If it is an electronic payment, I'm not sure how completing the transaction in front of a police station will help any.
That's not the point of going to the "police safe exchange zone".
The point is to hopefully prevent the possibility of the buyer showing up with a .38 in hand, and demanding to be given the easy to fence "item" unless the seller wants to get a .38 slug embedded in their gut.
The risk of a "hold up" increases with dollar value and with items that are easier to fence.
I got an RTX 6000 pro too. I like running locally, I've learned a lot more than if I had used an API and there's less worry about overspending tokens. I accidentally spent $100 on claude api in like 2 days because I didn't know what I was doing.
The problem is that while one these gpus is a huge improvement over a laptop or a single 3090, you very quickly wish you had more. I would buy a second one, but I did the math and realized that with the current crop of models, 2 Blackwells doesn't buy me any new capability that I didn't have with one. So I would need a 3rd one. And when I buy a 3rd one I will feel like I want to running a higher quant, so then I will want a 4th.
A pair of RTX6000 cards will give you a good performance boost due to tensor parallelism, though. I haven't tried the newest predictive quants but I see about 35 tps when running the 8-bit Qwen 3.6 27B model on one board and about 50 tps on two. Probably could come close to 100 tps on an optimized setup with the latest GGUFs.
Also, the 4-bit quants of MiniMax 2.7 will run at 100 tps or so with two cards, which is pretty decent. It doesn't go any faster at all with 4 GPUs from what I've seen, so if you don't actively need 384 GB of VRAM, 2x RTX6000 is a good place to be.
This is, sadly, obvious and inevitable in retrospect.
The two major drivers of inference costs are GPUs and electricity. You can't get cheaper GPUs, but you can make existing GPUs not sit idle, and you do that by utilizing them 24/7, processing user B's request when user A is thinking, and handling many requests in parallel, neither of which you can do as an individual. You can get cheaper electricity... by moving, and it's much easier to move your AI workload than to move yourself.
This is a completely different dynamic than renting houses or apartments, as you can't really rent out the same house to different people at different times of day.
Yea. LLM inference requires batch processing to have a shred of hope at being cost efficient. Batch processing requires a not so insignificant amount of scale (but probably not as much as people think).
I'm very pro local models, but not to have parity with SoTA frontier models. Just contextually trained small models doing smaller specific tasks.
Trying to run bigger LLMs for an individual user to do big tasks is not going to be a good time.
It is unfortunately still common practice among irregular agricultural workers in many parts of the world (I’m Italian so I definitely remember news about busts in southern Italy)
You can definitely run many requests in parallel as a single user, you just have to be OK with a significant slowdown for any single request. Cloud inference can't reach that ratio of total throughput per hardware cost since they are heavily incented to get the most expensive hardware available and to then minimize latency (and RAM occupation over time) even at the cost of throughput. Running slower inference with cheaper hardware is just not workable in a cloud setting.
You got numbers? Because it seems perfectly possible to me. OpenAI and Anthropic’s marginal cost for inference is certainly far less than their API pricing.
Because I’ve looked at what it would cost my company to self-host a SOTA sized model. For us it wasn’t worth it because the hardware is all bought up by frontier labs and we can’t get any supply. But if we could, at the prices they’re paying, it would pay for itself in 10-ish months. I assume further that they have economies of scale on top of what I was estimating.
Everything there is extremely speculative and I don't see anything that contradicts that inference itself could be profitable at massive scale. See https://youtu.be/xmkSf5IS-zw for example.
If the companies as a whole are destined to be profitable, or worth their valuations is a very different question. The only people who can truely answer that have time machines.
Yes, once you have modeled the problem correctly and you know all the input parameters. This is not that: Session# * tps * 86400 (secs in a day) * 30 days.
I don't think there is enough public information to check Anthropic's claims regarding inference profitability. It depends not just on unknown technical factors but also on agreements they have with other companies.
I agree that we dont know how expensive SOTA is. But yes my math should give you the max amount of tokens you can sell per month, and its not remotely profitible for most of the larger open source models (at their current pricing). Im not sure why a 10x larger model that is more in demand would be profitible when its only 5x the price.
Its possible you could pay off hardware for Kimi 2.6 after maybe 2-3 yrs (by providing low tps / high concurrency) but you're now out of warranty and have been running your machines full throttle 24/7 for 2-3 years.
This is why moonshot attempted to double the price when they released 2.6 but then it got driven down by North American capital subsidies.
We should specify which subscription plan we are talking about. You seem to be talking about the Anthropic Claude Max plan. I think it's consensus that these flat rate type of subscriptions are loss leaders, as they come with restrictions how you can use the API via T&C, namely only with Claude Code et al. They are meant to hook developers into their products.
Shouldn't we compare the API pricing, where we pay per token? The whole point of local inference is that we don't have any restrictions regarding product use or time limits, so it would only be fair if we compare it to a plan that offers the same. And even that is only a first approximation, because the commercial models are usually much more capable than the open weight models.
> I could easily say that everyone who says its profitible is msking unsubstantiated claims lol.
And people who don't understand the difference between capex and opex are making uneducated claims. It's not basic math.
Running an inference data center is a mix of variable and fixed costs. The fixed costs are currently in the billions of billions of dollars for pretty much any investment in this space. Many of those fixed costs have (currently) unknown refresh cycles. So, unless you have access to the financial books of these companies it's currently just speculation whether inference is profitable.
To some degree I think there's a hope that it becomes like a gym membership. If everybody used their membership, the gym would be too crowded. It's all of those memberships that people feel like they need to have but don't use where the extra profit comes in.
As long as the power users are paying per token, everything is good.
Really? This is what we expect from this amazing world changing technology? People will sign up for it and not use it? Good business plan, how can I invest? /s
"operationally" implies that capex (which I would assume includes datacenters, gpus, and r&d) is not in. So the big news is that they can now pay for electricity and sysadmin.
High usage seems to change the economics. The author of the article had a payback period of about 14 months which is excellent by any standards and an order of magnitude better than rent vs buy for a house in most places.
Yep, the great theoretical promise of local models remains theoretical, no matter how much die hard-engineers want to push it...Who would have thought, right?
I don't think this changes the final conclusion - but have you considered calculating against depreciation -- i.e. figuring out how much your M3 ultra is worth today, and only charging yourself for the delta? In my mind you might even have made money on the hardware.
I administer a simple AI server in the office, which just uses a single RTX 5090 but is able to serve ~80 people throughout the day. I'm impressed by Qwen3.6-27b's capabilities in agentic coding/tasks so far. Devs say it's not much different from Sonnet 4.6 on many tasks (sometimes it even outperformed it), 40-60 tok/sec, up to 260k context. The server cost about $10k with all the bells and whistles.
I spent a lot of time researching/adding/benchmarking many custom modifications to the software stack and its settings to make the server optimally handle the load with just 1 RTX 5090 without losing quality, but it's still not enough, and the wait times in the queue are getting longer. We're at the limits of the hardware, and I'm out of tricks.
The experiment was kind of a success, and the CTO agrees we should scale it. With our own infra, we could run agents 24/7 on everything. Currently, a lot of use cases for the cloud providers are completely blocked by PII/trade secret concerns (our infosec department doesn't buy the "zero retention" promise), plus you don't have to think about billing/budgets/etc. anymore.
Now I can't decide how to scale it. On one hand, I'd like to run larger models. And we have the budget to buy, say, 8xH200. But in many benchmarks, the larger models that do fit in 8xH200 comfortably and can serve many parallel requests with acceptable speed/quality don't seem to outperform Qwen3.6 that much in agentic coding/tasks to justify the price.
So another option is just to buy a bunch of RTX 6000s and scale horizontally instead: run a copy of a midrange LLM like Qwen3.6 on each GPU. It's cheaper and easier to scale/replace, but then we'll run into problems running larger models in the future if we have to, because of no NVLink support (say, if Alibaba & Co. stop releasing ~30b models and/or ~30b models start falling behind 400b+ models considerably)
Does anyone here have experience running large models in a multi-GPU setup with several RTX 6000s in a high-concurrency regime and with large context lengths? (something like Deepseek 4 Flash, Minimax 2.7 etc.)
Wouldn't that be a fairly ideal setup for layer parallelism? That doesn't need the high-performance communication of tensor parallelism, and the high-concurrency regime would make it easy to keep the pipeline full with microbatches. You'd also be able to scale out your KV cache storage since that naturally splits layer-wise.
I have a 5090 machine sitting idle that I'm considering turning into a machine for my own small team (3 devs).
Are you willing to share any lessons learned, etc. that I could make use of? We are evaluating paying for a SOTA sub or trying this, and the talk about Qwen3.6-27B makes me want to try deploying this machine.
Sell the machine for $4K, use it to pay for Codex Pro for everyone for a year. Everyone will be significantly more productive and happy.
It's not even a real comparison if they are actually using them for coding.
If you are deploying always running agents (e.g. monitoring logs and services) then sure - a QWEN local server is a good choice. But for coding the cost in productivity of using a lower performing model is way too high.
Anyone who frivolously suggests throwing away possible independence in favor of dependence on a Silicon Valley company is either incredibly naïve or acting in bad faith.
I'll choose not to respond to your personal attack.
But in term of actually running a dev team - you are free to use QWEN or another quantized local model that can run on an RTX 5090 for coding if it makes you feel more independence. However you would struggle and spend many many more hours achieving the same thing, with a lot more debugging time, long delays before it's done, and many more prompts.
It's just not the right approach. I use QWEN and other local models all the time, but for more clearly defined monitoring and classification tasks.
Not necessarily so. I can see how a bid to predict how thing will be in 1 year in AI-based coding is likely a losing one. So the idea is to extract the maximum value now, and turn it into profits that would buy you whatever is adequate for the next steps. For comparison, the AI-based coding landscape a year ago, in May 2025, wasn't even close to what we have now, and half the key tools did not exist.
OTOH, as we see, the larger models demonstrate diminishing returns, smaller models demonstrate improvements, and hardware does not show any signs of becoming cheaper, so holding on existing decent GPUs may, too, be a winning strategy in longer term.
The 5h quota of Codex Pro on GPT 5.4 Medium lasts me for around an hour and a half, maybe 2 hours.
And this is already the "savy" setup. Enable GPT 5.5 High fast and you will be beached in 30 minutes with active development.
For continues all day work you definitely need a higher tier sub level.
I'm actually looking into deploying a GPU at my company because we can not give out our code.
Qwen 3.6 looks good
Right, I did swap that.
Still, you have to pay that 4k then every year and give out the code.
I also assume that prices will go up as no AI company (but NVIDIA -> selling shovels) is currently making any money.
For some projects the giving out the code part might be ok (i use Codex there too) but for the core app at the company I'm working at there is currently a strict no-AI policy.
A local GPU solves this.
Does anyone here have experience running large models in a multi-GPU setup with several RTX 6000s in a high-concurrency regime and with large context lengths? (something like Deepseek 4 Flash, Minimax 2.7 etc.)
For what it's worth, I've been seeing ~100 tps with 4-bit MiniMax 2.7 on two RTX 6000 boards, just running under llama-server without any optimization effort at all. I have no serious long-context experience with that setup, but at 30K context it's still above 90 tps.
If you are happy with Qwen 3.6 27B, I would personally switch the 5090 out for 2x RTX 6000s and keep running 27B. That will give you ~2x your current throughput with a lot more headroom for multiple users. More important, it would buy time to see how things develop over the next few months before you spend a whole lot of money.
> our infosec department doesn't buy the "zero retention" promise
They are wise to be skeptical! It is neither a promise nor zero data retention.
Look at Anthropic's Zero Data Retention policy -- and remember, this is the policy that applies to the exclusively eligible enterprise partners who can even qualify for a ZDR agreement with Anthropic:
> When ZDR is enabled, prompts and model responses generated during Claude Code sessions are processed in real time and not stored by Anthropic after the response is returned, *except where needed to comply with law or combat misuse*.
> Even with ZDR enabled, Anthropic may retain data where required by law or to address Usage Policy violations. If a session is flagged for a policy violation, *Anthropic may retain the associated inputs and outputs for up to 2 years*....
This means that Anthropic is actively inspecting all of your data with machine learning classifiers. When the usage is flagged for whatever reason as violating any aspect of Anthropic's Usage Policy, then they get to keep your data for 2 years, with no apparent limitation on what they can then use it for.
Crucially, you have ZERO guarantees about the sensitivity or specificity of these classifiers. For all anyone knows, Anthropic is silently flagging 75% of queries and retaining the data.
I think it’s a cost/opportunity tradeoff at best with any agreement, regardless. The rest of the contract may make it difficult to impossible to do anything about it, starting with basic arbitration clauses and ending in a ton of other provisions that can make any legal action futile. I doubt there’s much room to negotiate too.
Given that all labs need to diversify to become profitable, they’ll end up competing with their customers and theres nothing that exposes a business more than having AI offload every job function for every account, every mail etc.
I wonder how aws handles this in bedrock. Do they use Anthropics classifiers? Or their own? Or none? Would their data policing be different in bedrock than their other services?
Thank you for the insight. This makes me feel confident, the L40S we are about to acquire with 48GB VRAM for engineering application should be useful for agentic coding as well.
> Does anyone here have experience running large models in a multi-GPU setup with several RTX 6000s in a high-concurrency regime and with large context lengths? (something like Deepseek 4 Flash, Minimax 2.7 etc.)
They don't use the server all at once. In the UI, users typically ask a question, get a response, and continue with their work. In the case of autonomous agentic loops, an agent simply waits its turn until the server is ready to accept the request. Agents don't hammer the server 24/7 every second either, because they either need to be triggered or are busy doing other work, such as compiling or running tests.
It would be more interesting to know how many simultaneous users this setup can serve. Otherwise I can just say it serves 500 users but not all of them use it at the same time which doesn't communicate the right level of detail.
With parallelism of 16 you can still get around 25 to 30 tokens per user when all 16 channels are running.
Not everyone will use the model at the same time but it certainly will be tight, especially for agentic coding.
For pure chat applications this should be quite fine.
The problem with wide parallelism with most models is that it blows up your KV cache. There's open models with KV caches lean enough to parallelize inference or even to offload the KV cache itself to disk without immediately running into wearout concerns, but they're quite exceptional.
Subagent swarms are actually great for the local inference scenario because they can share a whole lot of KV cache. You get to raise the compute intensity of decode (i.e. the aggregate tok/s) essentially for free.
> don't seem to outperform Qwen3.6 that much in agentic coding/tasks
idk i imagine you'll hit less edges with a larger model just because.. more data
if you think of them as a kind of NN compression, it's ~obvious that the larger model can have more stuff encoded in it and hopefully accessible
i don't use LLMs much right now but using midrange models seems like an unnecessary compromise in most cases, especially since the big open models sound to be rivaling opus and not just sonnet :p
> The mentality shift of renting vs. owning the gpus is huge. When renting, each experiment costs money and I had to ask myself is it worth it. When owning, it feels like not running experiments is costing me money.
I feel like there is some very deep generalizable wisdom buried here.
Also something about subscriptions vs pay-for-usage. I feel the need to use all my weekly tokens or I'm wasting and I bet they would never get this kind of usage out of me if AI ended up being same price per token.
I always buy software/assets/dev tools for my hobbies (like CAD, music production, game dev) instead of paying subscriptions, even if that would very likely be way, way cheaper and would give me access to really cool tools. I don’t want to feel bad not using something and I know that’s the case with a subscription
You should be able to achieve this mentality shift without owning a GPU. You just need to commit some money upfront to cloud GPU spend, in a way that is not feasible to go back on.
That way you get the experimentation-encouraging mentality shift to "If I don't use this, I'm wasting my money", without the cost inefficiencies associated with actually buying an accelerator, discussed by others in this thread -> you'll never be able to match the the utilisation and thus the cost amortisation of cloud GPUs.
This article appears to lack any reason for "needing" this beast, or any real comparison with alternatives, both of which are required to answer the question posed in the title. It's a summary of how much they spent and some light anecdotal comparison to what they might have spent on cloud services, but clearly they didn't do an exhaustive hunt for value.
The real question is whether or not they could have done whatever it is they did with less hardware. Is there a business idea here that could have been proven on cheaper hardware that could be upgraded as demand increased? Is the expected ROI there based on future earnings?
Absent any indication that this was needed in the first place, I can only conclude that it wasn't worth anything.
Post-hoc justification. There's no analysis of whether that level of hardware was necessary to launch, only that they did get that hardware and did launch.
Looking at the GPU utilization graph, it certainly seems like the hardware was saturated for many days/weeks on end.
Was it worth it to spend that amount up front, yak shave while building the system, etc. vs. pay for cloud GPUs? Probably not in terms of dollars, when their time is also valued in dollars.
Was it worth it for this person? It seems, unequivocally, yes.
Abstract/TLDR: LLMs are notoriously formulaic at writing, overusing certain tokens or phrases. I show that models trained with SFT fail to match the distribution of the training data by using Maximum Mean Discrepancy (MMD), Judge Model Quality (JMQ), and L2 Token Distribution.
Idk if this turns into revenue or some financial metric but even if it does and it was a good outcome for author, it still says nothing of risk. What if he loses his timing opportunity / gets beat to market because he's unnecessarily futzing around with hardware? AI is rapidly advancing and he spent 2 years on this to save what was probably <2 months of faang income. There's multiple other angles I could dissect this from a risk perspective. I'm all for taking risks, but at least acknowledge them and preferably measure them as part of making big decisions like this to save a little bit of cash.
Let’s be clear, though, FAANG (as someone who has spent an awful lot of my life working at FAANG) was pays well but crushes your soul. There was a time, a very long time ago, where it didn’t, and there are the soulless soul crushers that love it there, but I would rather futz around with a mid range cars worth of hardware and be happy than spend a moment longer prostituting my soul for their money.
There was a time in this industry that it paid about as well as an accountant and people did it because they loved what they did. Then the money flooded in, a bunch of people switched majors from business to CS, washed out in industry, got their MBA, and became product managers and engineering managers and sucked all joy from it. God bless those that find that joy again.
> would rather futz around with a mid range cars worth of hardware and be happy than spend a moment longer prostituting my soul for their money.
So only 2 options in this profession are 1) sell your soul to 1 of 5 evil corporations that just so happen to also pay excessively well or 2) choose to be unemployed for years while spending a significant amount of money on hardware trying to turn a hobby into a business
Also by your reasoning, these GPUs are blood diamonds and the authors future product/business should warrant preemptive boycott by all the perfect people like you
That is like saying my new restaurant was a success, therefore powering it with a generator was better than connecting it to the grid.
The raw infra being local didn't enable any of that. Now if was building ASICs at TMSC that would a different thing because you'd then be using something different locally.
UPDATE: Launch was a success! 400K+ views, and multiple companies reached to use my IP. Read more here
It seems that he managed to get what he wanted from the hardware and I'm happy for them.
He said something interesting at the beginning of his post, he compared the cost of the hardware to the cost of his time based on his FAANG salary. Which is an interesting way to think of this, but the rest of the article didn't make me understand if at the end he did save money/time based compared to just rend on the cloud.
Also, outside of the power cost, hardware has other costs too, you need to operate it, maintain it, set it up, etc. all that require time. I mean, even the process of figuring out if it had a good enough ROI compared to cloud, takes from your time (collecting data, analyzing data, etc etc).
Doubt it, feels basically like just an ad to get attention "Oh look, that's where the magic happens" vs running their code on existing infrastructure and thus just showing the results, like everybody else. This "feels" more "tangible".
I can't imagine spending $48K on a home GPU server, but I did just splurge and buy a PC with an RTX 5090, specifically to hold the largest models you can fit in 32GB. It's a top of the line PC with water cooled high end CPUs, 64GB RAM, RTX 5090 for $5K. To me the jury is still out whether this was a worthwhile investment, but I do expect to use this machine for a decade. I don't run it at 100% power (it's mostly idle, except for times when I'm training or doing batch inference). It has the nice property of being blackwell generation, similar to the machines we use at work.
It just scares me to own a box that is $48K in my house, especially if it breaks, or gets stolen.
I have a second computer with an RTX 4090 for gaming (running Windows). I also used the new RTX 5090 running Linux to evaluate whether Proton/Wine allow me to run Windows games on linux (yes, it works, but the compatibility and frame rate issues make me stick to native Windows for now).
I wonder what's going wrong there? Personally I found compatibility and performance on Linux to be extremely good. And just keeps getting better. And that's not even just me, that's all kinds of benchmarks out there. Sorry to hear that. : ' (
No idea. I agree that in principle I should have close to the same performance on Linux. I just didn't want to spend a bunch of time customizing configs and updating software so I could reach parity with Windows when I had two computers.
If you want a GPU that has comparable performance on Linux to Windows- you want AMD. NVIDIA drivers are notoriously bad. Many of my games run better on Linux with the open source AMd drivers. (CachyOS rolling rolling rolling).
I have no interest in moving to AMD for video cards right now- the network effect of NVIDIA is just too high, and their peak performance is insane. I also haven't noticed any major issues with nvidia drivers, unless you mean specifically running Windows games on Linux machines with nvidia cards, where I have zero experience.
Network effect for graphics cards? Literally what? Your friends don’t care what GPU you run my guy and there is not much benefit of having brand loyalty to a company like Nvidia that gives absolutely zero fucks about people that aren’t their enterprise customers buying GPUs by the thousands. If there’s any “network effect” for gaming GPUs on Linux it’s in favor of AMD because of the immense amount of work Valve has been putting in to make it work well for their steam* hardware.
Nvidia’s drivers are trash for gaming on Linux and the majority of your “compatibility and framerate issues” are because you’re using a sub-par product for the job.
I am also an enterprise customer that buys GPUs by the thousands, you can see a bit more about my work here: https://www.gene.com/media/press-releases/15010/2023-11-21/g... and https://blogs.nvidia.com/blog/roche-ai-factories-omniverse/ and have worked with nvidia since the mid-2000s on high performance computing for scientific research (in addition to having nvidia graphics cards since the Riva TNT, running both Windows and Linux). So having a blackwell graphics card I can evalute with linux and windows, both for ML training, inference, and gaming, is a huge network effect.
We’re talking about your gaming PC here. Nobody is forcing you to ONLY buy Nvidia graphics for your personal gaming rig when you ALSO have a purpose built AI rig. Nvidia just removed “gaming” as a segment from their financial reports. They give zero fucks. This absurd blind loyalty serves no purpose.
Sadly if you want a GPU with good AI performance you gotta go with NVIDIA. It might sound crazy but as a 7900XTX owner.. My 12GB 3060 on my linux server outperformed the 7900XTX by 40%. The 3060 only has half the vram of the AMD card. Proprietary drivers under Arch Linux.
On top of the significantly worse software on AMD's side (literally didn't work on windows in particular - so the "performs as good on both systems" is a nonstarter, some GGUF library dependency just doesn't work/exist under AMD on windows). Had me running the AMD card on windows under WSL (not a problem with nvidia though, that ran just fine on windows-side directly).
Aaaand also the other AMD bugs, such as the pink squares display corruption that has been an active issue for my GPU in particular (7900XTX) for over a year, maybe approaching two at this point, with no fix in sight from the AMD team (barely and ack at all - not on a single patch notes, just a bunch of reddit discussion). Really regret spending so much on an AMD gpu.
I have the same rig as you minus watercooling, and I assume you have AMD Ryzen 7 9800X3D? Anyway, it's my only PC now, I game, dev, run local models, edit photos, edit videos, all in Manjaro. I get ~70FPS in Cyberpunk at 4k, every setting at "Insane" or whatever goofy thing they call it, Ray tracing on path tracing off, with no framegen but with DLSS set to quality. Without DLSS I get around 40fps. Seems equivalent to what I see online with people with a similar build on Windows.
I run hyprland, seems to be the only wayland based keyboard-forward WM that has good nvidia support (and, allegedly, supports HDR, though I haven't got this working). I heard gnome was pretty good otherwise. I was running i3 before and it also worked fine, however once I got into wanting to get streaming working, there wasn't good compatibility between i3/xorg and tools like sunshine. I believe steam streaming worked fine on it though iirc.
The only thing I miss from windows: easy streaming with sunshine/moonlight. Steam streaming works (usually heh) but it took me a couple days of fiddling to get a stream to work at all through sunshine, and it is choppy. But for local gaming, I don't miss windows at all, I'm so glad to finally have all my drives converted from NTFS to ext4.
No, it's an Alienware R51 with Intel Core Ultra 9 285K 3.2GHz Processor; NVIDIA GeForce RTX 5090 32GB GDDR7; 64GB DDR5-6400 RAM; 4TB Solid State Drive; Microsoft Windows 11 Home; 2.5GbE LAN; 2x2 Intel Killer WiFi 7 BE1750+Bluetooth 5.4; Liquid Cooler
I don't see it on the Dell site anymore, only more expensive, lesser configurations (good timing on my part?).
Yeah, I really want to put in the time to try out various games, but realistically, the whole point of getting a second computer and installing Linux was to be able to train and serve models, and switching between serving a model (that people in my house want to use at random times) and gaming didn't seem like a great choice. If I did get good results, I'd seriously consider wiping Windows 11 from my older machine (an older Alienware with a 4090), but to be honest, I'm perfectly comfortable on Windows desktop.
Having built an almost identical rig earlier this year can promise at least one similarly-spec'd machine gets equal use between AI and gaming (Both on Linux). Stupid-excited for the Steam Frame to finally come out.
Personally, playing with AI models is way more fun than getting sucked into a game loop. Game loops feel like busy work hooked to an engineered dopamine drip. AI models are new frontiers and are exciting to build with, modify, lobotomize, and hack around with.
I remember playing Quake III which had user-programmable bots and thinking "wow, this is a really hard computer vision and reasoning problem". And then realizing "huh, that's a major research area, I should work on that". Later I learned that the bots were fairly simple and worked on far simpler world representations (nav meshes).
And some of us are doing AI stuff all freaking day at day job and just want to play some Tekken when we get home for 30 minutes after the kid is in bed. But now Playstations are 1000$ and Ram and GPU prices are astronomical.
Not everyone is hustling 24/7 like some kind of lunatic.
I would probably hate someone if they were buying the same hardware as me but doing something actually useful with it. Any game worth playing doesn't require high specs anyway. There is such a large catalog of old games.
I specifically got the previous model so I could play AAA games with all the settings set to Ultra, at 4K. Cyberpunk 2077 struggled even with my 4090, so I had to disable ray tracing and enable DLSS. Since I've run out of new AAA games I've been playing older ones and it's crazy how fast they are.
Yes! It scared me too. I tried to insure it under my renter's insurance policy, but they not surprisingly refused. I had to get business insurance to cover it
You showed this setup to a business insurance underwriter and they gave you a policy? Can I ask how much the premium is? Or is this just theft insurance?
>> To me the jury is still out whether this was a worthwhile investment, but I do expect to use this machine for a decade.
The high cost and power consumption are both signs of the death of Moore's law, so you are probably correct that this system will be near state of the art for some time.
I was looking at Ultras for sale, and had same worry, so didn't end up getting one. I have some peace of mind comfort about applecare and technical repair, but i couldn't find insurance that would cover theft (or rather, i did, but it was too expensive)
Last fall, seeing the writing on the wall, I pieced together an "AI" rig, 96GB ram, 2x RTX 3090, 9950X - not exactly top of the line, but it came in around CDN$3000 all in all, with most parts second hand. I don't think I could build that for CDN$10000 today.
I've been using it pretty steadily for a variety of personal projects, and the only improvement (aside from the obvious "more VRAM") I feel pressed to make is a portable AC unit / some kind of a focused cooling solution. The rig raises the ambient temperature in the office by 4C at least.
Now with the murmurs of even the large players reconsidering their AI spend, and usage-based pricing shifts, having a self-contained, owned, and independently administered compute resource is looking better and better.
It's comparing laptops to dedicated GPUs in a server environment. The best comparison would be the Mac Studio but the current release is almost 2 years old at this point. We'll see what a likely M5 Ultra Mac Studio looks like, probably in Q3 this year.
But yes, for pure inference, the M5 Max Macbook Pros probably aren't there yet. They have other utility though of course. And you can get 64GB and 128GB MBPs at a discount. Micro Center currently will let you buy a 64GB M5 Max MBP for under $4k currently, for example.
Why didn't you take into account batching, input tokens, different costs of electricity, and the fact that a laptop can still hold a decent % of its resale value, and is useful for many other tasks than running an LLM?
> Why didn't you take into account [...] the fact that a laptop can still hold a decent % of its resale value, and is useful for many other tasks than running an LLM?
Because that wasn't what they claimed to research?
>> for inference it's definitely not worth it.
It's entirely fine if you enjoy local LLMs on your computer, there are people doing horribly inefficient inference on smartphones now. But for pure inference tasks, it's pretty obvious why M5s and Mac Studios aren't replacing TPUs and GPUs.
Who is going to buy a $4299 M5 Max MBP with 64GB of RAM just to run Gemma 4 31b? Firstly you don't need 64GB for that model. Secondly if you want a machine that sits in the corner and does nothing but LLM inference, you don't buy a MacBook Pro, you buy some GPUs which are going to cost you a fraction of that (~$1k for ~64GB of VRAM is possible). The people buying Apple Silicon for inference general aim for the Mac Studios with enormous amounts of RAM (128-512GB), to run very large models.
The idea is obviously to be running the LLM on your work laptop. As a developer I'd need a laptop with 24GB of RAM for work anyway, and 48GB, which is enough for a very good quant of Gemini, is just $400 extra.
What quant? You should have no problem running it at Q4 with 256K context, Q5 or Q6 even although maybe not at full context. I can run Q4 on a 4090 with just 24GB VRAM.
Not a single new 64GB GPU, but multiple used GPUs.
They’ve significantly increased in price (so much for hardware depreciation…) but you can still get a modded 22GB 2080 ti for $320, or a Mi50 32GB for ~$450 each (used to be $150 a few months ago, alas), or a Mi50 16GB or <$200 but you’d need to stack 4 of them.
There’s also some more exotic configurations but those are probably the simplest options. You won’t get the performance of an RTX Pro 6000 Blackwell of course, and the power consumption will be pretty high so it’s only worth it if you have cheap electricity. But it is possible.
Except this math is 10x too high (unless accelerated depreciation is all of it) - a million tokens at 28 tokens/sec and 75W and 20c/kwh should cost $0.15 not $1.50. (And less with MTP.)
Not necessarily. I was spending ~$150/month on vultr's kubernetes hosting. I spent $5k building out a pretty awesome 1U server and I put it in a colo that costs me $50/month. Next year I will break even financially and everything after that is saving money. I also am getting so much more out of this server than I was getting on vultr because I over-spec'd the machine. In addition to running more on my cluster, I spin up large virtual machines for development, experiments, and for offloading distributed builds. No shade to vultr, but owning my hardware instead of renting was absolutely the way to go. Unfortunately today the ram alone would cost over $5k, so the math has changed.
This is interesting but I am unsure how you make money out of this home setup, I would imagine if one would be offering consultancy to a business the business would make their own equipment/infrastructure available, which would also give a better control of their data. But perhaps I am thinking this because I am thinking about very big companies. Then, on very small business I don’t see they having the use case with the budget to match the need. So is this for specific services for medium sized businesses? Can you explain this a bit?
Nice analysis, I would have loved a short overview of the kinds of experiments that were running on the machine (I know the results are given).
I find the "independent researcher" business model quite interesting. In the linked post he writes """DFT is a proprietary training algorithm, however, I’m currently offering a beta for a model training service where I will train your model for you using DFT."""
I'm curious how successful this is. Essentially market some AI breakthrough as a service instead of publishing a paper like my academic brain is trained to do.
As an aside, one thing that I always loved about our field was that the startup cost for many business ideas was "a laptop, internet connection and some some grit". In the age of AI it's quite a bit more and I feel one of the sad side effects of this is that it crowds out poorer and younger developers.
One of these things is not like the others. If you don't spend the money on one of them, you can get a visit from government officials that might decide to take that "item" from you. You'd also be a worthless human to spend that money on the other 3 while not on the one.
a P40 from 10 years ago costs 5x what it did in early 2023. a 3090 from 6 years ago costs as much now as it did then. RAM costs over three times what it did a year ago. M3 512gb now costs twice what it did at release.
"soon" isn't happening until China saves us from the Taiwanese semiconductor cabal and their Talmudic markups.
'If you google “plugging a PC into multiple outlets”, you get lots of warnings that if you even consider such a setup you will instantly burst into flames. So I hired a professional PC builder make sure it was safe.'
I read this as; the "professional PC builder" would carry some sort of insurance. So it isn't really "safer", but if something goes wrong, the investment is (potentially) still safe.
Probably means hiring someone who has more knowledge about PSUs and especially about having two simultaneous PSUs. There are questions like: when you press the power button how do the two PSUs turn up and in what sequence? How do you deal with the PWR_OK signal? What if there are voltage differences between the two PSUs? What about power backfeeding?
I guess it was supposed to be a humorous aside, but it wasn't actually helpful because the relevant issue is when you pull more total amps from a single circuit than it's fused for (usually 15 or 20 amps in U.S. residences). The failure mode is usually tripping the circuit breaker.
That issue can often be addressed fairly easily by splitting the power draw between two adjacent circuits. You can have an electrician do it permanently or temporarily DIY it with an appropriately rated extension cord. The real issue was OP was in an apartment at the time so an electrician would have been difficult. I assume they decided to just have a system integrator build it because they didn't want to figure out how to segment and route the power rails in a dual power supply system, but it's not exactly rocket science. Problems are often more due to choosing power supplies that aren't up to their claimed spec, not pre-testing them under load or using incorrect or under-spec cables.
I think the relevant issue is you could conceivably have a house with two outlets with opposite phases. Bussing them together in the PSU will then create a short
This is actually THE standard in the US, which is actually fundamentally a 240V power grid but with an electrode stuck halfway down the secondary winding on every pole transformer, which becomes your "neutral". The two ends become L1 and L2, so that L1-N is 120Vrms, L2-N is 120Vrms, and L1-L2 is 240Vrms, and this is what goes into every home.
The power outlets connected to L1 are all opposite phase to all the ones connected to L2.
Rather than bussing the two outlets together, what you can safely do is get an electrician to just wire up an outlet with L1 and L2 and voila you have a 240V outlet. This is how you get all your dryer outlets, EV charging outlets, electric stove outlets, etc.
I used to work for a company where we made test rigs and their safety guys were strictly against having a single machine with multiple power inputs. It wasn't about power draw. Once you have two plugs:
1. You no longer have the nice property that unplugging it guarantees (more or less) that it isn't electrified.
2. You open up the possibility of mains voltage from one plug appearing on the unplugged prongs of the other plug.
3. It possibly messes with RCDs, depending on what you do exactly.
Although in this case it's probably fine because he's just plugging totally separate power supplies in and they're already fully enclosed.
Agreed, if anything an electrician was needed, not a professional PC builder.
The picture shows two power supplies. Powering what is effectively one appliance from different circuits is a definite no-no, and I can't think of any circumstance where it wouldn't be in a home.
If his mains supply was sufficient to run the server and the house in the first place then the simplest solution would be to simply upgrade one of the MCBs/RCBOs on one of the circuits to the required capacity. I am not sure a landlord would even notice something like that, and if the house is wired correctly in the first place, it's unlikely to be dangerous. So going from say, 6A to 12A, on a 20A mains supply is generally fine if the gauge of wiring is correct.
I did this with used parts and cheaper consumer cards (3090s) and did much of the same calculations. I found it was way cheaper for me as well.
The main advantage, however, is that the friction of "this is going to cost me in tokens to even try" goes away. I was so much more willing to take chances and try new things on my own hardware than I would have been if I were paying API costs. I feel like this point isn't made clearly enough by those of us who run these absurd self-hosted inference systems.
Thanks for the write up, was a fun read. I spent an order of magnitude less, but I could relate to your story from beginning to end.
"The point of buying the server wasn’t to save money, it was to build something cool." In the end, this is always the real answer - one that I'm sure we can all agree is the 'correct' one too.
Sounds fun/stressful/rewarding. I'm most interested in the update at the end though 'Launch was a success! 400K+ views, and multiple companies reached to use my IP.' I too, like probably 1 in 5 of the people reading this, think I have figured out some major problems with LLMs (context and computation research) but have wondered the best way to 'release' and get value out of it. I can see training being a little easier in that you release weights against a known model arch but not the training code. Wy stuff is all custom layers though. Any thoughts on a release strategy where you need to release the layer code for people to see test weights/the benefits?
My first advice is to have a test set with clear improvement, and a clear "wow" demo use case. There are lots of "breakthroughs" that seem good but aren't (e.g. some new architecture that doesn't mask past tokens correctly and leaks information), so people will assume it is wrong. To prevent this, you need to be extremely rigorous in your launch materials. If you can make it into a product that people can try out themselves, that goes a long way. You don't need to open source any code (I haven't yet) if people can try it out some other way like a demo website. Good luck! Ping me if you want to chat more
This is a difficult calculation to make because you wouldn't rent time on the exact same system in the cloud. Depending on what you're running, a bigger server with better inter-GPU interconnects in the cloud might complete the task so much faster that the additional per-hour expense is more than covered.
Agreed. And the gained time either goes toward 1) more experiments, or 2) leisure, which makes you sharper in the lab and happier overall.
Not sure the "I saved $17,000 so far" framing is the most useful way to look at it, but it's a cool project and I love that people are doing this kind of thing.
FYI: If you're in a similar situation, think very carefully before you build your own. The $17000 might sound like a lot; but when you take into account your time and risk tolerance, renting might be a much better solution.
I built a very similar server myself [0] with a similar setup. I run different models for different purposes, but the primary one currently is kimi 2.6. I run kimi as the orchestrator model and then qwen, Gemma and others for specific tasks (sometimes loaded dynamically based on the task at hand), all exposed through the pi harness. I also use Hermes for some personal repeated tasks which connects to the same models, hosted on my local Mac Studio.
I am not even going to pretend that this is financially reasonable option. I simply wanted to have a local models. Maybe down the line, as cloud models become less subsidized, I might benefit from having a local setup, but for now, it wasn't the most prudent financial decision.
But one big benefit is that I never have worry about my account being randomly banned nor I have to worry about running out of quota. I still use codex and opus for some specific tasks, but as tools are improving, I need them less and less.
missing from most of these cost discussions: privacy. for some workloads the entire value of local is zero data leaving the network, and cloud cost is irrelevant
Just curious OP (if you're the one posting) -- what do you mean by independent researcher? What are you researching and are you making $$ from it or are you living off previous built up savings? Seems like an interesting path. What research have you looked into so far?
I am not the author, but he has been training/tuning? a model that produces text that mimics the source material in a more natural way. So getting the LLMs to produce less bland and boring LLMisms, according to the following up blog post.
"I spent a long time trying high risk/high reward experiments and failing. But now I have something good. I’ve solved a major problem with LLMs. And I’m launching next Monday so we will soon see if it’s actually a breakthrough or just LLM psychosis "
Maybe ai companies today have some bounty program?
(I would assume they haven't made a lot of $ off of this, if nothing else because they've only just put out that post and demo. They do seem to have produced a model that doesn't sound very LLM-y to my ear, though it also seems rather weak for its size.)
Shallow take: They made an LLM that uses fewer emdashes.
Cynical take: They made an LLM that can bypass existing AI slop detectors.
Realistic take: They found a research problem they found interesting, dumped a bunch of capital and sweat equity into and (claimed to have, at least) found a solution. Neat!
Or they just have lots of money and a hobby. Someone else might blow $48K to get an old Cessna and go have fun flying around. Not everything needs to have a purpose.
Just curious - What exactly are you using that rig for? I see that you said research work. Are you building a product or training models? I ask because whether something is worth it or not depends largely on what you get out oof it and how you value what you get. It's perfectly fine to leave a FANG job and go for, say, pottery hobby. What gives you happiness and your value system - these will qualify your decisions.
Any kind of fixed capacity usage model seems to be a dead end. Paying per token might seem like an exploitative arrangement at first glance, but it's a luxury if you are experimenting or deploying greenfield.
Provisioned capacity is a really high end thing. I feel like you'd need to be spending more than $1000/day on tokens for this model to make any sense. You lose a lot of flexibility once you start dumping capital into specific pieces of hardware. Maybe start by renting the GPU server for a few days...
Great article. I'm about to embark on a similar journey.... Doing a ton of AI development right now. Don't need a server, but a very, very high end workstation is super appealing to me right now. Looking at $50-$80k. 1TB RAM. 2x RTX Pro 6000s. 64 core Threadripper Pro. As many 4tb or 8tb nvme drives as I can stuff.
I envision NixOS at the core... then everything I need virtualized on top with KVM/QEMU. Maybe a dual boot setup with Windows for gaming and Flight Simulator (but I could virtualize that too with easy GPU passthrough.)
Lingering questions I'm working to figure out:
- Will 2 RTX Pro 6000s run on a 1600 watt PSU? Not sure how much higher I can go without calling an electrician. (standard US home.)
- Assuming I plop this into my home office, should I expect the PC to run significantly hotter than my current rig? (3960x threadripper, 128GB RAM, 1600watt psu, overclocked and watercooled 4090.) My water temp, measured at radiator, is about 60c at peak load. (This is the only number I care about, as this is what I have to consider to be comfortable sitting next to it.)
What do you want to do with the workstation? I have a similar setup:
- 512 GB
- Epyc 9684x
- 2x RTX 6000 Pro
- 1400 W PSU x 2 but in redundant mode
Mine is in a colo where it stays nice and cool. In my case, I went with less RAM and more GPUs (bought 4). Secondarily, the Max-Q blower version of an RTX 6000 Pro Blackwell is easier to keep cool and also only needs 300 W at the cost of very little performance. The non-max-q also only really use 300 W during inference, but the good thing about a lower power use is you can put more GPUs in very safely.
I assume you want the Threadripper Pro to maximize single-core performance? So you're spending a lot of time on CPU? Interesting stuff.
I gained a lot putting the machine somewhere else. TTFT on a thing like this is between 100-800 ms depending on batching and model size and so on, and your nearest datacenter is likely <10 ms. It sits on nice dual redundant power in a place where it's blown icy cool.
Good luck with your setup. If you get around to it, and end up writing about your setup on a blog, do share. Email in profile.
Very nice. Primary use case is application development, where the applications leverage a mixture of cloud based and local models. Modelling complex architectures. My work is primarily in the aerospace and defense arena, so hybrid and on-prem are important, as are ITAR and CMMC compliance. The idea is to have the local rig to build and validate architectural deployments that can sit on prem on customer hardware, in cloud, in gov cloud, or in a mix.
Not really looking at colocation, as this machine would double as a heavy duty gaming and flight sim rig. That means at least one regular RTX 6000 Pro. Not sure if I can mix and match with the Max-Q version, or if I even want blower fans in a desktop case (last time I did that was about 16-18 years ago with an ATI card... wasn't a fan--pun intended.)
The other advantage of the local GPU is that you are not feeding your data into cloud providers. I'm not sure how much you can really trust Anthropic and OpenAI not be improving their models based on your input.
How much do you trust OpenAi or Anthropic to not use it as training data anyway? What if you are building a startup and they can just use their visibility into it to copy your IP instead of buying your company?
If nothing else, rosmine's DFT [1], which is what they were working on with this setup, seems like a worthwhile investigation.
While I'm skeptical that there is much of a moat, at least for the large players, it should at least hopefully set rosmine up with for the next job :)
It does seem to fix the current biggest issues with using LLMs for writing at various publishers. If you're The Economist, you have a very specific house style and you have a decent corpus of articles written in that style. At least on my reading of it, rosmine can use DFT to get a model to closely match its outputs, in terms of the language quirks that are generated, to that of the corpus it is fine tuned on. ie it will very much match the house style, particularly as it is used in writing, vs giving a system prompt to an LLM that has some Economist articles in its vast training set, and telling it to write in that style- it will do an ok job, but still exhibit LLM language quirks despite itself. Even if you feed it the specific "style guide" that they give their authors, I dare say the reality of their writing is the best place to learn, and it sounds like DFT can ground the writing of a model in a specific corpus like that.
Giving an LLM samples and tell it to apply the style in the sample works a lot better than just telling it to copy a style it may have seen, or a list of rules.
They do it well enough that it'd take really good output to beat.
If your goal is to say, write science fiction, their reversion to classic LLM-isms, is really distracting and is what makes people say from a glance that it was written by an LLM. You basically can't use them at the moment in any real "natural" long-form writing. Everyone will call "slop" pretty quickly on the current frontier models.
I have seen examples that shows otherwise, including from a client that tested it extensively by paying people who thought they were paid to help detect AI generated content. They did little more than what I described. It works very well. Some people still insist they are able to tell the difference, but in the tests I saw, people did little better than random chance.
Some of it you could probably tell with statistical analysis, but actualy people are far worse at judging whether content is AI generated than they think they are.
If you need to beat an AI testing tool, you need to do marginally more work than to stop people from recognising it, but not all that much.
The nature of it is that you don't "see" most of the stuff that is well done because few people want to talk about it.
From the author’s POV it seems like they were going to do this research regardless, so this is asking what the most cost-effective way to do that research.
Or, for a person who did have a great way to monetize the same workload they’d probably find a lot of value in reading this post.
I ve seen already one question like that in the thread. But I rephrase it slightly sharper. Did you consider renting out you setup to vast.ai and if so, how much money it can generate per month deducting electricity.
Also, sorry for the noob question, is not such server generate enormous amount of heat? You did not use any special cooling system?
(For reference I’m talking about the DFT post from the same blog.) I love that ML is still in the “gentleman researcher” stage where relatively small amounts of startup capital can buy a ticket into frontier research.
For a lot of research questions 6 GPUs is even overkill.
It’s one of the reasons I’m skeptical of the “trillion dollar supercluster” idea [0]. I think what we need is more reasonably smart people investigating medium-sized problems. A “GPU middle class” you might say.
People doing economics with the cloud GPUs, of course cloud GPUs are going cheaper. But also, is generating tokens all you do with your computer? I can play games on DGX spark and also do LLM inference, so sometimes the economics work out, apart from having fun with it.
Quick tip for people who want to experiment with local models: A lot of the common smaller models are also available on openrouter or other services. Dirt cheap.
I know it's not the same. But a lot of people buy expensive GPUs, just to find out they have no real use for smaller models.
Openrouter is great for experimenting with models. I did exactly what you're saying to test smaller models that will run on commodity hardware and determine if it might be worth it to drop $10k on hardware. For me the answer was no, but it's close. I'm very excited for the next few innovation cycles to arrive.
The $48K also isn't fully sunk cost - there's a non-trivial residual value for those GPUs at the moment and likely for a few years yet. The server has a depreciation curve that's pretty enviable, actually!
The idea is similar to maintaining on-prem vs cloud
Cloud is optimized for development velocity but its nature of high margin business eventually makes on-prem more promising
It could be too late but it might be worth looking into tax saving if you have a business. Depreciation of asset is a loss and may deduct your income. (I'm NOT a tax expert)
Cloud servers have cheaper electricity, the scale of industrial-level cooling, no issues for you (as a user) with hardware failure (ie you just use a different server; it's not your problem) and can amortize their cost by running 24x7. I've seen H100 computer hours for as little as $2.
As the author notes, there are also electrical/wiring issues that cap how much compute gear you can run in a space not designed for it. I suspect a standard 20A 110V circuit can probably handle 2x RTX 6000 Pros. 15A probably can but that requires more research. Anything more than that and you're using multiple circuits, which has issues, or you need an upgraded circuit (eg 40A 240V) with all that entails (eg heavier duty cables, custom plug, etc).
I suspect a standard 20A 110V circuit can probably handle 2x RTX 6000 Pros. 15A probably can but that requires more research.
During initial setup of the server I am putting together, I found that a machine with 4x Blackwell cards derated to 300W can get by on a single 120V 20A circuit. It's tight but doable. A lot depends on the power supply. I don't think it's a great idea to run 4 high-power GPUs on a single ATX-style PSU, even a beefy 1600W job.
The other questionable part is whether all four cards can temporarily spike at full power during boot, before the wattage limit is applied by the OS. Some accounts say this is possible, and if so it could shut down the party in a hurry. But I didn't see any misbehavior when I tried it.
My earlier research suggests NVIDIA does not actually cap spikes, it caps the average over short periods of time. So setting the power limit is no guarantee.
"If I were to do this again, I wouldn’t do a custom build like this. I would buy a standard datacenter server and rent space in a colocation center. But then I would miss saying Hi to grumbl once in a while."
I have four old 24gb Nvidia cards. They're not great but they're not useless either. The problem is that I haven't really figured out a good way to actually use them.
Genuine question; would anyone here recommend any specific motherboard to best utilize these cards?
Depends what you want to do and which cards you have, but usually going with any older (3rd gen+) threadripper pro setup will give you a lot of pcie lanes.
I myself run with gigabyte trx40 aorus xtreme, but since it's regular threadripper (not pro) with 4 GPUs 2 of them will run at x16 and two of them at x8 speeds
It actually won't have "had" any experiences though. Yes, it can aggregate stuff from blog posts and reviews and marketing material. That's hardly the same thing.
So some things have changed since this rig was first built (2024). The most relevant is that $6800 RTX 6000 Ada 48GB has arguably been supplanted by the $9500 RTX 6000 Pro 96GB.
The Ada has a memory bandwidth of 960GB/s. The Pro has 1.8TB/s and about 40-50% better performance so is at least equivalent in processing power, much better in memory bandwidth (important for inference) and can hold larger models on a single card.
I've considered buying a rig with 1-2 6000 Pros for similar reasons but I want to see what happens with this year's Mac Studios with a likely M5 Ultra. Macs have a shared memory architecture whereas NVidia segments the market based on max memory where the biggest consumer card (RTX 5090) has 32GB of VRAM but still excellent memory bandwidth (1.8TB/s). A RTX 5090 rig will still trounce a Mac Studio seems to be the conventional wisdom. Despite being able to hold larger models and being able to chain Mac Studios on TB5, their lower memory bandwidth (~900GB/s) and lower overall GFLOPS mean they still come out behind.
That being said, the current Mac Studios are relatively long in the tooth, being released in 2024.
I'm still not sure any of this is really wroth it because things are still changing so fast. I think there's a decent chance of a number of large AI companies going bust in the next 2-3 years such that you'll be able to buy enterprise AI hardware at cents on the dollar, a bit like how Google bought data centers in the post-dot-com crash.
But anyway, nowadays I'd be looking at the RTX 6000 Pro as the sweet spot, having anywhere from 1-4 in a single server.
The electricial issues the author mentions are interesting. I hadn't really thought about the max amperage on a residential circuit. In a DC, these would typically operate on three phase power and much higher overall amperage. I wonder if there's a device you can buy that can combine multiple residential circuits into a single power source for a server this power hungry?
I have the Macbook M5 MAX with 128 GB of RAM. I put its performance at roughly equivalent to the RTX 5070 Ti. The M3 Ultra 512 GB for me is about half the performance of the RTX 5070 Ti but obviously it has the ability to do more because of the increased memory.
I don't think anything compares to the nVidia chips at all.
Why are these sockets "ruled out"? Pipeline/layer parallelism doesn't need high bandwidth between nodes, and tensor parallelism has middling performance unless you have very fast networking and very slow compute. It all depends on what you're doing.
You are correct that bandwidth requirements depends a lot on the exact workload. And that in specific cases, it might be doable to have AM5 for multiple RTX6000Pro. The parent mentioned workloads that are general, and broader than inference-only. In that case I would consider spending a bit extra on the motherboard to ensure that PCIE bandwidth is not an issue.
That's a nice problem to have. I can't afford a $48K GPU server, even if I worked as a developer since 25 years ago, because I live in the wrong place.
That’s very cool and very expensive - I think the cadastre value of the apartment that I live in is like 35k EUR or thereabout.
I wonder how much worse just a bunch of Intel Arc B70s might have been, software fuckery aside. Ofc if I’d need to run local inference or simple fine tunes and learning stuff, I’d probably get one of the SFF options - Mac Minis and all of those Sparks or new AMD AI chips. Then again, I’m broke so go figure.
I just fork over some money every month to Anthropic, have been trying out more DeepSeek and also Mistral (their Vibe tool is surprisingly passable under WSL).
> if more powerful GPUs could help me make my work be successful just 2 months earlier than I would have with a smaller machine, then buying a more powerful server would be worth it.
> Because of this I got a motherboard with slow GPU interconnect. It’s good for running many small experiments in parallel (which is my main use case) but horrible for any models split across gpus.
:( you paid a professional pc builder and you weren't told this?
Honestly, I made the same mistake when I added a GPU to my (not $48K) existing homelab. I got a Ada 4000 for its slim form factor and low wattage, but realize after I bought it that it does not support NVLink, so I can't really effectively double it up later if I wanted to. Live and learn. I suppose you might research that a little before blowing that much money though LOL :)
Consumer motherboards can still make sense even if you leave some performance on the table. Running an actual 8x GPU server is not something you'd want to do in an apartment. Imagine the old Lucasfilm "THX" trailer where an unearthly-sounding foghorn whine rises to a sweeping crescendo at reference level, only without the decay at the end.
At the time he put this rig together, there weren't a lot of open-weight LLMs that could run well on 6x48=288 GB, so it probably wasn't a huge loss. There still aren't, really.
Right now I'm in the process of cramming Blackwell cards into an old DDR4-based Milan server, where the important thing is to be able to run large models at all. The GPU fans alone burn over 400 watts at full throttle.
That was an option, but having decided on a true server chassis for other reasons, it made sense to use server-edition cards to take advantage of all those fans. I downclock them to 300W anyway for longevity, but it's nice to have the option to go to 600W if needed.
The server is going to live in the garage, so I'm not that concerned with noise. But I had no idea what to expect when I flipped the switch for the first time. It sounds like something out of the Book of Revelation. No way, no how could something like this be used in an inhabited area.
I wonder why using 2 PSUs resulted in having slower interconnect.
There is no specs in this blogpost regarding cpu/motherboard choice, but if you go with threadripper pro they have 128 pci-e lanes for some time now, so using all GPUs at full speed shouldn't be a problem
If you split models using pipeline/layer parallelism you don't have to care about a slow interconnect, you're just slowed down a lot when running a single inference at a time as opposed to a fully pipelined minibatch. But tensor parallelism requires much faster interconnects than you could get in your average server, so I'm not sure that a different motherboard would help all that much.
They did not. That's a mining rig not a workstation. It's visible from the photo and the chart showing multiple failures over a short period of time including the risers -- which are visibly very low quality -- failing twice.
You have 50K, you call a real expert like Puget Systems or Digital Storm.
The research that's presented in another article on the same site is way more interesting than the betteridges law article linked here. It'll be very useful in my own latest project if this research is incorporated into some model I can rent by the token!
You guys are nuts... I hope you're making enough money to justify this level of investment and power use (not to mention noise and heat management) in your home...
I'm just putting a 2nd hand 12gb 3060 into my lab box, but its only for use with HA/Paperless/Plex etc type things. I dont need multi-model agentic behavior for private use.
If I did I reckon I'd renting infrastructure rather than filling my home with that sort of gear.
out of curiosity, did you check how much would cost to rent a cage in a colocation space? Having to power your computer from two different outlets sounds wild..
"If I were to do this again, I wouldn’t do a custom build like this. I would buy a standard datacenter server and rent space in a colocation center. But then I would miss saying Hi to grumbl once in a while."
Yes, i mean, he could rent a cage and run grumbl it there. It doesn't have to be a standard datacenter server, even though a standard datacenter server would be better and cheaper.
A cage[0] is ~100x larger than what you need to host a single server. Many data centers will colocate by the rack unit. At others you can get a quarter or half cabinet[1]. Even at the very largest enterprise datacenters you can colocate a single cabinet.
> A cage[0] is ~100x larger than what you need to host a single server.
Yup, but i was assuming that he wanted to experiment building gpu rigs. For sure standard GPU servers are cheaper and easy to maintain. I have two lenovos, bought them used, already EOL.. was cheap and better than any custom gpu rig.. but i was pragmatic, because my goal was to put it in production, and not to research...
The cheaper, easier solution would probably be just to get an electrician to wire up a high amperage 240V outlet just like your electric stove or dryer has, and then get a PSU that connects to that.
Would probably cost you $500-1000 depending on how difficult your home is.
The article stated that this was dismissed because the author lives in a rental apartment and he was not certain the landlord would agree to making this change.
I did not see any indication that the landlord was ever actually asked, it appeared to be the author's "sense" that any answer would be "no" from the landlord.
It doesn't cover risk. If one or more gpus dies, who pays for it? If you rent, you are guaranteed to be insulated from this risk. But owning, you might not have the best return policy from the vendor. And if you are actually at fault for breaking it, they have every right to deny a return. Or if your apartment is burglarized or catches fire (possibly from overloading the circuit) you are out the entire investment.
Also a lightning strike or surge from the electric utility could fry the whole rig. Proper protection costs thousands, and even then it's not guaranteed to protect everything
I'm talking about standard surge protectors. Properly installed they are enough except for direct lightning strikes, these will fry everything. But unfortunately, even in code-obsessed Germany landlords are not required to retrofit SPDs.
To protect a large electrical device investment, you would want an EMP shield whole-home SPD, in addition to an SPD right at the electrical device. The first one shields exterior surges (including non-terrestrial), but the second shields against internal surges. And yeah lightning will blast through both of them. So the best bet is probably a lightning strike detector combined with renters insurance.
In the article, he wrote "I tried to insure it under my renter’s insurance policy. They didn’t like that. I had to get business insurance to cover it.“, but he didn't say how much it cost, either.
> I thought that I could not get a standard datacenter server because my apartment wouldn’t let me upgrade the circuits, so I needed to have 2 power supplies plugged into different circuits.
Why didn't they just put a higher amp breaker in the box?
It is unsafe for wires to be handling higher power than it was rated cause the wires act like very low ohm resistors. At some high enough I, you’re still gonna be generating power P=I^2R which is mainly thermal and melt the wires.
> Why didn't they just put a higher amp breaker in the box?
1) note the word "apartment" -- they rent, not own, and doing so not only would likely be illegal, but might also get them kicked out of the apartment.
2) Unless the wiring on the circuit drop, and all the end points are rated to handle the higher current, doing so would be an electrical code violation (and therefore trip into that "illegal" arena that might result in getting kicked out of the apartment).
Most residences are wired using the minimum size wire rated for the installed breaker (because doing so saves costs). So a 15a breaker in the box would mean 14gauge (the US NEC minimum size for 15a circuits) wiring in the walls and 15a rated outlets/switches. Installing a 20a breaker in the box would be a code violation, and in many jurisdictions also illegal.
And all the above is without considering that installing a 20a breaker on wires rated for 15a increases the fire risk tremendously if those wires are now asked to actually carry 20a for any length of time.
"quit my FAANG job" as in they simultaneously worked in Facebook, Amazon, Apple, Nvidia and Google? Or did the op work at Netflix and is too ashamed to admit that :P
In the last year, I have bought an M3 Ultra Mac Studio with 512 GB, a Macbook Pro M5 MAX with 128 GB and an RTX 6000 Pro. I have spent around $25k so far, not including electricity. I figured worst case scenario I can sell them in the next year and only take a haircut as opposed to losing my entire investment.
In comparison to just spending for tokens, the tokens would have been much cheaper and much much faster. I've been running against Gemma4:31b, Qwen3.5 and 3.6, and getting local LLMs to solve AMC 8/10 math questions and it's about 10-100x slower than just doing it online. When I tried it with ChatGPT late last year, it took about one night and $25 to solve about 1000 questions. Using my RTX 6000 and M3 Ultra and Gemma4:31b on both, it answered about 40 questions in 7 hours and I haven't checked how good the answer is yet. At 800 watts (600 for RTX and 200 for M3 Ultra) and running for 7 hours, it solved around 40 questions.
At the very least I'm going to try to sell my M3 Ultra if I can find a reliable place to sell it without getting ripped off by scammers.
I’m not usually one to ask this because learning to do a thing can be fun, but why exactly have you spent 25 thousand dollars on getting an LLM someone else made to answer maths exam questions?
It's just a project I'm working on. I'm working on projects where AIs are processing and classifying large amounts of data that would be a lot of work for humans to do.
I think of LLMs as being well equipped for handling dynamic data or adapting to unforeseen circumstances well (random code requests, website's ever changing layouts, typos, non-standard formatting in docs, groking out important info, etc), but math problems are be definition a very specific set of instructions to run, so is the overhead and "thinking" aspect of a LLM/AI even needed here? I'm genuinely curious, btw, I'm not asking sarcastically. Can't these math problems just be yanked from some test file and rapid fired directly at a gpu/compute unit?
> Can't these math problems just be yanked from some test file and rapid fired directly at a gpu/compute unit?
Yes this is exactly what I'm doing. I isolated the actual math question, and then sent it to my two servers to process and that's what's taking 10m+ to return. I'm asking them to solve the question and return the full answer along with their steps. I care about correctness so taking time is okay but I can't use 10m per solution.
Nono, parent was asking “They’re bad and inefficient at that, so why have an LLM do math? Why not just use some code and the CPU/GPU that’s already good and efficient at basic math?”
Privacy and offline operation are valuable or non-negotiable in some cases, but the difference is pretty categorical between what can run on a single card and what can run on a DGX GB200 NVL72 cabinet. Doesn't mean it's not worth seeing how far local models can be pushed. Not every problem needs a senior engineer.
I know it's one of those "if you have to ask" situations, but curiosity got the better part of me. Here's the search assist response:
"The DGX GB200 NVL72 AI server costs approximately $3 million per unit. This system includes 72 Blackwell GPUs and 36 Grace CPUs, making it one of the most powerful AI servers available."
The search assist actually credited a source used with: https://www.tweaktown.com/news/98292/nvidias-new-gb200-super...
That $25k spend by GGGP seems like nothing in comparison. That's ~1/3 of one chip in that cabinet. God gawd I'm old and out of touch with modern AI data centers.
It's The Circle of Computing Life. The pendulum swings between centralised mainframe timesharing-for-hire and desktop individuality.
We've been in a centralised phase for longer than usual - first cloud everything, then AI - but at some point in the next decade prices will crash and a market will appear for personal, local intelligence.
By comparison, the Colossus 1 data center had 32,000 GB200s (as well as 150,000 H100 GPUs, 50,000 H200 GPUs), and they are bringing another 110,000 GB200s online (although this might be Colossus 2?)
There are bigger data centers than Colossus 1 around too.
There is a reason NVidia is the most valuable company on the planet.
https://en.wikipedia.org/wiki/Colossus_(supercomputer)#Curre...
> the difference is pretty categorical between what can run on a single card and what can run on a DGX GB200 NVL72 cabinet.
A better way of putting it is that you can run plenty of things on a single ordinary system, but you may be disappointed at the performance. Generally, you can't expect inference to be as quick as with cloud for SOTA-like models. You have to run smaller models for quick replies, and large models with a lot of real-world knowledge for less time-critical inference, possibly batching many requests simultaneously to improve throughput.
The cost is obviously not that big of factor for OP as it might be for others. It's actually refreshing to hear the candid viewpoint that he expresses here.
25k is definitely a lot but I did the risk analysis and I figured worst case I would lose a 1000-2000 after a year of playing around with it, so I look at it more like renting (I'm going to keep the Macbook Pro no matter what since I needed a new one).
Nitpicking, but the worst case of spending $25k is unforeseen circumstances that write off the entire asset. I don’t think -$2000 is a conservative enough figure for standard depreciation either (a lot can happen in a year)
Either I don't understand the used apple market.. or I agree this is crazy. Someone spends $25k on new hardware, waits a year, and expects to sell it for $23k? Unless the ram issues save him, and cost of new goes up, I don't see how that was going to work.
This is the case in lots of markets, e.g. look at used cars, luxury goods and more. Some of it is driven by inflation/the rapid devaluation of the dollar. General and AI-adjacent compute in particular hasn't come down in price in a long while.
Well, Apple is literally not offering the M3 512GB studios currently. You can’t even back order one.
They are selling on EBay for over $20k, used.
It’s hard to know if any of these eBay listing are real or actual sales. Lots of scams.
The ones sold for $25k from established sellers are legit. Filter by "sold."
The 0-reputation account in Spain selling an M3U 512GB for $4200 is 100% fraud.
oh young grasshopper, I see you dont know that money launderers love the ebay hype cycle. Its REALLY common on high dollar hot items to have phantom transactions where parties are on both sides of the transaction to clean illicit money. The high price tag and high volume amount of transactions hides the illicit signal. I have tried to buy a few of these mac studios only to have the transaction cancelled because I wasnt the dirty money on the other side.
Bizarre. They don't care that eBay takes 14%?
14% seems like a pretty low fee to clean drug money, if we're being honest
Traditional money laundering loses upwards of 40%, so hell yeah.
What am I going to do with 40 subscriptions to Vibe?
Funny, they have the "honesty" to cancel the transaction and not take your money, just to keep their ebay reputation high ?
eBay and PayPal almost always side with the buyer.
The transaction was cancelled? Sounds like you weren't defrauded.
GP didn’t say they were defrauded. They said the listing was a cover for laundering money.
"weren't scammed" might have been a better choice of words, which they said one post up.
Ah, I see what you mean now.
You could almost sell a RTX 3090 for more today than what it cost brand new when it came out six years ago
It's still very contrarian to expect GPUs won't depreciate rapidly. Yes 3090s were a good investment then, but way worse than just buying Nvidia stock directly
Waiting for them to come down any day now. Been waiting since 2017.
First it was crypto, now AI. Just because the market can stay irrational for very long doesn't mean crashes don't happen. What nobody knows is when.
at some point you have to accept that the market is actually rational
Yup, just like crypto and tulips were rational behaviors for the global economy. Or investing 5% of world GDP in half-baked AI is.
wanting to get rich quick is completely rational if you ask me ;)
Only if they do get rich, if they lose money, it's stupid. Technically stupidity is also rational? The bottom of rationality? :-))
How did we go from "I expect to lose only 1000-2000 if I try to sell my used equipment" to "you should have just bought NVDA to get a better return." The point wasn't the better return, the point is that I wouldn't lose all my initial investment if i decided I wanted to sell it.
And the fact of the matter is that in 2026, all electronics has gone up, not down, and sought-after GPUs have gone up in price in the used market.
A lot of expensive things hold their value well. I have a friend who is really into telescopes and he now owns a $100,000 telescope but he didn't directly buy such an expensive scope. He started out with much cheaper ones and was able to sell them for about what he bought them for to help fund more expensive ones over 20 years. It is really interesting.
Thank god. I’ve been waiting to have a reason to tell someone I collect fountain pens.
Computers depreciate because they are obviously being supplanted by newer better models—until they become vintage and then move into collectibles.
Apple products have had relatively high resale for a while. Only losing 8% in a year is probably extra unusual, and 1-year-old wasn't really ever the sweet spot, but a "sell used privately after a few years, roll onto the new one" has been a relatively common play.
Doing this particular one is definitely expecting the market squeeze to continue. "Worst case" is back to more "normal" depreciation. Where I'd expect to only be able to recoup more like 18k. But... if you look at GPU prices the last 3 years... it's not a crazy assumption that it won't drop that fast.
iPhone example since those are easiest to find in quantity: new iPhone 16 Pro Max for $1200, Gazelle would want $866 for "execllent" condition. Lost ~28% for one-model-back. iPhone 15 Pro Max, though: excellent priced at $667 here, only down another 23%, and gives you basically half-priced-upgrade if you can sell it for that and roll into the newest.
So to have never-more-than-one-model-old rough estimate at today's value-holding you'd be out $3600 for three new phones, with getting 1732 of that back, or 1868 for it (with a $334-per-year incremental cost of upgrade).
For never-more-than-two-models-back you'd be out $2400, getting back $866, for net $1534 spend, with a $167 incremental per-year upgrade cost once you buy the first one. Pretty good if you keep the phone in excellent condition and are happy to budget a bit over $10/month to be on a every-two-year upgrade train.
Well, you'd also eat the tax...
https://buy.gazelle.com/products/iphone-16-pro-max-256gb-unl...
https://buy.gazelle.com/products/iphone-15-pro-max-256gb-unl...
What you describe is something people with enough to make the first purchase and eat the cost when it breaks have been doing for years e.g. with cars. People on the lower end of money scale tend to use products for well over their economic lifetime saving way more and buying a cheap replacement if it breaks. Notable exception as stated being phones for some reason as it likely is a status symbol for more people in a [insert preferred external sexual characteristic] measuring contest.
> "Worst case" is back to more "normal" depreciation.
I would absolutely not count on that, if and when it drops it will drop hard.
> I don’t think -$2000 is a conservative enough figure for standard depreciation either (a lot can happen in a year)
We aren't exactly in "standard" times and haven't been for quite a while. Even five year old graphics cards are worth more today than they were just a year ago. Things will obviously depreciate at some point, but you gotta throw your existing notions of how quickly and how much hardware will depreciate out the window. There's just been too much money dumped into AI for a "well I guess this won't ever pan out, let's dump all this hardware to recoup our costs" moment to happen and tank the price of everything suddenly IMO.
And that's not even getting into the other geopolitical stuff going on right now. Strange times.
I assume he is calculating the loss as depreciation - what they would have spent on cloud bills if they hadn’t been doing this locally.
Aren't things like this seeing 'negative depreciation' these days?
"inflation"
"appreciation"
Sure, they took a gamble that they wouldn't be able to sell it used.
If you are able to tie up $25k for a few years just for shiggles, you clearly are able to make do fine without that money and if lost it would be at worst annoying, not catastrophic.
I wouldn't call this nitpicking. This is how people who are careful with money think. I learned embarrassingly late to stop justifying purchases by making predictions about future returns. I treat everything as having zero value as soon as I purchase it. Thinking otherwise is, for me, always a dangerous rationalization -- always a craving that's trying to outmaneuver sense.
I mean whatever. It's workstation/server class hardware, that's how much it's been for a long time
Was this risk analysis just AI slop?
I think op would make a really good pope too.
https://news.ycombinator.com/item?id=48118672
That hardware is costing him ~1$/hour over 3 years. Presumably having it answer math questions was a tiny fraction of what he was using it for.
I’ve spent twice that on hosting movies and tv for Plex, so… I think they are worthy of my praise. What a healthy outlet for money.
You spent 50k for plex hosting? Why so expensive?
Half a petabyte of RAID6 is the biggest line item, then the redundant 40gb networking and compute follow closely. I have a lot… too much even?
That’s a lot of blurays…
I found them dumped out in the “street” in a place ordained by law as public domain. So I just grabbed up the media and use it in private.
How many Blurays are we talking about?
500 TB/25 GB = 20.000
if some have more than one layer it could fewer but that's the order of magnitude
If each Bluray is 2 hours long, that's 4.5 years of nonstop watching.
Just parallelize watching.
Can't reply to the other poster, but I have 4K HDR Blu-ray copies from discs I found in the street too, which are more in the 60GB ballpark.
In what kind of "streets" you guys are hanging around?
does one really need 40gb networking to stream bluerays?
I am (clearly) not as far down the rabbithole as the commenter you're replying to, but almost certainly not. Streaming 4k blueray is on the order or ~100Mb/s, which means on a LAN bog-standard gigabit ethernet and associated networking hardware would be more than sufficient.
This is taking a hobby to its extremes, in much the same way that a $5k boat and $500k boat let you catch the same fish.
It’s about being able to rapidly move files between the arrays and future proofing.
You are totally right, it’s mostly just for backups and transfers to rebalance data.
Dam this is just art in this scale
One year ago finetuned local LLMs had a significant edge over ChatGPT or Claude. Look up in YouTube all the DIY videos testing LLMs on their own machines with different setups.
Remember: one year showed up to be a gigantic leap in regards to quality of results and innovation in the AI space. Agents weren't really a thing and vibe coding wasn't even invented as a term because the top notch tools at the time were lousy, with lovable being the frontrunner with its - in my view - sorry Tailwind recombination tool shaming AI to do the work.
Then fall hit 2025 hit us, new year's eve and suddenly there was such a massive surge of innovation and competition with ChatGPT Codex suddenly showing up.
Remember: one year ago many now commonly used tools weren't yet available like Nano Banana or Codex.
"The 25k are so vast" - Yes, and no. For example, if the machine is bought for business usage I can deduct the costs from taxes. This roughly amount for 50% of the financial burden.
So I jokingly use to say, that I pay only half the price for my Apple business machines. And yes, I am strict in this regard. Business means business. No private emails etc. nothing on my company computers.
Maybe there are other options as well to reduce the financial expenses the dude mentions, but it doesn't seem so.
I would also go for leasing, this way already the monthly payments can be deduced and I don't need to buy and maybe resell the machine.
Apple is a luxury good. Without business usage or at least partly using it for business as well as private (mixed usage in tax reports) I wouldn't buy the devices or think twice.
Apple under Cook evolved into a Gucci like luxury brand, that is more and more a rip off than quality delivered, especially considering the latest OS updates for Mac, iOS and iPad. Apple is a mess, following Microsoft Windows' footsteps happily, because the CEO is as has been correctly assessed, no product guy.
But I stop with my rant here.
Always try to use tax deduction as leverage for your computer expenses. Every citizen should invest in basic knowledge about that.
Even a 10-20% professional usage for work (mixed usage) gives you a noticeable advantage over normal pay.
I didn't spend that much, only $6500 AUD for a GB10 based Asus GX10 which is even slower than OPs, but I spent that because it makes for a great learning platform. Theres not much else that lets me fiddle with 128GB of RAM for my graphics processor, and it's quite lovely to be able to run things as long as I like without worrying about my cloud instance being shut down.
It's not financially a good idea: renting really does beat owning, and cloud beats both if you're only running inference on these machines. But I'm not just doing inference, and as a thing I can do silly stuff on to learn, it's hard to beat!
When you say you are not just doing inference, you mean you are also training your own llms? I am curious what other things can be done.
Fine tuning, and yeah training my own, experimenting with architectures and learning how it all works. Been a lot of fun
$6500 AUD can get you a good chunk of B200 time on any of the GPU neoclouds :)
Less than I expected, though! And I get to run this all through the night
I do still use Vast and Runpod for things too, but it’s much nicer to test a fine tuning run here to make sure I’m in the ballpark
I also did literally say “It's not financially a good idea, renting is better than owning” so I’m confused why I have two people telling me that
Also it’s just far more fun to play with something tangible to me :)
You could just rent a bare metal server with those specs
Yes I could, but that is annoying because of spot pricing and having my instance shut down, and it has fluctuating prices
It’s also annoying because then I need to make sure my little “lab” setup is well automated, and I’m lazy :)
Also, I literally said “ It's not financially a good idea” so I’m confused why you think I don’t know that.
Spot pricing and instance availability don’t apply to on metal hosting. You’d have your own machine dedicated to your own use only, at a locked in price.
> renting really does beat owning, and cloud beats both
Because buying Macs is not about performance, its about feeling like you are rich.
That money could have been spent on way more bang/buck performance in the form of a set of 4 graphics cards.
Also I would probably put the odds 70:30 that Apple marketing is astroturfing on HN from the amount of posts about running llms on Macbooks, because in reality, the inference speed of any decent llm is unusable on a Macbook despite the ability to fit it into RAM.
Or it could have had way more bang/buck by feeding a family of real brains for a year or two
Excuse me for this comment, really, but I can't comprehend the absurdity, some people are buying GPUs when other people have no money for insulin so they literally die. I don't mean anything towards op or gp, quite the opposite I'm truly happy they have this kind of freedom, it must feel really nice, I just hate this game so much.
40-80 tok/s is unusable to you? Ok.
If you like having a box with 8-12 fans blasting hot air and noise into your office all day, nobody's stopping you.
This is making me feel a lot better about my plan to lease a $25k EV simply because it's available at a massive discount. I'll probably end up using less electricity, too.
How do you use the RTX 6000 with the Macs? Exo? I would think that would be pretty snappy if configured properly.
This is on a separate Windows PC, I don't have it integrated with the Macs.
If you don't need cash right away, I'd wait until the M5 Ultra comes out and see how things shape up. There have been some early efforts aimed at combining the prefill performance of a GPU with the high throughput achievable with the Mac's unified memory architecture (see various YouTube videos by Ziskind and others, as well as https://old.reddit.com/r/LocalLLM/comments/1r6drpi/exo_clust... ).
Point being, once the M5 Ultra is available, I suspect a lot of people will get very serious about making Macs work with RTX GPUs because that will yield an inference platform with a good bang:buck ratio. If so, you may find that your existing hardware is more powerful than it seems today. And it may be a lot more expensive to replace later if you sell it now.
All of these have appreciated in value. How much are you looking for the Ultra?
I've seen a lot of sales on eBay for over $20k, but I don't know if I believe it. Plus the lack of seller protection and the prevalence of scams on eBay make me too hesitant to actually want to risk it so I don't know what to do haha
Haha, yeah, it's about $23k or so. Should be twice the price what you bought it for if you got it last year. Tbh I don't know why. The RAM is large but the bandwidth and the compute isn't nearly enough. You can fit DeepSeek V3 on it quantized but inference is like 10 tok/s. Honestly, you'll be able to sell it locally for that in cash, and I would in your place.
I saw your heat comments about the RTX 6000 Pro as well. I bought a few of them recently and I'm running 2 of them in a 2U case in a colo. You need a lot of active airflow to keep them cool. Mine range from 23 C to 80 C.
Which of these has been the most productive for you? Sounds like you've enjoyed the RTX6000 the most?
RTX 6000 is some-what obviously my fastest card but my biggest problem with the RT 6000 is the immense heat. The GPU itself is almost 200F and the exhaust from the fans itself is over 150F. I'm worried that my hard drives are going to fail. I was told that the GDDR7 is even hotter than the GPU which is surprising to me.
After my last run, I'm going to wait for the new case I ordered to come in and cannibalize my kid's PC that we built beginning of this year to form an entirely separate computer. And then figure out better ways to deal with the heat, especially with summer coming up. I'll have to play around with undervolting and running vents directly outside my house to see if that helps.
From my failed and expensive affair with GPU mining 5 years ago, You can get a great heat dissipation outcome by using an open case with a lot of directed fans at the expense of a bit of dust and lots of noise
I take it this wasn't the half-wattage Max Q version with blower fan?
Since you are not running realtime 3d grafix, could you put the card in an external chassis so the heat is not in the same box as the SSDs?
That's about what my OC'd and watercooled 4090 runs at. The cards are designed for it. Only problem I have is when sitting next to the computer under load -- I either have to open windows or blast the AC. Too bad I don't live in a cold climate -- that 60c heat output would come in handy :)
> Too bad I don't live in a cold climate -- that 60c heat output would come in handy :)
Used to overclock back in the day during winter with an intake duct rigged to suck in outside air, best thing about -30c :)
I've always thought about doing something like this in the Midwest US, but was always a bit nervous about condensation damaging the components over time; did you run with that sort of setup consistently, or only when pushing high scores? Ever run into issues with components failing?
That was 25 odd years ago, less sensitive hardware and cheaper... Nothing that failed though, did have some sketchy moments with condensation yeah :)
Not consistently, I did start using petroleum jelly till I upgraded and found out that wasn't very fun to clean up.
You'll probably make a profit by selling them today. I bought a M1 Max Studio with 64 GB last year off FB Marketplace for $1000 and today I'm seeing numerous 32 GB M1 Maxes for $1200-1500.
Yes the prices on eBay for the Mac Studio are all over the place, but I've seen sales for over $20k. I don't know if I believe it but there's enough to make me think if I can sell it for that price it would be worth it, but eBay has basically no seller protection so I'm not willing to take that chance.
I looked into the M3 Ultra 512GB Mac Studio before it was discontinued and the as best as I could determine it just wasn't worth it... yet. The GFLOPS and memory bandwidth just arne't there even though it can hold a much larger model in memory.
But the trend here is interesting. I think by 2030 you'll be able to buy fairly cheap hardware that is currently $10k+. I don't know what this does to the trillions invested in AI data centers because the next NVidia architecture after Blackwell will essentially half the value of purchased cards overnight.
I'm not convinced Apple has yet pivoted the Mac Studio line towards this market and the expected M5 Ultras in Q3 2026 will likely be an incremental improvement rather than big leap forward but I'd like to be proven wrong.
I agree that all these datacenter companies like Coreweave are investing billions in technology that has a very fast depreciation curve and I don't know how they will sustain income. The same goes for datacenters in space, what happens when those chips are obsolete? Will they sent astronauts to replace them or will they let them burn up and send new ones into orbit every year?
I feel that the open weight models pale in comparison to the frontier models, and I believe that if the gap closes quickly, that the open weight vendors will stop releasing it for free.
Data centers in space aren’t realistic.
Higher radiation, space insulations, etc.
Underwater data centers provide a lot of the same benefits and can (much more) easily be hauled to the surface
I'll buy your macbook if you're trying to get rid of it!
I'm keeping that one for sure, I love it!
I'm not really asking this from the perspective of whether I should buy hardware. I'm trying to understand the economics.
The AI space is moving so fast that it is hard to know which conclusions are stable. After all the discussion around local models, is the practical conclusion still that API/frontier providers have a huge structural advantage because of datacenter hardware, high utilization, batching, optimized inference stacks, and perhaps strategic pricing?
In a comparison like this, a $25k local setup versus buying tokens, what multiple are we really talking about? 10x? 100x? Or is it too workload-dependent to reduce to a single number?
Has someone written a good breakdown that separates true infrastructure efficiency from temporary underpricing/subsidy? The part I'm trying to understand is less ideological (local vs. cloud) and more basic economics.
The speed of results for an API call to ChatGPT is 10-100x faster than my local LLM. I haven't exactly quantified the results but I was getting results in a few seconds vs 10+ minutes for my local LLM. I'm going to do a deep dive this weekend and try to get better results, but it was staggering. I'll also do a deep dive on how to optimize my setup and see if I can get things to perform much quicker.
Well if it makes you feel better those frontier LLMs are all technically taking a big loss, and they may all be in your shoes after a few years.
If you are in the bay area, i'm happy to buy that M3 Ultra from you, i've been unsuccessfully looking for one and can't find any.
Running LLMs on Macs is still terribly slow. They simply lack the optimizations other platforms have.
An RTX 6000 pro Blackwell is a pretty good card
A M3 ultra mac Studio can run models that do not fit in similarly priced computers with multiple Nvidia GPUs. And it will use a lot less electricity while still having good enough performance. Except the pre-filing perfs that are quite poor on the M3.
M5 pro 48GB should be good and future proof
If you buy Mac get at least 256GB ram otherwise just buy a bunch of nvidia cards. It really does not make sense otherwise if you are looking for performance / $. The mac (studio) is unique as it has more ram than the alternatives(I.e consumer nvidia cards or spark stuff) so it can fit bigger models but otherwise its performance is worse.
>> find a reliable place to sell it without getting ripped off by scammers.
This is a real problem and why I've just about given up on ebay or fb marketplace, esp for computers. If you are in Canada though sellit9.com is a great solution to having to deal with sketchy buyers.
No harm in listing it for $20k, and if it sells, that's an easy $5-10k for you.
If you're in a decent sized city, you should be able to find a local buyer on Craigslist or FB Marketplace... Beyond that, for higher value, smaller items like your M3 Ultra, I would talk to your local police department and/or library to see if you can do the exchange there. Larger libraries usually have a police officer on site or nearby, and the PD office near you may also provide a "safe" exchange location... I'd bring a monitor/keyboard/mouse so you can demonstrate the system working properly.
YMMV but between your nearest PD office and Library, you should be able to use one or the other for your exchange of goods/money. The biggest thing I've sold is a mid-range video card during late covid (I managed to get a better one via newegg shuffle) so I sold the old one (RX 5700XT -> RTX 2080) to make up the difference a bit. I just did the exchange at the Starbucks near me for that.
Something is very wrong in some countries if you have to get police protection to sell a f* computer. I get it’s on the expensive side but still….
You don't have to... but it's a matter of a safe location for both parties. If it was more expensive, I'd probably work through a broker (like a car or house).
The buyer doesn't know who the seller is, and vice-versa... the level of trust you can bear depends on how much you're willing to lose. My advice is only in that there are safe venues you can use to make such an exchange.
See e.g. https://www.murphytx.org/843/Safe-Exchange https://www.ottawapolice.ca/en/community-safety-and-crime-pr...
Police "safe trade zones" are basically a parking space outside a police station, with a sign.
Not really. Every country has a nonzero number of criminals. It's entirely a matter of the risk/reward tradeoff. A small consumer item over $10k is well into dangerous territory.
Are we talking about a cash transaction? If so >$10k is dangerous as the police may want to steal it themselves.
If it is an electronic payment, I'm not sure how completing the transaction in front of a police station will help any. Well, it will help the buyer to see it working, but the seller gets no additional protection besides seeing "a person."
Why would the seller worry about ... himself? The seller worries that the buyer might not have any intention to pay for the small, expensive, easy to fence item to begin with. Conversely, the buyer worries that the seller might not have brought an item at all.
> If it is an electronic payment, I'm not sure how completing the transaction in front of a police station will help any.
That's not the point of going to the "police safe exchange zone".
The point is to hopefully prevent the possibility of the buyer showing up with a .38 in hand, and demanding to be given the easy to fence "item" unless the seller wants to get a .38 slug embedded in their gut.
The risk of a "hold up" increases with dollar value and with items that are easier to fence.
I love that it never occurred to you that the "buyer" could just steal the item. Be safe out there.
Ask a question, get a condescending answer.
I got an RTX 6000 pro too. I like running locally, I've learned a lot more than if I had used an API and there's less worry about overspending tokens. I accidentally spent $100 on claude api in like 2 days because I didn't know what I was doing.
The problem is that while one these gpus is a huge improvement over a laptop or a single 3090, you very quickly wish you had more. I would buy a second one, but I did the math and realized that with the current crop of models, 2 Blackwells doesn't buy me any new capability that I didn't have with one. So I would need a 3rd one. And when I buy a 3rd one I will feel like I want to running a higher quant, so then I will want a 4th.
A pair of RTX6000 cards will give you a good performance boost due to tensor parallelism, though. I haven't tried the newest predictive quants but I see about 35 tps when running the 8-bit Qwen 3.6 27B model on one board and about 50 tps on two. Probably could come close to 100 tps on an optimized setup with the latest GGUFs.
Also, the 4-bit quants of MiniMax 2.7 will run at 100 tps or so with two cards, which is pretty decent. It doesn't go any faster at all with 4 GPUs from what I've seen, so if you don't actively need 384 GB of VRAM, 2x RTX6000 is a good place to be.
You can get 70-80 tps on qwen3.6-27b f16 with MTP on a single card
You can fit Deepseek 4 Flash on two with TP 2 and 6 different streams at 65k context. 150 tok/s
What kind of machine did you build around it ?
> I figured worst case scenario I can sell them in the next year and only take a haircut as opposed to losing my entire investment.
It's going to be a non-trivial haircut. This stuff depreciates pretty fast.
Bizarrely, I brought a GPU new in Jun 2024, and there are sold ebay listings saying the used GPU is worth 4% more today.
Of course, this is an unusual state of affairs; I see my GPU purchase as consumption, not investment.
You definitely want to get rid of your M3 Ultra before the M5 Ultra get officially announced.
Give the global memory shortage, the m5 will be both delayed and restricted to lower ram tiers, I dont think we will see a 512gb ram model until 2030
I have three m3 512gb units and want a fourth to run an exo set up. Like you, I am worried about scammers. Let’s discuss if you still want to sell.
https://calendly.com/ryanwmartin/open-office-hours
This is, sadly, obvious and inevitable in retrospect.
The two major drivers of inference costs are GPUs and electricity. You can't get cheaper GPUs, but you can make existing GPUs not sit idle, and you do that by utilizing them 24/7, processing user B's request when user A is thinking, and handling many requests in parallel, neither of which you can do as an individual. You can get cheaper electricity... by moving, and it's much easier to move your AI workload than to move yourself.
This is a completely different dynamic than renting houses or apartments, as you can't really rent out the same house to different people at different times of day.
Yea. LLM inference requires batch processing to have a shred of hope at being cost efficient. Batch processing requires a not so insignificant amount of scale (but probably not as much as people think).
I'm very pro local models, but not to have parity with SoTA frontier models. Just contextually trained small models doing smaller specific tasks.
Trying to run bigger LLMs for an individual user to do big tasks is not going to be a good time.
Wasnt this pretty evident to pretty much anyone who knew even a bit about inferencing?
Idk what people were thinking. I’ve never seen anyone offer a plausible way to sidestep batch processing for example.
Historically it was not uncommon for beds to be rented out to multiple people.
See military submarines, for a modern version.
It is unfortunately still common practice among irregular agricultural workers in many parts of the world (I’m Italian so I definitely remember news about busts in southern Italy)
The word for this type of boarding is “flophouse.”
This is the type of place one might be “waiting for the other shoe to drop.” Which carries a variety of potential meanings in this moment of AI.
Tangentially related: Mack and the boys lived in the “Palace Flophouse and Grill” in Cannery Row.
I suppose I must have looked up flophouse when reading all the Steinbeck I could get my hands on and it’s stuck w me.
Yeah there are good accounts of this in Down and Out in Paris and London and also one of Hemingway's books - forgot which one.
You can definitely run many requests in parallel as a single user, you just have to be OK with a significant slowdown for any single request. Cloud inference can't reach that ratio of total throughput per hardware cost since they are heavily incented to get the most expensive hardware available and to then minimize latency (and RAM occupation over time) even at the cost of throughput. Running slower inference with cheaper hardware is just not workable in a cloud setting.
It also doesn't help that they probably sell tokens below cost.
On top of that, AI providers are also eating a big loss on the service.
Are they? I only ever see unsubstantiated claims for this whereas I see many justifications that interference is comfortably profitable in isolation.
Its basic math, go calculate max sessions for a certain tps on any hardware. Session# * tps * 86400 (secs in a day) * 30 days.
You'll realize real quick its not profitible. You cant just say things you don't like to hear are unsubstantiated without verifying.
Not to mention, subscriptions.. $2mm in GPUs being given out for 5 hrs a day at a cost of $200 a month.
I could easily say that everyone who says its profitible is msking unsubstantiated claims lol.
You got numbers? Because it seems perfectly possible to me. OpenAI and Anthropic’s marginal cost for inference is certainly far less than their API pricing.
How can you say that with such certainty? You have no idea what it costs to run a 10T parameter model at extremely high concurrency.
These 1T param models running at <$3.00 per 1mm are certainly not profitable.
Because I’ve looked at what it would cost my company to self-host a SOTA sized model. For us it wasn’t worth it because the hardware is all bought up by frontier labs and we can’t get any supply. But if we could, at the prices they’re paying, it would pay for itself in 10-ish months. I assume further that they have economies of scale on top of what I was estimating.
See: https://www.wheresyoured.at/ He's been "numbering" for quite a while now.
Everything there is extremely speculative and I don't see anything that contradicts that inference itself could be profitable at massive scale. See https://youtu.be/xmkSf5IS-zw for example.
If the companies as a whole are destined to be profitable, or worth their valuations is a very different question. The only people who can truely answer that have time machines.
>Its basic math
Yes, once you have modeled the problem correctly and you know all the input parameters. This is not that: Session# * tps * 86400 (secs in a day) * 30 days.
I don't think there is enough public information to check Anthropic's claims regarding inference profitability. It depends not just on unknown technical factors but also on agreements they have with other companies.
I agree that we dont know how expensive SOTA is. But yes my math should give you the max amount of tokens you can sell per month, and its not remotely profitible for most of the larger open source models (at their current pricing). Im not sure why a 10x larger model that is more in demand would be profitible when its only 5x the price.
Its possible you could pay off hardware for Kimi 2.6 after maybe 2-3 yrs (by providing low tps / high concurrency) but you're now out of warranty and have been running your machines full throttle 24/7 for 2-3 years.
This is why moonshot attempted to double the price when they released 2.6 but then it got driven down by North American capital subsidies.
We should specify which subscription plan we are talking about. You seem to be talking about the Anthropic Claude Max plan. I think it's consensus that these flat rate type of subscriptions are loss leaders, as they come with restrictions how you can use the API via T&C, namely only with Claude Code et al. They are meant to hook developers into their products.
Shouldn't we compare the API pricing, where we pay per token? The whole point of local inference is that we don't have any restrictions regarding product use or time limits, so it would only be fair if we compare it to a plan that offers the same. And even that is only a first approximation, because the commercial models are usually much more capable than the open weight models.
> I could easily say that everyone who says its profitible is msking unsubstantiated claims lol.
And people who don't understand the difference between capex and opex are making uneducated claims. It's not basic math.
Running an inference data center is a mix of variable and fixed costs. The fixed costs are currently in the billions of billions of dollars for pretty much any investment in this space. Many of those fixed costs have (currently) unknown refresh cycles. So, unless you have access to the financial books of these companies it's currently just speculation whether inference is profitable.
To some degree I think there's a hope that it becomes like a gym membership. If everybody used their membership, the gym would be too crowded. It's all of those memberships that people feel like they need to have but don't use where the extra profit comes in.
As long as the power users are paying per token, everything is good.
Really? This is what we expect from this amazing world changing technology? People will sign up for it and not use it? Good business plan, how can I invest? /s
Just speculating on the math.
SpaceX's has disclosed that they're loosing $2Bln a quarter on A.I - and rising - in their IPO documents.
Anthropic told the Department of War-nee-Defence that they'd made $5bln total, which is a lot LOT less than what they're spending.
We'll see what's in OpenAi's IPO later this year I guess. I'll be very surprised if they're losing less that $100bln a year.
Especially since their costs might be multi-year investments. It's too early to judge the quality of those investments.
Supposedly Anthropic just reported that they’re operationally profitable. So maybe not?
"operationally" implies that capex (which I would assume includes datacenters, gpus, and r&d) is not in. So the big news is that they can now pay for electricity and sysadmin.
I believe they also excluded stock-based compensation from their calculation, which could easily tip them in the non-profitable direction.
High usage seems to change the economics. The author of the article had a payback period of about 14 months which is excellent by any standards and an order of magnitude better than rent vs buy for a house in most places.
> You can't get cheaper GPUs
You absolutely can. OpenAI et al are paying a fortune for GPUs but they are not paying retail prices.
The entire business model of retail is to sell above cost.
I'd buy that Mac studio m3
Better sell it fast before the M5 ones come out.
Given that the tokens are being subsidised by a couple orders of magnitude, would it still be as cost effective long term?
I don't follow this last part. What is the scam they try to run?
A buyer can claim they never received it or that the box was empty, thus receiving a full refund.
For something listed at $25k I would not list on eBay at all. eBay corporate will pocket $3400 in fees and will also dock you local taxes on the $25k.
I’ve had the best luck selling in Craigslist. Every other platform has been sub par.
How are you using the 6000 with a Mac ?
If you run it in the winter the electricity is “free” because it’s replacing a portion of whatever else heats your house.
Yep, the great theoretical promise of local models remains theoretical, no matter how much die hard-engineers want to push it...Who would have thought, right?
I don't think this changes the final conclusion - but have you considered calculating against depreciation -- i.e. figuring out how much your M3 ultra is worth today, and only charging yourself for the delta? In my mind you might even have made money on the hardware.
I administer a simple AI server in the office, which just uses a single RTX 5090 but is able to serve ~80 people throughout the day. I'm impressed by Qwen3.6-27b's capabilities in agentic coding/tasks so far. Devs say it's not much different from Sonnet 4.6 on many tasks (sometimes it even outperformed it), 40-60 tok/sec, up to 260k context. The server cost about $10k with all the bells and whistles.
I spent a lot of time researching/adding/benchmarking many custom modifications to the software stack and its settings to make the server optimally handle the load with just 1 RTX 5090 without losing quality, but it's still not enough, and the wait times in the queue are getting longer. We're at the limits of the hardware, and I'm out of tricks.
The experiment was kind of a success, and the CTO agrees we should scale it. With our own infra, we could run agents 24/7 on everything. Currently, a lot of use cases for the cloud providers are completely blocked by PII/trade secret concerns (our infosec department doesn't buy the "zero retention" promise), plus you don't have to think about billing/budgets/etc. anymore.
Now I can't decide how to scale it. On one hand, I'd like to run larger models. And we have the budget to buy, say, 8xH200. But in many benchmarks, the larger models that do fit in 8xH200 comfortably and can serve many parallel requests with acceptable speed/quality don't seem to outperform Qwen3.6 that much in agentic coding/tasks to justify the price.
So another option is just to buy a bunch of RTX 6000s and scale horizontally instead: run a copy of a midrange LLM like Qwen3.6 on each GPU. It's cheaper and easier to scale/replace, but then we'll run into problems running larger models in the future if we have to, because of no NVLink support (say, if Alibaba & Co. stop releasing ~30b models and/or ~30b models start falling behind 400b+ models considerably)
Does anyone here have experience running large models in a multi-GPU setup with several RTX 6000s in a high-concurrency regime and with large context lengths? (something like Deepseek 4 Flash, Minimax 2.7 etc.)
Wouldn't that be a fairly ideal setup for layer parallelism? That doesn't need the high-performance communication of tensor parallelism, and the high-concurrency regime would make it easy to keep the pipeline full with microbatches. You'd also be able to scale out your KV cache storage since that naturally splits layer-wise.
I have a 5090 machine sitting idle that I'm considering turning into a machine for my own small team (3 devs).
Are you willing to share any lessons learned, etc. that I could make use of? We are evaluating paying for a SOTA sub or trying this, and the talk about Qwen3.6-27B makes me want to try deploying this machine.
Sell the machine for $4K, use it to pay for Codex Pro for everyone for a year. Everyone will be significantly more productive and happy.
It's not even a real comparison if they are actually using them for coding.
If you are deploying always running agents (e.g. monitoring logs and services) then sure - a QWEN local server is a good choice. But for coding the cost in productivity of using a lower performing model is way too high.
Anyone who frivolously suggests throwing away possible independence in favor of dependence on a Silicon Valley company is either incredibly naïve or acting in bad faith.
I'll choose not to respond to your personal attack.
But in term of actually running a dev team - you are free to use QWEN or another quantized local model that can run on an RTX 5090 for coding if it makes you feel more independence. However you would struggle and spend many many more hours achieving the same thing, with a lot more debugging time, long delays before it's done, and many more prompts.
It's just not the right approach. I use QWEN and other local models all the time, but for more clearly defined monitoring and classification tasks.
Not necessarily so. I can see how a bid to predict how thing will be in 1 year in AI-based coding is likely a losing one. So the idea is to extract the maximum value now, and turn it into profits that would buy you whatever is adequate for the next steps. For comparison, the AI-based coding landscape a year ago, in May 2025, wasn't even close to what we have now, and half the key tools did not exist.
OTOH, as we see, the larger models demonstrate diminishing returns, smaller models demonstrate improvements, and hardware does not show any signs of becoming cheaper, so holding on existing decent GPUs may, too, be a winning strategy in longer term.
The 5h quota of Codex Pro on GPT 5.4 Medium lasts me for around an hour and a half, maybe 2 hours. And this is already the "savy" setup. Enable GPT 5.5 High fast and you will be beached in 30 minutes with active development.
For continues all day work you definitely need a higher tier sub level.
I'm actually looking into deploying a GPU at my company because we can not give out our code. Qwen 3.6 looks good
this might be true for the plus account. For the "Pro" tier ($100-$200/month) the 5h limit is never a problem.
Right, I did swap that. Still, you have to pay that 4k then every year and give out the code. I also assume that prices will go up as no AI company (but NVIDIA -> selling shovels) is currently making any money.
For some projects the giving out the code part might be ok (i use Codex there too) but for the core app at the company I'm working at there is currently a strict no-AI policy. A local GPU solves this.
Does anyone here have experience running large models in a multi-GPU setup with several RTX 6000s in a high-concurrency regime and with large context lengths? (something like Deepseek 4 Flash, Minimax 2.7 etc.)
For what it's worth, I've been seeing ~100 tps with 4-bit MiniMax 2.7 on two RTX 6000 boards, just running under llama-server without any optimization effort at all. I have no serious long-context experience with that setup, but at 30K context it's still above 90 tps.
If you are happy with Qwen 3.6 27B, I would personally switch the 5090 out for 2x RTX 6000s and keep running 27B. That will give you ~2x your current throughput with a lot more headroom for multiple users. More important, it would buy time to see how things develop over the next few months before you spend a whole lot of money.
Qwen 3.6 27B is fine but it's not in the same ballpark as GLM-5.1 or Kimi K2.6.
If you truly want to scale up, you should get the 8xH200 with NVLink.
> our infosec department doesn't buy the "zero retention" promise
They are wise to be skeptical! It is neither a promise nor zero data retention.
Look at Anthropic's Zero Data Retention policy -- and remember, this is the policy that applies to the exclusively eligible enterprise partners who can even qualify for a ZDR agreement with Anthropic:
> When ZDR is enabled, prompts and model responses generated during Claude Code sessions are processed in real time and not stored by Anthropic after the response is returned, *except where needed to comply with law or combat misuse*.
> Even with ZDR enabled, Anthropic may retain data where required by law or to address Usage Policy violations. If a session is flagged for a policy violation, *Anthropic may retain the associated inputs and outputs for up to 2 years*....
This means that Anthropic is actively inspecting all of your data with machine learning classifiers. When the usage is flagged for whatever reason as violating any aspect of Anthropic's Usage Policy, then they get to keep your data for 2 years, with no apparent limitation on what they can then use it for.
Crucially, you have ZERO guarantees about the sensitivity or specificity of these classifiers. For all anyone knows, Anthropic is silently flagging 75% of queries and retaining the data.
https://code.claude.com/docs/en/zero-data-retention
I think it’s a cost/opportunity tradeoff at best with any agreement, regardless. The rest of the contract may make it difficult to impossible to do anything about it, starting with basic arbitration clauses and ending in a ton of other provisions that can make any legal action futile. I doubt there’s much room to negotiate too.
Given that all labs need to diversify to become profitable, they’ll end up competing with their customers and theres nothing that exposes a business more than having AI offload every job function for every account, every mail etc.
Assuming this won’t be an issue is naive at best.
I wonder how aws handles this in bedrock. Do they use Anthropics classifiers? Or their own? Or none? Would their data policing be different in bedrock than their other services?
Thank you for the insight. This makes me feel confident, the L40S we are about to acquire with 48GB VRAM for engineering application should be useful for agentic coding as well.
> Does anyone here have experience running large models in a multi-GPU setup with several RTX 6000s in a high-concurrency regime and with large context lengths? (something like Deepseek 4 Flash, Minimax 2.7 etc.)
Join the RTX6kPRO tribe!
- https://discord.gg/pYCvaQTf
- https://github.com/local-inference-lab/rtx6kpro
How can a single 5090 serve 80 people? Something doesn't add up here.
They are using it as an assistant, bot running multiple fully automated agents loops?
They don't use the server all at once. In the UI, users typically ask a question, get a response, and continue with their work. In the case of autonomous agentic loops, an agent simply waits its turn until the server is ready to accept the request. Agents don't hammer the server 24/7 every second either, because they either need to be triggered or are busy doing other work, such as compiling or running tests.
It would be more interesting to know how many simultaneous users this setup can serve. Otherwise I can just say it serves 500 users but not all of them use it at the same time which doesn't communicate the right level of detail.
Depends on TTFT and tokens per second you want.
With parallelism of 16 you can still get around 25 to 30 tokens per user when all 16 channels are running. Not everyone will use the model at the same time but it certainly will be tight, especially for agentic coding. For pure chat applications this should be quite fine.
The problem with wide parallelism with most models is that it blows up your KV cache. There's open models with KV caches lean enough to parallelize inference or even to offload the KV cache itself to disk without immediately running into wearout concerns, but they're quite exceptional.
I also call this "bollocks" there is no way this workflow is even 1/10 of what you can get with Codex/Claude Code.
A normal engineer may be running a couple of sessions with every session spawning sub agents left and right.
80 persons or even 10 having this workflow on this setup doesn't work, and this is the standard engineer workflow today.
Subagent swarms are actually great for the local inference scenario because they can share a whole lot of KV cache. You get to raise the compute intensity of decode (i.e. the aggregate tok/s) essentially for free.
> 260k context
with a single 5090?
Yep, Gated DeltaNet in Qwen3.6 requires much less VRAM for the KV cache than previous generations. Plus the KV cache is 8-bit.
is it in llama.cpp?
> don't seem to outperform Qwen3.6 that much in agentic coding/tasks
idk i imagine you'll hit less edges with a larger model just because.. more data
if you think of them as a kind of NN compression, it's ~obvious that the larger model can have more stuff encoded in it and hopefully accessible
i don't use LLMs much right now but using midrange models seems like an unnecessary compromise in most cases, especially since the big open models sound to be rivaling opus and not just sonnet :p
I thought NVLINK didn't matter anymore because of the latest PCI-E speeds. Am I wrong there?
> The mentality shift of renting vs. owning the gpus is huge. When renting, each experiment costs money and I had to ask myself is it worth it. When owning, it feels like not running experiments is costing me money.
I feel like there is some very deep generalizable wisdom buried here.
Also something about subscriptions vs pay-for-usage. I feel the need to use all my weekly tokens or I'm wasting and I bet they would never get this kind of usage out of me if AI ended up being same price per token.
I always buy software/assets/dev tools for my hobbies (like CAD, music production, game dev) instead of paying subscriptions, even if that would very likely be way, way cheaper and would give me access to really cool tools. I don’t want to feel bad not using something and I know that’s the case with a subscription
You should be able to achieve this mentality shift without owning a GPU. You just need to commit some money upfront to cloud GPU spend, in a way that is not feasible to go back on.
That way you get the experimentation-encouraging mentality shift to "If I don't use this, I'm wasting my money", without the cost inefficiencies associated with actually buying an accelerator, discussed by others in this thread -> you'll never be able to match the the utilisation and thus the cost amortisation of cloud GPUs.
This article appears to lack any reason for "needing" this beast, or any real comparison with alternatives, both of which are required to answer the question posed in the title. It's a summary of how much they spent and some light anecdotal comparison to what they might have spent on cloud services, but clearly they didn't do an exhaustive hunt for value.
The real question is whether or not they could have done whatever it is they did with less hardware. Is there a business idea here that could have been proven on cheaper hardware that could be upgraded as demand increased? Is the expected ROI there based on future earnings?
Absent any indication that this was needed in the first place, I can only conclude that it wasn't worth anything.
At the end of the article, the author has this to say:
> UPDATE: Launch was a success! 400K+ views, and multiple companies reached to use my IP. Read more here[0]
[0]https://rosmine.ai/2026/05/18/fixing-llm-writing-with-distri...
Post-hoc justification. There's no analysis of whether that level of hardware was necessary to launch, only that they did get that hardware and did launch.
Looking at the GPU utilization graph, it certainly seems like the hardware was saturated for many days/weeks on end.
Was it worth it to spend that amount up front, yak shave while building the system, etc. vs. pay for cloud GPUs? Probably not in terms of dollars, when their time is also valued in dollars.
Was it worth it for this person? It seems, unequivocally, yes.
Their more recent post seems to suggest it was worthwhile. https://rosmine.ai/2026/05/18/fixing-llm-writing-with-distri...
Abstract/TLDR: LLMs are notoriously formulaic at writing, overusing certain tokens or phrases. I show that models trained with SFT fail to match the distribution of the training data by using Maximum Mean Discrepancy (MMD), Judge Model Quality (JMQ), and L2 Token Distribution.
Idk if this turns into revenue or some financial metric but even if it does and it was a good outcome for author, it still says nothing of risk. What if he loses his timing opportunity / gets beat to market because he's unnecessarily futzing around with hardware? AI is rapidly advancing and he spent 2 years on this to save what was probably <2 months of faang income. There's multiple other angles I could dissect this from a risk perspective. I'm all for taking risks, but at least acknowledge them and preferably measure them as part of making big decisions like this to save a little bit of cash.
Let’s be clear, though, FAANG (as someone who has spent an awful lot of my life working at FAANG) was pays well but crushes your soul. There was a time, a very long time ago, where it didn’t, and there are the soulless soul crushers that love it there, but I would rather futz around with a mid range cars worth of hardware and be happy than spend a moment longer prostituting my soul for their money.
There was a time in this industry that it paid about as well as an accountant and people did it because they loved what they did. Then the money flooded in, a bunch of people switched majors from business to CS, washed out in industry, got their MBA, and became product managers and engineering managers and sucked all joy from it. God bless those that find that joy again.
> would rather futz around with a mid range cars worth of hardware and be happy than spend a moment longer prostituting my soul for their money.
So only 2 options in this profession are 1) sell your soul to 1 of 5 evil corporations that just so happen to also pay excessively well or 2) choose to be unemployed for years while spending a significant amount of money on hardware trying to turn a hobby into a business
Also by your reasoning, these GPUs are blood diamonds and the authors future product/business should warrant preemptive boycott by all the perfect people like you
That is like saying my new restaurant was a success, therefore powering it with a generator was better than connecting it to the grid.
The raw infra being local didn't enable any of that. Now if was building ASICs at TMSC that would a different thing because you'd then be using something different locally.
UPDATE: Launch was a success! 400K+ views, and multiple companies reached to use my IP. Read more here
It seems that he managed to get what he wanted from the hardware and I'm happy for them.
He said something interesting at the beginning of his post, he compared the cost of the hardware to the cost of his time based on his FAANG salary. Which is an interesting way to think of this, but the rest of the article didn't make me understand if at the end he did save money/time based compared to just rend on the cloud.
Also, outside of the power cost, hardware has other costs too, you need to operate it, maintain it, set it up, etc. all that require time. I mean, even the process of figuring out if it had a good enough ROI compared to cloud, takes from your time (collecting data, analyzing data, etc etc).
Doubt it, feels basically like just an ad to get attention "Oh look, that's where the magic happens" vs running their code on existing infrastructure and thus just showing the results, like everybody else. This "feels" more "tangible".
I can't imagine spending $48K on a home GPU server, but I did just splurge and buy a PC with an RTX 5090, specifically to hold the largest models you can fit in 32GB. It's a top of the line PC with water cooled high end CPUs, 64GB RAM, RTX 5090 for $5K. To me the jury is still out whether this was a worthwhile investment, but I do expect to use this machine for a decade. I don't run it at 100% power (it's mostly idle, except for times when I'm training or doing batch inference). It has the nice property of being blackwell generation, similar to the machines we use at work.
It just scares me to own a box that is $48K in my house, especially if it breaks, or gets stolen.
Not even a single mention of gaming.
No wonder gamers hate AI bros.
I have a second computer with an RTX 4090 for gaming (running Windows). I also used the new RTX 5090 running Linux to evaluate whether Proton/Wine allow me to run Windows games on linux (yes, it works, but the compatibility and frame rate issues make me stick to native Windows for now).
I wonder what's going wrong there? Personally I found compatibility and performance on Linux to be extremely good. And just keeps getting better. And that's not even just me, that's all kinds of benchmarks out there. Sorry to hear that. : ' (
No idea. I agree that in principle I should have close to the same performance on Linux. I just didn't want to spend a bunch of time customizing configs and updating software so I could reach parity with Windows when I had two computers.
If you want a GPU that has comparable performance on Linux to Windows- you want AMD. NVIDIA drivers are notoriously bad. Many of my games run better on Linux with the open source AMd drivers. (CachyOS rolling rolling rolling).
I have no interest in moving to AMD for video cards right now- the network effect of NVIDIA is just too high, and their peak performance is insane. I also haven't noticed any major issues with nvidia drivers, unless you mean specifically running Windows games on Linux machines with nvidia cards, where I have zero experience.
Network effect for graphics cards? Literally what? Your friends don’t care what GPU you run my guy and there is not much benefit of having brand loyalty to a company like Nvidia that gives absolutely zero fucks about people that aren’t their enterprise customers buying GPUs by the thousands. If there’s any “network effect” for gaming GPUs on Linux it’s in favor of AMD because of the immense amount of work Valve has been putting in to make it work well for their steam* hardware.
Nvidia’s drivers are trash for gaming on Linux and the majority of your “compatibility and framerate issues” are because you’re using a sub-par product for the job.
I am also an enterprise customer that buys GPUs by the thousands, you can see a bit more about my work here: https://www.gene.com/media/press-releases/15010/2023-11-21/g... and https://blogs.nvidia.com/blog/roche-ai-factories-omniverse/ and have worked with nvidia since the mid-2000s on high performance computing for scientific research (in addition to having nvidia graphics cards since the Riva TNT, running both Windows and Linux). So having a blackwell graphics card I can evalute with linux and windows, both for ML training, inference, and gaming, is a huge network effect.
We’re talking about your gaming PC here. Nobody is forcing you to ONLY buy Nvidia graphics for your personal gaming rig when you ALSO have a purpose built AI rig. Nvidia just removed “gaming” as a segment from their financial reports. They give zero fucks. This absurd blind loyalty serves no purpose.
Sadly if you want a GPU with good AI performance you gotta go with NVIDIA. It might sound crazy but as a 7900XTX owner.. My 12GB 3060 on my linux server outperformed the 7900XTX by 40%. The 3060 only has half the vram of the AMD card. Proprietary drivers under Arch Linux.
On top of the significantly worse software on AMD's side (literally didn't work on windows in particular - so the "performs as good on both systems" is a nonstarter, some GGUF library dependency just doesn't work/exist under AMD on windows). Had me running the AMD card on windows under WSL (not a problem with nvidia though, that ran just fine on windows-side directly).
Aaaand also the other AMD bugs, such as the pink squares display corruption that has been an active issue for my GPU in particular (7900XTX) for over a year, maybe approaching two at this point, with no fix in sight from the AMD team (barely and ack at all - not on a single patch notes, just a bunch of reddit discussion). Really regret spending so much on an AMD gpu.
I have the same rig as you minus watercooling, and I assume you have AMD Ryzen 7 9800X3D? Anyway, it's my only PC now, I game, dev, run local models, edit photos, edit videos, all in Manjaro. I get ~70FPS in Cyberpunk at 4k, every setting at "Insane" or whatever goofy thing they call it, Ray tracing on path tracing off, with no framegen but with DLSS set to quality. Without DLSS I get around 40fps. Seems equivalent to what I see online with people with a similar build on Windows.
I run hyprland, seems to be the only wayland based keyboard-forward WM that has good nvidia support (and, allegedly, supports HDR, though I haven't got this working). I heard gnome was pretty good otherwise. I was running i3 before and it also worked fine, however once I got into wanting to get streaming working, there wasn't good compatibility between i3/xorg and tools like sunshine. I believe steam streaming worked fine on it though iirc.
The only thing I miss from windows: easy streaming with sunshine/moonlight. Steam streaming works (usually heh) but it took me a couple days of fiddling to get a stream to work at all through sunshine, and it is choppy. But for local gaming, I don't miss windows at all, I'm so glad to finally have all my drives converted from NTFS to ext4.
No, it's an Alienware R51 with Intel Core Ultra 9 285K 3.2GHz Processor; NVIDIA GeForce RTX 5090 32GB GDDR7; 64GB DDR5-6400 RAM; 4TB Solid State Drive; Microsoft Windows 11 Home; 2.5GbE LAN; 2x2 Intel Killer WiFi 7 BE1750+Bluetooth 5.4; Liquid Cooler
I don't see it on the Dell site anymore, only more expensive, lesser configurations (good timing on my part?).
Yeah, I really want to put in the time to try out various games, but realistically, the whole point of getting a second computer and installing Linux was to be able to train and serve models, and switching between serving a model (that people in my house want to use at random times) and gaming didn't seem like a great choice. If I did get good results, I'd seriously consider wiping Windows 11 from my older machine (an older Alienware with a 4090), but to be honest, I'm perfectly comfortable on Windows desktop.
https://www.protondb.com/
Having built an almost identical rig earlier this year can promise at least one similarly-spec'd machine gets equal use between AI and gaming (Both on Linux). Stupid-excited for the Steam Frame to finally come out.
or crypto... what's old is new again.
> No wonder gamers hate AI bros.
Personally, playing with AI models is way more fun than getting sucked into a game loop. Game loops feel like busy work hooked to an engineered dopamine drip. AI models are new frontiers and are exciting to build with, modify, lobotomize, and hack around with.
I remember playing Quake III which had user-programmable bots and thinking "wow, this is a really hard computer vision and reasoning problem". And then realizing "huh, that's a major research area, I should work on that". Later I learned that the bots were fairly simple and worked on far simpler world representations (nav meshes).
It looks like DM took a crack at it: https://deepmind.google/blog/capture-the-flag-the-emergence-...
And some of us are doing AI stuff all freaking day at day job and just want to play some Tekken when we get home for 30 minutes after the kid is in bed. But now Playstations are 1000$ and Ram and GPU prices are astronomical.
Not everyone is hustling 24/7 like some kind of lunatic.
I would probably hate someone if they were buying the same hardware as me but doing something actually useful with it. Any game worth playing doesn't require high specs anyway. There is such a large catalog of old games.
I specifically got the previous model so I could play AAA games with all the settings set to Ultra, at 4K. Cyberpunk 2077 struggled even with my 4090, so I had to disable ray tracing and enable DLSS. Since I've run out of new AAA games I've been playing older ones and it's crazy how fast they are.
I don't think you can dismiss gaming as not "actually useful".
Yes! It scared me too. I tried to insure it under my renter's insurance policy, but they not surprisingly refused. I had to get business insurance to cover it
You showed this setup to a business insurance underwriter and they gave you a policy? Can I ask how much the premium is? Or is this just theft insurance?
>> To me the jury is still out whether this was a worthwhile investment, but I do expect to use this machine for a decade.
The high cost and power consumption are both signs of the death of Moore's law, so you are probably correct that this system will be near state of the art for some time.
I was looking at Ultras for sale, and had same worry, so didn't end up getting one. I have some peace of mind comfort about applecare and technical repair, but i couldn't find insurance that would cover theft (or rather, i did, but it was too expensive)
Well a lot of people have that in their garage, even "worst" it's on wheels so even easier to steal.
I'm not saying it's worth it just that it's not such a crazy amount in comparison.
Last fall, seeing the writing on the wall, I pieced together an "AI" rig, 96GB ram, 2x RTX 3090, 9950X - not exactly top of the line, but it came in around CDN$3000 all in all, with most parts second hand. I don't think I could build that for CDN$10000 today.
I've been using it pretty steadily for a variety of personal projects, and the only improvement (aside from the obvious "more VRAM") I feel pressed to make is a portable AC unit / some kind of a focused cooling solution. The rig raises the ambient temperature in the office by 4C at least.
Now with the murmurs of even the large players reconsidering their AI spend, and usage-based pricing shifts, having a self-contained, owned, and independently administered compute resource is looking better and better.
I did the math at least on a Macbook pro, and for inference it's definitely not worth it.
- https://www.williamangel.net/blog/2026/05/17/offline-llm-ene... - Discussion: https://news.ycombinator.com/item?id=48168198
It's comparing laptops to dedicated GPUs in a server environment. The best comparison would be the Mac Studio but the current release is almost 2 years old at this point. We'll see what a likely M5 Ultra Mac Studio looks like, probably in Q3 this year.
But yes, for pure inference, the M5 Max Macbook Pros probably aren't there yet. They have other utility though of course. And you can get 64GB and 128GB MBPs at a discount. Micro Center currently will let you buy a 64GB M5 Max MBP for under $4k currently, for example.
Why didn't you take into account batching, input tokens, different costs of electricity, and the fact that a laptop can still hold a decent % of its resale value, and is useful for many other tasks than running an LLM?
> Why didn't you take into account [...] the fact that a laptop can still hold a decent % of its resale value, and is useful for many other tasks than running an LLM?
Because that wasn't what they claimed to research?
It's entirely fine if you enjoy local LLMs on your computer, there are people doing horribly inefficient inference on smartphones now. But for pure inference tasks, it's pretty obvious why M5s and Mac Studios aren't replacing TPUs and GPUs.
Who is going to buy a $4299 M5 Max MBP with 64GB of RAM just to run Gemma 4 31b? Firstly you don't need 64GB for that model. Secondly if you want a machine that sits in the corner and does nothing but LLM inference, you don't buy a MacBook Pro, you buy some GPUs which are going to cost you a fraction of that (~$1k for ~64GB of VRAM is possible). The people buying Apple Silicon for inference general aim for the Mac Studios with enormous amounts of RAM (128-512GB), to run very large models.
The idea is obviously to be running the LLM on your work laptop. As a developer I'd need a laptop with 24GB of RAM for work anyway, and 48GB, which is enough for a very good quant of Gemini, is just $400 extra.
> Gemma 4 31b? Firstly you don't need 64GB for that model.
You don't? It for sure doesn't run on my 32 GB M2 MAX.
What quant? You should have no problem running it at Q4 with 256K context, Q5 or Q6 even although maybe not at full context. I can run Q4 on a 4090 with just 24GB VRAM.
> Firstly you don't need 64GB for that model.
You might need that to run it with a longer context, KV cache size is a known issue with that model series.
24GB GPUs are $700-2500. Please show me the 64GB GPU for $1k.
Not a single new 64GB GPU, but multiple used GPUs.
They’ve significantly increased in price (so much for hardware depreciation…) but you can still get a modded 22GB 2080 ti for $320, or a Mi50 32GB for ~$450 each (used to be $150 a few months ago, alas), or a Mi50 16GB or <$200 but you’d need to stack 4 of them.
There’s also some more exotic configurations but those are probably the simplest options. You won’t get the performance of an RTX Pro 6000 Blackwell of course, and the power consumption will be pretty high so it’s only worth it if you have cheap electricity. But it is possible.
All the 22GB 2080 Ti are now $450-600.
Except this math is 10x too high (unless accelerated depreciation is all of it) - a million tokens at 28 tokens/sec and 75W and 20c/kwh should cost $0.15 not $1.50. (And less with MTP.)
That's the case with Self-hosting anything. It is the privacy that matters.
Privacy of what in this case?
Not necessarily. I was spending ~$150/month on vultr's kubernetes hosting. I spent $5k building out a pretty awesome 1U server and I put it in a colo that costs me $50/month. Next year I will break even financially and everything after that is saving money. I also am getting so much more out of this server than I was getting on vultr because I over-spec'd the machine. In addition to running more on my cluster, I spin up large virtual machines for development, experiments, and for offloading distributed builds. No shade to vultr, but owning my hardware instead of renting was absolutely the way to go. Unfortunately today the ram alone would cost over $5k, so the math has changed.
One value of learning on my Macbook is that mps is not as well supported as cuda which forces me to go down roads I would not have traveled.
That's more of a disadvantage. CUDA is an industry standard, MPS/MLX/Metal compute shaders are a novelty.
This is interesting but I am unsure how you make money out of this home setup, I would imagine if one would be offering consultancy to a business the business would make their own equipment/infrastructure available, which would also give a better control of their data. But perhaps I am thinking this because I am thinking about very big companies. Then, on very small business I don’t see they having the use case with the budget to match the need. So is this for specific services for medium sized businesses? Can you explain this a bit?
At then end they briefly mentioned how they started a service to post-train LLMs on producing more human, less formulaic-obviously-AI text.
Nice analysis, I would have loved a short overview of the kinds of experiments that were running on the machine (I know the results are given).
I find the "independent researcher" business model quite interesting. In the linked post he writes """DFT is a proprietary training algorithm, however, I’m currently offering a beta for a model training service where I will train your model for you using DFT.""" I'm curious how successful this is. Essentially market some AI breakthrough as a service instead of publishing a paper like my academic brain is trained to do.
As an aside, one thing that I always loved about our field was that the startup cost for many business ideas was "a laptop, internet connection and some some grit". In the age of AI it's quite a bit more and I feel one of the sad side effects of this is that it crowds out poorer and younger developers.
Other things people spend "too much money" on:
- muscle cars, with all the stuff, driven occasionally.
- boats, that don't get taken out much
- gamer x, where x=system or laptop or keyboard or mouse or desk or glasses or mousepad or speakers or ... usually with "> too much RGB"
- children
$48k for something constructive even if ai related? no problem, refreshing even.
One of these things is not like the others. If you don't spend the money on one of them, you can get a visit from government officials that might decide to take that "item" from you. You'd also be a worthless human to spend that money on the other 3 while not on the one.
I (probably not obviously) meant the "too much" part, where kids fail to grow/launch when (optional) things are given to them too easily.
I didn't mean food, shelter, medical, education, lego, others-where-required-by-law.
I have profited richly from my children, but not in a monetary sense.
However much it has cost me monetarily, it has repaid itself ten times over in value to my very soul.
It reminds me of the Crypto mining bubble - I look forward to buying my heavily discounted Mac Studio soon.
a P40 from 10 years ago costs 5x what it did in early 2023. a 3090 from 6 years ago costs as much now as it did then. RAM costs over three times what it did a year ago. M3 512gb now costs twice what it did at release.
"soon" isn't happening until China saves us from the Taiwanese semiconductor cabal and their Talmudic markups.
'If you google “plugging a PC into multiple outlets”, you get lots of warnings that if you even consider such a setup you will instantly burst into flames. So I hired a professional PC builder make sure it was safe.'
Not really sure how that makes it safe but OK!
I read this as; the "professional PC builder" would carry some sort of insurance. So it isn't really "safer", but if something goes wrong, the investment is (potentially) still safe.
Just an assumption, though!
Probably means hiring someone who has more knowledge about PSUs and especially about having two simultaneous PSUs. There are questions like: when you press the power button how do the two PSUs turn up and in what sequence? How do you deal with the PWR_OK signal? What if there are voltage differences between the two PSUs? What about power backfeeding?
I guess it was supposed to be a humorous aside, but it wasn't actually helpful because the relevant issue is when you pull more total amps from a single circuit than it's fused for (usually 15 or 20 amps in U.S. residences). The failure mode is usually tripping the circuit breaker.
That issue can often be addressed fairly easily by splitting the power draw between two adjacent circuits. You can have an electrician do it permanently or temporarily DIY it with an appropriately rated extension cord. The real issue was OP was in an apartment at the time so an electrician would have been difficult. I assume they decided to just have a system integrator build it because they didn't want to figure out how to segment and route the power rails in a dual power supply system, but it's not exactly rocket science. Problems are often more due to choosing power supplies that aren't up to their claimed spec, not pre-testing them under load or using incorrect or under-spec cables.
I think the relevant issue is you could conceivably have a house with two outlets with opposite phases. Bussing them together in the PSU will then create a short
> two outlets with opposite phases
This is actually THE standard in the US, which is actually fundamentally a 240V power grid but with an electrode stuck halfway down the secondary winding on every pole transformer, which becomes your "neutral". The two ends become L1 and L2, so that L1-N is 120Vrms, L2-N is 120Vrms, and L1-L2 is 240Vrms, and this is what goes into every home.
The power outlets connected to L1 are all opposite phase to all the ones connected to L2.
Rather than bussing the two outlets together, what you can safely do is get an electrician to just wire up an outlet with L1 and L2 and voila you have a 240V outlet. This is how you get all your dryer outlets, EV charging outlets, electric stove outlets, etc.
I used to work for a company where we made test rigs and their safety guys were strictly against having a single machine with multiple power inputs. It wasn't about power draw. Once you have two plugs:
1. You no longer have the nice property that unplugging it guarantees (more or less) that it isn't electrified.
2. You open up the possibility of mains voltage from one plug appearing on the unplugged prongs of the other plug.
3. It possibly messes with RCDs, depending on what you do exactly.
Although in this case it's probably fine because he's just plugging totally separate power supplies in and they're already fully enclosed.
Agreed, if anything an electrician was needed, not a professional PC builder.
The picture shows two power supplies. Powering what is effectively one appliance from different circuits is a definite no-no, and I can't think of any circumstance where it wouldn't be in a home.
If his mains supply was sufficient to run the server and the house in the first place then the simplest solution would be to simply upgrade one of the MCBs/RCBOs on one of the circuits to the required capacity. I am not sure a landlord would even notice something like that, and if the house is wired correctly in the first place, it's unlikely to be dangerous. So going from say, 6A to 12A, on a 20A mains supply is generally fine if the gauge of wiring is correct.
Hi! Thank you so much for posting this! I got back luck/timing when I tried, so happy it made it to the front page! (I am the author)
I did this with used parts and cheaper consumer cards (3090s) and did much of the same calculations. I found it was way cheaper for me as well.
The main advantage, however, is that the friction of "this is going to cost me in tokens to even try" goes away. I was so much more willing to take chances and try new things on my own hardware than I would have been if I were paying API costs. I feel like this point isn't made clearly enough by those of us who run these absurd self-hosted inference systems.
Thanks for the write up, was a fun read. I spent an order of magnitude less, but I could relate to your story from beginning to end.
Epyc (Milan), 512gb ram, 4x 3090
You kind of bury the lede in that Article, it's a good article, well done getting interest in your work.
Will you now be selling these GPUs for a profit?
"The point of buying the server wasn’t to save money, it was to build something cool." In the end, this is always the real answer - one that I'm sure we can all agree is the 'correct' one too.
I'm sure thr plan was to build a holodeck all the time..
Sounds fun/stressful/rewarding. I'm most interested in the update at the end though 'Launch was a success! 400K+ views, and multiple companies reached to use my IP.' I too, like probably 1 in 5 of the people reading this, think I have figured out some major problems with LLMs (context and computation research) but have wondered the best way to 'release' and get value out of it. I can see training being a little easier in that you release weights against a known model arch but not the training code. Wy stuff is all custom layers though. Any thoughts on a release strategy where you need to release the layer code for people to see test weights/the benefits?
My first advice is to have a test set with clear improvement, and a clear "wow" demo use case. There are lots of "breakthroughs" that seem good but aren't (e.g. some new architecture that doesn't mask past tokens correctly and leaks information), so people will assume it is wrong. To prevent this, you need to be extremely rigorous in your launch materials. If you can make it into a product that people can try out themselves, that goes a long way. You don't need to open source any code (I haven't yet) if people can try it out some other way like a demo website. Good luck! Ping me if you want to chat more
And the net result is a way for LLMs to use more variety in their writing style.
Didn't Sam Altman create LLMs to cure cancer and stuff? Why does their writing style matter as long as the information they are conveying is accurate?
This is a difficult calculation to make because you wouldn't rent time on the exact same system in the cloud. Depending on what you're running, a bigger server with better inter-GPU interconnects in the cloud might complete the task so much faster that the additional per-hour expense is more than covered.
Agreed. And the gained time either goes toward 1) more experiments, or 2) leisure, which makes you sharper in the lab and happier overall. Not sure the "I saved $17,000 so far" framing is the most useful way to look at it, but it's a cool project and I love that people are doing this kind of thing.
Right, you can rent from a v100 from llama cloud for $0.79/hr. An h100 is $3.99 /hr.
$48000 is equal to 12000 hours of renting an h100, which is about as long as you’d spend at your job for 6 years!
Could turn the system into a multi-seat gaming rig with 6 separate gaming seats, using loginctl on Linux.
The fine print at the end, with "Advice/Other notes", is the most interesting.
FYI: If you're in a similar situation, think very carefully before you build your own. The $17000 might sound like a lot; but when you take into account your time and risk tolerance, renting might be a much better solution.
I think their retrospective at the end of the article is grounded and logical:
"If I were to do this again, I wouldn’t do a custom build like this. I would buy a standard datacenter server and rent space in a colocation center"
I'm sure there are use cases when renting makes sense, but it can get crazy expensive really fast if you're not careful.
I built a very similar server myself [0] with a similar setup. I run different models for different purposes, but the primary one currently is kimi 2.6. I run kimi as the orchestrator model and then qwen, Gemma and others for specific tasks (sometimes loaded dynamically based on the task at hand), all exposed through the pi harness. I also use Hermes for some personal repeated tasks which connects to the same models, hosted on my local Mac Studio.
I am not even going to pretend that this is financially reasonable option. I simply wanted to have a local models. Maybe down the line, as cloud models become less subsidized, I might benefit from having a local setup, but for now, it wasn't the most prudent financial decision.
But one big benefit is that I never have worry about my account being randomly banned nor I have to worry about running out of quota. I still use codex and opus for some specific tasks, but as tools are improving, I need them less and less.
[0] https://x.com/synopsi/status/2024235558193811778?s=20
missing from most of these cost discussions: privacy. for some workloads the entire value of local is zero data leaving the network, and cloud cost is irrelevant
Just curious OP (if you're the one posting) -- what do you mean by independent researcher? What are you researching and are you making $$ from it or are you living off previous built up savings? Seems like an interesting path. What research have you looked into so far?
I am not the author, but he has been training/tuning? a model that produces text that mimics the source material in a more natural way. So getting the LLMs to produce less bland and boring LLMisms, according to the following up blog post.
citing from the article:
"I spent a long time trying high risk/high reward experiments and failing. But now I have something good. I’ve solved a major problem with LLMs. And I’m launching next Monday so we will soon see if it’s actually a breakthrough or just LLM psychosis "
Maybe ai companies today have some bounty program?
They have a subsequent post (from Monday) about what they've been working on: https://rosmine.ai/2026/05/18/fixing-llm-writing-with-distri...
(I would assume they haven't made a lot of $ off of this, if nothing else because they've only just put out that post and demo. They do seem to have produced a model that doesn't sound very LLM-y to my ear, though it also seems rather weak for its size.)
Shallow take: They made an LLM that uses fewer emdashes.
Cynical take: They made an LLM that can bypass existing AI slop detectors.
Realistic take: They found a research problem they found interesting, dumped a bunch of capital and sweat equity into and (claimed to have, at least) found a solution. Neat!
Or they just have lots of money and a hobby. Someone else might blow $48K to get an old Cessna and go have fun flying around. Not everything needs to have a purpose.
A self-described "broke grad student [who has] been saving up for this for years."
Risking their own money and time instead of leveraging a PowerPoint to hire other peoples' labor with other peoples' money. I can respect that.
You read that line wrong.
You were on the money with the Cynical take lol:
https://rosmine.ai/2026/05/18/fixing-llm-writing-with-distri...
Huh? That's them intentionally demonstrating the slop style.
Just curious - What exactly are you using that rig for? I see that you said research work. Are you building a product or training models? I ask because whether something is worth it or not depends largely on what you get out oof it and how you value what you get. It's perfectly fine to leave a FANG job and go for, say, pottery hobby. What gives you happiness and your value system - these will qualify your decisions.
In the article the author says they are doing reinforcement learning with LLMs.
Seems like they just want to play PewDiePie after making tons of money from their salaried job and have a bunch of spare time now.
They posted their research results, it's linked at the end of the article.
https://rosmine.ai/2026/05/18/fixing-llm-writing-with-distri...
Buy one of these next time, https://tinygrad.org/#tinybox. At least geohot knows what he is doing.
Any kind of fixed capacity usage model seems to be a dead end. Paying per token might seem like an exploitative arrangement at first glance, but it's a luxury if you are experimenting or deploying greenfield.
Provisioned capacity is a really high end thing. I feel like you'd need to be spending more than $1000/day on tokens for this model to make any sense. You lose a lot of flexibility once you start dumping capital into specific pieces of hardware. Maybe start by renting the GPU server for a few days...
Great article. I'm about to embark on a similar journey.... Doing a ton of AI development right now. Don't need a server, but a very, very high end workstation is super appealing to me right now. Looking at $50-$80k. 1TB RAM. 2x RTX Pro 6000s. 64 core Threadripper Pro. As many 4tb or 8tb nvme drives as I can stuff.
I envision NixOS at the core... then everything I need virtualized on top with KVM/QEMU. Maybe a dual boot setup with Windows for gaming and Flight Simulator (but I could virtualize that too with easy GPU passthrough.)
Lingering questions I'm working to figure out:
- Will 2 RTX Pro 6000s run on a 1600 watt PSU? Not sure how much higher I can go without calling an electrician. (standard US home.)
- Assuming I plop this into my home office, should I expect the PC to run significantly hotter than my current rig? (3960x threadripper, 128GB RAM, 1600watt psu, overclocked and watercooled 4090.) My water temp, measured at radiator, is about 60c at peak load. (This is the only number I care about, as this is what I have to consider to be comfortable sitting next to it.)
What do you want to do with the workstation? I have a similar setup:
- 512 GB
- Epyc 9684x
- 2x RTX 6000 Pro
- 1400 W PSU x 2 but in redundant mode
Mine is in a colo where it stays nice and cool. In my case, I went with less RAM and more GPUs (bought 4). Secondarily, the Max-Q blower version of an RTX 6000 Pro Blackwell is easier to keep cool and also only needs 300 W at the cost of very little performance. The non-max-q also only really use 300 W during inference, but the good thing about a lower power use is you can put more GPUs in very safely.
I assume you want the Threadripper Pro to maximize single-core performance? So you're spending a lot of time on CPU? Interesting stuff.
I gained a lot putting the machine somewhere else. TTFT on a thing like this is between 100-800 ms depending on batching and model size and so on, and your nearest datacenter is likely <10 ms. It sits on nice dual redundant power in a place where it's blown icy cool.
Good luck with your setup. If you get around to it, and end up writing about your setup on a blog, do share. Email in profile.
Very nice. Primary use case is application development, where the applications leverage a mixture of cloud based and local models. Modelling complex architectures. My work is primarily in the aerospace and defense arena, so hybrid and on-prem are important, as are ITAR and CMMC compliance. The idea is to have the local rig to build and validate architectural deployments that can sit on prem on customer hardware, in cloud, in gov cloud, or in a mix.
Not really looking at colocation, as this machine would double as a heavy duty gaming and flight sim rig. That means at least one regular RTX 6000 Pro. Not sure if I can mix and match with the Max-Q version, or if I even want blower fans in a desktop case (last time I did that was about 16-18 years ago with an ATI card... wasn't a fan--pun intended.)
The other advantage of the local GPU is that you are not feeding your data into cloud providers. I'm not sure how much you can really trust Anthropic and OpenAI not be improving their models based on your input.
Doesn't it benefit me if the models I use improve?
Do you value the infinitesimal improvement of model quality more than your privacy?
You can turn off training with codex an gemini. Not sure about Anthropic.
I can check a box.
How much do you trust OpenAi or Anthropic to not use it as training data anyway? What if you are building a startup and they can just use their visibility into it to copy your IP instead of buying your company?
And here I felt like I was wasting money on an Intel B70 to run LLMs locally.
Stuff like this + OpenClaw with Mac Minis a while back is sort of exposing a probable local AI flywheel waiting to happen.
Someone needs to solve proper distribution of packaged GPUs with some Tesla-like wall connector for a consumer grade box that is plug and play.
Maybe John Ternus ends up doing that at Apple since they sit closer to this consumer profile.
So the answer is: "TBD if I can actually make money to pay this back"
If nothing else, rosmine's DFT [1], which is what they were working on with this setup, seems like a worthwhile investigation.
While I'm skeptical that there is much of a moat, at least for the large players, it should at least hopefully set rosmine up with for the next job :)
It does seem to fix the current biggest issues with using LLMs for writing at various publishers. If you're The Economist, you have a very specific house style and you have a decent corpus of articles written in that style. At least on my reading of it, rosmine can use DFT to get a model to closely match its outputs, in terms of the language quirks that are generated, to that of the corpus it is fine tuned on. ie it will very much match the house style, particularly as it is used in writing, vs giving a system prompt to an LLM that has some Economist articles in its vast training set, and telling it to write in that style- it will do an ok job, but still exhibit LLM language quirks despite itself. Even if you feed it the specific "style guide" that they give their authors, I dare say the reality of their writing is the best place to learn, and it sounds like DFT can ground the writing of a model in a specific corpus like that.
[1]: https://rosmine.ai/2026/05/18/fixing-llm-writing-with-distri...
Giving an LLM samples and tell it to apply the style in the sample works a lot better than just telling it to copy a style it may have seen, or a list of rules.
They do it well enough that it'd take really good output to beat.
They really don't.
If your goal is to say, write science fiction, their reversion to classic LLM-isms, is really distracting and is what makes people say from a glance that it was written by an LLM. You basically can't use them at the moment in any real "natural" long-form writing. Everyone will call "slop" pretty quickly on the current frontier models.
Rosmin's DFT paper is worth a read.
I have seen examples that shows otherwise, including from a client that tested it extensively by paying people who thought they were paid to help detect AI generated content. They did little more than what I described. It works very well. Some people still insist they are able to tell the difference, but in the tests I saw, people did little better than random chance.
Some of it you could probably tell with statistical analysis, but actualy people are far worse at judging whether content is AI generated than they think they are.
If you need to beat an AI testing tool, you need to do marginally more work than to stop people from recognising it, but not all that much.
The nature of it is that you don't "see" most of the stuff that is well done because few people want to talk about it.
From the author’s POV it seems like they were going to do this research regardless, so this is asking what the most cost-effective way to do that research.
Or, for a person who did have a great way to monetize the same workload they’d probably find a lot of value in reading this post.
I ve seen already one question like that in the thread. But I rephrase it slightly sharper. Did you consider renting out you setup to vast.ai and if so, how much money it can generate per month deducting electricity.
Also, sorry for the noob question, is not such server generate enormous amount of heat? You did not use any special cooling system?
(For reference I’m talking about the DFT post from the same blog.) I love that ML is still in the “gentleman researcher” stage where relatively small amounts of startup capital can buy a ticket into frontier research.
For a lot of research questions 6 GPUs is even overkill.
It’s one of the reasons I’m skeptical of the “trillion dollar supercluster” idea [0]. I think what we need is more reasonably smart people investigating medium-sized problems. A “GPU middle class” you might say.
[0] https://situational-awareness.ai/racing-to-the-trillion-doll...
I agree :) Also, I heard Teknium trained the original Hermes model on 2x 4090. You can do a lot with a little compute
People doing economics with the cloud GPUs, of course cloud GPUs are going cheaper. But also, is generating tokens all you do with your computer? I can play games on DGX spark and also do LLM inference, so sometimes the economics work out, apart from having fun with it.
These top AI "independent researchers" that live in underpowered apartments and work off their parents' basement...
Is that California ?
Questions:
1) Was the energy bill factored in? 2) Have you extracted any comparable value out of this?
Quick tip for people who want to experiment with local models: A lot of the common smaller models are also available on openrouter or other services. Dirt cheap.
I know it's not the same. But a lot of people buy expensive GPUs, just to find out they have no real use for smaller models.
Openrouter is great for experimenting with models. I did exactly what you're saying to test smaller models that will run on commodity hardware and determine if it might be worth it to drop $10k on hardware. For me the answer was no, but it's close. I'm very excited for the next few innovation cycles to arrive.
The $48K also isn't fully sunk cost - there's a non-trivial residual value for those GPUs at the moment and likely for a few years yet. The server has a depreciation curve that's pretty enviable, actually!
The idea is similar to maintaining on-prem vs cloud
Cloud is optimized for development velocity but its nature of high margin business eventually makes on-prem more promising
It could be too late but it might be worth looking into tax saving if you have a business. Depreciation of asset is a loss and may deduct your income. (I'm NOT a tax expert)
Cloud servers have cheaper electricity, the scale of industrial-level cooling, no issues for you (as a user) with hardware failure (ie you just use a different server; it's not your problem) and can amortize their cost by running 24x7. I've seen H100 computer hours for as little as $2.
As the author notes, there are also electrical/wiring issues that cap how much compute gear you can run in a space not designed for it. I suspect a standard 20A 110V circuit can probably handle 2x RTX 6000 Pros. 15A probably can but that requires more research. Anything more than that and you're using multiple circuits, which has issues, or you need an upgraded circuit (eg 40A 240V) with all that entails (eg heavier duty cables, custom plug, etc).
I suspect a standard 20A 110V circuit can probably handle 2x RTX 6000 Pros. 15A probably can but that requires more research.
During initial setup of the server I am putting together, I found that a machine with 4x Blackwell cards derated to 300W can get by on a single 120V 20A circuit. It's tight but doable. A lot depends on the power supply. I don't think it's a great idea to run 4 high-power GPUs on a single ATX-style PSU, even a beefy 1600W job.
The other questionable part is whether all four cards can temporarily spike at full power during boot, before the wattage limit is applied by the OS. Some accounts say this is possible, and if so it could shut down the party in a hurry. But I didn't see any misbehavior when I tried it.
My earlier research suggests NVIDIA does not actually cap spikes, it caps the average over short periods of time. So setting the power limit is no guarantee.
Jensen Huang said 'the more you buy, the more you save,' and you actually took it personally.
I've hit command+f and then looked for this.
Glad I could fulfill your search query. Doing my part for the SEO of this comment section.
"If I were to do this again, I wouldn’t do a custom build like this. I would buy a standard datacenter server and rent space in a colocation center. But then I would miss saying Hi to grumbl once in a while."
I have four old 24gb Nvidia cards. They're not great but they're not useless either. The problem is that I haven't really figured out a good way to actually use them.
Genuine question; would anyone here recommend any specific motherboard to best utilize these cards?
Depends what you want to do and which cards you have, but usually going with any older (3rd gen+) threadripper pro setup will give you a lot of pcie lanes.
I myself run with gigabyte trx40 aorus xtreme, but since it's regular threadripper (not pro) with 4 GPUs 2 of them will run at x16 and two of them at x8 speeds
You could ask AI and get pretty far reading the answer.
I know. But this is a forum filled with technical professionals and I would like to get actual opinions from actual humans.
AI is cool but it's not going to have all the good and bad experiences that humans have had with different motherboards.
Actually that's the best part of AI. It has access to experience with way more than the select sample size here.
I'm not entirely sure what your point is here; me asking for humans to give an opinion does not preclude me from also asking AI.
I was just making a correction based on what you said.
"AI is cool but it's not going to have all the good and bad experiences that humans have had with different motherboards."
AI will have more access to experiences than you'll find here.
It actually won't have "had" any experiences though. Yes, it can aggregate stuff from blog posts and reviews and marketing material. That's hardly the same thing.
Can anyone recommend what types of server would it be required to run a RTX 6000 Pro?
So some things have changed since this rig was first built (2024). The most relevant is that $6800 RTX 6000 Ada 48GB has arguably been supplanted by the $9500 RTX 6000 Pro 96GB.
The Ada has a memory bandwidth of 960GB/s. The Pro has 1.8TB/s and about 40-50% better performance so is at least equivalent in processing power, much better in memory bandwidth (important for inference) and can hold larger models on a single card.
I've considered buying a rig with 1-2 6000 Pros for similar reasons but I want to see what happens with this year's Mac Studios with a likely M5 Ultra. Macs have a shared memory architecture whereas NVidia segments the market based on max memory where the biggest consumer card (RTX 5090) has 32GB of VRAM but still excellent memory bandwidth (1.8TB/s). A RTX 5090 rig will still trounce a Mac Studio seems to be the conventional wisdom. Despite being able to hold larger models and being able to chain Mac Studios on TB5, their lower memory bandwidth (~900GB/s) and lower overall GFLOPS mean they still come out behind.
That being said, the current Mac Studios are relatively long in the tooth, being released in 2024.
I'm still not sure any of this is really wroth it because things are still changing so fast. I think there's a decent chance of a number of large AI companies going bust in the next 2-3 years such that you'll be able to buy enterprise AI hardware at cents on the dollar, a bit like how Google bought data centers in the post-dot-com crash.
But anyway, nowadays I'd be looking at the RTX 6000 Pro as the sweet spot, having anywhere from 1-4 in a single server.
The electricial issues the author mentions are interesting. I hadn't really thought about the max amperage on a residential circuit. In a DC, these would typically operate on three phase power and much higher overall amperage. I wonder if there's a device you can buy that can combine multiple residential circuits into a single power source for a server this power hungry?
I have the Macbook M5 MAX with 128 GB of RAM. I put its performance at roughly equivalent to the RTX 5070 Ti. The M3 Ultra 512 GB for me is about half the performance of the RTX 5070 Ti but obviously it has the ability to do more because of the increased memory.
I don't think anything compares to the nVidia chips at all.
I am also considering to buy 3-4x RTX 6000 Pro 96GB plus some Ryzen workstation with a grant.
Is this the best general-purpose choice as of 2026 with $50k for training, fine-tuning and running large open models?
Foe multi GPU make sure you have enough PCIE lanes! That rules out consumer grade sockets like AM5, you would need Threadripper or EPYC.
Why are these sockets "ruled out"? Pipeline/layer parallelism doesn't need high bandwidth between nodes, and tensor parallelism has middling performance unless you have very fast networking and very slow compute. It all depends on what you're doing.
You are correct that bandwidth requirements depends a lot on the exact workload. And that in specific cases, it might be doable to have AM5 for multiple RTX6000Pro. The parent mentioned workloads that are general, and broader than inference-only. In that case I would consider spending a bit extra on the motherboard to ensure that PCIE bandwidth is not an issue.
You would install a 240v circuit (in the US) like for an electric clothes dryer.
Edit: I now see the author was in an apartment and couldn't do this, so I concede this is not responsive here.
That's a nice problem to have. I can't afford a $48K GPU server, even if I worked as a developer since 25 years ago, because I live in the wrong place.
Love to see what u gonna build on $48k GPU
well as i need to process medical documents i really should not use anything offsite.
privacy has a steep cost
He didn't consider the possibility of renting it out during the downtime to Vast.ai to make some money back.
That’s very cool and very expensive - I think the cadastre value of the apartment that I live in is like 35k EUR or thereabout.
I wonder how much worse just a bunch of Intel Arc B70s might have been, software fuckery aside. Ofc if I’d need to run local inference or simple fine tunes and learning stuff, I’d probably get one of the SFF options - Mac Minis and all of those Sparks or new AMD AI chips. Then again, I’m broke so go figure.
I just fork over some money every month to Anthropic, have been trying out more DeepSeek and also Mistral (their Vibe tool is surprisingly passable under WSL).
> if more powerful GPUs could help me make my work be successful just 2 months earlier than I would have with a smaller machine, then buying a more powerful server would be worth it.
Jesus I got a migraine trying to parse that
I'll buy it from you!
“Being less able to detect whether a text is AI-generated is exactly what nobody asked for — except villains.” ("—")
that adds up fast over a year.
> Because of this I got a motherboard with slow GPU interconnect. It’s good for running many small experiments in parallel (which is my main use case) but horrible for any models split across gpus.
:( you paid a professional pc builder and you weren't told this?
Don't those Ada 6000 GPUs support NVLink? I think I can even see the cover for the connectors in OP's pic.
edit: Hm, finding mixed information online on whether that's still supported or not. Apparently it was removed in workstation GPUs.
Nope, they don't support it. And afair even if they did, you would be limited to connecting only in pairs, not all 6 together
Honestly, I made the same mistake when I added a GPU to my (not $48K) existing homelab. I got a Ada 4000 for its slim form factor and low wattage, but realize after I bought it that it does not support NVLink, so I can't really effectively double it up later if I wanted to. Live and learn. I suppose you might research that a little before blowing that much money though LOL :)
Consumer motherboards can still make sense even if you leave some performance on the table. Running an actual 8x GPU server is not something you'd want to do in an apartment. Imagine the old Lucasfilm "THX" trailer where an unearthly-sounding foghorn whine rises to a sweeping crescendo at reference level, only without the decay at the end.
At the time he put this rig together, there weren't a lot of open-weight LLMs that could run well on 6x48=288 GB, so it probably wasn't a huge loss. There still aren't, really.
Right now I'm in the process of cramming Blackwell cards into an old DDR4-based Milan server, where the important thing is to be able to run large models at all. The GPU fans alone burn over 400 watts at full throttle.
Did you think about Max-Q cards? 300W and they aren't that noisy either, 14% lower perf than non-Max-Q card.
That was an option, but having decided on a true server chassis for other reasons, it made sense to use server-edition cards to take advantage of all those fans. I downclock them to 300W anyway for longevity, but it's nice to have the option to go to 600W if needed.
The server is going to live in the garage, so I'm not that concerned with noise. But I had no idea what to expect when I flipped the switch for the first time. It sounds like something out of the Book of Revelation. No way, no how could something like this be used in an inhabited area.
I wonder why using 2 PSUs resulted in having slower interconnect.
There is no specs in this blogpost regarding cpu/motherboard choice, but if you go with threadripper pro they have 128 pci-e lanes for some time now, so using all GPUs at full speed shouldn't be a problem
what is a "professional pc builder" in 2026
A guy on Facebook with more confidence and better insurance
If you split models using pipeline/layer parallelism you don't have to care about a slow interconnect, you're just slowed down a lot when running a single inference at a time as opposed to a fully pipelined minibatch. But tensor parallelism requires much faster interconnects than you could get in your average server, so I'm not sure that a different motherboard would help all that much.
> paid a professional pc builder
They did not. That's a mining rig not a workstation. It's visible from the photo and the chart showing multiple failures over a short period of time including the risers -- which are visibly very low quality -- failing twice.
You have 50K, you call a real expert like Puget Systems or Digital Storm.
The research that's presented in another article on the same site is way more interesting than the betteridges law article linked here. It'll be very useful in my own latest project if this research is incorporated into some model I can rent by the token!
no
You guys are nuts... I hope you're making enough money to justify this level of investment and power use (not to mention noise and heat management) in your home...
I'm just putting a 2nd hand 12gb 3060 into my lab box, but its only for use with HA/Paperless/Plex etc type things. I dont need multi-model agentic behavior for private use.
If I did I reckon I'd renting infrastructure rather than filling my home with that sort of gear.
There's a reason folks like these can afford this and you can't. Go on with your cheap self
out of curiosity, did you check how much would cost to rent a cage in a colocation space? Having to power your computer from two different outlets sounds wild..
the very last line of the article:
"If I were to do this again, I wouldn’t do a custom build like this. I would buy a standard datacenter server and rent space in a colocation center. But then I would miss saying Hi to grumbl once in a while."
Yes, i mean, he could rent a cage and run grumbl it there. It doesn't have to be a standard datacenter server, even though a standard datacenter server would be better and cheaper.
A cage[0] is ~100x larger than what you need to host a single server. Many data centers will colocate by the rack unit. At others you can get a quarter or half cabinet[1]. Even at the very largest enterprise datacenters you can colocate a single cabinet.
[0]: https://static.cisco-eagle.com/images/category/WireCrafters/...
[1]: https://www.edpeurope.com/wp-content/uploads/EDP-3-Compartme...
> A cage[0] is ~100x larger than what you need to host a single server.
Yup, but i was assuming that he wanted to experiment building gpu rigs. For sure standard GPU servers are cheaper and easy to maintain. I have two lenovos, bought them used, already EOL.. was cheap and better than any custom gpu rig.. but i was pragmatic, because my goal was to put it in production, and not to research...
You could fit 10 million dollars of GPU rigs in the smallest of cages. A cage is an entire room. You don't need that to run a few servers.
The cheaper, easier solution would probably be just to get an electrician to wire up a high amperage 240V outlet just like your electric stove or dryer has, and then get a PSU that connects to that.
Would probably cost you $500-1000 depending on how difficult your home is.
The article stated that this was dismissed because the author lives in a rental apartment and he was not certain the landlord would agree to making this change.
I did not see any indication that the landlord was ever actually asked, it appeared to be the author's "sense" that any answer would be "no" from the landlord.
https://en.wikipedia.org/wiki/Betteridge's_law_of_headlines
It doesn't cover risk. If one or more gpus dies, who pays for it? If you rent, you are guaranteed to be insulated from this risk. But owning, you might not have the best return policy from the vendor. And if you are actually at fault for breaking it, they have every right to deny a return. Or if your apartment is burglarized or catches fire (possibly from overloading the circuit) you are out the entire investment.
Also a lightning strike or surge from the electric utility could fry the whole rig. Proper protection costs thousands, and even then it's not guaranteed to protect everything
> Proper protection costs thousands
Frankly that's something a landlord should provide. And there's insurance against losses from electrical issues.
Why should a domestic landlord provide you with data center-level power protection instead of just the normal household utility connection?
I'm talking about standard surge protectors. Properly installed they are enough except for direct lightning strikes, these will fry everything. But unfortunately, even in code-obsessed Germany landlords are not required to retrofit SPDs.
To protect a large electrical device investment, you would want an EMP shield whole-home SPD, in addition to an SPD right at the electrical device. The first one shields exterior surges (including non-terrestrial), but the second shields against internal surges. And yeah lightning will blast through both of them. So the best bet is probably a lightning strike detector combined with renters insurance.
In the article, he wrote "I tried to insure it under my renter’s insurance policy. They didn’t like that. I had to get business insurance to cover it.“, but he didn't say how much it cost, either.
> I thought that I could not get a standard datacenter server because my apartment wouldn’t let me upgrade the circuits, so I needed to have 2 power supplies plugged into different circuits.
Why didn't they just put a higher amp breaker in the box?
It is unsafe for wires to be handling higher power than it was rated cause the wires act like very low ohm resistors. At some high enough I, you’re still gonna be generating power P=I^2R which is mainly thermal and melt the wires.
> Why didn't they just put a higher amp breaker in the box?
1) note the word "apartment" -- they rent, not own, and doing so not only would likely be illegal, but might also get them kicked out of the apartment.
2) Unless the wiring on the circuit drop, and all the end points are rated to handle the higher current, doing so would be an electrical code violation (and therefore trip into that "illegal" arena that might result in getting kicked out of the apartment).
Most residences are wired using the minimum size wire rated for the installed breaker (because doing so saves costs). So a 15a breaker in the box would mean 14gauge (the US NEC minimum size for 15a circuits) wiring in the walls and 15a rated outlets/switches. Installing a 20a breaker in the box would be a code violation, and in many jurisdictions also illegal.
And all the above is without considering that installing a 20a breaker on wires rated for 15a increases the fire risk tremendously if those wires are now asked to actually carry 20a for any length of time.
"quit my FAANG job" as in they simultaneously worked in Facebook, Amazon, Apple, Nvidia and Google? Or did the op work at Netflix and is too ashamed to admit that :P