We are a 1000-2000 person company and we have probably on the order of $100M of servers and data centers and whatnot, and I think we spend about 2/3rds of that every year on power/maintenance/rent/upgrades/etc.
We don't generally trust cloud providers to meet our requirements for:
* uptime (network and machine - both because we are good at reliability [and we're willing to spend extra on it] and because we have lots of fancy redundant infrastructure that we can't rely on from cloud companies)
* latency (this is a big one)
* security, to some degree
* if something crazy is happening, that's when we need hardware, and that's when hardware is hard to get. Consider how Azure was running out of space during the last few months. It would have cost us an insane amount of money if we couldn't grow our data centers during Corona! We probably have at least 20-30% free hot capacity in our datacenters, so we can grow quickly.
We also have a number of machines with specs that would be hard to get e.g. on AWS.
We have some machines on external cloud services, but probably less than 1% of our deployed boxes.
We move a lot of bandwidth internally (tens of terabytes a day at least, hundreds some days), and I'm not sure we could do that cheaply on AWS (maybe you could).
We do use <insert big cloud provider> for backup, but that's the only thing we've thought it was economical to really use them for.
What kind of business is that? That level of spending seems insane based on headcount. At least without knowing anything about the business. Are you running public facing services or something? What does that look like as a percentage of revenue?
Infrastructure. Datacenter costs are a relatively high percentage of revenue. No public facing services - only large clients.
It sounds like you are operating a cloud.....
Sure... but at least they're not using the cloud! Just a whole bunch of servers!
There is no cloud, its just someone else’s computer :)
Working at a medium-sized bank the data center costs were very significant.
How many employees do medium sized banks have?
This one had 600 on-site employees serving 12 million customers. Not sure how many were off-site.
Hundreds of terabytes a day is really not that much, depends on what latency can you accept. I often run computations over datasets that are petabytes in size, just for my own needs. A big data move would be at least tens of petabytes or more like hundreds, or thousands.
Also surprised about latency, latency from what to what? Big cloud providers have excellent globally spanning networks. Long distance networking is crazy expensive, though, compared to the peanuts it costs to transfer data within a data center.
Reliability - again, not sure I buy it. Reliability is "solved" at low levels (such as data storage), most failures occur directly at service levels, regardless of whether you have the service in house or in the cloud.
The rest of your points make sense.
> Hundreds of terabytes a day is really not that much
How much would it cost to move this across boxes in EC2? I actually don't know, that's not a rhetorical question. A lot of our servers have 10-40gbit links that we saturate for minutes/hours at a time, which I suspect would be expensive without the kind of topology optimization we do in our datacenters.
> Also surprised about latency
We've spent a surprising amount of money reducing latency :) We're not a high frequency trading firm or anything, but an extra 1ms (say) between datacenters is generally bad for us and measurably reduces performance of some systems.
> Reliability is "solved" at low levels
To whatever extent this may be true, it's certainly not true for cloud providers. One obvious example is that EC2 has "scheduled maintenance events" where they force you to reboot your box. This would cost us a lot of money (mostly in dev time, to work around it).
Also, multi-second network dropouts in big cloud datacenters are not uncommon (in my limited experience), but that would be really bad for us. We have millisecond-scale failover with 2x or 3x redundancy on important systems.
Seems you’re trying to bring your on-premise structure and concepts to the cloud, that won’t work. EC2 instances are cattle, not pets.
I believe I saw a slide that average lifespan of an EC2 instance at Netflix is 14 minutes.
I’m not necessarily saying cloud will work for your setup, but you can’t compare like that.
It seems to me like they're not trying to bring on-prem to the cloud, and that's very much the point.
Most businesses are not Netflix, where most of the work is done by their OpenConnect CDN appliances serving video content outside of cloud providers.
> How much would it cost to move this across boxes in EC2?
Nothing. You generally only pay for data going out of cloud providers. Not data going in or data being transferred within the same region.
> One obvious example is that EC2 has "scheduled maintenance events" where they force you to reboot your box. This would cost us a lot of money (mostly in dev time, to work around it).
You're not going to have a successful cloud experience unless you build your applications in a cloud suitable way. This means not all legacy applications are a good fit for the public cloud. Most companies really embracing the cloud are mitigating those risks by distributing workloads across multiple instances so you don't care if any one needs to be restarted, especially within a planned window.
> Also, multi-second network dropouts in big cloud datacenters are not uncommon (in my limited experience), but that would be really bad for us. We have millisecond-scale failover with 2x or 3x redundancy on important systems.
Are these inter-region network dropouts or between the internet and the cloud data center? You're not going to be relying on a public internet connection to the cloud for critical workloads.
All that being said, there are plenty of workloads which I don't think fit well in the cloud operating model. You may very well have one of them.
You pay for cross-AZ Traffic in AWS, and that adds up really fast.
Yes. You've got to be aware of where those boundaries are when adopting the cloud and the cost information around these cases are inadequate at best. Too many people get surprise bills.
The first step is to stop blaming the victim.
It's nobody's fault that the billing structure at Amazon is so complicated and confusing.
Except Amazon's.
Yep. Got bitten HARD by this recently, $1.5k inter-az transfer charges that we never saw coming.
Our fault, I suppose -- but multi-az is prohibitively expensive if you need to run anything data heavy distributed.
I'm working on reducing a $50K per month bill for Inter-AZ traffic at the moment.
> but multi-az is prohibitively expensive if you need to run anything data heavy distributed.
If you communicate between your AZs via ALBs, multi-az is effectively free. Our bill is so high because within our Kubernetes cluster, our mesh isn't locality aware; it randomly routes to any available pod. 2/3rds of our traffic crosses AZs.
> unless you build your applications in a cloud suitable way
Right, we have not done this. We basically decided it was cheaper to keep doing the “old school” thing and not spend a bunch of dev time trying to do it in a way that supports arbitrary failure of N boxes. We just spent the money to make it unlikely our boxes or networks will fail, and if they do fail we may be sad (but it rarely happens, and has not yet happened in a catastrophic way).
> Are these inter-region network dropouts or between the internet and the cloud data center?
I’ve seen multi-second dropouts within a DC (cloud, not my current company), and multi-hour single-path failures between DCs (usually something like a fire or construction cutting a line). But all of our DCs have at least 2 physically independent routes to the internet, so it’s never taken us fully offline.
> Nothing. You generally only pay for data going out of cloud providers. Not data going in or data being transferred within the same region.
This is not true, AWS charges for all traffic cross AZ, so if you want a resilience within a region (which you mentioned in next paragraph) you will do a lot of that talking and this ends up a non trivial cost. You can do some optimization though by having your apps being aware of AZs they are on, but most places don't do that, more common are places that run everything from a single AZ.
> How much would it cost to move this across boxes in EC2? I actually don't know, that's not a rhetorical question.
Data transfer between instances in the same AZ is free. If the data crosses AZs, you're changed $0.01 per GB in both directions. This is for instances on a VPC. I think the pricing model is different for classic EC2.
There are some exceptions like all traffic between EC2 and ALBs being free.
Edit: Pricing is described at https://aws.amazon.com/ec2/pricing/on-demand/#Data_Transfer
and if you go for resilience you will do a lot of that cross AZ talking, unless you also design your apps to know the infrastructure and consider that when communicating (most companies don't do it)
> A lot of our servers have 10-40gbit links that we saturate for minutes/hours at a time, which I suspect would be expensive without the kind of topology optimization we do in our datacenters.
I think everyone does something of this sort nowadays, that's why networking is ~free within data centers :)
> but an extra 1ms (say) between datacenters is generally bad for us and measurably reduces performance of some systems
That speaks to me. You will always be just the n-th client unless you own the cross-datacenter data links (i.e. have full autonomy on deciding the priority of the traffic). It's similar to the covid provisioning problems you had mentioned.
> One obvious example is that EC2 has "scheduled maintenance events" where they force you to reboot your box.
Yeah like others pointed out - that's just what "cloud" is, and is generally a good idea. You're supposed to handle a certain % of your machines going dark without a warning without violating any SLO (or even worse, certain % of your machines "pretending" they're up but actually being ridiculously slow for this or that reason; and don't even get me started on CPU/RAM bitflips).
It sounds to me that you run an extremely highly sensitive service, something for which paying for true ownership of the hardware just makes sense to remove those kinds of risks that most services don't care about. At the end of the day "cloud" is a shared resource, and no resource separation efforts will be 100% effective.
> One obvious example is that EC2 has "scheduled maintenance events" where they force you to reboot your box. This would cost us a lot of money (mostly in dev time, to work around it).
It's interesting that you say this, because I think this just boils down to how you treat failures. Boxes will fail. It's inevitable, even if you run your own hardware.
We treat failure as an every-day event and have set up our systems so it doesn't matter. One box fails, an automated system notices and brings up another to replace it, while the remaining boxes take slightly more load for a few minutes.
Sure, that kind of failure tolerance doesn't come for free. But, again, you're going to have failures, no matter how much work you put into reliability (which also doesn't come for free!).
Computation over Petabytes of data sounds fairly expensive, jobs that I was running over close to a PB could cost hundreds of dollars. Am I misremembering, doing it wrong, or underestimating your teams’s cloud budget?
> We also have a number of machines with specs that would be hard to get e.g. on AWS.
What specs are those? I was under the impression AWS has everything from extremely tiny to giant terabytes-of-ram-for-SAP instance types.
We would need something like an X1 instance in terms of RAM, but it's hard to find something on AWS that has a well-tuned balance of RAM/CPU/disk for our needs. A lot of the big specialized instances are tuned for one particular limiting factor (RAM/GPU/storage/CPU/bandwidth/whatever) and I don't recall them having a good selection for "really big everything".
Amazon is constantly expanding the selection, so it's possible they've added something since we last seriously looked into this.
Just curious, can you share general details on the types of services that need really big everything?
For most every workload I can think of, as their load increases it's always one resource in particular that's the limiting factor. (RAM/CPU/Storage/etc). So it makes sense to me that AWS instances focus on optimizing one particular resource type.
Would be interesting to hear about types of workloads that break this pattern. (Or maybe it's a few different types of workloads/services that need to be tightly integrated on one machine?)
Redlining RAM + CPU is not difficult for the right kind of database application that needs in-memory latency. RAM is a limit on your (hot, at least) data size, and CPU is a limit on your query workload.
In my experience, harder to also balance I/O so you’re close to hitting three limits, but update-heavy transactional workloads can manage it for disk-based DBMS’s.
> For most every workload I can think of, as their load increases it's always one resource in particular that's the limiting factor.
So this is a rediscovery, at least an example, of what could be called the bottleneck principle of system performance analysis and optimization! Sooo, just look for the bottleneck(s), work on those, and f'get about everything else until, say, everything is equally a bottleneck at which time have a well balanced, call it optimized, which can be appropriate, configuration!
At times, e.g., at IBM's Watson lab, there has been a lot of work on applying queueing theory, analysis, and simulation to analyzing and then optimizing such systems, but, more closely to fully true than one might guess, all that really mattered were the bottlenecks(s)!
We just launched (Oracle Cloud) an AMD Epyc 2 Compute service (called E3 on our compute page) that is the beginning of 'shapeless' or flexible computing - you can spec the cores for your shape from 1-64 and drive the balance on storage and memory. That might be a fit for you, as it's really inexpensive or the power.
It could be that they need hardware that can support less common architectures, like Solaris or AIX.
Sort of sounds like why some large users will opt to go with a wholesaler like Digital realty, or maybe even Equinix.
Without revealing toomuch:
What industry are you in?
Are you concentrated in 1 geo location or... across USA / across 1 county / global?
> What industry are you in
Something plumbing-y; the company is not well known
> 1 geo location or...
Global, although probably 50% in the US
Guessing here...ad-tech
100% ad tech or some Mulesoft style middleware app.
MS and Boomi runs on AWS. Not those ones.
Mulesoft on AWS is a mess, though I'm not sure if that's Mule's fault, AWS's fault, or the fact that it's an integration platform and it's always a damn mess.
> because we have lots of fancy redundant infrastructure that we can't rely on from cloud companies)
Haha. This can't possibly be true.
Could a saturated network interface prioritize some other company's traffic over your own in AWS? Could the same happen in your own private network?
Consider that building fault-tolerance into their application may be very difficult, or impossible. Cloud would be incompatible with that.
Just to give a few examples, we are typically much more aggressive with RAID, ECC, redundant network links, redundant time sources, redundant cooling, redundant power supplies, etc. than cloud providers.
Cost notwithstanding (heh) and as a relative novice to the cloud world, it looks to me like there is no bounds to the level of redundancy in the big three clouds. The trick is to use cloud-native tooling vs. EC2.
Everything you mentioned here is table stakes for the major cloud providers.
I actually am surprised you would say something like that. Public cloud is infamous for their VMs being unreliable. The idea is that you should assume that VM disappears at any time, and you need to design your applications in a way to successfully handle it. It's kind of how Internet Protocol (IP) is unreliable, doesn't guarantee packet delivery or even that packets arrive in order, but TCP protocol can provide a reliable service.
I've seen on-premises, where reliability wasn't even objective machines that easily run for 5+ or more years without any interruption. Now parent said that they actually need reliability and you can achieve it using many technologies. Starting with RAID, dual PSU, or even hot swappable RAM or CPU (I remember SPARC machines allowed this). With full control of networking you can also make a standby node take over nearly instantaneous, when in AWS it might take couple minutes. You can achieve nearly any availability as long as you have enough money. In AWS you don't have any control and your only way is through designing your application in specific ways and that still has limitation. Just take look at RDS when everyone would want to have it instantaneous, but it usually will take few minutes.