points by centimeter 6 years ago

We are a 1000-2000 person company and we have probably on the order of $100M of servers and data centers and whatnot, and I think we spend about 2/3rds of that every year on power/maintenance/rent/upgrades/etc.

We don't generally trust cloud providers to meet our requirements for:

* uptime (network and machine - both because we are good at reliability [and we're willing to spend extra on it] and because we have lots of fancy redundant infrastructure that we can't rely on from cloud companies)

* latency (this is a big one)

* security, to some degree

* if something crazy is happening, that's when we need hardware, and that's when hardware is hard to get. Consider how Azure was running out of space during the last few months. It would have cost us an insane amount of money if we couldn't grow our data centers during Corona! We probably have at least 20-30% free hot capacity in our datacenters, so we can grow quickly.

We also have a number of machines with specs that would be hard to get e.g. on AWS.

We have some machines on external cloud services, but probably less than 1% of our deployed boxes.

We move a lot of bandwidth internally (tens of terabytes a day at least, hundreds some days), and I'm not sure we could do that cheaply on AWS (maybe you could).

We do use <insert big cloud provider> for backup, but that's the only thing we've thought it was economical to really use them for.

jm4 6 years ago

What kind of business is that? That level of spending seems insane based on headcount. At least without knowing anything about the business. Are you running public facing services or something? What does that look like as a percentage of revenue?

  • centimeter 6 years ago

    Infrastructure. Datacenter costs are a relatively high percentage of revenue. No public facing services - only large clients.

    • flokie 6 years ago

      It sounds like you are operating a cloud.....

      • SamBam 6 years ago

        Sure... but at least they're not using the cloud! Just a whole bunch of servers!

        • pojzon 6 years ago

          There is no cloud, its just someone else’s computer :)

  • Guest42 6 years ago

    Working at a medium-sized bank the data center costs were very significant.

    • derision 6 years ago

      How many employees do medium sized banks have?

      • Guest42 6 years ago

        This one had 600 on-site employees serving 12 million customers. Not sure how many were off-site.

H8crilA 6 years ago

Hundreds of terabytes a day is really not that much, depends on what latency can you accept. I often run computations over datasets that are petabytes in size, just for my own needs. A big data move would be at least tens of petabytes or more like hundreds, or thousands.

Also surprised about latency, latency from what to what? Big cloud providers have excellent globally spanning networks. Long distance networking is crazy expensive, though, compared to the peanuts it costs to transfer data within a data center.

Reliability - again, not sure I buy it. Reliability is "solved" at low levels (such as data storage), most failures occur directly at service levels, regardless of whether you have the service in house or in the cloud.

The rest of your points make sense.

  • centimeter 6 years ago

    > Hundreds of terabytes a day is really not that much

    How much would it cost to move this across boxes in EC2? I actually don't know, that's not a rhetorical question. A lot of our servers have 10-40gbit links that we saturate for minutes/hours at a time, which I suspect would be expensive without the kind of topology optimization we do in our datacenters.

    > Also surprised about latency

    We've spent a surprising amount of money reducing latency :) We're not a high frequency trading firm or anything, but an extra 1ms (say) between datacenters is generally bad for us and measurably reduces performance of some systems.

    > Reliability is "solved" at low levels

    To whatever extent this may be true, it's certainly not true for cloud providers. One obvious example is that EC2 has "scheduled maintenance events" where they force you to reboot your box. This would cost us a lot of money (mostly in dev time, to work around it).

    Also, multi-second network dropouts in big cloud datacenters are not uncommon (in my limited experience), but that would be really bad for us. We have millisecond-scale failover with 2x or 3x redundancy on important systems.

    • HatchedLake721 6 years ago

      Seems you’re trying to bring your on-premise structure and concepts to the cloud, that won’t work. EC2 instances are cattle, not pets.

      I believe I saw a slide that average lifespan of an EC2 instance at Netflix is 14 minutes.

      I’m not necessarily saying cloud will work for your setup, but you can’t compare like that.

      • pgwhalen 6 years ago

        It seems to me like they're not trying to bring on-prem to the cloud, and that's very much the point.

      • toomuchtodo 6 years ago

        Most businesses are not Netflix, where most of the work is done by their OpenConnect CDN appliances serving video content outside of cloud providers.

    • tstrimple 6 years ago

      > How much would it cost to move this across boxes in EC2?

      Nothing. You generally only pay for data going out of cloud providers. Not data going in or data being transferred within the same region.

      > One obvious example is that EC2 has "scheduled maintenance events" where they force you to reboot your box. This would cost us a lot of money (mostly in dev time, to work around it).

      You're not going to have a successful cloud experience unless you build your applications in a cloud suitable way. This means not all legacy applications are a good fit for the public cloud. Most companies really embracing the cloud are mitigating those risks by distributing workloads across multiple instances so you don't care if any one needs to be restarted, especially within a planned window.

      > Also, multi-second network dropouts in big cloud datacenters are not uncommon (in my limited experience), but that would be really bad for us. We have millisecond-scale failover with 2x or 3x redundancy on important systems.

      Are these inter-region network dropouts or between the internet and the cloud data center? You're not going to be relying on a public internet connection to the cloud for critical workloads.

      All that being said, there are plenty of workloads which I don't think fit well in the cloud operating model. You may very well have one of them.

      • iampims 6 years ago

        You pay for cross-AZ Traffic in AWS, and that adds up really fast.

        • tstrimple 6 years ago

          Yes. You've got to be aware of where those boundaries are when adopting the cloud and the cost information around these cases are inadequate at best. Too many people get surprise bills.

          • hinkley 6 years ago

            The first step is to stop blaming the victim.

            It's nobody's fault that the billing structure at Amazon is so complicated and confusing.

            Except Amazon's.

        • Wintereise 6 years ago

          Yep. Got bitten HARD by this recently, $1.5k inter-az transfer charges that we never saw coming.

          Our fault, I suppose -- but multi-az is prohibitively expensive if you need to run anything data heavy distributed.

          • resonator 6 years ago

            I'm working on reducing a $50K per month bill for Inter-AZ traffic at the moment.

            > but multi-az is prohibitively expensive if you need to run anything data heavy distributed.

            If you communicate between your AZs via ALBs, multi-az is effectively free. Our bill is so high because within our Kubernetes cluster, our mesh isn't locality aware; it randomly routes to any available pod. 2/3rds of our traffic crosses AZs.

      • centimeter 6 years ago

        > unless you build your applications in a cloud suitable way

        Right, we have not done this. We basically decided it was cheaper to keep doing the “old school” thing and not spend a bunch of dev time trying to do it in a way that supports arbitrary failure of N boxes. We just spent the money to make it unlikely our boxes or networks will fail, and if they do fail we may be sad (but it rarely happens, and has not yet happened in a catastrophic way).

        > Are these inter-region network dropouts or between the internet and the cloud data center?

        I’ve seen multi-second dropouts within a DC (cloud, not my current company), and multi-hour single-path failures between DCs (usually something like a fire or construction cutting a line). But all of our DCs have at least 2 physically independent routes to the internet, so it’s never taken us fully offline.

      • takeda 6 years ago

        > Nothing. You generally only pay for data going out of cloud providers. Not data going in or data being transferred within the same region.

        This is not true, AWS charges for all traffic cross AZ, so if you want a resilience within a region (which you mentioned in next paragraph) you will do a lot of that talking and this ends up a non trivial cost. You can do some optimization though by having your apps being aware of AZs they are on, but most places don't do that, more common are places that run everything from a single AZ.

    • resonator 6 years ago

      > How much would it cost to move this across boxes in EC2? I actually don't know, that's not a rhetorical question.

      Data transfer between instances in the same AZ is free. If the data crosses AZs, you're changed $0.01 per GB in both directions. This is for instances on a VPC. I think the pricing model is different for classic EC2.

      There are some exceptions like all traffic between EC2 and ALBs being free.

      Edit: Pricing is described at https://aws.amazon.com/ec2/pricing/on-demand/#Data_Transfer

      • takeda 6 years ago

        and if you go for resilience you will do a lot of that cross AZ talking, unless you also design your apps to know the infrastructure and consider that when communicating (most companies don't do it)

    • H8crilA 6 years ago

      > A lot of our servers have 10-40gbit links that we saturate for minutes/hours at a time, which I suspect would be expensive without the kind of topology optimization we do in our datacenters.

      I think everyone does something of this sort nowadays, that's why networking is ~free within data centers :)

      > but an extra 1ms (say) between datacenters is generally bad for us and measurably reduces performance of some systems

      That speaks to me. You will always be just the n-th client unless you own the cross-datacenter data links (i.e. have full autonomy on deciding the priority of the traffic). It's similar to the covid provisioning problems you had mentioned.

      > One obvious example is that EC2 has "scheduled maintenance events" where they force you to reboot your box.

      Yeah like others pointed out - that's just what "cloud" is, and is generally a good idea. You're supposed to handle a certain % of your machines going dark without a warning without violating any SLO (or even worse, certain % of your machines "pretending" they're up but actually being ridiculously slow for this or that reason; and don't even get me started on CPU/RAM bitflips).

      It sounds to me that you run an extremely highly sensitive service, something for which paying for true ownership of the hardware just makes sense to remove those kinds of risks that most services don't care about. At the end of the day "cloud" is a shared resource, and no resource separation efforts will be 100% effective.

    • kelnos 6 years ago

      > One obvious example is that EC2 has "scheduled maintenance events" where they force you to reboot your box. This would cost us a lot of money (mostly in dev time, to work around it).

      It's interesting that you say this, because I think this just boils down to how you treat failures. Boxes will fail. It's inevitable, even if you run your own hardware.

      We treat failure as an every-day event and have set up our systems so it doesn't matter. One box fails, an automated system notices and brings up another to replace it, while the remaining boxes take slightly more load for a few minutes.

      Sure, that kind of failure tolerance doesn't come for free. But, again, you're going to have failures, no matter how much work you put into reliability (which also doesn't come for free!).

  • runT1ME 6 years ago

    Computation over Petabytes of data sounds fairly expensive, jobs that I was running over close to a PB could cost hundreds of dollars. Am I misremembering, doing it wrong, or underestimating your teams’s cloud budget?

sneak 6 years ago

> We also have a number of machines with specs that would be hard to get e.g. on AWS.

What specs are those? I was under the impression AWS has everything from extremely tiny to giant terabytes-of-ram-for-SAP instance types.

  • centimeter 6 years ago

    We would need something like an X1 instance in terms of RAM, but it's hard to find something on AWS that has a well-tuned balance of RAM/CPU/disk for our needs. A lot of the big specialized instances are tuned for one particular limiting factor (RAM/GPU/storage/CPU/bandwidth/whatever) and I don't recall them having a good selection for "really big everything".

    Amazon is constantly expanding the selection, so it's possible they've added something since we last seriously looked into this.

    • varenc 6 years ago

      Just curious, can you share general details on the types of services that need really big everything?

      For most every workload I can think of, as their load increases it's always one resource in particular that's the limiting factor. (RAM/CPU/Storage/etc). So it makes sense to me that AWS instances focus on optimizing one particular resource type.

      Would be interesting to hear about types of workloads that break this pattern. (Or maybe it's a few different types of workloads/services that need to be tightly integrated on one machine?)

      • twoodfin 6 years ago

        Redlining RAM + CPU is not difficult for the right kind of database application that needs in-memory latency. RAM is a limit on your (hot, at least) data size, and CPU is a limit on your query workload.

        In my experience, harder to also balance I/O so you’re close to hitting three limits, but update-heavy transactional workloads can manage it for disk-based DBMS’s.

      • graycat 6 years ago

        > For most every workload I can think of, as their load increases it's always one resource in particular that's the limiting factor.

        So this is a rediscovery, at least an example, of what could be called the bottleneck principle of system performance analysis and optimization! Sooo, just look for the bottleneck(s), work on those, and f'get about everything else until, say, everything is equally a bottleneck at which time have a well balanced, call it optimized, which can be appropriate, configuration!

        At times, e.g., at IBM's Watson lab, there has been a lot of work on applying queueing theory, analysis, and simulation to analyzing and then optimizing such systems, but, more closely to fully true than one might guess, all that really mattered were the bottlenecks(s)!

    • hey_ross 6 years ago

      We just launched (Oracle Cloud) an AMD Epyc 2 Compute service (called E3 on our compute page) that is the beginning of 'shapeless' or flexible computing - you can spec the cores for your shape from 1-64 and drive the balance on storage and memory. That might be a fit for you, as it's really inexpensive or the power.

  • frakkingcylons 6 years ago

    It could be that they need hardware that can support less common architectures, like Solaris or AIX.

hkmurakami 6 years ago

Sort of sounds like why some large users will opt to go with a wholesaler like Digital realty, or maybe even Equinix.

willart4food 6 years ago

Without revealing toomuch:

What industry are you in?

Are you concentrated in 1 geo location or... across USA / across 1 county / global?

  • centimeter 6 years ago

    > What industry are you in

    Something plumbing-y; the company is not well known

    > 1 geo location or...

    Global, although probably 50% in the US

    • philg_jr 6 years ago

      Guessing here...ad-tech

      • renewiltord 6 years ago

        100% ad tech or some Mulesoft style middleware app.

        • Muley 6 years ago

          MS and Boomi runs on AWS. Not those ones.

          • blaser-waffle 6 years ago

            Mulesoft on AWS is a mess, though I'm not sure if that's Mule's fault, AWS's fault, or the fact that it's an integration platform and it's always a damn mess.

bradhe 6 years ago

> because we have lots of fancy redundant infrastructure that we can't rely on from cloud companies)

Haha. This can't possibly be true.

  • winkeltripel 6 years ago

    Could a saturated network interface prioritize some other company's traffic over your own in AWS? Could the same happen in your own private network?

    Consider that building fault-tolerance into their application may be very difficult, or impossible. Cloud would be incompatible with that.

  • centimeter 6 years ago

    Just to give a few examples, we are typically much more aggressive with RAID, ECC, redundant network links, redundant time sources, redundant cooling, redundant power supplies, etc. than cloud providers.

    • unethical_ban 6 years ago

      Cost notwithstanding (heh) and as a relative novice to the cloud world, it looks to me like there is no bounds to the level of redundancy in the big three clouds. The trick is to use cloud-native tooling vs. EC2.

    • joshuamorton 6 years ago

      Everything you mentioned here is table stakes for the major cloud providers.

  • takeda 6 years ago

    I actually am surprised you would say something like that. Public cloud is infamous for their VMs being unreliable. The idea is that you should assume that VM disappears at any time, and you need to design your applications in a way to successfully handle it. It's kind of how Internet Protocol (IP) is unreliable, doesn't guarantee packet delivery or even that packets arrive in order, but TCP protocol can provide a reliable service.

    I've seen on-premises, where reliability wasn't even objective machines that easily run for 5+ or more years without any interruption. Now parent said that they actually need reliability and you can achieve it using many technologies. Starting with RAID, dual PSU, or even hot swappable RAM or CPU (I remember SPARC machines allowed this). With full control of networking you can also make a standby node take over nearly instantaneous, when in AWS it might take couple minutes. You can achieve nearly any availability as long as you have enough money. In AWS you don't have any control and your only way is through designing your application in specific ways and that still has limitation. Just take look at RDS when everyone would want to have it instantaneous, but it usually will take few minutes.