University research group here.
Simply, _cost_
Our compute servers crunch numbers and data at > 80% util.
Our servers are optimized for the work we have.
They run 24/7 picking jobs from queue. Cloud burst is often irrelevant here.
They deal with Terabytes or even Petabytes of moving data. I’d cry paying for bandwidth costs if charged €/GB.
Sysadmin(yours truly) would be needed even if it were to be run in the cloud.
We run our machines beyond 4 years if they are still good at purpose.
We control the infra and data. So, a little more peace and self-reliance.
No surprise bills because some bot pounded on a S3 dataset.
Our heavy users are connected to the machines at a single hop :-) No need to go across WAN for work.
In germany it's a pretty common think for universities to have some servers for themself.
1. Their use case is kinda different. The servers mostly run heavy CS research related stuff. E.g. they might have heavy CPU load and heavy traffic between they servers but they have less often heavy traffic to the "normal internet" (if they have heavy traffic to the outside it's normally to other research institutes which not seldom have dedicated wire connections).
2. They might run target specific optimized CPU or GPU heavy compute tasks going on for weeks at a time. This is really expansive in the cloud which is mostly focused in thinks like web services.
3. When they don't run such tasks in the research groups they want to allow their juniors to run their research tasks "for free". Which wouldn't work with a payment model as done in the cloud.
4. They don't want to relay on some external company.
Also I'm not sure are there even (affordable) cloud systems with compatible spec? (like with 4+TB of RAM, I'm not kidding this is a requirement for some kind of tasks or they will take way to long and requires additional complexity by using special data structures which support partial offline data in the right way, which can be very costly in dev time)??
AWS has systems with up to 24TiB of RAM (u-24tb1.metal) [0], but the pricing is "call us". For almost 4TiB of RAM (x1e.32xlarge), it's about $28/hr [1]
[0] https://aws.amazon.com/about-aws/whats-new/2019/10/now-avail... [1] https://aws.amazon.com/ec2/pricing/on-demand/
For a researcher, that means running an experiment changes from "running the experiment" to six months of meetings with higher management establishing business cases, budget justifications, getting competitive quotes from different providers, etc. (in other words it just doesn't happen).
The choice is not between "on premises" and "cloud", it's between "on premises" and "our glossy faculty brochure says you can do this but really you can't".
This struck a nerve with me. I was a SWE at a company that moved to the cloud last year, after having a couple racks at a local colo previously. I was able to experiment and iterate so much faster when we owned hardware, despite all the cloud's elasticity. After we migrated, everything was request, wait, explain, request, wait...
I know some organizations handle this really well, but when mine didn't, it sucked to be a developer there.
It's not just CS. The computational chemistry and materials science crystallography folks can have jobs that run for days or weeks too.
I'm at a center for computational biology - our genomics guys have been known to use 90% of our university's HPC capacity ;-) My own work (ecological modelling) is not as heavy, but when I run a full experiment, that takes a 32 core machine about two weeks to complete.
UCSD here. The CS department building has a large room off in the back filled with cages where we rack up our own machines. There's an internal process by which you can request VM allocations, which ultimately also run on top of a set of on-prem servers. So we have both bare metal and a kind of mini managed cloud to work with, without having to go through purchasing approval just to run a fuzzer or do a web crawl.
It’s extremely common for academic research groups to have their own hardware. Often, it doesn’t make sense from an overhead/maintenance point of view. If you’re a single lab large enough to need dedicated compute, you often don’t have the budget for a good admin. If you’re part of a larger group (like it sounds like you are) it is easier and you can afford some of the extra overhead.
But, I’ve found that there are two reasons why there is still a lot of on-premise academic compute: 1) legal, 2) cap-ex.
For the first, there are still many agreements for data sharing that require strict security compliance. It is certainly possible for this to work on the cloud, but it is more difficult than setting up a hardened cluster that’s not exposed to the internet. IT is normally better setup to audit and approve these on-premises systems than cloud systems.
For the second, it is difficult to estimate and write in all of the cloud costs for a grant budget. Trying to manage op-ex can be more of a hassle than just budgeting X amount for “servers” (as a cap-ex). Often, you just care about getting the job done, but less about how long it takes. So, a set capital expense has lower risk than underestimating your cloud needs and not being able to finish your project. (Also, as a bonus, once the equipment is paid for, you get to keep it to use on the next project, or unfunded research).
When I was in academia I was stuck using the HPC facilities at my institution. I was lucky to get about 30 cores and my jobs took many months to run to get out of the queue (overall I used about a decade of CPU time).
I'm quite sure that 2 months of my time waiting for results (and the impact on colleagues occupying their HPC cluster) was worth the $9k or so it would have cost to rent enough VMs to get my results in about 4 hours or so instead of 2 months.
That's not counting the hours of machine and my time wasted building enough support for the awful queueing and execution environment it used which was almost impossible to mock up locally.
Can you share the architecture and stack?
What form do these jobs have?
How do you manage workloads?
How do you manage resources? Do users have quota for compute and storage?
Do you use GPUs? If so, how do you deal with malfunction?
Is this a distributed processing? Are the machines heterogeneous? What do you use for that cluster?
What if a job requires dependencies? Do you create a compute environment on the fly or do all jobs have the same dependencies and these don't change much?
How do you do data governance? Is the data read only? Do you have an API to fetch the data from the job code?
I wrote about it here:
https://aravindh.net/post/sysadmin/
> What form do these jobs have?
Mostly batch jobs written as bash scripts. Occasionally, some users run singularity containers. But, all through SLURM.
> How do you manage workloads?
SLURM
> How do you manage resources?
As a sysadmin, my inventory is via Ansible. All activity on servers happen via Ansible only.
> Do users have quota for compute and storage?
Yes, Users typically can run 72 cores, 512 GiB of mem at a time. Rest is queued until resources are released. Disk quota is only for home directories - 400GiB(enforced by ZFS refquota).
> Do you use GPUs? If so, how do you deal with malfunction?
No, weirdly our workloads(genetics and genomics) don’t fit the GPUs very well, as they are sparse matrix walks with wide precision floats. But, we plan to try for some other stuff soon.
> Is this a distributed processing?
No. Jobs run one node at a time.
> Are the machines heterogeneous?
Yes, inteL based servers all the way from Haswell to Cascade lake.
> What do you use for that cluster?
SLURM
> What if a job requires dependencies?
Taken care of by SLURM.
> Do you create a compute environment on the fly or do all jobs have the same dependencies and these don't change much?
I guess you mean the software libraries and tools that jobs use? If so, our central software repo is NFS mounted on all compute nodes. Users can install things they need if admin priv is not needed. If not either Singularity or email to me.
> How do you do data governance?
This is the painful and human oriented task. We lock down data transport to outside world, educate the users about data policies and then spend a lot of time looking at data flows with hope.
> Is the data read only? Do you have an API to fetch the data from the job code?
Sorry, I could not understand this question. Do you mean metadata about A job?
Great article, thanks for sharing :)
I used to work in an environment not dissimilar from what's described above. The uni has some hardware details here: https://in.nau.edu/hpc/details/. Many of my answers would be similar, although the presence of physics, astronomy, and a few other areas motivates GPUs these days apparently. I was doing genomics workloads in the 900-1200gb RAM range with 30+ cores.
There's a pretty interesting NSF-wide project for managing clusters in a more commoditized way as part of https://www.xsede.org. You might describe it as a "private heterogeneous almost-cloud" in its goals? That might be saying a bit much.
Density, latency, storage throughput, etc. were in favor of DIY (plus the pricey professionals to run it) rather than cloud offerings. When I was there (2016) I did some basic math for being able to use a cloud provider for some of our lighter workloads when the local cluster was loaded down. Astronomical without a contract, which is quite the thing to set up, etc.
Worth noting that while they do alright, NAU (first link) is hardly a top-tier university with bleeding edge technical requirements.
I work in a kinda similar environment even if bigger.
To manage runtime dependencies we use cvmfs.
A read only distributed filesystems, kinda like a very specific NFS for software distribution.
If that is never been a problem, cvmfs it is quite stable and very used in our field.
Fascinating blog post of yours, thanks for sharing it!
Yeah, cloud dev-ops and HPC are so similar and still oh-so different (I run a smaller HPC cluster for genomics). When reading these questions, someone with an HPC background would probably already know the answers (or just accept SLURM as an answer for everything).
With respect to the original questions -- I've investigated running a full SLURM cluster on the cloud and it just never seems worth the effort. For larger clusters than mine, maybe, but then when you start to hit the levels you're talking about, I just don't see the point in moving to the cloud. It would be hard to hit that sweet spot where the costs would make sense. Amazon even has a series on scaling a SLURM cluster with AWS EC2 provisioning, but it just seemed like more work than was justified for us.
See reply above[^0]. I'm interested in Jupyter cell level job granularity for "MLOps".
[^0]: https://news.ycombinator.com/item?id=23129417
The article says you're in Aarhus, so I assume you work at Aarhus university? I'm a student from Denmark myself, and I'm curious, which departments mainly need the compute power?
Thank you for the reply and the post. We're running JupyterLab for some students, and when several of them are training models, they consume GPU memory.
Now, we thought of several ways of solving this. We have a branch that uses Kubernetes but this is not the one that is deployed.
We know about SLURM and this is something we will support for a coarse granularity. i.e: notebook level jobs. This is a common scenario and we'll need it for our AppBook (we allow users to turn a notebook into an application with one click and we automatically generate the form fields for the features, and the API endpoint for the model).
However, I'm interested in finer granularity computation; as in: I want a job for the cell. This involves looking into Jupyter kernels to circumvent the front-end<-->kernel disconnection (we'll have to do that for another use case in low internet bandwidth scenario).
>Sorry, I could not understand this question. Do you mean metadata about A job?
How do your users upload, and manage data access and is the data mutable and versioned. What version of the code ran on which version of the data, and who changed that data, etc.
How do users access the data from a Jupyter notebook? Is the data in different object storages? Do you proxy the requests and handle access, things like that.
Interesting. How many machines do you have, roughly? Are you the only sysadmin for this? Also, do you have any public facing services hosted from that infrastructure?
I am the only one administering this cluster.
We have approximately 1000 cores across 45 servers.
A few public facing services - web servers, small APIs, a web front end tool for a big genetics database. Nothing big.
Public services are cordoned off from the compute cluster.
We had a similar setup at our University as well. Execs decided we needed to be a part of school-wide clusters though, and are now trying to retire even that computing hardware and push us to the cloud without covering any of the costs.