Less than 24 hours later [0], another GitHub incident with something going down. This time it is Actions going down once again.
Really, you would get more uptime if you self-hosted using GitLab or Gitea, etc. even some open-source projects like wireguard, RedoxOS, ReactOS, GNU Ring, etc and many others are doing this, with no issues.
Centralizing to GitHub [1] really isn't a good idea and it is showing that it is beyond unreliable for years.
GitHub is going great, and it has never been better. /s
My experience is that it is usually ok to setup these kind of services. Takes few days, interesting and fun work to get new stuff going and tuned.
Problems arise with the operations. Nobody is confident with doing the updates since they happen so rarely. If system is critical, do you dare to do updates during business hours? Is there proper test system to practice updates? Is anybody prepared to handle recovery scenarios? Who is testing backups? How do you transfer the knowledge when people who originally built the system leave?
I run a gitea instance on a moderate SBC backed by a NAS over iSCSI.
The SBC runs updates weekly, rebooting as needed as per zypper ps. It takes about 2 minutes for a reboot cycle, and reboots about 50% of the time. During this, the service is hard down.
The NAS selfupdates roughly monthly, and it takes about an hour. The service is stopped during this time.
This gives me a "perfect hardware" uptime of about 99.8.
I've had a single hardware failure in the last year, but it had the service down for 3 days waiting on shipping for a part, which gives an empirical hardware of 99.17.
I've not had any major maintenance on this setup in the last few years, though I expect to replace the NAS's drives in stages over the next few weeks. However, for a best-case, I'll give it 100.
My internet has not gone out in 2.5 years, so I am going to give it an optimistic 100.
My power, however, has been out for a total of 30 hours, of which, 28 hours outlasted my UPS. Based on syslog from NUT calling shutdown, until the first timestamp on reboot, therefore I have a Grid+UPS power availability of 99.68
If we compose all these, we have a total uptime of 98.65 of total services, which is far less reliable for basic operations (git cli interactions) than any of Git(Lab|Hub), Bitbucket, SourceHut.
This only covers all services down cases, my single-system setup doesn't often partially fail, and isn't engineered to hide maintenance (and doing so would be far out of scope and budget). Large services tend to fail partially, as they are composed of many individual systems with varied duties, and the system is often designed to fallback to basic function in the event of failure.
Not to even mention the question of capacity, which my 4c 8t of amd64 comes nowhere close to the CDN-ish power of the major SaaS's in this space.
I would like to see real stats on this. Most people don't properly quantify availability and assume just because a big system goes down a lot that one they run could be more reliable, but it's usually apples and oranges.
I think even if the percentage uptime is less on a self hosted one, at least you get more control over when to upgrade systems (perhaps in the evenings / weekends) with less impact to the devs.
From an SLA uptime perspective, it’s also worth considering that you don’t have a whole team working to fix your self-hosted Gitlab server when it goes down. So one outage overnight of your self-hosted server could be more downtime than more frequent, shorter GitHub outages.
This is kind of a straw man don’t you think? Really small startups sure but I’m sure many places that self host have an ops team that can and does respond to outages on their systems.
Your example also suggests another factor: downtime overnight might be less consequential than shorter outages that occur during working hours.
Have you ever troubleshot an outage with a whole team? I am strongly in the belief that any more than two people working on an outage makes it last longer.
Properly run incident management can have three dozen people involved with no negative impact. You need the incident manager to coordinate, communicate, run interference. You need a clear set of rules for who will have what role and responsibility. Combination of voice and text, multiple chat channels.
But yeah, if you have no plan or organization, too many cooks is detrimental.
I've worked using self-hosted Gitlabs for more than 6 years, and they're has been almost no down time, mostly quick updates lasting a few minutes from time to time. Companies tend to fear self-hosted services because tech giants are supposed to be way more reliable and do everything for you, but owning your infrastructure and having people in place who actually know what's going on can be a life saver in many situations.
Unfortunately this experience doesn't pan out the same for a company:
- A large number of users will increase the likelihood that one of them will encounter a temporary issue; even if problems are intermittent, a single person using the system simply won't notice them, but lots of people will.
- More users tends to mean using more features, more system resources, etc, which increases the likelihood of an issue from either buggy/complex features, or resource exhaustion.
- Updates for a team typically involve being more careful and performing updates on a regular schedule, in order to minimize downtime.
- If a problem does occur and you're not around, the team is stuck until you are available. If you go on vacation or are hit by a bus, they're stuck longer.
- Having more users tends to require things like disaster recovery contingencies, security, etc. Somebody has to do this extra work, which is a cost in time and labor.
Where self hosted shines is in keeping complexity down and changes minimal. GitHub is a giant system, making it more likely for problems to interrupt service. They constantly ship changes, increasing the likelihood of interrupted service. A self hosted option can use the same simple system and version for much longer, only taking security patches until the version is EOL and needs to be upgraded.