GitHub Major Service Outage

204 points by DeepWinter 8 years ago

See https://status.github.com

drinchev 8 years ago

A bit weird. GitHub says they fixed it, but on the other hand CircleCI still considers it as an outage :

> Monitoring

> May 31, 2017 3:08 PM

> GitHub have declared the outage resolved and we are starting to see incoming GitHub hooks. Builds are being triggered again. However we are still seeing failures with the GitHub API. This continues to prevent our webapp > from fetching data from GitHub. We are monitoring the situation and will ensure sufficient capacity for when their service resumes normal operations.

amorphid 8 years ago

Maybe fixing the problem is different than recovering? Like stopping blood loss vs slowly replacing the blood.

runeks 8 years ago

Good thinking on GitHub's part not using github.com/github/status to host the content of status.github.com. Amazon, take notice.

brian_herman 8 years ago

Were they doing this before amazon had that major outage?
- imbriaco 8 years ago
  
  Yes. They've been hosting it outside of their production infrastructure for several years.
mhils 8 years ago

Nitpick: They would still fail for DNS issues with *.github.com, so a domain like githubstatus.com would be even more resilient.
- peterwwillis 8 years ago
  
  Nitpick: www.githubstatus.com is more flexible, potentially more resilient.
  
  aerovistae 8 years ago
  
  Nitpick: www.statushub.com is less obviously related, and therefore won't be attacked in tandem. If they want to go all the way, maybe just something like www.wesellnikesdiscount.com.

tyingq 8 years ago

Seeing an interesting thing where my github issue comments just posted are apparently posted a short time in the future. http://imgur.com/a/eQSc9

Doesn't seem to break anything, but it is a bit curious. May not be new though...I just happened to notice it today.

i336_ 8 years ago

I've noticed this behavior with a lot of services.
I can only chalk it up to something like clock drift between the processing node and the database server.
Irritatingly I can't remember which site it was but I posted something somewhere a couple days ago and immediately after hitting enter the site marked what I'd said as submitted "a few seconds from now". I never fail to be amused that the fuzzy time library being used has code specifically designed to handle this edge case scenario. :D
- toomuchtodo 8 years ago
  
  Message queues and eventual consistency. Unless your request requires something "atomicy", 200/201 response should be a sign of "got the message, will get to work on it when we can".
  
  peterwwillis 8 years ago
  
  201 is "request has been fulfilled, resource has been created". It is explicitly not "we will get to it when we can". You are thinking of 202, "request has been accepted for processing", for asynchronous request processing.
  
  toomuchtodo 8 years ago
  
  I assure you, 201 is used all the time to say something has been done when it's only been queued, regardless of what the RFC says. This is based on real world integration experience.
  Http response status codes rarely align with RFC guidelines.
  
  peterwwillis 8 years ago
  
  A lot of people back up to tape and never test the tapes, too. Doesn't mean bad practice should be expected.
  
  toomuchtodo 8 years ago
  
  Disagree. Plan defensively.
- epicide 8 years ago
  
  I don't think it's really an edge case. Probably one of the main uses, actually.
  Sure, if you use it to show comment age, you shouldn't ever see it, but I'm sure they fully support using it for countdowns, too.
  EDIT: it's the 4th example under relative time for Moment.js (https://momentjs.com/).
  
  i336_ 8 years ago
  
  I can totally agree about relative timestamping in both directions (past+future) - my argument is more about the UX of situations where you're canonically referring to a past event.
  So as not to spam with my reply to a similar comment, I'll link it: https://news.ycombinator.com/item?id=14452335
- Piskvorrr 8 years ago
  
  Actually, most of the "relative time" JS libraries can be used for countdown as well ("this event is scheduled in two hours"). Accidentally getting a timestamp that's supposed to be in the past shouldn't break it :)
  
  i336_ 8 years ago
  
  Very true.
  I just think there should be an intent argument specifiable to the library to indicate that the duration in question refers to an event that has happened in the past.
  In such a scenario, the library should mark the duration as happening "just now" and possibly flag a warning or raise an exception.
  The reason I say this is that, humanly speaking, "Your message was sent 23 seconds from now" is amusing at best to developers who know what's happening (negative time delta) and linguistically confusing to general users ("was sent" vs "from now").
- yeukhon 8 years ago
  
  Git timestamp can be set with different timezone so technically GitHub can analzye the timestamp (which they do) and compare with the actual clock of timezone. There's only a small problem though: the latency from the clock check server must be low otherwise we would be 1 second ahead by the time we get response and then another second or two after webpage render.
  So from a UX we should simply discount time drift +-5 seconds simply says "blah blah just now" and anything larger might warrant as a warning to the author (Github can reject the commit and ask for confirmation). Although re-editing commit is a hassle for changing timestamp. It's more of a "please make sure you computer time is synced next time."
  
  i336_ 8 years ago
  
  Wow, interesting. This info is definitely filed away, thanks.
  I don't quite remember but I think I might've been commenting on something on GitHub when I saw the time glitch.
  Initially for a moment I thought "why not just have OCD local NTP tracking?" but then I realized that time glitching around (even at the millisecond level) can be disastrous. One way to solve this is to obsess about keeping up to date with NTP, but instead of updating the time, update a global reference to the offset. Then your time server is simply (system date)+-(saved offset), which should be super fast. And of course this can run on the node generating the HTML.
  
  yeukhon 8 years ago
  
  If you look at "man git-commit-tree"
  While parent object ids are provided on the command line, author and committer information is taken from the following environment variables, if set: GIT_AUTHOR_NAME GIT_AUTHOR_EMAIL GIT_AUTHOR_DATE GIT_COMMITTER_NAME GIT_COMMITTER_EMAIL GIT_COMMITTER_DATE (nb "<", ">" and "\n"s are stripped) In case (some of) these environment variables are not set, the information is taken from the configuration items user.name and user.email, or, if not present, the environment variable EMAIL, or, if that is not set, system user name and the hostname used for outgoing mail (taken from /etc/mailname and falling back to the fully qualified hostname when that file does not exist). A commit comment is read from stdin. If a changelog entry is not provided via "<" redirection, git commit-tree will just wait for one to be entered and terminated with ^D. DATE FORMATS The GIT_AUTHOR_DATE, GIT_COMMITTER_DATE environment variables support the following date formats: Git internal format It is <unix timestamp> <time zone offset>, where <unix timestamp> is the number of seconds since the UNIX epoch. <time zone offset> is a positive or negative offset from UTC. For example CET (which is 2 hours ahead UTC) is +0200. RFC 2822 The standard email format as described by RFC 2822, for example Thu, 07 Apr 2005 22:13:13 +0200. ISO 8601 Time and date specified by the ISO 8601 standard, for example 2005-04-07T22:13:13. The parser accepts a space instead of the T character as well. Note In addition, the date part is accepted in the following formats: YYYY.MM.DD, MM/DD/YYYY and DD.MM.YYYY.
  Edit see this repo and screenshot:
  * https://github.com/yeukhon/demos/commits/master/git-date
  * https://github.com/yeukhon/demos/blame/a9fc9dfe6d35c5ffe14af...
  * https://github.com/yeukhon/demos/commits/a9fc9dfe6d35c5ffe14...
  You see I got a 3 minute (using local time) and then 4 hours ago because I set the timestamp manually in the most recent commits. So yes you can set time in the future / past.
  
  i336_ 8 years ago
  
  Very interesting. Thanks for the demo and references!
wiredfool 8 years ago

I've seen that, but I've chalked it up to clock skew on the client. It only seems to happen on one of my machines, and only after a couple months of uptime.
- tyingq 8 years ago
  
  Ah, yes. I hadn't looked at it, but now I see that fixed date/times are in the html source, and the "x hours ago" messages are rendered in the browser.
  Fairly compact implementation too: https://gist.github.com/anonymous/33710bd9c7175a645dd0d72d1a...

apeace 8 years ago

As others have said, Github's postmortems are always great.

But frankly, I'd rather they have better uptime. Every couple months is too much. I pay them. My work pays them.

If their CEO is serious about zero downtime, how about he offers his paying customers a credit for time they cannot access the service?

richardwhiuk 8 years ago

They are? https://status.github.com/messages/2017-01-18 has a bunch of major service outages and no link to any post-mortem.
The vague rumour always seems to be 'DDoS attack I guess' but there's very little in the way of formal reporting as far as I can tell...
Thaxll 8 years ago

Use on prem.
https://enterprise.github.com/faq
Xylakant 8 years ago
> I pay them. My work pays them.
Hmm. You pay them to uphold a contract. What does that contract say about SLAs and availability? Probably the same as the TOS that I agreed to when paying and those specifically say:
```
    GitHub does not warrant that the Service will meet your requirements; 
    that the Service will be uninterrupted, timely, secure, or error-free; 
    that the information provided through the Service is accurate, reliable or 
    correct; that any defects or errors will be corrected; that the Service will 
    be available at any particular time or location; or that the Service is free 
    of viruses or other harmful components. You assume full responsibility and 
    risk of loss resulting from your downloading and/or use of files, 
    information, content or other material obtained from the Service.
```
If you negotiate, you might get better terms and guarantees, for example with github enterprise. You might also have to pay substantially more for those.
I understand, it sucks when github is down. But we all get what we pay for and we all don't want to pay for more. And yes, I do have clients that meticulously mirror all their dependencies from outside sources and spend significant money on this - money that pays off in exactly these situations.
- minxomat 8 years ago
  
  Huh? You don't have to use Github Enterprise (self-hosted) to get an SLA. Github Business, which is hosted on github.com has a 99.95% uptime SLA: https://github.com/pricing
  An upgrade from Team to Business is "only" a 2.3x price bump per dev. I have no experience with this though, my team is still of the Team plan and thus suffered from the outage today.
  
  dromedary512 8 years ago
  
  Huh indeed. 99.95% uptime -- AKA: three and a half nines. My quick math tells me that 99.95% uptime equates to a downtime of ~4:23/yr. If github is down for an hour once every few months, I'd say they're likely well within their stated SLA.
tchaffee 8 years ago

My problem with a credit is that it never even comes close to what I'm losing in income. An ISP is an excellent example. I might get a $10 credit for 24 hours of downtime. I'm charging slightly more per hour than that... /s
Maybe switch to bitbucket or other competition for a while?
- eof 8 years ago
  
  switched to gitlab a few weeks ago, didn't notice the outage until I saw this thread.
  
  tekism 8 years ago
  
  I use VSTS, it's actually improved much over the years!
- justinclift 8 years ago
  
  Gitea's pretty good if you just need a GitHub-like experience (eg good UX) git repo host:
  https://gitea.io
- scott_karana 8 years ago
  
  If the price of the service working (or not working) is disproportionately large compared to the price of your lost business, that's a problem at your end: you needed to calculate the risk vs return for redundancy.
  Eg, if your Internet costs $100/mo, but you'd lose $100/hour when it's down during business hours, buy a fallback connection from a competing ISP. ;)
  
  tchaffee 8 years ago
  
  > a competing ISP
  Wow! That actually exists in some places? ;-)
  Infrastructure so often becomes a monopoly. I can't pay a competing bridge service to drive to work quicker, I can't pay a competing gas company to deliver gas via different pipelines to my house. And I can't pay a competing electric company that uses different wires.
  I actually am lucky enough to live in a city where there are many competing high speed ISPs. But guess what? I've paid for fallback connections in the past and when one goes down, the other goes down, so I go out to lunch and see the guys working on the wires in the cabinet down the street. The wires that both my ISPs share. I suppose I could get a satellite ISP? That latency. True redundancy for infrastructure is actually very expensive in most cases.

detaro 8 years ago

"discussion": https://news.ycombinator.com/item?id=14451924

peterjlee 8 years ago

Funny thing is Github sometimes makes more sales after an outage because clients want to upgrade to the enterprise edition to host on their own servers.

Jayakumark 8 years ago

A little funny like Silicon Valley episode , saw the news from GitHub CEO yesterday saying our goal is zero downtime and now it's down

IanCal 8 years ago

Perhaps an issue with punctuation?
Goal: zero downtime.
vs
Goal zero: downtime.
- pavement 8 years ago
  
  Works on contingency? No, money down!
Xylakant 8 years ago

Zero downtime is always a goal and never achieved for any complex service. They literally all go down sooner or later.
yeukhon 8 years ago

You sure it isn't zero downtime deployment? But I thought Github runs infrastructure globally? I remember some outage were caused by DoDS, and some were software bugs / bad config.
Probably good idea to do rolling deployment. I will be surprised if they haven't for the kind of top engineering team they are running.

saosebastiao 8 years ago

Business idea: github hosting failover. You'd probably need a modified git client, but if you can't push/pull/whatever from github, it transparently fails over to your service which will sync up with github once they've recovered.

Even better idea: github should stop failing.

prh8 8 years ago

Gitlab has both push/pull mirroring, I wonder if it would be possible to use them together to accomplish this.
peterjlee 8 years ago

Why not just use Github Enterprise then?
emars 8 years ago

When expanded to see the monthly trend it shows 99.6% availability.
Serious Question: is there enough people that would pay for that 0.4% to support a business?
tedchs 8 years ago

A couple times I have configured multiple "remotes" for my local git repo and pushed to both, e.g. GitHub + Google Cloud Source Repository, or even just a bare repo on a VPS.

i336_ 8 years ago

As of right now...

On the one hand, I see "Everything operating normally." at the top in green, and no flags or alerts.

On the other hand, the charts look good, but "App server availability" looks interesting, the right edge of the chart is pretty much at 0%.

nadim 8 years ago

MEAN WEB RESPONSE TIME - 262ms

98TH PERC. WEB RESPONSE TIME - 1134ms

4.3x?

maxyme 8 years ago

And the 98th percentile is still faster than the 50th percentile of GitLab...

samgranieri 8 years ago

Apparently it's resolved. I'd like to read their postmortem on it. They write those extremely well

rodionos 8 years ago

github daily availability history:

https://apps.axibase.com/chartlab/25f38b08/2/

thejosh 8 years ago

Looks like their cdn is having problems as well now, seeing timeouts when trying to download archives.

r3bl 8 years ago

Was receiving random downtimes when I've tried opening a certain project and its wiki ~1 hour ago. Nothing big and a couple of refreshes fixed it, just minor annoyance.

citrusui 8 years ago

The outage seems to be resolved as of 8:58 EDT

DeepWinter 8 years ago

Yep. Now wonder what the issue was. Ghost in the shell?

gionn 8 years ago

web looks fine but repository are not so responsive

JohnHaugeland 8 years ago

feel free to make a better one

edoceo 8 years ago

It's called GitLab. Not 100% uptime but better (and constantly improving)
- goralph 8 years ago
  
  GitLab, really ?
  
  edoceo 8 years ago
  
  I'm very happy I switched. I even posted here on their last outage. That was a stressful 15 minutes.
  Not perfect, just working loads better for me and my teams
- xrjn 8 years ago
  
  Gitlab does not have features parity with Github. I personally also stumbled upon a bizzare bug that doesn't allow a friend to add me to any of his private repo's. Asked on IRC and nothing much came of it unfortunately.
  
  cmatija 8 years ago
  
  Hiya,
  Which features would you like to see in GitLab? We'd love to talk about it. You could also open a feature proposal issues in https://gitlab.com/gitlab-org/gitlab-ce/issues.
  
  justinclift 8 years ago
  
  Err... you're responding to someone who very clearly said they hit a (serious for them) bug which needs fixing. That's definitely not a feature proposal. :D
- Filligree 8 years ago
  
  It will take a long time before GitLab can even begin to regain my trust from their missing-backups outage.
  Problems are to be expected. But as great as it is that they had multiple levels of backup, none of them worked. They hadn't even been tested.
  
  jjawssd 8 years ago
  
  Do you have any evidence that Github does any better?
  
  Xylakant 8 years ago
  
  They haven't lost my data so far and they had to restore the production database at least once in their history. All circumstantial, but we'll have to wait and see.
  
  foxylion 8 years ago
  
  They did not lose that much data, only a few hours.
  
  cmatija 8 years ago
  
  We have a pretty detailed post-mortem on the timeline of events for the outage, as well as actions taken after it to prevent anything similar in the future (with links to related issues). You can check it out in https://about.gitlab.com/2017/02/10/postmortem-of-database-o...
  
  ChartsNGraffs 8 years ago
  
  They actually gained my trust from that outage.
  
  jonknee 8 years ago
  
  If you paid for GitLab there was no missing-backups outage...
- acdha 8 years ago
  
  Do you have data to support that claim about uptime?
  
  justinclift 8 years ago
  
  GitLab's public service had a major outage a few months ago:
  https://about.gitlab.com/2017/02/10/postmortem-of-database-o...
alekratz 8 years ago

This is the biggest cop-out of a reply. I hate it. OP has already stated:
>I pay them. My work pays them.
Github is raking in oodles of cash and they STILL can't keep their service up without going down, quoted from OP, "[e]very couple months".
It's not about "making a better one", nor is it about paying for the fancier/premium features; it's about the uninterrupted service, which Github keeps failing to provide.
- Arizhel 8 years ago
  
  I fail to see what the problem is. If you don't like their service, then switch to a competitor, or just set up your own git server. It's like this for any vendor: if you don't like the product or service you're getting, you can either bitch and complain endlessly, or you can look for alternatives. One of these choices is more productive than the other.
wand3r 8 years ago

I think hindsight is 20/20. They obviously did not think they needed to find a competing service; nor build one as they were paying for one. Given that they have been inconvenienced by downtime they may switch or make their own.
> Feel free to make a better one
The GitLab team did that already so I use their service. ;)
dang 8 years ago

We detached this subthread from https://news.ycombinator.com/item?id=14452709 and marked it off-topic.

runn1ng 8 years ago

Git access to repos seems to be working for me (pull, push).

JensRantil 8 years ago

I guess they just rolled out their new DNS infrastructure (https://githubengineering.com/dns-infrastructure-at-github/) :-P