It's surprisingly easy, depending on your scale/scope of course. But in general, I've managed to build CI/CD pipelines that are tolerant of GitHub (or any service) failures by following these steps:
1. Use as little of the configuration language provided by the CI as possible (prefer shellscripts that you call in CI instead of having each step in a YAML config for example)
2. Make sure static content is in a Git repository (same or different) that is also available on multiple SCM systems (I usually use GitHub + GitLab mirroring + a private VPS that also mirrors GitHub)
3. Have a bastion host for doing updates, make CI push changes via bastion host and have at least four devs (if you're at that scale, otherwise you just) with access to it, requiring multisig of 2 of them to access
Now when the service goes down, you just need 2 developers to sign the login for the bastion host, then manually run the shellscript locally to push your update. You'll always be able to update now :)
> our CI can't clone the PR to run tests. What do other folks use to avoid this situation?
Multiple remotes can help and is certainly something you should have as a backup. However I don't think it solves the root cause which is how the CI is configured.
I'm a firm proponent of keeping your CI as dumb as possible. That's not to say unsophisticated, I mean it should be decoupled as much as possible from the the how of the actions it's taking.
If you have a CI pipeline that consists of Clone, Build, Test, and Deploy stages, then I think your actual CI configuration should look as close as possible to the following pseudocode:
stages:
- clone: git clone $REPO_URL
- build: sh ./scripts/build.sh
- test: sh ./scripts/test.sh
- deploy: sh ./scripts/deploy.sh
Each of these scripts should be something you can run on anything from your local machine to a hardened bastion, at least given the right credentials/access for the deploy step. They don't have to be shell scripts, they could be npm scripts or makefiles or whatever, as long as all the CI is doing is calling one with very simple or no arguments.
This doesn't rule out using CI specific features, such as an approval stage. Just don't mix CI level operations with project level operations.
As a side benefit this helps avoid a bunch of commits that look like "Actually really for real this time fix deployment for srs" by letting you run these stages manually during development instead of pushing something you think works.
More importantly though, it makes it substantially easier to migrate between CI providers, recover from a CI/VCS crash, or onboard someone who's responsible for CI but maybe hasn't used your specific tool.
You really just need a TCP pathway between your CI and some machine with the git repo on it.
Or take your local copy and use git-fu commands to create a bare repo of it that you can compress and put somewhere like S3. Then download it in CI and checkout from that.
Or just tarball your app source, who cares about git, and do the same (s3, give it a direct path to the asset)
All of this is potentialy useless info though. Hard to say without understanding how your CI works. If all you need is the source code, there are a half dozen ways to get that source into CI without git.
Presumably, manually running builds locally is an automatic failsafe option if you have the same people around who originally set the build pipeline up in the first place.
In 2021, basic business continuity plans for software companies should incorporate these sorts of concerns. You should have a published procedure somewhere that a person could follow for producing the final build artifacts of your software on any machine once backups are made available. Situations like these are why I check in 100% of my dependencies to source control as well.
Ideally your CI/CD is just calling Make/Python/Whatever scripts that are one shot actions. You should be able to run the same action locally from a clean git repo (assuming you have the right permissions).
The anti-pattern to watch out for is long, complex scripts that live in your CI system’s config file. These are hard to test and replicate when you need to.
Well unfortunately it seems everything I said 11 days ago has become a reality I'm afraid and I was still downvoted for pointing this truth out. [0]
Too many times I suggested everyone to begin self-hosting or have that as a backup but once again some think 'going all in on GitHub' is worth it. (It really is not the case)
Something I've learned on HN is that upvotes/downvotes means nothing but how popular an opinion is. You can be 100% right, honest, straightforward and kind, but if the hive-mind does not agree, it does not agree and will downvote your well-written opinion.
Don't read too much into it and comment freely as normal. In the end, it's just internet points.
Doesn't help that this occurred in the same week as a patent pending MS Patch Tuesday that borked a lot of corporate machines. I'm still cleaning up messes from the changes they pushed out that break Kyocera drivers.
When I built CI stuff at my previous job there were two remote repos that could be cloned from; Github and a repo on a system on the LAN that the CI's user has ssh access. Which one was used was controlled by a toggle-able environment variable in the CI system.
It's surprisingly easy, depending on your scale/scope of course. But in general, I've managed to build CI/CD pipelines that are tolerant of GitHub (or any service) failures by following these steps:
1. Use as little of the configuration language provided by the CI as possible (prefer shellscripts that you call in CI instead of having each step in a YAML config for example)
2. Make sure static content is in a Git repository (same or different) that is also available on multiple SCM systems (I usually use GitHub + GitLab mirroring + a private VPS that also mirrors GitHub)
3. Have a bastion host for doing updates, make CI push changes via bastion host and have at least four devs (if you're at that scale, otherwise you just) with access to it, requiring multisig of 2 of them to access
Now when the service goes down, you just need 2 developers to sign the login for the bastion host, then manually run the shellscript locally to push your update. You'll always be able to update now :)
Great advice, thank you.
What facility are you using for multi signature login?
> our CI can't clone the PR to run tests. What do other folks use to avoid this situation?
Multiple remotes can help and is certainly something you should have as a backup. However I don't think it solves the root cause which is how the CI is configured.
I'm a firm proponent of keeping your CI as dumb as possible. That's not to say unsophisticated, I mean it should be decoupled as much as possible from the the how of the actions it's taking.
If you have a CI pipeline that consists of Clone, Build, Test, and Deploy stages, then I think your actual CI configuration should look as close as possible to the following pseudocode:
Each of these scripts should be something you can run on anything from your local machine to a hardened bastion, at least given the right credentials/access for the deploy step. They don't have to be shell scripts, they could be npm scripts or makefiles or whatever, as long as all the CI is doing is calling one with very simple or no arguments.
This doesn't rule out using CI specific features, such as an approval stage. Just don't mix CI level operations with project level operations.
As a side benefit this helps avoid a bunch of commits that look like "Actually really for real this time fix deployment for srs" by letting you run these stages manually during development instead of pushing something you think works.
More importantly though, it makes it substantially easier to migrate between CI providers, recover from a CI/VCS crash, or onboard someone who's responsible for CI but maybe hasn't used your specific tool.
You really just need a TCP pathway between your CI and some machine with the git repo on it.
Or take your local copy and use git-fu commands to create a bare repo of it that you can compress and put somewhere like S3. Then download it in CI and checkout from that.
Or just tarball your app source, who cares about git, and do the same (s3, give it a direct path to the asset)
All of this is potentialy useless info though. Hard to say without understanding how your CI works. If all you need is the source code, there are a half dozen ways to get that source into CI without git.
Can you make a local build? Our fallback is that someone does the CI/CD steps manually.
Unfortunately we heavily shard our tests, running on a single laptop would take a while.
Presumably, manually running builds locally is an automatic failsafe option if you have the same people around who originally set the build pipeline up in the first place.
In 2021, basic business continuity plans for software companies should incorporate these sorts of concerns. You should have a published procedure somewhere that a person could follow for producing the final build artifacts of your software on any machine once backups are made available. Situations like these are why I check in 100% of my dependencies to source control as well.
Ideally your CI/CD is just calling Make/Python/Whatever scripts that are one shot actions. You should be able to run the same action locally from a clean git repo (assuming you have the right permissions).
The anti-pattern to watch out for is long, complex scripts that live in your CI system’s config file. These are hard to test and replicate when you need to.
Well unfortunately it seems everything I said 11 days ago has become a reality I'm afraid and I was still downvoted for pointing this truth out. [0]
Too many times I suggested everyone to begin self-hosting or have that as a backup but once again some think 'going all in on GitHub' is worth it. (It really is not the case)
[0] https://news.ycombinator.com/item?id=26301750
Something I've learned on HN is that upvotes/downvotes means nothing but how popular an opinion is. You can be 100% right, honest, straightforward and kind, but if the hive-mind does not agree, it does not agree and will downvote your well-written opinion.
Don't read too much into it and comment freely as normal. In the end, it's just internet points.
Doesn't help that this occurred in the same week as a patent pending MS Patch Tuesday that borked a lot of corporate machines. I'm still cleaning up messes from the changes they pushed out that break Kyocera drivers.
When I built CI stuff at my previous job there were two remote repos that could be cloned from; Github and a repo on a system on the LAN that the CI's user has ssh access. Which one was used was controlled by a toggle-able environment variable in the CI system.
> Have a Gitlab instance or similar that you can pull from instead for CI?
Gitlab, mirrored repo basically.
Could you just pull the PR from the submitter’s machine? They could serve it via e.g. git http-backend and you could then point the CI there to pull.
> What do other folks use to avoid this situation?
Don't use Microsoft?
GitLab deleted some DB at some time, kernel.org was hacked years ago...nothing is perfect.