GitHub recently experienced several availability incidents, both long running and shorter duration. We have since mitigated these incidents and all systems are now operating normally. Read on for more details about what caused these incidents and what we’re doing to mitigate in the future.
What I learned from taking ownership of a lesser maintained service and bringing it up to a better standard.
Hot take: Anyone dunking on Atlassian about the fact that an outage like this could happen should not be trusted near production environments because they're either lying or don't have the experience to know what they're talking about.
Mark Imbriaco (@markimbriaco)Fri, 15 Apr 2022 13:00 +0000
It was MySQL, with the resource contention, in the database cluster
This week we’re joined by Nora Jones, founder and CEO at Jeli where they help teams gain insight and learnings from incidents. Back in December Nora shared here thoughts in a Changelog post titled “Incident” shouldn’t be a four-letter word - which got a lot of attention from our readers. Today we’re talking with Nora a...
sorry, really disagree that it’s just an aesthetic. cynical tweets about other companies do great on twitter, so hugops is a less trendy but more empathetic response from a specific part of the field recognizing job complexities; a _more_ advanced conversation IMO.
Kara Sowles (@FeyNudibranch)Mon, 29 Nov 2021 02:48 GMT
Alexa, show me someone who’s never been responsible for a production workload.
“hugops” is just a way to tweet about a company’s outage without looking like a jerk, but y’all aren’t ready for that conversation
Sam Kottler (@samkottler)Sun, 28 Nov 2021 15:40 GMT
Tim Banks stands 5 feet, 8 inches (@elchefe)Mon, 29 Nov 2021 07:20 GMT
Use (End-to-End) Tracing or Correlation IDs (4 mins read).
Why you should be requesting, and logging, a unique identifier per request for better supportability.
Same with automated testing prior to deployment. This is a good thing though. Those hard problems were always there, you just never had time to look for them due to all the simpler issues taking up all your time.
Post detailsit's a bit counterintuitive, but the better-instrumented and the more mature your systems are, the fewer problems you'll find with automated alerting and the more you'll have to find by sifting around in production by hand. twitter.com/arclight/statu…
Charity Majors (@mipsytipsy)Tue, 09 Nov 2021 22:13 GMT
Tom Binns (@fullstacktester)Wed, 10 Nov 2021 16:08 GMT
it's a bit counterintuitive, but the better-instrumented and the more mature your systems are, the fewer problems you'll find with automated alerting and the more you'll have to find by sifting around in production by hand.
I'm reminded of Clifford Stoll's "The Cuckoo's Egg" where he detected severe security vulnerabilities based on a monthly $0.25 accounting discrepancy that had no good explanation. Systems weren't falling over, nothing was tripping an alert but the system was subtly off-normal
arclight (@arclight)Fri, 29 Oct 2021 09:12 +0000
Charity Majors (@mipsytipsy)Tue, 09 Nov 2021 22:13 GMT
I organized a department-wide storytime titled “That Time I Broke Prod” and I think it may have been my favorite hour that I ever spent with colleagues. Normalize talking about failure, and remember that your manager (me) has fucked up way worse than you ever will 🥰
Denise Yu 💉💉 (@deniseyu21)Sat, 16 Oct 2021 22:10 +0000
I'm starting to see incidents as essential for knowledge sharing. If you're not experiencing any, it then makes sense to periodically introduce controlled incidents to learn about your infrastructure and how it behaves. Note: Hardly an original thought/realisation.
Lou ☁️ 👨💻🏋️♂️🎸🚴🏻♂️🏍 (@loujaybee)Mon, 02 Aug 2021 12:41 +0000
Everyone should be on-call because everyone should share the load The load exists because companies don't invest in proper infrastructure and ample headcount Yet another way this industry will grind you to dust
Post detailsprincipal engineers should be on call there - i said it
kris nóva (@krisnova)Sat, 10 Jul 2021 18:46 +0000
bletchley punk (@alicegoldfuss)Sun, 11 Jul 2021 20:34 +0000
In general, staying calm during an incident is a superpower. It’s like the difference between how you code normally and how you code during an interview. It’s also a skill that you can learn over time. Even if you’re an anxious person, you can get better at it.
Lorin Hochstein (@norootcause)Sat, 22 May 2021 22:21 +0000
The full postmortem of the Google outage this week is now up: - incomplete migrations can be dangerous - it continues to baffle me that Google systems aren’t designed to minimize the blast radius. - Automated tools shouldn’t have the ability to make *global* config changes.
Cindy Sridharan (@copyconstruct)Sat, 19 Dec 2020 03:42 GMT
Lot of armchair quarterbacking about the AWS/Cloudflare outages. I never comment on outages because, frankly, "the internet" is a Rube Goldberg machine, built on buckets of tears and non-obvious (and often non-technical) constraints. It's a miracle it works 1% as well as it does.
Matt Klein (@mattklein123)Sat, 28 Nov 2020 21:25 GMT
“... 50 years ago the conclusion "pilot error" as the main cause was virtually banned from accident investigation. The new mindset is that any system or procedure where a single human error can cause an incident is a broken system.” This is a great take re software outages 👏
Ross Wilson (@rossalexwilson)Sun, 18 Oct 2020 11:20 +0000
As I've said before, I'm a big fan of how Monzo handles their production incidents because it's quite polished and transparent
This is a really interesting read from Monzo about a recent incident they had. I really enjoy reading their incident management writeups because they show a tonne of detail, yet are stakeholder-friendly.
It's always interesting to see how other banks deal with issues like this, and what they would do to make things better next time.
You're currently viewing page 1 of 1, of 50 posts.