Last night, two major IT disasters struck:
- Microsoft Azure’s Central region went down for about 4 hours. The official post-mortem isn’t out yet, but rumor has it that while decommissioning legacy storage services, the product group deleted the wrong thing.
- Crowdstrike pushed a bad update, leading to blue screens of death on Windows systems worldwide, affecting banking, healthcare, airlines, and more.
If you were affected by one of those outages, you have my warmest virtual hug. At times like this, the stress level can be really tough, and I hope you can take care of yourself. Remember that your own self-worth is not determined by the IT solutions you work on.
If you weren’t affected by one of those outages, it’s a good time to spend an hour writing up a few things:
- Which of our production services are hosted entirely in a single region, availability zone, or data center?
- How are we monitoring the status of that single point of failure? If there’s a widespread outage like that, how much time are we going to waste troubleshooting our own services when there’s a bigger problem?
- When our single-region or single-AZ production services go down, what users/customers would be affected?
- How will we communicate the outage to those affected users? Can we write that notification ahead of time so that it’s ready to go quickly in the event of the next disaster like this?
- How much would it cost us (monthly or annually) to add in a second region or availability zone for protection from these kinds of incidents?
Summarize that, pass it up to your manager in writing, and it’ll help them have discussions this morning with their managers and executives. Today, a lot of business folks are going to be asking questions, and having these answers will help get you the resources you want.
(Or, it’ll help you feel more comfortable that the business understands the risks of putting all their eggs in a single basket, and that when that basket breaks, it’s not your fault. You warned ’em, and they chose not to spend the money to double-up on baskets.)