BCDR / Security Incidents ☄️

Developers can become frustrated investing effort in documentation that is seldom utilized or planning for improbable edge cases. Business continuity and disaster recovery (BCDR) is one such example that materializes infrequently but wreaks immense disruption when unaddressed, potentially accruing substantial costs.

All technology teams in a company, including Site Reliability Engineering (SRE) and regular staff, must prepare for disaster scenarios. When outages occur, the priority becomes restoring stability as swiftly as possible to constrain financial impacts.

Some may view business continuity planning as a formality, providing arbitrary Recovery Time Objective (RTO) and Recovery Point Objective (RPO) metrics. However, teams should compel themselves to thoroughly gameplan disaster response despite the rarity of such events. Detailed strategies help minimize disruption when calamity does strike.

What I need from my organization when it happens?

Your management will need to provide:

scope being impacted
communication to have with external partners
target timeframe for resolution
continuous availability from managers and platforms
identification and prioritization of systems

As a team, how can I prepare?

Business continuity planning should commence at the outset of any project, with system architectures designed to facilitate restoring critical components in the event of failure. Solutions should be engineered with recoverability in mind from day one. Below several domains to look at:

Documentation 📃

High-level diagrams - you may have to rotate credentials for old projects running since a long time, there is a need to understand quickly the different parts involved.
Third-part vendors - SLAs properly defined, points of contact and escalation process to reach out efficiently to them. Don’t hesitate to get your manager dealing with the communication while the technical need from them is clearly defined.
Credentials rotation process - how-to instructions for certificates, any kind of credentials (API keys, OAuth2, platform credentials, …)

Data persistence 🔥

A platform team can be working at a whole different scale and never recover. A decision can be taken to lose the data to get services back running. If it happens, did you lose data you shouldn’t have?

Do you need to recurrently snapshot your data or replicate it to a remote storage? Do I have test datasets I’m using in non-prod in a place I could lose it?

CI/CD ✅

Obviously, you should have a pipeline always running fine, even in the middle of a migration with a new Kubernetes cluster, new ingress, etc… Being able to patch cross-platforms and environments, ideally without a manual change.

e.g. my team is migrating from ECS to EKS. My non-prod environments are on Kubernetes, prod on ECS, can I still handle it through pipeline?

Project Dependencies 📚

If I have a critical CVE (Common Vulnerabilities and Exposures), do I manage my versions a smart way that I can version bump it in a single place? (BOM, version catalog, …)

If I need to apply a fix while the 3P dependency is getting patched, can it be propagated through a new library version? or all the projects need a copy-paste of this fix?

Any tool able to parse and edit a declarative file and apply a change across the board by opening a pull request?

Secrets and Certificates 🔐

Do I have my secrets decoupled from my container orchestration platform? (Vault, Sops, etc…)
If I have team credentials, do I use them a way they’re all inheriting from the same place?
How do I store my certificate private keys? is it impacted?

What will it look like?

Theo - t3.gg

@theo

·Follow

The senior eng during a sev1 outage

Watch on X

1:00 AM · Aug 16, 2023

2.8K

Read 25 replies