Failure Domains
The Site Reliability Framework is organised into 14 domains. Four are enabling domains (Governance, Personnel, Incident Management, and Capacity Management) — organisational capabilities whose absence amplifies the severity of every incident but which rarely initiate one directly. The remainder are failure domains: distinct categories of production failure, each derived from clustering real incident post-mortems by their primary causal pattern.
This page covers the eight failure domains — what they are, why each one exists, and the granularity of what they assess.
The eight failure domains
Each domain is weighted by incident frequency in the underlying dataset of 192 real-world production incidents. Infrastructure leads, followed by Database & Storage, with Authentication and Dependency carrying narrower but critical scope.
| Domain | Key | Sub-domains | Incidents (primary cause) | Primary failure patterns |
|---|---|---|---|---|
| Code | CODE | 7 | 26 | Resource exhaustion cascades, latent bugs triggered by environmental conditions, insufficient testing coverage, component side effects, race conditions, type and unit mismatches |
| Dependency | DEP | 5 | 9 primary (30%+ contributing) | Cascading failures from tight coupling, unplanned transitive dependencies, backlog accumulation, retry storms amplifying upstream blips |
| Infrastructure | INFRA | 8 | 50 | Network capacity exhaustion, single points of failure, load balancer failures, hardware failures without redundancy |
| Deployment | DEPLOY | 6 | 23 | Incorrect execution of manual procedures, stale deployment artefacts, insufficient pre-deployment testing, autoscaling cascades from bad deploys, manual production access without safeguards |
| Configuration | CONFIG | 6 | 19 | Config values accepted without validation, network configuration changes causing routing errors, configuration drift across environments, incorrect feature flags and control plane configuration |
| Database & Storage | DB | 7 | 33 | Lock contention cascades, replication lag and failover failures, insufficient capacity for peak load, network partitions, failed automatic recovery procedures |
| Network | NET | 6 | Carved from INFRA/DB | Network capacity exhaustion, routing errors, DNS failures, DDoS attacks, certificate issues |
| Authentication & Access | AUTH | 5 | 22 | Weak identity verification enabling impersonation, stolen or compromised credentials, broken abuse reporting paths, insufficient audit logging |
Totals: 62 sub-domains across the failure domains. Add 19 sub-domains across the four enabling domains for 81 sub-domains in total, producing 243 assessment points (81 sub-domains × 3 axes).
Why these eight domains
Incident-driven clustering
The domain structure was not designed top-down. The initial step was reading and structurally analysing each post-mortem in the 192-incident dataset, reconstructing the causal chain, and tagging the primary failure category. Seven technical clusters emerged naturally: Code, Dependency, Infrastructure, Deployment, Configuration, Database & Storage, and Authentication.
Network was subsequently carved out as a separate domain rather than left split across Infrastructure and Database. The rationale is specificity: network-specific failure patterns — DNS misconfiguration, BGP route propagation, DDoS mitigation, TLS certificate expiry — have detection signals, mitigation strategies, and response playbooks that are distinct enough from compute infrastructure and database networking to warrant their own assessment track. Treating them as a sub-category of INFRA or DB would have blurred those distinctions and produced less actionable profiles.
Sub-domain count reflects failure diversity
The number of sub-domains per domain is itself empirically grounded. INFRA has eight sub-domains because infrastructure failures are the most diverse in the dataset — capacity, load balancing, compute redundancy, storage, network hardware, and more each have distinct causal structures. AUTH has five sub-domains because, while the consequences of authentication failures are severe, the failure patterns cluster into fewer distinct categories. The framework allocates assessment granularity where the data shows complexity lives.
Dependency as a special case
The DEP domain appears to have modest scope by primary-cause count — nine incidents where a dependency failure was the initiating event. This understates its importance considerably.
Dependency issues appear as contributing factors in over 30% of all incidents in the dataset. The pattern is consistent: a minor upstream blip — a slow response, a brief timeout, a transient 503 — is amplified by missing circuit breakers and misconfigured retry logic into a cascading failure that affects downstream services. The dependency itself didn't fail catastrophically; the absence of defensive patterns turned a contained problem into a multi-service outage.
This is why DEP exists as a full domain with five sub-domains and a complete D-M-R assessment profile, rather than as a footnote inside another domain. Assessing only primary causes would leave the most common amplification mechanism unexamined.
What each domain assesses
Each domain's sub-domains represent distinct reliability concerns with their own detection signals, mitigation controls, and response procedures. The granularity is deliberate: grouping all database concerns as "database reliability" would produce a score too coarse to drive remediation.
Code sub-domains
The CODE domain's seven sub-domains illustrate the range within a single failure category:
- Resource Management — memory, CPU, file descriptors, thread pools; resource exhaustion cascades
- Concurrency & State Management — race conditions, deadlocks, shared-state corruption
- Error Handling & Recovery — unhandled exceptions, silent failures, propagation of errors across service boundaries
- Testing Coverage & Quality — unit, integration, and load testing coverage; latent bugs that pass staging
- Dependency Version & Vulnerability Management — outdated dependencies introducing known vulnerabilities or breaking changes
- Performance & Efficiency — algorithmic complexity, N+1 queries, blocking I/O in hot paths
- Code Review & Quality Gates — review processes, static analysis, and the controls that catch issues before production
Database & Storage sub-domains
The DB domain's seven sub-domains reflect the diversity of database failure modes observed in the dataset:
- Database Capacity & Performance — query performance, index coverage, capacity headroom for peak load
- Database Contention — lock contention cascades, deadlock detection, long-running transactions
- Database High Availability — primary/replica topology, automatic failover, promotion procedures
- Backup & Recovery — backup frequency, restoration testing, recovery time objectives
- Schema Migration & Data Integrity — migration safety, rollback capability, data validation
- Connection Pool & Session Management — pool sizing, connection limits, leak detection
- Data Replication & Consistency — replication lag monitoring, consistency guarantees, partition behaviour
The same degree of granularity applies in every domain. INFRA distinguishes compute redundancy from load balancing from storage availability. DEPLOY distinguishes artefact management from deployment procedure from rollback capability. The assessment is structured so that each sub-domain score points to a specific area of investment, not a vague domain-level weakness.
Relationship to the enabling domains
The failure domains explain what breaks. The four enabling domains explain why things break badly even when the technical controls appear adequate.
Governance failures appear as contributing factors when the change that caused the outage bypassed review. Personnel failures appear when the on-call engineer lacked the context to diagnose the issue. Incident Management failures appear when the response was unco-ordinated and the outage lasted four times longer than it needed to. Capacity Management failures appear when headroom was exhausted without warning, or when the absence of tiered load-shedding meant premium customers degraded alongside free-tier users.
Per-domain D-M-R profiles across both failure and enabling domains together constitute a complete reliability posture. A high INFRA score paired with a low INC score describes an organisation with excellent architecture that handles incidents badly. A high INC score paired with a low DEPLOY score describes an organisation that responds well to the avoidable incidents it keeps creating.