Failure Domains

The Site Reliability Framework is organised into 14 domains. Four are enabling domains (Governance, Personnel, Incident Management, and Capacity Management) — organisational capabilities whose absence amplifies the severity of every incident but which rarely initiate one directly. The remainder are failure domains: distinct categories of production failure, each derived from clustering real incident post-mortems by their primary causal pattern.

This page covers the eight failure domains — what they are, why each one exists, and the granularity of what they assess.

The eight failure domains

Each domain is weighted by incident frequency in the underlying dataset of 192 real-world production incidents. Infrastructure leads, followed by Database & Storage, with Authentication and Dependency carrying narrower but critical scope.

Domain	Key	Sub-domains	Incidents (primary cause)	Primary failure patterns
Code	CODE	7	26	Resource exhaustion cascades, latent bugs triggered by environmental conditions, insufficient testing coverage, component side effects, race conditions, type and unit mismatches
Dependency	DEP	5	9 primary (30%+ contributing)	Cascading failures from tight coupling, unplanned transitive dependencies, backlog accumulation, retry storms amplifying upstream blips
Infrastructure	INFRA	8	50	Network capacity exhaustion, single points of failure, load balancer failures, hardware failures without redundancy
Deployment	DEPLOY	6	23	Incorrect execution of manual procedures, stale deployment artefacts, insufficient pre-deployment testing, autoscaling cascades from bad deploys, manual production access without safeguards
Configuration	CONFIG	6	19	Config values accepted without validation, network configuration changes causing routing errors, configuration drift across environments, incorrect feature flags and control plane configuration
Database & Storage	DB	7	33	Lock contention cascades, replication lag and failover failures, insufficient capacity for peak load, network partitions, failed automatic recovery procedures
Network	NET	6	Carved from INFRA/DB	Network capacity exhaustion, routing errors, DNS failures, DDoS attacks, certificate issues
Authentication & Access	AUTH	5	22	Weak identity verification enabling impersonation, stolen or compromised credentials, broken abuse reporting paths, insufficient audit logging

Totals: 62 sub-domains across the failure domains. Add 19 sub-domains across the four enabling domains for 81 sub-domains in total, producing 243 assessment points (81 sub-domains × 3 axes).

Why these eight domains

Incident-driven clustering

The domain structure was not designed top-down. The initial step was reading and structurally analysing each post-mortem in the 192-incident dataset, reconstructing the causal chain, and tagging the primary failure category. Seven technical clusters emerged naturally: Code, Dependency, Infrastructure, Deployment, Configuration, Database & Storage, and Authentication.

Network was subsequently carved out as a separate domain rather than left split across Infrastructure and Database. The rationale is specificity: network-specific failure patterns — DNS misconfiguration, BGP route propagation, DDoS mitigation, TLS certificate expiry — have detection signals, mitigation strategies, and response playbooks that are distinct enough from compute infrastructure and database networking to warrant their own assessment track. Treating them as a sub-category of INFRA or DB would have blurred those distinctions and produced less actionable profiles.

Sub-domain count reflects failure diversity

The number of sub-domains per domain is itself empirically grounded. INFRA has eight sub-domains because infrastructure failures are the most diverse in the dataset — capacity, load balancing, compute redundancy, storage, network hardware, and more each have distinct causal structures. AUTH has five sub-domains because, while the consequences of authentication failures are severe, the failure patterns cluster into fewer distinct categories. The framework allocates assessment granularity where the data shows complexity lives.

Dependency as a special case

The DEP domain appears to have modest scope by primary-cause count — nine incidents where a dependency failure was the initiating event. This understates its importance considerably.

Dependency issues appear as contributing factors in over 30% of all incidents in the dataset. The pattern is consistent: a minor upstream blip — a slow response, a brief timeout, a transient 503 — is amplified by missing circuit breakers and misconfigured retry logic into a cascading failure that affects downstream services. The dependency itself didn't fail catastrophically; the absence of defensive patterns turned a contained problem into a multi-service outage.

This is why DEP exists as a full domain with five sub-domains and a complete D-M-R assessment profile, rather than as a footnote inside another domain. Assessing only primary causes would leave the most common amplification mechanism unexamined.

What each domain assesses

Each domain's sub-domains represent distinct reliability concerns with their own detection signals, mitigation controls, and response procedures. The granularity is deliberate: grouping all database concerns as "database reliability" would produce a score too coarse to drive remediation.

Code sub-domains

The CODE domain's seven sub-domains illustrate the range within a single failure category:

Resource Management — memory, CPU, file descriptors, thread pools; resource exhaustion cascades
Concurrency & State Management — race conditions, deadlocks, shared-state corruption
Error Handling & Recovery — unhandled exceptions, silent failures, propagation of errors across service boundaries
Testing Coverage & Quality — unit, integration, and load testing coverage; latent bugs that pass staging
Dependency Version & Vulnerability Management — outdated dependencies introducing known vulnerabilities or breaking changes
Performance & Efficiency — algorithmic complexity, N+1 queries, blocking I/O in hot paths
Code Review & Quality Gates — review processes, static analysis, and the controls that catch issues before production

Database & Storage sub-domains

The DB domain's seven sub-domains reflect the diversity of database failure modes observed in the dataset:

Database Capacity & Performance — query performance, index coverage, capacity headroom for peak load
Database Contention — lock contention cascades, deadlock detection, long-running transactions
Database High Availability — primary/replica topology, automatic failover, promotion procedures
Backup & Recovery — backup frequency, restoration testing, recovery time objectives
Schema Migration & Data Integrity — migration safety, rollback capability, data validation
Connection Pool & Session Management — pool sizing, connection limits, leak detection
Data Replication & Consistency — replication lag monitoring, consistency guarantees, partition behaviour

The same degree of granularity applies in every domain. INFRA distinguishes compute redundancy from load balancing from storage availability. DEPLOY distinguishes artefact management from deployment procedure from rollback capability. The assessment is structured so that each sub-domain score points to a specific area of investment, not a vague domain-level weakness.

Relationship to the enabling domains

The failure domains explain what breaks. The four enabling domains explain why things break badly even when the technical controls appear adequate.

Governance failures appear as contributing factors when the change that caused the outage bypassed review. Personnel failures appear when the on-call engineer lacked the context to diagnose the issue. Incident Management failures appear when the response was unco-ordinated and the outage lasted four times longer than it needed to. Capacity Management failures appear when headroom was exhausted without warning, or when the absence of tiered load-shedding meant premium customers degraded alongside free-tier users.

Per-domain D-M-R profiles across both failure and enabling domains together constitute a complete reliability posture. A high INFRA score paired with a low INC score describes an organisation with excellent architecture that handles incidents badly. A high INC score paired with a low DEPLOY score describes an organisation that responds well to the avoidable incidents it keeps creating.

PreviousIntroduction NextGovernance Models