From Incidents to Insight: The Methodology Behind the Site Reliability Framework

sitereliability.consulting | April 2026

The problem with reliability today

Most organisations don't know how reliable they are. They know when things break — but between incidents, reliability is a feeling, not a measurement. Teams inherit checklists from previous roles, adopt practices they read about in blog posts, and assume that what worked at a previous employer will work here. The result is a reliability posture built on anecdote rather than evidence.

This wouldn't be acceptable in any other risk domain. Financial auditors don't assess solvency based on vibes. Safety engineers don't certify bridges based on what worked at the last bridge. Yet in production operations — where a single outage can cost millions in revenue, erode years of customer trust, and trigger regulatory scrutiny — we routinely assess reliability with unstructured checklists and subjective self-assessment.

The Site Reliability Framework (SRF) was built to change that. It is a structured, empirically grounded methodology for assessing, scoring, and improving production reliability. This paper explains how it works — not what it contains, but the thinking underneath: how raw incident data becomes a usable assessment framework, why the assessment model is shaped the way it is, and how scoring translates maturity into actionable insight.

Starting from incidents, not theory

The SRF's foundation is a dataset of 192 real-world production incidents. The dataset includes incidents from organisations of all sizes, across cloud providers, financial services, consumer technology, government infrastructure, and more.

Each incident in the dataset was structurally analysed and classified. This wasn't a keyword exercise — each post-mortem was read, the causal chain was reconstructed, and the incident was tagged with:

Primary failure domain — what category of failure initiated the incident
Contributing factors — what other domains played a role in the severity or duration
Detection quality — how the incident was discovered, and how long it took
Resilience posture — whether architectural protections existed and whether they worked
Mitigation effectiveness — how the organisation contained and resolved the incident
Outage duration — where available (172 of 192 incidents had known duration data)

This classification was the raw material. Everything else in the SRF — the domains, the sub-domains, the weighting, the maturity model — was derived from patterns in this data, not from theoretical risk modelling.

Why start with incidents?

There's a philosophical choice here worth making explicit. Most frameworks start with best practices: a group of experts sit down and write what organisations should do. The resulting framework reflects the experts' experience, which is valuable — but it also reflects their biases, their blind spots, and the particular contexts they've worked in.

Starting from incidents inverts this. Instead of asking "what should organisations do?", we asked "what actually goes wrong, and why?" The framework then works backward from observed failure to which controls failed, and what capability would have prevented, contained, or shortened it.

This means the SRF has an empirical answer to a question that most frameworks dodge: why does this control matter? Every capability statement in the framework is traceable to real incidents where that capability was absent or insufficient. When the SRF says you need circuit breakers on your external dependencies, it's because we observed specific incidents where the absence of circuit breakers turned a minor upstream blip into a cascading outage lasting hours.

From incidents to domains

The 200 incidents naturally cluster into categories — not because we imposed categories, but because production systems fail in recognisable patterns. A memory leak cascading into a service outage has a fundamentally different causal structure from a DNS misconfiguration, which is different again from a deployment that passed staging but failed in production.

The initial clustering identified seven technical failure categories: Code, Dependency, Infrastructure, Deployment, Configuration, Database & Storage, and Authentication. These are the incident-driven domains — each one directly traceable to a cluster of real failures.

However, analysing the contributing factors across incidents revealed something important: nearly every major incident had governance, people, or process failures amplifying the technical root cause. A config change that caused an outage was technically a configuration failure — but the deeper problem was often that no change review process existed, or the on-call engineer didn't have the knowledge to diagnose it, or the incident response was uncoordinated. These patterns appeared so consistently that they warranted their own domains.

This led to the two-tier structure:

Enabling domains (Governance, Personnel, Incident Management) — cross-cutting organisational capabilities that don't cause incidents on their own, but whose absence makes every incident worse. These are derived from contributing-factor analysis rather than primary failure classification.

Technical domains (the remaining eight) — specific categories of production failure, each weighted by how frequently they appear as primary causes in the dataset.

The split matters for assessment. An organisation might have strong technical controls but weak governance — or excellent incident management but poor code quality. The two-tier structure prevents one from masking the other.

Sub-domains: the unit of assessment

Each domain is further broken into sub-domains — typically four to eight per domain, for a total of 64 across the framework. Sub-domain granularity is driven by the principle that each sub-domain should represent a distinct reliability concern with its own detection, mitigations, and response profile.

For example, the Database & Storage domain has seven sub-domains: Capacity & Performance, Contention, High Availability, Backup & Recovery, Schema Migration, Connection Pool Management, and Data Replication. Each of these has different failure characteristics, different controls, and different people responsible for them. Lumping them together as "database reliability" would produce an assessment too coarse to be actionable.

The number of sub-domains per domain is itself empirically informed. Infrastructure has eight sub-domains because infrastructure failures are the most diverse and frequent in the dataset. Authentication has five because, while serious, auth failures cluster into fewer distinct patterns. The framework allocates assessment granularity where the data says complexity lives.

The Detection-Mitigation-Response model: three questions for every control

The core of the SRF's assessment methodology is the D-M-R model: Detection, Mitigations, and Response. Every one of the 64 sub-domains is assessed across all three axes, producing 192 individual assessment points.

This isn't an arbitrary taxonomy. It reflects a pattern observed across the entire incident dataset: failures are not events — they are sequences.

Consider a typical outage. A database begins experiencing lock contention under load. If detection is strong, monitoring fires an alert before users are affected, and the team can intervene proactively. If detection is weak but mitigations are strong, the read traffic automatically shifts to replicas and the contention resolves itself. If both detection and mitigations fail but response is strong, the on-call engineer follows a runbook to kill blocking queries and restore throughput within minutes. The outage duration and severity depend on which of these defences holds.

The D-M-R model captures this reality:

Detection asks: Can you see it coming? This covers monitoring, alerting, proactive scanning, and risk identification. The key insight from the data is that detection is necessary but not sufficient — many of the worst outages were detected quickly but still lasted hours because there was nothing in place to contain them.

Mitigations asks: What have you put in place? This covers architectural redundancy, defensive design patterns, policy enforcement, and controls that reduce the probability or blast radius of failure. Mitigations is the dimension that matters most for preventing outages entirely — and it's consistently the weakest across the industry (average score 1.1 out of 4.0 in the dataset). This is the SRF's most important finding: organisations invest heavily in detection (monitoring, alerting) and response (incident response, runbooks) but underinvest in the architectural mitigations that would make many incidents non-events.

Response asks: Can you fix it fast? This covers incident response, rollback, failover, recovery procedures, and post-incident learning. Response is the last line of defence — it determines how long an incident lasts once detection and mitigations have failed to prevent it.

Why not just "maturity"?

Many frameworks assess controls on a single maturity axis: you either have the control or you don't, and if you have it, it's at some level of sophistication. The problem is that this conflates fundamentally different capabilities. An organisation might have excellent monitoring (Detection: Level 3) but no redundancy (Mitigations: Level 1) and no rollback capability (Response: Level 1). A single-axis maturity score of "Level 2" would be misleading — it would obscure the specific gap that's going to hurt them.

The three-axis model produces a profile that implies a different remediation strategy. One organisation needs to invest in architecture, another needs additional monitoring, the third needs both detection and better response playbooks — they can recover from anything, but only after it's already caused damage.

Maturity levels: what "good" looks like at each stage

Each D-M-R axis is scored on a four-level scale (0 through 3). The levels are designed to be recognisable to anyone who's worked with maturity models (CMMI, ISO 33001, COBIT), but they're calibrated specifically for production reliability:

Level 0 — None. No capability exists. The organisation hasn't addressed this area at all. This is more common than you'd think: in the dataset, many incidents occurred in areas where the organisation had literally no controls — no monitoring, no redundancy, no documented response procedure.

Level 1 — Ad-hoc. Capability exists but is informal and person-dependent. The senior engineer knows how to check the database locks. The team lead has a mental model of the dependency graph. If that person is on holiday, the capability effectively drops to zero. The majority of incidents in the dataset occurred in organisations operating at Level 1 across most dimensions.

Level 2 — Defined. Documented processes and manual thresholds exist. There's a runbook. There's a monitoring dashboard. Someone has written down the escalation path. But it's manual, it requires human judgment to trigger, and it may not be consistently followed. This is where most organisations think they are — and where many actually are for Detection and Response, though typically not for Mitigations.

Level 3 — Managed. Automated, consistent, and measurable. Alerting fires automatically. Circuit breakers engage without human intervention. Deployments roll back on canary failure. The effectiveness of controls is tracked through metrics. This is the level where reliability becomes a system property rather than a human performance issue.

The gap between Level 1 and Level 3

The empirical data reveals that the most impactful reliability improvement comes from moving from Level 1 (ad-hoc) to Level 3 (managed) on the Mitigations axis. This is because:

Level 1 Mitigations means your architecture has minimal protection against failure — most failures become catastrophic incidents.
Level 3 Mitigations means your architecture absorbs most failures automatically — only the exceptional ones become incidents.

The difference in expected downtime between a Mitigations score of 1 and a Mitigations score of 3 is dramatic. The dataset shows that incidents in organisations with strong mitigations (Level 3) averaged 2.1 hours duration, while those with weak mitigations (Level 0–1) averaged 8.7 hours. More importantly, the strong-mitigations organisations had fewer incidents reaching production impact in the first place — their architectural protections converted what would have been outages into non-events.

Scoring: from maturity to meaning

Raw maturity scores (201 numbers between 0 and 3) are useful for practitioners but need translation for executives, boards, and procurement teams. The SRF provides three derived outputs from the assessment (Detection, Mitigations, Response):

The reliability grade (A through F)

The overall grade is a weighted average of all D-M-R scores, mapped to a letter scale. The weighting reflects empirical incident frequency: domains that cause more outages in the real world carry more weight in the score. Infrastructure, Database, and Code contribute more to the overall grade than Authentication or Dependency — because that's where the incidents are.

The grading thresholds are calibrated against the dataset:

Grade A (average 2.6–3.0) — Organisations at this level would have prevented or rapidly contained the vast majority of the 200 incidents in the dataset
Grade C (average 1.8–2.2) — This is approximately where the median organisation sits, based on the dataset's average scores of 1.8/1.1/1.6 across D-M-R
Grade F (average 0.0–1.3) — Critical gaps across multiple domains; the organisation is operating with essentially no reliability management

The letter grade is deliberately simple. It's designed to be communicable — a CEO can understand "we're a C and we need to be a B" in a way they can't understand "our weighted D-M-R score is 1.72".

Per-domain D-M-R profiles

Beneath the overall grade, each domain produces a three-number profile (e.g., Detection 2.4 / Mitigations 1.1 / Response 1.8). This is where the actionable insight lives. An organisation with strong Detection but weak Mitigations needs to invest in architecture, not more monitoring. An organisation with weak Response needs incident response and rollback capabilities, not more preventive controls.

The profiles also reveal patterns. It's common to see Detection consistently higher than Mitigations — monitoring tools are easier to buy and deploy than architectural redesign. The SRF makes this pattern visible and quantifiable, rather than leaving it as a vague feeling that "we should probably improve our architecture."

Estimated annual downtime

The most commercially powerful output is the estimated annual downtime figure. Using the correlation between D-M-R scores and observed outage durations in the dataset, the SRF produces an estimate of how many hours of downtime per year the organisation's current controls would be expected to produce.

This is necessarily an estimate — it's based on statistical correlation, not deterministic prediction. But it translates abstract maturity scores into a number that finance and risk teams understand. "We estimate 47 hours of annual downtime based on your current controls" is a fundamentally different conversation from "your Mitigations score is 1.1."

The downtime estimate also powers the ROI case for improvement. If the current estimate is 47 hours and raising Mitigations from Level 1 to Level 3 across three key domains would reduce it to 12 hours, the business case writes itself — especially when you can attach revenue-per-hour-of-downtime to the calculation.

The assessment process

The SRF is designed to be assessed at three levels of rigour:

Self-assessment (free, online tool) — The organisation answers structured questions for each applicable sub-domain. The tool scores responses against the maturity model and produces a grade, D-M-R profiles, gap analysis, and estimated downtime. This is designed to be completable in under 20 minutes by someone with broad operational knowledge of their environment. The self-assessment is intentionally generous — it takes the organisation's answers at face value without evidence verification.

Professional attestation (formal engagement) — An assessor conducts interviews, reviews evidence (dashboards, configurations, runbooks, incident records), and independently scores each sub-domain. This produces a formal attestation letter, detailed findings report, and embeddable reliability badge with 12-month validity. The attestation process is modelled on ISO 27001 certification audits: scoping, evidence collection, assessment, reporting, and remediation tracking.

Continuous monitoring (tooling-supported) — Through integration with the organisation's existing tooling (source control, cloud platforms, observability stack), the SRF assessment is kept current as controls change. This shifts reliability from a point-in-time audit to an ongoing posture measurement.

Each level builds on the previous one. The self-assessment reveals the landscape. The attestation validates it with evidence. Continuous monitoring keeps it current.

Framework design principles

Several deliberate design choices shape the SRF and distinguish it from adjacent frameworks:

Empirical over theoretical. Every structural decision — which domains exist, how many sub-domains each has, what capabilities are assessed — is traceable to observed incidents. This means the framework has built-in answers to "why does this matter?" that theoretical frameworks lack.

Assessment over prescription. The SRF defines what needs to be true (e.g., "circuit breakers are configured for all external dependencies") but not how to achieve it (e.g., which circuit breaker library to use). This makes the framework technology-agnostic and applicable across cloud providers, programming languages, and architectural styles.

Profiles over scores. A single reliability number is seductive but misleading. The D-M-R profile reveals where the investment is needed. Two organisations with the same overall score might need completely different remediation strategies.

Familiar structure. The framework is organised in numbered clauses with normative and informative annexes, following the conventions of ISO management system standards. This is deliberate — it means that anyone who's been through an ISO 27001 audit, an ITIL assessment, or a SOC 2 examination will recognise the structure immediately. The SRF doesn't require organisations to learn a new assessment paradigm; it applies an existing, well-understood paradigm to a domain that has lacked one.

Complementary, not competing. The SRF is designed to sit alongside ISO 27001, ITIL, DORA, and SOC 2 — not to replace them. Many controls overlap (particularly in governance, access control, and incident management). The SRF adds depth in areas those frameworks treat lightly: code reliability, dependency management, deployment safety, database availability, and network resilience. Optional mapping tables show where SRF controls align with controls in other frameworks, making it straightforward to integrate the SRF into an existing compliance programme.

What the data tells us

Across 192 incidents, the dataset reveals several patterns that shaped the framework and inform its use:

Mitigations is the weakest link. Average scores across the dataset: Detection 1.8, Mitigations 1.1, Response 1.6. The gap between Detection and Mitigations — 0.7 points on a 3-point scale — is the single most important finding. Organisations are investing in watching things break but not in building things that don't break.

Infrastructure is the biggest category. 28% of incidents are infrastructure-related, which is why INFRA has eight sub-domains — more than any other domain. But infrastructure incidents are also among the most preventable: they tend to involve known failure modes (capacity exhaustion, single points of failure) with well-understood mitigations.

Dependencies are understated. Only 5% of incidents have Dependency as the primary cause, but dependency issues appear as contributing factors in over 30% of all incidents. This pattern — where dependencies amplify other failures rather than causing them directly — is why the DEP domain exists despite its low primary-cause count.

Average outage duration is 6.4 hours. Across 172 incidents with duration data, the mean outage was 6.4 hours, with a total of 66,498 minutes (over 1,100 hours) of captured downtime. The distribution is heavily right-skewed: most outages are resolved in 1–4 hours, but a long tail of multi-day incidents pulls the average up.

The 6.4-hour average is not a law of nature. It's the average given the maturity levels observed in the dataset (roughly 1.8/1.1/1.6). Organisations with higher maturity — particularly higher Mitigations — experience shorter and fewer outages. The SRF's downtime estimation model quantifies this relationship, making it possible to project the impact of targeted improvement.

Conclusion

The Site Reliability Framework is, at its core, an argument that production reliability should be measured with the same rigour we apply to information security, financial controls, and workplace safety. The methodology — starting from real incidents, deriving domains from observed failure patterns, assessing across three axes at four maturity levels, and translating scores into actionable profiles and downtime estimates — is designed to make that measurement structured, repeatable, and grounded in evidence.

The framework doesn't claim that Level 3 maturity everywhere is the right answer for every organisation. It claims that knowing where you are is the prerequisite for deciding where you need to be — and that decisions about reliability investment should be informed by data about how systems actually fail, not by inherited assumptions about what "good" looks like.

SRF 1.0 | sitereliability.consulting | 2026

For the full normative framework including all 11 domains, 67 sub-domains, and 201 assessment points, see the SRF Standard v1.0.

NextFailure Domains