Governance Models

Four of the SRF's 14 domains are enabling domains: Governance (GOV), Personnel (PER), Incident Management (INC), and Capacity Management (CAP). Unlike the failure domains, enabling domains rarely initiate incidents directly — but their weaknesses appear as contributing factors in nearly every major incident. A configuration change that caused an outage may be technically a CONFIG failure; the deeper problem is often that no change review process existed (GOV), the on-call engineer lacked the diagnostic knowledge (PER), the response was unco-ordinated (INC), or headroom had been exhausted without warning (CAP).

This page is about programme governance — how an organisation structures, runs, and sustains an SRF assessment programme over time. That is a different question from what the GOV domain's sub-domains assess, though the two are related: an organisation with weak GOV domain scores is also likely to struggle with the operational discipline that a mature SRF programme demands.

Assessment cadence: three tiers

The SRF is designed to be assessed at three levels of rigour, each building on the one before.

Tier 1: Self-assessment

The free, online self-assessment tool is completable in under 20 minutes by someone with broad operational knowledge of their environment. It is in-memory with no persistence — responses are not stored or transmitted. The tool scores answers against the maturity model and produces a grade, per-domain D-M-R profiles, a gap analysis, and an estimated annual downtime figure.

The self-assessment is intentionally generous: it takes the organisation's answers at face value without evidence verification. This is a deliberate design choice. The goal at this tier is to reveal the landscape — to surface which domains and axes are weakest and focus attention on where to look. It is not a certification and it does not claim to be.

Tier 2: Professional attestation

A professional attestation engagement involves an independent assessor conducting structured interviews and reviewing evidence — monitoring dashboards, configuration records, runbooks, deployment pipelines, incident records, and access control policies. The assessor independently scores each sub-domain against the maturity model.

The process is modelled on ISO 27001 certification audits: scoping, evidence collection, assessment, reporting, and remediation tracking. The output is a formal attestation letter with 12-month validity, a detailed findings report with per-sub-domain scores and narrative, and an embeddable reliability badge for external communication.

Attestation validates what the self-assessment reveals. An organisation that self-assessed at Grade B may attest at Grade C once evidence is reviewed — not because the self-assessment was dishonest, but because capability claims that feel accurate informally often have gaps when subjected to evidence review.

Tier 3: Continuous monitoring

At the highest tier, the SRF assessment is kept current through integration with the organisation's existing tooling: source control systems, cloud platform APIs, observability stacks, and deployment pipelines. Controls that can be verified programmatically — deployment frequency, alert coverage, backup recency, dependency vulnerability counts — are monitored continuously rather than assessed at a point in time.

Continuous monitoring shifts reliability from an annual audit posture to an ongoing measurement posture. Scores update as controls change. Remediation is visible in near-real time. Regression from a previous grade triggers a notification rather than a surprise finding at the next annual review.

Programme roles

An SRF programme is a cross-functional effort. The following roles are typically involved, regardless of how an organisation's hierarchy is formally structured.

Engineering and SRE teams

Engineering and SRE teams are the primary assessors. They complete the capability checklists and, crucially, they know the current state of controls — which monitoring exists, how deployments actually work, what the real rollback procedure is as opposed to what the runbook says. Without their involvement, self-assessment produces aspirational scores rather than accurate ones.

SRE or platform lead

The SRE or platform lead acts as assessment owner. They are responsible for scoping which parts of the system are in scope, co-ordinating inputs from across engineering teams, and tracking remediation against the backlog that the assessment produces. In smaller organisations, this role is often filled by a senior engineer or engineering manager.

Engineering manager or VP Engineering

The engineering manager or VP Engineering reviews domain profiles and prioritises remediation investment. The SRF assessment produces more improvement candidates than any team can act on simultaneously; this role makes the triage decisions — which sub-domain gaps to close first, which can wait, and what the next 90-day target looks like.

Executive sponsor

The executive sponsor — typically CTO or CIO — receives the overall grade and the estimated annual downtime figure. They own the business case for improvement investment, communicate reliability posture to the board and to customers, and set the organisational expectation that reliability is measured rather than assumed.

CISO and GRC team

Where the SRF feeds into a broader compliance or risk programme, the CISO and GRC team provide the integration point. They map SRF findings against existing control frameworks, incorporate the assessment into risk registers, and use the per-domain profiles to prioritise controls audits.

Cross-framework integration

The SRF is designed to sit alongside established frameworks, not to replace them. For organisations already operating ISO 27001, SOC 2, ITIL, or DORA programmes, the SRF adds depth in areas those frameworks treat lightly — particularly in code reliability, dependency management, deployment safety, and database availability.

ISO 27001

ISO 27001 is a comprehensive information security management standard. Its Annex A controls include operations security (A.12) and incident management (A.16), which overlap with SRF's INC and GOV domains. However, ISO 27001 is thin on production reliability specifics: it establishes that logging should exist and that incidents should be managed, but does not assess the maturity of monitoring, the quality of architectural mitigations, or the adequacy of recovery procedures in any operational depth.

The SRF adds reliability depth in CODE, INFRA, DB, and DEPLOY — areas where an ISO 27001 certified organisation may still be operating at Level 1 maturity. The framework includes mapping guidance that aligns SRF sub-domains with the relevant ISO 27001 Annex A controls, making it straightforward to present SRF findings within an existing ISO programme.

SOC 2

SOC 2's Availability and Processing Integrity trust service criteria map heavily to SRF's INFRA, DEPLOY, and DB domains. An organisation preparing for a SOC 2 Type II audit can use the SRF assessment to identify which availability controls need strengthening before the audit window, and to build the evidence base that auditors will review. The SRF's framework mapping guidance identifies the specific SOC 2 criteria that each SRF sub-domain supports.

ITIL 4

ITIL 4's incident management and service continuity practices align closely with SRF's INC domain. The SRF's contribution is quantitative: where ITIL describes processes (how should incidents be managed?), the SRF provides a maturity score (how well are they managed?). An organisation with ITIL processes in place can use the SRF to distinguish between process documentation that exists and processes that are consistently followed at adequate quality.

DORA metrics

DORA's four key metrics — deployment frequency, lead time for changes, mean time to restore (MTTR), and change failure rate — correlate directly with SRF domain scores. A high SRF DEPLOY score predicts high deployment frequency and low change failure rate: organisations that have strong deployment controls deploy more often and break production less. A high SRF INC score predicts low MTTR: organisations with mature incident response restore service faster.

The relationship runs in both directions. DORA metrics can be used to validate SRF scores — an organisation claiming Level 3 DEPLOY maturity but with a change failure rate above 20% warrants scrutiny. The SRF provides the diagnostic depth to explain why the DORA metrics are what they are, where DORA itself only measures outcomes.

Remediation governance

An assessment produces more improvement candidates than a team can address simultaneously. The value of the SRF lies not in producing a grade but in structuring the improvement work that follows.

Reading the profiles

Per-domain D-M-R profiles reveal where to invest, not just how much. An organisation with Detection consistently above Mitigations across all domains needs architectural investment — more monitoring is not the answer. An organisation with weak Response scores needs incident response and rollback capability investment, regardless of how strong its preventive controls are.

The industry average makes the priority clear: Detection 1.8, Mitigations 1.1, Response 1.6. Mitigations is the weakest axis across the dataset and the one with the highest ROI for improvement. Incidents in organisations with Mitigations at Level 3 or above averaged 2.1 hours duration; those at Level 0–1 averaged 8.7 hours. Closing the Mitigations gap is the single most impactful action most organisations can take.

From assessment to backlog

Each unchecked capability in the assessment export represents a specific, actionable control gap. The Export Backlog function surfaces these as a structured list — sub-domain, axis, capability, and prerequisite chain — formatted for direct import into project management tooling.

The recommended approach for triaging this list is to score each domain by the product of impact weight (derived from incident frequency in the dataset) and gap size (the distance between current and target maturity). Address the top three sub-domains in the first 90-day cycle, set a specific target maturity level for each, and reassess at the cycle end. Progress is measurable because the scoring model is deterministic: adding specific capabilities moves the axis score in a predictable and calculable direction.

Remediation cycle

A practical remediation cycle operates as follows:

Assess — Complete the self-assessment or attestation to establish a baseline
Prioritise — Identify the three sub-domains with the highest (impact × gap) score
Target — Set a specific maturity level target for each selected sub-domain (typically +1 or +2 levels)
Implement — Address the specific capability gaps identified by the assessment
Reassess — Re-run the assessment after 90 days to measure progress and re-prioritise

This cycle is designed to produce visible improvement in the overall grade within a single quarter when the highest-impact gaps are targeted. An organisation moving from Grade C to Grade B through one or two remediation cycles will find the business case for the next cycle materially easier to make.

PreviousFailure Domains NextMaturity Levels