/// Engineering & Reliability

Mean Time to Recovery MTTR

CTO VP Engineering Director Engineering COO

Mean Time to Recovery (MTTR) measures the average time required to restore a service or system to normal operation after a failure or incident. It is one of the four DORA metrics and a critical reliability KPI. MTTR directly determines the amount of downtime caused by each incident and the impact on users and revenue. Shorter MTTR requires strong observability, clear incident response processes, and empowered on-call engineers.

MTTR is meaningfully different from Mean Time Between Failures (MTBF); MTTR focuses on recovery speed while MTBF focuses on failure prevention. Both matter for overall availability.

Where It Lives

PagerDutyIncident lifecycle tracking from alert to resolution
DatadogIncident management and MTTR reporting
OpsgenieOn-call management and incident timeline tracking
FireHydrantIncident orchestration with MTTR analytics

What Drives It

Observability depth (logs, metrics, traces) enabling fast diagnosis
On-call response time and runbook quality
Incident command process clarity and escalation procedures
Automated rollback capability for deployment-caused incidents
Blast radius of failures (isolated microservices recover faster than monoliths)

How Different Roles Think About This Metric

Each function reads MTTR through a different lens and takes different actions when it changes.

CTO

The CTO sets MTTR targets in SLOs and ensures the organization invests in observability and incident response tooling that enables fast recovery.

VP Engineering

VP Engineering owns the incident management process and reviews MTTR trends to identify systemic improvements to on-call practices and runbook quality.

Director Engineering

Directors run post-incident reviews and implement the remediation actions that reduce MTTR for future incidents.

COO

The COO monitors MTTR as a customer experience metric, particularly for enterprise customers with SLA-based downtime penalties.

Common Questions About Mean Time to Recovery

Click any question to expand the answer.

What is the difference between MTTR and MTTD (Mean Time to Detect)?

MTTD (Mean Time to Detect) measures how long it takes to discover that an incident is occurring. MTTR measures the total time from failure to recovery, which includes detection, diagnosis, and restoration. Improving MTTD shortens MTTR by starting the response process sooner. Many organizations separately track MTTD, MTTI (Mean Time to Investigate), and MTTF (Mean Time to Fix) to understand where in the incident lifecycle the most time is lost.

How can observability improve MTTR?

Observability (the ability to understand system state from external outputs) reduces the diagnosis phase of incident response, which is often the longest part of MTTR. Comprehensive distributed tracing shows exactly where a request failed. Correlated logs and metrics allow engineers to quickly identify the root cause. Without good observability, engineers spend the majority of MTTR in blind investigation rather than implementing a fix.

What is a blameless post-incident review?

A blameless post-incident review (also called a postmortem) documents what happened, why it happened, and what will be done to prevent recurrence, without assigning personal blame to the individuals involved. The blameless culture, popularized by Google's SRE practices, encourages engineers to report incidents honestly and participate in reviews without fear of punishment. This produces better systemic fixes than blame-based cultures that incentivize hiding problems.

How does runbook quality affect MTTR?

Runbooks are step-by-step guides for diagnosing and resolving known failure modes. A high-quality runbook can reduce MTTR from hours to minutes for known incident types by giving on-call engineers a clear path to resolution without requiring deep knowledge of every system component. Runbooks should be reviewed and updated after every relevant incident to capture new diagnostic steps and automated remediation procedures discovered during recovery.

Mean Time to Recovery MTTR

How Different Roles Think About This Metric

Common Questions About Mean Time to Recovery

Related Metrics

Role Guides That Include This Metric

See What’s Actually Moving Your MTTR