SEO from $300/mo AI-powered, human-verified No agency markup Transparent platform included
/// Engineering & Reliability

Mean Time to Recovery MTTR

Mean Time to Recovery (MTTR) measures the average time required to restore a service or system to normal operation after a failure or incident. It is one of the four DORA metrics and a critical reliability KPI. MTTR directly determines the amount of downtime caused by each incident and the impact on users and revenue. Shorter MTTR requires strong observability, clear incident response processes, and empowered on-call engineers.

MTTR is meaningfully different from Mean Time Between Failures (MTBF); MTTR focuses on recovery speed while MTBF focuses on failure prevention. Both matter for overall availability.

Formula
Sum of Recovery Time per Incident ÷ Number of Incidents
Where It Lives
  • PagerDutyIncident lifecycle tracking from alert to resolution
  • DatadogIncident management and MTTR reporting
  • OpsgenieOn-call management and incident timeline tracking
  • FireHydrantIncident orchestration with MTTR analytics
What Drives It
  • Observability depth (logs, metrics, traces) enabling fast diagnosis
  • On-call response time and runbook quality
  • Incident command process clarity and escalation procedures
  • Automated rollback capability for deployment-caused incidents
  • Blast radius of failures (isolated microservices recover faster than monoliths)
Causal Analysis: Post-incident reviews (PIRs) causally identify which diagnostic delays, communication breakdowns, or missing runbook steps caused MTTR to exceed target, driving specific process improvements.
Benchmark

DORA elite teams recover in under 1 hour; high performers in under 24 hours; medium performers in less than 1 week; low performers in over 1 week.

Common Mistake
Starting the MTTR clock at incident acknowledgment rather than at the first user impact, which understates actual customer-impacting downtime.

How Different Roles Think About This Metric

Each function reads MTTR through a different lens and takes different actions when it changes.

CTO
The CTO sets MTTR targets in SLOs and ensures the organization invests in observability and incident response tooling that enables fast recovery.
VP Engineering
VP Engineering owns the incident management process and reviews MTTR trends to identify systemic improvements to on-call practices and runbook quality.
Director Engineering
Directors run post-incident reviews and implement the remediation actions that reduce MTTR for future incidents.
COO
The COO monitors MTTR as a customer experience metric, particularly for enterprise customers with SLA-based downtime penalties.

Common Questions About Mean Time to Recovery

Click any question to expand the answer.

What is the difference between MTTR and MTTD (Mean Time to Detect)?
MTTD (Mean Time to Detect) measures how long it takes to discover that an incident is occurring. MTTR measures the total time from failure to recovery, which includes detection, diagnosis, and restoration. Improving MTTD shortens MTTR by starting the response process sooner. Many organizations separately track MTTD, MTTI (Mean Time to Investigate), and MTTF (Mean Time to Fix) to understand where in the incident lifecycle the most time is lost.
How can observability improve MTTR?
Observability (the ability to understand system state from external outputs) reduces the diagnosis phase of incident response, which is often the longest part of MTTR. Comprehensive distributed tracing shows exactly where a request failed. Correlated logs and metrics allow engineers to quickly identify the root cause. Without good observability, engineers spend the majority of MTTR in blind investigation rather than implementing a fix.
What is a blameless post-incident review?
A blameless post-incident review (also called a postmortem) documents what happened, why it happened, and what will be done to prevent recurrence, without assigning personal blame to the individuals involved. The blameless culture, popularized by Google's SRE practices, encourages engineers to report incidents honestly and participate in reviews without fear of punishment. This produces better systemic fixes than blame-based cultures that incentivize hiding problems.
How does runbook quality affect MTTR?
Runbooks are step-by-step guides for diagnosing and resolving known failure modes. A high-quality runbook can reduce MTTR from hours to minutes for known incident types by giving on-call engineers a clear path to resolution without requiring deep knowledge of every system component. Runbooks should be reviewed and updated after every relevant incident to capture new diagnostic steps and automated remediation procedures discovered during recovery.

Related Metrics

Metrics that are commonly analyzed alongside MTTR.

Role Guides That Include This Metric

See how each role uses MTTR in context with the full set of metrics they own.

/// get started

See What’s Actually Moving Your MTTR

askotter connects your data sources and applies causal analysis to tell you exactly why your metrics are changing, not just that they changed.

Book a Conversation →