SEO from $300/mo AI-powered, human-verified No agency markup Transparent platform included
/// Engineering & Reliability

Error Rate

Error Rate measures the percentage of API requests, user sessions, or transactions that result in an error (typically HTTP 5xx server errors or application-level exceptions). It is a real-time health signal for production systems and is used as an SLO indicator alongside latency and uptime. Sudden spikes in error rate are often the first observable signal of a production incident.

Error rate should be tracked at multiple levels: total platform error rate, per-endpoint error rate, and per-user-segment error rate to catch localized failures that blend into acceptable aggregate numbers.

Formula
Error Responses ÷ Total Requests × 100
Where It Lives
  • DatadogReal-time error rate monitoring with alerting
  • SentryApplication error tracking with stack traces and context
  • New RelicError rate dashboards by service and endpoint
  • Elastic APMDistributed error tracking across microservices
What Drives It
  • Code bugs introduced by recent deployments
  • Database connectivity or query failures
  • Third-party API dependency failures
  • Resource exhaustion (memory, connection pool) under load
  • Configuration changes affecting application behavior
Causal Analysis: Correlating error rate spikes with deployment timestamps, infrastructure changes, or traffic pattern changes in observability tools provides direct causal attribution for incident investigation.
Benchmark

Production error rates below 0.1% are generally considered excellent; above 1% typically warrants immediate investigation; SLO targets are typically set at 0.1%–0.5% depending on service criticality.

Common Mistake
Alerting only on absolute error counts rather than error rate, which leads to alert fatigue during high-traffic periods and misses proportionally significant errors during low-traffic periods.

How Different Roles Think About This Metric

Each function reads Error Rate through a different lens and takes different actions when it changes.

VP Engineering
VP Engineering monitors error rate as a primary production health metric and uses it to triage incident severity and prioritize response.
Director Engineering
Directors own the alerting thresholds and escalation paths for error rate breaches and ensure on-call teams have the runbooks to diagnose error spikes quickly.
CTO
The CTO reviews error rate trends as part of platform reliability reporting and uses persistent error types to prioritize architectural investment.

Common Questions About Error Rate

Click any question to expand the answer.

What is the difference between 4xx and 5xx errors in error rate tracking?
HTTP 4xx errors are client errors (invalid requests, authentication failures, not-found responses) that are often expected behavior from the application's perspective. HTTP 5xx errors are server errors indicating the application failed to process a valid request. Error rate tracking for reliability purposes typically focuses on 5xx errors, which represent genuine application failures. 4xx rates can also be tracked separately as a security and UX signal.
How do I set meaningful error rate SLOs?
Start by measuring your current error rate baseline over 30 days to understand normal variance. Set your SLO at a level that represents meaningful degradation from your normal state, typically at the 95th percentile of your normal error rate plus a meaningful buffer. For example, if normal error rate is 0.05%, an SLO of 0.5% allows for 10× normal before breaching. Tighten the SLO as you improve reliability.
What should trigger an immediate incident vs. a ticket?
An immediate incident response should be triggered when error rate exceeds your SLO threshold and is confirmed to be affecting real users rather than a monitoring anomaly. A ticket is appropriate for error rates that are elevated but within SLO, indicating a quality issue that needs investigation but not immediate all-hands response. Use PagerDuty or similar tools to define severity levels with different escalation paths based on error rate thresholds and impacted services.
How do microservices architectures affect error rate tracking?
Microservices create cascading failure risks: an error in a downstream service can cause errors to propagate upstream, making root cause identification harder. Track error rates at both the service-mesh level (between services) and the external API boundary (user-facing). Distributed tracing is essential for following an error through multiple service hops to its origin. Circuit breakers can prevent cascading failures from amplifying error rates across the system.

Related Metrics

Metrics that are commonly analyzed alongside Error Rate.

Role Guides That Include This Metric

See how each role uses Error Rate in context with the full set of metrics they own.

/// get started

See What’s Actually Moving Your Error Rate

askotter connects your data sources and applies causal analysis to tell you exactly why your metrics are changing, not just that they changed.

Book a Conversation →