/// Engineering & Reliability

Error Rate

Error Rate measures the percentage of API requests, user sessions, or transactions that result in an error (typically HTTP 5xx server errors or application-level exceptions). It is a real-time health signal for production systems and is used as an SLO indicator alongside latency and uptime. Sudden spikes in error rate are often the first observable signal of a production incident.

Error rate should be tracked at multiple levels: total platform error rate, per-endpoint error rate, and per-user-segment error rate to catch localized failures that blend into acceptable aggregate numbers.

Where It Lives

DatadogReal-time error rate monitoring with alerting
SentryApplication error tracking with stack traces and context
New RelicError rate dashboards by service and endpoint
Elastic APMDistributed error tracking across microservices

What Drives It

Code bugs introduced by recent deployments
Database connectivity or query failures
Third-party API dependency failures
Resource exhaustion (memory, connection pool) under load
Configuration changes affecting application behavior

How Different Roles Think About This Metric

Each function reads Error Rate through a different lens and takes different actions when it changes.

VP Engineering

VP Engineering monitors error rate as a primary production health metric and uses it to triage incident severity and prioritize response.

Director Engineering

Directors own the alerting thresholds and escalation paths for error rate breaches and ensure on-call teams have the runbooks to diagnose error spikes quickly.

CTO

The CTO reviews error rate trends as part of platform reliability reporting and uses persistent error types to prioritize architectural investment.

Common Questions About Error Rate

Click any question to expand the answer.

What is the difference between 4xx and 5xx errors in error rate tracking?

HTTP 4xx errors are client errors (invalid requests, authentication failures, not-found responses) that are often expected behavior from the application's perspective. HTTP 5xx errors are server errors indicating the application failed to process a valid request. Error rate tracking for reliability purposes typically focuses on 5xx errors, which represent genuine application failures. 4xx rates can also be tracked separately as a security and UX signal.

How do I set meaningful error rate SLOs?

Start by measuring your current error rate baseline over 30 days to understand normal variance. Set your SLO at a level that represents meaningful degradation from your normal state, typically at the 95th percentile of your normal error rate plus a meaningful buffer. For example, if normal error rate is 0.05%, an SLO of 0.5% allows for 10× normal before breaching. Tighten the SLO as you improve reliability.

What should trigger an immediate incident vs. a ticket?

An immediate incident response should be triggered when error rate exceeds your SLO threshold and is confirmed to be affecting real users rather than a monitoring anomaly. A ticket is appropriate for error rates that are elevated but within SLO, indicating a quality issue that needs investigation but not immediate all-hands response. Use PagerDuty or similar tools to define severity levels with different escalation paths based on error rate thresholds and impacted services.

How do microservices architectures affect error rate tracking?

Microservices create cascading failure risks: an error in a downstream service can cause errors to propagate upstream, making root cause identification harder. Track error rates at both the service-mesh level (between services) and the external API boundary (user-facing). Distributed tracing is essential for following an error through multiple service hops to its origin. Circuit breakers can prevent cascading failures from amplifying error rates across the system.

Error Rate

How Different Roles Think About This Metric

Common Questions About Error Rate

Related Metrics

Role Guides That Include This Metric

See What’s Actually Moving Your Error Rate