SEO from $300/mo AI-powered, human-verified No agency markup Transparent platform included
/// Engineering & Reliability

API Latency / Response Time P95

API Latency measures the time elapsed between a client sending a request to an API and receiving a complete response. It is most meaningfully expressed as a percentile distribution (P50, P95, P99) rather than an average, because averages obscure the experience of users who encounter the slowest responses. P95 latency (the response time for the 95th percentile of requests) is the most commonly tracked production reliability target.

Tail latency (P99 and above) is especially important for user-facing APIs; the users experiencing the slowest 1% of responses are often the most engaged and highest-value users, and degraded performance for them disproportionately affects business outcomes.

Formula
Time from Request Received to Response Sent (measured at Pth percentile)
Where It Lives
  • DatadogAPI latency distribution, APM traces, and P95/P99 dashboards
  • New RelicTransaction response time percentile tracking
  • AWS CloudWatchAPI Gateway latency metrics and alarms
  • Grafana / PrometheusCustom latency histograms and SLO tracking
What Drives It
  • Database query performance and indexing
  • External API and third-party service dependencies
  • Application code inefficiency and N+1 query patterns
  • Network latency and geographic distribution
  • Resource contention under high concurrency
Causal Analysis: Distributed tracing (Jaeger, Datadog APM) causally attributes latency to specific service calls and code paths, enabling targeted optimization rather than guesswork.
Benchmark

User-facing APIs typically target P95 below 300ms; P99 below 1,000ms; above 3,000ms P95 is generally considered unacceptable for interactive applications.

Common Mistake
Monitoring only average response time and missing P95/P99 tail latency spikes that affect a significant minority of users and often indicate emerging infrastructure problems.

How Different Roles Think About This Metric

Each function reads P95 through a different lens and takes different actions when it changes.

CTO
The CTO sets latency SLOs that balance infrastructure investment against user experience requirements and monitors tail latency trends as a product quality signal.
VP Engineering
VP Engineering monitors P95 latency across all critical API endpoints and escalates when SLOs are breached to trigger optimization sprints.
Director Engineering
Directors own the performance engineering roadmap and use distributed tracing data to prioritize the highest-impact latency optimization work.

Common Questions About API Latency / Response Time

Click any question to expand the answer.

Why measure P95 instead of average latency?
Average latency is heavily skewed by the many fast requests in the distribution and masks the experience of users hitting slower responses. P95 latency means 95% of requests are faster than this value, and it directly captures the experience of users who are slower than most. P99 is even more conservative. Systems that look fine on average often have severe tail latency issues that degrade the experience for a significant minority of users.
What causes high tail latency (P99)?
Common causes of P99 latency spikes include: garbage collection pauses in JVM or Go applications, database lock contention, cache miss storms when cache is cold or invalidated, network packet loss causing TCP retransmits, and resource exhaustion under high concurrency. Distributed tracing is the most effective tool for isolating where in the request path the latency is occurring.
How do I set API latency SLOs?
Start from the user experience requirement: what response time feels fast enough for your users given the nature of the interaction? Interactive UI actions should be under 200ms; complex queries under 1,000ms. Then measure your current P95/P99 baseline and set SLOs slightly below current performance to start, tightening them as you improve. Tie SLOs to error budgets so teams know when latency issues are consuming budget that should be protected.
What is the difference between latency and throughput?
Latency is the time to complete one request; throughput is the number of requests the system can handle per unit of time. They are related but distinct. A system can have low latency at low throughput but degrade as concurrency increases. Load testing at production-scale concurrency reveals the latency-throughput trade-off curve and identifies the concurrency level at which tail latency begins to degrade unacceptably.

Related Metrics

Metrics that are commonly analyzed alongside P95.

Role Guides That Include This Metric

See how each role uses P95 in context with the full set of metrics they own.

/// get started

See What’s Actually Moving Your P95

askotter connects your data sources and applies causal analysis to tell you exactly why your metrics are changing, not just that they changed.

Book a Conversation →