/// Engineering & Reliability

API Latency / Response Time P95

API Latency measures the time elapsed between a client sending a request to an API and receiving a complete response. It is most meaningfully expressed as a percentile distribution (P50, P95, P99) rather than an average, because averages obscure the experience of users who encounter the slowest responses. P95 latency (the response time for the 95th percentile of requests) is the most commonly tracked production reliability target.

Tail latency (P99 and above) is especially important for user-facing APIs; the users experiencing the slowest 1% of responses are often the most engaged and highest-value users, and degraded performance for them disproportionately affects business outcomes.

Where It Lives

DatadogAPI latency distribution, APM traces, and P95/P99 dashboards
New RelicTransaction response time percentile tracking
AWS CloudWatchAPI Gateway latency metrics and alarms
Grafana / PrometheusCustom latency histograms and SLO tracking

What Drives It

Database query performance and indexing
External API and third-party service dependencies
Application code inefficiency and N+1 query patterns
Network latency and geographic distribution
Resource contention under high concurrency

How Different Roles Think About This Metric

Each function reads P95 through a different lens and takes different actions when it changes.

CTO

The CTO sets latency SLOs that balance infrastructure investment against user experience requirements and monitors tail latency trends as a product quality signal.

VP Engineering

VP Engineering monitors P95 latency across all critical API endpoints and escalates when SLOs are breached to trigger optimization sprints.

Director Engineering

Directors own the performance engineering roadmap and use distributed tracing data to prioritize the highest-impact latency optimization work.

Common Questions About API Latency / Response Time

Click any question to expand the answer.

Why measure P95 instead of average latency?

Average latency is heavily skewed by the many fast requests in the distribution and masks the experience of users hitting slower responses. P95 latency means 95% of requests are faster than this value, and it directly captures the experience of users who are slower than most. P99 is even more conservative. Systems that look fine on average often have severe tail latency issues that degrade the experience for a significant minority of users.

What causes high tail latency (P99)?

Common causes of P99 latency spikes include: garbage collection pauses in JVM or Go applications, database lock contention, cache miss storms when cache is cold or invalidated, network packet loss causing TCP retransmits, and resource exhaustion under high concurrency. Distributed tracing is the most effective tool for isolating where in the request path the latency is occurring.

How do I set API latency SLOs?

Start from the user experience requirement: what response time feels fast enough for your users given the nature of the interaction? Interactive UI actions should be under 200ms; complex queries under 1,000ms. Then measure your current P95/P99 baseline and set SLOs slightly below current performance to start, tightening them as you improve. Tie SLOs to error budgets so teams know when latency issues are consuming budget that should be protected.

What is the difference between latency and throughput?

Latency is the time to complete one request; throughput is the number of requests the system can handle per unit of time. They are related but distinct. A system can have low latency at low throughput but degrade as concurrency increases. Load testing at production-scale concurrency reveals the latency-throughput trade-off curve and identifies the concurrency level at which tail latency begins to degrade unacceptably.

API Latency / Response Time P95

How Different Roles Think About This Metric

Common Questions About API Latency / Response Time

Related Metrics

Role Guides That Include This Metric

See What’s Actually Moving Your P95