03 Deep dive

Tail-latency amplification in microservice chains

We measure end-to-end p99 latency of a 19-service microservice application (DeathStarBench hotel-reservation) under co-located best-effort CPU load, comparing stock CFS against Temper's kernel-enforced tiers on the same nodes. Under CFS, end-to-end p99 grew 9.4× from the bottom to the top of the background ladder; in fully attached Temper windows the tail held at roughly 1.5×. A replication on a fresh cluster and a right-sized node reproduces the effect and identifies a node-sizing boundary condition.

workload: DeathStarBench hotel-reservation (19 services) clusters: EKS (m5) · GKE (c2, e2) records: 3 reports + forensics method: benchmark methodology

Background: why chains amplify tails

A monolith's p99 is one service's p99. A call graph is different: if a request touches n services and each hop independently has probability p of hitting a slow scheduling event, the end-to-end request hits at least one with probability 1 − (1 − p)ⁿ — the per-hop tail is effectively multiplied by the fan-out. On a co-located node, the dominant slow event is a runnable thread waiting behind lower-priority work, and CFS distributes that wait across every service in the chain. The measurement below is the end-to-end result observed at the load generator, not a hop-by-hop decomposition; the raw records preserve the distinction.

Experimental setup

The workload is upstream DeathStarBench hotel-reservation — 19 services including five MongoDB instances, four memcacheds, consul, and jaeger — all Guaranteed and pinned to one 4-vCPU EKS node (m5.xlarge, requests totaling 3550m). Load is generated open-loop by wrk2 from a second node at a fixed 200 requests/s (the application saturates near 315 requests/s), with 60 s measurement windows and three repetitions per step. Contention is a ladder of co-located BestEffort single-thread spinners (0/1/2/4/8) on the application node. The CFS arm is the same node with Temper's safe-mode annotation set.

Two adjustments to the upstream benchmark were required for measurement validity, and both generalize to published microservice benchmarks. First, the Go services observe 4 CPUs while running under sub-core CFS quotas; garbage-collection bursts exhausted the quota window and produced 100–300 ms stalls that destabilized even the zero-background baseline. The services therefore run with GOMAXPROCS=1. Second, the upstream request mix's 0.5% reserve writes grow a MongoDB collection that the search path rescans, so p99 climbs with wall-clock time; the mix was made read-only. Without these adjustments the ladder measures quota-throttling and dataset-growth artifacts rather than scheduling.

Results

DeathStarBench, 19 services, end-to-end p99 growth across the ladder CFS 9.4× · Temper attached ~1.5×

End-to-end p99 growth from bg=0 to the top of the spinner ladder; p50 remained 1.8–6.2 ms in both arms — the effect is confined to the tail.

CFS end-to-end p99 grew from 19.6 ms (bg=0) to 183.9 ms (bg=8), a 9.4× increase; the report's prose summary states 5–8× across the mid-ladder. Fully attached Temper windows: 18.1 ms → 28.4 ms (~1.5×). Attached operating points: −75% at bg=2 (25.8 vs 104.6 ms) and −83% at bg=4 (28.4 vs 164.2 ms). Caveat inherited from the CPU-limits article: cpu.max is not enforced under sched_ext, so attached-arm wins are labeled “fence + limit-bypass” in the record. source: docs/training-artifacts/deathstarbench/REPORT.md · EKS 1.36, 2× m5.xlarge

A second run on a freshly provisioned cluster of the same shape (same manifests, kernel, and instance type) produced a materially better CFS baseline (bg=1 p99 ≈ 14 ms vs 54.8 ms), so absolute values are not comparable across the two runs and the records prohibit mixing rows between them; each run is a self-consistent A/B against its own baseline. In the second run, Temper's tail was flat at 17–20 ms from bg=2 upward while CFS climbed to 29.5 ms: −19% / −28% / −33% at bg=2/4/8. No watchdog ejections occurred during the full ~35-minute saturated arm of that run.

The second run also produced a negative result at exactly one point of the ladder: +156% against CFS at bg=1. The mechanism is node sizing, not the fence: the application requests 3.55 cores on a node with two physical cores, so confinement placed all 19 services on the SMT siblings of one physical core while CFS exploited the half-idle second core; at bg≥2 that core saturates and confinement wins. Re-run on a right-sized GKE node (c2-standard-8, application scaled so critical demand fits within physical cores minus one), the bg=1 point measures +14%, Temper wins from bg=2 (−39% at bg=4), and Temper's tail is flat at 8.5–9.2 ms at every density. The resulting sizing rule — critical demand must fit within physical cores minus one — is stated as first-class guidance in operations.

Scheduler attribution and the watchdog fallback

sched_ext provides a kernel-enforced fail-safe: if any runnable task is not serviced within a 30 s watchdog timeout, the kernel ejects the BPF scheduler and the node reverts to stock CFS automatically. The node and its workloads keep running; the failure direction is loss of the fence, not an outage (the full contract is described in the failure-modes article). For measurement, this property means an attached-arm window is valid only if the scheduler was actually attached for its duration.

Windows are therefore attributed by PSI fingerprinting. Node-level CPU pressure (cpu some) is captured before and after every window, and the two schedulers produce distinct pressure signatures at the same density: attached windows run 46–65% (confined spinners accumulate as runnable-but-waiting) while CFS windows run 6–35% (the pressure is distributed into the primary workload). Any window whose fingerprint indicates CFS is asterisked in the raw tables and excluded from attached-arm claims, never silently dropped. The classification is corroborated by the data itself: one excluded bg=2 window measured 102.6 ms against a CFS median of 104.6 ms, consistent with the window having run on CFS.

The shipping release includes two behaviors relevant to this fallback path. The scheduler provides a per-CPU starvation guarantee: CPU-pinned kernel threads, which cannot migrate to another CPU, are serviced on their own CPU regardless of layer accounting — the starvation class that can otherwise trip the watchdog on small, saturated nodes. The agent additionally runs a scheduler supervisor: an unexpected scheduler exit is detected, the advertised status is updated, and the scheduler is respawned within 3–5 s (measured by SIGKILL test). Zero watchdog ejections were observed across the validation runs of the current release.

Summary of findings

Kernel-enforced tiers control end-to-end tail amplification on a real 19-service application without application changes — the strongest end-to-end fence datum in the evidence set.
The failure direction of a lost fence is stock CFS: performance degrades to the baseline, not below it.
Window-level scheduler attribution (PSI fingerprinting, asterisked windows) is part of the committed methodology, and the exclusions are published in the raw tables.

Limitations

cpu.max is not enforced for tasks under sched_ext, so the attached arm effectively runs limit-free while the CFS arm is quota-throttled; attached-arm wins are labeled “fence + limit-bypass” in the record (see the CPU-limits article).
The two EKS runs used separate fresh clusters with materially different CFS baselines; results are quoted per run and never mixed across runs.
On a node where critical demand exceeds physical cores minus one, confinement can lose to CFS at low background density (+156% at bg=1 on the undersized node); the right-sized replication bounds this to +14%.
The measurement is end-to-end at the load generator; no hop-by-hop latency decomposition is claimed.

Raw records

docs/training-artifacts/deathstarbench/REPORT.md
docs/training-artifacts/deathstarbench/v14-validation/REPORT.md
docs/training-artifacts/pwb-v14-validation/raw/gke-e2-repro-agent.log
docs/training-artifacts/forensics/FINDINGS.md
docs/training-artifacts/forensics/FINDINGS-2-stranded-exclusive.md
docs/training-artifacts/headroom/gke-c2-http/v10-validation/REPORT.md
docs/training-artifacts/headroom/gke-c2-http/v11-validation/REPORT.md

Committed benchmark records in the product repository; design partners get the full artifact tree. Windows that ran after a watchdog eject are asterisked in the raw tables, never silently dropped.