03 Deep dive

Tail-latency amplification in microservice chains

We measure end-to-end p99 latency of a 19-service microservice application (DeathStarBench hotel-reservation) under co-located best-effort CPU load, comparing stock CFS against Temper's kernel-enforced tiers on the same nodes. Under CFS, end-to-end p99 grew 9.4× from the bottom to the top of the background ladder; in fully attached Temper windows the tail held at roughly 1.5×. A replication on a fresh cluster and a right-sized node reproduces the effect and identifies a node-sizing boundary condition.

Background: why chains amplify tails

A monolith's p99 is one service's p99. A call graph is different: if a request touches n services and each hop independently has probability p of hitting a slow scheduling event, the end-to-end request hits at least one with probability 1 − (1 − p)n — the per-hop tail is effectively multiplied by the fan-out. On a co-located node, the dominant slow event is a runnable thread waiting behind lower-priority work, and CFS distributes that wait across every service in the chain. The measurement below is the end-to-end result observed at the load generator, not a hop-by-hop decomposition; the raw records preserve the distinction.

Experimental setup

The workload is upstream DeathStarBench hotel-reservation — 19 services including five MongoDB instances, four memcacheds, consul, and jaeger — all Guaranteed and pinned to one 4-vCPU EKS node (m5.xlarge, requests totaling 3550m). Load is generated open-loop by wrk2 from a second node at a fixed 200 requests/s (the application saturates near 315 requests/s), with 60 s measurement windows and three repetitions per step. Contention is a ladder of co-located BestEffort single-thread spinners (0/1/2/4/8) on the application node. The CFS arm is the same node with Temper's safe-mode annotation set.

Two adjustments to the upstream benchmark were required for measurement validity, and both generalize to published microservice benchmarks. First, the Go services observe 4 CPUs while running under sub-core CFS quotas; garbage-collection bursts exhausted the quota window and produced 100–300 ms stalls that destabilized even the zero-background baseline. The services therefore run with GOMAXPROCS=1. Second, the upstream request mix's 0.5% reserve writes grow a MongoDB collection that the search path rescans, so p99 climbs with wall-clock time; the mix was made read-only. Without these adjustments the ladder measures quota-throttling and dataset-growth artifacts rather than scheduling.

Results

DeathStarBench, 19 services, end-to-end p99 growth across the ladder CFS 9.4× · Temper attached ~1.5×

End-to-end p99 growth from bg=0 to the top of the spinner ladder; p50 remained 1.8–6.2 ms in both arms — the effect is confined to the tail.

0 p99 growth 9.4× ~1.5× CFS (19.6 → 183.9 ms) Temper (18.1 → 28.4 ms)

CFS end-to-end p99 grew from 19.6 ms (bg=0) to 183.9 ms (bg=8), a 9.4× increase; the report's prose summary states 5–8× across the mid-ladder. Fully attached Temper windows: 18.1 ms → 28.4 ms (~1.5×). Attached operating points: −75% at bg=2 (25.8 vs 104.6 ms) and −83% at bg=4 (28.4 vs 164.2 ms). Caveat inherited from the CPU-limits article: cpu.max is not enforced under sched_ext, so attached-arm wins are labeled “fence + limit-bypass” in the record. source: docs/training-artifacts/deathstarbench/REPORT.md · EKS 1.36, 2× m5.xlarge

A second run on a freshly provisioned cluster of the same shape (same manifests, kernel, and instance type) produced a materially better CFS baseline (bg=1 p99 ≈ 14 ms vs 54.8 ms), so absolute values are not comparable across the two runs and the records prohibit mixing rows between them; each run is a self-consistent A/B against its own baseline. In the second run, Temper's tail was flat at 17–20 ms from bg=2 upward while CFS climbed to 29.5 ms: −19% / −28% / −33% at bg=2/4/8. No watchdog ejections occurred during the full ~35-minute saturated arm of that run.

The second run also produced a negative result at exactly one point of the ladder: +156% against CFS at bg=1. The mechanism is node sizing, not the fence: the application requests 3.55 cores on a node with two physical cores, so confinement placed all 19 services on the SMT siblings of one physical core while CFS exploited the half-idle second core; at bg≥2 that core saturates and confinement wins. Re-run on a right-sized GKE node (c2-standard-8, application scaled so critical demand fits within physical cores minus one), the bg=1 point measures +14%, Temper wins from bg=2 (−39% at bg=4), and Temper's tail is flat at 8.5–9.2 ms at every density. The resulting sizing rule — critical demand must fit within physical cores minus one — is stated as first-class guidance in operations.

Scheduler attribution and the watchdog fallback

sched_ext provides a kernel-enforced fail-safe: if any runnable task is not serviced within a 30 s watchdog timeout, the kernel ejects the BPF scheduler and the node reverts to stock CFS automatically. The node and its workloads keep running; the failure direction is loss of the fence, not an outage (the full contract is described in the failure-modes article). For measurement, this property means an attached-arm window is valid only if the scheduler was actually attached for its duration.

Windows are therefore attributed by PSI fingerprinting. Node-level CPU pressure (cpu some) is captured before and after every window, and the two schedulers produce distinct pressure signatures at the same density: attached windows run 46–65% (confined spinners accumulate as runnable-but-waiting) while CFS windows run 6–35% (the pressure is distributed into the primary workload). Any window whose fingerprint indicates CFS is asterisked in the raw tables and excluded from attached-arm claims, never silently dropped. The classification is corroborated by the data itself: one excluded bg=2 window measured 102.6 ms against a CFS median of 104.6 ms, consistent with the window having run on CFS.

The shipping release includes two behaviors relevant to this fallback path. The scheduler provides a per-CPU starvation guarantee: CPU-pinned kernel threads, which cannot migrate to another CPU, are serviced on their own CPU regardless of layer accounting — the starvation class that can otherwise trip the watchdog on small, saturated nodes. The agent additionally runs a scheduler supervisor: an unexpected scheduler exit is detected, the advertised status is updated, and the scheduler is respawned within 3–5 s (measured by SIGKILL test). Zero watchdog ejections were observed across the validation runs of the current release.

Summary of findings

Limitations

Raw records

  • docs/training-artifacts/deathstarbench/REPORT.md
  • docs/training-artifacts/deathstarbench/v14-validation/REPORT.md
  • docs/training-artifacts/pwb-v14-validation/raw/gke-e2-repro-agent.log
  • docs/training-artifacts/forensics/FINDINGS.md
  • docs/training-artifacts/forensics/FINDINGS-2-stranded-exclusive.md
  • docs/training-artifacts/headroom/gke-c2-http/v10-validation/REPORT.md
  • docs/training-artifacts/headroom/gke-c2-http/v11-validation/REPORT.md

Committed benchmark records in the product repository; design partners get the full artifact tree. Windows that ran after a watchdog eject are asterisked in the raw tables, never silently dropped.