01 Deep dive

Idle-capacity harvesting under tail-latency constraints

Latency-critical Kubernetes nodes are typically provisioned with substantial idle headroom because packing is driven by declared CPU requests rather than usage. We evaluate whether batch work can occupy that headroom without degrading a protected service's tail latency, using memcached and redis primaries under background-load ladders on GKE and EKS. With kernel-enforced scheduling tiers, primary p99 remained flat at node utilizations of 0.81–0.95 where CFS baselines degraded 3–12×; the same enforcement enabled +72% pod placement under request overcommit and a 3→2 node consolidation.

workloads: memcached · redis · batch fillers clusters: GKE (e2, c2, c3) · EKS (m5) records: 8 reports method: benchmark methodology

Background: request-based packing and idle headroom

Kubernetes packs nodes by declared CPU requests. Requests are routinely padded two to four times above observed usage because the request is a pod's only protection at admission time: once pods share a node, the Linux scheduler — not the request — arbitrates every microsecond of CPU. Operators compensate by provisioning latency-critical services with headroom that sits idle most of the time.

Idle-capacity harvesting places Background-tier batch work (builds, encoding, analytics) into the idle cycles of protected nodes, under the constraint that batch work yields immediately when the protected tier has demand. The constraint is the operative requirement: if batch work can delay even one wakeup of the protected service, the headroom was cheaper than the colocation.

Scheduling mechanism

CFS arbitrates with proportional weights (cpu.weight, the successor of cpu.shares): a Guaranteed pod outweighs a BestEffort spinner by a large ratio, and over a scheduling period each side receives CPU in that ratio. Proportional sharing is not preemptive priority, with three consequences relevant here:

Weights divide time; they do not answer wakeups. When the protected service's thread wakes, the batch thread currently on the CPU is entitled to finish out its vruntime slice. On a loaded node, that wait appears in the p99.
Wakeup placement degrades under load. The idle-CPU search that makes CFS fast on an empty node finds progressively less; woken threads increasingly queue behind running ones instead of starting immediately.
BestEffort still contends. A weight of 1 is not a weight of 0 — batch threads and the service time-slice the same runqueues, so the primary's tail tracks node occupancy. This is consistent with the CFS baselines reported below.

Temper's node engine replaces this arbitration with explicit layers in the kernel's scheduling path (scx_layered on sched_ext): the Critical tier gets a protected, fenced layer; Background gets an Open layer that runs only on CPUs the fence is not using; and the protected-while-busy policy governs the loan — a protected layer's CPUs are loaned out only when that layer is idle everywhere, and loaned CPUs are preempted back at the layer's idle→demand transition, so batch occupancy does not add a slice-length delay to the woken owner.

Multi-node results

memcached p99 vs. batch-filler density, multi-node Temper flat · CFS 3.1×

3-node GKE cluster (e2-standard-4), two memcached primaries under memtier load, batch-filler ladder 0→18 plus 6 background spinners. Node utilization 0.81–0.92.

CFS (stock kube-scheduler) Temper (scx_layered)

Single run per arm; ±0.2 ms is within observed noise, and the flat-vs-3.1× delta is far outside it. The second memcached instance tracked the CFS arm (~1.6 ms) in this GKE run; the asymmetry is unexplained and did not reproduce on EKS, where both instances held near-flat (1.2–1.3× vs CFS ~3×). source: docs/training-artifacts/binpack/REPORT.md · EKS replication: docs/training-artifacts/binpack/records/eks/

The same ladder was replicated on EKS. Two instrumentation corrections were required for a valid run: the memtier client was CFS-throttled by its own 500m CPU limit, masking the signal, and the initial nodegroup spanned availability zones, introducing a ~1 ms cross-AZ RTT floor that dominated a sub-millisecond effect. With single-AZ placement and an unthrottled client, the EKS run shows the flat line directly: p99 held 0.343–0.399 ms from node utilization 0.36 to 0.95 while background work delivered ~3.9 cores of a 4-vCPU node; CFS incurred a +33% tail increase for the same reclaim. The earlier EKS run with the defective instrumentation is retained in the record set, marked invalid.

Operating-point sensitivity

The cost of colocation under CFS depends on how hard the primary itself is driven. A sensitivity sweep on EKS quantifies this. With a nearly idle primary (1 thread × 4 connections), both schedulers are flat — parity within noise; there is no tail to protect. At the default operating point the difference is −23% p99 at the top of the ladder. At the heavy point (4 threads × 8 connections, the primary saturating its allocation), CFS degrades +256% across the ladder while Temper holds within +70% of its own baseline — 3.3× lower p99 at the top step, −70%. The advantage therefore scales with primary load; all three operating points are reported.

memcached, heavy operating point, GKE dedicated cores −88% p99 at the knee

Saturating memcached primary (4t×8c) vs. background-spinner ladder. 2× c2-standard-4, 2026-07-02.

CFS Temper

−88% p99 at 4 spinners (0.415 vs 3.231 ms), −71% at idle; CFS +202% across the ladder, Temper flat. Cross-cloud: EKS heavy point −70%. One Temper step in this run reported background_cores=0.0 with p99 unaffected; per-tier core attribution is a known open instrumentation item. source: docs/training-artifacts/headroom/gke-c2-mc-heavy/REPORT.md · docs/training-artifacts/headroom/eks-sensitivity/REPORT.md

Redis shows the same operating-point dependence. At the default operating point a single hot redis thread is the most favorable case for CFS weight arbitration, and the difference is modest (−17…−24%; Temper flat, worst step +13% over its own baseline while background reclaims ~3.8 cores at 0.98 node utilization). At the heavy point the difference widens to −55% at the knee and −40% on the idle node — Confined placement isolates the primary from system-pod noise before any background load is added.

Single-node density with tier QoS alone

memcached vs. density, tier QoS only CFS degrades 12.3× · Temper flat

Single node (c3-standard-8), density ladder 0→8. No workload profile — PriorityClass-derived tiers alone.

CFS Temper

CFS broke the SLO at density 2 (0.335 ms) and reached 1.951 ms at density 8; Temper held 0.183 ms through an up-and-down ladder. At zero load, fenced placement idles ~20% above CFS idle (0.191 vs 0.159 ms), and background was limited to 1.84 cores vs CFS's 6.22 — a strictly protective fence trades reclaim for protection; the reclaim side is addressed by the loan policy in the next section. source: docs/training-artifacts/OVERNIGHT-REPORT.md · docs/training-artifacts/memcached/SUMMARY.md

Capacity reclaim and request overcommit

Holding the tail is one half of the problem; reclaiming the idle cycles is the other. A strictly protective layer strands capacity: under a bursty primary, a fence that held its CPUs unconditionally left background work at 1.89 cores and node utilization at 0.40 while the primary idled 60% of the time. The protected-while-busy loan policy (loans only when the layer is idle everywhere, preemption on demand return) takes the same node to 5.65 background cores and 0.85 utilization at −7% primary throughput (the report's figure; raw samples/s read 25.5→23.3). Steady-state primary throughput was unaffected.

With enforcement in place, the request padding itself becomes recoverable. On the same 3-node cluster, halving the declared requests of non-Critical pods (the opt-in overcommit webhook; Critical requests untouched, originals preserved in annotations) moved the request-packing wall from 18 to 31 fillers — +72% pods placed with p99 still under 1.9 ms. A 16-pod fleet that does not fit on two nodes at stock requests ran entirely on two after overcommit — 3→2 nodes (33%), with a p99 spot-check of 1.56 ms taken with the client co-located with a server, which biases the reading upward. With Karpenter performing placement in both arms, the same load and SLO provisioned −40% vCPU (12 vs 20) with Temper enforcement underneath.

Limitations

All ladder and packing runs are single-run-per-arm. ±0.2 ms p99 differences are within noise; the flat-vs-3.1× and 18-vs-31 deltas are far outside it.
The second memcached instance in the GKE multi-node run tracked the CFS arm (~1.6 ms) rather than holding flat; the asymmetry is unexplained and did not reproduce on EKS.
At a nearly idle primary operating point, the two schedulers are at parity: there is no tail contention to remove, so no improvement is expected or observed.
Fenced placement carries an idle-latency cost (~20% above CFS idle at zero load in the single-node density run) and, without the loan policy, reduces background reclaim.
+72% is a requests-packing number at overcommit factor 0.5 on a workload whose true usage fits. A fleet whose real usage exceeds capacity sees Open-tier throughput degrade first — by design — while Critical p99 held in every measured state.
The 3→2 consolidation shown here was performed manually (cordon + delete). The consolidation controller automates the same shrink; the measured run and its caveats are in the node-consolidation article.
One EKS run was invalidated by client CPU throttling and cross-AZ placement; it is retained in the record set and marked invalid.

Raw records

docs/training-artifacts/binpack/REPORT.md
docs/training-artifacts/binpack/SAVINGS-REPORT.md
docs/training-artifacts/binpack/records/eks/
docs/training-artifacts/headroom/gke-c2/REPORT.md
docs/training-artifacts/headroom/gke-c2-mc-heavy/REPORT.md
docs/training-artifacts/headroom/gke-c2-redis/REPORT.md
docs/training-artifacts/headroom/gke-c2-redis-heavy/REPORT.md
docs/training-artifacts/headroom/eks-valid/REPORT.md
docs/training-artifacts/headroom/eks-sensitivity/REPORT.md
docs/training-artifacts/headroom/eks-inconclusive/FINDINGS.md (invalidated run, kept)
docs/training-artifacts/karpenter/REPORT.md
docs/training-artifacts/memcached/SUMMARY.md
docs/training-artifacts/OVERNIGHT-REPORT.md

Committed benchmark records in the product repository; design partners get the full artifact tree. Single-run arms are labeled in each report; anomalies are published, not pruned.