02 Deep dive

CFS bandwidth control and database tail latency

We measure the tail-latency effect of CFS bandwidth control on quota-limited PostgreSQL, MySQL, and Cassandra pods on GKE, attributing the mechanism with cgroup throttle counters. Replacing quota freezing with layer-based placement confinement reduced p99 latency 4–7× on the tested shapes. One scoping fact is central to interpreting the results: cpu.max is not kernel-enforced while a sched_ext scheduler is attached, so a direct quota-parity consumption measurement is included.

Background: CFS bandwidth control

A Kubernetes CPU limit becomes cgroup cpu.max: a quota of CPU-microseconds per enforcement period (100 ms by default). CFS bandwidth control charges every fair-class thread's runtime against the quota; when it is exhausted, the kernel freezes every thread in the cgroup until the next period refill. The freeze is indiscriminate. For a database it can land mid-transaction, while locks are held, and every queued client behind that transaction inherits the stall. This is why quota-limited databases show 30–70 ms p99 cliffs on nodes that are otherwise idle: no co-tenant is required, because the workload's own burst exhausts the quota.

A common operational response is to remove limits, which trades the throttle cliff for unbounded contention and motivates the request padding examined in the idle-capacity harvesting study. The experiments below test whether containment can be retained without the freeze.

Results: throttling verified with kernel counters

PostgreSQL under CPU limits throttle tails eliminated

pgbench (-c 8) p99, quota-limited postgres (requests=limits=1500m, Critical tier), background ladder. 2× c2-standard-4, 2026-07-02.

CFS Temper
0 10 20 30 p99 (ms) 0 1 2 4 8 background spinners (ladder) 33.6 4.3

5–7× lower p99 at every step, including the idle node (33.6 vs 4.3 ms at bg=0). Mechanism attribution via kernel counters: in a 20 s window the CFS arm's cgroup logged nr_throttled +199 and 16.48 s of throttled time; the Temper arm logged zero of both. The interpretation of that zero is scoped in the enforcement section below. Single run per arm. source: docs/training-artifacts/headroom/gke-c2-pgbench/REPORT.md

Doubling the client pressure (-c 16 -j 4) shows that the throttle cliff is self-inflicted and scales with offered load, not with co-tenants. CFS p99 held at 66.9–68.0 ms at every background step — flat, because the throttling is driven by the workload's own quota rather than by contention. Temper started at 13.4 ms and degraded to 47.2 ms as sixteen clients exceeded what a 1500m pod can serve under background pressure; it remained below CFS at every step. The degradation is expected: removing the throttle cliff does not remove queueing under genuine overload.

The same pattern held on two more engines:

cpu.max enforcement under sched_ext

The Temper arm's zero throttle count is a frozen counter, not a pacing result. CFS bandwidth control is fair-class machinery: quota is charged only on the fair scheduling class's accounting path, sched_ext tasks never run on a fair runqueue, and so they never charge the quota at all. This was verified against the kernel source on 6.12; the 6.17 ops.cgroup_set_bandwidth interface is notification-only. While scx_layered is attached, the kernel does not enforce cpu.max and the cgroup's throttle counters stop moving. This is a property of the kernel feature, not a Temper design choice, and it means the two benchmark arms above were not limit-identical: the CFS arm was quota-throttled, while the Temper arm was bounded by its layer placement.

Under Temper, the binding constraint on a pod is the layer ceiling: a Confined layer's cpus_range and utilization band cap where and how much the pod's threads run. On the MySQL run above, that ceiling (a whole-core [2,2] allocation) corresponds to roughly 1.9 effective cores against a 1.5-core quota, so that record cannot apportion how much of its 4× improvement came from removing the throttle freeze versus the additional fraction of a core; the report states this explicitly.

To resolve the confound we measured consumption directly. The quota-parity measurement (2026-07-04): the same 1.5-CPU Guaranteed MySQL pod on a c2-standard-8 with four BestEffort spinners, 120 s per arm, consumption read from the pod's own cpu.stat usage_usec delta:

ArmCores consumedQuotatpsp95
CFS1.4861.5 (kernel-enforced)67556.8 ms
Temper1.3531.5 (not kernel-enforced under scx)59230.8 ms

The Temper arm consumed less than its quota: on this shape the whole-core layer ceiling binds below cpu.max, so the latency improvement is not attributable to excess CPU consumption. The trade-off at this shape is −12% throughput for −46% p95: confinement removes the refill-stall cliff and also the burst headroom. source: docs/training-artifacts/mysql-oltp/REPORT.md (quota-parity addendum) · raw: mysql-oltp/quota-parity-v15/qp.txt

The supported claim is therefore “measured consumption at-or-below quota on the tested shapes,” not “limits enforced.” The layer ceiling is currently derived from requests, not from cpu.max; a pod whose limit sits far below the whole-core granularity of its layer could consume above its limit. Three properties are unconditional: memory limits are unaffected (only CPU quota semantics change), usage accounting continues to function, and safe mode restores CFS with quota enforcement immediately. Roadmap items: deriving layer ceilings from cpu.max so limits are honored equivalently, and implementing the ≥6.17 bandwidth callback. Where strict CPU quota enforcement is a compliance requirement on a node, Temper should not be attached to that node; mixed fleets are supported.

Limitations

Raw records

  • docs/training-artifacts/headroom/gke-c2-pgbench/REPORT.md
  • docs/training-artifacts/headroom/gke-c2-pgbench-c16/REPORT.md
  • docs/training-artifacts/mysql-oltp/REPORT.md (incl. quota-parity addendum)
  • docs/training-artifacts/mysql-oltp/quota-parity-v15/qp.txt
  • docs/training-artifacts/cassandra/REPORT.md
  • docs/security/WHITEPAPER.md §8.0 (the cpu.max enforcement disclosure)

Committed benchmark records in the product repository; design partners get the full artifact tree.