02 Deep dive

CFS bandwidth control and database tail latency

We measure the tail-latency effect of CFS bandwidth control on quota-limited PostgreSQL, MySQL, and Cassandra pods on GKE, attributing the mechanism with cgroup throttle counters. Replacing quota freezing with layer-based placement confinement reduced p99 latency 4–7× on the tested shapes. One scoping fact is central to interpreting the results: cpu.max is not kernel-enforced while a sched_ext scheduler is attached, so a direct quota-parity consumption measurement is included.

workloads: PostgreSQL · MySQL · Cassandra clusters: GKE (c2) records: 5 reports method: benchmark methodology

Background: CFS bandwidth control

A Kubernetes CPU limit becomes cgroup cpu.max: a quota of CPU-microseconds per enforcement period (100 ms by default). CFS bandwidth control charges every fair-class thread's runtime against the quota; when it is exhausted, the kernel freezes every thread in the cgroup until the next period refill. The freeze is indiscriminate. For a database it can land mid-transaction, while locks are held, and every queued client behind that transaction inherits the stall. This is why quota-limited databases show 30–70 ms p99 cliffs on nodes that are otherwise idle: no co-tenant is required, because the workload's own burst exhausts the quota.

A common operational response is to remove limits, which trades the throttle cliff for unbounded contention and motivates the request padding examined in the idle-capacity harvesting study. The experiments below test whether containment can be retained without the freeze.

Results: throttling verified with kernel counters

PostgreSQL under CPU limits throttle tails eliminated

pgbench (-c 8) p99, quota-limited postgres (requests=limits=1500m, Critical tier), background ladder. 2× c2-standard-4, 2026-07-02.

CFS Temper

5–7× lower p99 at every step, including the idle node (33.6 vs 4.3 ms at bg=0). Mechanism attribution via kernel counters: in a 20 s window the CFS arm's cgroup logged nr_throttled +199 and 16.48 s of throttled time; the Temper arm logged zero of both. The interpretation of that zero is scoped in the enforcement section below. Single run per arm. source: docs/training-artifacts/headroom/gke-c2-pgbench/REPORT.md

Doubling the client pressure (-c 16 -j 4) shows that the throttle cliff is self-inflicted and scales with offered load, not with co-tenants. CFS p99 held at 66.9–68.0 ms at every background step — flat, because the throttling is driven by the workload's own quota rather than by contention. Temper started at 13.4 ms and degraded to 47.2 ms as sixteen clients exceeded what a 1500m pod can serve under background pressure; it remained below CFS at every step. The degradation is expected: removing the throttle cliff does not remove queueing under genuine overload.

The same pattern held on two more engines:

MySQL (sysbench oltp_read_write, 8 threads, 1500m requests=limits): CFS p99 60.0–65.7 ms at every step; Temper 15.6–16.7 ms, flat through bg=8 while the node ran at 0.99 utilization. Kernel counters in a 20 s window: CFS nr_throttled +200, 17.93 s throttled.
Cassandra (JVM, 3-core quota, tier-only, no profile): the CFS arm hit nr_throttled +228 (7.55 s throttled) in 20 s on an idle node. The JVM sized its thread pools from availableProcessors()=4 while running under a 3-core quota; the pod ran 84 threads, so quota exhaustion occurs without external load. Idle-node p99: 11.8 vs 2.6 ms (a manual re-check measured 10.9 ms, so the idle-point delta ranges −69…−78%). JVM thread structure is examined further in the workload-profile study.

cpu.max enforcement under sched_ext

The Temper arm's zero throttle count is a frozen counter, not a pacing result. CFS bandwidth control is fair-class machinery: quota is charged only on the fair scheduling class's accounting path, sched_ext tasks never run on a fair runqueue, and so they never charge the quota at all. This was verified against the kernel source on 6.12; the 6.17 ops.cgroup_set_bandwidth interface is notification-only. While scx_layered is attached, the kernel does not enforce cpu.max and the cgroup's throttle counters stop moving. This is a property of the kernel feature, not a Temper design choice, and it means the two benchmark arms above were not limit-identical: the CFS arm was quota-throttled, while the Temper arm was bounded by its layer placement.

Under Temper, the binding constraint on a pod is the layer ceiling: a Confined layer's cpus_range and utilization band cap where and how much the pod's threads run. On the MySQL run above, that ceiling (a whole-core [2,2] allocation) corresponds to roughly 1.9 effective cores against a 1.5-core quota, so that record cannot apportion how much of its 4× improvement came from removing the throttle freeze versus the additional fraction of a core; the report states this explicitly.

To resolve the confound we measured consumption directly. The quota-parity measurement (2026-07-04): the same 1.5-CPU Guaranteed MySQL pod on a c2-standard-8 with four BestEffort spinners, 120 s per arm, consumption read from the pod's own cpu.stat usage_usec delta:

Arm	Cores consumed	Quota	tps	p95
CFS	1.486	1.5 (kernel-enforced)	675	56.8 ms
Temper	1.353	1.5 (not kernel-enforced under scx)	592	30.8 ms

The Temper arm consumed less than its quota: on this shape the whole-core layer ceiling binds below cpu.max, so the latency improvement is not attributable to excess CPU consumption. The trade-off at this shape is −12% throughput for −46% p95: confinement removes the refill-stall cliff and also the burst headroom. source: docs/training-artifacts/mysql-oltp/REPORT.md (quota-parity addendum) · raw: mysql-oltp/quota-parity-v15/qp.txt

The supported claim is therefore “measured consumption at-or-below quota on the tested shapes,” not “limits enforced.” The layer ceiling is currently derived from requests, not from cpu.max; a pod whose limit sits far below the whole-core granularity of its layer could consume above its limit. Three properties are unconditional: memory limits are unaffected (only CPU quota semantics change), usage accounting continues to function, and safe mode restores CFS with quota enforcement immediately. Roadmap items: deriving layer ceilings from cpu.max so limits are honored equivalently, and implementing the ≥6.17 bandwidth callback. Where strict CPU quota enforcement is a compliance requirement on a node, Temper should not be attached to that node; mixed fleets are supported.

Limitations

Single run per arm throughout; 20–120 s windows. The throttle-counter deltas are ~10× above noise; the p99 deltas are 4–7×.
p99/p95 figures come from the load tools' own client-side percentiles; closed-loop clients are subject to coordinated omission — identically in both arms.
Cassandra: single-node ring, RF=1, fsync-light, 4-CPU node — a contained lab shape, not a production ring. Its first CFS window (11.8 ms) was likely elevated by post-seed compaction; the re-checked idle point is 10.9 ms.
The MySQL Temper arm ran with the mysql-innodb workload profile active; a tier-only comparison arm was not run, so the profile's marginal contribution is not isolated in that record.

Raw records

docs/training-artifacts/headroom/gke-c2-pgbench/REPORT.md
docs/training-artifacts/headroom/gke-c2-pgbench-c16/REPORT.md
docs/training-artifacts/mysql-oltp/REPORT.md (incl. quota-parity addendum)
docs/training-artifacts/mysql-oltp/quota-parity-v15/qp.txt
docs/training-artifacts/cassandra/REPORT.md
docs/security/WHITEPAPER.md §8.0 (the cpu.max enforcement disclosure)

Committed benchmark records in the product repository; design partners get the full artifact tree.