02 Deep dive
CFS bandwidth control and database tail latency
We measure the tail-latency effect of CFS bandwidth control on quota-limited PostgreSQL,
MySQL, and Cassandra pods on GKE, attributing the mechanism with cgroup throttle counters.
Replacing quota freezing with layer-based placement confinement reduced p99 latency
4–7× on the tested shapes. One scoping fact is central to interpreting the
results: cpu.max is not kernel-enforced while a sched_ext scheduler is
attached, so a direct quota-parity consumption measurement is included.
Background: CFS bandwidth control
A Kubernetes CPU limit becomes cgroup cpu.max: a quota of CPU-microseconds per
enforcement period (100 ms by default). CFS bandwidth control charges every fair-class
thread's runtime against the quota; when it is exhausted, the kernel freezes every thread
in the cgroup until the next period refill. The freeze is indiscriminate. For a database
it can land mid-transaction, while locks are held, and every queued client behind that
transaction inherits the stall. This is why quota-limited databases show 30–70 ms
p99 cliffs on nodes that are otherwise idle: no co-tenant is required, because the
workload's own burst exhausts the quota.
A common operational response is to remove limits, which trades the throttle cliff for unbounded contention and motivates the request padding examined in the idle-capacity harvesting study. The experiments below test whether containment can be retained without the freeze.
Results: throttling verified with kernel counters
PostgreSQL under CPU limits throttle tails eliminated
pgbench (-c 8) p99, quota-limited postgres (requests=limits=1500m, Critical tier), background ladder. 2× c2-standard-4, 2026-07-02.
5–7× lower p99 at every step, including the idle node (33.6 vs 4.3 ms at bg=0).
Mechanism attribution via kernel counters: in a 20 s window the CFS arm's cgroup logged
nr_throttled +199 and 16.48 s of throttled time; the Temper arm logged zero of both.
The interpretation of that zero is scoped in the enforcement section below. Single run per arm.
source: docs/training-artifacts/headroom/gke-c2-pgbench/REPORT.md
Doubling the client pressure (-c 16 -j 4) shows that the throttle cliff is
self-inflicted and scales with offered load, not with co-tenants. CFS p99 held at
66.9–68.0 ms at every background step — flat, because the throttling is
driven by the workload's own quota rather than by contention. Temper started at
13.4 ms and degraded to 47.2 ms as sixteen clients exceeded what a 1500m pod
can serve under background pressure; it remained below CFS at every step. The degradation
is expected: removing the throttle cliff does not remove queueing under genuine
overload.
The same pattern held on two more engines:
- MySQL (sysbench oltp_read_write, 8 threads, 1500m requests=limits): CFS p99
60.0–65.7 ms at every step; Temper 15.6–16.7 ms, flat through
bg=8 while the node ran at 0.99 utilization. Kernel counters in a 20 s window:
CFS
nr_throttled +200, 17.93 s throttled. - Cassandra (JVM, 3-core quota, tier-only, no profile): the CFS arm hit
nr_throttled +228(7.55 s throttled) in 20 s on an idle node. The JVM sized its thread pools fromavailableProcessors()=4while running under a 3-core quota; the pod ran 84 threads, so quota exhaustion occurs without external load. Idle-node p99: 11.8 vs 2.6 ms (a manual re-check measured 10.9 ms, so the idle-point delta ranges −69…−78%). JVM thread structure is examined further in the workload-profile study.
cpu.max enforcement under sched_ext
The Temper arm's zero throttle count is a frozen counter, not a pacing result. CFS
bandwidth control is fair-class machinery: quota is charged only on the fair scheduling
class's accounting path, sched_ext tasks never run on a fair runqueue, and so they never
charge the quota at all. This was verified against the kernel source on 6.12; the 6.17
ops.cgroup_set_bandwidth interface is notification-only. While
scx_layered is attached, the kernel does not enforce cpu.max
and the cgroup's throttle counters stop moving. This is a property of the kernel feature,
not a Temper design choice, and it means the two benchmark arms above were not
limit-identical: the CFS arm was quota-throttled, while the Temper arm was bounded by its
layer placement.
Under Temper, the binding constraint on a pod is the layer ceiling: a Confined layer's
cpus_range and utilization band cap where and how much the pod's threads run.
On the MySQL run above, that ceiling (a whole-core [2,2] allocation) corresponds to roughly
1.9 effective cores against a 1.5-core quota, so that record cannot apportion how much of
its 4× improvement came from removing the throttle freeze versus the additional
fraction of a core; the report states this explicitly.
To resolve the confound we measured consumption directly. The quota-parity
measurement (2026-07-04): the same 1.5-CPU Guaranteed MySQL pod on a c2-standard-8 with
four BestEffort spinners, 120 s per arm, consumption read from the pod's own
cpu.stat usage_usec delta:
| Arm | Cores consumed | Quota | tps | p95 |
|---|---|---|---|---|
| CFS | 1.486 | 1.5 (kernel-enforced) | 675 | 56.8 ms |
| Temper | 1.353 | 1.5 (not kernel-enforced under scx) | 592 | 30.8 ms |
The Temper arm consumed less than its quota: on this shape the whole-core layer ceiling binds below
cpu.max, so the latency improvement is not attributable to excess CPU consumption. The trade-off
at this shape is −12% throughput for −46% p95: confinement removes the refill-stall cliff and also
the burst headroom.
source: docs/training-artifacts/mysql-oltp/REPORT.md (quota-parity addendum) · raw: mysql-oltp/quota-parity-v15/qp.txt
The supported claim is therefore “measured consumption at-or-below quota on the
tested shapes,” not “limits enforced.” The layer ceiling is currently
derived from requests, not from cpu.max; a pod whose limit sits far below the
whole-core granularity of its layer could consume above its limit. Three properties are
unconditional: memory limits are unaffected (only CPU quota semantics change), usage
accounting continues to function, and
safe mode restores CFS with quota enforcement
immediately. Roadmap items: deriving layer ceilings from cpu.max so limits are
honored equivalently, and implementing the ≥6.17 bandwidth callback. Where strict CPU
quota enforcement is a compliance requirement on a node, Temper should not be attached to
that node; mixed fleets are supported.
Limitations
- Single run per arm throughout; 20–120 s windows. The throttle-counter deltas are ~10× above noise; the p99 deltas are 4–7×.
- p99/p95 figures come from the load tools' own client-side percentiles; closed-loop clients are subject to coordinated omission — identically in both arms.
- Cassandra: single-node ring, RF=1, fsync-light, 4-CPU node — a contained lab shape, not a production ring. Its first CFS window (11.8 ms) was likely elevated by post-seed compaction; the re-checked idle point is 10.9 ms.
- The MySQL Temper arm ran with the mysql-innodb workload profile active; a tier-only comparison arm was not run, so the profile's marginal contribution is not isolated in that record.
Raw records
- docs/training-artifacts/headroom/gke-c2-pgbench/REPORT.md
- docs/training-artifacts/headroom/gke-c2-pgbench-c16/REPORT.md
- docs/training-artifacts/mysql-oltp/REPORT.md (incl. quota-parity addendum)
- docs/training-artifacts/mysql-oltp/quota-parity-v15/qp.txt
- docs/training-artifacts/cassandra/REPORT.md
- docs/security/WHITEPAPER.md §8.0 (the cpu.max enforcement disclosure)
Committed benchmark records in the product repository; design partners get the full artifact tree.