05 Deep dive
Thread-level scheduling with workload profiles
Container-level CPU metrics average over threads with heterogeneous scheduling requirements and expose neither thread placement nor SMT topology. We evaluate workload profiles — declarative thread-group scheduling policies applied beneath pod QoS tiers — on ONNX inference, llama.cpp, MySQL, and Cassandra across GKE and EKS node shapes. Exclusive-core profiles reach idle-node throughput parity with CFS while holding throughput and tail latency flat under background load; a tier-only Cassandra run measures −78% idle-node p99 and motivates a JVM profile.
Mechanism: whole-core allocation and SMT geometry
scx_layered allocates CPU to Confined layers in whole physical cores. On a 2-core node, a 2–3 vCPU Confined layer occupies exactly one core — both of its hot threads therefore run on the two hyperthreads of a single physical core, while CFS spreads the same threads across cores. This condition is invisible to pod-level aggregates: the affected pod reads as adequately provisioned and lightly utilized.
The condition is directly measurable at thread granularity. llama.cpp inference
(Qwen2.5-0.5B Q4, 2 threads, Guaranteed 2 CPU, Critical tier) on a 2-core EKS m5.xlarge
ran ~20–25% slower at median under a tier-only Temper configuration than under CFS at
every background level, including on an idle node. The layer's cpus_used read
1.23, ruling out CPU-quantity starvation. Per-thread placement from the
/observe endpoint showed both inference threads on CPUs 0 and 2 — on the
m5 topology (sibling pairs (0,2)/(1,3)), the two hyperthreads of one physical core.
A workload profile declaring the hot threads an exclusive-core group recovered ~60% of the
idle-node gap (−12% median). The placement linter's smt_collision
invariant evaluates pinned tier layers as well as profile thread-group layers, so sibling
stacking on unprofiled workloads is surfaced as a lint violation.
A workload profile groups a pod's threads by name pattern into separate scheduler layers with distinct treatment — exclusive cores for compute threads, latency treatment for wake chains, low weight for housekeeping — while the pod's QoS tier continues to govern its standing against other pods. The mechanics are described in the workload-profiles documentation; the remainder of this article presents the measured results.
ONNX inference (ResNet-50)
ONNX Runtime CPU inference (ResNet-50, batch 1, 3 intra-op threads, Guaranteed 3 CPU, Critical tier) on a 4-core c2-standard-8 was measured in 120 s windows across a background-spinner ladder (bg = 0/1/2/4/8). Tier-only (no profile), Temper held throughput flat across the ladder (±1%, p99 ~45 ms) but at a large peak cost: CFS reached 43–44 samples/s on the idle node by spreading the three threads over three physical cores, then degraded −38% by bg=4, while Temper's whole-core tier layer sized itself to two cores and SMT-paired the threads (22.9 vs 43 samples/s at idle). The generic tier configuration cannot express “three whole cores with idle siblings”; that allocation is expressible only as a profile.
| bg spinners | CFS sps (p99) | Temper + exclusive profile sps (p99) |
|---|---|---|
| 0 | 44.45 (26.2 ms) | 44.46 (23.9 ms) |
| 1 | 44.27 (27.9 ms) | 44.60 (22.8 ms) |
| 2 | 28.15 (38.7 ms) | 44.71 (22.7 ms) |
| 4 | 22.65 (45.8 ms) | 44.59 (22.8 ms) |
| 8 | 30.58 (36.0 ms) | 44.61 (22.8 ms) |
With the exclusive-core profile (3 hot threads → 3 whole cores, siblings idle): idle-node parity with CFS (44.5 vs 44.5) and ±0.3% variation under density, while CFS loses up to 49%. The tier-only 22.9-vs-43 gap is attributable entirely to SMT pairing. Single run per arm; c2-standard-8, 2026-07-02. source: docs/training-artifacts/onnx-inference/REPORT.md
The same report contains a negative result on the 2-core c2-standard-4: no profile can allocate a third core for three hot threads on a node with two. Strict whole-core confinement of the three threads into a single physical core reduced throughput by 62% at bg=4; the configuration generator therefore demotes a Confined layer whose minimum whole-core demand exceeds the node budget to Grouped, which bounds the worst case at −27% with a monotonic decline — bounded degradation, not parity. The resulting sizing rule: critical pods requesting more than 2 CPUs should not be placed on 2-core nodes.
llama.cpp inference
On a 4-core m5.2xlarge with no profile, the tier-only configuration is sufficient: Temper medians hold flat at ~1.98–2.02 s across the ladder while CFS degrades +46% under background load; Temper measures −9% at bg=4 and −21% at bg=8. Whole-core allocation spreads hot threads across cores when enough cores exist — SMT stacking is a property of 2-core geometry, not of the allocator in general. A residual ~12% idle-node confinement gap remains on that shape without a profile, relative to CFS running unconstrained on the whole node.
On c2-standard-8, the exclusive-core profile removes the residual gap: idle-node
parity (1512 vs 1514 ms median), and under density a worst-case drift of
+3.7% while CFS degrades +27% — at bg=8 Temper measures −18% median and
−24% p90 relative to CFS. One profile-sizing detail is relevant to reproduction: the
group's cpu_fraction is 0.85 rather than 0.9, because the pod's aggregate
request (2250m) multiplied by 0.9 rounds up to three hot threads — three exclusive
cores for a two-thread server. The committed profile TOML documents the arithmetic.
MySQL (sysbench OLTP)
The MySQL arm (sysbench oltp_read_write, 8 threads, 30 s windows;
mysql:8.0, Guaranteed 1500m requests=limits, Critical tier; quota-limited — the
bandwidth-throttling mechanism is analyzed in
the CFS bandwidth control article) ran with the builtin
mysql-innodb profile active: connection threads in a Confined exclusive layer, InnoDB
internals (ib_*) in a weight-boosted Open layer, and the remainder in a
low-weight Open layer. Result: p99 of 15.6–16.7 ms, flat through bg=8, against
CFS's 60.0–65.7 ms, with ~3.3 background cores reclaimed at 0.99 node
utilization.
Profile specifications are validated against the scheduler's supported thread-matcher set at load time; a profile that fails validation is reported as a load failure rather than leaving the node silently running the default scheduler. One caveat attaches to this record: the run has no tier-only comparison arm, so the profile's marginal contribution over plain tiers is not isolated here.
Cassandra (tier-only baseline)
Cassandra (cassandra:4.1, 2 GB fixed heap, Guaranteed 3000m requests=limits,
Critical tier; cassandra-stress mixed 1:3 write/read, 16 threads, 30 s windows) ran
tier-only — no profile exists for it — and measured −78% p99 on
the idle node (2.6 vs 11.8 ms; a re-check bounds the idle-point delta at
−69…−78%), with every density step below CFS
(−13…−62%). The largest relative improvement occurs at idle, where no
neighbors exist and the latency driver is the pod's own internal thread contention: 84 JVM
threads (request handlers against GC and compaction) burst past the 3-core quota into
refill freezes under CFS — an idle-node A/B measured nr_throttled +228
and 7.55 s throttled over a 20 s window. This is intra-pod structure of
exactly the kind a profile expresses (GC and compaction threads to batch treatment, request
threads to latency treatment); a Cassandra/JVM profile is the next increment, and this
tier-only run is its baseline.
Limitations
- All runs above are single-run-per-arm lab shapes. The ONNX and llama.cpp profile results replicate across two node shapes each; the MySQL and Cassandra results are single-shape.
- The MySQL run lacks a tier-only arm, so the profile's marginal contribution over tier QoS alone is not isolated in that record. Tier QoS alone carries several of the platform's headline results (memcached, for example, has never run with a profile); profiles are a second stage that closes peak-throughput gaps created by whole-core tier confinement on small shapes and encodes intra-pod structure that container-level averages cannot.
- The quota-limited MySQL and Cassandra comparisons carry the CPU-limit enforcement
caveat analyzed in the CFS bandwidth control article:
cpu.maxis not enforced for sched_ext-class tasks on the tested kernel, so the two arms did not enforce identical CPU ceilings. - Profiles are measured artifacts generated by the training pipeline (observe → analyze → synthesize → refine), not hand-tuned configurations. The committed pipeline cycle record (phase4) is a mixed result: the synthesized PyTorch profile trailed CFS on that ladder and one refinement step regressed, which is why refinement retains only measured improvements.
- Sizing constraint from the 2-core negative result: critical pods requesting more than 2 CPUs should not be placed on 2-core nodes; the Grouped demotion bounds the degradation when this guidance is violated but cannot restore parity.
Raw records
- docs/training-artifacts/onnx-inference/REPORT.md (+ onnx-inference.toml)
- docs/training-artifacts/llm-inference/FINDINGS.md
- docs/training-artifacts/llm-inference/smt-fix/FINDINGS.md (+ profile TOMLs)
- docs/training-artifacts/mysql-oltp/REPORT.md
- docs/training-artifacts/cassandra/REPORT.md
- docs/training-artifacts/phase4/REPORT.md (training-mode cycle, mixed result)
Committed benchmark records in the product repository; design partners get the full artifact tree.