05 Deep dive

Thread-level scheduling with workload profiles

Container-level CPU metrics average over threads with heterogeneous scheduling requirements and expose neither thread placement nor SMT topology. We evaluate workload profiles — declarative thread-group scheduling policies applied beneath pod QoS tiers — on ONNX inference, llama.cpp, MySQL, and Cassandra across GKE and EKS node shapes. Exclusive-core profiles reach idle-node throughput parity with CFS while holding throughput and tail latency flat under background load; a tier-only Cassandra run measures −78% idle-node p99 and motivates a JVM profile.

Mechanism: whole-core allocation and SMT geometry

scx_layered allocates CPU to Confined layers in whole physical cores. On a 2-core node, a 2–3 vCPU Confined layer occupies exactly one core — both of its hot threads therefore run on the two hyperthreads of a single physical core, while CFS spreads the same threads across cores. This condition is invisible to pod-level aggregates: the affected pod reads as adequately provisioned and lightly utilized.

The condition is directly measurable at thread granularity. llama.cpp inference (Qwen2.5-0.5B Q4, 2 threads, Guaranteed 2 CPU, Critical tier) on a 2-core EKS m5.xlarge ran ~20–25% slower at median under a tier-only Temper configuration than under CFS at every background level, including on an idle node. The layer's cpus_used read 1.23, ruling out CPU-quantity starvation. Per-thread placement from the /observe endpoint showed both inference threads on CPUs 0 and 2 — on the m5 topology (sibling pairs (0,2)/(1,3)), the two hyperthreads of one physical core. A workload profile declaring the hot threads an exclusive-core group recovered ~60% of the idle-node gap (−12% median). The placement linter's smt_collision invariant evaluates pinned tier layers as well as profile thread-group layers, so sibling stacking on unprofiled workloads is surfaced as a lint violation.

A workload profile groups a pod's threads by name pattern into separate scheduler layers with distinct treatment — exclusive cores for compute threads, latency treatment for wake chains, low weight for housekeeping — while the pod's QoS tier continues to govern its standing against other pods. The mechanics are described in the workload-profiles documentation; the remainder of this article presents the measured results.

ONNX inference (ResNet-50)

ONNX Runtime CPU inference (ResNet-50, batch 1, 3 intra-op threads, Guaranteed 3 CPU, Critical tier) on a 4-core c2-standard-8 was measured in 120 s windows across a background-spinner ladder (bg = 0/1/2/4/8). Tier-only (no profile), Temper held throughput flat across the ladder (±1%, p99 ~45 ms) but at a large peak cost: CFS reached 43–44 samples/s on the idle node by spreading the three threads over three physical cores, then degraded −38% by bg=4, while Temper's whole-core tier layer sized itself to two cores and SMT-paired the threads (22.9 vs 43 samples/s at idle). The generic tier configuration cannot express “three whole cores with idle siblings”; that allocation is expressible only as a profile.

bg spinnersCFS sps (p99)Temper + exclusive profile sps (p99)
044.45 (26.2 ms)44.46 (23.9 ms)
144.27 (27.9 ms)44.60 (22.8 ms)
228.15 (38.7 ms)44.71 (22.7 ms)
422.65 (45.8 ms)44.59 (22.8 ms)
830.58 (36.0 ms)44.61 (22.8 ms)

With the exclusive-core profile (3 hot threads → 3 whole cores, siblings idle): idle-node parity with CFS (44.5 vs 44.5) and ±0.3% variation under density, while CFS loses up to 49%. The tier-only 22.9-vs-43 gap is attributable entirely to SMT pairing. Single run per arm; c2-standard-8, 2026-07-02. source: docs/training-artifacts/onnx-inference/REPORT.md

The same report contains a negative result on the 2-core c2-standard-4: no profile can allocate a third core for three hot threads on a node with two. Strict whole-core confinement of the three threads into a single physical core reduced throughput by 62% at bg=4; the configuration generator therefore demotes a Confined layer whose minimum whole-core demand exceeds the node budget to Grouped, which bounds the worst case at −27% with a monotonic decline — bounded degradation, not parity. The resulting sizing rule: critical pods requesting more than 2 CPUs should not be placed on 2-core nodes.

llama.cpp inference

On a 4-core m5.2xlarge with no profile, the tier-only configuration is sufficient: Temper medians hold flat at ~1.98–2.02 s across the ladder while CFS degrades +46% under background load; Temper measures −9% at bg=4 and −21% at bg=8. Whole-core allocation spreads hot threads across cores when enough cores exist — SMT stacking is a property of 2-core geometry, not of the allocator in general. A residual ~12% idle-node confinement gap remains on that shape without a profile, relative to CFS running unconstrained on the whole node.

On c2-standard-8, the exclusive-core profile removes the residual gap: idle-node parity (1512 vs 1514 ms median), and under density a worst-case drift of +3.7% while CFS degrades +27% — at bg=8 Temper measures −18% median and −24% p90 relative to CFS. One profile-sizing detail is relevant to reproduction: the group's cpu_fraction is 0.85 rather than 0.9, because the pod's aggregate request (2250m) multiplied by 0.9 rounds up to three hot threads — three exclusive cores for a two-thread server. The committed profile TOML documents the arithmetic.

MySQL (sysbench OLTP)

The MySQL arm (sysbench oltp_read_write, 8 threads, 30 s windows; mysql:8.0, Guaranteed 1500m requests=limits, Critical tier; quota-limited — the bandwidth-throttling mechanism is analyzed in the CFS bandwidth control article) ran with the builtin mysql-innodb profile active: connection threads in a Confined exclusive layer, InnoDB internals (ib_*) in a weight-boosted Open layer, and the remainder in a low-weight Open layer. Result: p99 of 15.6–16.7 ms, flat through bg=8, against CFS's 60.0–65.7 ms, with ~3.3 background cores reclaimed at 0.99 node utilization.

Profile specifications are validated against the scheduler's supported thread-matcher set at load time; a profile that fails validation is reported as a load failure rather than leaving the node silently running the default scheduler. One caveat attaches to this record: the run has no tier-only comparison arm, so the profile's marginal contribution over plain tiers is not isolated here.

Cassandra (tier-only baseline)

Cassandra (cassandra:4.1, 2 GB fixed heap, Guaranteed 3000m requests=limits, Critical tier; cassandra-stress mixed 1:3 write/read, 16 threads, 30 s windows) ran tier-only — no profile exists for it — and measured −78% p99 on the idle node (2.6 vs 11.8 ms; a re-check bounds the idle-point delta at −69…−78%), with every density step below CFS (−13…−62%). The largest relative improvement occurs at idle, where no neighbors exist and the latency driver is the pod's own internal thread contention: 84 JVM threads (request handlers against GC and compaction) burst past the 3-core quota into refill freezes under CFS — an idle-node A/B measured nr_throttled +228 and 7.55 s throttled over a 20 s window. This is intra-pod structure of exactly the kind a profile expresses (GC and compaction threads to batch treatment, request threads to latency treatment); a Cassandra/JVM profile is the next increment, and this tier-only run is its baseline.

Limitations

Raw records

  • docs/training-artifacts/onnx-inference/REPORT.md (+ onnx-inference.toml)
  • docs/training-artifacts/llm-inference/FINDINGS.md
  • docs/training-artifacts/llm-inference/smt-fix/FINDINGS.md (+ profile TOMLs)
  • docs/training-artifacts/mysql-oltp/REPORT.md
  • docs/training-artifacts/cassandra/REPORT.md
  • docs/training-artifacts/phase4/REPORT.md (training-mode cycle, mixed result)

Committed benchmark records in the product repository; design partners get the full artifact tree.