04 Deep dive

CPU-side scheduling and accelerator utilization

Kubernetes assigns a GPU to a pod exclusively, but the CPU threads that feed it are scheduled under ordinary CFS weights alongside every other pod on the node. We measure the effect of CPU-side contention on accelerator utilization using ResNet-18 training on an NVIDIA L4 (GKE g2-standard-8) and vLLM serving on the same hardware. Under CFS, 16 batch neighbors reduced training throughput by 25% and collapsed GPU utilization; under kernel-enforced tiers, throughput and GPU utilization held flat. Small-model vLLM serving showed parity in both arms; the workload is GPU-bound and the result is reported as a negative finding.

Background: scheduling asymmetry on GPU nodes

Kubernetes treats a GPU as an indivisible, exclusively assigned resource: one pod owns nvidia.com/gpu: 1 and no other pod can use it. The CPU side of the same training pod — the main loop, the DataLoader workers performing decode and augmentation, the copy threads — receives no such treatment. It competes under ordinary CFS weights with every other pod on the node. When batch neighbors outweigh the trainer, the feeder threads stall, the input pipeline drains, and the GPU idles while it remains reserved and billed.

The precondition is stated up front: this starvation mode exists when the trainer's CPU demand exceeds its request. That configuration is common in practice — DataLoader-heavy training pods are frequently provisioned with small CPU requests next to expensive GPUs — but it is a condition, not a universal property.

Results: ResNet-18 training on NVIDIA L4

ResNet-18 on NVIDIA L4, batch neighbors Temper flat · CFS −25%

Burstable trainer (2-CPU request, no CPU limit, ~7 vCPU demand, 6 DataLoader workers) vs. batch-spinner ladder on one g2-standard-8. 2026-07-01.

CFS Temper
0 200 400 600 samples/s 0 4 8 16 batch neighbors on the GPU node 471 (−25%), GPU util collapsing 637 flat, GPU ~85% steady

At 16 neighbors CFS lost 25% of trainer throughput (629→471 samples/s) and the L4’s utilization collapsed into a 0–81% band (mean ~40%); under Temper throughput held 636–642 and GPU utilization ~85% at every step. Isolated nvidia-smi zero-samples are sampling artifacts (present at idle too); the starvation signal is the sustained 12–70% band, not single zeros. Single run per arm. source: docs/training-artifacts/gpu-wedge/REPORT.md · GKE g2-standard-8 + 1× NVIDIA L4

A control configuration bounds the effect. With the trainer Guaranteed and demand that fit inside its 4-CPU request, both arms were flat (~616–643 samples/s, GPU ~80–85%): kubelet's QoS weighting fully defends a Guaranteed pod whose demand fits its request, and Temper added only +3–4% at high density. The control run is retained in the report because it defines the boundary condition: the starvation effect appears when demand exceeds request (measured configuration: 2-CPU request, ~7 vCPU demand), the point at which proportional weights favor the aggregate batch neighbors over the trainer. In utilization terms, at 16 neighbors CFS forfeits approximately a quarter of the accelerator's throughput; with enforcement the same node carries the batch overflow with no trainer degradation.

Comparison with quota limits and static CPU pinning

PyTorch training at density 8, CPU-only node +67% samples/s

Guaranteed 3-CPU trainer next to 8 noisy neighbors, c3-standard-8 (SMT-2). Four arms: CFS default, CFS + quota limits, CFS + static cpuset, Temper. 2026-06-12.

0 10 20 samples/s 14.8 16.5 14.6 24.7 CFS CFS+quota CFS+cpuset Temper

Quota limits partition manually and land mid-pack (16.5). Static CPU pinning — kubelet’s own cpuset manager — was the lowest-performing primary arm: the pin was SMT-blind (three logical CPUs sharing physical cores) and capped the trainer at 14.6–14.8 even on an idle node. Whole-core, SMT-aware placement accounts for part of the enforcement result. In the same table, Temper’s fence limited background reclaim to ~1.9 cores (CFS delivered 5.0); the reclaim side of that trade-off is measured in the idle-capacity harvesting article. source: docs/training-artifacts/OVERNIGHT-REPORT.md · docs/training-artifacts/arms/FOUR-ARM-SUMMARY.md

Negative result: GPU-bound serving

The serving analogue of the training experiment was run and shows no effect. vLLM serving a small model (Qwen2.5-0.5B) on the same L4, 8 concurrent request loops, in both a right-sized Guaranteed configuration and a deliberately starvation-shaped one (demand over request, weighted neighbors), measured parity in both. Right-sized, both arms were flat — p99 ~393–411 ms at idle and ~398–404 ms at bg=8, throughput within a few percent. In the starvation-shaped configuration, CFS showed only mild degradation at bg=4 (+16% vs Temper's +7%) with throughput parity, over single, noisy runs; the bg=0 rows include post-rollout warm-up effects and the record marks them unsettled.

The mechanism is consistent with the training result rather than in tension with it: tokenization and scheduling for a 0.5B model at this concurrency cost a fraction of one core, so the workload is GPU-bound and there is no CPU-side contention for a CPU scheduler to remove. The training result therefore does not generalize to this serving shape. The CPU-side effect applies to DataLoader-heavy training and preprocessing-heavy pipelines (long prompts, large tokenizers, multimodal encode) — configurations in which substantive CPU work sits between storage and the accelerator — and does not apply to workloads whose CPU side is negligible.

Limitations

Raw records

  • docs/training-artifacts/gpu-wedge/REPORT.md
  • docs/training-artifacts/vllm-l4/REPORT.md
  • docs/training-artifacts/OVERNIGHT-REPORT.md
  • docs/training-artifacts/arms/FOUR-ARM-SUMMARY.md
  • docs/training-artifacts/arms/STAGE1-SUMMARY.md
  • docs/training-artifacts/shapes/SUMMARY.md (partial run, shape-comparison caveat)

Committed benchmark records in the product repository; design partners get the full artifact tree.