Deep dives

Technical reports over committed benchmark records.

Seven reports, one per deployment scenario, sharing a single documented method. Each describes the mechanism under test, the experimental setup, the measured results, and the limitations that apply to them. Single-run arms are labeled, negative results are reported, and every figure cites a committed report path in the repository. The benchmark harness ships with the product, so all runs are reproducible on independent clusters.

CFS baseline in every comparison Kernel-counter verification where applicable Limitations stated with each result Raw record paths in every footer

00 · method

Benchmark methodology

The common experimental method behind every report: same-node two-arm protocol (CFS baseline via safe mode vs. attached tiers), the stepped background-load ladder, load-generator validity rules, metric conventions, scheduler attribution, and the committed-record discipline.

01 · capacity

Idle-capacity harvesting under tail-latency constraints

Cluster CPU allocation exceeds usage because requests are provisioned for peaks; harvesting the difference under CFS raises tail latency at the wakeup path. Measured memcached and redis density ladders in which batch work fills protected nodes while p99 remains flat.

flat @ 0.92p99 vs node util (CFS 3.1×) −88%p99, heavy point +72%pods at equal SLO

Read the deep dive → 02 · databases

CFS bandwidth control and database tail latency

CFS bandwidth control suspends every thread in a cgroup when its quota is exhausted, producing tail-latency cliffs on otherwise idle nodes; measured with kernel throttle counters on PostgreSQL, MySQL, and Cassandra. Includes the cpu.max enforcement semantics under sched_ext and a direct quota-parity measurement.

33–68→4–6 mspgbench p99 +199throttles/20s under CFS (0 under Temper*) 1.353 < 1.5cores consumed vs quota, measured

Read the deep dive → 03 · service chains

Tail-latency amplification in microservice chains

A 19-service DeathStarBench application under co-located background load. Per-hop queueing delay compounds across the request chain under CFS; end-to-end p99 growth is measured for both arms at each density step.

9.4×→~1.5×end-to-end p99 growth −75/−83%p99 at bg=2/4, attached 19services in the request chain

Read the deep dive → 04 · accelerators

CPU-side scheduling and accelerator utilization

Descheduled data-loading threads stall GPU pipelines, idling the accelerator. Measurements cover PyTorch training on NVIDIA L4 under co-location density, a CPU-training comparison against standard Kubernetes remedies, and a negative result on GPU-bound serving.

−25% vs flatCFS vs Temper on L4 at density +67%samples/s at density 8 paritysmall-model vLLM (negative result)

Read the deep dive → 05 · inside the pod

Thread-level scheduling with workload profiles

Container-level metrics aggregate over threads with heterogeneous roles. Workload profiles assign per-thread-group scheduling policy; measurements cover ONNX and llama.cpp under SMT contention and a MySQL profile validation.

44.5 = 44.5ONNX peak throughput parity, flat under density +3.7% vs +27%llama drift under density −78%Cassandra p99 at idle, tier-only

Read the deep dive → 06 · operations

Failure modes and rollback behavior

Failure behavior of the sched_ext attachment: the kernel fallback contract, measured agent-kill failover, an 8-hour soak, reconfiguration cost, watchdog-initiated fallback to CFS, and annotation-based fleet-wide disable.

0.61/0.64/0.61p99 ms across an agent kill 8 h cleansoak: flat memory, no drift ~52 msCFS gap per reconfig

Read the deep dive → 07 · cost

Automated node consolidation under an SLO guard

A consolidation controller executes an automated 3→2 node reduction on a live GKE cluster: plan-hash approval, tier-ordered drain honoring PodDisruptionBudgets under a continuous SLO guard, with the node reclaimed by the cluster autoscaler. A guard-triggered abort run is also reported.

3→2 nodesautomated end to end, p99 flat 13 m 17 sapproval → node deleted by CA $97.82/moone e2-standard-4 returned, at list

Read the deep dive →

Methodology, in one paragraph: each comparison runs the same workload, same nodes, same load generator in two arms — stock Kubernetes on CFS (obtained by putting Temper’s nodes in safe mode, so hardware and noise are held constant), then Temper attached. Density tests step a background-workload ladder and record where the primary’s SLO breaks in each arm. Where the mechanism can be verified with kernel counters instead of inferred from latency, it is. *under an attached sched_ext scheduler the cgroup throttle counter does not advance; enforcement semantics are analyzed in the CFS bandwidth control article