00 Method
Benchmark methodology
Every technical report on this site uses a common experimental method. This page defines it once: the two-arm same-node comparison protocol, the stepped background-load ladder, the validity rules applied to load generators, the metric and aggregation conventions, how measurement windows are attributed to a scheduler, and the record discipline that makes each figure traceable to a committed artifact. Individual reports state only their deviations from this method.
Environments
Experiments run on live managed-Kubernetes clusters, not synthetic single-machine rigs:
GKE Standard node pools on Container-Optimized OS (kernel 6.12 with
sched_ext), and EKS on AL2023 and Bottlerocket. Node shapes are stated per
report (e.g. e2-standard-4: 4 vCPU as 2 SMT-2 physical cores; c2-standard-4/8;
m5.xlarge). System daemons, kubelet, and managed-platform agents remain in place;
reports do not exclude their interference. Where a result is shape-sensitive (for
example, whole-core allocation on 2-core nodes), the report identifies the boundary
condition explicitly.
Two-arm protocol
Comparisons are same-node A/B, not cluster-vs-cluster. Both arms run the identical manifests, on the same nodes, driven by the same load generators:
- Arm (a), baseline: stock Linux CFS. The Temper agent remains installed but is
placed in safe mode via the
temper.codes/safe-mode-requestednode annotation, which detachesscx_layered; the node then schedules under the stock kernel path. This keeps the software environment identical across arms except for the scheduler itself. - Arm (b), attached: Temper's QoS tiers enforced by
scx_layered. Workloads are assigned tiers through standard PriorityClasses; where a report evaluates workload profiles, the profile is the only additional variable and is stated.
Arms are run back-to-back on the same cluster state. For stateful services the arms are compared at matched warm-up time, and services are not restarted between an arm and its comparison arm. Results from different bring-ups of a cluster are never compared to each other; each report's ladder is self-contained.
The background ladder
The independent variable in most reports is co-located background load. A latency-critical primary workload runs at a fixed offered load; best-effort background replicas (CPU-bound batch workers, unless stated) are added in steps, typically bg ∈ {0, 1, 2, 4, 8} instances. The top of the ladder deliberately exceeds the node's capacity: the point is to measure the primary's tail latency as idle cycles disappear entirely. After each step the system is given a stabilization period before measurement begins, so step boundaries are excluded from the measured windows. “Bottom of the ladder” means bg=0 (the primary alone); “top” means the highest step in that report.
Load generation and instrument validity
Latency is measured at the client (memtier_benchmark, pgbench, sysbench, wrk2, or the application's own load driver), from bounded fixed-duration runs per ladder step. Three validity rules apply to the instrument itself:
- The client is never the constrained resource. Client pods are provisioned so they do not hit their own CPU quota during a run; a throttled client reports its own scheduling delay as server latency and invalidates the measurement.
- Placement is controlled. Clients are anti-affine to the servers they measure (a co-located client competes with its own server for CPU), and client and server run in the same availability zone so network variance does not dominate sub-millisecond percentiles.
- One experiment per node at a time. Nodes never host two concurrent test rigs; results contaminated by a co-resident experiment are discarded and re-run.
Metrics and aggregation
The primary metric is client-observed p99 latency per ladder step; throughput (operations or requests per second) is reported alongside it, since a scheduler can trade one for the other. Where a report uses repetitions, the aggregation (median of N runs, with spread) is stated in the table; single-run arms are labeled as such. Derived quantities — amplification factors, percentage deltas, cost figures — are computed from the tabled values, and cost uses on-demand list prices with the price basis stated.
Scheduler attribution
Because the kernel can eject a sched_ext scheduler and revert to CFS as a
fail-safe, attached-arm windows are attributed, not assumed: the agent's status stream
records attach/detach transitions, and pressure-stall (PSI) fingerprints distinguish
windows scheduled by scx_layered from windows on the CFS fallback. Windows
that ran partly under fallback are marked in the record and either excluded or reported
separately; they are never silently averaged into an attached arm.
Records and reproducibility
Every figure quoted in a report traces to a committed machine-readable record
(JSON/JSONL) plus a written report under docs/training-artifacts/ in the
product repository, produced by the same harness that ran the experiment. Anomalies and
negative results are retained in the records and reported; invalidated runs are kept,
marked invalid, rather than deleted. Reports state the cluster, node shape, date, and
record paths sufficient to re-run the experiment with the published harness.
Limitations of the method
These are lab workloads on small clusters: representative open-source services and generators, not production traffic, and typically two to three nodes per arm. Same-node A/B controls for hardware but not for long-horizon effects (fragmentation, thermal, neighbor churn) that only production exposure reveals. Cost figures are list-price arithmetic, not invoices. Where a report's claim depends on a specific node shape or a stated boundary condition, that dependence is part of the result, not a footnote.