00 Method

Benchmark methodology

Every technical report on this site uses a common experimental method. This page defines it once: the two-arm same-node comparison protocol, the stepped background-load ladder, the validity rules applied to load generators, the metric and aggregation conventions, how measurement windows are attributed to a scheduler, and the record discipline that makes each figure traceable to a committed artifact. Individual reports state only their deviations from this method.

Environments

Experiments run on live managed-Kubernetes clusters, not synthetic single-machine rigs: GKE Standard node pools on Container-Optimized OS (kernel 6.12 with sched_ext), and EKS on AL2023 and Bottlerocket. Node shapes are stated per report (e.g. e2-standard-4: 4 vCPU as 2 SMT-2 physical cores; c2-standard-4/8; m5.xlarge). System daemons, kubelet, and managed-platform agents remain in place; reports do not exclude their interference. Where a result is shape-sensitive (for example, whole-core allocation on 2-core nodes), the report identifies the boundary condition explicitly.

Two-arm protocol

Comparisons are same-node A/B, not cluster-vs-cluster. Both arms run the identical manifests, on the same nodes, driven by the same load generators:

Arms are run back-to-back on the same cluster state. For stateful services the arms are compared at matched warm-up time, and services are not restarted between an arm and its comparison arm. Results from different bring-ups of a cluster are never compared to each other; each report's ladder is self-contained.

The background ladder

The independent variable in most reports is co-located background load. A latency-critical primary workload runs at a fixed offered load; best-effort background replicas (CPU-bound batch workers, unless stated) are added in steps, typically bg ∈ {0, 1, 2, 4, 8} instances. The top of the ladder deliberately exceeds the node's capacity: the point is to measure the primary's tail latency as idle cycles disappear entirely. After each step the system is given a stabilization period before measurement begins, so step boundaries are excluded from the measured windows. “Bottom of the ladder” means bg=0 (the primary alone); “top” means the highest step in that report.

Load generation and instrument validity

Latency is measured at the client (memtier_benchmark, pgbench, sysbench, wrk2, or the application's own load driver), from bounded fixed-duration runs per ladder step. Three validity rules apply to the instrument itself:

Metrics and aggregation

The primary metric is client-observed p99 latency per ladder step; throughput (operations or requests per second) is reported alongside it, since a scheduler can trade one for the other. Where a report uses repetitions, the aggregation (median of N runs, with spread) is stated in the table; single-run arms are labeled as such. Derived quantities — amplification factors, percentage deltas, cost figures — are computed from the tabled values, and cost uses on-demand list prices with the price basis stated.

Scheduler attribution

Because the kernel can eject a sched_ext scheduler and revert to CFS as a fail-safe, attached-arm windows are attributed, not assumed: the agent's status stream records attach/detach transitions, and pressure-stall (PSI) fingerprints distinguish windows scheduled by scx_layered from windows on the CFS fallback. Windows that ran partly under fallback are marked in the record and either excluded or reported separately; they are never silently averaged into an attached arm.

Records and reproducibility

Every figure quoted in a report traces to a committed machine-readable record (JSON/JSONL) plus a written report under docs/training-artifacts/ in the product repository, produced by the same harness that ran the experiment. Anomalies and negative results are retained in the records and reported; invalidated runs are kept, marked invalid, rather than deleted. Reports state the cluster, node shape, date, and record paths sufficient to re-run the experiment with the published harness.

Limitations of the method

These are lab workloads on small clusters: representative open-source services and generators, not production traffic, and typically two to three nodes per arm. Same-node A/B controls for hardware but not for long-horizon effects (fragmentation, thermal, neighbor churn) that only production exposure reveals. Cost figures are list-price arithmetic, not invoices. Where a report's claim depends on a specific node shape or a stated boundary condition, that dependence is part of the result, not a footnote.