06 Deep dive

Failure modes and rollback behavior

Adoptability of kernel-level scheduling enforcement depends on its failure behavior. This article collects the measured failure-path results in one place: the sched_ext fallback contract, agent-kill failover on GKE and EKS, an 8-hour soak under continuous pod churn, per-reconfiguration cost, watchdog fallback behavior, and the rollback mechanisms.

The sched_ext fallback contract

sched_ext (upstream since Linux 6.12) is designed so that a BPF scheduler cannot take the node down with it. When the scheduler is disabled — by request or by error — the kernel moves all tasks back to the stock fair-class scheduler atomically. The kernel itself ejects the BPF scheduler on any of: the scheduler process dying (crash, OOM-kill, pod deletion), any BPF-side error, a runnable-task stall caught by a kernel-enforced watchdog (a hard 30 s cap: if any runnable task is not scheduled within the timeout, the scheduler is ejected), or an operator request. The consequence for risk assessment is structural rather than promised: the worst case of a Temper scheduling failure is the scheduler the cluster runs today — not a kernel panic, not a hung node; degradation to stock CPU behavior, on that node only.

The same direction holds for nodes that never qualify: where the kernel lacks sched_ext, the scheduler fails to attach and the agent enters safe mode — the node runs stock CFS exactly as it would without Temper. Mixed fleets degrade per-node, never per-cluster.

Measured failover: agent termination under load

The node agent was force-killed mid-benchmark, with a latency-critical memcached under memtier load on the node:

0.607 / 0.639 / 0.607 p99 (ms) before the kill / during the CFS-fallback window / after the replacement agent re-attached docs/training-artifacts/reliability/failover-record.json
15.0 s replacement agent (a DaemonSet) Ready and scheduler re-attached docs/training-artifacts/binpack/SAVINGS-REPORT.md
0.271 → 0.239 → 0.215 the same experiment replicated on EKS (p99 ms, before/during/after) — no SLO deviation, agent Ready in 14.8 s docs/training-artifacts/reliability/eks/failover-record.json

Two properties follow from the records. First, scx_layered keeps running when the agent dies — the scheduler is not dependent on the agent process; the EKS record shows no workload-visible effect at all. Second, when the scheduler itself exits, the fallback is the kernel revert described above, so the full kill-and-recover cycle on GKE amounted to roughly a 5% p99 deviation and fifteen seconds of DaemonSet rescheduling.

Eight-hour soak under continuous churn

A soak run held the full benchmark workload (two Critical memcacheds under load, Normal fillers, Background spinners) for 8 hours while a churn generator created and deleted a pod every ~36 seconds — 399 cycles, which drove 798 scheduler reconfiguration restarts on the node that received them. Verdict, from the committed report: pass — no memory leak, no SLO drift, no agent instability.

192 Mi flat agent memory, all three agents, stable to within 1 Mi across 8 h of continuous churn (50 samples each) docs/training-artifacts/reliability/agent-mem.csv
0.21% of the churned node’s time spent in CFS fallback across 798 reconfig restarts docs/training-artifacts/reliability/SOAK-REPORT.md
0 crashes the 798 restarts are clean reconfigs, not failures — agent restartCount stayed 0 on every pod; SLO envelope stable first hour to last docs/training-artifacts/reliability/SOAK-REPORT.md

Reconfiguration cost

When QoS assignments change on a node, the agent regenerates the scheduler config and restarts scx_layered; the node runs stock CFS for the gap. A deliberate churn storm (50 assignment cycles in 10 minutes → 100 restarts on one node, with 48 no-op reloads correctly skipped) measured 52.6 ms of CFS fallback per restart from log kill→spawn timestamps — 0.88% of the busiest node's time at that extreme rate. The EKS parity run showed the same blast-radius shape (all restarts confined to the churned node, zero on the others) with a different per-restart cost model (~172 ms implied); the records note that the models differ, so the measured GKE figure and the shape are quoted rather than a cross-cloud average. The failure direction during every gap is the same as elsewhere on this page: absence of benefit, not harm.

Watchdog fallback behavior

The watchdog path is exercised, not merely specified. Sustained starvation of any runnable task — from any cause — results in the kernel ejecting the sched_ext scheduler and reverting the node to stock CFS automatically; workloads continue running throughout. This is a fail-safe, not an outage: the ejection is the safety system, and it has been observed operating as specified under saturation testing on a 19-service application (methodology and measurements in the microservice-chains article, including the PSI fingerprinting used to attribute each measurement window to the scheduler actually live during it).

The shipping release carries two behaviors on this path. The scheduler provides a per-CPU starvation guarantee — CPU-pinned kernel threads, which cannot migrate, are serviced on their own CPU regardless of layer accounting — removing the starvation class most likely to trip the watchdog on small, saturated nodes; zero watchdog ejections were observed across the validation runs of the current release, including a full ~35-minute saturated arm. The agent additionally runs a scheduler supervisor: a SIGKILL test against the live scheduler showed reap, status flip, and respawn in 3–5 s.

Safe mode: two rollback speeds

Fleet-wide rollback is one node annotation — temper.codes/safe-mode-requested — honored directly by each node's agent with no control-plane round trip, so it works even when everything above the node engine is down. Entering safe mode always succeeds (it kills the scheduler; the kernel does the rest), and the agent also auto-enters safe mode on scheduler crash. The operational point, spelled out in the runbook: enforcement rollback (milliseconds, kernel-native) is deliberately decoupled from software rollback (minutes, helm) — returning to stock scheduling never waits on an image pull. Safe mode also restores CFS with CPU quotas, which matters given the cpu.max disclosure.

For upgrades, the chart ships a canary mechanism: a second DaemonSet with the candidate image on labeled nodes, with hard mutual-exclusion affinity against the main one. A misbehaving candidate's blast radius is the labeled nodes, and each of them fails toward stock CFS — the same direction as every other failure on this page.

What is measured vs. what is guaranteed

Limitations

Raw records

  • docs/training-artifacts/reliability/SOAK-REPORT.md
  • docs/training-artifacts/reliability/failover-record.json
  • docs/training-artifacts/reliability/churn-record.json
  • docs/training-artifacts/reliability/soak-churn-record.json
  • docs/training-artifacts/reliability/agent-mem.csv · slo.csv
  • docs/training-artifacts/reliability/eks/ (failover + churn parity)
  • docs/training-artifacts/binpack/SAVINGS-REPORT.md (§4 reliability)
  • docs/training-artifacts/deathstarbench/v14-validation/REPORT.md (watchdog validation + supervisor kill test)
  • docs/security/WHITEPAPER.md (§2, the kernel contract and kill switch)

Committed benchmark records in the product repository; design partners get the full artifact tree.