Safety & rollback: the kernel takes back over

The question that decides whether kernel-level enforcement is adoptable is not the upside — it is the failure modes. This page is the complete honest list, with measurements.

The fail-safe is the kernel’s contract

sched_ext was designed so that a BPF scheduler cannot take the system down with it: if the scheduler misbehaves, stalls, crashes, or detaches for any reason, the kernel ejects it and atomically resumes scheduling with the stock scheduler. This is not Temper code — it is the kernel feature Temper is built on. The consequence is structural: the worst case is the scheduler you already run today, per node, never per cluster.

Measured failover

The failure paths are benchmarked, not asserted: we force-killed the node agent under load (on GKE and EKS), soaked the system for 8 hours under continuous pod churn, and measured the per-reconfiguration cost. The numbers, their caveats, and the raw record paths are collected in one place: deep dive: failure & rollback engineering.

The kill switch

Fleet-wide rollback is one annotation — no helm operation, no control-plane dependency, honored directly by each node’s agent:

# stand the scheduler down everywhere; pods run stock CFS
kubectl annotate node --all temper.codes/safe-mode-requested=true

# re-engage
kubectl annotate node --all temper.codes/safe-mode-requested-

Safe mode can also be targeted at single nodes, toggled from the dashboard (audit-logged), or driven by the optional controller via an TemperPolicy resource. Entering safe mode always succeeds — it kills the scheduler; exit re-generates config and re-attaches.

Reconfiguration churn cost

When QoS assignments change on a node (a pod joins or leaves a tier), the agent regenerates the scheduler configuration and restarts the kernel scheduler. The cost is a brief window during which the node runs stock CFS — measured at ~52 ms per reconfiguration (measurement and caveats) — node-local, bounded, and in the same safe failure direction as everything else here: absence of benefit, not harm. Pod churn is debounced and batched so a busy node does not thrash.

The cpu.max disclosure

Stated plainly, because it is the one behavioral difference you must know: while Temper’s scheduler is attached, cgroup cpu.max CPU quotas are not enforced by the kernel. This is a property of sched_ext scheduling, not a Temper choice. Containment of greedy workloads comes from Temper’s layer ceilings instead — measured at-or-below quota on the tested shapes; the kernel mechanics and the quota-parity measurement are written up in the CPU-limits deep dive — and quota-derived layer ceilings are on the roadmap to close the semantic gap. Two mitigations are unconditional: memory limits are unaffected (only CPU quota semantics change), and the kill switch restores CFS with quotas instantly. If strict CPU quota enforcement is a compliance requirement for a node, do not attach Temper to that node — mixed fleets are fully supported.

Privileged DaemonSet posture

Loading a kernel scheduler requires privileged + hostPID and /sys access — the standard posture of node agents like Falco or Datadog. What bounds it: the agent serves only in-cluster endpoints, executes no remote code, makes zero external calls, and writes only its own scheduler process and node annotations. Every permission is justified line by line in the security whitepaper. Full posture, supply chain, and disclosure policy: security & trust.