Safety & rollback: the kernel takes back over
The question that decides whether kernel-level enforcement is adoptable is not the upside — it is the failure modes. This page is the complete honest list, with measurements.
The fail-safe is the kernel’s contract
sched_ext was designed so that a BPF scheduler cannot take the system down
with it: if the scheduler misbehaves, stalls, crashes, or detaches for any reason, the
kernel ejects it and atomically resumes scheduling with the stock scheduler. This is not
Temper code — it is the kernel feature Temper is built on. The consequence is structural:
the worst case is the scheduler you already run today, per node, never per cluster.
Measured failover
The failure paths are benchmarked, not asserted: we force-killed the node agent under load (on GKE and EKS), soaked the system for 8 hours under continuous pod churn, and measured the per-reconfiguration cost. The numbers, their caveats, and the raw record paths are collected in one place: deep dive: failure & rollback engineering.
The kill switch
Fleet-wide rollback is one annotation — no helm operation, no control-plane dependency, honored directly by each node’s agent:
# stand the scheduler down everywhere; pods run stock CFS
kubectl annotate node --all temper.codes/safe-mode-requested=true
# re-engage
kubectl annotate node --all temper.codes/safe-mode-requested-
Safe mode can also be targeted at single nodes, toggled from the
dashboard (audit-logged), or driven by the optional controller via
an TemperPolicy resource. Entering safe mode always succeeds — it kills the
scheduler; exit re-generates config and re-attaches.
Reconfiguration churn cost
When QoS assignments change on a node (a pod joins or leaves a tier), the agent regenerates the scheduler configuration and restarts the kernel scheduler. The cost is a brief window during which the node runs stock CFS — measured at ~52 ms per reconfiguration (measurement and caveats) — node-local, bounded, and in the same safe failure direction as everything else here: absence of benefit, not harm. Pod churn is debounced and batched so a busy node does not thrash.
The cpu.max disclosure
Stated plainly, because it is the one behavioral difference you must know:
while Temper’s scheduler is attached, cgroup cpu.max CPU quotas are not
enforced by the kernel. This is a property of sched_ext scheduling, not a Temper choice.
Containment of greedy workloads comes from Temper’s layer ceilings instead —
measured at-or-below quota on the tested shapes; the kernel mechanics and the quota-parity
measurement are written up in the
CPU-limits deep dive — and quota-derived layer
ceilings are on the roadmap to close the semantic gap. Two mitigations are unconditional: memory limits are unaffected (only
CPU quota semantics change), and the kill switch restores CFS with quotas instantly.
If strict CPU quota enforcement is a compliance requirement for a node, do not attach Temper
to that node — mixed fleets are fully supported.
Privileged DaemonSet posture
Loading a kernel scheduler requires privileged + hostPID and /sys access
— the standard posture of node agents like Falco or Datadog. What bounds it: the agent
serves only in-cluster endpoints, executes no remote code, makes zero external calls, and
writes only its own scheduler process and node annotations. Every permission is justified
line by line in the security whitepaper. Full posture, supply chain, and disclosure policy:
security & trust.