Operations: running it like you mean it

Upgrade patterns, the rollback runbook, what to monitor, and how to size protected tiers so the enforcement math works in your favor.

Upgrades: canary first

The chart has first-class support for canarying a new agent build on a subset of nodes before fleet rollout: enabling canary mode renders a second DaemonSet with the candidate image tag, targeted by a node selector, while the main DaemonSet excludes those nodes.

# label the canary nodes, then:
helm upgrade temper deploy/helm/temper --reuse-values \
  --set agent.canary.enabled=true \
  --set-string agent.canary.image.tag=$NEW_TAG \
  --set agent.canary.nodeSelector.temper-canary=true

Watch the canary nodes’ linter metrics and your SLOs; if the candidate misbehaves, the blast radius is the labeled nodes, and each of them fails toward stock CFS. Promote by moving the main image tag and disabling the canary.

Rollback runbook

  1. Stop enforcement first, everywhere it hurts — safe mode is instant and does not require helm:
    kubectl annotate node --all temper.codes/safe-mode-requested=true
  2. Then roll back the release at leisure:
    helm rollback temper
  3. Re-engage by removing the annotation once the fleet is on the version you trust:
    kubectl annotate node --all temper.codes/safe-mode-requested-

The ordering matters and is the point of the design: enforcement rollback (milliseconds, kernel-native) is decoupled from software rollback (minutes, helm). You never wait on an image pull to get back to stock scheduling. Details: safety & rollback.

Monitoring the agent

SignalWhereAlert when
Agent statustemper.codes/agent-status node annotationNot ready on a node that should be enforcing
Agent podskubectl -n temper get pods / kube-state-metricsCrashLoop or not Running on schedulable nodes
Linter violationstemper_lint_violation on /metricsPersistently nonzero — config and reality have drifted
Safe-mode stateMetrics + node annotationsUnexpected safe-mode entries (someone pulled the kill switch)
Config generationtemper.codes/config-generation annotationRapid churn — tier assignments are thrashing

Remember the failure direction when triaging: an agent that is down means the node runs stock CFS — your workloads are un-protected, not broken.

Sizing guidance for protected tiers