07 Deep dive
Automated node consolidation under an SLO guard
Packing improvements reduce cost only when a node actually leaves the cluster. We describe a consolidation controller that plans node-count reductions from kernel-level telemetry and executes them through hash-bound approvals, tier-ordered PDB-honoring drains, and a live SLO guard, delegating instance deletion to the resident autoscaler. On a GKE cluster (3× e2-standard-4) the controller executed an automated 3→2 reduction with the latency-critical workload's p99 flat throughout; a separate run in which the controller's pending-replacement guard aborted the operation and rolled back completely is also reported.
System design: scale-down decision, delegated actuation
The division of labor is fixed by design. Scale-up remains delegated to Cluster Autoscaler or Karpenter: they react to unschedulable pods, and their scale-up path is the safety net if the request-level picture is ever wrong — pods that do not fit trigger a new node, and the failure costs money rather than availability. The scale-down question — can these pods co-locate at SLO? — is computable only from kernel-level telemetry: what each pod actually consumes, what the protected tiers’ cgroups report under pressure, and what the scheduler underneath will enforce once they share a node. That decision belongs to the controller; the actuation surface is kept deliberately thin.
One invariant constrains the design: the worst case must be a forgone reduction, never the removal of a node whose workloads cannot co-locate at SLO. Each mechanism below — the approval hash, the eviction-only drain, the live SLO guard, the abort path — exists to keep the failure direction on that side. The complementary scale-up result: with Karpenter running placement in both arms, Temper underneath provisioned 40% fewer vCPUs at equal load and equal SLO (the idle-capacity harvesting article, docs/training-artifacts/karpenter/).
Execution protocol: plan, hash-bound approval, drain, reap
The decision engine computes a ConsolidationPlan from Temper telemetry only:
the requests annotation the agent already publishes, measured usage and machine shape from
the agent’s /observe snapshot, and protected-tier cgroup PSI from the
same source. No cloud API and no metrics-server dependency are involved. The plan —
victim, survivors, feasibility checks, predicted post-state — is published in the
TemperPolicy status with a hash over its inputs, and the dashboard displays it
beside the savings it would realize. Applying it is an audited one-click action, and
the approval is the plan hash: the dashboard returns the hash of the exact plan it
displayed, and execution re-validates the full plan against live objects before the first
mutation. If the fleet has drifted since the plan was computed, no mutation occurs.
Sequencing is ordered so that resident autoscalers cannot race the operation. Before
the victim is cordoned, every survivor is annotated with
karpenter.sh/do-not-disrupt,
cluster-autoscaler.kubernetes.io/scale-down-disabled, and the Cast AI
equivalent, so whichever autoscaler is resident is instructed not to disturb the nodes the
plan depends on. The drain is tier-ordered (Background first, Critical last) and
uses policy/v1 evictions only — PDB-honoring, never a force-delete
— with a live SLO guard sampling protected-tier cgroup PSI between evictions.
Degradation or a stuck replacement aborts the run and rolls back. The empty node is then
reaped by the resident autoscaler (Mode A: zero cloud credentials in the controller
— the emptiness detection that CA or Karpenter already runs deletes the instance).
Results: automated 3→2 reduction
The acceptance run automated a node reduction previously performed manually (cordon, drain, delete) on the same cluster shape — 3× e2-standard-4, with memtier→memcached as the latency-critical primary and filler and background tiers alongside.
The safety-relevant measurement is the primary workload’s tail latency through the operation:
| Window | p99 (ms) | Samples |
|---|---|---|
| Baseline, 3 nodes (pre-approval) | 6.111 / 6.111 | 2 |
| During the drain | 6.079 | 1 |
| Awaiting reap (workloads settled on survivors) | 6.015 – 6.143 | 12 |
| After reap, 2 nodes | 5.951 / 6.111 / 6.111 | 3 |
p99 flat within sample noise end to end: through nine evictions, resettlement onto two survivors, and the node deletion. The co-location the plan predicted to be safe was measured to be safe. source: docs/training-artifacts/consolidation/records/p99-timeline.jsonl
Reaping a cordoned node: taint escalation
One protocol detail is relevant to any system that delegates node deletion to a resident
autoscaler. After the drain, the victim node sat empty and GKE’s cluster-autoscaler
did not reap it: CA interprets a cordoned node as an operator statement and
excludes it from scale-down. The controller therefore escalates: at +5 minutes it replaces
the cordon with a temper.codes/consolidating:NoSchedule taint and uncordons
the node — still unschedulable for workloads, no longer an operator statement. In the
measured run, CA marked the node DeletionCandidate within a minute of the
escalation and deleted it after its standard 10-minute unneeded window; the timestamps are
in the run record.
Guard-triggered abort
A separate run on the same cluster demonstrates the abort path under real conditions. The run (plan 8605baef2fb7ccb3, also dashboard-applied and audit-logged) drained the victim cleanly, but the replacement pods for the measurement clients failed to schedule: the clients carried required anti-affinity, the kube-scheduler preempted them to place an evicted Critical pod (priority 1000000 preempts 100000; required anti-affinity is enforced symmetrically), and on the two-node end state their replacements had no legal placement. The pending-replacement guard aborted the operation at its 5-minute threshold, automatically uncordoned the victim, removed every protection the controller had added, and left zero annotation residue (records/abort-pending-replacement.json). After the client anti-affinity was relaxed to preferred, a subsequent plan executed to completion. In both runs the failure direction was the designed one: the worst observed outcome was a forgone reduction.
Limitations and operational notes
- The engine does not plan without measurements. If the agent’s observation layer is disabled or unreachable, no plan is emitted; the engine never falls back to requests-only estimation, because a consolidation decision without measured usage and protected-tier PSI has no defensible safety basis.
- Node-level signals are not valid inputs on a Temper node. Because the Background tier absorbs idle capacity by design, node-level CPU PSI and total utilization are uninformative: in this session the node-level PSI read 78 while the critical cgroup’s PSI was 0.0, and total utilization sits near 100% on a healthy node. The engine gates on protected-tier cgroup PSI and non-Background measured usage. Node-level dashboards will show near-saturation on healthy Temper nodes; evaluation should use protected-tier metrics.
- The placement heuristic is deliberately conservative. It predicts placement
proportionally to free capacity and refuses fleets that manual analysis might pack
further. In this session it refused to plan while critical requests exceeded the
whole-core budget of the 2-core nodes
(
critical-whole-core: predicted 3 cores > 1) and produced a plan only after the fleet was right-sized. The asymmetry is intentional: refusing a feasible reduction costs money; approving an infeasible one costs an SLO. - The results above are from a single acceptance run and a single abort run on one cluster shape.
Raw records
- docs/training-artifacts/consolidation/REPORT.md
- docs/training-artifacts/consolidation/records/ (plan, final status, phase log, p99 timeline, audit export, abort record, node/pod snapshots, dashboard screenshots)
- docs/design/consolidator.md (design document)
- docs/training-artifacts/karpenter/REPORT.md (the scale-up side: −40% provisioned vCPUs under Karpenter)
Committed benchmark records in the product repository; design partners get the full artifact tree.