07 Deep dive

Automated node consolidation under an SLO guard

Packing improvements reduce cost only when a node actually leaves the cluster. We describe a consolidation controller that plans node-count reductions from kernel-level telemetry and executes them through hash-bound approvals, tier-ordered PDB-honoring drains, and a live SLO guard, delegating instance deletion to the resident autoscaler. On a GKE cluster (3× e2-standard-4) the controller executed an automated 3→2 reduction with the latency-critical workload's p99 flat throughout; a separate run in which the controller's pending-replacement guard aborted the operation and rolled back completely is also reported.

run: automated 3→2 node shrink cluster: GKE, 3× e2-standard-4 records: 1 report + machine records method: benchmark methodology

System design: scale-down decision, delegated actuation

The division of labor is fixed by design. Scale-up remains delegated to Cluster Autoscaler or Karpenter: they react to unschedulable pods, and their scale-up path is the safety net if the request-level picture is ever wrong — pods that do not fit trigger a new node, and the failure costs money rather than availability. The scale-down question — can these pods co-locate at SLO? — is computable only from kernel-level telemetry: what each pod actually consumes, what the protected tiers’ cgroups report under pressure, and what the scheduler underneath will enforce once they share a node. That decision belongs to the controller; the actuation surface is kept deliberately thin.

One invariant constrains the design: the worst case must be a forgone reduction, never the removal of a node whose workloads cannot co-locate at SLO. Each mechanism below — the approval hash, the eviction-only drain, the live SLO guard, the abort path — exists to keep the failure direction on that side. The complementary scale-up result: with Karpenter running placement in both arms, Temper underneath provisioned 40% fewer vCPUs at equal load and equal SLO (the idle-capacity harvesting article, docs/training-artifacts/karpenter/).

Execution protocol: plan, hash-bound approval, drain, reap

The decision engine computes a ConsolidationPlan from Temper telemetry only: the requests annotation the agent already publishes, measured usage and machine shape from the agent’s /observe snapshot, and protected-tier cgroup PSI from the same source. No cloud API and no metrics-server dependency are involved. The plan — victim, survivors, feasibility checks, predicted post-state — is published in the TemperPolicy status with a hash over its inputs, and the dashboard displays it beside the savings it would realize. Applying it is an audited one-click action, and the approval is the plan hash: the dashboard returns the hash of the exact plan it displayed, and execution re-validates the full plan against live objects before the first mutation. If the fleet has drifted since the plan was computed, no mutation occurs.

Sequencing is ordered so that resident autoscalers cannot race the operation. Before the victim is cordoned, every survivor is annotated with karpenter.sh/do-not-disrupt, cluster-autoscaler.kubernetes.io/scale-down-disabled, and the Cast AI equivalent, so whichever autoscaler is resident is instructed not to disturb the nodes the plan depends on. The drain is tier-ordered (Background first, Critical last) and uses policy/v1 evictions only — PDB-honoring, never a force-delete — with a live SLO guard sampling protected-tier cgroup PSI between evictions. Degradation or a stuck replacement aborts the run and rolls back. The empty node is then reaped by the resident autoscaler (Mode A: zero cloud credentials in the controller — the emptiness detection that CA or Karpenter already runs deletes the instance).

Results: automated 3→2 reduction

The acceptance run automated a node reduction previously performed manually (cordon, drain, delete) on the same cluster shape — 3× e2-standard-4, with memtier→memcached as the latency-critical primary and filler and background tiers alongside.

3 → 2 nodes plan 6112738e9b5c9d63, dashboard-approved; victim drained with 9 evictions, tier-ordered, PDB-honoring docs/training-artifacts/consolidation/records/plan.json

13 m 17 s approval to node removed by cluster-autoscaler (797 s in the status record); the controller deleted no infrastructure itself docs/training-artifacts/consolidation/records/final-status.json

$97.82/mo one e2-standard-4 returned to the pool, at list price docs/training-artifacts/consolidation/REPORT.md

The safety-relevant measurement is the primary workload’s tail latency through the operation:

Window	p99 (ms)	Samples
Baseline, 3 nodes (pre-approval)	6.111 / 6.111	2
During the drain	6.079	1
Awaiting reap (workloads settled on survivors)	6.015 – 6.143	12
After reap, 2 nodes	5.951 / 6.111 / 6.111	3

p99 flat within sample noise end to end: through nine evictions, resettlement onto two survivors, and the node deletion. The co-location the plan predicted to be safe was measured to be safe. source: docs/training-artifacts/consolidation/records/p99-timeline.jsonl

Reaping a cordoned node: taint escalation

One protocol detail is relevant to any system that delegates node deletion to a resident autoscaler. After the drain, the victim node sat empty and GKE’s cluster-autoscaler did not reap it: CA interprets a cordoned node as an operator statement and excludes it from scale-down. The controller therefore escalates: at +5 minutes it replaces the cordon with a temper.codes/consolidating:NoSchedule taint and uncordons the node — still unschedulable for workloads, no longer an operator statement. In the measured run, CA marked the node DeletionCandidate within a minute of the escalation and deleted it after its standard 10-minute unneeded window; the timestamps are in the run record.

Guard-triggered abort

A separate run on the same cluster demonstrates the abort path under real conditions. The run (plan 8605baef2fb7ccb3, also dashboard-applied and audit-logged) drained the victim cleanly, but the replacement pods for the measurement clients failed to schedule: the clients carried required anti-affinity, the kube-scheduler preempted them to place an evicted Critical pod (priority 1000000 preempts 100000; required anti-affinity is enforced symmetrically), and on the two-node end state their replacements had no legal placement. The pending-replacement guard aborted the operation at its 5-minute threshold, automatically uncordoned the victim, removed every protection the controller had added, and left zero annotation residue (records/abort-pending-replacement.json). After the client anti-affinity was relaxed to preferred, a subsequent plan executed to completion. In both runs the failure direction was the designed one: the worst observed outcome was a forgone reduction.

Limitations and operational notes

The engine does not plan without measurements. If the agent’s observation layer is disabled or unreachable, no plan is emitted; the engine never falls back to requests-only estimation, because a consolidation decision without measured usage and protected-tier PSI has no defensible safety basis.
Node-level signals are not valid inputs on a Temper node. Because the Background tier absorbs idle capacity by design, node-level CPU PSI and total utilization are uninformative: in this session the node-level PSI read 78 while the critical cgroup’s PSI was 0.0, and total utilization sits near 100% on a healthy node. The engine gates on protected-tier cgroup PSI and non-Background measured usage. Node-level dashboards will show near-saturation on healthy Temper nodes; evaluation should use protected-tier metrics.
The placement heuristic is deliberately conservative. It predicts placement proportionally to free capacity and refuses fleets that manual analysis might pack further. In this session it refused to plan while critical requests exceeded the whole-core budget of the 2-core nodes (critical-whole-core: predicted 3 cores > 1) and produced a plan only after the fleet was right-sized. The asymmetry is intentional: refusing a feasible reduction costs money; approving an infeasible one costs an SLO.
The results above are from a single acceptance run and a single abort run on one cluster shape.

Raw records

docs/training-artifacts/consolidation/REPORT.md
docs/training-artifacts/consolidation/records/ (plan, final status, phase log, p99 timeline, audit export, abort record, node/pod snapshots, dashboard screenshots)
docs/design/consolidator.md (design document)
docs/training-artifacts/karpenter/REPORT.md (the scale-up side: −40% provisioned vCPUs under Karpenter)

Committed benchmark records in the product repository; design partners get the full artifact tree.