SLO-Driven Kubernetes Capacity Planning: Ending Cluster Sprawl in 90 Days

Kubernetes sprawl is common: dozens of clusters, uneven node sizes, and opaque costs. This post shows how to replace infrastructure-first tuning with SLO-driven capacity planning that most enterprises can pilot in 90 days—without rewriting applications.

text [Traffic] -> [SLOs] -> [Policies] -> [Schedulers/Autoscalers] -> [Cloud Spend] ^ [Observability]--+-> SLI metrics & cost signals

Why cluster sprawl happens 📦

No shared definitions of "good enough" performance across teams.
CPU/memory requests set once, never revisited; limits trigger throttling or OOM.
Environment drift (dev/stage/prod) hides true headroom and regression risk.
Cost signals are detached from SRE dashboards; teams can’t see $/user or $/request.

Measure what matters 📈

Anchor capacity to service objectives, not node counts.

Service SLOs
- Latency: p95 under a threshold tied to user journeys (e.g., checkout p95 ≤ 300 ms during peak).
- Availability: 99.9% for critical APIs; error budget managed per release.
SLIs and resource signals
- CPU saturation, memory pressure (RSS vs. requests), queue depth, GC pauses, tail latency.
- Platform signals: node headroom %, pending pods, eviction rate, image pull time.
Cost signals
- Cost per RPS or per batch job, spot/on‑demand mix, egress and storage IOPS.
- Team/namespace allocation with OpenCost or cloud billing exports + tags/labels.

Tip: start with two services (one latency-sensitive, one batch) to prove the loop before scaling out. ✅

A 90-day rollout plan 🧭

Weeks 1–2: Baseline and accountabilities
- Map services → namespaces → owners; enforce labels (team, service, env, cost-center).
- Enable cost allocation (OpenCost or billing export) and SLI pipelines in your observability stack.
- Deliverable: current-state scorecard (SLOs present?, request/limit coverage, $/service).
Weeks 3–4: Define SLOs and error budgets
- Draft 1–2 SLOs per pilot service (availability + latency); review with product owners.
- Add burn-rate alerts (fast/slow). Tie paging to user impact, not just CPU alarms.
- Deliverable: SLOs in code (Git), dashboards and alerts live.
Weeks 5–6: Rightsize containers 🛠️
- Turn on Vertical Pod Autoscaler (recommendation mode) and capture 7 days of data.
- Run synthetic or replayed load; set initial requests = p50 recommendation, limits = 1.5–2× request for CPU-bound, higher for bursty services.
- Deliverable: PRs adjusting requests/limits; reduced throttling and OOMs.
Weeks 7–8: Autoscale to SLOs, not averages
- Configure HPA with custom or external metrics (e.g., queue depth or p95 latency proxy), not just CPU%.
- Tune Cluster Autoscaler/Karpenter via node groups and priorities; reserve a small on-demand baseline and use spot for burst where safe.
- Deliverable: autoscaling policies that hold SLOs during a controlled load step.
Weeks 9–10: Quotas and multi-tenancy guardrails
- Apply LimitRange and ResourceQuota per namespace; adopt PriorityClasses for critical paths.
- Enforce pod disruption budgets; use topology spread to avoid noisy-neighbor hotspots.
- Deliverable: guardrails preventing runaway usage and protecting critical workloads.
Weeks 11–12: Decommission & steady state
- Consolidate node groups; remove idle nodes and unused instance types.
- Archive unused images; retire legacy clusters/services identified in week 1.
- Schedule monthly SLO/cost reviews; automate a "rightsizing" job based on VPA recs.
- Deliverable: documented runbook and quarterly review cadence.

Definition of done 🎯

SLOs declared and enforced for pilot services.
Requests/limits coverage ≥ 90% of pods; fewer throttling/OOM events.
Dashboards show SLI, autoscaler decisions, and $/service; alerts tied to burn rate.

Common pitfalls and guardrails ⚠️

Pitfall: Over-tight CPU limits cause throttling that inflates p95 latency.
- Guardrail: Start with generous CPU limits; tighten after observing stable latency.
Pitfall: HPA on CPU for I/O-bound services.
- Guardrail: Use queue depth or latency SLI via custom metrics.
Pitfall: VPA and HPA fighting.
- Guardrail: Use VPA in recommendation mode when HPA scales by pods; apply recs via CI.
Pitfall: Aggressive overcommit leads to eviction storms during failover.
- Guardrail: Cap node overcommit; reserve headroom for DaemonSets and spikes.
Pitfall: Hidden costs (egress, storage IOPS) mask compute savings.
- Guardrail: Add egress and IOPS to the cost panel; set budgets and alerts per team.

Closing SLO-first capacity planning aligns Kubernetes with business outcomes. Start narrow, measure relentlessly, and automate guardrails. When teams can see the link between SLOs, autoscaler decisions, and cost, sprawl recedes—and reliability becomes predictable.