Kubernetes Upgrades, Done Right (Part 2): Technical Guide

Introduction

Kubernetes upgrades often evolve into time-consuming, high-risk projects that strain platform and DevOps teams. This technical guide provides a step-by-step framework to automate and streamline upgrades, minimizing risk while improving operational efficiency across AWS EKS, Azure AKS, and Google GKE environments.

This technical guide builds on our Strategic Guide to Kubernetes Upgrades. While the strategic guide focuses on organizational transformation and business value, this article provides hands-on frameworks, tools, and best practices for platform teams to operationalize upgrades efficiently across EKS, AKS, and GKE.

Tools That Make Kubernetes Upgrades Routine

Every successful Kubernetes upgrade strategy begins with the right toolkit. Below are essential tools that help transform upgrades from manual projects into automated, repeatable operations.

Enterprise-Grade Tool Categories for Kubernetes Upgrades

While individual tools are critical, building an enterprise-grade Kubernetes upgrade strategy also requires a broader toolchain across observability, automation, testing, and security.

Kubernetes upgrade tool categories table

Integration and Workflow Tools

1. GitOps Tools (ArgoCD/Flux)

Maintain declarative configurations in git repositories
Automate the deployment of applications and infrastructure
Support progressive delivery patterns for upgrades

2. CI/CD Pipelines (Jenkins/GitHub Actions/GitLab CI)

Automate testing and validation of cluster changes
Integrate with approval workflows
Create audit trails of changes and approvals

Why Infrastructure as Code is Critical for Upgrades

Infrastructure as Code forms the backbone of a reliable Kubernetes upgrade strategy. By defining your infrastructure in code, you create:

Versioned history of all cluster configurations
Repeatable processes for creating and updating environments
Reviewable changes through pull requests and code reviews
Testing opportunities before applying to production

Terraform Implementation for EKS

Using Infrastructure as Code (IaC) creates a consistent, auditable approach to Kubernetes cluster management. Here's an example of how to implement this with Terraform for EKS:

# EKS cluster configuration with managed node groups
module "eks" {
  source          = "terraform-aws-modules/eks/aws"
  cluster_name    = "my-cluster"
  cluster_version = "1.30"  # Target version
  
  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnets

  # Managed node group with update configuration
  eks_managed_node_groups = {
    primary = {
      name = "primary-node-group"
      instance_types = ["m5.large"]
      min_size     = 2
      max_size     = 10
      desired_size = 3
      
      # Control maximum unavailable nodes during upgrades
      update_config = {
        max_unavailable_percentage = 25
      }
      
      ami_type = "AL2023_x86_64_STANDARD"
      labels = { Environment = "production" }
    }
  }
  
  # Keep add-ons current
  cluster_addons = {
    coredns = { most_recent = true }
    kube-proxy = { most_recent = true }
    vpc-cni = { most_recent = true }
  }
}

Each cloud provider has specific approaches to Kubernetes management:

Cloud Provider Approaches to Kubernetes Upgrades

Key IaC Best Practices

Explicit versioning - Always specify exact Kubernetes versions
Module usage - Leverage community modules rather than building from scratch
Separate concerns - Use different Terraform modules for different components
State management - Use remote state with proper locking
CI/CD integration - Automate the plan and apply in your pipelines

Zero-Downtime Upgrade Strategies

The technical approach to upgrades must balance risk mitigation with operational efficiency. Here are the key patterns to consider:

1. Canary Deployments (Best for EKS)

Canary deployment for Kubernetes upgrades diagram

This approach involves creating a new node group with the upgraded Kubernetes version while maintaining your existing node groups. It's particularly effective for EKS, where node groups provide a natural boundary for canary testing.

Implementation steps:

Create a new node group with the target Kubernetes version
Target specific workloads to the new node group
Monitor application health on the new nodes using Prometheus and logs
Gradually increase canary traffic by scaling the canary node group up and old node groups down
Complete the migration once validation is successful

2. Blue/Green Cluster Deployment

For maximum safety in regulated environments or when major version jumps are necessary:

Create a completely new cluster with the target Kubernetes version
Backup all resources from the old cluster using Velero
Restore workloads to the new cluster
Validate functionality thoroughly
Switch traffic by updating DNS or load balancer configurations
Decommission the old cluster once traffic has been successfully migrated

This approach eliminates in-place upgrade risks at the cost of additional infrastructure and coordination complexity.

3. Rolling Updates with Pod Disruption Budgets (PDBs)

For managed node groups in any cloud provider, Pod Disruption Budgets are essential to control workload availability during upgrades:

# Example PDB ensuring 70% minimum availability
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: app-pdb
spec:
  minAvailable: 70%
  selector:
    matchLabels:
      app: critical-service

When combined with the appropriate node group configuration, PDBs ensure that enough pods remain available during the upgrade process to maintain service continuity.

Implementation in node group settings:

EKS: Use update_config.max_unavailable_percentage in node group configuration
AKS: Use max_surge in node pool settings
GKE: Use upgrade_settings.max_surge and upgrade_settings.max_unavailable

Automated Validation Frameworks

A robust testing framework transforms "hope-based" upgrades into confident, validated processes:

Pre-Upgrade Validation

Before initiating any upgrade, run these essential validations:

1. API Compatibility Check:

# Use Pluto to identify deprecated APIs
pluto detect-all-in-cluster -o wide

# Look specifically for critical workloads
kubectl get deployments,statefulsets -A -o yaml | pluto detect -

2. Security Validation:

# Run kube-bench against CIS benchmarks
kube-bench --benchmark cis-1.6

3. Cluster Health Assessment:

# Check control plane component health
kubectl get componentstatuses

# Verify all nodes are ready
kubectl get nodes

# Check for pending or failed pods
kubectl get pods --all-namespaces | grep -v Running

4. Application Readiness:

# Ensure PDBs exist for critical workloads
kubectl get pdb --all-namespaces

# Verify horizontal pod autoscalers
kubectl get hpa --all-namespaces

Post-Upgrade Validation

After the upgrade completes, implement comprehensive validation:

1. Core Kubernetes API Testing:

# Test creating temporary resources
kubectl run test-pod --image=busybox -- sleep 300
kubectl expose pod test-pod --port=80 --target-port=8080

2. Storage Validation:

# Test persistent volume provisioning
kubectl apply -f test-pvc.yaml
kubectl get pvc

3. Application Testing:

# Verify all deployments are available
kubectl get deployments --all-namespaces

# Check statefulsets
kubectl get statefulsets --all-namespaces

# Run application-specific health checks
./run-app-tests.sh

4. Smoke Testing:

After validating core Kubernetes resources and application health, it’s critical to run smoke tests that simulate real-world usage patterns.

Smoke testing helps catch hidden issues such as:

Networking misconfigurations
Latency spikes
Unexpected service errors
Application readiness regressions

Example approach using K6 for quick smoke testing:

k6 run smoke-test.js

Tip: Watch out for our upcoming comprehensive guide on using K6 for load and stress testing in Kubernetes environments!

Version Management

Managing Kubernetes version compatibility is critical to ensuring smooth upgrades and minimizing operational risks. This section provides a complete framework for handling version compatibility across the control plane, node groups, and workloads.

1. Compatibility Matrix

Different Kubernetes-related components, such as service meshes, monitoring tools, and certificate managers, often have strict version requirements tied to specific Kubernetes releases.
Here’s a sample compatibility matrix for common components with Kubernetes 1.30:

Kubernetes upgrade components compatibility table

Before upgrading, ensure all critical workloads are compatible with the target Kubernetes version.

2. Version Skew Management

Each cloud provider enforces rules about how far apart the control plane and worker node versions can be:

Command to Check Node Version Skew:

kubectl get nodes -o custom-columns=NAME:.metadata.name,VERSION:.status.nodeInfo.kubeletVersion | sort -k2

3. Maximum Version Skew Limits

Here’s a simple policy reminder you should enforce in your cluster management:

EKS: Control Plane 1.30 → Nodes must be 1.28, 1.29, or 1.30.
GKE: Control Plane 1.30 → Nodes must be 1.28, 1.29, or 1.30.
AKS: Control Plane 1.30 → Nodes should be 1.29 or 1.30 (1 version only).

Skew beyond the allowed limit can break cluster functionality or block upgrades.

4. End-of-Life (EOL) Dates

Always be aware of the support lifecycle for each Kubernetes minor version on your cloud platform. Staying ahead of EOL deadlines ensures that you avoid forced upgrades, extended support fees, and security risks.

Amazon EKS: Kubernetes 1.30 reaches End-of-Life (Standard Support) in July 2025.
Azure AKS: Kubernetes 1.30 reaches End-of-Life (Standard Support) in July 2025.
Google GKE: Kubernetes 1.30 reaches End-of-Life (Stable Channel Support) in July 2025.

Tip: Plan upgrades at least 3–6 months before EOL to avoid rushed migrations or extended support fees.

5. CNI Plugin and Networking Compatibility

Networking components are critical during Kubernetes upgrades.
CNI plugins must be compatible with the new Kubernetes version:

Amazon EKS:

Upgrade the vpc-cni plugin after upgrading the control plane.
Use aws eks describe-addon-versions --addon-name vpc-cni to check compatibility.

Azure AKS:

Upgrade Azure CNI as part of automatic node pool upgrades.

Google GKE:

GKE manages CNI automatically, but manual validation is recommended when using custom CNI plugins.

Always verify CNI versions and upgrade them before or immediately after node pool upgrades.

Troubleshooting Common Upgrade Issues

Even with careful planning, Kubernetes upgrades can encounter operational challenges. Below is a comprehensive guide to the most common cluster-specific issues, their root causes, recommended solutions, and key commands to resolve them quickly.

Common Upgrade Issues Within a Cluster

These are the most frequent problems that occur during Kubernetes upgrades within a single cluster environment, particularly in EKS, AKS, and GKE managed clusters.

Cross-Cloud Kubernetes Upgrade Issues

When upgrading Kubernetes across different cloud providers, additional platform-agnostic issues can emerge. This table summarizes the most common cross-cloud upgrade problems and how to address them effectively.

Proactive Prevention

The best approach is to prevent issues before they occur:

Run pre-flight checks using AWS Cluster Insights or GKE equivalent
Test upgrades thoroughly in non-production environments
Create backups before beginning any upgrade
Document rollback procedures and test them before production upgrades
Monitor actively during and after the upgrade process

Future-Proofing Your Kubernetes Strategy

Looking ahead to Kubernetes 1.31 and beyond, prepare for these upcoming changes:

1. External Cloud Provider Migration

Kubernetes is removing in-tree cloud provider integrations. To prepare:

1. Test external cloud provider components in non-production

2. Update your Terraform configurations to include external controllers:

# AWS Cloud Controller Manager deployment
resource "kubernetes_deployment" "aws_cloud_controller_manager" {
  metadata {
    name      = "aws-cloud-controller-manager"
    namespace = "kube-system"
  }
  # Configuration details...
}

2. Removal of Deprecated Flags

Flags like --keep-terminated-pod-volumes are being removed. Audit your configurations:

# Check for usage in your configurations
grep -r "keep-terminated-pod-volumes" ./k8s-manifests/

# Review kubelet configurations
kubectl get cm -n kube-system kubelet-config -o yaml

3. Container Storage Interface (CSI) Migration

In-tree volume plugins are moving to CSI implementations:

# Verify CSI drivers are installed
kubectl get pods -n kube-system | grep csi

# Check for StorageClass configurations using CSI
kubectl get storageclass

Implementing a Comprehensive Upgrade Strategy

Bringing all the elements together, here's a comprehensive approach to making Kubernetes upgrades business-as-usual:

1. Establish a Regular Cadence

Implement a predictable upgrade schedule:

Kubernetes minor versions: Every 6 months
Patch versions: Monthly or as needed for security
Establish clear dates for standard support expiration

2. Define Your Rollout Strategy

For most enterprises, a phased approach works best:

Dev environments: Upgrade immediately when a new version reaches general availability
Test/QA environments: Upgrade 1 month after dev validation
Production: Upgrade 1-2 months after test validation

3. Automate Everything

To truly transform Kubernetes upgrades from risky projects into routine operations, automation must be embedded across every stage of the upgrade lifecycle. The goal is simple: no manual steps, no guesswork, no surprises.

Focus on automating the following:

Discovery and Validation:

Use tools like Pluto, kube-bench, and custom scripts to automatically detect deprecated APIs, security gaps, and readiness issues before upgrades.

Backup and Recovery:

Automate cluster snapshots and application backups with Velero before every control plane and node group upgrade.

Infrastructure as Code (IaC):

Manage cluster configurations, node groups, and add-ons entirely through Terraform, Pulumi, or AWS CDK.
Enable GitOps pipelines (ArgoCD, Flux) to manage application deployments during upgrades.

Strategic Rollouts:

Implement automated canary deployments, blue/green clusters, and rolling updates using scripts and CI/CD workflows.

Testing Frameworks:

Deploy pre-upgrade and post-upgrade validation jobs automatically after each upgrade phase.
Integrate smoke testing and workload-specific health checks into your CI/CD pipelines.

Monitoring and Observability:

Automate health monitoring using Prometheus alerts and Grafana dashboards tied to upgrade phases.

Governance Integration:

Automatically generate change requests (e.g., ServiceNow) and update ITSM tickets as part of the upgrade process.

Conclusion

By treating Kubernetes upgrades as a continuous, automated process rather than isolated projects, platform teams can dramatically reduce risk, improve cluster health, and drive faster innovation cycles. A robust upgrade strategy today is the foundation for operational excellence tomorrow. Start small, automate everything, and make upgrades boring — that's the goal.

Community and Support Resources

A successful Kubernetes strategy relies on continuous learning and community engagement. Here are essential resources for platform engineering teams:

Cloud Provider Documentation

EKS Documentation - Comprehensive AWS EKS guides
AKS Documentation - Microsoft's AKS resources
GKE Documentation - Google's GKE guides

Community Resources

Kubernetes Slack - Join #eks, #aks, #gke, and technical SIG channels
CNCF Webinars - Regular technical deep dives
KubeCon Recordings — Sessions from past conferences

Technical Learning Resources

EKS Workshop - Hands-on EKS exercises
AKS Workshop - Microsoft's AKS tutorial
Kubernetes Learning Path - Official learning resources

Looking for the business case and organizational strategies behind seamless Kubernetes upgrades?
Explore Part 1: Strategic Guide to complete the full playbook.

Ready to simplify your Kubernetes upgrades?
Book a Demo or Contact Us to learn how Aokumo can help automate and streamline your operations.