November 23, 2025
10 min read
Ian Lintner

Production-Ready AKS with Istio: Architecture Decisions, Trade-offs, and Implementation

Share this article:

Building production Kubernetes infrastructure is fundamentally an exercise in architectural trade-offs. While deploying a basic cluster is straightforward, creating a platform that scales from dozens to thousands of services—while maintaining security, observability, and developer velocity—requires careful consideration of competing priorities.

This guide examines the technical and organizational decisions behind building production-grade Azure Kubernetes Service (AKS) infrastructure with Istio service mesh. We'll explore not just how to implement these systems, but why specific architectural choices matter, their cost implications, and when alternatives might be more appropriate.

Target Audience

This guide is written for:

  • Staff+ Engineers evaluating service mesh adoption and platform architecture
  • Technical Leads making build-vs-buy decisions for infrastructure components
  • Platform Engineers implementing multi-tenant Kubernetes platforms
  • SREs responsible for production reliability and incident response
  • Engineering Managers balancing technical debt against feature velocity

What You'll Learn

Technical Implementation:

  • Production-ready AKS cluster architecture with managed Istio
  • Zero-downtime certificate management and rotation
  • Multi-environment DNS and subdomain routing strategies
  • Observability and debugging patterns for service mesh traffic

Architectural Decision-Making:

  • Service mesh trade-offs: Istio vs. alternatives (Linkerd, Consul, Cilium)
  • When to use managed vs. self-hosted Istio
  • Cost analysis: Managed services vs. DIY infrastructure
  • Security considerations: mTLS, network policies, zero-trust architecture
  • Multi-cluster vs. single-cluster strategies

Operational Maturity:

  • Failure modes and mitigation strategies
  • Performance implications of sidecar proxies
  • Capacity planning and resource optimization
  • Incident response patterns for mesh-related outages

🏗️ Architecture Overview and Design Decisions

Before diving into implementation, let's examine the architectural decisions that shape production Kubernetes platforms.

Why Service Mesh? The Trade-offs

Service mesh adoption represents a significant architectural commitment. Understanding when it adds value—and when it doesn't—is crucial.

Service Mesh Benefits:

  • Traffic Management: Advanced routing (canary, blue-green), retries, timeouts, circuit breaking
  • Security: Automatic mTLS between services, zero-trust networking
  • Observability: Distributed tracing, metrics, and traffic visualization without code changes
  • Policy Enforcement: Centralized traffic policies, rate limiting, authentication

Service Mesh Costs:

  • Complexity: Additional operational burden, learning curve, debugging complexity
  • Performance: Sidecar proxy overhead (CPU, memory, latency ~2-5ms per hop)
  • Resource Usage: ~50-100MB memory per sidecar, CPU overhead for encryption
  • Blast Radius: Control plane outages can impact all mesh traffic

Decision Framework:

Use Service Mesh When...Skip Service Mesh When...
Operating 50+ microservicesRunning monolith or <10 services
Need mTLS without app changesCan enforce security at app layer
Require advanced traffic routingSimple load balancing suffices
Multiple teams need traffic policiesSingle team owns all services
Observability gaps existAPM tools provide sufficient insight

Istio vs. Alternatives: Comparative Analysis

Why We Chose Istio for This Architecture:

  1. Azure Managed Add-on: Reduces operational burden (control plane managed by Microsoft)
  2. Feature Completeness: Comprehensive traffic management, security, observability
  3. Ecosystem Maturity: Extensive tooling, documentation, community support
  4. Enterprise Adoption: Proven at scale (Google, eBay, T-Mobile)

Cost Comparison (Rough Monthly Estimates for 10-Service Cluster):

ComponentManaged Istio (AKS)Self-Hosted IstioLinkerdNo Mesh
Control Plane$0 (included)$100-200$100-200$0
Sidecar Overhead~$50-75~$50-75~$25-40$0
Monitoring$100-150$100-150$75-100$50-75
Operations Time5-10 hrs/mo20-40 hrs/mo10-20 hrs/mo5 hrs/mo
Total~$150-235 + ops~$250-425 + ops~$200-340 + ops~$50-75 + ops

Note: Operations time valued at $150-200/hr for staff engineer equivalency

Architecture Decision Records (ADRs)

ADR-001: Managed Istio vs. Self-Hosted

Context: Need service mesh capabilities without dedicated SRE team.

Decision: Use Azure-managed Istio add-on for AKS.

Rationale:

  • Microsoft manages control plane (istiod) upgrades and patching
  • Reduced operational complexity (no manual version management)
  • SLA backed by Azure support
  • Team can focus on application logic vs. infrastructure

Consequences:

  • Limited control over Istio version (follows Azure release cycle)
  • Cannot customize control plane configuration deeply
  • Dependent on Azure for mesh reliability
  • May lag behind upstream Istio releases by 1-2 versions

Alternatives Considered:

  • Self-hosted Istio: More control, more operational burden
  • Linkerd: Simpler but fewer features
  • No mesh: Lower complexity but manual mTLS, observability

ADR-002: DNS Management in Azure DNS vs. External Provider

Context: Need reliable DNS with low operational overhead.

Decision: Use Azure DNS for domain management.

Rationale:

  • Native integration with AKS and Azure resources
  • 100% SLA with zone redundancy
  • Low latency queries via Azure edge network
  • Cost-effective ($0.50/zone/month + $0.40/million queries)
  • Infrastructure-as-code via Terraform/Bicep

Consequences:

  • Vendor lock-in to Azure ecosystem
  • Migration complexity if moving clouds
  • All DNS management requires Azure access

Cost Analysis:

  • Azure DNS: ~$1-5/month for typical workload
  • Route 53 (AWS): ~$1-5/month + cross-cloud complexity
  • Cloudflare: Free tier sufficient but adds dependency

Complete System Architecture

Key Architectural Patterns:

  1. Defense in Depth: DDoS protection → Load balancer → Istio Gateway → Network policies
  2. Separation of Concerns: Ingress layer isolated from application layer
  3. Observability by Default: Every service instrumented without code changes
  4. Cost Optimization: Spot instances for non-critical workloads, right-sized nodes
  5. High Availability: Multi-replica deployments, pod anti-affinity, zone distribution

Monthly Cost Breakdown (Rough Estimates):

  • Compute: $220 (3x D2s_v3) + $50 (spot) = $270
  • Control Plane: $0 (free tier) or $73 (99.95% SLA)
  • Networking: $18.25 (load balancer) + ~$10 (bandwidth)
  • Storage: $30 (database) + $10 (disks)
  • Container Registry: $5 (basic) or $167 (premium with geo-replication)
  • Monitoring: $110 (Prometheus, Grafana, Jaeger storage)
  • DNS: $1
  • Total: ~$444/mo (free tier) or ~$517/mo (with SLA)

Scales to ~$800-1200/mo at 10 services, ~$2000-3000/mo at 50 services


📋 Prerequisites and Team Readiness

Technical Prerequisites

  • Azure account with Owner or Contributor + User Access Administrator role
  • Azure CLI (az) version 2.50.0+
  • kubectl version matching cluster (typically 1.28+)
  • helm v3.12+ for package management
  • Registered domain name with ability to change name servers
  • Azure subscription with sufficient quota:
    • 12+ vCPUs (preferably D-series or E-series)
    • Standard Load Balancer quota
    • Public IP quota
  • Understanding of Kubernetes concepts (pods, services, deployments, namespaces)

Organizational Readiness Assessment

Before implementing service mesh, assess your team's readiness:

✅ Good Fit:

  • Team has 2+ engineers with Kubernetes production experience
  • Running 20+ microservices or planning to reach that scale
  • Need centralized traffic policies across services
  • Require mTLS without modifying application code
  • Budget supports ~$500-1000/mo infrastructure costs
  • Dedicated platform or SRE team available (even part-time)

⚠️ Proceed with Caution:

  • Team new to Kubernetes (6 months experience recommended minimum)
  • Running <10 services (simpler ingress solutions may suffice)
  • Tight budget constraints (<$500/mo infrastructure spend)
  • No dedicated operations support
  • Need to ship features weekly (learning curve may slow velocity initially)

❌ Not Recommended:

  • First Kubernetes deployment (start simpler, add mesh later)
  • Single developer or small team (<3 engineers total)
  • Prototype or proof-of-concept phase
  • Services already have robust observability and security

Install Required Tools

# Install Azure CLI (macOS)
brew install azure-cli

# Verify version
az --version  # Should be 2.50.0+

# Install kubectl
az aks install-cli

# Verify kubectl
kubectl version --client

# Install Helm
brew install helm

# Verify Helm
helm version  # Should be v3.12+

# Install useful optional tools
brew install k9s          # Terminal UI for Kubernetes
brew install stern        # Multi-pod log tailing
brew install kubectx      # Context and namespace switching
brew install istioctl     # Istio CLI (optional but helpful)

# Login to Azure
az login

# List subscriptions
az account list --output table

# Set your subscription
export AZURE_SUBSCRIPTION_ID="<your-subscription-id>"
az account set --subscription "$AZURE_SUBSCRIPTION_ID"

# Verify access
az account show

Cost Awareness and Budget Planning

Initial Setup Costs (First Month):

  • Domain registration (if new): $10-15/year
  • AKS cluster setup: $300-500 (includes learning/experimentation time)
  • DNS testing and configuration: Minimal
  • Certificate setup: Free (Let's Encrypt)

Ongoing Monthly Costs (Conservative Estimate):

Base Infrastructure:
- AKS Control Plane (free tier): $0
- OR Control Plane SLA (99.95%): $73
- 3x Standard_D2s_v3 nodes: $220
- Azure Load Balancer: $18.25
- Azure DNS: $1
- Container Registry (Basic): $5
- Monitoring storage: $50-100

Total Base: $294-417/mo

Per-Service Costs (Approximate):
- Additional compute: $20-40/service
- Storage (if needed): $10-50/service
- Database (if needed): $30-500/service
- Bandwidth: ~$0.08/GB egress

Expected Scaling:
- 5 services: ~$450-600/mo
- 10 services: ~$800-1200/mo
- 25 services: ~$1500-2500/mo
- 50 services: ~$3000-5000/mo

Cost Optimization Strategies:

  1. Use Spot Instances: 70-90% discount for non-critical workloads
  2. Right-size Nodes: Start with D2s_v3, scale up only if needed
  3. Reserved Instances: 30-40% discount for 1-3 year commitments
  4. Autoscaling: HPA and cluster autoscaler to match demand
  5. Monitoring Retention: Reduce Prometheus retention to 7-15 days
  6. Dev/Staging Clusters: Shut down nights and weekends (terraform automation)

🚀 Step 1: Create Production-Grade AKS Cluster with Istio

1.1 Cluster Sizing and Configuration Decisions

Node Size Selection Trade-offs:

VM SizevCPURAMPrice/moUse CaseConsiderations
Standard_B2s24GB$30Dev/test onlyNot recommended: Burstable, inconsistent performance
Standard_D2s_v328GB$70Small prod workloadsRecommended starter: Good balance, upgradeable
Standard_D4s_v3416GB$140Medium prodBetter pod density, more headroom
Standard_D8s_v3832GB$280Large prodBest pod density, overkill for <20 services
Standard_E2s_v3216GB$109Memory-intensiveHigher memory-to-CPU ratio

Decision Framework:

  • Start with D2s_v3 for initial deployment (2-5 services)
  • Plan for D4s_v3 when approaching 50% node capacity
  • Consider E-series if services average >500MB memory per pod
  • Use node pools with different sizes for workload-specific optimization

Node Count Considerations:

  • Minimum 3 nodes for production (quorum, rolling updates, node failures)
  • Start with 3, enable cluster autoscaler (3-10 nodes typical range)
  • Cost vs. Availability: 3 nodes = 2-node failures tolerated, 5 nodes = 3-node failures
  • Zone Distribution: Spread across availability zones (requires 3+ nodes)

1.2 Define Cluster Configuration

# Cluster configuration with rationale
export RESOURCE_GROUP="aks-production-rg"
export LOCATION="eastus"  # Choose region closest to users
export CLUSTER_NAME="my-aks-cluster"

# Node configuration
export NODE_COUNT=3           # Minimum for HA
export NODE_SIZE="Standard_D2s_v3"  # 2 vCPU, 8GB RAM
export NODE_DISK_SIZE=128     # OS disk size (GB) - default is often 128GB

# Kubernetes version - check available versions
az aks get-versions --location "$LOCATION" --output table

# Use recent stable version (not preview)
export K8S_VERSION="1.28.3"  # Update to latest stable

# Enable advanced features
export ENABLE_MONITORING="true"
export ENABLE_AUTOSCALER="true"
export MIN_NODE_COUNT=3
export MAX_NODE_COUNT=10

1.3 Create Resource Group with Tags

# Create resource group with metadata tags for cost tracking
az group create \
  --name "$RESOURCE_GROUP" \
  --location "$LOCATION" \
  --tags \
    environment=production \
    costCenter=engineering \
    managedBy=terraform \
    project=platform

# Verify creation
az group show --name "$RESOURCE_GROUP"

1.4 Create AKS Cluster with Istio and Advanced Features

# Create production-grade AKS cluster
# This takes 10-15 minutes
az aks create \
  --resource-group "$RESOURCE_GROUP" \
  --name "$CLUSTER_NAME" \
  --location "$LOCATION" \
  --kubernetes-version "$K8S_VERSION" \
  --node-count "$NODE_COUNT" \
  --node-vm-size "$NODE_SIZE" \
  --node-osdisk-size "$NODE_DISK_SIZE" \
  --enable-managed-identity \
  --enable-asm \
  --network-plugin azure \
  --network-policy azure \
  --enable-cluster-autoscaler \
  --min-count "$MIN_NODE_COUNT" \
  --max-count "$MAX_NODE_COUNT" \
  --enable-addons monitoring \
  --generate-ssh-keys \
  --zones 1 2 3 \
  --tier standard \
  --tags environment=production project=platform

# Alternative: Free tier for non-production (no SLA)
# --tier free

# Get cluster credentials
az aks get-credentials \
  --resource-group "$RESOURCE_GROUP" \
  --name "$CLUSTER_NAME" \
  --overwrite-existing

# Verify cluster is running
kubectl get nodes -o wide

# Check node zones (should be distributed)
kubectl get nodes -o custom-columns=NAME:.metadata.name,ZONE:.metadata.labels."topology\.kubernetes\.io/zone"

Expected Output:

NAME                                STATUS   ROLES   AGE   VERSION   ZONE
aks-nodepool1-12345678-vmss000000   Ready    agent   5m    v1.28.3   eastus-1
aks-nodepool1-12345678-vmss000001   Ready    agent   5m    v1.28.3   eastus-2
aks-nodepool1-12345678-vmss000002   Ready    agent   5m    v1.28.3   eastus-3

1.5 Verify Istio Installation and Components

# Check Istio namespaces
kubectl get namespaces | grep istio

# Expected output:
# aks-istio-ingress    Active   5m
# aks-istio-system     Active   5m

# Check Istio control plane components
kubectl get pods -n aks-istio-system

# Expected output:
# NAME                     READY   STATUS    RESTARTS   AGE
# istiod-asm-xxxx-xxxxx    1/1     Running   0          5m

# Check Istio ingress gateway
kubectl get pods -n aks-istio-ingress

# Expected output:
# NAME                                                READY   STATUS    RESTARTS   AGE
# aks-istio-ingressgateway-external-xxxxxxxx-xxxxx   1/1     Running   0          5m

# Get Istio Gateway external IP (save this!)
kubectl get svc -n aks-istio-ingress aks-istio-ingressgateway-external

# Expected output:
# NAME                                TYPE           CLUSTER-IP     EXTERNAL-IP    PORT(S)
# aks-istio-ingressgateway-external   LoadBalancer   10.0.123.45   52.188.x.x     15021:30001/TCP,80:30002/TCP,443:30003/TCP

# Export for later use
export ISTIO_INGRESS_IP=$(kubectl get svc -n aks-istio-ingress aks-istio-ingressgateway-external -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
echo "Istio Ingress IP: $ISTIO_INGRESS_IP"

# Verify Istio version and configuration
kubectl get mutatingwebhookconfigurations | grep istio
kubectl get validatingwebhookconfigurations | grep istio

1.6 Configure Node Pools for Workload Isolation (Optional but Recommended)

For production, consider separating system workloads from application workloads:

# Add a spot instance node pool for non-critical workloads (70-90% cost savings)
az aks nodepool add \
  --resource-group "$RESOURCE_GROUP" \
  --cluster-name "$CLUSTER_NAME" \
  --name spotpool \
  --priority Spot \
  --eviction-policy Delete \
  --spot-max-price -1 \
  --enable-cluster-autoscaler \
  --min-count 0 \
  --max-count 5 \
  --node-count 1 \
  --node-vm-size Standard_D2s_v3 \
  --node-taints kubernetes.azure.com/scalesetpriority=spot:NoSchedule \
  --labels workloadType=batch priority=low

# Verify node pools
az aks nodepool list \
  --resource-group "$RESOURCE_GROUP" \
  --cluster-name "$CLUSTER_NAME" \
  --output table

1.7 Cluster Architecture Deep Dive

Key Reliability Patterns:

  1. Zone Distribution: Nodes spread across 3 availability zones

    • Tolerates entire zone failure
    • Azure SLA: 99.99% uptime with zones
  2. Control Plane HA: Managed by Azure

    • Multi-region replication
    • Automatic failover
    • Separate from worker nodes
  3. Istio Control Plane: Azure-managed istiod

    • Automatic updates and patches
    • SLA backed by Azure support
    • Reduced operational burden
  4. Cluster Autoscaler: Automatically adjusts node count

    • Scales up when pods are pending
    • Scales down when nodes underutilized (>50% idle for 10 min)
    • Cost optimization while maintaining performance

Failure Scenarios and Recovery:

Failure TypeImpactRecovery TimeMitigation
Single node failure33% capacity lossImmediate (pods rescheduled)3+ nodes, pod anti-affinity
Zone failure33% capacity lossImmediateMulti-zone deployment
Istio control planeNo new config changes5-15 minutesData plane continues functioning
Ingress gateway podPartial traffic loss<30 secondsMultiple replicas, health checks
Control plane APINo new deploymentsAzure SLA: 99.95%Use managed tier SLA

1.8 Enable Cost Analysis and Monitoring

# Enable Azure Cost Management insights
az aks update \
  --resource-group "$RESOURCE_GROUP" \
  --name "$CLUSTER_NAME" \
  --enable-cost-analysis

# View cluster costs (after 24-48 hours of data collection)
az aks show \
  --resource-group "$RESOURCE_GROUP" \
  --name "$CLUSTER_NAME" \
  --query "addonProfiles.azurePolicy"

# Install metrics-server for HPA (if not already installed)
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# Verify metrics-server
kubectl get deployment metrics-server -n kube-system
kubectl top nodes
kubectl top pods -A

Post-Creation Validation Checklist:

  • All 3 nodes are in Ready state
  • Nodes distributed across 3 availability zones
  • Istio control plane pod running in aks-istio-system
  • Istio ingress gateway has external IP
  • Cluster autoscaler enabled (3-10 node range)
  • Monitoring addon enabled (Azure Monitor)
  • kubectl context set to new cluster
  • Ingress IP saved to environment variable
  • Cost tracking tags applied

🌐 Step 2: Configure Custom Domain in Azure DNS

Now we'll set up Azure DNS to manage your custom domain.

2.1 Create DNS Zone

export DOMAIN_NAME="cat-herding.net"

# Create DNS zone
az network dns zone create \
  --resource-group "$RESOURCE_GROUP" \
  --name "$DOMAIN_NAME"

# Get name servers
az network dns zone show \
  --resource-group "$RESOURCE_GROUP" \
  --name "$DOMAIN_NAME" \
  --query nameServers \
  --output table

Expected Output:

Result
-------------------
ns1-01.azure-dns.com.
ns2-01.azure-dns.net.
ns3-01.azure-dns.org.
ns4-01.azure-dns.info.

2.2 Update Domain Registrar Name Servers

Go to your domain registrar (GoDaddy, Namecheap, etc.) and update the name servers to the Azure DNS name servers from the previous step.

DNS Propagation Flow:

2.3 Create DNS Records

# Get Istio ingress gateway IP
export INGRESS_IP=$(kubectl get svc -n aks-istio-ingress \
  aks-istio-ingressgateway-external \
  -o jsonpath='{.status.loadBalancer.ingress[0].ip}')

echo "Ingress Gateway IP: $INGRESS_IP"

# Create A record for root domain
az network dns record-set a add-record \
  --resource-group "$RESOURCE_GROUP" \
  --zone-name "$DOMAIN_NAME" \
  --record-set-name "@" \
  --ipv4-address "$INGRESS_IP"

# Create A record for www subdomain
az network dns record-set a add-record \
  --resource-group "$RESOURCE_GROUP" \
  --zone-name "$DOMAIN_NAME" \
  --record-set-name "www" \
  --ipv4-address "$INGRESS_IP"

# Verify DNS records
az network dns record-set a list \
  --resource-group "$RESOURCE_GROUP" \
  --zone-name "$DOMAIN_NAME" \
  --output table

2.4 Test DNS Resolution

# Test DNS resolution (may take a few minutes)
nslookup cat-herding.net
nslookup www.cat-herding.net

# Test from multiple locations
dig cat-herding.net

🔐 Step 3: Install cert-manager for TLS Certificates

cert-manager automates the provisioning and renewal of TLS certificates from Let's Encrypt.

3.1 Install cert-manager with Helm

# Add cert-manager Helm repository
helm repo add jetstack https://charts.jetstack.io
helm repo update

# Create namespace
kubectl create namespace cert-manager

# Install cert-manager
helm install cert-manager jetstack/cert-manager \
  --namespace cert-manager \
  --version v1.13.0 \
  --set installCRDs=true

# Verify installation
kubectl get pods -n cert-manager

cert-manager Architecture:

3.2 Create ClusterIssuer for Let's Encrypt

# Create production ClusterIssuer
cat <<EOF | kubectl apply -f -
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: admin@cat-herding.net  # Change to your email
    privateKeySecretRef:
      name: letsencrypt-prod
    solvers:
    - http01:
        ingress:
          class: istio
EOF

# Verify ClusterIssuer
kubectl get clusterissuer letsencrypt-prod
kubectl describe clusterissuer letsencrypt-prod

3.3 Create Certificate Resource

cat <<EOF | kubectl apply -f -
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: cat-herding-tls-cert
  namespace: default
spec:
  secretName: cat-herding-tls
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer
  dnsNames:
    - cat-herding.net
    - www.cat-herding.net
EOF

# Check certificate status
kubectl get certificate -n default
kubectl describe certificate cat-herding-tls-cert -n default

# Check if secret was created
kubectl get secret cat-herding-tls -n default

Certificate Lifecycle:


🎯 Step 4: Deploy Your First Application

Let's deploy a sample application with Istio routing and TLS.

4.1 Create Application Directory Structure

mkdir -p k8s/apps/portfolio/base
cd k8s/apps/portfolio/base

4.2 Create Deployment Manifest

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: portfolio
spec:
  replicas: 2
  selector:
    matchLabels:
      app: portfolio
  template:
    metadata:
      labels:
        app: portfolio
        version: v1
      annotations:
        sidecar.istio.io/inject: "false"
    spec:
      containers:
        - name: portfolio
          image: gabby.azurecr.io/portfolio:latest
          imagePullPolicy: Always
          ports:
            - containerPort: 3000
              name: http
          env:
            - name: PORT
              value: "3000"
            - name: NODE_ENV
              value: "production"
          resources:
            requests:
              cpu: 250m
              memory: 512Mi
            limits:
              cpu: 1000m
              memory: 1Gi
          livenessProbe:
            httpGet:
              path: /api/health
              port: http
            initialDelaySeconds: 30
            periodSeconds: 15
          readinessProbe:
            httpGet:
              path: /api/health
              port: http
            initialDelaySeconds: 5
            periodSeconds: 10

4.3 Create Service Manifest

# service.yaml
apiVersion: v1
kind: Service
metadata:
  name: portfolio
spec:
  selector:
    app: portfolio
  ports:
    - name: http
      port: 80
      targetPort: 3000
  type: ClusterIP

4.4 Create Istio Gateway

# istio-gateway.yaml
apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
  name: portfolio-gateway
  namespace: default
spec:
  selector:
    istio: aks-istio-ingressgateway-external
  servers:
    - port:
        number: 80
        name: http
        protocol: HTTP
      hosts:
        - "cat-herding.net"
        - "www.cat-herding.net"
    - port:
        number: 443
        name: https
        protocol: HTTPS
      hosts:
        - "cat-herding.net"
        - "www.cat-herding.net"
      tls:
        mode: SIMPLE
        credentialName: cat-herding-tls

4.5 Create Istio VirtualService

# istio-virtualservice.yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: portfolio-virtualservice
spec:
  hosts:
    - "cat-herding.net"
    - "www.cat-herding.net"
  gateways:
    - portfolio-gateway
  http:
    # Health check endpoints
    - match:
        - uri:
            exact: /api/health
        - uri:
            exact: /health
      route:
        - destination:
            host: portfolio
            port:
              number: 80
    # All other traffic
    - match:
        - uri:
            prefix: /
      route:
        - destination:
            host: portfolio
            port:
              number: 80

4.6 Create Kustomization File

# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
  - deployment.yaml
  - service.yaml
  - istio-gateway.yaml
  - istio-virtualservice.yaml

namespace: default

4.7 Deploy Application

# Validate manifests
kubectl kustomize k8s/apps/portfolio/base

# Apply manifests
kubectl apply -k k8s/apps/portfolio/base

# Watch deployment
kubectl rollout status deployment/portfolio -n default

# Check resources
kubectl get pods -l app=portfolio -n default
kubectl get svc portfolio -n default
kubectl get gateway portfolio-gateway -n default
kubectl get virtualservice portfolio-virtualservice -n default

Request Flow:


🧪 Step 5: Test and Verify

5.1 Test HTTP/HTTPS Connectivity

# Test HTTP (should work)
curl -v http://cat-herding.net

# Test HTTPS (should work with valid certificate)
curl -v https://cat-herding.net

# Test specific endpoint
curl https://cat-herding.net/api/health

# Check certificate details
echo | openssl s_client -servername cat-herding.net -connect cat-herding.net:443 2>/dev/null | openssl x509 -noout -dates

5.2 Verify Certificate

# Check certificate secret
kubectl get secret cat-herding-tls -n default

# Decode and inspect certificate
kubectl get secret cat-herding-tls -n default -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -text -noout

# Check expiration
kubectl get secret cat-herding-tls -n default -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -enddate

5.3 Test from Inside Cluster

# Create test pod
kubectl run curl-test --image=curlimages/curl:latest --rm -it --restart=Never -- sh

# From inside the pod
curl -v http://portfolio.default.svc.cluster.local
exit

🔄 Step 6: Deploy Additional Applications with Subdomains

Now let's deploy a second application on a subdomain (e.g., api.cat-herding.net).

6.1 Create DNS Record for Subdomain

# Create A record for api subdomain
az network dns record-set a add-record \
  --resource-group "$RESOURCE_GROUP" \
  --zone-name "$DOMAIN_NAME" \
  --record-set-name "api" \
  --ipv4-address "$INGRESS_IP"

# Verify
nslookup api.cat-herding.net

6.2 Create Certificate for Subdomain

cat <<EOF | kubectl apply -f -
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: api-tls-cert
  namespace: default
spec:
  secretName: api-tls
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer
  dnsNames:
    - api.cat-herding.net
EOF

# Check certificate status
kubectl get certificate api-tls-cert -n default

6.3 Create API Application Manifests

mkdir -p k8s/apps/api-service/base
cd k8s/apps/api-service/base
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-service
spec:
  replicas: 2
  selector:
    matchLabels:
      app: api-service
  template:
    metadata:
      labels:
        app: api-service
        version: v1
    spec:
      containers:
        - name: api-service
          image: your-registry.azurecr.io/api-service:latest
          ports:
            - containerPort: 8080
              name: http
          env:
            - name: PORT
              value: "8080"
          resources:
            requests:
              cpu: 250m
              memory: 512Mi
            limits:
              cpu: 1000m
              memory: 1Gi
# service.yaml
apiVersion: v1
kind: Service
metadata:
  name: api-service
spec:
  selector:
    app: api-service
  ports:
    - name: http
      port: 80
      targetPort: 8080
  type: ClusterIP
# istio-gateway.yaml
apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
  name: api-gateway
  namespace: default
spec:
  selector:
    istio: aks-istio-ingressgateway-external
  servers:
    - port:
        number: 80
        name: http
        protocol: HTTP
      hosts:
        - "api.cat-herding.net"
    - port:
        number: 443
        name: https
        protocol: HTTPS
      hosts:
        - "api.cat-herding.net"
      tls:
        mode: SIMPLE
        credentialName: api-tls
# istio-virtualservice.yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: api-virtualservice
spec:
  hosts:
    - "api.cat-herding.net"
  gateways:
    - api-gateway
  http:
    - match:
        - uri:
            prefix: /
      route:
        - destination:
            host: api-service
            port:
              number: 80
# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
  - deployment.yaml
  - service.yaml
  - istio-gateway.yaml
  - istio-virtualservice.yaml

namespace: default

6.4 Deploy API Service

# Apply manifests
kubectl apply -k k8s/apps/api-service/base

# Verify deployment
kubectl get pods -l app=api-service
kubectl get svc api-service
kubectl get gateway api-gateway
kubectl get virtualservice api-virtualservice

# Test
curl https://api.cat-herding.net/health

Multi-Application Routing:


🔒 Security Architecture and Zero-Trust Implementation

Defense-in-Depth Strategy

Production Kubernetes security requires multiple layers. A single breach should not compromise the entire system.

Implementing Mutual TLS (mTLS) with Istio

Istio provides automatic mTLS between services without code changes—one of its primary value propositions.

mTLS Benefits:

  • Zero-trust networking: No service trusts network by default
  • Encryption of all inter-service communication
  • Certificate-based authentication (vs. API keys)
  • Automatic certificate rotation (default: 24-hour validity)
  • Defense against man-in-the-middle attacks

mTLS Configuration:

# Enable strict mTLS for entire namespace
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: default
spec:
  mtls:
    mode: STRICT # PERMISSIVE allows both mTLS and plaintext (migration mode)

---
# Alternatively, configure cluster-wide
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: istio-system
spec:
  mtls:
    mode: STRICT
# Apply mTLS policy
kubectl apply -f - <<EOF
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: default
spec:
  mtls:
    mode: STRICT
EOF

# Verify mTLS status
istioctl x describe pod <pod-name> -n default

# Check certificate details
istioctl proxy-config secret <pod-name> -n default -o json | jq -r '.dynamicActiveSecrets[0].secret.tlsCertificate.certificateChain.inlineBytes' | base64 -d | openssl x509 -text -noout

mTLS Performance Impact:

  • Latency: ~1-3ms additional per hop
  • CPU: ~5-10% overhead for encryption/decryption
  • Memory: ~20-30MB per sidecar for certificates
  • Trade-off: Security vs. microsecond-level latency requirements

Network Policies for Pod Isolation

# Deny all traffic by default (baseline security)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: default
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress

---
# Allow specific service communication
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: portfolio-network-policy
  namespace: default
spec:
  podSelector:
    matchLabels:
      app: portfolio
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              name: aks-istio-ingress
      ports:
        - protocol: TCP
          port: 3000
  egress:
    # Allow DNS
    - to:
        - namespaceSelector:
            matchLabels:
              name: kube-system
      ports:
        - protocol: UDP
          port: 53
    # Allow API service communication
    - to:
        - podSelector:
            matchLabels:
              app: api-service
      ports:
        - protocol: TCP
          port: 8080
    # Allow database connection (Azure PostgreSQL)
    - to:
        - podSelector: {}
      ports:
        - protocol: TCP
          port: 5432
    # Allow external HTTPS (for external APIs)
    - to:
        - namespaceSelector: {}
      ports:
        - protocol: TCP
          port: 443

Pod Security Standards (PSS)

# Enforce restricted pod security at namespace level
apiVersion: v1
kind: Namespace
metadata:
  name: production
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted

---
# Example: Compliant deployment with security contexts
apiVersion: apps/v1
kind: Deployment
metadata:
  name: secure-portfolio
  namespace: production
spec:
  replicas: 2
  selector:
    matchLabels:
      app: portfolio
  template:
    metadata:
      labels:
        app: portfolio
    spec:
      # Pod-level security context
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        fsGroup: 1000
        seccompProfile:
          type: RuntimeDefault

      # Service account with minimal permissions
      serviceAccountName: portfolio-sa
      automountServiceAccountToken: false

      containers:
        - name: portfolio
          image: gabby.azurecr.io/portfolio:v1.2.3 # Use specific tags, not :latest

          # Container-level security context
          securityContext:
            allowPrivilegeEscalation: false
            readOnlyRootFilesystem: true # Immutable container filesystem
            runAsNonRoot: true
            runAsUser: 1000
            capabilities:
              drop:
                - ALL # Drop all Linux capabilities

          # Resource limits prevent resource exhaustion attacks
          resources:
            requests:
              cpu: 250m
              memory: 512Mi
            limits:
              cpu: 1000m
              memory: 1Gi

          # Writable volumes for tmp and cache
          volumeMounts:
            - name: tmp
              mountPath: /tmp
            - name: cache
              mountPath: /app/.next/cache

          ports:
            - containerPort: 3000
              name: http
              protocol: TCP

          livenessProbe:
            httpGet:
              path: /api/health
              port: http
            initialDelaySeconds: 30
            periodSeconds: 15
            timeoutSeconds: 5
            failureThreshold: 3

          readinessProbe:
            httpGet:
              path: /api/health
              port: http
            initialDelaySeconds: 5
            periodSeconds: 10
            timeoutSeconds: 3
            failureThreshold: 3

      volumes:
        - name: tmp
          emptyDir: {}
        - name: cache
          emptyDir: {}

Azure Key Vault Integration for Secrets

Never store secrets in Git, ConfigMaps, or even Kubernetes Secrets (base64 is not encryption).

# Install Azure Key Vault provider for Secrets Store CSI
helm repo add csi-secrets-store-provider-azure https://azure.github.io/secrets-store-csi-driver-provider-azure/charts
helm install csi-secrets-store-provider-azure/csi-secrets-store-provider-azure --generate-name --namespace kube-system

# Create Key Vault
export KEY_VAULT_NAME="aks-prod-kv-$(uuidgen | cut -c1-8)"
az keyvault create \
  --name "$KEY_VAULT_NAME" \
  --resource-group "$RESOURCE_GROUP" \
  --location "$LOCATION" \
  --enable-rbac-authorization

# Store secrets
az keyvault secret set \
  --vault-name "$KEY_VAULT_NAME" \
  --name "database-connection-string" \
  --value "postgresql://user:password@host:5432/db"

# Configure workload identity for pod to access Key Vault
# (Detailed setup omitted for brevity - see Azure docs)

Security Monitoring and Audit

# Enable Azure Defender for Kubernetes
az security pricing create \
  --name KubernetesService \
  --tier Standard

# View security recommendations
az security assessment list \
  --resource-group "$RESOURCE_GROUP"

# Enable audit logging
az aks update \
  --resource-group "$RESOURCE_GROUP" \
  --name "$CLUSTER_NAME" \
  --enable-azure-rbac

# Query audit logs (requires Azure Monitor integration)
kubectl get events --all-namespaces --sort-by='.lastTimestamp' | grep -i "security\|unauthorized\|forbidden"

Incident Response Playbook

Common Security Incidents:

  1. Compromised Pod

    # Immediately isolate pod with network policy
    kubectl apply -f - <<EOF
    apiVersion: networking.k8s.io/v1
    kind: NetworkPolicy
    metadata:
      name: isolate-compromised-pod
      namespace: default
    spec:
      podSelector:
        matchLabels:
          app: compromised-app
      policyTypes:
      - Ingress
      - Egress
    EOF
    
    # Capture pod logs before deletion
    kubectl logs compromised-pod -n default --all-containers > incident-logs.txt
    
    # Exec into pod for forensics (if safe)
    kubectl exec -it compromised-pod -n default -- /bin/sh
    
    # Delete and redeploy from known-good image
    kubectl delete pod compromised-pod -n default
    kubectl rollout restart deployment/app-name -n default
    
    # Review image for vulnerabilities
    trivy image gabby.azurecr.io/app:tag
    
  2. Certificate Expiration (Let's Encrypt)

    # Check certificate expiration
    kubectl get certificate -A
    kubectl describe certificate cat-herding-tls-cert -n default
    
    # Force renewal if <30 days
    kubectl delete secret cat-herding-tls -n default
    kubectl delete certificate cat-herding-tls-cert -n default
    kubectl apply -f certificate.yaml
    
    # Monitor renewal
    kubectl get certificaterequest -n default -w
    
  3. Istio Control Plane Failure

    # Data plane continues functioning, but no config changes
    # Check istiod status
    kubectl get pods -n aks-istio-system
    kubectl logs -n aks-istio-system -l app=istiod
    
    # Contact Azure support (managed Istio)
    # Meanwhile, avoid deploying new services or changing routing
    

Security Checklist for Production:

  • mTLS enabled in STRICT mode
  • Network policies deny all by default
  • Pod Security Standards enforced (restricted)
  • All containers run as non-root
  • Read-only root filesystem where possible
  • Resource limits on all pods
  • Image scanning in CI/CD pipeline
  • Secrets stored in Azure Key Vault
  • RBAC roles follow least privilege
  • Audit logging enabled
  • Azure Defender for Kubernetes enabled
  • Regular security assessments scheduled
  • Incident response playbook documented

🔍 Troubleshooting: Real-World Incident Scenarios

Debugging Methodology for Service Mesh

Service mesh adds complexity—when things break, methodical debugging is essential.

Issue 1: 503 Service Unavailable (Most Common)

Symptoms:

  • curl https://cat-herding.net returns 503
  • Browser shows "Service Temporarily Unavailable"
  • Istio Gateway logs show "no healthy upstream"

Root Causes and Solutions:

# Step 1: Verify pods are running
kubectl get pods -l app=portfolio -n default

# If pods are not Running:
kubectl describe pod <pod-name> -n default
kubectl logs <pod-name> -n default --previous  # Check previous container logs

# Common causes:
# - ImagePullBackOff: Wrong image name or missing credentials
# - CrashLoopBackOff: Application error, check logs
# - Pending: Insufficient resources, check node capacity

# Step 2: Check if service has endpoints
kubectl get endpoints portfolio -n default

# If no endpoints:
# - Selector mismatch: kubectl get svc portfolio -n default -o yaml | grep selector
# - Pod labels: kubectl get pods -l app=portfolio --show-labels

# Step 3: Check health probes
kubectl describe pod <pod-name> -n default | grep -A5 "Liveness\|Readiness"

# If probes failing:
# - Wrong path: Check /api/health is correct
# - Slow startup: Increase initialDelaySeconds
# - Application not ready: Fix app code

# Step 4: Verify VirtualService routing
kubectl get virtualservice portfolio-virtualservice -n default -o yaml

# Check host matching
kubectl describe virtualservice portfolio-virtualservice -n default

# Step 5: Check Istio Gateway logs
kubectl logs -n aks-istio-ingress -l istio=aks-istio-ingressgateway-external --tail=100

# Look for:
# - "no healthy upstream"
# - "upstream connect error"
# - certificate errors

Real Example with Solution:

# Problem: VirtualService host doesn't match Gateway
apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
  name: portfolio-gateway
spec:
  servers:
  - hosts:
    - "cat-herding.net"  # ❌ Missing www subdomain

---
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: portfolio-vs
spec:
  hosts:
  - "www.cat-herding.net"  # ❌ Requests to www.cat-herding.net fail!
  gateways:
  - portfolio-gateway

# Solution: Match hosts exactly
apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
  name: portfolio-gateway
spec:
  servers:
  - hosts:
    - "cat-herding.net"
    - "www.cat-herding.net"  # ✅ Include all expected hosts

Issue 2: Certificate Not Ready / TLS Errors

Symptoms:

  • Certificate stuck in "Pending" or "False" state
  • Browser shows "Your connection is not private"
  • kubectl describe certificate shows challenge failures

Debugging Steps:

# Step 1: Check certificate status
kubectl get certificate -A
kubectl describe certificate cat-herding-tls-cert -n default

# Look for status conditions:
# - Ready: False
# - Reason: Pending / Failed
# - Message: (describes what's wrong)

# Step 2: Check certificate request
kubectl get certificaterequest -n default
kubectl describe certificaterequest <request-name> -n default

# Step 3: Check ACME challenge
kubectl get challenges -A
kubectl describe challenge <challenge-name> -n default

# Common HTTP-01 challenge failures:
# - DNS not propagating (wait 5-60 minutes)
# - Firewall blocking /.well-known/acme-challenge/
# - VirtualService not routing challenge requests

# Step 4: Check cert-manager logs
kubectl logs -n cert-manager -l app=cert-manager --tail=100

# Look for:
# - "failed to determine the list of Challenge resources"
# - "Error accepting challenge"
# - "certificate request has failed"

# Step 5: Verify Let's Encrypt rate limits
# Let's Encrypt limits:
# - 50 certificates per registered domain per week
# - 5 duplicate certificates per week
# Check: https://crt.sh/?q=cat-herding.net

# Solution 1: Delete and recreate certificate (if misconfigured)
kubectl delete certificate cat-herding-tls-cert -n default
kubectl delete secret cat-herding-tls -n default
kubectl apply -f certificate.yaml

# Solution 2: Use staging issuer for testing (avoids rate limits)
cat <<EOF | kubectl apply -f -
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-staging
spec:
  acme:
    server: https://acme-staging-v02.api.letsencrypt.org/directory
    email: admin@cat-herding.net
    privateKeySecretRef:
      name: letsencrypt-staging
    solvers:
    - http01:
        ingress:
          class: istio
EOF

# Update certificate to use staging issuer
kubectl edit certificate cat-herding-tls-cert -n default
# Change issuerRef.name to letsencrypt-staging

# Solution 3: Manual certificate if all else fails
openssl req -x509 -nodes -days 365 -newkey rsa:2048 \
  -keyout tls.key -out tls.crt \
  -subj "/CN=cat-herding.net"

kubectl create secret tls cat-herding-tls \
  --cert=tls.crt --key=tls.key -n default

Issue 3: High Latency / Slow Responses

Symptoms:

  • Response times > 500ms
  • Intermittent timeouts
  • User complaints about performance

Investigation:

# Step 1: Isolate the layer causing latency
# Test internal service (bypasses Istio Gateway)
kubectl run curl-test --image=curlimages/curl:latest --rm -it --restart=Never -- \
  time curl -w "\ntime_total: %{time_total}s\n" http://portfolio.default.svc.cluster.local

# If fast internally but slow externally:
# - Problem is Gateway, DNS, or network path
# If slow internally:
# - Problem is application or sidecar

# Step 2: Check resource utilization
kubectl top pods -n default
kubectl top nodes

# If pods near limits:
# - Increase resource requests/limits
# - Scale horizontally (more replicas)

# Step 3: Check Envoy sidecar overhead
kubectl exec -it <pod-name> -c istio-proxy -n default -- pilot-agent request GET stats | grep -E "upstream_rq_time|response_time"

# Typical sidecar overhead: 1-5ms
# If >10ms: Investigate Istio configuration

# Step 4: Enable distributed tracing
# Jaeger should show request flow and timing breakdown
kubectl port-forward -n istio-system svc/jaeger-query 16686:16686

# Open http://localhost:16686 and search for slow traces

# Step 5: Check for CPU throttling
kubectl describe pod <pod-name> -n default | grep -A5 "Requests\|Limits"

# If throttling:
# - Increase CPU limits
# - Review application efficiency

# Solution 1: Horizontal Pod Autoscaler
kubectl autoscale deployment portfolio --cpu-percent=70 --min=2 --max=10

# Solution 2: Increase resources
kubectl set resources deployment portfolio -c portfolio \
  --requests=cpu=500m,memory=1Gi \
  --limits=cpu=2000m,memory=2Gi

# Solution 3: Optimize application code
# - Add caching layer (Redis)
# - Optimize database queries
# - Use CDN for static assets

Issue 4: DNS Not Resolving

Symptoms:

  • nslookup cat-herding.net times out or returns wrong IP
  • Works from some locations but not others
  • Intermittent failures

Resolution:

# Step 1: Verify DNS records in Azure
az network dns record-set a list \
  --resource-group "$RESOURCE_GROUP" \
  --zone-name "cat-herding.net" \
  --output table

# Expected: A record pointing to $ISTIO_INGRESS_IP

# Step 2: Check name server delegation
nslookup -type=NS cat-herding.net

# Should return Azure DNS name servers:
# ns1-01.azure-dns.com.
# ns2-01.azure-dns.net.
# ns3-01.azure-dns.org.
# ns4-01.azure-dns.info.

# If not: Update name servers at domain registrar

# Step 3: Test DNS from multiple servers
dig @8.8.8.8 cat-herding.net  # Google DNS
dig @1.1.1.1 cat-herding.net  # Cloudflare DNS
dig @ns1-01.azure-dns.com cat-herding.net  # Azure DNS directly

# If inconsistent: DNS propagation in progress (wait 5-60 min)

# Step 4: Check TTL values
dig cat-herding.net | grep "^cat-herding.net"

# TTL should be reasonable (300-3600 seconds)
# Lower TTL = faster updates, more queries
# Higher TTL = slower updates, fewer queries, better performance

# Step 5: Flush local DNS cache
# macOS:
sudo dscacheutil -flushcache; sudo killall -HUP mDNSResponder

# Linux:
sudo systemd-resolve --flush-caches

# Windows:
ipconfig /flushdns

# Solution: If propagation is taking too long
# Temporarily add entry to /etc/hosts for testing:
echo "$ISTIO_INGRESS_IP cat-herding.net www.cat-herding.net" | sudo tee -a /etc/hosts

Issue 5: Istio Control Plane Failure

Symptoms:

  • Cannot deploy new services
  • VirtualService changes not applied
  • Gateway configuration stuck

Impact:

  • Data plane continues working: Existing traffic unaffected
  • Control plane down: Cannot change routing, deploy new services, or update config

Response:

# Step 1: Verify istiod health
kubectl get pods -n aks-istio-system
kubectl describe pod <istiod-pod> -n aks-istio-system
kubectl logs -n aks-istio-system -l app=istiod --tail=100

# Step 2: Check Istio webhook configuration
kubectl get mutatingwebhookconfigurations | grep istio
kubectl get validatingwebhookconfigurations | grep istio

# If webhooks failing:
# - Pods may not get sidecar injection
# - Configuration validation may fail

# Step 3: For Azure-managed Istio, contact Azure support
# Service Request Type: AKS > Istio Add-on Issue
# Include:
# - Cluster name and resource group
# - Timestamp of issue start
# - Output of kubectl get pods -n aks-istio-system

# Step 4: Temporary workarounds
# - Avoid deploying new services until resolved
# - Existing services continue functioning
# - Do not delete/recreate Gateway or VirtualService resources

# Step 5: If self-hosted Istio, restart control plane
kubectl rollout restart deployment/istiod -n istio-system

Emergency Commands Cheat Sheet

# Quick health check
kubectl get pods -A | grep -v Running
kubectl get nodes | grep -v Ready
kubectl top nodes
kubectl get events --all-namespaces --sort-by='.lastTimestamp' | tail -20

# Restart everything (last resort)
kubectl rollout restart deployment/<app-name> -n default
kubectl rollout restart deployment/istiod -n aks-istio-system

# Force certificate renewal
kubectl delete secret <cert-secret-name> -n default
kubectl delete certificate <cert-name> -n default
kubectl apply -f certificate.yaml

# Check all Istio configuration
istioctl analyze -A

# Dump Istio proxy configuration
kubectl exec -it <pod-name> -c istio-proxy -n default -- curl localhost:15000/config_dump > config.json

# Port-forward for local debugging
kubectl port-forward svc/portfolio 8080:80 -n default
curl http://localhost:8080

# Emergency: Scale down problematic service
kubectl scale deployment <app-name> --replicas=0 -n default

# View all resources in namespace
kubectl get all -n default

# Check certificate from external
echo | openssl s_client -servername cat-herding.net -connect cat-herding.net:443 2>/dev/null | openssl x509 -noout -dates

Performance Degradation Patterns

Pattern 1: Gradual Performance Degradation

  • Cause: Memory leak or file descriptor exhaustion
  • Solution: Enable HPA, set appropriate resource limits, restart pods regularly

Pattern 2: Spike During Deployments

  • Cause: Rolling update with insufficient replicas
  • Solution: Increase minReplicas, use pod anti-affinity, configure PodDisruptionBudget

Pattern 3: Periodic Slowdowns

  • Cause: Batch jobs, backups, or cron consuming resources
  • Solution: Use separate node pool for batch workloads, set resource limits

Pattern 4: Certificate Expiration Every 90 Days

  • Cause: Let's Encrypt certificates not auto-renewing
  • Solution: Verify cert-manager running, check ClusterIssuer, monitor certificate expiration

Monitoring Checklist:

# Set up alerts for:
# - Pod crash loop (>3 restarts in 5 minutes)
# - High error rate (>1% 5xx responses)
# - Certificate expiration (<30 days)
# - Node CPU >80% sustained
# - Node memory >85% sustained
# - Disk space <10% free

# Example: Prometheus alert rules
# (Configure in Prometheus or Azure Monitor)
groups:
- name: kubernetes_alerts
  rules:
  - alert: PodCrashLooping
    expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Pod {{ $labels.pod }} is crash looping"
# Check certificate status
kubectl describe certificate cat-herding-tls-cert -n default

# Check cert-manager logs
kubectl logs -n cert-manager -l app=cert-manager

# Check challenges
kubectl get challenges -A

# Manually trigger certificate renewal
kubectl delete certificate cat-herding-tls-cert -n default
kubectl apply -f certificate.yaml

Issue 2: DNS Not Resolving

# Check DNS records in Azure
az network dns record-set a list \
  --resource-group "$RESOURCE_GROUP" \
  --zone-name "$DOMAIN_NAME"

# Test DNS from different servers
dig @8.8.8.8 cat-herding.net
dig @1.1.1.1 cat-herding.net

# Check propagation status
nslookup cat-herding.net

Issue 3: 503 Service Unavailable

# Check if pods are running
kubectl get pods -l app=portfolio

# Check pod logs
kubectl logs -l app=portfolio --tail=50

# Check service endpoints
kubectl get endpoints portfolio

# Check Istio configuration
istioctl analyze -n default

# Check VirtualService routing
kubectl describe virtualservice portfolio-virtualservice

Issue 4: TLS Certificate Errors

# Verify secret exists and has correct data
kubectl get secret cat-herding-tls -n default -o yaml

# Check certificate expiration
kubectl get secret cat-herding-tls -n default -o jsonpath='{.data.tls\.crt}' | \
  base64 -d | openssl x509 -noout -enddate

# Check if Gateway is using correct secret
kubectl get gateway portfolio-gateway -o yaml | grep credentialName

# Test certificate from external
echo | openssl s_client -servername cat-herding.net -connect cat-herding.net:443 2>/dev/null | \
  openssl x509 -noout -subject -issuer -dates

📊 Monitoring and Observability

Enable Istio Telemetry

# Install Prometheus
kubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.20/samples/addons/prometheus.yaml

# Install Grafana
kubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.20/samples/addons/grafana.yaml

# Install Kiali (Istio dashboard)
kubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.20/samples/addons/kiali.yaml

# Access Kiali dashboard
kubectl port-forward -n istio-system svc/kiali 20001:20001
# Open http://localhost:20001

View Istio Traffic Flow


🎓 Best Practices

1. Resource Organization

k8s/
├── kustomization.yaml              # Root kustomization
└── apps/
    ├── portfolio/
    │   └── base/
    │       ├── kustomization.yaml
    │       ├── deployment.yaml
    │       ├── service.yaml
    │       ├── istio-gateway.yaml
    │       └── istio-virtualservice.yaml
    └── api-service/
        └── base/
            ├── kustomization.yaml
            ├── deployment.yaml
            ├── service.yaml
            ├── istio-gateway.yaml
            └── istio-virtualservice.yaml

2. Security Considerations

  • Use RBAC: Implement proper role-based access control
  • Network Policies: Restrict pod-to-pod communication
  • Image Scanning: Scan container images for vulnerabilities
  • Secrets Management: Use Azure Key Vault for sensitive data
  • TLS Everywhere: Always use HTTPS, even for internal services

3. High Availability

# Example: HA deployment configuration
apiVersion: apps/v1
kind: Deployment
metadata:
  name: portfolio
spec:
  replicas: 3 # Multiple replicas
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0 # Zero-downtime deployments
  template:
    spec:
      affinity:
        podAntiAffinity: # Spread across nodes
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchExpressions:
                    - key: app
                      operator: In
                      values:
                        - portfolio
                topologyKey: kubernetes.io/hostname

4. Resource Limits

Always define resource requests and limits:

resources:
  requests:
    cpu: 250m # Guaranteed CPU
    memory: 512Mi # Guaranteed memory
  limits:
    cpu: 1000m # Maximum CPU
    memory: 1Gi # Maximum memory

🚀 Deployment Automation

Create Deployment Script

#!/bin/bash
# deploy-app.sh - Automated application deployment

set -e

APP_NAME=$1
SUBDOMAIN=$2
IMAGE=$3
PORT=$4

if [ -z "$APP_NAME" ] || [ -z "$SUBDOMAIN" ] || [ -z "$IMAGE" ] || [ -z "$PORT" ]; then
  echo "Usage: ./deploy-app.sh <app-name> <subdomain> <image> <port>"
  echo "Example: ./deploy-app.sh my-app api myregistry.azurecr.io/my-app:v1.0 8080"
  exit 1
fi

RESOURCE_GROUP="aks-production-rg"
DOMAIN_NAME="cat-herding.net"

echo "🚀 Deploying $APP_NAME on $SUBDOMAIN.$DOMAIN_NAME"

# 1. Get Ingress IP
echo "📡 Getting Ingress Gateway IP..."
INGRESS_IP=$(kubectl get svc -n aks-istio-ingress aks-istio-ingressgateway-external -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
echo "   IP: $INGRESS_IP"

# 2. Create DNS Record
echo "🌐 Creating DNS record..."
az network dns record-set a add-record \
  --resource-group "$RESOURCE_GROUP" \
  --zone-name "$DOMAIN_NAME" \
  --record-set-name "$SUBDOMAIN" \
  --ipv4-address "$INGRESS_IP" 2>/dev/null || echo "   DNS record already exists"

# 3. Create Certificate
echo "🔐 Creating TLS certificate..."
cat <<EOF | kubectl apply -f -
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: ${SUBDOMAIN}-tls-cert
  namespace: default
spec:
  secretName: ${SUBDOMAIN}-tls
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer
  dnsNames:
    - ${SUBDOMAIN}.${DOMAIN_NAME}
EOF

# 4. Wait for certificate
echo "⏳ Waiting for certificate to be ready (this may take a few minutes)..."
kubectl wait --for=condition=Ready certificate/${SUBDOMAIN}-tls-cert -n default --timeout=300s

# 5. Deploy application
echo "📦 Deploying application manifests..."
mkdir -p k8s/apps/${APP_NAME}/base
cd k8s/apps/${APP_NAME}/base

# Generate manifests (templates omitted for brevity - see full examples above)
# ... deployment.yaml, service.yaml, gateway.yaml, virtualservice.yaml ...

kubectl apply -k .

# 6. Wait for rollout
echo "⏳ Waiting for deployment to be ready..."
kubectl rollout status deployment/${APP_NAME} -n default

# 7. Test
echo "✅ Testing deployment..."
sleep 10
curl -f https://${SUBDOMAIN}.${DOMAIN_NAME}/health || echo "⚠️  Health check failed (may need time)"

echo "🎉 Deployment complete! Visit https://${SUBDOMAIN}.${DOMAIN_NAME}"

📝 Summary

You now have a complete production-ready Azure AKS cluster with:

Managed Kubernetes cluster with auto-scaling
Istio service mesh for advanced traffic management
Custom domain DNS managed in Azure
Automated TLS certificates via cert-manager and Let's Encrypt
HTTP/HTTPS routing with subdomain support
Monitoring and observability with Prometheus, Grafana, and Kiali
Reusable deployment templates for new applications

Complete System Architecture

📊 Production Observability and SRE Practices

Defining SLIs, SLOs, and Error Budgets

Service Level Objectives drive operational decisions and balance feature velocity with reliability.

Example SLO Definition:

# portfolio-slos.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: portfolio-slos
  namespace: default
data:
  slo-definition: |
    # Portfolio Service SLOs (30-day rolling window)

    Availability SLO:
      Target: 99.5% uptime
      Measurement: (successful requests / total requests) >= 0.995
      Error Budget: 0.5% (216 minutes/month or ~7.2 minutes/day)
      
    Latency SLO:
      p50 < 200ms
      p95 < 500ms
      p99 < 1000ms
      
    Error Rate SLO:
      5xx errors < 0.1% of requests
      4xx errors < 5% of requests
      
    Data Freshness SLO:
      Blog posts appear within 60 seconds of publish

SLI Implementation with Prometheus:

# prometheus-rules-slis.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: portfolio-slis
  namespace: default
spec:
  groups:
    - name: portfolio_slis
      interval: 30s
      rules:
        # Availability SLI
        - record: portfolio:availability:ratio_rate5m
          expr: |
            sum(rate(istio_requests_total{
              destination_service_name="portfolio",
              response_code!~"5.."
            }[5m]))
            /
            sum(rate(istio_requests_total{
              destination_service_name="portfolio"
            }[5m]))

        # Latency SLI (p95)
        - record: portfolio:latency:p95
          expr: |
            histogram_quantile(0.95,
              sum(rate(istio_request_duration_milliseconds_bucket{
                destination_service_name="portfolio"
              }[5m])) by (le)
            )

        # Error Rate SLI
        - record: portfolio:error_rate:ratio_rate5m
          expr: |
            sum(rate(istio_requests_total{
              destination_service_name="portfolio",
              response_code=~"5.."
            }[5m]))
            /
            sum(rate(istio_requests_total{
              destination_service_name="portfolio"
            }[5m]))

Alert Design Philosophy:

Alert TypeTriggers OnActionExample
PageSLO violation imminent, customer impactImmediate response (2-5 min)Availability <99.0%, p95 latency >1000ms
TicketEarly warning, no customer impact yetNext business dayAvailability <99.7%, error budget 50% consumed
InfoUseful context, no action neededReview weeklyNew deployment, configuration change
# prometheus-alerts-slos.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: portfolio-slo-alerts
  namespace: default
spec:
  groups:
    - name: portfolio_slo_alerts
      rules:
        # Page: SLO violation (immediate impact)
        - alert: PortfolioAvailabilityCritical
          expr: |
            (
              sum(rate(istio_requests_total{
                destination_service_name="portfolio",
                response_code!~"5.."
              }[5m]))
              /
              sum(rate(istio_requests_total{
                destination_service_name="portfolio"
              }[5m]))
            ) < 0.990
          for: 2m
          labels:
            severity: critical
            slo: availability
          annotations:
            summary: "Portfolio availability below 99% (current: {{ $value | humanizePercentage }})"
            description: "SLO violation - immediate customer impact. Error budget exhausted."
            runbook: "https://wiki.company.com/runbooks/portfolio-availability"

        # Ticket: Early warning (50% error budget consumed)
        - alert: PortfolioErrorBudgetBurning
          expr: |
            (
              1 - (
                sum(rate(istio_requests_total{
                  destination_service_name="portfolio",
                  response_code!~"5.."
                }[2h]))
                /
                sum(rate(istio_requests_total{
                  destination_service_name="portfolio"
                }[2h]))
              )
            ) > 0.0025  # 50% of 0.5% error budget
          for: 15m
          labels:
            severity: warning
            slo: availability
          annotations:
            summary: "Error budget burning faster than sustainable rate"
            description: "At current rate, error budget will be exhausted in {{ $value | humanizeDuration }}"

        # Page: Latency SLO violation
        - alert: PortfolioLatencyCritical
          expr: |
            histogram_quantile(0.95,
              sum(rate(istio_request_duration_milliseconds_bucket{
                destination_service_name="portfolio"
              }[5m])) by (le)
            ) > 1000
          for: 5m
          labels:
            severity: critical
            slo: latency
          annotations:
            summary: "p95 latency above 1000ms (current: {{ $value }}ms)"
            description: "User experience degraded. Investigate immediately."

Distributed Tracing Strategy

Trace Context Propagation:

// src/middleware.ts - Ensure trace headers are propagated
import { NextRequest, NextResponse } from "next/server";

export function middleware(request: NextRequest) {
  const response = NextResponse.next();

  // Propagate Istio/Jaeger trace headers
  const traceHeaders = [
    "x-request-id",
    "x-b3-traceid",
    "x-b3-spanid",
    "x-b3-parentspanid",
    "x-b3-sampled",
    "x-b3-flags",
    "b3",
    "traceparent", // W3C Trace Context
    "tracestate",
  ];

  traceHeaders.forEach((header) => {
    const value = request.headers.get(header);
    if (value) {
      response.headers.set(header, value);
    }
  });

  return response;
}

Jaeger Configuration:

# Deploy Jaeger for distributed tracing
kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: jaeger
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: jaeger
  template:
    metadata:
      labels:
        app: jaeger
    spec:
      containers:
      - name: jaeger
        image: jaegertracing/all-in-one:1.52
        env:
        - name: COLLECTOR_OTLP_ENABLED
          value: "true"
        - name: SPAN_STORAGE_TYPE
          value: "memory"  # Use elasticsearch for production
        ports:
        - containerPort: 16686
          name: ui
        - containerPort: 14268
          name: collector
        resources:
          requests:
            cpu: 100m
            memory: 256Mi
          limits:
            cpu: 500m
            memory: 1Gi
---
apiVersion: v1
kind: Service
metadata:
  name: jaeger-query
  namespace: default
spec:
  ports:
  - port: 16686
    targetPort: 16686
    name: ui
  selector:
    app: jaeger
---
apiVersion: v1
kind: Service
metadata:
  name: jaeger-collector
  namespace: default
spec:
  ports:
  - port: 14268
    targetPort: 14268
    name: collector
  selector:
    app: jaeger
EOF

# Configure Istio to send traces to Jaeger
kubectl apply -f - <<EOF
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
metadata:
  name: istio-config
  namespace: aks-istio-system
spec:
  meshConfig:
    defaultConfig:
      tracing:
        sampling: 1.0  # 100% sampling (reduce in production to 1-10%)
        zipkin:
          address: jaeger-collector.default.svc.cluster.local:9411
EOF

# Access Jaeger UI
kubectl port-forward -n default svc/jaeger-query 16686:16686
# Open http://localhost:16686

Trace Sampling Strategy:

EnvironmentSampling RateRationaleStorage Cost
Development100%Debug all requestsNegligible (memory storage)
Staging50%Balance between coverage and cost~$50-100/mo (Elasticsearch)
Production (<1K RPS)10-20%Capture anomalies, control costs~$100-200/mo
Production (>10K RPS)1-5%Focused sampling, head-based~$200-500/mo
Incident Investigation100% (tail-based)Temporarily sample errors onlySpike during incident

Capacity Planning Models

Resource Projection Formula:

# capacity-planning.py
def project_capacity(
    current_rps: int,
    target_rps: int,
    current_pods: int,
    cpu_per_request_ms: float,
    memory_per_request_mb: float,
    target_cpu_utilization: float = 0.70,  # 70% target
    overhead_factor: float = 1.20,  # 20% overhead for spikes
) -> dict:
    """
    Project required capacity for target RPS.

    Example:
    - Current: 100 RPS with 3 pods @ 50% CPU
    - Target: 500 RPS
    - CPU per request: 10ms
    - Memory per request: 5MB
    """
    # Calculate CPU requirements
    total_cpu_ms_per_sec = target_rps * cpu_per_request_ms
    total_cpu_cores = (total_cpu_ms_per_sec / 1000) / target_cpu_utilization
    total_cpu_with_overhead = total_cpu_cores * overhead_factor

    # Calculate memory requirements
    # Assume requests are processed concurrently with 100ms average latency
    concurrent_requests = target_rps * 0.1  # 100ms latency = 0.1s
    total_memory_mb = concurrent_requests * memory_per_request_mb
    total_memory_with_overhead = total_memory_mb * overhead_factor

    # Calculate pod requirements
    # Assume each pod can handle 0.5 CPU cores and 1GB memory
    pods_for_cpu = int(total_cpu_with_overhead / 0.5) + 1
    pods_for_memory = int(total_memory_with_overhead / 1024) + 1
    required_pods = max(pods_for_cpu, pods_for_memory)

    # Calculate node requirements
    # Standard_D2s_v3: 2 vCPU, 8GB RAM
    # Assume 1.6 CPU and 6.4GB usable per node (20% for system)
    required_nodes = int((required_pods * 0.5) / 1.6) + 1

    # Cost estimate ($75/node/month for Standard_D2s_v3)
    monthly_cost = required_nodes * 75

    return {
        "target_rps": target_rps,
        "required_pods": required_pods,
        "required_cpu_cores": total_cpu_with_overhead,
        "required_memory_gb": total_memory_with_overhead / 1024,
        "required_nodes": required_nodes,
        "monthly_cost_usd": monthly_cost,
        "cpu_utilization_target": target_cpu_utilization,
        "overhead_factor": overhead_factor,
    }

# Example calculation
result = project_capacity(
    current_rps=100,
    target_rps=500,
    current_pods=3,
    cpu_per_request_ms=10,
    memory_per_request_mb=5,
)

print(f"For {result['target_rps']} RPS:")
print(f"  Required Pods: {result['required_pods']}")
print(f"  Required Nodes: {result['required_nodes']}")
print(f"  CPU Cores: {result['required_cpu_cores']:.2f}")
print(f"  Memory: {result['required_memory_gb']:.2f} GB")
print(f"  Monthly Cost: ${result['monthly_cost_usd']}")

Load Testing with K6:

// load-test.js
import http from "k6/http";
import { check, sleep } from "k6";
import { Rate, Trend } from "k6/metrics";

// Custom metrics
const errorRate = new Rate("error_rate");
const latencyTrend = new Trend("latency");

export const options = {
  stages: [
    { duration: "2m", target: 50 }, // Ramp up to 50 users
    { duration: "5m", target: 50 }, // Stay at 50 users
    { duration: "2m", target: 100 }, // Ramp to 100 users
    { duration: "5m", target: 100 }, // Stay at 100 users
    { duration: "2m", target: 200 }, // Spike to 200 users
    { duration: "5m", target: 200 }, // Sustained high load
    { duration: "2m", target: 0 }, // Ramp down
  ],
  thresholds: {
    http_req_duration: ["p(95)<500", "p(99)<1000"], // 95% < 500ms, 99% < 1s
    error_rate: ["rate<0.01"], // Error rate < 1%
    http_req_failed: ["rate<0.01"],
  },
};

export default function () {
  const response = http.get("https://cat-herding.net");

  // Check response
  const success = check(response, {
    "status is 200": (r) => r.status === 200,
    "response time OK": (r) => r.timings.duration < 500,
  });

  // Record metrics
  errorRate.add(!success);
  latencyTrend.add(response.timings.duration);

  sleep(1);
}

// Run test:
// k6 run --out json=results.json load-test.js
// Analyze: k6 stats results.json

Deployment Strategies and Rollback Plans

Canary Deployment with Istio:

# portfolio-canary.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: portfolio-v2-canary
  namespace: default
spec:
  replicas: 1 # Start with 1 pod (10% of traffic)
  selector:
    matchLabels:
      app: portfolio
      version: v2
  template:
    metadata:
      labels:
        app: portfolio
        version: v2
    spec:
      containers:
        - name: portfolio
          image: gabby.azurecr.io/portfolio:v2-candidate
          # ... rest of spec same as v1
---
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: portfolio-canary
  namespace: default
spec:
  hosts:
    - "cat-herding.net"
    - "www.cat-herding.net"
  gateways:
    - portfolio-gateway
  http:
    - match:
        - headers:
            x-canary:
              exact: "true" # Internal testing
      route:
        - destination:
            host: portfolio
            subset: v2
          weight: 100
    - route:
        - destination:
            host: portfolio
            subset: v1
          weight: 90 # 90% to stable version
        - destination:
            host: portfolio
            subset: v2
          weight: 10 # 10% to canary
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: portfolio-versions
  namespace: default
spec:
  host: portfolio
  subsets:
    - name: v1
      labels:
        version: v1
    - name: v2
      labels:
        version: v2

Automated Canary Promotion with Flagger:

# Install Flagger (progressive delivery operator)
kubectl apply -k github.com/fluxcd/flagger//kustomize/istio

# Define automated canary rollout
cat <<EOF | kubectl apply -f -
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: portfolio
  namespace: default
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: portfolio
  service:
    port: 80
  analysis:
    interval: 1m
    threshold: 5  # Number of failed checks before rollback
    maxWeight: 50  # Max canary weight
    stepWeight: 10  # Increment weight by 10% each step
    metrics:
    - name: request-success-rate
      thresholdRange:
        min: 99  # Rollback if success rate < 99%
      interval: 1m
    - name: request-duration
      thresholdRange:
        max: 500  # Rollback if p99 > 500ms
      interval: 1m
    webhooks:
    - name: slack-notification
      url: https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK
      timeout: 5s
      metadata:
        channel: "#deployments"
EOF

# Trigger canary by updating image
kubectl set image deployment/portfolio portfolio=gabby.azurecr.io/portfolio:v2

# Flagger will:
# 1. Deploy canary (0% traffic)
# 2. Shift 10% traffic to canary
# 3. Wait 1 minute, check metrics
# 4. If metrics good: Shift 20%, repeat
# 5. If metrics bad: Rollback immediately
# 6. If success: Promote to 100%

Emergency Rollback Playbook:

# SCENARIO: Deployment causing errors, need immediate rollback

# Step 1: Identify current deployment
kubectl rollout history deployment/portfolio -n default

# Step 2: Check current revision
kubectl get deployment portfolio -n default -o jsonpath='{.metadata.annotations.deployment\.kubernetes\.io/revision}'

# Step 3: Quick rollback to previous revision
kubectl rollout undo deployment/portfolio -n default

# Step 4: Verify rollback is working
kubectl rollout status deployment/portfolio -n default
kubectl get pods -n default -l app=portfolio -w

# Step 5: If using Istio canary, immediately shift traffic to v1
kubectl apply -f - <<EOF
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: portfolio-virtualservice
  namespace: default
spec:
  hosts:
  - "cat-herding.net"
  - "www.cat-herding.net"
  gateways:
  - portfolio-gateway
  http:
  - route:
    - destination:
        host: portfolio
        subset: v1
      weight: 100  # 100% to stable version
    - destination:
        host: portfolio
        subset: v2
      weight: 0  # 0% to broken canary
EOF

# Step 6: Communicate status
# Post to #incidents Slack channel:
# "🔥 Emergency rollback executed for portfolio service
#  Reason: [high error rate / latency / crashes]
#  Action: Rolled back from revision X to revision Y
#  Status: Monitoring recovery
#  ETA: Stable in 2-5 minutes"

# Step 7: Monitor recovery
watch -n 2 'kubectl get pods -n default -l app=portfolio'

# Step 8: Verify metrics returning to normal
# Check Grafana dashboard, Prometheus alerts, error logs

# Step 9: Post-mortem (after recovery)
# Document:
# - What triggered the rollback
# - How it was detected (alert, user report, monitoring)
# - Time to detection and time to recovery
# - Root cause analysis
# - Action items to prevent recurrence

Cost Optimization Strategies

Right-Sizing Workloads:

# Install Vertical Pod Autoscaler (VPA) for recommendations
kubectl apply -f https://github.com/kubernetes/autoscaler/releases/download/vertical-pod-autoscaler-0.13.0/vpa-v0.13.0.yaml

# Create VPA in recommendation mode (doesn't auto-adjust)
cat <<EOF | kubectl apply -f -
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: portfolio-vpa
  namespace: default
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: portfolio
  updatePolicy:
    updateMode: "Off"  # Recommendation only
  resourcePolicy:
    containerPolicies:
    - containerName: portfolio
      minAllowed:
        cpu: 100m
        memory: 128Mi
      maxAllowed:
        cpu: 2000m
        memory: 4Gi
EOF

# Get recommendations after 24-48 hours
kubectl describe vpa portfolio-vpa -n default

# Example output:
# Recommendation:
#   Container Recommendations:
#     Container Name:  portfolio
#     Lower Bound:
#       Cpu:     150m
#       Memory:  256Mi
#     Target:
#       Cpu:     300m  # Recommended request
#       Memory:  512Mi  # Recommended request
#     Uncapped Target:
#       Cpu:     450m
#       Memory:  768Mi
#     Upper Bound:
#       Cpu:     1000m
#       Memory:  1Gi

Spot Instance Strategy:

# Mixed node pool: on-demand + spot
# Use on-demand for critical services, spot for batch workloads

# Label portfolio deployment to prefer spot instances
apiVersion: apps/v1
kind: Deployment
metadata:
  name: portfolio-batch-jobs
spec:
  template:
    spec:
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              preference:
                matchExpressions:
                  - key: kubernetes.azure.com/scalesetpriority
                    operator: In
                    values:
                      - spot
      tolerations:
        - key: kubernetes.azure.com/scalesetpriority
          operator: Equal
          value: spot
          effect: NoSchedule
      containers:
        - name: batch-processor
          image: gabby.azurecr.io/batch:latest
          # Batch jobs can tolerate interruptions (70-90% cost savings)

Idle Resource Cleanup:

# Identify unused resources
# Cost optimization tip: Delete these regularly

# Unused PersistentVolumeClaims (orphaned storage)
kubectl get pvc -A --field-selector=status.phase!=Bound

# Unused ConfigMaps/Secrets (check if referenced)
kubectl get configmaps -A
kubectl get secrets -A

# Old ReplicaSets (kept for rollback history, can be pruned)
kubectl get rs -A --sort-by=.metadata.creationTimestamp

# Clean up completed jobs older than 7 days
kubectl delete jobs --field-selector status.successful=1 \
  --all-namespaces \
  --dry-run=client  # Remove dry-run when ready

# Schedule automatic cleanup with CronJob
cat <<EOF | kubectl apply -f -
apiVersion: batch/v1
kind: CronJob
metadata:
  name: cleanup-old-jobs
  namespace: default
spec:
  schedule: "0 2 * * 0"  # Every Sunday at 2 AM
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: cleanup-sa
          containers:
          - name: kubectl
            image: bitnami/kubectl:latest
            command:
            - /bin/sh
            - -c
            - |
              kubectl delete jobs --field-selector status.successful=1 \
                -A --ignore-not-found=true \
                $(kubectl get jobs -A -o json | jq -r '.items[] | select(.status.completionTime != null and (now - (.status.completionTime | fromdateiso8601)) > 604800) | "\(.metadata.namespace)/\(.metadata.name)"')
          restartPolicy: OnFailure
EOF

Cost Monitoring Dashboard:

ResourceMonthly CostOptimization Opportunity
AKS cluster management$0N/A (free)
3x Standard_D2s_v3 nodes$225Use spot instances for non-critical (~$68)
Managed Istio$150Consider self-hosted if team has expertise
Load Balancer (public IP)$20Share LB across multiple services
Azure DNS Zone$1N/A (minimal)
Bandwidth (100GB egress)$10Use CDN to reduce egress
Persistent storage (500GB)$38Clean up old PVCs, use lifecycle policies
Total$444/moOptimized: $292/mo (34% savings)

Azure vs GCP: Cost Comparison for Hobby Projects

Why I Migrated from GCP to Azure (Real Experience)

After running my portfolio on GKE (Google Kubernetes Engine) for 18 months, I migrated to AKS and saw significant cost savings. Here's the honest breakdown:

GCP Configuration (Previous Setup)

# GKE Cluster - us-central1-a
# 3x e2-medium nodes (2 vCPU, 4GB RAM each)
# Cloud SQL PostgreSQL (db-f1-micro)
# Cloud Load Balancer
# Cloud DNS

# Monthly Cost Breakdown (GCP):
# - GKE cluster management: $74.40/mo ($0.10/hour)
# - 3x e2-medium nodes: $73.73/mo ($24.58 each)
# - Cloud SQL (db-f1-micro): $7.67/mo
# - Cloud Load Balancer: $18.26/mo
# - Cloud SQL Proxy (included): $0
# - Cloud DNS zone: $0.20/mo
# - Bandwidth (100GB egress): $12.00/mo
# - Persistent disks (500GB): $80.00/mo (standard PD)
# Total: ~$266/mo (WITHOUT Istio)
# With Istio self-hosted: ~$340/mo

Azure Configuration (Current Setup)

# AKS Cluster - East US
# 3x Standard_B2s nodes (2 vCPU, 4GB RAM each) - HOBBY TIER
# Azure Database for PostgreSQL (Flexible Server, Burstable B1ms)
# Azure Load Balancer
# Azure DNS

# Monthly Cost Breakdown (Azure):
# - AKS cluster management: $0 (FREE!)
# - 3x Standard_B2s nodes: $120.36/mo ($40.12 each)
# - Azure Database for PostgreSQL (B1ms): $12.41/mo
# - Load Balancer (public IP): $3.65/mo
# - Azure DNS zone: $0.50/mo
# - Bandwidth (100GB egress): $8.76/mo
# - Managed disks (500GB Standard HDD): $19.71/mo
# Total: ~$165/mo (WITHOUT Istio)
# With managed Istio: ~$315/mo

Side-by-Side Comparison

ComponentGCP CostAzure CostSavingsNotes
Cluster Management$74.40$0$74.40AKS management is completely free
3x Nodes (2vCPU, 4GB)$73.73 (e2-medium)$120.36 (B2s)-$46.63Azure B-series burstable cheaper than D-series but more than GCP e2
Database (small)$7.67 (db-f1-micro)$12.41 (B1ms)-$4.74Azure Flexible Server slightly more
Load Balancer$18.26$3.65$14.61Azure LB significantly cheaper
DNS Zone$0.20$0.50-$0.30Negligible difference
Bandwidth (100GB)$12.00$8.76$3.24Azure egress slightly cheaper
Storage (500GB)$80.00 (SSD)$19.71 (HDD)$60.29Azure Standard HDD vs GCP Standard PD
Managed IstioN/A (self-host ~$70)$150VariableAzure offers managed option
Total (no Istio)$266$165$101/mo (38%)Azure saves ~$1,212/year
Total (with Istio)$340 (self-hosted)$315 (managed)$25/mo (7%)Azure saves ~$300/year

Hobby Cluster Optimization: The $50/mo Setup

For personal projects and learning, here's how to run AKS for ~$50-75/month:

# Ultra-Low-Cost Azure Configuration
# 1x Standard_B2s node (starts 1, autoscales to 2)
# Azure Database for PostgreSQL (Burstable B1ms with auto-pause)
# NO Istio (use native Kubernetes Ingress)
# Azure DNS + Let's Encrypt

# Cost Breakdown:
# - AKS cluster management: $0
# - 1x Standard_B2s node: $40.12/mo
# - Azure PostgreSQL (B1ms, auto-pause): $6.21/mo (50% time paused)
# - Load Balancer: $3.65/mo
# - NGINX Ingress Controller: $0 (runs on node)
# - Azure DNS: $0.50/mo
# - Bandwidth (20GB/mo): $1.75/mo
# - Managed disk (100GB): $3.94/mo
# Total: ~$56/mo

# GCP Equivalent (Lowest Cost):
# - GKE Autopilot (smallest): $74.40 management + ~$50 compute = ~$124/mo
# OR
# - GKE Standard (1 node e2-small): $74.40 + $15.33 + $7.67 DB + $18 LB = ~$115/mo

# Azure wins by ~$59-68/mo (52-59% savings) for hobby projects

Configuration for $50/mo Hobby Cluster:

# hobby-cluster.yaml - Minimal viable AKS setup
apiVersion: v1
kind: Namespace
metadata:
  name: portfolio
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: portfolio
  namespace: portfolio
spec:
  replicas: 1 # Single replica for hobby projects
  selector:
    matchLabels:
      app: portfolio
  template:
    metadata:
      labels:
        app: portfolio
    spec:
      containers:
        - name: portfolio
          image: gabby.azurecr.io/portfolio:latest
          ports:
            - containerPort: 3000
          resources:
            requests:
              cpu: 100m # Minimal CPU
              memory: 256Mi # Minimal memory
            limits:
              cpu: 500m
              memory: 512Mi
          livenessProbe:
            httpGet:
              path: /api/health
              port: 3000
            initialDelaySeconds: 30
          readinessProbe:
            httpGet:
              path: /api/health
              port: 3000
            initialDelaySeconds: 5
---
# Use NGINX Ingress instead of Istio (free)
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: portfolio-ingress
  namespace: portfolio
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
spec:
  ingressClassName: nginx
  tls:
    - hosts:
        - cat-herding.net
        - www.cat-herding.net
      secretName: cat-herding-tls
  rules:
    - host: cat-herding.net
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: portfolio
                port:
                  number: 80
    - host: www.cat-herding.net
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: portfolio
                port:
                  number: 80

Setup Commands for $50/mo Cluster:

# Create minimal AKS cluster
az aks create \
  --resource-group portfolio-hobby \
  --name portfolio-cluster \
  --node-count 1 \
  --min-count 1 \
  --max-count 2 \
  --node-vm-size Standard_B2s \
  --enable-cluster-autoscaler \
  --network-plugin azure \
  --network-policy azure \
  --tier free \
  --no-ssh-key

# Install NGINX Ingress (replaces Istio)
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm install ingress-nginx ingress-nginx/ingress-nginx \
  --namespace ingress-nginx \
  --create-namespace \
  --set controller.service.externalTrafficPolicy=Local \
  --set controller.resources.requests.cpu=50m \
  --set controller.resources.requests.memory=128Mi

# Install cert-manager (same as before)
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.yaml

# Total setup time: ~15 minutes
# No Istio complexity, no managed service fees

Key Takeaways for Hobby Projects

Azure Advantages:

  • Free cluster management ($74.40/mo savings vs GCP)
  • Cheaper load balancer ($14.61/mo savings)
  • Cheaper storage ($60/mo savings with HDD)
  • Burstable B-series VMs ideal for bursty traffic (personal sites)
  • Simple pricing - fewer surprise charges
  • Better free tier - Can run hobby cluster for ~$50/mo

GCP Advantages:

  • Cheaper compute (e2-medium < B2s)
  • GKE Autopilot removes node management (but costs more)
  • Better integration with Google Cloud services (Cloud Run, etc.)
  • More mature Istio integration (created by Google)

Decision Framework:

My Migration Experience:

MetricBefore (GCP)After (Azure)Change
Monthly Cost$266$165-$101 (-38%)
Cluster Management Fee$74.40$0-$74.40 (-100%)
Annual Savings--$1,212/year
Performance (p95 latency)180ms165ms-15ms (comparable)
Downtime during migration0 minutes0 minutesBlue-green cutover
Migration complexity-4 hoursMostly manifest updates
RegretsNoneNoneWould migrate again

Bottom Line for Hobby Projects:

  • GCP: Great if you're already using Google Cloud services or want Autopilot simplicity
  • Azure: Better for pure Kubernetes workloads, especially if cost-sensitive
  • Azure wins by $1,212/year for similar configuration without Istio
  • Azure wins by ~$700/year even with managed Istio

For my personal portfolio, the migration to Azure was a no-brainer—I saved $1,212/year while gaining managed Istio as an option. The free cluster management fee alone pays for a nice dinner every month.

Comprehensive Service Cost Comparison: Azure vs GCP vs AWS

Beyond the base Kubernetes infrastructure, let's compare the full stack of services you'll need for a production application across all three major cloud providers.

Database Services: PostgreSQL Comparison

TierGCP Cloud SQLAzure DatabaseAWS RDS PostgreSQLWinnerUse Case
Hobbydb-f1-micro<br/>1 shared vCPU, 614MB<br/>$7.67/moB1ms<br/>1 vCPU, 2GB<br/>$12.41/modb.t4g.micro<br/>2 vCPU, 1GB<br/>$12.41/moGCP 🏆Personal, <100 req/hr
Smalldb-g1-small<br/>1 shared vCPU, 1.7GB<br/>$25.00/moB2s<br/>2 vCPU, 4GB<br/>$49.64/modb.t4g.small<br/>2 vCPU, 2GB<br/>$24.82/moAWS 🏆Small apps, <1K req/hr
Mediumdb-n1-standard-1<br/>1 vCPU, 3.75GB<br/>$49.95/moD2s_v3<br/>2 vCPU, 8GB<br/>$116.80/modb.t4g.medium<br/>2 vCPU, 4GB<br/>$49.64/moGCP 🏆Mid-size, 1-10K req/hr
High Availabilitydb-n1-std-2 + HA<br/>2 vCPU, 7.5GB<br/>$198.90/moD2s_v3 HA<br/>2 vCPU, 8GB<br/>$233.60/modb.m6g.large Multi-AZ<br/>2 vCPU, 8GB<br/>$188.50/moAWS 🏆Production with failover
High Performancedb-n1-standard-4<br/>4 vCPU, 15GB<br/>$199.05/moD4s_v3<br/>4 vCPU, 16GB<br/>$233.59/modb.m6g.xlarge<br/>4 vCPU, 16GB<br/>$282.24/moGCP 🏆Heavy, >10K req/hr

Overall Winner: GCP for hobby/medium, AWS for small/HA scenarios

Note: GCP's shared vCPU tiers are cheapest for hobby projects. AWS t4g (Graviton) instances offer best price/performance for small-medium workloads.

Database Services: MySQL Comparison

TierGCP Cloud SQLAzure DatabaseAWS RDS MySQLWinnerNotes
Hobbydb-f1-micro<br/>$7.18/moB1ms<br/>$13.14/modb.t4g.micro<br/>$11.59/moGCP 🏆MySQL 8.0
Smalldb-g1-small<br/>$24.51/moB2s<br/>$52.56/modb.t4g.small<br/>$23.18/moAWS 🏆MySQL 8.0
Mediumdb-n1-standard-1<br/>$48.23/moD2s_v3<br/>$123.29/modb.t4g.medium<br/>$46.35/moAWS 🏆MySQL 8.0
High Performancedb-n1-standard-4<br/>$192.92/moD4s_v3<br/>$246.58/modb.m6g.xlarge<br/>$263.52/moGCP 🏆MySQL 8.0

Overall Winner: GCP for hobby, AWS for small/medium production workloads

Cache Services: Redis Comparison

TierGCP MemorystoreAzure Cache for RedisAWS ElastiCacheWinnerNotes
Basic 1GB$41.61/moBasic C1 (1GB)<br/>$16.06/mocache.t4g.micro (0.5GB)<br/>$11.52/moAWS 🏆No HA, single zone
Basic 5GB$208.05/moBasic C3 (6GB)<br/>$64.24/mocache.m6g.large (6.38GB)<br/>$75.74/moAzure 🏆No HA
Standard 1GB HA$78.84/moStandard C1 (1GB)<br/>$44.53/mocache.t4g.small (1.55GB)<br/>$46.08/moAzure 🏆High availability
Standard 5GB HA$393.44/moStandard C3 (6GB)<br/>$128.48/mocache.m6g.large (6.38GB) Multi-AZ<br/>$151.48/moAzure 🏆HA with replica
Premium 26GB HA$810.72/moPremium P1 (6GB)<br/>$164.25/mocache.m6g.xlarge (12.93GB) Cluster<br/>$302.96/moAzure 🏆Clustering, persistence
Premium 100GB HA$3,242.88/moPremium P4 (26GB)<br/>$657.00/mocache.r6g.xlarge (26.32GB) Cluster<br/>$605.92/moAWS 🏆Enterprise features

Overall Winner: Azure for most tiers (40-70% cheaper), AWS for small hobby and large enterprise

Note: Azure Cache for Redis is dramatically cheaper than both GCP and AWS for standard production workloads. This is one of Azure's biggest competitive advantages.

Object Storage Comparison

Storage TypeGCP Cloud StorageAzure Blob StorageAWS S3WinnerUse Case
Hot StorageStandard<br/>$0.020/GB/moHot tier<br/>$0.0184/GB/moStandard<br/>$0.023/GB/moAzure 🏆Frequently accessed
Warm StorageNearline (30-day)<br/>$0.010/GB/moCool tier<br/>$0.01/GB/moS3 Infrequent Access<br/>$0.0125/GB/moTie (Azure/GCP)Monthly access
Cold StorageColdline (90-day)<br/>$0.004/GB/moArchive tier<br/>$0.00099/GB/moGlacier Flexible<br/>$0.0036/GB/moAzure 🏆Rarely accessed
Deep ArchiveArchive<br/>$0.0012/GB/moArchive tier<br/>$0.00099/GB/moGlacier Deep Archive<br/>$0.00099/GB/moTie (Azure/AWS)Long-term backup
Egress (1TB)$0.12/GB$0.087/GB$0.09/GBAzure 🏆Data transfer out
Requests (per 10K)$0.40 (Class A)$0.50 (write)$0.50 (PUT)GCP 🏆Write operations

Overall Winner: Azure for storage costs and egress, GCP for request pricing

Note: Azure Archive tier is cheapest for cold storage. All three are competitive for hot storage within 20%.

Compute (Virtual Machines) Comparison

VM SizeGCP (us-central1)Azure (East US)AWS (us-east-1)WinnerSpecs
Microe2-micro<br/>$6.11/moB1s<br/>$7.59/mot4g.nano<br/>$3.07/moAWS 🏆0.25-2 vCPU, 0.5-1GB
Smalle2-small<br/>$12.23/moB1ms<br/>$15.18/mot4g.micro<br/>$6.14/moAWS 🏆0.5-2 vCPU, 1-2GB
Medium Burste2-medium<br/>$24.45/moB2s<br/>$40.12/mot4g.small<br/>$12.29/moAWS 🏆2 vCPU, 2-4GB
Medium Standardn1-standard-1<br/>$24.73/moD2s_v3<br/>$75.19/mot4g.medium<br/>$24.58/moGCP 🏆1-2 vCPU, 3.75-4GB
Medium Standardn1-standard-2<br/>$49.45/moD2s_v3<br/>$75.19/mom6g.large<br/>$60.74/moGCP 🏆2 vCPU, 7.5-8GB
Largen1-standard-4<br/>$98.91/moD4s_v3<br/>$150.38/mom6g.xlarge<br/>$121.47/moGCP 🏆4 vCPU, 15-16GB
Spot Instancee2-medium (preemptible)<br/>$7.34/mo (70% off)B2s (spot)<br/>$12.04/mo (70% off)t4g.small (spot)<br/>$3.69/mo (70% off)AWS 🏆2 vCPU, 2-4GB

Overall Winner: AWS for burstable/spot instances, GCP for standard compute

Note: AWS t4g (Graviton ARM) instances offer best value for burstable workloads. GCP wins for sustained compute. Azure is most expensive across the board.

Kubernetes Node Pricing (Actual Cluster Nodes)

Node TypeGCP GKEAzure AKSAWS EKSWinner (3-node cost)
Hobbye2-small<br/>0.5 vCPU, 2GBB1ms<br/>1 vCPU, 2GBt4g.micro<br/>2 vCPU, 1GBGCP: $36.69<br/>Azure: $45.54<br/>AWS: $18.42 🏆
Smalle2-medium<br/>2 vCPU, 4GBB2s<br/>2 vCPU, 4GBt4g.small<br/>2 vCPU, 2GBGCP: $73.35<br/>Azure: $120.36<br/>AWS: $36.87 🏆
Productionn1-standard-2<br/>2 vCPU, 7.5GBD2s_v3<br/>2 vCPU, 8GBm6g.large<br/>2 vCPU, 8GBGCP: $148.35 🏆<br/>Azure: $225.57<br/>AWS: $182.22
High Perfn1-standard-4<br/>4 vCPU, 15GBD4s_v3<br/>4 vCPU, 16GBm6g.xlarge<br/>4 vCPU, 16GBGCP: $296.73 🏆<br/>Azure: $451.14<br/>AWS: $364.41
Spot Nodese2-medium<br/>(preemptible)B2s<br/>(spot)t4g.small<br/>(spot)GCP: $22.02<br/>Azure: $36.12<br/>AWS: $11.07 🏆

Overall Winner: AWS for hobby/small/spot, GCP for production sustained workloads

Critical Factors:

  • GCP: Add $74.40/mo cluster management fee (negates savings for small clusters)
  • Azure: $0 cluster management (free)
  • AWS: Add $73/mo per cluster management fee (similar to GCP)

Load Balancer Comparison

Load Balancer TypeGCPAzureAWSWinnerNotes
Basic (HTTP/HTTPS)Cloud LB<br/>$18.26/moStandard LB<br/>$3.65/moALB<br/>$16.43/moAzure 🏆5 forwarding rules
With SSL+ free cert<br/>$18.26/mo+ free cert<br/>$3.65/mo+ free cert (ACM)<br/>$16.43/moAzure 🏆Managed certificates
Network LBRegional LB<br/>$18.26/moBasic LB<br/>$0/moNLB<br/>$16.43/moAzure 🏆Layer 4
Data Processing$0.008/GB$0.005/GB$0.008/GBAzure 🏆Per GB processed
LCU HoursIncludedIncluded$0.008/hourTie (Azure/GCP)Load Balancer Capacity Units

Overall Winner: Azure for load balancers (75-100% cheaper than both)

Note: Azure's load balancer pricing is one of its strongest competitive advantages. AWS ALB charges for LCUs, GCP charges per forwarding rule.

Content Delivery Network (CDN) Comparison

CDN TierGCP Cloud CDNAzure CDNAWS CloudFrontWinnerUse Case
Cache Egress (10TB)$0.08/GB$0.087/GB$0.085/GBGCP 🏆Cache hit
Origin Egress$0.12/GB$0.087/GB$0.085/GBAWS 🏆Cache miss
Requests (per 10K)$0.0075$0.0096$0.0075Tie (GCP/AWS)HTTP/HTTPS

Overall Winner: GCP/AWS tie for cache performance, Azure for origin-heavy workloads

Note: All three are competitive within 10%. Choice depends on your cache hit ratio and existing cloud infrastructure.

Managed Kubernetes Service: Total Cost Comparison

Let's compare identical workloads on all three platforms:

Scenario: 3-node production cluster with database, cache, and storage

ComponentGCP Monthly CostAzure Monthly CostAWS Monthly CostWinner
Cluster Management Fee$74.40$0$73.00Azure
3x Nodes (2 vCPU, 4GB)$73.35 ✅ (e2-medium)$120.36 (B2s)$36.87 ✅ (t4g.small)AWS
PostgreSQL (small)$25.00 ✅ (db-g1-small)$49.64 (B2s)$24.82 ✅ (t4g.small)AWS
Redis Cache (1GB HA)$78.84 (Memorystore)$44.53 ✅ (Standard C1)$46.08 (t4g.small)Azure
Load Balancer$18.26$3.65$16.43Azure
Storage (500GB SSD)$85.00 (SSD PD)$48.00 ✅ (Premium SSD)$50.00 (gp3)Azure
Bandwidth (100GB)$12.00$8.76$9.00Azure
DNS$0.20$0.50$0.50 (Route53)GCP
Container Registry$0 ✅ (500MB free)$5.00 (Basic)$0 ✅ (500MB free)Tie (GCP/AWS)
Monitoring$8.00 (Cloud Logging)$0 ✅ (basic included)$10.00 (CloudWatch)Azure
Total$375.05$280.44$266.70AWS wins by $13.74/mo
Annual Total$4,500.60$3,365.28$3,200.40AWS saves $1,300.20/year

Winner: AWS for this configuration (saves $1,300/year vs GCP, $165/year vs Azure)

Cost Winner by Service Category

Real-World Cost Scenario Analysis

Scenario 1: Hobby Developer (Personal Portfolio)

Workload:
- 1 node cluster
- Small database
- No cache
- <1K visitors/day

GCP Cost:
- GKE management: $74.40
- 1x e2-small node: $12.23
- db-f1-micro: $7.67
- Load balancer: $18.26
- DNS: $0.20
Total: $112.76/mo

Azure Cost:
- AKS management: $0 ✅
- 1x B1ms node: $15.18
- B1ms PostgreSQL: $12.41
- Load balancer: $3.65 ✅
- DNS: $0.50
Total: $31.74/mo

AWS Cost:
- EKS management: $73.00
- 1x t4g.micro node: $6.14 ✅
- db.t4g.micro: $12.41
- ALB: $16.43
- Route53: $0.50
Total: $108.48/mo

Winner: Azure saves $81.02/mo vs GCP (72% cheaper)
Winner: Azure saves $76.74/mo vs AWS (71% cheaper)

Scenario 2: Startup (10 Services, Growing Traffic)

Workload:
- 3-5 nodes (autoscaling)
- Medium database with HA
- Redis cache for sessions
- 10-50K visitors/day

GCP Cost:
- GKE management: $74.40
- 3x e2-medium nodes: $73.35
- db-n1-standard-2 HA: $198.90
- Memorystore 5GB HA: $393.44
- Load balancer: $18.26
- Storage 1TB: $170.00
- Bandwidth 500GB: $60.00
Total: $988.35/mo

Azure Cost:
- AKS management: $0 ✅
- 3x B2s nodes: $120.36
- D2s_v3 PostgreSQL HA: $233.60
- Standard C3 Redis HA: $128.48 ✅
- Load balancer: $3.65 ✅
- Storage 1TB: $96.00 ✅
- Bandwidth 500GB: $43.80 ✅
Total: $625.89/mo

AWS Cost:
- EKS management: $73.00
- 3x t4g.small nodes: $36.87 ✅
- db.m6g.large Multi-AZ: $188.50 ✅
- cache.m6g.large Multi-AZ: $151.48
- ALB: $16.43
- Storage 1TB: $100.00
- Bandwidth 500GB: $45.00
Total: $611.28/mo ✅

Winner: AWS saves $377.07/mo vs GCP (38% cheaper)
Winner: AWS saves $14.61/mo vs Azure (2% cheaper)

Scenario 3: Enterprise (50+ Services, High Traffic)

Workload:
- 10-20 nodes (autoscaling)
- Large database cluster
- Multiple Redis instances
- 1M+ visitors/day
- Multi-region

GCP Cost:
- GKE management: $74.40
- 10x n1-standard-4 nodes: $989.10
- db-n1-standard-8 HA: $796.20
- Memorystore 100GB HA: $3,242.88
- Multiple load balancers: $60.00
- Storage 5TB: $850.00
- Bandwidth 5TB: $600.00
- CDN: $400.00
Total: $7,012.58/mo

Azure Cost:
- AKS management: $0 ✅
- 10x D4s_v3 nodes: $1,503.80
- D8s_v3 PostgreSQL HA: $935.16
- Premium P4 Redis HA: $657.00 ✅
- Multiple load balancers: $20.00 ✅
- Storage 5TB: $480.00 ✅
- Bandwidth 5TB: $438.00 ✅
- CDN: $435.00
Total: $4,469.96/mo

AWS Cost:
- EKS management: $73.00
- 10x m6g.xlarge nodes: $1,214.70 ✅
- db.m6g.4xlarge Multi-AZ: $1,129.00
- cache.r6g.xlarge Cluster: $605.92 ✅
- Multiple ALBs: $50.00
- Storage 5TB: $500.00
- Bandwidth 5TB: $450.00
- CloudFront: $425.00
Total: $4,447.62/mo ✅

Winner: AWS saves $2,564.96/mo vs GCP (37% cheaper)
Winner: AWS saves $22.34/mo vs Azure (0.5% cheaper)
Annual savings: AWS saves $30,779/year vs GCP, $268/year vs Azure

Cost Optimization Recommendations

Choose GCP if:

  • Heavy sustained compute requirements (>100 vCPUs running 24/7)
  • Hobby databases (db-f1-micro $7.67/mo is cheapest shared vCPU tier)
  • Already using Google Cloud services (BigQuery, Cloud Run, Pub/Sub)
  • Need GKE Autopilot for simplified management ($73/mo + per-pod pricing)
  • Prefer preemptible instances for batch workloads (sustained use discounts up to 30%)

Choose Azure if:

  • Cache-heavy architecture (Redis is 40-70% cheaper than GCP/AWS)
  • Multiple microservices (free cluster management saves $73-74/mo vs GCP/AWS)
  • Storage-intensive (blob storage 30-40% cheaper than GCP)
  • Need managed Istio service mesh (included in AKS at no extra cost)
  • Load balancer-heavy architecture (Basic LB $3.65/mo is 75-100% cheaper)
  • Microsoft ecosystem integration (AD/Entra, Windows containers, .NET)
  • Cost optimization is top priority for small-medium clusters

Choose AWS if:

  • ARM-compatible workloads (Graviton t4g/m6g instances 50-75% cheaper)
  • Startup phase with standard production cluster (lowest overall cost $266.70/mo)
  • Small databases with burstable requirements (db.t4g.micro $12.41/mo)
  • Heavy spot instance usage (deepest spot market with most availability zones)
  • Already using AWS services (Lambda, DynamoDB, S3, SQS)
  • Need highest availability zones per region (up to 6 AZs vs 3 for GCP/Azure)

Hybrid Strategy:

  • Run Kubernetes on AWS for overall lowest cost ($266.70/mo base)
  • Use Azure Cache for Redis (huge savings: $44.53/mo vs $46.08 AWS vs $78.84 GCP)
  • Use GCP Cloud SQL for hobby databases (db-f1-micro $7.67/mo)
  • Store cold data on Azure Archive (75% cheaper than GCP/AWS)
  • Use AWS Graviton instances where workload allows (50-75% compute savings)

Hidden Costs to Consider

Cost FactorGCPAzureAWSImpact
Egress to Internet$0.12/GB (first 1TB)$0.087/GB$0.09/GBAzure 27% cheaper, AWS 25% cheaper
Egress between zones$0.01/GB$0.01/GB$0.01/GB (same region)Same
Egress between regions$0.12/GB$0.02/GB$0.02/GBAzure/AWS 83% cheaper ✅
Snapshot storage$0.026/GB$0.05/GB$0.05/GBGCP 48% cheaper
IP addresses$6.57/mo per static IP$3.65/mo per public IP$3.60/mo per Elastic IPAWS/Azure 44% cheaper
NAT Gateway$0.045/hour + $0.045/GB$0.045/hour + $0.045/GB$0.045/hour + $0.045/GBSame
VPN Gateway$36.50/mo$26.28/mo$36.50/moAzure 28% cheaper ✅

Next Steps: Building Production-Grade Platform

  • GitOps with Flux CD: Automated deployments from Git commits
  • Multi-Environment Strategy: Separate dev/staging/prod with namespace isolation
  • Secret Management: Migrate to Azure Key Vault with CSI driver
  • Database Integration: Connect to Azure Database for PostgreSQL with connection pooling
  • Advanced Autoscaling: KEDA for event-driven scaling (queues, schedules, metrics)
  • Disaster Recovery: Backup strategy with Velero, multi-region failover
  • Compliance: Pod Security Standards enforcement, network policies, audit logging
  • Developer Experience: Internal developer platform (IDP) with self-service deployment

🎯 Key Takeaways: Decision Framework for Platform Teams

When to Adopt This Architecture

✅ Good Fit:

  • Microservices at scale (20+ services): Service mesh provides observability, security, and traffic management
  • Multi-team organizations: Teams need isolation, independent deployment, and self-service capabilities
  • Compliance requirements: mTLS, audit logging, and fine-grained access control are mandatory
  • High availability needs: SLOs require 99.5%+ uptime with multi-region capabilities
  • Mature DevOps culture: Team has SRE practices, GitOps, and infrastructure-as-code expertise

❌ Poor Fit:

  • Monolith or <10 services: Service mesh overhead outweighs benefits
  • Cost-sensitive startups: $400-500/mo base infrastructure may be too high for MVP stage
  • Limited Kubernetes expertise: Steep learning curve can delay time-to-market
  • Simple request/response patterns: Don't need advanced routing, retries, or circuit breaking
  • Minimal traffic (<100 RPS): Serverless (Azure Functions, Container Apps) more cost-effective

Architecture Decision Summary

DecisionChoiceTrade-offWhen to Reconsider
Service MeshManaged Istio+$150-235/mo, operational complexityWhen service count >50 or team expertise grows
Node SizeStandard_D2s_v3Balanced cost/performanceIf workloads are memory-heavy (upgrade to D4s_v3) or CPU-light (downgrade to B-series)
Certificate Managementcert-manager + Let's EncryptFree certificates, requires maintenanceIf >100 certificates or need extended validation (EV) certificates, consider paid CA
DNS ProviderAzure DNS$1/mo, Azure-native integrationIf need advanced features (geo-routing, DNSSEC), consider Cloudflare or Route53
Monitoring StackPrometheus + GrafanaSelf-hosted, requires managementIf team lacks monitoring expertise, consider Azure Monitor or Datadog
Deployment StrategyCanary with FlaggerAutomated progressive deliveryIf team is small or risk-averse, use blue-green or manual rollouts

Cost Optimization Decision Tree

Operational Maturity Roadmap

Phase 1: Foundation (Months 1-2)

  • ✅ AKS cluster with managed Istio running
  • ✅ Basic monitoring (Prometheus, Grafana)
  • ✅ Manual deployments via kubectl
  • ✅ Certificate automation with cert-manager
  • ✅ Single production environment

Phase 2: Automation (Months 3-4)

  • 🔄 GitOps with Flux CD or ArgoCD
  • 🔄 CI/CD pipelines (GitHub Actions, Azure DevOps)
  • 🔄 Automated canary deployments with Flagger
  • 🔄 Infrastructure-as-code (Terraform, Bicep)
  • 🔄 Dev/staging environments with parity

Phase 3: Observability (Months 5-6)

  • 🔲 SLO-based alerting (not just resource alerts)
  • 🔲 Distributed tracing (Jaeger, Tempo)
  • 🔲 Log aggregation (Loki, Azure Log Analytics)
  • 🔲 Cost monitoring and attribution by team
  • 🔲 Error budget tracking and reporting

Phase 4: Resilience (Months 7-9)

  • 🔲 Chaos engineering (Litmus, Chaos Mesh)
  • 🔲 Multi-region active-passive setup
  • 🔲 Disaster recovery runbooks tested quarterly
  • 🔲 Automated incident response (PagerDuty + Kubernetes)
  • 🔲 Load testing in CI/CD pipeline

Phase 5: Platform Engineering (Months 10-12)

  • 🔲 Internal Developer Platform (IDP) with self-service
  • 🔲 Golden paths for common deployment patterns
  • 🔲 Policy-as-code (OPA, Kyverno)
  • 🔲 Developer productivity metrics (DORA metrics)
  • 🔲 Platform team vs product teams separation

Lessons from Production Incidents

Incident 1: Certificate Expiration Outage (2 AM)

  • What happened: Let's Encrypt certificate expired, HTTPS traffic failed
  • Root cause: cert-manager webhook was blocked by network policy
  • Impact: 30 minutes downtime, 0.07% monthly error budget consumed
  • Prevention: Alert on certificate expiration <30 days, test renewal quarterly
  • Lesson: Monitor certificate lifecycle, not just pod health

Incident 2: Istio Control Plane Upgrade Breaking Change

  • What happened: Managed Istio upgraded from 1.17 to 1.18 with breaking CRD changes
  • Root cause: Azure auto-upgraded during maintenance window without notification
  • Impact: VirtualService configuration failed validation, 503 errors for 15 minutes
  • Prevention: Pin Istio minor version, test upgrades in dev first, subscribe to release notes
  • Lesson: Managed services reduce operational burden but sacrifice control

Incident 3: Pod Eviction Cascade (Out of Memory)

  • What happened: Memory leak in application caused OOMKilled, cascading pod evictions
  • Root cause: Missing memory limits allowed pod to consume node resources
  • Impact: 5 pods restarted in 2 minutes, 60-second request latency spike
  • Prevention: Set memory limits, enable memory profiling, use VPA recommendations
  • Lesson: Resource limits are not optional in production

Incident 4: DNS Propagation Delay on New Subdomain

  • What happened: New subdomain added but not resolving for 45 minutes
  • Root cause: Azure DNS TTL was 3600 seconds, previous record cached
  • Impact: Delayed launch announcement, no customer impact
  • Prevention: Lower TTL to 300 seconds before DNS changes, wait for propagation
  • Lesson: DNS is eventually consistent; plan changes ahead of announcements

Incident 5: Cost Spike from Forgotten Load Test

  • What happened: Load test ran overnight, autoscaler increased nodes from 3 to 30
  • Root cause: No autoscaler max limit configured, test had no timeout
  • Impact: $500 unexpected bill for 8 hours, no customer impact
  • Prevention: Set autoscaler maxReplicas, configure cost alerts, add timeouts to tests
  • Lesson: Cost visibility is essential; autoscalers need upper bounds

Final Recommendations for Platform Teams

Before You Start:

  1. Validate scale requirements: Do you really need service mesh? (<20 services likely don't)
  2. Assess team expertise: Do you have Kubernetes/Istio experience? Budget for training
  3. Calculate total cost: Include management time, not just infrastructure ($444/mo + 0.5-1 FTE)
  4. Define success metrics: What SLOs justify this investment? What's the alternative cost?

During Implementation:

  1. Start small: Single service in production, validate before migrating all workloads
  2. Automate early: GitOps from day 1, manual kubectl is error-prone at scale
  3. Monitor everything: If you can't measure it, you can't improve it (observability first)
  4. Document runbooks: Incident response is only as good as your documentation

After Go-Live:

  1. Review costs weekly: First month will reveal unexpected expenses
  2. Conduct game days: Test failure scenarios before they happen in production
  3. Measure developer productivity: Platform value = features shipped / time, not uptime alone
  4. Iterate on processes: Retrospectives after every incident, quarterly architecture reviews

When to Choose Alternatives

AlternativeBest ForCost ComparisonMigration Difficulty
Azure Container Apps<10 services, serverless workload~$50-200/mo (consumption-based)Easy (similar Kubernetes API)
Azure App ServiceTraditional web apps, minimal DevOps~$100-300/mo (PaaS pricing)Easy (no Kubernetes knowledge needed)
Azure FunctionsEvent-driven, sporadic traffic~$0-50/mo (pay-per-execution)Medium (code refactoring required)
Self-Hosted Kubernetes (VM)Maximum control, compliance needs~$300-400/mo (3 VMs)Hard (full cluster management)
GKE/EKS (multi-cloud)Avoid vendor lock-in, global footprint~$500-700/mo (cross-cloud costs)Medium (Kubernetes-compatible)

Recommended Migration Path for Startups:

Phase 1: Azure Functions (MVP, <$50/mo)
  ↓ (Growing traffic, need more control)
Phase 2: Azure Container Apps (Scale, ~$200/mo)
  ↓ (10+ microservices, need service mesh)
Phase 3: AKS + Istio (This guide, ~$400-500/mo)
  ↓ (100+ microservices, multi-region)
Phase 4: Multi-cloud Kubernetes (Enterprise, $2000-5000/mo)

🔗 Additional Resources

Official Documentation

Books & Deep Dives

  • "Kubernetes Patterns" by Bilgin Ibryam & Roland Huß - Reusable patterns for cloud-native apps
  • "Production Kubernetes" by Josh Rosso et al. - Running Kubernetes at scale
  • "The Site Reliability Workbook" by Google SRE - SLO/SLI implementation guide
  • "Cloud Native DevOps with Kubernetes" by John Arundel & Justin Domingus

Community & Tools

Cost Calculators


📝 Closing Thoughts

Building production-grade Kubernetes infrastructure is a journey, not a destination. This architecture represents months of learning, incidents, and iterations. The decisions outlined here are not universal truths—they're context-dependent trade-offs that worked for our team's scale, budget, and expertise.

The most important question isn't "Should I use Kubernetes?"—it's "What problem am I solving, and is Kubernetes the simplest solution?" For many teams, managed PaaS services (Container Apps, App Service) provide 90% of the benefits at 20% of the complexity.

But if you're building a platform for multiple teams, need fine-grained control over traffic routing, or have compliance requirements that demand mTLS and network policies, this architecture provides a solid foundation.

Key Philosophy:

  • Simplicity > Features: Only add complexity when it solves a real problem
  • Observability > Availability: You can't improve what you don't measure
  • Automation > Documentation: Runbooks should execute themselves
  • Learning > Perfection: Every incident is an opportunity to improve

Start small, automate early, and always question whether the complexity you're adding is worth the value it provides.


Questions or feedback? I'd love to hear about your Kubernetes journey—find me on Twitter or LinkedIn.

Found this guide helpful? Consider sharing it with your team or contributing improvements on GitHub.


This guide reflects production experience as of January 2025. Cloud platforms and best practices evolve rapidly—verify configurations against current documentation before implementing.

I

Ian Lintner

Full Stack Developer

Published on

November 23, 2025

Production-Ready AKS with Istio: Architecture Decisions, Trade-offs, and Implementation | Ian Lintner