Production-Ready AKS with Istio: Architecture Decisions, Trade-offs, and Implementation

Building production Kubernetes infrastructure is a lot like driving in a winter storm—you need the right tires (infrastructure), you gotta know when to brake (trade-offs), and if you're not careful, you'll end up in the ditch.

While deploying a basic cluster is straightforward, creating a platform that scales from dozens to thousands of services—while maintaining security, observability, and developer velocity—requires careful consideration of competing priorities.

I've spent two decades in the trenches, and now as a leader, I see the cost of complexity. It's easy to over-engineer this stuff. This guide examines the technical and organizational decisions behind building production-grade Azure Kubernetes Service (AKS) infrastructure with Istio service mesh. We'll explore not just how to implement these systems, but why specific architectural choices matter, their cost implications, and when alternatives might be more appropriate.

Target Audience

This guide is written for:

Staff+ Engineers evaluating service mesh adoption and platform architecture (and wondering if it's worth the headache).
Technical Leads making build-vs-buy decisions for infrastructure components.
Platform Engineers implementing multi-tenant Kubernetes platforms.
SREs responsible for production reliability and incident response (the folks carrying the pager).
Engineering Managers balancing technical debt against feature velocity.

What You'll Learn

Technical Implementation:

Production-ready AKS cluster architecture with managed Istio
Zero-downtime certificate management and rotation
Multi-environment DNS and subdomain routing strategies
Observability and debugging patterns for service mesh traffic

Architectural Decision-Making:

Service mesh trade-offs: Istio vs. alternatives (Linkerd, Consul, Cilium)
When to use managed vs. self-hosted Istio
Cost analysis: Managed services vs. DIY infrastructure
Security considerations: mTLS, network policies, zero-trust architecture
Multi-cluster vs. single-cluster strategies

Operational Maturity:

Failure modes and mitigation strategies
Performance implications of sidecar proxies
Capacity planning and resource optimization
Incident response patterns for mesh-related outages

🏗️ Architecture Overview and Design Decisions

Before diving into implementation, let's examine the architectural decisions that shape production Kubernetes platforms.

Why Service Mesh? The Trade-offs

Service mesh adoption represents a significant architectural commitment. Understanding when it adds value—and when it doesn't—is crucial.

Service Mesh Benefits:

Traffic Management: Advanced routing (canary, blue-green), retries, timeouts, circuit breaking
Security: Automatic mTLS between services, zero-trust networking
Observability: Distributed tracing, metrics, and traffic visualization without code changes
Policy Enforcement: Centralized traffic policies, rate limiting, authentication

Service Mesh Costs:

Complexity: Additional operational burden, learning curve, debugging complexity (The debugging can be a real pain).
Performance: Sidecar proxy overhead (CPU, memory, latency ~2-5ms per hop)
Resource Usage: ~50-100MB memory per sidecar, CPU overhead for encryption
Blast Radius: Control plane outages can impact all mesh traffic

Decision Framework:

Use Service Mesh When...	Skip Service Mesh When...
Operating 50+ microservices	Running monolith or <10 services
Need mTLS without app changes	Can enforce security at app layer
Require advanced traffic routing	Simple load balancing suffices
Multiple teams need traffic policies	Single team owns all services
Observability gaps exist	APM tools provide sufficient insight

Istio vs. Alternatives: Comparative Analysis

Why We Chose Istio for This Architecture:

Azure Managed Add-on: Reduces operational burden (control plane managed by Microsoft)
Feature Completeness: Comprehensive traffic management, security, observability
Ecosystem Maturity: Extensive tooling, documentation, community support
Enterprise Adoption: Proven at scale (Google, eBay, T-Mobile)

Cost Comparison (Rough Monthly Estimates for 10-Service Cluster):

Component	Managed Istio (AKS)	Self-Hosted Istio	Linkerd	No Mesh
Control Plane	$0 (included)	$100-200	$100-200	$0
Sidecar Overhead	~$50-75	~$50-75	~$25-40	$0
Monitoring	$100-150	$100-150	$75-100	$50-75
Operations Time	5-10 hrs/mo	20-40 hrs/mo	10-20 hrs/mo	5 hrs/mo
Total	~$150-235 + ops	~$250-425 + ops	~$200-340 + ops	~$50-75 + ops

Note: Operations time valued at $150-200/hr for staff engineer equivalency

Architecture Decision Records (ADRs)

ADR-001: Managed Istio vs. Self-Hosted

Context: Need service mesh capabilities without dedicated SRE team.

Decision: Use Azure-managed Istio add-on for AKS.

Rationale:

Microsoft manages control plane (istiod) upgrades and patching
Reduced operational complexity (no manual version management)
SLA backed by Azure support
Team can focus on application logic vs. infrastructure

Consequences:

Limited control over Istio version (follows Azure release cycle)
Cannot customize control plane configuration deeply
Dependent on Azure for mesh reliability
May lag behind upstream Istio releases by 1-2 versions

Alternatives Considered:

Self-hosted Istio: More control, more operational burden
Linkerd: Simpler but fewer features
No mesh: Lower complexity but manual mTLS, observability

ADR-002: DNS Management in Azure DNS vs. External Provider

Context: Need reliable DNS with low operational overhead.

Decision: Use Azure DNS for domain management.

Rationale:

Native integration with AKS and Azure resources
100% SLA with zone redundancy
Low latency queries via Azure edge network
Cost-effective ($0.50/zone/month + $0.40/million queries)
Infrastructure-as-code via Terraform/Bicep

Consequences:

Vendor lock-in to Azure ecosystem
Migration complexity if moving clouds
All DNS management requires Azure access

Cost Analysis:

Azure DNS: ~$1-5/month for typical workload
Route 53 (AWS): ~$1-5/month + cross-cloud complexity
Cloudflare: Free tier sufficient but adds dependency

Complete System Architecture

Key Architectural Patterns:

Defense in Depth: DDoS protection → Load balancer → Istio Gateway → Network policies
Separation of Concerns: Ingress layer isolated from application layer
Observability by Default: Every service instrumented without code changes
Cost Optimization: Spot instances for non-critical workloads, right-sized nodes
High Availability: Multi-replica deployments, pod anti-affinity, zone distribution

Monthly Cost Breakdown (Rough Estimates):

Compute: $220 (3x D2s_v3) + $50 (spot) = $270
Control Plane: $0 (free tier) or $73 (99.95% SLA)
Networking: $18.25 (load balancer) + ~$10 (bandwidth)
Storage: $30 (database) + $10 (disks)
Container Registry: $5 (basic) or $167 (premium with geo-replication)
Monitoring: $110 (Prometheus, Grafana, Jaeger storage)
DNS: $1
Total: ~$444/mo (free tier) or ~$517/mo (with SLA)

Scales to ~$800-1200/mo at 10 services, ~$2000-3000/mo at 50 services

📋 Prerequisites and Team Readiness

Technical Prerequisites

Azure account with Owner or Contributor + User Access Administrator role
Azure CLI (az) version 2.50.0+
kubectl version matching cluster (typically 1.28+)
helm v3.12+ for package management
Registered domain name with ability to change name servers
Azure subscription with sufficient quota:
- 12+ vCPUs (preferably D-series or E-series)
- Standard Load Balancer quota
- Public IP quota
Understanding of Kubernetes concepts (pods, services, deployments, namespaces)

Organizational Readiness Assessment

Before implementing service mesh, assess your team's readiness:

✅ Good Fit:

Team has 2+ engineers with Kubernetes production experience
Running 20+ microservices or planning to reach that scale
Need centralized traffic policies across services
Require mTLS without modifying application code
Budget supports ~$500-1000/mo infrastructure costs
Dedicated platform or SRE team available (even part-time)

⚠️ Proceed with Caution:

Team new to Kubernetes (6 months experience recommended minimum)
Running <10 services (simpler ingress solutions may suffice)
Tight budget constraints (<$500/mo infrastructure spend)
No dedicated operations support
Need to ship features weekly (learning curve may slow velocity initially)

❌ Not Recommended:

First Kubernetes deployment (start simpler, add mesh later)
Single developer or small team (<3 engineers total)
Prototype or proof-of-concept phase
Services already have robust observability and security

Install Required Tools

# Install Azure CLI (macOS)
brew install azure-cli

# Verify version
az --version  # Should be 2.50.0+

# Install kubectl
az aks install-cli

# Verify kubectl
kubectl version --client

# Install Helm
brew install helm

# Verify Helm
helm version  # Should be v3.12+

# Install useful optional tools
brew install k9s          # Terminal UI for Kubernetes
brew install stern        # Multi-pod log tailing
brew install kubectx      # Context and namespace switching
brew install istioctl     # Istio CLI (optional but helpful)

# Login to Azure
az login

# List subscriptions
az account list --output table

# Set your subscription
export AZURE_SUBSCRIPTION_ID="<your-subscription-id>"
az account set --subscription "$AZURE_SUBSCRIPTION_ID"

# Verify access
az account show

Cost Awareness and Budget Planning

Initial Setup Costs (First Month):

Domain registration (if new): $10-15/year
AKS cluster setup: $300-500 (includes learning/experimentation time)
DNS testing and configuration: Minimal
Certificate setup: Free (Let's Encrypt)

Ongoing Monthly Costs (Conservative Estimate):

Base Infrastructure:
- AKS Control Plane (free tier): $0
- OR Control Plane SLA (99.95%): $73
- 3x Standard_D2s_v3 nodes: $220
- Azure Load Balancer: $18.25
- Azure DNS: $1
- Container Registry (Basic): $5
- Monitoring storage: $50-100

Total Base: $294-417/mo

Per-Service Costs (Approximate):
- Additional compute: $20-40/service
- Storage (if needed): $10-50/service
- Database (if needed): $30-500/service
- Bandwidth: ~$0.08/GB egress

Expected Scaling:
- 5 services: ~$450-600/mo
- 10 services: ~$800-1200/mo
- 25 services: ~$1500-2500/mo
- 50 services: ~$3000-5000/mo

Cost Optimization Strategies:

Use Spot Instances: 70-90% discount for non-critical workloads
Right-size Nodes: Start with D2s_v3, scale up only if needed
Reserved Instances: 30-40% discount for 1-3 year commitments
Autoscaling: HPA and cluster autoscaler to match demand
Monitoring Retention: Reduce Prometheus retention to 7-15 days
Dev/Staging Clusters: Shut down nights and weekends (terraform automation)

🚀 Step 1: Create Production-Grade AKS Cluster with Istio

1.1 Cluster Sizing and Configuration Decisions

Node Size Selection Trade-offs:

VM Size	vCPU	RAM	Price/mo	Use Case	Considerations
Standard_B2s	2	4GB	$30	Dev/test only	Not recommended: Burstable, inconsistent performance
Standard_D2s_v3	2	8GB	$70	Small prod workloads	Recommended starter: Good balance, upgradeable
Standard_D4s_v3	4	16GB	$140	Medium prod	Better pod density, more headroom
Standard_D8s_v3	8	32GB	$280	Large prod	Best pod density, overkill for <20 services
Standard_E2s_v3	2	16GB	$109	Memory-intensive	Higher memory-to-CPU ratio

Decision Framework:

Start with D2s_v3 for initial deployment (2-5 services)
Plan for D4s_v3 when approaching 50% node capacity
Consider E-series if services average >500MB memory per pod
Use node pools with different sizes for workload-specific optimization

Node Count Considerations:

Minimum 3 nodes for production (quorum, rolling updates, node failures)
Start with 3, enable cluster autoscaler (3-10 nodes typical range)
Cost vs. Availability: 3 nodes = 2-node failures tolerated, 5 nodes = 3-node failures
Zone Distribution: Spread across availability zones (requires 3+ nodes)

1.2 Define Cluster Configuration

# Cluster configuration with rationale
export RESOURCE_GROUP="aks-production-rg"
export LOCATION="eastus"  # Choose region closest to users
export CLUSTER_NAME="my-aks-cluster"

# Node configuration
export NODE_COUNT=3           # Minimum for HA
export NODE_SIZE="Standard_D2s_v3"  # 2 vCPU, 8GB RAM
export NODE_DISK_SIZE=128     # OS disk size (GB) - default is often 128GB

# Kubernetes version - check available versions
az aks get-versions --location "$LOCATION" --output table

# Use recent stable version (not preview)
export K8S_VERSION="1.28.3"  # Update to latest stable

# Enable advanced features
export ENABLE_MONITORING="true"
export ENABLE_AUTOSCALER="true"
export MIN_NODE_COUNT=3
export MAX_NODE_COUNT=10

1.3 Create Resource Group with Tags

# Create resource group with metadata tags for cost tracking
az group create \
  --name "$RESOURCE_GROUP" \
  --location "$LOCATION" \
  --tags \
    environment=production \
    costCenter=engineering \
    managedBy=terraform \
    project=platform

# Verify creation
az group show --name "$RESOURCE_GROUP"

1.4 Create AKS Cluster with Istio and Advanced Features

# Create production-grade AKS cluster
# This takes 10-15 minutes
az aks create \
  --resource-group "$RESOURCE_GROUP" \
  --name "$CLUSTER_NAME" \
  --location "$LOCATION" \
  --kubernetes-version "$K8S_VERSION" \
  --node-count "$NODE_COUNT" \
  --node-vm-size "$NODE_SIZE" \
  --node-osdisk-size "$NODE_DISK_SIZE" \
  --enable-managed-identity \
  --enable-asm \
  --network-plugin azure \
  --network-policy azure \
  --enable-cluster-autoscaler \
  --min-count "$MIN_NODE_COUNT" \
  --max-count "$MAX_NODE_COUNT" \
  --enable-addons monitoring \
  --generate-ssh-keys \
  --zones 1 2 3 \
  --tier standard \
  --tags environment=production project=platform

# Alternative: Free tier for non-production (no SLA)
# --tier free

# Get cluster credentials
az aks get-credentials \
  --resource-group "$RESOURCE_GROUP" \
  --name "$CLUSTER_NAME" \
  --overwrite-existing

# Verify cluster is running
kubectl get nodes -o wide

# Check node zones (should be distributed)
kubectl get nodes -o custom-columns=NAME:.metadata.name,ZONE:.metadata.labels."topology\.kubernetes\.io/zone"

Expected Output:

NAME                                STATUS   ROLES   AGE   VERSION   ZONE
aks-nodepool1-12345678-vmss000000   Ready    agent   5m    v1.28.3   eastus-1
aks-nodepool1-12345678-vmss000001   Ready    agent   5m    v1.28.3   eastus-2
aks-nodepool1-12345678-vmss000002   Ready    agent   5m    v1.28.3   eastus-3

1.5 Verify Istio Installation and Components

# Check Istio namespaces
kubectl get namespaces | grep istio

# Expected output:
# aks-istio-ingress    Active   5m
# aks-istio-system     Active   5m

# Check Istio control plane components
kubectl get pods -n aks-istio-system

# Expected output:
# NAME                     READY   STATUS    RESTARTS   AGE
# istiod-asm-xxxx-xxxxx    1/1     Running   0          5m

# Check Istio ingress gateway
kubectl get pods -n aks-istio-ingress

# Expected output:
# NAME                                                READY   STATUS    RESTARTS   AGE
# aks-istio-ingressgateway-external-xxxxxxxx-xxxxx   1/1     Running   0          5m

# Get Istio Gateway external IP (save this!)
kubectl get svc -n aks-istio-ingress aks-istio-ingressgateway-external

# Expected output:
# NAME                                TYPE           CLUSTER-IP     EXTERNAL-IP    PORT(S)
# aks-istio-ingressgateway-external   LoadBalancer   10.0.123.45   52.188.x.x     15021:30001/TCP,80:30002/TCP,443:30003/TCP

# Export for later use
export ISTIO_INGRESS_IP=$(kubectl get svc -n aks-istio-ingress aks-istio-ingressgateway-external -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
echo "Istio Ingress IP: $ISTIO_INGRESS_IP"

# Verify Istio version and configuration
kubectl get mutatingwebhookconfigurations | grep istio
kubectl get validatingwebhookconfigurations | grep istio

1.6 Configure Node Pools for Workload Isolation (Optional but Recommended)

For production, consider separating system workloads from application workloads:

# Add a spot instance node pool for non-critical workloads (70-90% cost savings)
az aks nodepool add \
  --resource-group "$RESOURCE_GROUP" \
  --cluster-name "$CLUSTER_NAME" \
  --name spotpool \
  --priority Spot \
  --eviction-policy Delete \
  --spot-max-price -1 \
  --enable-cluster-autoscaler \
  --min-count 0 \
  --max-count 5 \
  --node-count 1 \
  --node-vm-size Standard_D2s_v3 \
  --node-taints kubernetes.azure.com/scalesetpriority=spot:NoSchedule \
  --labels workloadType=batch priority=low

# Verify node pools
az aks nodepool list \
  --resource-group "$RESOURCE_GROUP" \
  --cluster-name "$CLUSTER_NAME" \
  --output table

1.7 Cluster Architecture Deep Dive

Key Reliability Patterns:

Zone Distribution: Nodes spread across 3 availability zones
- Tolerates entire zone failure
- Azure SLA: 99.99% uptime with zones
Control Plane HA: Managed by Azure
- Multi-region replication
- Automatic failover
- Separate from worker nodes
Istio Control Plane: Azure-managed istiod
- Automatic updates and patches
- SLA backed by Azure support
- Reduced operational burden
Cluster Autoscaler: Automatically adjusts node count
- Scales up when pods are pending
- Scales down when nodes underutilized (>50% idle for 10 min)
- Cost optimization while maintaining performance

Failure Scenarios and Recovery:

Failure Type	Impact	Recovery Time	Mitigation
Single node failure	33% capacity loss	Immediate (pods rescheduled)	3+ nodes, pod anti-affinity
Zone failure	33% capacity loss	Immediate	Multi-zone deployment
Istio control plane	No new config changes	5-15 minutes	Data plane continues functioning
Ingress gateway pod	Partial traffic loss	<30 seconds	Multiple replicas, health checks
Control plane API	No new deployments	Azure SLA: 99.95%	Use managed tier SLA

1.8 Enable Cost Analysis and Monitoring

# Enable Azure Cost Management insights
az aks update \
  --resource-group "$RESOURCE_GROUP" \
  --name "$CLUSTER_NAME" \
  --enable-cost-analysis

# View cluster costs (after 24-48 hours of data collection)
az aks show \
  --resource-group "$RESOURCE_GROUP" \
  --name "$CLUSTER_NAME" \
  --query "addonProfiles.azurePolicy"

# Install metrics-server for HPA (if not already installed)
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# Verify metrics-server
kubectl get deployment metrics-server -n kube-system
kubectl top nodes
kubectl top pods -A

Post-Creation Validation Checklist:

All 3 nodes are in Ready state
Nodes distributed across 3 availability zones
Istio control plane pod running in aks-istio-system
Istio ingress gateway has external IP
Cluster autoscaler enabled (3-10 node range)
Monitoring addon enabled (Azure Monitor)
kubectl context set to new cluster
Ingress IP saved to environment variable
Cost tracking tags applied

🌐 Step 2: Configure Custom Domain in Azure DNS

Now we'll set up Azure DNS to manage your custom domain.

2.1 Create DNS Zone

export DOMAIN_NAME="cat-herding.net"

# Create DNS zone
az network dns zone create \
  --resource-group "$RESOURCE_GROUP" \
  --name "$DOMAIN_NAME"

# Get name servers
az network dns zone show \
  --resource-group "$RESOURCE_GROUP" \
  --name "$DOMAIN_NAME" \
  --query nameServers \
  --output table

Expected Output:

Result
-------------------
ns1-01.azure-dns.com.
ns2-01.azure-dns.net.
ns3-01.azure-dns.org.
ns4-01.azure-dns.info.

2.2 Update Domain Registrar Name Servers

Go to your domain registrar (GoDaddy, Namecheap, etc.) and update the name servers to the Azure DNS name servers from the previous step.

DNS Propagation Flow:

2.3 Create DNS Records

# Get Istio ingress gateway IP
export INGRESS_IP=$(kubectl get svc -n aks-istio-ingress \
  aks-istio-ingressgateway-external \
  -o jsonpath='{.status.loadBalancer.ingress[0].ip}')

echo "Ingress Gateway IP: $INGRESS_IP"

# Create A record for root domain
az network dns record-set a add-record \
  --resource-group "$RESOURCE_GROUP" \
  --zone-name "$DOMAIN_NAME" \
  --record-set-name "@" \
  --ipv4-address "$INGRESS_IP"

# Create A record for www subdomain
az network dns record-set a add-record \
  --resource-group "$RESOURCE_GROUP" \
  --zone-name "$DOMAIN_NAME" \
  --record-set-name "www" \
  --ipv4-address "$INGRESS_IP"

# Verify DNS records
az network dns record-set a list \
  --resource-group "$RESOURCE_GROUP" \
  --zone-name "$DOMAIN_NAME" \
  --output table

2.4 Test DNS Resolution

# Test DNS resolution (may take a few minutes)
nslookup cat-herding.net
nslookup www.cat-herding.net

# Test from multiple locations
dig cat-herding.net

🔐 Step 3: Install cert-manager for TLS Certificates

cert-manager automates the provisioning and renewal of TLS certificates from Let's Encrypt.

3.1 Install cert-manager with Helm

# Add cert-manager Helm repository
helm repo add jetstack https://charts.jetstack.io
helm repo update

# Create namespace
kubectl create namespace cert-manager

# Install cert-manager
helm install cert-manager jetstack/cert-manager \
  --namespace cert-manager \
  --version v1.13.0 \
  --set installCRDs=true

# Verify installation
kubectl get pods -n cert-manager

cert-manager Architecture:

3.2 Create ClusterIssuer for Let's Encrypt

# Create production ClusterIssuer
cat <<EOF | kubectl apply -f -
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: admin@cat-herding.net  # Change to your email
    privateKeySecretRef:
      name: letsencrypt-prod
    solvers:
    - http01:
        ingress:
          class: istio
EOF

# Verify ClusterIssuer
kubectl get clusterissuer letsencrypt-prod
kubectl describe clusterissuer letsencrypt-prod

3.3 Create Certificate Resource

cat <<EOF | kubectl apply -f -
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: cat-herding-tls-cert
  namespace: default
spec:
  secretName: cat-herding-tls
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer
  dnsNames:
    - cat-herding.net
    - www.cat-herding.net
EOF

# Check certificate status
kubectl get certificate -n default
kubectl describe certificate cat-herding-tls-cert -n default

# Check if secret was created
kubectl get secret cat-herding-tls -n default

Certificate Lifecycle:

🎯 Step 4: Deploy Your First Application

Let's deploy a sample application with Istio routing and TLS.

4.1 Create Application Directory Structure

mkdir -p k8s/apps/portfolio/base
cd k8s/apps/portfolio/base

4.2 Create Deployment Manifest

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: portfolio
spec:
  replicas: 2
  selector:
    matchLabels:
      app: portfolio
  template:
    metadata:
      labels:
        app: portfolio
        version: v1
      annotations:
        sidecar.istio.io/inject: "false"
    spec:
      containers:
        - name: portfolio
          image: gabby.azurecr.io/portfolio:latest
          imagePullPolicy: Always
          ports:
            - containerPort: 3000
              name: http
          env:
            - name: PORT
              value: "3000"
            - name: NODE_ENV
              value: "production"
          resources:
            requests:
              cpu: 250m
              memory: 512Mi
            limits:
              cpu: 1000m
              memory: 1Gi
          livenessProbe:
            httpGet:
              path: /api/health
              port: http
            initialDelaySeconds: 30
            periodSeconds: 15
          readinessProbe:
            httpGet:
              path: /api/health
              port: http
            initialDelaySeconds: 5
            periodSeconds: 10

4.3 Create Service Manifest

# service.yaml
apiVersion: v1
kind: Service
metadata:
  name: portfolio
spec:
  selector:
    app: portfolio
  ports:
    - name: http
      port: 80
      targetPort: 3000
  type: ClusterIP

4.4 Create Istio Gateway

# istio-gateway.yaml
apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
  name: portfolio-gateway
  namespace: default
spec:
  selector:
    istio: aks-istio-ingressgateway-external
  servers:
    - port:
        number: 80
        name: http
        protocol: HTTP
      hosts:
        - "cat-herding.net"
        - "www.cat-herding.net"
    - port:
        number: 443
        name: https
        protocol: HTTPS
      hosts:
        - "cat-herding.net"
        - "www.cat-herding.net"
      tls:
        mode: SIMPLE
        credentialName: cat-herding-tls

4.5 Create Istio VirtualService

# istio-virtualservice.yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: portfolio-virtualservice
spec:
  hosts:
    - "cat-herding.net"
    - "www.cat-herding.net"
  gateways:
    - portfolio-gateway
  http:
    # Health check endpoints
    - match:
        - uri:
            exact: /api/health
        - uri:
            exact: /health
      route:
        - destination:
            host: portfolio
            port:
              number: 80
    # All other traffic
    - match:
        - uri:
            prefix: /
      route:
        - destination:
            host: portfolio
            port:
              number: 80

4.6 Create Kustomization File

# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
  - deployment.yaml
  - service.yaml
  - istio-gateway.yaml
  - istio-virtualservice.yaml

namespace: default

4.7 Deploy Application

# Validate manifests
kubectl kustomize k8s/apps/portfolio/base

# Apply manifests
kubectl apply -k k8s/apps/portfolio/base

# Watch deployment
kubectl rollout status deployment/portfolio -n default

# Check resources
kubectl get pods -l app=portfolio -n default
kubectl get svc portfolio -n default
kubectl get gateway portfolio-gateway -n default
kubectl get virtualservice portfolio-virtualservice -n default

Request Flow:

🧪 Step 5: Test and Verify

5.1 Test HTTP/HTTPS Connectivity

# Test HTTP (should work)
curl -v http://cat-herding.net

# Test HTTPS (should work with valid certificate)
curl -v https://cat-herding.net

# Test specific endpoint
curl https://cat-herding.net/api/health

# Check certificate details
echo | openssl s_client -servername cat-herding.net -connect cat-herding.net:443 2>/dev/null | openssl x509 -noout -dates

5.2 Verify Certificate

# Check certificate secret
kubectl get secret cat-herding-tls -n default

# Decode and inspect certificate
kubectl get secret cat-herding-tls -n default -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -text -noout

# Check expiration
kubectl get secret cat-herding-tls -n default -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -enddate

5.3 Test from Inside Cluster

# Create test pod
kubectl run curl-test --image=curlimages/curl:latest --rm -it --restart=Never -- sh

# From inside the pod
curl -v http://portfolio.default.svc.cluster.local
exit

🔄 Step 6: Deploy Additional Applications with Subdomains

Now let's deploy a second application on a subdomain (e.g., api.cat-herding.net).

6.1 Create DNS Record for Subdomain

# Create A record for api subdomain
az network dns record-set a add-record \
  --resource-group "$RESOURCE_GROUP" \
  --zone-name "$DOMAIN_NAME" \
  --record-set-name "api" \
  --ipv4-address "$INGRESS_IP"

# Verify
nslookup api.cat-herding.net

6.2 Create Certificate for Subdomain

cat <<EOF | kubectl apply -f -
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: api-tls-cert
  namespace: default
spec:
  secretName: api-tls
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer
  dnsNames:
    - api.cat-herding.net
EOF

# Check certificate status
kubectl get certificate api-tls-cert -n default

6.3 Create API Application Manifests

mkdir -p k8s/apps/api-service/base
cd k8s/apps/api-service/base

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-service
spec:
  replicas: 2
  selector:
    matchLabels:
      app: api-service
  template:
    metadata:
      labels:
        app: api-service
        version: v1
    spec:
      containers:
        - name: api-service
          image: your-registry.azurecr.io/api-service:latest
          ports:
            - containerPort: 8080
              name: http
          env:
            - name: PORT
              value: "8080"
          resources:
            requests:
              cpu: 250m
              memory: 512Mi
            limits:
              cpu: 1000m
              memory: 1Gi

# service.yaml
apiVersion: v1
kind: Service
metadata:
  name: api-service
spec:
  selector:
    app: api-service
  ports:
    - name: http
      port: 80
      targetPort: 8080
  type: ClusterIP

# istio-gateway.yaml
apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
  name: api-gateway
  namespace: default
spec:
  selector:
    istio: aks-istio-ingressgateway-external
  servers:
    - port:
        number: 80
        name: http
        protocol: HTTP
      hosts:
        - "api.cat-herding.net"
    - port:
        number: 443
        name: https
        protocol: HTTPS
      hosts:
        - "api.cat-herding.net"
      tls:
        mode: SIMPLE
        credentialName: api-tls

# istio-virtualservice.yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: api-virtualservice
spec:
  hosts:
    - "api.cat-herding.net"
  gateways:
    - api-gateway
  http:
    - match:
        - uri:
            prefix: /
      route:
        - destination:
            host: api-service
            port:
              number: 80

# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
  - deployment.yaml
  - service.yaml
  - istio-gateway.yaml
  - istio-virtualservice.yaml

namespace: default

6.4 Deploy API Service

# Apply manifests
kubectl apply -k k8s/apps/api-service/base

# Verify deployment
kubectl get pods -l app=api-service
kubectl get svc api-service
kubectl get gateway api-gateway
kubectl get virtualservice api-virtualservice

# Test
curl https://api.cat-herding.net/health

Multi-Application Routing:

🔒 Security Architecture and Zero-Trust Implementation

Defense-in-Depth Strategy

Production Kubernetes security requires multiple layers. A single breach should not compromise the entire system.

Implementing Mutual TLS (mTLS) with Istio

Istio provides automatic mTLS between services without code changes—one of its primary value propositions.

mTLS Benefits:

Zero-trust networking: No service trusts network by default
Encryption of all inter-service communication
Certificate-based authentication (vs. API keys)
Automatic certificate rotation (default: 24-hour validity)
Defense against man-in-the-middle attacks

mTLS Configuration:

# Enable strict mTLS for entire namespace
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: default
spec:
  mtls:
    mode: STRICT # PERMISSIVE allows both mTLS and plaintext (migration mode)

---
# Alternatively, configure cluster-wide
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: istio-system
spec:
  mtls:
    mode: STRICT

# Apply mTLS policy
kubectl apply -f - <<EOF
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: default
spec:
  mtls:
    mode: STRICT
EOF

# Verify mTLS status
istioctl x describe pod <pod-name> -n default

# Check certificate details
istioctl proxy-config secret <pod-name> -n default -o json | jq -r '.dynamicActiveSecrets[0].secret.tlsCertificate.certificateChain.inlineBytes' | base64 -d | openssl x509 -text -noout

mTLS Performance Impact:

Latency: ~1-3ms additional per hop
CPU: ~5-10% overhead for encryption/decryption
Memory: ~20-30MB per sidecar for certificates
Trade-off: Security vs. microsecond-level latency requirements

Network Policies for Pod Isolation

# Deny all traffic by default (baseline security)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: default
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress

---
# Allow specific service communication
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: portfolio-network-policy
  namespace: default
spec:
  podSelector:
    matchLabels:
      app: portfolio
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              name: aks-istio-ingress
      ports:
        - protocol: TCP
          port: 3000
  egress:
    # Allow DNS
    - to:
        - namespaceSelector:
            matchLabels:
              name: kube-system
      ports:
        - protocol: UDP
          port: 53
    # Allow API service communication
    - to:
        - podSelector:
            matchLabels:
              app: api-service
      ports:
        - protocol: TCP
          port: 8080
    # Allow database connection (Azure PostgreSQL)
    - to:
        - podSelector: {}
      ports:
        - protocol: TCP
          port: 5432
    # Allow external HTTPS (for external APIs)
    - to:
        - namespaceSelector: {}
      ports:
        - protocol: TCP
          port: 443

Pod Security Standards (PSS)

# Enforce restricted pod security at namespace level
apiVersion: v1
kind: Namespace
metadata:
  name: production
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted

---
# Example: Compliant deployment with security contexts
apiVersion: apps/v1
kind: Deployment
metadata:
  name: secure-portfolio
  namespace: production
spec:
  replicas: 2
  selector:
    matchLabels:
      app: portfolio
  template:
    metadata:
      labels:
        app: portfolio
    spec:
      # Pod-level security context
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        fsGroup: 1000
        seccompProfile:
          type: RuntimeDefault

      # Service account with minimal permissions
      serviceAccountName: portfolio-sa
      automountServiceAccountToken: false

      containers:
        - name: portfolio
          image: gabby.azurecr.io/portfolio:v1.2.3 # Use specific tags, not :latest

          # Container-level security context
          securityContext:
            allowPrivilegeEscalation: false
            readOnlyRootFilesystem: true # Immutable container filesystem
            runAsNonRoot: true
            runAsUser: 1000
            capabilities:
              drop:
                - ALL # Drop all Linux capabilities

          # Resource limits prevent resource exhaustion attacks
          resources:
            requests:
              cpu: 250m
              memory: 512Mi
            limits:
              cpu: 1000m
              memory: 1Gi

          # Writable volumes for tmp and cache
          volumeMounts:
            - name: tmp
              mountPath: /tmp
            - name: cache
              mountPath: /app/.next/cache

          ports:
            - containerPort: 3000
              name: http
              protocol: TCP

          livenessProbe:
            httpGet:
              path: /api/health
              port: http
            initialDelaySeconds: 30
            periodSeconds: 15
            timeoutSeconds: 5
            failureThreshold: 3

          readinessProbe:
            httpGet:
              path: /api/health
              port: http
            initialDelaySeconds: 5
            periodSeconds: 10
            timeoutSeconds: 3
            failureThreshold: 3

      volumes:
        - name: tmp
          emptyDir: {}
        - name: cache
          emptyDir: {}

Azure Key Vault Integration for Secrets

Never store secrets in Git, ConfigMaps, or even Kubernetes Secrets (base64 is not encryption).

# Install Azure Key Vault provider for Secrets Store CSI
helm repo add csi-secrets-store-provider-azure https://azure.github.io/secrets-store-csi-driver-provider-azure/charts
helm install csi-secrets-store-provider-azure/csi-secrets-store-provider-azure --generate-name --namespace kube-system

# Create Key Vault
export KEY_VAULT_NAME="aks-prod-kv-$(uuidgen | cut -c1-8)"
az keyvault create \
  --name "$KEY_VAULT_NAME" \
  --resource-group "$RESOURCE_GROUP" \
  --location "$LOCATION" \
  --enable-rbac-authorization

# Store secrets
az keyvault secret set \
  --vault-name "$KEY_VAULT_NAME" \
  --name "database-connection-string" \
  --value "postgresql://user:password@host:5432/db"

# Configure workload identity for pod to access Key Vault
# (Detailed setup omitted for brevity - see Azure docs)

Security Monitoring and Audit

# Enable Azure Defender for Kubernetes
az security pricing create \
  --name KubernetesService \
  --tier Standard

# View security recommendations
az security assessment list \
  --resource-group "$RESOURCE_GROUP"

# Enable audit logging
az aks update \
  --resource-group "$RESOURCE_GROUP" \
  --name "$CLUSTER_NAME" \
  --enable-azure-rbac

# Query audit logs (requires Azure Monitor integration)
kubectl get events --all-namespaces --sort-by='.lastTimestamp' | grep -i "security\|unauthorized\|forbidden"

Incident Response Playbook

Common Security Incidents:

Compromised Pod

# Immediately isolate pod with network policy
kubectl apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: isolate-compromised-pod
  namespace: default
spec:
  podSelector:
    matchLabels:
      app: compromised-app
  policyTypes:
  - Ingress
  - Egress
EOF

# Capture pod logs before deletion
kubectl logs compromised-pod -n default --all-containers > incident-logs.txt

# Exec into pod for forensics (if safe)
kubectl exec -it compromised-pod -n default -- /bin/sh

# Delete and redeploy from known-good image
kubectl delete pod compromised-pod -n default
kubectl rollout restart deployment/app-name -n default

# Review image for vulnerabilities
trivy image gabby.azurecr.io/app:tag

Certificate Expiration (Let's Encrypt)

# Check certificate expiration
kubectl get certificate -A
kubectl describe certificate cat-herding-tls-cert -n default

# Force renewal if <30 days
kubectl delete secret cat-herding-tls -n default
kubectl delete certificate cat-herding-tls-cert -n default
kubectl apply -f certificate.yaml

# Monitor renewal
kubectl get certificaterequest -n default -w

Istio Control Plane Failure

# Data plane continues functioning, but no config changes
# Check istiod status
kubectl get pods -n aks-istio-system
kubectl logs -n aks-istio-system -l app=istiod

# Contact Azure support (managed Istio)
# Meanwhile, avoid deploying new services or changing routing

Security Checklist for Production:

🔍 Troubleshooting: Real-World Incident Scenarios

Debugging Methodology for Service Mesh

Service mesh adds complexity—when things break, methodical debugging is essential.

Issue 1: 503 Service Unavailable (Most Common)

Symptoms:

curl https://cat-herding.net returns 503
Browser shows "Service Temporarily Unavailable"
Istio Gateway logs show "no healthy upstream"

Root Causes and Solutions:

# Step 1: Verify pods are running
kubectl get pods -l app=portfolio -n default

# If pods are not Running:
kubectl describe pod <pod-name> -n default
kubectl logs <pod-name> -n default --previous  # Check previous container logs

# Common causes:
# - ImagePullBackOff: Wrong image name or missing credentials
# - CrashLoopBackOff: Application error, check logs
# - Pending: Insufficient resources, check node capacity

# Step 2: Check if service has endpoints
kubectl get endpoints portfolio -n default

# If no endpoints:
# - Selector mismatch: kubectl get svc portfolio -n default -o yaml | grep selector
# - Pod labels: kubectl get pods -l app=portfolio --show-labels

# Step 3: Check health probes
kubectl describe pod <pod-name> -n default | grep -A5 "Liveness\|Readiness"

# If probes failing:
# - Wrong path: Check /api/health is correct
# - Slow startup: Increase initialDelaySeconds
# - Application not ready: Fix app code

# Step 4: Verify VirtualService routing
kubectl get virtualservice portfolio-virtualservice -n default -o yaml

# Check host matching
kubectl describe virtualservice portfolio-virtualservice -n default

# Step 5: Check Istio Gateway logs
kubectl logs -n aks-istio-ingress -l istio=aks-istio-ingressgateway-external --tail=100

# Look for:
# - "no healthy upstream"
# - "upstream connect error"
# - certificate errors

Real Example with Solution:

# Problem: VirtualService host doesn't match Gateway
apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
  name: portfolio-gateway
spec:
  servers:
  - hosts:
    - "cat-herding.net"  # ❌ Missing www subdomain

---
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: portfolio-vs
spec:
  hosts:
  - "www.cat-herding.net"  # ❌ Requests to www.cat-herding.net fail!
  gateways:
  - portfolio-gateway

# Solution: Match hosts exactly
apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
  name: portfolio-gateway
spec:
  servers:
  - hosts:
    - "cat-herding.net"
    - "www.cat-herding.net"  # ✅ Include all expected hosts

Issue 2: Certificate Not Ready / TLS Errors

Symptoms:

Certificate stuck in "Pending" or "False" state
Browser shows "Your connection is not private"
kubectl describe certificate shows challenge failures

Debugging Steps:

# Step 1: Check certificate status
kubectl get certificate -A
kubectl describe certificate cat-herding-tls-cert -n default

# Look for status conditions:
# - Ready: False
# - Reason: Pending / Failed
# - Message: (describes what's wrong)

# Step 2: Check certificate request
kubectl get certificaterequest -n default
kubectl describe certificaterequest <request-name> -n default

# Step 3: Check ACME challenge
kubectl get challenges -A
kubectl describe challenge <challenge-name> -n default

# Common HTTP-01 challenge failures:
# - DNS not propagating (wait 5-60 minutes)
# - Firewall blocking /.well-known/acme-challenge/
# - VirtualService not routing challenge requests

# Step 4: Check cert-manager logs
kubectl logs -n cert-manager -l app=cert-manager --tail=100

# Look for:
# - "failed to determine the list of Challenge resources"
# - "Error accepting challenge"
# - "certificate request has failed"

# Step 5: Verify Let's Encrypt rate limits
# Let's Encrypt limits:
# - 50 certificates per registered domain per week
# - 5 duplicate certificates per week
# Check: https://crt.sh/?q=cat-herding.net

# Solution 1: Delete and recreate certificate (if misconfigured)
kubectl delete certificate cat-herding-tls-cert -n default
kubectl delete secret cat-herding-tls -n default
kubectl apply -f certificate.yaml

# Solution 2: Use staging issuer for testing (avoids rate limits)
cat <<EOF | kubectl apply -f -
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-staging
spec:
  acme:
    server: https://acme-staging-v02.api.letsencrypt.org/directory
    email: admin@cat-herding.net
    privateKeySecretRef:
      name: letsencrypt-staging
    solvers:
    - http01:
        ingress:
          class: istio
EOF

# Update certificate to use staging issuer
kubectl edit certificate cat-herding-tls-cert -n default
# Change issuerRef.name to letsencrypt-staging

# Solution 3: Manual certificate if all else fails
openssl req -x509 -nodes -days 365 -newkey rsa:2048 \
  -keyout tls.key -out tls.crt \
  -subj "/CN=cat-herding.net"

kubectl create secret tls cat-herding-tls \
  --cert=tls.crt --key=tls.key -n default

Issue 3: High Latency / Slow Responses

Symptoms:

Response times > 500ms
Intermittent timeouts
User complaints about performance

Investigation:

# Step 1: Isolate the layer causing latency
# Test internal service (bypasses Istio Gateway)
kubectl run curl-test --image=curlimages/curl:latest --rm -it --restart=Never -- \
  time curl -w "\ntime_total: %{time_total}s\n" http://portfolio.default.svc.cluster.local

# If fast internally but slow externally:
# - Problem is Gateway, DNS, or network path
# If slow internally:
# - Problem is application or sidecar

# Step 2: Check resource utilization
kubectl top pods -n default
kubectl top nodes

# If pods near limits:
# - Increase resource requests/limits
# - Scale horizontally (more replicas)

# Step 3: Check Envoy sidecar overhead
kubectl exec -it <pod-name> -c istio-proxy -n default -- pilot-agent request GET stats | grep -E "upstream_rq_time|response_time"

# Typical sidecar overhead: 1-5ms
# If >10ms: Investigate Istio configuration

# Step 4: Enable distributed tracing
# Jaeger should show request flow and timing breakdown
kubectl port-forward -n istio-system svc/jaeger-query 16686:16686

# Open http://localhost:16686 and search for slow traces

# Step 5: Check for CPU throttling
kubectl describe pod <pod-name> -n default | grep -A5 "Requests\|Limits"

# If throttling:
# - Increase CPU limits
# - Review application efficiency

# Solution 1: Horizontal Pod Autoscaler
kubectl autoscale deployment portfolio --cpu-percent=70 --min=2 --max=10

# Solution 2: Increase resources
kubectl set resources deployment portfolio -c portfolio \
  --requests=cpu=500m,memory=1Gi \
  --limits=cpu=2000m,memory=2Gi

# Solution 3: Optimize application code
# - Add caching layer (Redis)
# - Optimize database queries
# - Use CDN for static assets

Issue 4: DNS Not Resolving

Symptoms:

nslookup cat-herding.net times out or returns wrong IP
Works from some locations but not others
Intermittent failures

Resolution:

# Step 1: Verify DNS records in Azure
az network dns record-set a list \
  --resource-group "$RESOURCE_GROUP" \
  --zone-name "cat-herding.net" \
  --output table

# Expected: A record pointing to $ISTIO_INGRESS_IP

# Step 2: Check name server delegation
nslookup -type=NS cat-herding.net

# Should return Azure DNS name servers:
# ns1-01.azure-dns.com.
# ns2-01.azure-dns.net.
# ns3-01.azure-dns.org.
# ns4-01.azure-dns.info.

# If not: Update name servers at domain registrar

# Step 3: Test DNS from multiple servers
dig @8.8.8.8 cat-herding.net  # Google DNS
dig @1.1.1.1 cat-herding.net  # Cloudflare DNS
dig @ns1-01.azure-dns.com cat-herding.net  # Azure DNS directly

# If inconsistent: DNS propagation in progress (wait 5-60 min)

# Step 4: Check TTL values
dig cat-herding.net | grep "^cat-herding.net"

# TTL should be reasonable (300-3600 seconds)
# Lower TTL = faster updates, more queries
# Higher TTL = slower updates, fewer queries, better performance

# Step 5: Flush local DNS cache
# macOS:
sudo dscacheutil -flushcache; sudo killall -HUP mDNSResponder

# Linux:
sudo systemd-resolve --flush-caches

# Windows:
ipconfig /flushdns

# Solution: If propagation is taking too long
# Temporarily add entry to /etc/hosts for testing:
echo "$ISTIO_INGRESS_IP cat-herding.net www.cat-herding.net" | sudo tee -a /etc/hosts

Issue 5: Istio Control Plane Failure

Symptoms:

Cannot deploy new services
VirtualService changes not applied
Gateway configuration stuck

Impact:

Data plane continues working: Existing traffic unaffected
Control plane down: Cannot change routing, deploy new services, or update config

Response:

# Step 1: Verify istiod health
kubectl get pods -n aks-istio-system
kubectl describe pod <istiod-pod> -n aks-istio-system
kubectl logs -n aks-istio-system -l app=istiod --tail=100

# Step 2: Check Istio webhook configuration
kubectl get mutatingwebhookconfigurations | grep istio
kubectl get validatingwebhookconfigurations | grep istio

# If webhooks failing:
# - Pods may not get sidecar injection
# - Configuration validation may fail

# Step 3: For Azure-managed Istio, contact Azure support
# Service Request Type: AKS > Istio Add-on Issue
# Include:
# - Cluster name and resource group
# - Timestamp of issue start
# - Output of kubectl get pods -n aks-istio-system

# Step 4: Temporary workarounds
# - Avoid deploying new services until resolved
# - Existing services continue functioning
# - Do not delete/recreate Gateway or VirtualService resources

# Step 5: If self-hosted Istio, restart control plane
kubectl rollout restart deployment/istiod -n istio-system

Emergency Commands Cheat Sheet

# Quick health check
kubectl get pods -A | grep -v Running
kubectl get nodes | grep -v Ready
kubectl top nodes
kubectl get events --all-namespaces --sort-by='.lastTimestamp' | tail -20

# Restart everything (last resort)
kubectl rollout restart deployment/<app-name> -n default
kubectl rollout restart deployment/istiod -n aks-istio-system

# Force certificate renewal
kubectl delete secret <cert-secret-name> -n default
kubectl delete certificate <cert-name> -n default
kubectl apply -f certificate.yaml

# Check all Istio configuration
istioctl analyze -A

# Dump Istio proxy configuration
kubectl exec -it <pod-name> -c istio-proxy -n default -- curl localhost:15000/config_dump > config.json

# Port-forward for local debugging
kubectl port-forward svc/portfolio 8080:80 -n default
curl http://localhost:8080

# Emergency: Scale down problematic service
kubectl scale deployment <app-name> --replicas=0 -n default

# View all resources in namespace
kubectl get all -n default

# Check certificate from external
echo | openssl s_client -servername cat-herding.net -connect cat-herding.net:443 2>/dev/null | openssl x509 -noout -dates

Performance Degradation Patterns

Pattern 1: Gradual Performance Degradation

Cause: Memory leak or file descriptor exhaustion
Solution: Enable HPA, set appropriate resource limits, restart pods regularly

Pattern 2: Spike During Deployments

Cause: Rolling update with insufficient replicas
Solution: Increase minReplicas, use pod anti-affinity, configure PodDisruptionBudget

Pattern 3: Periodic Slowdowns

Cause: Batch jobs, backups, or cron consuming resources
Solution: Use separate node pool for batch workloads, set resource limits

Pattern 4: Certificate Expiration Every 90 Days

Cause: Let's Encrypt certificates not auto-renewing
Solution: Verify cert-manager running, check ClusterIssuer, monitor certificate expiration

Monitoring Checklist:

# Set up alerts for:
# - Pod crash loop (>3 restarts in 5 minutes)
# - High error rate (>1% 5xx responses)
# - Certificate expiration (<30 days)
# - Node CPU >80% sustained
# - Node memory >85% sustained
# - Disk space <10% free

# Example: Prometheus alert rules
# (Configure in Prometheus or Azure Monitor)
groups:
- name: kubernetes_alerts
  rules:
  - alert: PodCrashLooping
    expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Pod {{ $labels.pod }} is crash looping"

# Check certificate status
kubectl describe certificate cat-herding-tls-cert -n default

# Check cert-manager logs
kubectl logs -n cert-manager -l app=cert-manager

# Check challenges
kubectl get challenges -A

# Manually trigger certificate renewal
kubectl delete certificate cat-herding-tls-cert -n default
kubectl apply -f certificate.yaml

Issue 2: DNS Not Resolving

# Check DNS records in Azure
az network dns record-set a list \
  --resource-group "$RESOURCE_GROUP" \
  --zone-name "$DOMAIN_NAME"

# Test DNS from different servers
dig @8.8.8.8 cat-herding.net
dig @1.1.1.1 cat-herding.net

# Check propagation status
nslookup cat-herding.net

Issue 3: 503 Service Unavailable

# Check if pods are running
kubectl get pods -l app=portfolio

# Check pod logs
kubectl logs -l app=portfolio --tail=50

# Check service endpoints
kubectl get endpoints portfolio

# Check Istio configuration
istioctl analyze -n default

# Check VirtualService routing
kubectl describe virtualservice portfolio-virtualservice

Issue 4: TLS Certificate Errors

# Verify secret exists and has correct data
kubectl get secret cat-herding-tls -n default -o yaml

# Check certificate expiration
kubectl get secret cat-herding-tls -n default -o jsonpath='{.data.tls\.crt}' | \
  base64 -d | openssl x509 -noout -enddate

# Check if Gateway is using correct secret
kubectl get gateway portfolio-gateway -o yaml | grep credentialName

# Test certificate from external
echo | openssl s_client -servername cat-herding.net -connect cat-herding.net:443 2>/dev/null | \
  openssl x509 -noout -subject -issuer -dates

📊 Monitoring and Observability

Enable Istio Telemetry

# Install Prometheus
kubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.20/samples/addons/prometheus.yaml

# Install Grafana
kubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.20/samples/addons/grafana.yaml

# Install Kiali (Istio dashboard)
kubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.20/samples/addons/kiali.yaml

# Access Kiali dashboard
kubectl port-forward -n istio-system svc/kiali 20001:20001
# Open http://localhost:20001

View Istio Traffic Flow

🎓 Best Practices

1. Resource Organization

k8s/
├── kustomization.yaml              # Root kustomization
└── apps/
    ├── portfolio/
    │   └── base/
    │       ├── kustomization.yaml
    │       ├── deployment.yaml
    │       ├── service.yaml
    │       ├── istio-gateway.yaml
    │       └── istio-virtualservice.yaml
    └── api-service/
        └── base/
            ├── kustomization.yaml
            ├── deployment.yaml
            ├── service.yaml
            ├── istio-gateway.yaml
            └── istio-virtualservice.yaml

2. Security Considerations

Use RBAC: Implement proper role-based access control
Network Policies: Restrict pod-to-pod communication
Image Scanning: Scan container images for vulnerabilities
Secrets Management: Use Azure Key Vault for sensitive data
TLS Everywhere: Always use HTTPS, even for internal services

3. High Availability

# Example: HA deployment configuration
apiVersion: apps/v1
kind: Deployment
metadata:
  name: portfolio
spec:
  replicas: 3 # Multiple replicas
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0 # Zero-downtime deployments
  template:
    spec:
      affinity:
        podAntiAffinity: # Spread across nodes
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchExpressions:
                    - key: app
                      operator: In
                      values:
                        - portfolio
                topologyKey: kubernetes.io/hostname

4. Resource Limits

Always define resource requests and limits:

resources:
  requests:
    cpu: 250m # Guaranteed CPU
    memory: 512Mi # Guaranteed memory
  limits:
    cpu: 1000m # Maximum CPU
    memory: 1Gi # Maximum memory

🚀 Deployment Automation

Create Deployment Script

#!/bin/bash
# deploy-app.sh - Automated application deployment

set -e

APP_NAME=$1
SUBDOMAIN=$2
IMAGE=$3
PORT=$4

if [ -z "$APP_NAME" ] || [ -z "$SUBDOMAIN" ] || [ -z "$IMAGE" ] || [ -z "$PORT" ]; then
  echo "Usage: ./deploy-app.sh <app-name> <subdomain> <image> <port>"
  echo "Example: ./deploy-app.sh my-app api myregistry.azurecr.io/my-app:v1.0 8080"
  exit 1
fi

RESOURCE_GROUP="aks-production-rg"
DOMAIN_NAME="cat-herding.net"

echo "🚀 Deploying $APP_NAME on $SUBDOMAIN.$DOMAIN_NAME"

# 1. Get Ingress IP
echo "📡 Getting Ingress Gateway IP..."
INGRESS_IP=$(kubectl get svc -n aks-istio-ingress aks-istio-ingressgateway-external -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
echo "   IP: $INGRESS_IP"

# 2. Create DNS Record
echo "🌐 Creating DNS record..."
az network dns record-set a add-record \
  --resource-group "$RESOURCE_GROUP" \
  --zone-name "$DOMAIN_NAME" \
  --record-set-name "$SUBDOMAIN" \
  --ipv4-address "$INGRESS_IP" 2>/dev/null || echo "   DNS record already exists"

# 3. Create Certificate
echo "🔐 Creating TLS certificate..."
cat <<EOF | kubectl apply -f -
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: ${SUBDOMAIN}-tls-cert
  namespace: default
spec:
  secretName: ${SUBDOMAIN}-tls
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer
  dnsNames:
    - ${SUBDOMAIN}.${DOMAIN_NAME}
EOF

# 4. Wait for certificate
echo "⏳ Waiting for certificate to be ready (this may take a few minutes)..."
kubectl wait --for=condition=Ready certificate/${SUBDOMAIN}-tls-cert -n default --timeout=300s

# 5. Deploy application
echo "📦 Deploying application manifests..."
mkdir -p k8s/apps/${APP_NAME}/base
cd k8s/apps/${APP_NAME}/base

# Generate manifests (templates omitted for brevity - see full examples above)
# ... deployment.yaml, service.yaml, gateway.yaml, virtualservice.yaml ...

kubectl apply -k .

# 6. Wait for rollout
echo "⏳ Waiting for deployment to be ready..."
kubectl rollout status deployment/${APP_NAME} -n default

# 7. Test
echo "✅ Testing deployment..."
sleep 10
curl -f https://${SUBDOMAIN}.${DOMAIN_NAME}/health || echo "⚠️  Health check failed (may need time)"

echo "🎉 Deployment complete! Visit https://${SUBDOMAIN}.${DOMAIN_NAME}"

📝 Summary

You now have a complete production-ready Azure AKS cluster with:

✅ Managed Kubernetes cluster with auto-scaling
✅ Istio service mesh for advanced traffic management
✅ Custom domain DNS managed in Azure
✅ Automated TLS certificates via cert-manager and Let's Encrypt
✅ HTTP/HTTPS routing with subdomain support
✅ Monitoring and observability with Prometheus, Grafana, and Kiali
✅ Reusable deployment templates for new applications

Complete System Architecture

📊 Production Observability and SRE Practices

Defining SLIs, SLOs, and Error Budgets

Service Level Objectives drive operational decisions and balance feature velocity with reliability.

Example SLO Definition:

# portfolio-slos.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: portfolio-slos
  namespace: default
data:
  slo-definition: |
    # Portfolio Service SLOs (30-day rolling window)

    Availability SLO:
      Target: 99.5% uptime
      Measurement: (successful requests / total requests) >= 0.995
      Error Budget: 0.5% (216 minutes/month or ~7.2 minutes/day)
      
    Latency SLO:
      p50 < 200ms
      p95 < 500ms
      p99 < 1000ms
      
    Error Rate SLO:
      5xx errors < 0.1% of requests
      4xx errors < 5% of requests
      
    Data Freshness SLO:
      Blog posts appear within 60 seconds of publish

SLI Implementation with Prometheus:

# prometheus-rules-slis.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: portfolio-slis
  namespace: default
spec:
  groups:
    - name: portfolio_slis
      interval: 30s
      rules:
        # Availability SLI
        - record: portfolio:availability:ratio_rate5m
          expr: |
            sum(rate(istio_requests_total{
              destination_service_name="portfolio",
              response_code!~"5.."
            }[5m]))
            /
            sum(rate(istio_requests_total{
              destination_service_name="portfolio"
            }[5m]))

        # Latency SLI (p95)
        - record: portfolio:latency:p95
          expr: |
            histogram_quantile(0.95,
              sum(rate(istio_request_duration_milliseconds_bucket{
                destination_service_name="portfolio"
              }[5m])) by (le)
            )

        # Error Rate SLI
        - record: portfolio:error_rate:ratio_rate5m
          expr: |
            sum(rate(istio_requests_total{
              destination_service_name="portfolio",
              response_code=~"5.."
            }[5m]))
            /
            sum(rate(istio_requests_total{
              destination_service_name="portfolio"
            }[5m]))

Alert Design Philosophy:

Alert Type	Triggers On	Action	Example
Page	SLO violation imminent, customer impact	Immediate response (2-5 min)	Availability <99.0%, p95 latency >1000ms
Ticket	Early warning, no customer impact yet	Next business day	Availability <99.7%, error budget 50% consumed
Info	Useful context, no action needed	Review weekly	New deployment, configuration change

# prometheus-alerts-slos.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: portfolio-slo-alerts
  namespace: default
spec:
  groups:
    - name: portfolio_slo_alerts
      rules:
        # Page: SLO violation (immediate impact)
        - alert: PortfolioAvailabilityCritical
          expr: |
            (
              sum(rate(istio_requests_total{
                destination_service_name="portfolio",
                response_code!~"5.."
              }[5m]))
              /
              sum(rate(istio_requests_total{
                destination_service_name="portfolio"
              }[5m]))
            ) < 0.990
          for: 2m
          labels:
            severity: critical
            slo: availability
          annotations:
            summary: "Portfolio availability below 99% (current: {{ $value | humanizePercentage }})"
            description: "SLO violation - immediate customer impact. Error budget exhausted."
            runbook: "https://wiki.company.com/runbooks/portfolio-availability"

        # Ticket: Early warning (50% error budget consumed)
        - alert: PortfolioErrorBudgetBurning
          expr: |
            (
              1 - (
                sum(rate(istio_requests_total{
                  destination_service_name="portfolio",
                  response_code!~"5.."
                }[2h]))
                /
                sum(rate(istio_requests_total{
                  destination_service_name="portfolio"
                }[2h]))
              )
            ) > 0.0025  # 50% of 0.5% error budget
          for: 15m
          labels:
            severity: warning
            slo: availability
          annotations:
            summary: "Error budget burning faster than sustainable rate"
            description: "At current rate, error budget will be exhausted in {{ $value | humanizeDuration }}"

        # Page: Latency SLO violation
        - alert: PortfolioLatencyCritical
          expr: |
            histogram_quantile(0.95,
              sum(rate(istio_request_duration_milliseconds_bucket{
                destination_service_name="portfolio"
              }[5m])) by (le)
            ) > 1000
          for: 5m
          labels:
            severity: critical
            slo: latency
          annotations:
            summary: "p95 latency above 1000ms (current: {{ $value }}ms)"
            description: "User experience degraded. Investigate immediately."

Distributed Tracing Strategy

Trace Context Propagation:

// src/middleware.ts - Ensure trace headers are propagated
import { NextRequest, NextResponse } from "next/server";

export function middleware(request: NextRequest) {
  const response = NextResponse.next();

  // Propagate Istio/Jaeger trace headers
  const traceHeaders = [
    "x-request-id",
    "x-b3-traceid",
    "x-b3-spanid",
    "x-b3-parentspanid",
    "x-b3-sampled",
    "x-b3-flags",
    "b3",
    "traceparent", // W3C Trace Context
    "tracestate",
  ];

  traceHeaders.forEach((header) => {
    const value = request.headers.get(header);
    if (value) {
      response.headers.set(header, value);
    }
  });

  return response;
}

Jaeger Configuration:

# Deploy Jaeger for distributed tracing
kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: jaeger
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: jaeger
  template:
    metadata:
      labels:
        app: jaeger
    spec:
      containers:
      - name: jaeger
        image: jaegertracing/all-in-one:1.52
        env:
        - name: COLLECTOR_OTLP_ENABLED
          value: "true"
        - name: SPAN_STORAGE_TYPE
          value: "memory"  # Use elasticsearch for production
        ports:
        - containerPort: 16686
          name: ui
        - containerPort: 14268
          name: collector
        resources:
          requests:
            cpu: 100m
            memory: 256Mi
          limits:
            cpu: 500m
            memory: 1Gi
---
apiVersion: v1
kind: Service
metadata:
  name: jaeger-query
  namespace: default
spec:
  ports:
  - port: 16686
    targetPort: 16686
    name: ui
  selector:
    app: jaeger
---
apiVersion: v1
kind: Service
metadata:
  name: jaeger-collector
  namespace: default
spec:
  ports:
  - port: 14268
    targetPort: 14268
    name: collector
  selector:
    app: jaeger
EOF

# Configure Istio to send traces to Jaeger
kubectl apply -f - <<EOF
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
metadata:
  name: istio-config
  namespace: aks-istio-system
spec:
  meshConfig:
    defaultConfig:
      tracing:
        sampling: 1.0  # 100% sampling (reduce in production to 1-10%)
        zipkin:
          address: jaeger-collector.default.svc.cluster.local:9411
EOF

# Access Jaeger UI
kubectl port-forward -n default svc/jaeger-query 16686:16686
# Open http://localhost:16686

Trace Sampling Strategy:

Environment	Sampling Rate	Rationale	Storage Cost
Development	100%	Debug all requests	Negligible (memory storage)
Staging	50%	Balance between coverage and cost	~$50-100/mo (Elasticsearch)
Production (<1K RPS)	10-20%	Capture anomalies, control costs	~$100-200/mo
Production (>10K RPS)	1-5%	Focused sampling, head-based	~$200-500/mo
Incident Investigation	100% (tail-based)	Temporarily sample errors only	Spike during incident

Capacity Planning Models

Resource Projection Formula:

# capacity-planning.py
def project_capacity(
    current_rps: int,
    target_rps: int,
    current_pods: int,
    cpu_per_request_ms: float,
    memory_per_request_mb: float,
    target_cpu_utilization: float = 0.70,  # 70% target
    overhead_factor: float = 1.20,  # 20% overhead for spikes
) -> dict:
    """
    Project required capacity for target RPS.

    Example:
    - Current: 100 RPS with 3 pods @ 50% CPU
    - Target: 500 RPS
    - CPU per request: 10ms
    - Memory per request: 5MB
    """
    # Calculate CPU requirements
    total_cpu_ms_per_sec = target_rps * cpu_per_request_ms
    total_cpu_cores = (total_cpu_ms_per_sec / 1000) / target_cpu_utilization
    total_cpu_with_overhead = total_cpu_cores * overhead_factor

    # Calculate memory requirements
    # Assume requests are processed concurrently with 100ms average latency
    concurrent_requests = target_rps * 0.1  # 100ms latency = 0.1s
    total_memory_mb = concurrent_requests * memory_per_request_mb
    total_memory_with_overhead = total_memory_mb * overhead_factor

    # Calculate pod requirements
    # Assume each pod can handle 0.5 CPU cores and 1GB memory
    pods_for_cpu = int(total_cpu_with_overhead / 0.5) + 1
    pods_for_memory = int(total_memory_with_overhead / 1024) + 1
    required_pods = max(pods_for_cpu, pods_for_memory)

    # Calculate node requirements
    # Standard_D2s_v3: 2 vCPU, 8GB RAM
    # Assume 1.6 CPU and 6.4GB usable per node (20% for system)
    required_nodes = int((required_pods * 0.5) / 1.6) + 1

    # Cost estimate ($75/node/month for Standard_D2s_v3)
    monthly_cost = required_nodes * 75

    return {
        "target_rps": target_rps,
        "required_pods": required_pods,
        "required_cpu_cores": total_cpu_with_overhead,
        "required_memory_gb": total_memory_with_overhead / 1024,
        "required_nodes": required_nodes,
        "monthly_cost_usd": monthly_cost,
        "cpu_utilization_target": target_cpu_utilization,
        "overhead_factor": overhead_factor,
    }

# Example calculation
result = project_capacity(
    current_rps=100,
    target_rps=500,
    current_pods=3,
    cpu_per_request_ms=10,
    memory_per_request_mb=5,
)

print(f"For {result['target_rps']} RPS:")
print(f"  Required Pods: {result['required_pods']}")
print(f"  Required Nodes: {result['required_nodes']}")
print(f"  CPU Cores: {result['required_cpu_cores']:.2f}")
print(f"  Memory: {result['required_memory_gb']:.2f} GB")
print(f"  Monthly Cost: ${result['monthly_cost_usd']}")

Load Testing with K6:

// load-test.js
import http from "k6/http";
import { check, sleep } from "k6";
import { Rate, Trend } from "k6/metrics";

// Custom metrics
const errorRate = new Rate("error_rate");
const latencyTrend = new Trend("latency");

export const options = {
  stages: [
    { duration: "2m", target: 50 }, // Ramp up to 50 users
    { duration: "5m", target: 50 }, // Stay at 50 users
    { duration: "2m", target: 100 }, // Ramp to 100 users
    { duration: "5m", target: 100 }, // Stay at 100 users
    { duration: "2m", target: 200 }, // Spike to 200 users
    { duration: "5m", target: 200 }, // Sustained high load
    { duration: "2m", target: 0 }, // Ramp down
  ],
  thresholds: {
    http_req_duration: ["p(95)<500", "p(99)<1000"], // 95% < 500ms, 99% < 1s
    error_rate: ["rate<0.01"], // Error rate < 1%
    http_req_failed: ["rate<0.01"],
  },
};

export default function () {
  const response = http.get("https://cat-herding.net");

  // Check response
  const success = check(response, {
    "status is 200": (r) => r.status === 200,
    "response time OK": (r) => r.timings.duration < 500,
  });

  // Record metrics
  errorRate.add(!success);
  latencyTrend.add(response.timings.duration);

  sleep(1);
}

// Run test:
// k6 run --out json=results.json load-test.js
// Analyze: k6 stats results.json

Deployment Strategies and Rollback Plans

Canary Deployment with Istio:

# portfolio-canary.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: portfolio-v2-canary
  namespace: default
spec:
  replicas: 1 # Start with 1 pod (10% of traffic)
  selector:
    matchLabels:
      app: portfolio
      version: v2
  template:
    metadata:
      labels:
        app: portfolio
        version: v2
    spec:
      containers:
        - name: portfolio
          image: gabby.azurecr.io/portfolio:v2-candidate
          # ... rest of spec same as v1
---
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: portfolio-canary
  namespace: default
spec:
  hosts:
    - "cat-herding.net"
    - "www.cat-herding.net"
  gateways:
    - portfolio-gateway
  http:
    - match:
        - headers:
            x-canary:
              exact: "true" # Internal testing
      route:
        - destination:
            host: portfolio
            subset: v2
          weight: 100
    - route:
        - destination:
            host: portfolio
            subset: v1
          weight: 90 # 90% to stable version
        - destination:
            host: portfolio
            subset: v2
          weight: 10 # 10% to canary
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: portfolio-versions
  namespace: default
spec:
  host: portfolio
  subsets:
    - name: v1
      labels:
        version: v1
    - name: v2
      labels:
        version: v2

Automated Canary Promotion with Flagger:

# Install Flagger (progressive delivery operator)
kubectl apply -k github.com/fluxcd/flagger//kustomize/istio

# Define automated canary rollout
cat <<EOF | kubectl apply -f -
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: portfolio
  namespace: default
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: portfolio
  service:
    port: 80
  analysis:
    interval: 1m
    threshold: 5  # Number of failed checks before rollback
    maxWeight: 50  # Max canary weight
    stepWeight: 10  # Increment weight by 10% each step
    metrics:
    - name: request-success-rate
      thresholdRange:
        min: 99  # Rollback if success rate < 99%
      interval: 1m
    - name: request-duration
      thresholdRange:
        max: 500  # Rollback if p99 > 500ms
      interval: 1m
    webhooks:
    - name: slack-notification
      url: https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK
      timeout: 5s
      metadata:
        channel: "#deployments"
EOF

# Trigger canary by updating image
kubectl set image deployment/portfolio portfolio=gabby.azurecr.io/portfolio:v2

# Flagger will:
# 1. Deploy canary (0% traffic)
# 2. Shift 10% traffic to canary
# 3. Wait 1 minute, check metrics
# 4. If metrics good: Shift 20%, repeat
# 5. If metrics bad: Rollback immediately
# 6. If success: Promote to 100%

Emergency Rollback Playbook:

# SCENARIO: Deployment causing errors, need immediate rollback

# Step 1: Identify current deployment
kubectl rollout history deployment/portfolio -n default

# Step 2: Check current revision
kubectl get deployment portfolio -n default -o jsonpath='{.metadata.annotations.deployment\.kubernetes\.io/revision}'

# Step 3: Quick rollback to previous revision
kubectl rollout undo deployment/portfolio -n default

# Step 4: Verify rollback is working
kubectl rollout status deployment/portfolio -n default
kubectl get pods -n default -l app=portfolio -w

# Step 5: If using Istio canary, immediately shift traffic to v1
kubectl apply -f - <<EOF
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: portfolio-virtualservice
  namespace: default
spec:
  hosts:
  - "cat-herding.net"
  - "www.cat-herding.net"
  gateways:
  - portfolio-gateway
  http:
  - route:
    - destination:
        host: portfolio
        subset: v1
      weight: 100  # 100% to stable version
    - destination:
        host: portfolio
        subset: v2
      weight: 0  # 0% to broken canary
EOF

# Step 6: Communicate status
# Post to #incidents Slack channel:
# "🔥 Emergency rollback executed for portfolio service
#  Reason: [high error rate / latency / crashes]
#  Action: Rolled back from revision X to revision Y
#  Status: Monitoring recovery
#  ETA: Stable in 2-5 minutes"

# Step 7: Monitor recovery
watch -n 2 'kubectl get pods -n default -l app=portfolio'

# Step 8: Verify metrics returning to normal
# Check Grafana dashboard, Prometheus alerts, error logs

# Step 9: Post-mortem (after recovery)
# Document:
# - What triggered the rollback
# - How it was detected (alert, user report, monitoring)
# - Time to detection and time to recovery
# - Root cause analysis
# - Action items to prevent recurrence

Cost Optimization Strategies

Right-Sizing Workloads:

# Install Vertical Pod Autoscaler (VPA) for recommendations
kubectl apply -f https://github.com/kubernetes/autoscaler/releases/download/vertical-pod-autoscaler-0.13.0/vpa-v0.13.0.yaml

# Create VPA in recommendation mode (doesn't auto-adjust)
cat <<EOF | kubectl apply -f -
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: portfolio-vpa
  namespace: default
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: portfolio
  updatePolicy:
    updateMode: "Off"  # Recommendation only
  resourcePolicy:
    containerPolicies:
    - containerName: portfolio
      minAllowed:
        cpu: 100m
        memory: 128Mi
      maxAllowed:
        cpu: 2000m
        memory: 4Gi
EOF

# Get recommendations after 24-48 hours
kubectl describe vpa portfolio-vpa -n default

# Example output:
# Recommendation:
#   Container Recommendations:
#     Container Name:  portfolio
#     Lower Bound:
#       Cpu:     150m
#       Memory:  256Mi
#     Target:
#       Cpu:     300m  # Recommended request
#       Memory:  512Mi  # Recommended request
#     Uncapped Target:
#       Cpu:     450m
#       Memory:  768Mi
#     Upper Bound:
#       Cpu:     1000m
#       Memory:  1Gi

Spot Instance Strategy:

# Mixed node pool: on-demand + spot
# Use on-demand for critical services, spot for batch workloads

# Label portfolio deployment to prefer spot instances
apiVersion: apps/v1
kind: Deployment
metadata:
  name: portfolio-batch-jobs
spec:
  template:
    spec:
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              preference:
                matchExpressions:
                  - key: kubernetes.azure.com/scalesetpriority
                    operator: In
                    values:
                      - spot
      tolerations:
        - key: kubernetes.azure.com/scalesetpriority
          operator: Equal
          value: spot
          effect: NoSchedule
      containers:
        - name: batch-processor
          image: gabby.azurecr.io/batch:latest
          # Batch jobs can tolerate interruptions (70-90% cost savings)

Idle Resource Cleanup:

# Identify unused resources
# Cost optimization tip: Delete these regularly

# Unused PersistentVolumeClaims (orphaned storage)
kubectl get pvc -A --field-selector=status.phase!=Bound

# Unused ConfigMaps/Secrets (check if referenced)
kubectl get configmaps -A
kubectl get secrets -A

# Old ReplicaSets (kept for rollback history, can be pruned)
kubectl get rs -A --sort-by=.metadata.creationTimestamp

# Clean up completed jobs older than 7 days
kubectl delete jobs --field-selector status.successful=1 \
  --all-namespaces \
  --dry-run=client  # Remove dry-run when ready

# Schedule automatic cleanup with CronJob
cat <<EOF | kubectl apply -f -
apiVersion: batch/v1
kind: CronJob
metadata:
  name: cleanup-old-jobs
  namespace: default
spec:
  schedule: "0 2 * * 0"  # Every Sunday at 2 AM
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: cleanup-sa
          containers:
          - name: kubectl
            image: bitnami/kubectl:latest
            command:
            - /bin/sh
            - -c
            - |
              kubectl delete jobs --field-selector status.successful=1 \
                -A --ignore-not-found=true \
                $(kubectl get jobs -A -o json | jq -r '.items[] | select(.status.completionTime != null and (now - (.status.completionTime | fromdateiso8601)) > 604800) | "\(.metadata.namespace)/\(.metadata.name)"')
          restartPolicy: OnFailure
EOF

Cost Monitoring Dashboard:

Resource	Monthly Cost	Optimization Opportunity
AKS cluster management	$0	N/A (free)
3x Standard_D2s_v3 nodes	$225	Use spot instances for non-critical (~$68)
Managed Istio	$150	Consider self-hosted if team has expertise
Load Balancer (public IP)	$20	Share LB across multiple services
Azure DNS Zone	$1	N/A (minimal)
Bandwidth (100GB egress)	$10	Use CDN to reduce egress
Persistent storage (500GB)	$38	Clean up old PVCs, use lifecycle policies
Total	$444/mo	Optimized: $292/mo (34% savings)

Azure vs GCP: Cost Comparison for Hobby Projects

Why Migrate from GCP to Azure?

When comparing running a portfolio on GKE (Google Kubernetes Engine) versus AKS, significant cost savings can often be realized. Here's a breakdown:

GCP Configuration (Previous Setup)

# GKE Cluster - us-central1-a
# 3x e2-medium nodes (2 vCPU, 4GB RAM each)
# Cloud SQL PostgreSQL (db-f1-micro)
# Cloud Load Balancer
# Cloud DNS

# Monthly Cost Breakdown (GCP):
# - GKE cluster management: $74.40/mo ($0.10/hour)
# - 3x e2-medium nodes: $73.73/mo ($24.58 each)
# - Cloud SQL (db-f1-micro): $7.67/mo
# - Cloud Load Balancer: $18.26/mo
# - Cloud SQL Proxy (included): $0
# - Cloud DNS zone: $0.20/mo
# - Bandwidth (100GB egress): $12.00/mo
# - Persistent disks (500GB): $80.00/mo (standard PD)
# Total: ~$266/mo (WITHOUT Istio)
# With Istio self-hosted: ~$340/mo

Azure Configuration (Current Setup)

# AKS Cluster - East US
# 3x Standard_B2s nodes (2 vCPU, 4GB RAM each) - HOBBY TIER
# Azure Database for PostgreSQL (Flexible Server, Burstable B1ms)
# Azure Load Balancer
# Azure DNS

# Monthly Cost Breakdown (Azure):
# - AKS cluster management: $0 (FREE!)
# - 3x Standard_B2s nodes: $120.36/mo ($40.12 each)
# - Azure Database for PostgreSQL (B1ms): $12.41/mo
# - Load Balancer (public IP): $3.65/mo
# - Azure DNS zone: $0.50/mo
# - Bandwidth (100GB egress): $8.76/mo
# - Managed disks (500GB Standard HDD): $19.71/mo
# Total: ~$165/mo (WITHOUT Istio)
# With managed Istio: ~$315/mo

Side-by-Side Comparison

Component	GCP Cost	Azure Cost	Savings	Notes
Cluster Management	$74.40	$0	$74.40	AKS management is completely free
3x Nodes (2vCPU, 4GB)	$73.73 (e2-medium)	$120.36 (B2s)	-$46.63	Azure B-series burstable cheaper than D-series but more than GCP e2
Database (small)	$7.67 (db-f1-micro)	$12.41 (B1ms)	-$4.74	Azure Flexible Server slightly more
Load Balancer	$18.26	$3.65	$14.61	Azure LB significantly cheaper
DNS Zone	$0.20	$0.50	-$0.30	Negligible difference
Bandwidth (100GB)	$12.00	$8.76	$3.24	Azure egress slightly cheaper
Storage (500GB)	$80.00 (SSD)	$19.71 (HDD)	$60.29	Azure Standard HDD vs GCP Standard PD
Managed Istio	N/A (self-host ~$70)	$150	Variable	Azure offers managed option
Total (no Istio)	$266	$165	$101/mo (38%)	Azure saves ~$1,212/year
Total (with Istio)	$340 (self-hosted)	$315 (managed)	$25/mo (7%)	Azure saves ~$300/year

Hobby Cluster Optimization: The $50/mo Setup

For personal projects and learning, here's how to run AKS for ~$50-75/month:

# Ultra-Low-Cost Azure Configuration
# 1x Standard_B2s node (starts 1, autoscales to 2)
# Azure Database for PostgreSQL (Burstable B1ms with auto-pause)
# NO Istio (use native Kubernetes Ingress)
# Azure DNS + Let's Encrypt

# Cost Breakdown:
# - AKS cluster management: $0
# - 1x Standard_B2s node: $40.12/mo
# - Azure PostgreSQL (B1ms, auto-pause): $6.21/mo (50% time paused)
# - Load Balancer: $3.65/mo
# - NGINX Ingress Controller: $0 (runs on node)
# - Azure DNS: $0.50/mo
# - Bandwidth (20GB/mo): $1.75/mo
# - Managed disk (100GB): $3.94/mo
# Total: ~$56/mo

# GCP Equivalent (Lowest Cost):
# - GKE Autopilot (smallest): $74.40 management + ~$50 compute = ~$124/mo
# OR
# - GKE Standard (1 node e2-small): $74.40 + $15.33 + $7.67 DB + $18 LB = ~$115/mo

# Azure wins by ~$59-68/mo (52-59% savings) for hobby projects

Configuration for $50/mo Hobby Cluster:

# hobby-cluster.yaml - Minimal viable AKS setup
apiVersion: v1
kind: Namespace
metadata:
  name: portfolio
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: portfolio
  namespace: portfolio
spec:
  replicas: 1 # Single replica for hobby projects
  selector:
    matchLabels:
      app: portfolio
  template:
    metadata:
      labels:
        app: portfolio
    spec:
      containers:
        - name: portfolio
          image: gabby.azurecr.io/portfolio:latest
          ports:
            - containerPort: 3000
          resources:
            requests:
              cpu: 100m # Minimal CPU
              memory: 256Mi # Minimal memory
            limits:
              cpu: 500m
              memory: 512Mi
          livenessProbe:
            httpGet:
              path: /api/health
              port: 3000
            initialDelaySeconds: 30
          readinessProbe:
            httpGet:
              path: /api/health
              port: 3000
            initialDelaySeconds: 5
---
# Use NGINX Ingress instead of Istio (free)
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: portfolio-ingress
  namespace: portfolio
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
spec:
  ingressClassName: nginx
  tls:
    - hosts:
        - cat-herding.net
        - www.cat-herding.net
      secretName: cat-herding-tls
  rules:
    - host: cat-herding.net
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: portfolio
                port:
                  number: 80
    - host: www.cat-herding.net
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: portfolio
                port:
                  number: 80

Setup Commands for $50/mo Cluster:

# Create minimal AKS cluster
az aks create \
  --resource-group portfolio-hobby \
  --name portfolio-cluster \
  --node-count 1 \
  --min-count 1 \
  --max-count 2 \
  --node-vm-size Standard_B2s \
  --enable-cluster-autoscaler \
  --network-plugin azure \
  --network-policy azure \
  --tier free \
  --no-ssh-key

# Install NGINX Ingress (replaces Istio)
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm install ingress-nginx ingress-nginx/ingress-nginx \
  --namespace ingress-nginx \
  --create-namespace \
  --set controller.service.externalTrafficPolicy=Local \
  --set controller.resources.requests.cpu=50m \
  --set controller.resources.requests.memory=128Mi

# Install cert-manager (same as before)
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.yaml

# Total setup time: ~15 minutes
# No Istio complexity, no managed service fees

Key Takeaways for Hobby Projects

Azure Advantages:

✅ Free cluster management ($74.40/mo savings vs GCP)
✅ Cheaper load balancer ($14.61/mo savings)
✅ Cheaper storage ($60/mo savings with HDD)
✅ Burstable B-series VMs ideal for bursty traffic (personal sites)
✅ Simple pricing - fewer surprise charges
✅ Better free tier - Can run hobby cluster for ~$50/mo

GCP Advantages:

✅ Cheaper compute (e2-medium < B2s)
✅ GKE Autopilot removes node management (but costs more)
✅ Better integration with Google Cloud services (Cloud Run, etc.)
✅ More mature Istio integration (created by Google)

Decision Framework:

Migration Case Study:

Metric	Before (GCP)	After (Azure)	Change
Monthly Cost	$266	$165	-$101 (-38%)
Cluster Management Fee	$74.40	$0	-$74.40 (-100%)
Annual Savings	-	-	$1,212/year
Performance (p95 latency)	180ms	165ms	-15ms (comparable)
Downtime during migration	0 minutes	0 minutes	Blue-green cutover
Migration complexity	-	4 hours	Mostly manifest updates
Regrets	None	None	Would migrate again

Bottom Line for Hobby Projects:

GCP: Great if you're already using Google Cloud services or want Autopilot simplicity
Azure: Better for pure Kubernetes workloads, especially if cost-sensitive
Azure wins by $1,212/year for similar configuration without Istio
Azure wins by ~$700/year even with managed Istio

For a typical personal portfolio, migrating to Azure can be a strategic move—potentially saving over $1,000/year while gaining managed Istio as an option. The free cluster management fee alone provides significant value.

Comprehensive Service Cost Comparison: Azure vs GCP vs AWS

Beyond the base Kubernetes infrastructure, let's compare the full stack of services you'll need for a production application across all three major cloud providers.

Database Services: PostgreSQL Comparison

Tier	GCP Cloud SQL	Azure Database	AWS RDS PostgreSQL	Winner	Use Case
Hobby	db-f1-micro<br/>1 shared vCPU, 614MB<br/>$7.67/mo	B1ms<br/>1 vCPU, 2GB<br/>$12.41/mo	db.t4g.micro<br/>2 vCPU, 1GB<br/>$12.41/mo	GCP 🏆	Personal, <100 req/hr
Small	db-g1-small<br/>1 shared vCPU, 1.7GB<br/>$25.00/mo	B2s<br/>2 vCPU, 4GB<br/>$49.64/mo	db.t4g.small<br/>2 vCPU, 2GB<br/>$24.82/mo	AWS 🏆	Small apps, <1K req/hr
Medium	db-n1-standard-1<br/>1 vCPU, 3.75GB<br/>$49.95/mo	D2s_v3<br/>2 vCPU, 8GB<br/>$116.80/mo	db.t4g.medium<br/>2 vCPU, 4GB<br/>$49.64/mo	GCP 🏆	Mid-size, 1-10K req/hr
High Availability	db-n1-std-2 + HA<br/>2 vCPU, 7.5GB<br/>$198.90/mo	D2s_v3 HA<br/>2 vCPU, 8GB<br/>$233.60/mo	db.m6g.large Multi-AZ<br/>2 vCPU, 8GB<br/>$188.50/mo	AWS 🏆	Production with failover
High Performance	db-n1-standard-4<br/>4 vCPU, 15GB<br/>$199.05/mo	D4s_v3<br/>4 vCPU, 16GB<br/>$233.59/mo	db.m6g.xlarge<br/>4 vCPU, 16GB<br/>$282.24/mo	GCP 🏆	Heavy, >10K req/hr

Overall Winner: GCP for hobby/medium, AWS for small/HA scenarios

Note: GCP's shared vCPU tiers are cheapest for hobby projects. AWS t4g (Graviton) instances offer best price/performance for small-medium workloads.

Database Services: MySQL Comparison

Tier	GCP Cloud SQL	Azure Database	AWS RDS MySQL	Winner	Notes
Hobby	db-f1-micro<br/>$7.18/mo	B1ms<br/>$13.14/mo	db.t4g.micro<br/>$11.59/mo	GCP 🏆	MySQL 8.0
Small	db-g1-small<br/>$24.51/mo	B2s<br/>$52.56/mo	db.t4g.small<br/>$23.18/mo	AWS 🏆	MySQL 8.0
Medium	db-n1-standard-1<br/>$48.23/mo	D2s_v3<br/>$123.29/mo	db.t4g.medium<br/>$46.35/mo	AWS 🏆	MySQL 8.0
High Performance	db-n1-standard-4<br/>$192.92/mo	D4s_v3<br/>$246.58/mo	db.m6g.xlarge<br/>$263.52/mo	GCP 🏆	MySQL 8.0

Overall Winner: GCP for hobby, AWS for small/medium production workloads

Cache Services: Redis Comparison

Tier	GCP Memorystore	Azure Cache for Redis	AWS ElastiCache	Winner	Notes
Basic 1GB	$41.61/mo	Basic C1 (1GB)<br/>$16.06/mo	cache.t4g.micro (0.5GB)<br/>$11.52/mo	AWS 🏆	No HA, single zone
Basic 5GB	$208.05/mo	Basic C3 (6GB)<br/>$64.24/mo	cache.m6g.large (6.38GB)<br/>$75.74/mo	Azure 🏆	No HA
Standard 1GB HA	$78.84/mo	Standard C1 (1GB)<br/>$44.53/mo	cache.t4g.small (1.55GB)<br/>$46.08/mo	Azure 🏆	High availability
Standard 5GB HA	$393.44/mo	Standard C3 (6GB)<br/>$128.48/mo	cache.m6g.large (6.38GB) Multi-AZ<br/>$151.48/mo	Azure 🏆	HA with replica
Premium 26GB HA	$810.72/mo	Premium P1 (6GB)<br/>$164.25/mo	cache.m6g.xlarge (12.93GB) Cluster<br/>$302.96/mo	Azure 🏆	Clustering, persistence
Premium 100GB HA	$3,242.88/mo	Premium P4 (26GB)<br/>$657.00/mo	cache.r6g.xlarge (26.32GB) Cluster<br/>$605.92/mo	AWS 🏆	Enterprise features

Overall Winner: Azure for most tiers (40-70% cheaper), AWS for small hobby and large enterprise

Note: Azure Cache for Redis is dramatically cheaper than both GCP and AWS for standard production workloads. This is one of Azure's biggest competitive advantages.

Object Storage Comparison

Storage Type	GCP Cloud Storage	Azure Blob Storage	AWS S3	Winner	Use Case
Hot Storage	Standard<br/>$0.020/GB/mo	Hot tier<br/>$0.0184/GB/mo	Standard<br/>$0.023/GB/mo	Azure 🏆	Frequently accessed
Warm Storage	Nearline (30-day)<br/>$0.010/GB/mo	Cool tier<br/>$0.01/GB/mo	S3 Infrequent Access<br/>$0.0125/GB/mo	Tie (Azure/GCP)	Monthly access
Cold Storage	Coldline (90-day)<br/>$0.004/GB/mo	Archive tier<br/>$0.00099/GB/mo	Glacier Flexible<br/>$0.0036/GB/mo	Azure 🏆	Rarely accessed
Deep Archive	Archive<br/>$0.0012/GB/mo	Archive tier<br/>$0.00099/GB/mo	Glacier Deep Archive<br/>$0.00099/GB/mo	Tie (Azure/AWS)	Long-term backup
Egress (1TB)	$0.12/GB	$0.087/GB	$0.09/GB	Azure 🏆	Data transfer out
Requests (per 10K)	$0.40 (Class A)	$0.50 (write)	$0.50 (PUT)	GCP 🏆	Write operations

Overall Winner: Azure for storage costs and egress, GCP for request pricing

Note: Azure Archive tier is cheapest for cold storage. All three are competitive for hot storage within 20%.

Compute (Virtual Machines) Comparison

VM Size	GCP (us-central1)	Azure (East US)	AWS (us-east-1)	Winner	Specs
Micro	e2-micro<br/>$6.11/mo	B1s<br/>$7.59/mo	t4g.nano<br/>$3.07/mo	AWS 🏆	0.25-2 vCPU, 0.5-1GB
Small	e2-small<br/>$12.23/mo	B1ms<br/>$15.18/mo	t4g.micro<br/>$6.14/mo	AWS 🏆	0.5-2 vCPU, 1-2GB
Medium Burst	e2-medium<br/>$24.45/mo	B2s<br/>$40.12/mo	t4g.small<br/>$12.29/mo	AWS 🏆	2 vCPU, 2-4GB
Medium Standard	n1-standard-1<br/>$24.73/mo	D2s_v3<br/>$75.19/mo	t4g.medium<br/>$24.58/mo	GCP 🏆	1-2 vCPU, 3.75-4GB
Medium Standard	n1-standard-2<br/>$49.45/mo	D2s_v3<br/>$75.19/mo	m6g.large<br/>$60.74/mo	GCP 🏆	2 vCPU, 7.5-8GB
Large	n1-standard-4<br/>$98.91/mo	D4s_v3<br/>$150.38/mo	m6g.xlarge<br/>$121.47/mo	GCP 🏆	4 vCPU, 15-16GB
Spot Instance	e2-medium (preemptible)<br/>$7.34/mo (70% off)	B2s (spot)<br/>$12.04/mo (70% off)	t4g.small (spot)<br/>$3.69/mo (70% off)	AWS 🏆	2 vCPU, 2-4GB

Overall Winner: AWS for burstable/spot instances, GCP for standard compute

Note: AWS t4g (Graviton ARM) instances offer best value for burstable workloads. GCP wins for sustained compute. Azure is most expensive across the board.

Kubernetes Node Pricing (Actual Cluster Nodes)

Node Type	GCP GKE	Azure AKS	AWS EKS	Winner (3-node cost)
Hobby	e2-small<br/>0.5 vCPU, 2GB	B1ms<br/>1 vCPU, 2GB	t4g.micro<br/>2 vCPU, 1GB	GCP: $36.69<br/>Azure: $45.54<br/>AWS: $18.42 🏆
Small	e2-medium<br/>2 vCPU, 4GB	B2s<br/>2 vCPU, 4GB	t4g.small<br/>2 vCPU, 2GB	GCP: $73.35<br/>Azure: $120.36<br/>AWS: $36.87 🏆
Production	n1-standard-2<br/>2 vCPU, 7.5GB	D2s_v3<br/>2 vCPU, 8GB	m6g.large<br/>2 vCPU, 8GB	GCP: $148.35 🏆<br/>Azure: $225.57<br/>AWS: $182.22
High Perf	n1-standard-4<br/>4 vCPU, 15GB	D4s_v3<br/>4 vCPU, 16GB	m6g.xlarge<br/>4 vCPU, 16GB	GCP: $296.73 🏆<br/>Azure: $451.14<br/>AWS: $364.41
Spot Nodes	e2-medium<br/>(preemptible)	B2s<br/>(spot)	t4g.small<br/>(spot)	GCP: $22.02<br/>Azure: $36.12<br/>AWS: $11.07 🏆

Overall Winner: AWS for hobby/small/spot, GCP for production sustained workloads

Critical Factors:

GCP: Add $74.40/mo cluster management fee (negates savings for small clusters)
Azure: $0 cluster management (free)
AWS: Add $73/mo per cluster management fee (similar to GCP)

Load Balancer Comparison

Load Balancer Type	GCP	Azure	AWS	Winner	Notes
Basic (HTTP/HTTPS)	Cloud LB<br/>$18.26/mo	Standard LB<br/>$3.65/mo	ALB<br/>$16.43/mo	Azure 🏆	5 forwarding rules
With SSL	+ free cert<br/>$18.26/mo	+ free cert<br/>$3.65/mo	+ free cert (ACM)<br/>$16.43/mo	Azure 🏆	Managed certificates
Network LB	Regional LB<br/>$18.26/mo	Basic LB<br/>$0/mo	NLB<br/>$16.43/mo	Azure 🏆	Layer 4
Data Processing	$0.008/GB	$0.005/GB	$0.008/GB	Azure 🏆	Per GB processed
LCU Hours	Included	Included	$0.008/hour	Tie (Azure/GCP)	Load Balancer Capacity Units

Overall Winner: Azure for load balancers (75-100% cheaper than both)

Note: Azure's load balancer pricing is one of its strongest competitive advantages. AWS ALB charges for LCUs, GCP charges per forwarding rule.

Content Delivery Network (CDN) Comparison

CDN Tier	GCP Cloud CDN	Azure CDN	AWS CloudFront	Winner	Use Case
Cache Egress (10TB)	$0.08/GB	$0.087/GB	$0.085/GB	GCP 🏆	Cache hit
Origin Egress	$0.12/GB	$0.087/GB	$0.085/GB	AWS 🏆	Cache miss
Requests (per 10K)	$0.0075	$0.0096	$0.0075	Tie (GCP/AWS)	HTTP/HTTPS

Overall Winner: GCP/AWS tie for cache performance, Azure for origin-heavy workloads

Note: All three are competitive within 10%. Choice depends on your cache hit ratio and existing cloud infrastructure.

Managed Kubernetes Service: Total Cost Comparison

Let's compare identical workloads on all three platforms:

Scenario: 3-node production cluster with database, cache, and storage

Component	GCP Monthly Cost	Azure Monthly Cost	AWS Monthly Cost	Winner
Cluster Management Fee	$74.40	$0 ✅	$73.00	Azure
3x Nodes (2 vCPU, 4GB)	$73.35 ✅ (e2-medium)	$120.36 (B2s)	$36.87 ✅ (t4g.small)	AWS
PostgreSQL (small)	$25.00 ✅ (db-g1-small)	$49.64 (B2s)	$24.82 ✅ (t4g.small)	AWS
Redis Cache (1GB HA)	$78.84 (Memorystore)	$44.53 ✅ (Standard C1)	$46.08 (t4g.small)	Azure
Load Balancer	$18.26	$3.65 ✅	$16.43	Azure
Storage (500GB SSD)	$85.00 (SSD PD)	$48.00 ✅ (Premium SSD)	$50.00 (gp3)	Azure
Bandwidth (100GB)	$12.00	$8.76 ✅	$9.00	Azure
DNS	$0.20 ✅	$0.50	$0.50 (Route53)	GCP
Container Registry	$0 ✅ (500MB free)	$5.00 (Basic)	$0 ✅ (500MB free)	Tie (GCP/AWS)
Monitoring	$8.00 (Cloud Logging)	$0 ✅ (basic included)	$10.00 (CloudWatch)	Azure
Total	$375.05	$280.44 ✅	$266.70 ✅	AWS wins by $13.74/mo
Annual Total	$4,500.60	$3,365.28	$3,200.40 ✅	AWS saves $1,300.20/year

Winner: AWS for this configuration (saves $1,300/year vs GCP, $165/year vs Azure)

Cost Winner by Service Category

Real-World Cost Scenario Analysis

Scenario 1: Hobby Developer (Personal Portfolio)

Workload:
- 1 node cluster
- Small database
- No cache
- <1K visitors/day

GCP Cost:
- GKE management: $74.40
- 1x e2-small node: $12.23
- db-f1-micro: $7.67
- Load balancer: $18.26
- DNS: $0.20
Total: $112.76/mo

Azure Cost:
- AKS management: $0 ✅
- 1x B1ms node: $15.18
- B1ms PostgreSQL: $12.41
- Load balancer: $3.65 ✅
- DNS: $0.50
Total: $31.74/mo

AWS Cost:
- EKS management: $73.00
- 1x t4g.micro node: $6.14 ✅
- db.t4g.micro: $12.41
- ALB: $16.43
- Route53: $0.50
Total: $108.48/mo

Winner: Azure saves $81.02/mo vs GCP (72% cheaper)
Winner: Azure saves $76.74/mo vs AWS (71% cheaper)

Scenario 2: Startup (10 Services, Growing Traffic)

Workload:
- 3-5 nodes (autoscaling)
- Medium database with HA
- Redis cache for sessions
- 10-50K visitors/day

GCP Cost:
- GKE management: $74.40
- 3x e2-medium nodes: $73.35
- db-n1-standard-2 HA: $198.90
- Memorystore 5GB HA: $393.44
- Load balancer: $18.26
- Storage 1TB: $170.00
- Bandwidth 500GB: $60.00
Total: $988.35/mo

Azure Cost:
- AKS management: $0 ✅
- 3x B2s nodes: $120.36
- D2s_v3 PostgreSQL HA: $233.60
- Standard C3 Redis HA: $128.48 ✅
- Load balancer: $3.65 ✅
- Storage 1TB: $96.00 ✅
- Bandwidth 500GB: $43.80 ✅
Total: $625.89/mo

AWS Cost:
- EKS management: $73.00
- 3x t4g.small nodes: $36.87 ✅
- db.m6g.large Multi-AZ: $188.50 ✅
- cache.m6g.large Multi-AZ: $151.48
- ALB: $16.43
- Storage 1TB: $100.00
- Bandwidth 500GB: $45.00
Total: $611.28/mo ✅

Winner: AWS saves $377.07/mo vs GCP (38% cheaper)
Winner: AWS saves $14.61/mo vs Azure (2% cheaper)

Scenario 3: Enterprise (50+ Services, High Traffic)

Workload:
- 10-20 nodes (autoscaling)
- Large database cluster
- Multiple Redis instances
- 1M+ visitors/day
- Multi-region

GCP Cost:
- GKE management: $74.40
- 10x n1-standard-4 nodes: $989.10
- db-n1-standard-8 HA: $796.20
- Memorystore 100GB HA: $3,242.88
- Multiple load balancers: $60.00
- Storage 5TB: $850.00
- Bandwidth 5TB: $600.00
- CDN: $400.00
Total: $7,012.58/mo

Azure Cost:
- AKS management: $0 ✅
- 10x D4s_v3 nodes: $1,503.80
- D8s_v3 PostgreSQL HA: $935.16
- Premium P4 Redis HA: $657.00 ✅
- Multiple load balancers: $20.00 ✅
- Storage 5TB: $480.00 ✅
- Bandwidth 5TB: $438.00 ✅
- CDN: $435.00
Total: $4,469.96/mo

AWS Cost:
- EKS management: $73.00
- 10x m6g.xlarge nodes: $1,214.70 ✅
- db.m6g.4xlarge Multi-AZ: $1,129.00
- cache.r6g.xlarge Cluster: $605.92 ✅
- Multiple ALBs: $50.00
- Storage 5TB: $500.00
- Bandwidth 5TB: $450.00
- CloudFront: $425.00
Total: $4,447.62/mo ✅

Winner: AWS saves $2,564.96/mo vs GCP (37% cheaper)
Winner: AWS saves $22.34/mo vs Azure (0.5% cheaper)
Annual savings: AWS saves $30,779/year vs GCP, $268/year vs Azure

Cost Optimization Recommendations

Choose GCP if:

Heavy sustained compute requirements (>100 vCPUs running 24/7)
Hobby databases (db-f1-micro $7.67/mo is cheapest shared vCPU tier)
Already using Google Cloud services (BigQuery, Cloud Run, Pub/Sub)
Need GKE Autopilot for simplified management ($73/mo + per-pod pricing)
Prefer preemptible instances for batch workloads (sustained use discounts up to 30%)

Choose Azure if:

Cache-heavy architecture (Redis is 40-70% cheaper than GCP/AWS)
Multiple microservices (free cluster management saves $73-74/mo vs GCP/AWS)
Storage-intensive (blob storage 30-40% cheaper than GCP)
Need managed Istio service mesh (included in AKS at no extra cost)
Load balancer-heavy architecture (Basic LB $3.65/mo is 75-100% cheaper)
Microsoft ecosystem integration (AD/Entra, Windows containers, .NET)
Cost optimization is top priority for small-medium clusters

Choose AWS if:

ARM-compatible workloads (Graviton t4g/m6g instances 50-75% cheaper)
Startup phase with standard production cluster (lowest overall cost $266.70/mo)
Small databases with burstable requirements (db.t4g.micro $12.41/mo)
Heavy spot instance usage (deepest spot market with most availability zones)
Already using AWS services (Lambda, DynamoDB, S3, SQS)
Need highest availability zones per region (up to 6 AZs vs 3 for GCP/Azure)

Hybrid Strategy:

Run Kubernetes on AWS for overall lowest cost ($266.70/mo base)
Use Azure Cache for Redis (huge savings: $44.53/mo vs $46.08 AWS vs $78.84 GCP)
Use GCP Cloud SQL for hobby databases (db-f1-micro $7.67/mo)
Store cold data on Azure Archive (75% cheaper than GCP/AWS)
Use AWS Graviton instances where workload allows (50-75% compute savings)

Hidden Costs to Consider

Cost Factor	GCP	Azure	AWS	Impact
Egress to Internet	$0.12/GB (first 1TB)	$0.087/GB	$0.09/GB	Azure 27% cheaper, AWS 25% cheaper
Egress between zones	$0.01/GB	$0.01/GB	$0.01/GB (same region)	Same
Egress between regions	$0.12/GB	$0.02/GB	$0.02/GB	Azure/AWS 83% cheaper ✅
Snapshot storage	$0.026/GB	$0.05/GB	$0.05/GB	GCP 48% cheaper
IP addresses	$6.57/mo per static IP	$3.65/mo per public IP	$3.60/mo per Elastic IP	AWS/Azure 44% cheaper
NAT Gateway	$0.045/hour + $0.045/GB	$0.045/hour + $0.045/GB	$0.045/hour + $0.045/GB	Same
VPN Gateway	$36.50/mo	$26.28/mo	$36.50/mo	Azure 28% cheaper ✅

Next Steps: Building Production-Grade Platform

GitOps with Flux CD: Automated deployments from Git commits
Multi-Environment Strategy: Separate dev/staging/prod with namespace isolation
Secret Management: Migrate to Azure Key Vault with CSI driver
Database Integration: Connect to Azure Database for PostgreSQL with connection pooling
Advanced Autoscaling: KEDA for event-driven scaling (queues, schedules, metrics)
Disaster Recovery: Backup strategy with Velero, multi-region failover
Compliance: Pod Security Standards enforcement, network policies, audit logging
Developer Experience: Internal developer platform (IDP) with self-service deployment

🎯 Key Takeaways: Decision Framework for Platform Teams

When to Adopt This Architecture

✅ Good Fit:

Microservices at scale (20+ services): Service mesh provides observability, security, and traffic management
Multi-team organizations: Teams need isolation, independent deployment, and self-service capabilities
Compliance requirements: mTLS, audit logging, and fine-grained access control are mandatory
High availability needs: SLOs require 99.5%+ uptime with multi-region capabilities
Mature DevOps culture: Team has SRE practices, GitOps, and infrastructure-as-code expertise

❌ Poor Fit:

Monolith or <10 services: Service mesh overhead outweighs benefits
Cost-sensitive startups: $400-500/mo base infrastructure may be too high for MVP stage
Limited Kubernetes expertise: Steep learning curve can delay time-to-market
Simple request/response patterns: Don't need advanced routing, retries, or circuit breaking
Minimal traffic (<100 RPS): Serverless (Azure Functions, Container Apps) more cost-effective

Architecture Decision Summary

Decision	Choice	Trade-off	When to Reconsider
Service Mesh	Managed Istio	+$150-235/mo, operational complexity	When service count >50 or team expertise grows
Node Size	Standard_D2s_v3	Balanced cost/performance	If workloads are memory-heavy (upgrade to D4s_v3) or CPU-light (downgrade to B-series)
Certificate Management	cert-manager + Let's Encrypt	Free certificates, requires maintenance	If >100 certificates or need extended validation (EV) certificates, consider paid CA
DNS Provider	Azure DNS	$1/mo, Azure-native integration	If need advanced features (geo-routing, DNSSEC), consider Cloudflare or Route53
Monitoring Stack	Prometheus + Grafana	Self-hosted, requires management	If team lacks monitoring expertise, consider Azure Monitor or Datadog
Deployment Strategy	Canary with Flagger	Automated progressive delivery	If team is small or risk-averse, use blue-green or manual rollouts

Cost Optimization Decision Tree

Operational Maturity Roadmap

Phase 1: Foundation (Months 1-2)

✅ AKS cluster with managed Istio running
✅ Basic monitoring (Prometheus, Grafana)
✅ Manual deployments via kubectl
✅ Certificate automation with cert-manager
✅ Single production environment

Phase 2: Automation (Months 3-4)

🔄 GitOps with Flux CD or ArgoCD
🔄 CI/CD pipelines (GitHub Actions, Azure DevOps)
🔄 Automated canary deployments with Flagger
🔄 Infrastructure-as-code (Terraform, Bicep)
🔄 Dev/staging environments with parity

Phase 3: Observability (Months 5-6)

🔲 SLO-based alerting (not just resource alerts)
🔲 Distributed tracing (Jaeger, Tempo)
🔲 Log aggregation (Loki, Azure Log Analytics)
🔲 Cost monitoring and attribution by team
🔲 Error budget tracking and reporting

Phase 4: Resilience (Months 7-9)

🔲 Chaos engineering (Litmus, Chaos Mesh)
🔲 Multi-region active-passive setup
🔲 Disaster recovery runbooks tested quarterly
🔲 Automated incident response (PagerDuty + Kubernetes)
🔲 Load testing in CI/CD pipeline

Phase 5: Platform Engineering (Months 10-12)

🔲 Internal Developer Platform (IDP) with self-service
🔲 Golden paths for common deployment patterns
🔲 Policy-as-code (OPA, Kyverno)
🔲 Developer productivity metrics (DORA metrics)
🔲 Platform team vs product teams separation

Lessons from Production Incidents

Incident 1: Certificate Expiration Outage (2 AM)

What happened: Let's Encrypt certificate expired, HTTPS traffic failed
Root cause: cert-manager webhook was blocked by network policy
Impact: 30 minutes downtime, 0.07% monthly error budget consumed
Prevention: Alert on certificate expiration <30 days, test renewal quarterly
Lesson: Monitor certificate lifecycle, not just pod health

Incident 2: Istio Control Plane Upgrade Breaking Change

What happened: Managed Istio upgraded from 1.17 to 1.18 with breaking CRD changes
Root cause: Azure auto-upgraded during maintenance window without notification
Impact: VirtualService configuration failed validation, 503 errors for 15 minutes
Prevention: Pin Istio minor version, test upgrades in dev first, subscribe to release notes
Lesson: Managed services reduce operational burden but sacrifice control

Incident 3: Pod Eviction Cascade (Out of Memory)

What happened: Memory leak in application caused OOMKilled, cascading pod evictions
Root cause: Missing memory limits allowed pod to consume node resources
Impact: 5 pods restarted in 2 minutes, 60-second request latency spike
Prevention: Set memory limits, enable memory profiling, use VPA recommendations
Lesson: Resource limits are not optional in production

Incident 4: DNS Propagation Delay on New Subdomain

What happened: New subdomain added but not resolving for 45 minutes
Root cause: Azure DNS TTL was 3600 seconds, previous record cached
Impact: Delayed launch announcement, no customer impact
Prevention: Lower TTL to 300 seconds before DNS changes, wait for propagation
Lesson: DNS is eventually consistent; plan changes ahead of announcements

Incident 5: Cost Spike from Forgotten Load Test

What happened: Load test ran overnight, autoscaler increased nodes from 3 to 30
Root cause: No autoscaler max limit configured, test had no timeout
Impact: $500 unexpected bill for 8 hours, no customer impact
Prevention: Set autoscaler maxReplicas, configure cost alerts, add timeouts to tests
Lesson: Cost visibility is essential; autoscalers need upper bounds

Final Recommendations for Platform Teams

Before You Start:

Validate scale requirements: Do you really need service mesh? (<20 services likely don't)
Assess team expertise: Do you have Kubernetes/Istio experience? Budget for training
Calculate total cost: Include management time, not just infrastructure ($444/mo + 0.5-1 FTE)
Define success metrics: What SLOs justify this investment? What's the alternative cost?

During Implementation:

Start small: Single service in production, validate before migrating all workloads
Automate early: GitOps from day 1, manual kubectl is error-prone at scale
Monitor everything: If you can't measure it, you can't improve it (observability first)
Document runbooks: Incident response is only as good as your documentation

After Go-Live:

Review costs weekly: First month will reveal unexpected expenses
Conduct game days: Test failure scenarios before they happen in production
Measure developer productivity: Platform value = features shipped / time, not uptime alone
Iterate on processes: Retrospectives after every incident, quarterly architecture reviews

When to Choose Alternatives

Alternative	Best For	Cost Comparison	Migration Difficulty
Azure Container Apps	<10 services, serverless workload	~$50-200/mo (consumption-based)	Easy (similar Kubernetes API)
Azure App Service	Traditional web apps, minimal DevOps	~$100-300/mo (PaaS pricing)	Easy (no Kubernetes knowledge needed)
Azure Functions	Event-driven, sporadic traffic	~$0-50/mo (pay-per-execution)	Medium (code refactoring required)
Self-Hosted Kubernetes (VM)	Maximum control, compliance needs	~$300-400/mo (3 VMs)	Hard (full cluster management)
GKE/EKS (multi-cloud)	Avoid vendor lock-in, global footprint	~$500-700/mo (cross-cloud costs)	Medium (Kubernetes-compatible)

Recommended Migration Path for Startups:

Phase 1: Azure Functions (MVP, <$50/mo)
  ↓ (Growing traffic, need more control)
Phase 2: Azure Container Apps (Scale, ~$200/mo)
  ↓ (10+ microservices, need service mesh)
Phase 3: AKS + Istio (This guide, ~$400-500/mo)
  ↓ (100+ microservices, multi-region)
Phase 4: Multi-cloud Kubernetes (Enterprise, $2000-5000/mo)

🔗 Additional Resources

Official Documentation

Books & Deep Dives

"Kubernetes Patterns" by Bilgin Ibryam & Roland Huß - Reusable patterns for cloud-native apps
"Production Kubernetes" by Josh Rosso et al. - Running Kubernetes at scale
"The Site Reliability Workbook" by Google SRE - SLO/SLI implementation guide
"Cloud Native DevOps with Kubernetes" by John Arundel & Justin Domingus

Community & Tools

CNCF Landscape - Comprehensive cloud-native ecosystem map
Awesome Kubernetes - Curated tools and resources
Kubernetes Failure Stories - Learn from production incidents
Learnk8s - Kubernetes training and best practices

Cost Calculators

Azure Pricing Calculator - Estimate infrastructure costs
KubeCost - Kubernetes cost monitoring and optimization

📝 Closing Thoughts

Building production-grade Kubernetes infrastructure is a journey, not a destination. This architecture represents months of learning, incidents, and iterations. The decisions outlined here are not universal truths—they're context-dependent trade-offs that worked for our team's scale, budget, and expertise.

The most important question isn't "Should I use Kubernetes?"—it's "What problem am I solving, and is Kubernetes the simplest solution?" For many teams, managed PaaS services (Container Apps, App Service) provide 90% of the benefits at 20% of the complexity.

But if you're building a platform for multiple teams, need fine-grained control over traffic routing, or have compliance requirements that demand mTLS and network policies, this architecture provides a solid foundation.

Key Philosophy:

Simplicity > Features: Only add complexity when it solves a real problem
Observability > Availability: You can't improve what you don't measure
Automation > Documentation: Runbooks should execute themselves
Learning > Perfection: Every incident is an opportunity to improve

Start small, automate early, and always question whether the complexity you're adding is worth the value it provides.

Questions or feedback? I'd love to hear about your Kubernetes journey—find me on Twitter or LinkedIn.

Found this guide helpful? Consider sharing it with your team or contributing improvements on GitHub.

This guide reflects production experience as of January 2025. Cloud platforms and best practices evolve rapidly—verify configurations against current documentation before implementing.