Production-Ready AKS with Istio: Architecture Decisions, Trade-offs, and Implementation
Building production Kubernetes infrastructure is fundamentally an exercise in architectural trade-offs. While deploying a basic cluster is straightforward, creating a platform that scales from dozens to thousands of services—while maintaining security, observability, and developer velocity—requires careful consideration of competing priorities.
This guide examines the technical and organizational decisions behind building production-grade Azure Kubernetes Service (AKS) infrastructure with Istio service mesh. We'll explore not just how to implement these systems, but why specific architectural choices matter, their cost implications, and when alternatives might be more appropriate.
Target Audience
This guide is written for:
- Staff+ Engineers evaluating service mesh adoption and platform architecture
- Technical Leads making build-vs-buy decisions for infrastructure components
- Platform Engineers implementing multi-tenant Kubernetes platforms
- SREs responsible for production reliability and incident response
- Engineering Managers balancing technical debt against feature velocity
What You'll Learn
Technical Implementation:
- Production-ready AKS cluster architecture with managed Istio
- Zero-downtime certificate management and rotation
- Multi-environment DNS and subdomain routing strategies
- Observability and debugging patterns for service mesh traffic
Architectural Decision-Making:
- Service mesh trade-offs: Istio vs. alternatives (Linkerd, Consul, Cilium)
- When to use managed vs. self-hosted Istio
- Cost analysis: Managed services vs. DIY infrastructure
- Security considerations: mTLS, network policies, zero-trust architecture
- Multi-cluster vs. single-cluster strategies
Operational Maturity:
- Failure modes and mitigation strategies
- Performance implications of sidecar proxies
- Capacity planning and resource optimization
- Incident response patterns for mesh-related outages
🏗️ Architecture Overview and Design Decisions
Before diving into implementation, let's examine the architectural decisions that shape production Kubernetes platforms.
Why Service Mesh? The Trade-offs
Service mesh adoption represents a significant architectural commitment. Understanding when it adds value—and when it doesn't—is crucial.
Service Mesh Benefits:
- Traffic Management: Advanced routing (canary, blue-green), retries, timeouts, circuit breaking
- Security: Automatic mTLS between services, zero-trust networking
- Observability: Distributed tracing, metrics, and traffic visualization without code changes
- Policy Enforcement: Centralized traffic policies, rate limiting, authentication
Service Mesh Costs:
- Complexity: Additional operational burden, learning curve, debugging complexity
- Performance: Sidecar proxy overhead (CPU, memory, latency ~2-5ms per hop)
- Resource Usage: ~50-100MB memory per sidecar, CPU overhead for encryption
- Blast Radius: Control plane outages can impact all mesh traffic
Decision Framework:
| Use Service Mesh When... | Skip Service Mesh When... |
|---|---|
| Operating 50+ microservices | Running monolith or <10 services |
| Need mTLS without app changes | Can enforce security at app layer |
| Require advanced traffic routing | Simple load balancing suffices |
| Multiple teams need traffic policies | Single team owns all services |
| Observability gaps exist | APM tools provide sufficient insight |
Istio vs. Alternatives: Comparative Analysis
Why We Chose Istio for This Architecture:
- Azure Managed Add-on: Reduces operational burden (control plane managed by Microsoft)
- Feature Completeness: Comprehensive traffic management, security, observability
- Ecosystem Maturity: Extensive tooling, documentation, community support
- Enterprise Adoption: Proven at scale (Google, eBay, T-Mobile)
Cost Comparison (Rough Monthly Estimates for 10-Service Cluster):
| Component | Managed Istio (AKS) | Self-Hosted Istio | Linkerd | No Mesh |
|---|---|---|---|---|
| Control Plane | $0 (included) | $100-200 | $100-200 | $0 |
| Sidecar Overhead | ~$50-75 | ~$50-75 | ~$25-40 | $0 |
| Monitoring | $100-150 | $100-150 | $75-100 | $50-75 |
| Operations Time | 5-10 hrs/mo | 20-40 hrs/mo | 10-20 hrs/mo | 5 hrs/mo |
| Total | ~$150-235 + ops | ~$250-425 + ops | ~$200-340 + ops | ~$50-75 + ops |
Note: Operations time valued at $150-200/hr for staff engineer equivalency
Architecture Decision Records (ADRs)
ADR-001: Managed Istio vs. Self-Hosted
Context: Need service mesh capabilities without dedicated SRE team.
Decision: Use Azure-managed Istio add-on for AKS.
Rationale:
- Microsoft manages control plane (istiod) upgrades and patching
- Reduced operational complexity (no manual version management)
- SLA backed by Azure support
- Team can focus on application logic vs. infrastructure
Consequences:
- Limited control over Istio version (follows Azure release cycle)
- Cannot customize control plane configuration deeply
- Dependent on Azure for mesh reliability
- May lag behind upstream Istio releases by 1-2 versions
Alternatives Considered:
- Self-hosted Istio: More control, more operational burden
- Linkerd: Simpler but fewer features
- No mesh: Lower complexity but manual mTLS, observability
ADR-002: DNS Management in Azure DNS vs. External Provider
Context: Need reliable DNS with low operational overhead.
Decision: Use Azure DNS for domain management.
Rationale:
- Native integration with AKS and Azure resources
- 100% SLA with zone redundancy
- Low latency queries via Azure edge network
- Cost-effective ($0.50/zone/month + $0.40/million queries)
- Infrastructure-as-code via Terraform/Bicep
Consequences:
- Vendor lock-in to Azure ecosystem
- Migration complexity if moving clouds
- All DNS management requires Azure access
Cost Analysis:
- Azure DNS: ~$1-5/month for typical workload
- Route 53 (AWS): ~$1-5/month + cross-cloud complexity
- Cloudflare: Free tier sufficient but adds dependency
Complete System Architecture
Key Architectural Patterns:
- Defense in Depth: DDoS protection → Load balancer → Istio Gateway → Network policies
- Separation of Concerns: Ingress layer isolated from application layer
- Observability by Default: Every service instrumented without code changes
- Cost Optimization: Spot instances for non-critical workloads, right-sized nodes
- High Availability: Multi-replica deployments, pod anti-affinity, zone distribution
Monthly Cost Breakdown (Rough Estimates):
- Compute: $220 (3x D2s_v3) + $50 (spot) = $270
- Control Plane: $0 (free tier) or $73 (99.95% SLA)
- Networking: $18.25 (load balancer) + ~$10 (bandwidth)
- Storage: $30 (database) + $10 (disks)
- Container Registry: $5 (basic) or $167 (premium with geo-replication)
- Monitoring: $110 (Prometheus, Grafana, Jaeger storage)
- DNS: $1
- Total: ~$444/mo (free tier) or ~$517/mo (with SLA)
Scales to ~$800-1200/mo at 10 services, ~$2000-3000/mo at 50 services
📋 Prerequisites and Team Readiness
Technical Prerequisites
- Azure account with Owner or Contributor + User Access Administrator role
- Azure CLI (
az) version 2.50.0+ kubectlversion matching cluster (typically 1.28+)helmv3.12+ for package management- Registered domain name with ability to change name servers
- Azure subscription with sufficient quota:
- 12+ vCPUs (preferably D-series or E-series)
- Standard Load Balancer quota
- Public IP quota
- Understanding of Kubernetes concepts (pods, services, deployments, namespaces)
Organizational Readiness Assessment
Before implementing service mesh, assess your team's readiness:
✅ Good Fit:
- Team has 2+ engineers with Kubernetes production experience
- Running 20+ microservices or planning to reach that scale
- Need centralized traffic policies across services
- Require mTLS without modifying application code
- Budget supports ~$500-1000/mo infrastructure costs
- Dedicated platform or SRE team available (even part-time)
⚠️ Proceed with Caution:
- Team new to Kubernetes (6 months experience recommended minimum)
- Running <10 services (simpler ingress solutions may suffice)
- Tight budget constraints (<$500/mo infrastructure spend)
- No dedicated operations support
- Need to ship features weekly (learning curve may slow velocity initially)
❌ Not Recommended:
- First Kubernetes deployment (start simpler, add mesh later)
- Single developer or small team (<3 engineers total)
- Prototype or proof-of-concept phase
- Services already have robust observability and security
Install Required Tools
# Install Azure CLI (macOS)
brew install azure-cli
# Verify version
az --version # Should be 2.50.0+
# Install kubectl
az aks install-cli
# Verify kubectl
kubectl version --client
# Install Helm
brew install helm
# Verify Helm
helm version # Should be v3.12+
# Install useful optional tools
brew install k9s # Terminal UI for Kubernetes
brew install stern # Multi-pod log tailing
brew install kubectx # Context and namespace switching
brew install istioctl # Istio CLI (optional but helpful)
# Login to Azure
az login
# List subscriptions
az account list --output table
# Set your subscription
export AZURE_SUBSCRIPTION_ID="<your-subscription-id>"
az account set --subscription "$AZURE_SUBSCRIPTION_ID"
# Verify access
az account show
Cost Awareness and Budget Planning
Initial Setup Costs (First Month):
- Domain registration (if new): $10-15/year
- AKS cluster setup: $300-500 (includes learning/experimentation time)
- DNS testing and configuration: Minimal
- Certificate setup: Free (Let's Encrypt)
Ongoing Monthly Costs (Conservative Estimate):
Base Infrastructure:
- AKS Control Plane (free tier): $0
- OR Control Plane SLA (99.95%): $73
- 3x Standard_D2s_v3 nodes: $220
- Azure Load Balancer: $18.25
- Azure DNS: $1
- Container Registry (Basic): $5
- Monitoring storage: $50-100
Total Base: $294-417/mo
Per-Service Costs (Approximate):
- Additional compute: $20-40/service
- Storage (if needed): $10-50/service
- Database (if needed): $30-500/service
- Bandwidth: ~$0.08/GB egress
Expected Scaling:
- 5 services: ~$450-600/mo
- 10 services: ~$800-1200/mo
- 25 services: ~$1500-2500/mo
- 50 services: ~$3000-5000/mo
Cost Optimization Strategies:
- Use Spot Instances: 70-90% discount for non-critical workloads
- Right-size Nodes: Start with D2s_v3, scale up only if needed
- Reserved Instances: 30-40% discount for 1-3 year commitments
- Autoscaling: HPA and cluster autoscaler to match demand
- Monitoring Retention: Reduce Prometheus retention to 7-15 days
- Dev/Staging Clusters: Shut down nights and weekends (terraform automation)
🚀 Step 1: Create Production-Grade AKS Cluster with Istio
1.1 Cluster Sizing and Configuration Decisions
Node Size Selection Trade-offs:
| VM Size | vCPU | RAM | Price/mo | Use Case | Considerations |
|---|---|---|---|---|---|
| Standard_B2s | 2 | 4GB | $30 | Dev/test only | Not recommended: Burstable, inconsistent performance |
| Standard_D2s_v3 | 2 | 8GB | $70 | Small prod workloads | Recommended starter: Good balance, upgradeable |
| Standard_D4s_v3 | 4 | 16GB | $140 | Medium prod | Better pod density, more headroom |
| Standard_D8s_v3 | 8 | 32GB | $280 | Large prod | Best pod density, overkill for <20 services |
| Standard_E2s_v3 | 2 | 16GB | $109 | Memory-intensive | Higher memory-to-CPU ratio |
Decision Framework:
- Start with D2s_v3 for initial deployment (2-5 services)
- Plan for D4s_v3 when approaching 50% node capacity
- Consider E-series if services average >500MB memory per pod
- Use node pools with different sizes for workload-specific optimization
Node Count Considerations:
- Minimum 3 nodes for production (quorum, rolling updates, node failures)
- Start with 3, enable cluster autoscaler (3-10 nodes typical range)
- Cost vs. Availability: 3 nodes = 2-node failures tolerated, 5 nodes = 3-node failures
- Zone Distribution: Spread across availability zones (requires 3+ nodes)
1.2 Define Cluster Configuration
# Cluster configuration with rationale
export RESOURCE_GROUP="aks-production-rg"
export LOCATION="eastus" # Choose region closest to users
export CLUSTER_NAME="my-aks-cluster"
# Node configuration
export NODE_COUNT=3 # Minimum for HA
export NODE_SIZE="Standard_D2s_v3" # 2 vCPU, 8GB RAM
export NODE_DISK_SIZE=128 # OS disk size (GB) - default is often 128GB
# Kubernetes version - check available versions
az aks get-versions --location "$LOCATION" --output table
# Use recent stable version (not preview)
export K8S_VERSION="1.28.3" # Update to latest stable
# Enable advanced features
export ENABLE_MONITORING="true"
export ENABLE_AUTOSCALER="true"
export MIN_NODE_COUNT=3
export MAX_NODE_COUNT=10
1.3 Create Resource Group with Tags
# Create resource group with metadata tags for cost tracking
az group create \
--name "$RESOURCE_GROUP" \
--location "$LOCATION" \
--tags \
environment=production \
costCenter=engineering \
managedBy=terraform \
project=platform
# Verify creation
az group show --name "$RESOURCE_GROUP"
1.4 Create AKS Cluster with Istio and Advanced Features
# Create production-grade AKS cluster
# This takes 10-15 minutes
az aks create \
--resource-group "$RESOURCE_GROUP" \
--name "$CLUSTER_NAME" \
--location "$LOCATION" \
--kubernetes-version "$K8S_VERSION" \
--node-count "$NODE_COUNT" \
--node-vm-size "$NODE_SIZE" \
--node-osdisk-size "$NODE_DISK_SIZE" \
--enable-managed-identity \
--enable-asm \
--network-plugin azure \
--network-policy azure \
--enable-cluster-autoscaler \
--min-count "$MIN_NODE_COUNT" \
--max-count "$MAX_NODE_COUNT" \
--enable-addons monitoring \
--generate-ssh-keys \
--zones 1 2 3 \
--tier standard \
--tags environment=production project=platform
# Alternative: Free tier for non-production (no SLA)
# --tier free
# Get cluster credentials
az aks get-credentials \
--resource-group "$RESOURCE_GROUP" \
--name "$CLUSTER_NAME" \
--overwrite-existing
# Verify cluster is running
kubectl get nodes -o wide
# Check node zones (should be distributed)
kubectl get nodes -o custom-columns=NAME:.metadata.name,ZONE:.metadata.labels."topology\.kubernetes\.io/zone"
Expected Output:
NAME STATUS ROLES AGE VERSION ZONE
aks-nodepool1-12345678-vmss000000 Ready agent 5m v1.28.3 eastus-1
aks-nodepool1-12345678-vmss000001 Ready agent 5m v1.28.3 eastus-2
aks-nodepool1-12345678-vmss000002 Ready agent 5m v1.28.3 eastus-3
1.5 Verify Istio Installation and Components
# Check Istio namespaces
kubectl get namespaces | grep istio
# Expected output:
# aks-istio-ingress Active 5m
# aks-istio-system Active 5m
# Check Istio control plane components
kubectl get pods -n aks-istio-system
# Expected output:
# NAME READY STATUS RESTARTS AGE
# istiod-asm-xxxx-xxxxx 1/1 Running 0 5m
# Check Istio ingress gateway
kubectl get pods -n aks-istio-ingress
# Expected output:
# NAME READY STATUS RESTARTS AGE
# aks-istio-ingressgateway-external-xxxxxxxx-xxxxx 1/1 Running 0 5m
# Get Istio Gateway external IP (save this!)
kubectl get svc -n aks-istio-ingress aks-istio-ingressgateway-external
# Expected output:
# NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S)
# aks-istio-ingressgateway-external LoadBalancer 10.0.123.45 52.188.x.x 15021:30001/TCP,80:30002/TCP,443:30003/TCP
# Export for later use
export ISTIO_INGRESS_IP=$(kubectl get svc -n aks-istio-ingress aks-istio-ingressgateway-external -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
echo "Istio Ingress IP: $ISTIO_INGRESS_IP"
# Verify Istio version and configuration
kubectl get mutatingwebhookconfigurations | grep istio
kubectl get validatingwebhookconfigurations | grep istio
1.6 Configure Node Pools for Workload Isolation (Optional but Recommended)
For production, consider separating system workloads from application workloads:
# Add a spot instance node pool for non-critical workloads (70-90% cost savings)
az aks nodepool add \
--resource-group "$RESOURCE_GROUP" \
--cluster-name "$CLUSTER_NAME" \
--name spotpool \
--priority Spot \
--eviction-policy Delete \
--spot-max-price -1 \
--enable-cluster-autoscaler \
--min-count 0 \
--max-count 5 \
--node-count 1 \
--node-vm-size Standard_D2s_v3 \
--node-taints kubernetes.azure.com/scalesetpriority=spot:NoSchedule \
--labels workloadType=batch priority=low
# Verify node pools
az aks nodepool list \
--resource-group "$RESOURCE_GROUP" \
--cluster-name "$CLUSTER_NAME" \
--output table
1.7 Cluster Architecture Deep Dive
Key Reliability Patterns:
-
Zone Distribution: Nodes spread across 3 availability zones
- Tolerates entire zone failure
- Azure SLA: 99.99% uptime with zones
-
Control Plane HA: Managed by Azure
- Multi-region replication
- Automatic failover
- Separate from worker nodes
-
Istio Control Plane: Azure-managed istiod
- Automatic updates and patches
- SLA backed by Azure support
- Reduced operational burden
-
Cluster Autoscaler: Automatically adjusts node count
- Scales up when pods are pending
- Scales down when nodes underutilized (>50% idle for 10 min)
- Cost optimization while maintaining performance
Failure Scenarios and Recovery:
| Failure Type | Impact | Recovery Time | Mitigation |
|---|---|---|---|
| Single node failure | 33% capacity loss | Immediate (pods rescheduled) | 3+ nodes, pod anti-affinity |
| Zone failure | 33% capacity loss | Immediate | Multi-zone deployment |
| Istio control plane | No new config changes | 5-15 minutes | Data plane continues functioning |
| Ingress gateway pod | Partial traffic loss | <30 seconds | Multiple replicas, health checks |
| Control plane API | No new deployments | Azure SLA: 99.95% | Use managed tier SLA |
1.8 Enable Cost Analysis and Monitoring
# Enable Azure Cost Management insights
az aks update \
--resource-group "$RESOURCE_GROUP" \
--name "$CLUSTER_NAME" \
--enable-cost-analysis
# View cluster costs (after 24-48 hours of data collection)
az aks show \
--resource-group "$RESOURCE_GROUP" \
--name "$CLUSTER_NAME" \
--query "addonProfiles.azurePolicy"
# Install metrics-server for HPA (if not already installed)
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
# Verify metrics-server
kubectl get deployment metrics-server -n kube-system
kubectl top nodes
kubectl top pods -A
Post-Creation Validation Checklist:
- All 3 nodes are in
Readystate - Nodes distributed across 3 availability zones
- Istio control plane pod running in
aks-istio-system - Istio ingress gateway has external IP
- Cluster autoscaler enabled (3-10 node range)
- Monitoring addon enabled (Azure Monitor)
-
kubectlcontext set to new cluster - Ingress IP saved to environment variable
- Cost tracking tags applied
🌐 Step 2: Configure Custom Domain in Azure DNS
Now we'll set up Azure DNS to manage your custom domain.
2.1 Create DNS Zone
export DOMAIN_NAME="cat-herding.net"
# Create DNS zone
az network dns zone create \
--resource-group "$RESOURCE_GROUP" \
--name "$DOMAIN_NAME"
# Get name servers
az network dns zone show \
--resource-group "$RESOURCE_GROUP" \
--name "$DOMAIN_NAME" \
--query nameServers \
--output table
Expected Output:
Result
-------------------
ns1-01.azure-dns.com.
ns2-01.azure-dns.net.
ns3-01.azure-dns.org.
ns4-01.azure-dns.info.
2.2 Update Domain Registrar Name Servers
Go to your domain registrar (GoDaddy, Namecheap, etc.) and update the name servers to the Azure DNS name servers from the previous step.
DNS Propagation Flow:
2.3 Create DNS Records
# Get Istio ingress gateway IP
export INGRESS_IP=$(kubectl get svc -n aks-istio-ingress \
aks-istio-ingressgateway-external \
-o jsonpath='{.status.loadBalancer.ingress[0].ip}')
echo "Ingress Gateway IP: $INGRESS_IP"
# Create A record for root domain
az network dns record-set a add-record \
--resource-group "$RESOURCE_GROUP" \
--zone-name "$DOMAIN_NAME" \
--record-set-name "@" \
--ipv4-address "$INGRESS_IP"
# Create A record for www subdomain
az network dns record-set a add-record \
--resource-group "$RESOURCE_GROUP" \
--zone-name "$DOMAIN_NAME" \
--record-set-name "www" \
--ipv4-address "$INGRESS_IP"
# Verify DNS records
az network dns record-set a list \
--resource-group "$RESOURCE_GROUP" \
--zone-name "$DOMAIN_NAME" \
--output table
2.4 Test DNS Resolution
# Test DNS resolution (may take a few minutes)
nslookup cat-herding.net
nslookup www.cat-herding.net
# Test from multiple locations
dig cat-herding.net
🔐 Step 3: Install cert-manager for TLS Certificates
cert-manager automates the provisioning and renewal of TLS certificates from Let's Encrypt.
3.1 Install cert-manager with Helm
# Add cert-manager Helm repository
helm repo add jetstack https://charts.jetstack.io
helm repo update
# Create namespace
kubectl create namespace cert-manager
# Install cert-manager
helm install cert-manager jetstack/cert-manager \
--namespace cert-manager \
--version v1.13.0 \
--set installCRDs=true
# Verify installation
kubectl get pods -n cert-manager
cert-manager Architecture:
3.2 Create ClusterIssuer for Let's Encrypt
# Create production ClusterIssuer
cat <<EOF | kubectl apply -f -
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
server: https://acme-v02.api.letsencrypt.org/directory
email: admin@cat-herding.net # Change to your email
privateKeySecretRef:
name: letsencrypt-prod
solvers:
- http01:
ingress:
class: istio
EOF
# Verify ClusterIssuer
kubectl get clusterissuer letsencrypt-prod
kubectl describe clusterissuer letsencrypt-prod
3.3 Create Certificate Resource
cat <<EOF | kubectl apply -f -
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: cat-herding-tls-cert
namespace: default
spec:
secretName: cat-herding-tls
issuerRef:
name: letsencrypt-prod
kind: ClusterIssuer
dnsNames:
- cat-herding.net
- www.cat-herding.net
EOF
# Check certificate status
kubectl get certificate -n default
kubectl describe certificate cat-herding-tls-cert -n default
# Check if secret was created
kubectl get secret cat-herding-tls -n default
Certificate Lifecycle:
🎯 Step 4: Deploy Your First Application
Let's deploy a sample application with Istio routing and TLS.
4.1 Create Application Directory Structure
mkdir -p k8s/apps/portfolio/base
cd k8s/apps/portfolio/base
4.2 Create Deployment Manifest
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: portfolio
spec:
replicas: 2
selector:
matchLabels:
app: portfolio
template:
metadata:
labels:
app: portfolio
version: v1
annotations:
sidecar.istio.io/inject: "false"
spec:
containers:
- name: portfolio
image: gabby.azurecr.io/portfolio:latest
imagePullPolicy: Always
ports:
- containerPort: 3000
name: http
env:
- name: PORT
value: "3000"
- name: NODE_ENV
value: "production"
resources:
requests:
cpu: 250m
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi
livenessProbe:
httpGet:
path: /api/health
port: http
initialDelaySeconds: 30
periodSeconds: 15
readinessProbe:
httpGet:
path: /api/health
port: http
initialDelaySeconds: 5
periodSeconds: 10
4.3 Create Service Manifest
# service.yaml
apiVersion: v1
kind: Service
metadata:
name: portfolio
spec:
selector:
app: portfolio
ports:
- name: http
port: 80
targetPort: 3000
type: ClusterIP
4.4 Create Istio Gateway
# istio-gateway.yaml
apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
name: portfolio-gateway
namespace: default
spec:
selector:
istio: aks-istio-ingressgateway-external
servers:
- port:
number: 80
name: http
protocol: HTTP
hosts:
- "cat-herding.net"
- "www.cat-herding.net"
- port:
number: 443
name: https
protocol: HTTPS
hosts:
- "cat-herding.net"
- "www.cat-herding.net"
tls:
mode: SIMPLE
credentialName: cat-herding-tls
4.5 Create Istio VirtualService
# istio-virtualservice.yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: portfolio-virtualservice
spec:
hosts:
- "cat-herding.net"
- "www.cat-herding.net"
gateways:
- portfolio-gateway
http:
# Health check endpoints
- match:
- uri:
exact: /api/health
- uri:
exact: /health
route:
- destination:
host: portfolio
port:
number: 80
# All other traffic
- match:
- uri:
prefix: /
route:
- destination:
host: portfolio
port:
number: 80
4.6 Create Kustomization File
# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- deployment.yaml
- service.yaml
- istio-gateway.yaml
- istio-virtualservice.yaml
namespace: default
4.7 Deploy Application
# Validate manifests
kubectl kustomize k8s/apps/portfolio/base
# Apply manifests
kubectl apply -k k8s/apps/portfolio/base
# Watch deployment
kubectl rollout status deployment/portfolio -n default
# Check resources
kubectl get pods -l app=portfolio -n default
kubectl get svc portfolio -n default
kubectl get gateway portfolio-gateway -n default
kubectl get virtualservice portfolio-virtualservice -n default
Request Flow:
🧪 Step 5: Test and Verify
5.1 Test HTTP/HTTPS Connectivity
# Test HTTP (should work)
curl -v http://cat-herding.net
# Test HTTPS (should work with valid certificate)
curl -v https://cat-herding.net
# Test specific endpoint
curl https://cat-herding.net/api/health
# Check certificate details
echo | openssl s_client -servername cat-herding.net -connect cat-herding.net:443 2>/dev/null | openssl x509 -noout -dates
5.2 Verify Certificate
# Check certificate secret
kubectl get secret cat-herding-tls -n default
# Decode and inspect certificate
kubectl get secret cat-herding-tls -n default -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -text -noout
# Check expiration
kubectl get secret cat-herding-tls -n default -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -enddate
5.3 Test from Inside Cluster
# Create test pod
kubectl run curl-test --image=curlimages/curl:latest --rm -it --restart=Never -- sh
# From inside the pod
curl -v http://portfolio.default.svc.cluster.local
exit
🔄 Step 6: Deploy Additional Applications with Subdomains
Now let's deploy a second application on a subdomain (e.g., api.cat-herding.net).
6.1 Create DNS Record for Subdomain
# Create A record for api subdomain
az network dns record-set a add-record \
--resource-group "$RESOURCE_GROUP" \
--zone-name "$DOMAIN_NAME" \
--record-set-name "api" \
--ipv4-address "$INGRESS_IP"
# Verify
nslookup api.cat-herding.net
6.2 Create Certificate for Subdomain
cat <<EOF | kubectl apply -f -
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: api-tls-cert
namespace: default
spec:
secretName: api-tls
issuerRef:
name: letsencrypt-prod
kind: ClusterIssuer
dnsNames:
- api.cat-herding.net
EOF
# Check certificate status
kubectl get certificate api-tls-cert -n default
6.3 Create API Application Manifests
mkdir -p k8s/apps/api-service/base
cd k8s/apps/api-service/base
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-service
spec:
replicas: 2
selector:
matchLabels:
app: api-service
template:
metadata:
labels:
app: api-service
version: v1
spec:
containers:
- name: api-service
image: your-registry.azurecr.io/api-service:latest
ports:
- containerPort: 8080
name: http
env:
- name: PORT
value: "8080"
resources:
requests:
cpu: 250m
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi
# service.yaml
apiVersion: v1
kind: Service
metadata:
name: api-service
spec:
selector:
app: api-service
ports:
- name: http
port: 80
targetPort: 8080
type: ClusterIP
# istio-gateway.yaml
apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
name: api-gateway
namespace: default
spec:
selector:
istio: aks-istio-ingressgateway-external
servers:
- port:
number: 80
name: http
protocol: HTTP
hosts:
- "api.cat-herding.net"
- port:
number: 443
name: https
protocol: HTTPS
hosts:
- "api.cat-herding.net"
tls:
mode: SIMPLE
credentialName: api-tls
# istio-virtualservice.yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: api-virtualservice
spec:
hosts:
- "api.cat-herding.net"
gateways:
- api-gateway
http:
- match:
- uri:
prefix: /
route:
- destination:
host: api-service
port:
number: 80
# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- deployment.yaml
- service.yaml
- istio-gateway.yaml
- istio-virtualservice.yaml
namespace: default
6.4 Deploy API Service
# Apply manifests
kubectl apply -k k8s/apps/api-service/base
# Verify deployment
kubectl get pods -l app=api-service
kubectl get svc api-service
kubectl get gateway api-gateway
kubectl get virtualservice api-virtualservice
# Test
curl https://api.cat-herding.net/health
Multi-Application Routing:
🔒 Security Architecture and Zero-Trust Implementation
Defense-in-Depth Strategy
Production Kubernetes security requires multiple layers. A single breach should not compromise the entire system.
Implementing Mutual TLS (mTLS) with Istio
Istio provides automatic mTLS between services without code changes—one of its primary value propositions.
mTLS Benefits:
- Zero-trust networking: No service trusts network by default
- Encryption of all inter-service communication
- Certificate-based authentication (vs. API keys)
- Automatic certificate rotation (default: 24-hour validity)
- Defense against man-in-the-middle attacks
mTLS Configuration:
# Enable strict mTLS for entire namespace
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: default
namespace: default
spec:
mtls:
mode: STRICT # PERMISSIVE allows both mTLS and plaintext (migration mode)
---
# Alternatively, configure cluster-wide
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: default
namespace: istio-system
spec:
mtls:
mode: STRICT
# Apply mTLS policy
kubectl apply -f - <<EOF
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: default
namespace: default
spec:
mtls:
mode: STRICT
EOF
# Verify mTLS status
istioctl x describe pod <pod-name> -n default
# Check certificate details
istioctl proxy-config secret <pod-name> -n default -o json | jq -r '.dynamicActiveSecrets[0].secret.tlsCertificate.certificateChain.inlineBytes' | base64 -d | openssl x509 -text -noout
mTLS Performance Impact:
- Latency: ~1-3ms additional per hop
- CPU: ~5-10% overhead for encryption/decryption
- Memory: ~20-30MB per sidecar for certificates
- Trade-off: Security vs. microsecond-level latency requirements
Network Policies for Pod Isolation
# Deny all traffic by default (baseline security)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: default
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
---
# Allow specific service communication
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: portfolio-network-policy
namespace: default
spec:
podSelector:
matchLabels:
app: portfolio
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: aks-istio-ingress
ports:
- protocol: TCP
port: 3000
egress:
# Allow DNS
- to:
- namespaceSelector:
matchLabels:
name: kube-system
ports:
- protocol: UDP
port: 53
# Allow API service communication
- to:
- podSelector:
matchLabels:
app: api-service
ports:
- protocol: TCP
port: 8080
# Allow database connection (Azure PostgreSQL)
- to:
- podSelector: {}
ports:
- protocol: TCP
port: 5432
# Allow external HTTPS (for external APIs)
- to:
- namespaceSelector: {}
ports:
- protocol: TCP
port: 443
Pod Security Standards (PSS)
# Enforce restricted pod security at namespace level
apiVersion: v1
kind: Namespace
metadata:
name: production
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/audit: restricted
pod-security.kubernetes.io/warn: restricted
---
# Example: Compliant deployment with security contexts
apiVersion: apps/v1
kind: Deployment
metadata:
name: secure-portfolio
namespace: production
spec:
replicas: 2
selector:
matchLabels:
app: portfolio
template:
metadata:
labels:
app: portfolio
spec:
# Pod-level security context
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1000
seccompProfile:
type: RuntimeDefault
# Service account with minimal permissions
serviceAccountName: portfolio-sa
automountServiceAccountToken: false
containers:
- name: portfolio
image: gabby.azurecr.io/portfolio:v1.2.3 # Use specific tags, not :latest
# Container-level security context
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true # Immutable container filesystem
runAsNonRoot: true
runAsUser: 1000
capabilities:
drop:
- ALL # Drop all Linux capabilities
# Resource limits prevent resource exhaustion attacks
resources:
requests:
cpu: 250m
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi
# Writable volumes for tmp and cache
volumeMounts:
- name: tmp
mountPath: /tmp
- name: cache
mountPath: /app/.next/cache
ports:
- containerPort: 3000
name: http
protocol: TCP
livenessProbe:
httpGet:
path: /api/health
port: http
initialDelaySeconds: 30
periodSeconds: 15
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /api/health
port: http
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3
volumes:
- name: tmp
emptyDir: {}
- name: cache
emptyDir: {}
Azure Key Vault Integration for Secrets
Never store secrets in Git, ConfigMaps, or even Kubernetes Secrets (base64 is not encryption).
# Install Azure Key Vault provider for Secrets Store CSI
helm repo add csi-secrets-store-provider-azure https://azure.github.io/secrets-store-csi-driver-provider-azure/charts
helm install csi-secrets-store-provider-azure/csi-secrets-store-provider-azure --generate-name --namespace kube-system
# Create Key Vault
export KEY_VAULT_NAME="aks-prod-kv-$(uuidgen | cut -c1-8)"
az keyvault create \
--name "$KEY_VAULT_NAME" \
--resource-group "$RESOURCE_GROUP" \
--location "$LOCATION" \
--enable-rbac-authorization
# Store secrets
az keyvault secret set \
--vault-name "$KEY_VAULT_NAME" \
--name "database-connection-string" \
--value "postgresql://user:password@host:5432/db"
# Configure workload identity for pod to access Key Vault
# (Detailed setup omitted for brevity - see Azure docs)
Security Monitoring and Audit
# Enable Azure Defender for Kubernetes
az security pricing create \
--name KubernetesService \
--tier Standard
# View security recommendations
az security assessment list \
--resource-group "$RESOURCE_GROUP"
# Enable audit logging
az aks update \
--resource-group "$RESOURCE_GROUP" \
--name "$CLUSTER_NAME" \
--enable-azure-rbac
# Query audit logs (requires Azure Monitor integration)
kubectl get events --all-namespaces --sort-by='.lastTimestamp' | grep -i "security\|unauthorized\|forbidden"
Incident Response Playbook
Common Security Incidents:
-
Compromised Pod
# Immediately isolate pod with network policy kubectl apply -f - <<EOF apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: isolate-compromised-pod namespace: default spec: podSelector: matchLabels: app: compromised-app policyTypes: - Ingress - Egress EOF # Capture pod logs before deletion kubectl logs compromised-pod -n default --all-containers > incident-logs.txt # Exec into pod for forensics (if safe) kubectl exec -it compromised-pod -n default -- /bin/sh # Delete and redeploy from known-good image kubectl delete pod compromised-pod -n default kubectl rollout restart deployment/app-name -n default # Review image for vulnerabilities trivy image gabby.azurecr.io/app:tag -
Certificate Expiration (Let's Encrypt)
# Check certificate expiration kubectl get certificate -A kubectl describe certificate cat-herding-tls-cert -n default # Force renewal if <30 days kubectl delete secret cat-herding-tls -n default kubectl delete certificate cat-herding-tls-cert -n default kubectl apply -f certificate.yaml # Monitor renewal kubectl get certificaterequest -n default -w -
Istio Control Plane Failure
# Data plane continues functioning, but no config changes # Check istiod status kubectl get pods -n aks-istio-system kubectl logs -n aks-istio-system -l app=istiod # Contact Azure support (managed Istio) # Meanwhile, avoid deploying new services or changing routing
Security Checklist for Production:
- mTLS enabled in STRICT mode
- Network policies deny all by default
- Pod Security Standards enforced (restricted)
- All containers run as non-root
- Read-only root filesystem where possible
- Resource limits on all pods
- Image scanning in CI/CD pipeline
- Secrets stored in Azure Key Vault
- RBAC roles follow least privilege
- Audit logging enabled
- Azure Defender for Kubernetes enabled
- Regular security assessments scheduled
- Incident response playbook documented
🔍 Troubleshooting: Real-World Incident Scenarios
Debugging Methodology for Service Mesh
Service mesh adds complexity—when things break, methodical debugging is essential.
Issue 1: 503 Service Unavailable (Most Common)
Symptoms:
curl https://cat-herding.netreturns 503- Browser shows "Service Temporarily Unavailable"
- Istio Gateway logs show "no healthy upstream"
Root Causes and Solutions:
# Step 1: Verify pods are running
kubectl get pods -l app=portfolio -n default
# If pods are not Running:
kubectl describe pod <pod-name> -n default
kubectl logs <pod-name> -n default --previous # Check previous container logs
# Common causes:
# - ImagePullBackOff: Wrong image name or missing credentials
# - CrashLoopBackOff: Application error, check logs
# - Pending: Insufficient resources, check node capacity
# Step 2: Check if service has endpoints
kubectl get endpoints portfolio -n default
# If no endpoints:
# - Selector mismatch: kubectl get svc portfolio -n default -o yaml | grep selector
# - Pod labels: kubectl get pods -l app=portfolio --show-labels
# Step 3: Check health probes
kubectl describe pod <pod-name> -n default | grep -A5 "Liveness\|Readiness"
# If probes failing:
# - Wrong path: Check /api/health is correct
# - Slow startup: Increase initialDelaySeconds
# - Application not ready: Fix app code
# Step 4: Verify VirtualService routing
kubectl get virtualservice portfolio-virtualservice -n default -o yaml
# Check host matching
kubectl describe virtualservice portfolio-virtualservice -n default
# Step 5: Check Istio Gateway logs
kubectl logs -n aks-istio-ingress -l istio=aks-istio-ingressgateway-external --tail=100
# Look for:
# - "no healthy upstream"
# - "upstream connect error"
# - certificate errors
Real Example with Solution:
# Problem: VirtualService host doesn't match Gateway
apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
name: portfolio-gateway
spec:
servers:
- hosts:
- "cat-herding.net" # ❌ Missing www subdomain
---
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: portfolio-vs
spec:
hosts:
- "www.cat-herding.net" # ❌ Requests to www.cat-herding.net fail!
gateways:
- portfolio-gateway
# Solution: Match hosts exactly
apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
name: portfolio-gateway
spec:
servers:
- hosts:
- "cat-herding.net"
- "www.cat-herding.net" # ✅ Include all expected hosts
Issue 2: Certificate Not Ready / TLS Errors
Symptoms:
- Certificate stuck in "Pending" or "False" state
- Browser shows "Your connection is not private"
kubectl describe certificateshows challenge failures
Debugging Steps:
# Step 1: Check certificate status
kubectl get certificate -A
kubectl describe certificate cat-herding-tls-cert -n default
# Look for status conditions:
# - Ready: False
# - Reason: Pending / Failed
# - Message: (describes what's wrong)
# Step 2: Check certificate request
kubectl get certificaterequest -n default
kubectl describe certificaterequest <request-name> -n default
# Step 3: Check ACME challenge
kubectl get challenges -A
kubectl describe challenge <challenge-name> -n default
# Common HTTP-01 challenge failures:
# - DNS not propagating (wait 5-60 minutes)
# - Firewall blocking /.well-known/acme-challenge/
# - VirtualService not routing challenge requests
# Step 4: Check cert-manager logs
kubectl logs -n cert-manager -l app=cert-manager --tail=100
# Look for:
# - "failed to determine the list of Challenge resources"
# - "Error accepting challenge"
# - "certificate request has failed"
# Step 5: Verify Let's Encrypt rate limits
# Let's Encrypt limits:
# - 50 certificates per registered domain per week
# - 5 duplicate certificates per week
# Check: https://crt.sh/?q=cat-herding.net
# Solution 1: Delete and recreate certificate (if misconfigured)
kubectl delete certificate cat-herding-tls-cert -n default
kubectl delete secret cat-herding-tls -n default
kubectl apply -f certificate.yaml
# Solution 2: Use staging issuer for testing (avoids rate limits)
cat <<EOF | kubectl apply -f -
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-staging
spec:
acme:
server: https://acme-staging-v02.api.letsencrypt.org/directory
email: admin@cat-herding.net
privateKeySecretRef:
name: letsencrypt-staging
solvers:
- http01:
ingress:
class: istio
EOF
# Update certificate to use staging issuer
kubectl edit certificate cat-herding-tls-cert -n default
# Change issuerRef.name to letsencrypt-staging
# Solution 3: Manual certificate if all else fails
openssl req -x509 -nodes -days 365 -newkey rsa:2048 \
-keyout tls.key -out tls.crt \
-subj "/CN=cat-herding.net"
kubectl create secret tls cat-herding-tls \
--cert=tls.crt --key=tls.key -n default
Issue 3: High Latency / Slow Responses
Symptoms:
- Response times > 500ms
- Intermittent timeouts
- User complaints about performance
Investigation:
# Step 1: Isolate the layer causing latency
# Test internal service (bypasses Istio Gateway)
kubectl run curl-test --image=curlimages/curl:latest --rm -it --restart=Never -- \
time curl -w "\ntime_total: %{time_total}s\n" http://portfolio.default.svc.cluster.local
# If fast internally but slow externally:
# - Problem is Gateway, DNS, or network path
# If slow internally:
# - Problem is application or sidecar
# Step 2: Check resource utilization
kubectl top pods -n default
kubectl top nodes
# If pods near limits:
# - Increase resource requests/limits
# - Scale horizontally (more replicas)
# Step 3: Check Envoy sidecar overhead
kubectl exec -it <pod-name> -c istio-proxy -n default -- pilot-agent request GET stats | grep -E "upstream_rq_time|response_time"
# Typical sidecar overhead: 1-5ms
# If >10ms: Investigate Istio configuration
# Step 4: Enable distributed tracing
# Jaeger should show request flow and timing breakdown
kubectl port-forward -n istio-system svc/jaeger-query 16686:16686
# Open http://localhost:16686 and search for slow traces
# Step 5: Check for CPU throttling
kubectl describe pod <pod-name> -n default | grep -A5 "Requests\|Limits"
# If throttling:
# - Increase CPU limits
# - Review application efficiency
# Solution 1: Horizontal Pod Autoscaler
kubectl autoscale deployment portfolio --cpu-percent=70 --min=2 --max=10
# Solution 2: Increase resources
kubectl set resources deployment portfolio -c portfolio \
--requests=cpu=500m,memory=1Gi \
--limits=cpu=2000m,memory=2Gi
# Solution 3: Optimize application code
# - Add caching layer (Redis)
# - Optimize database queries
# - Use CDN for static assets
Issue 4: DNS Not Resolving
Symptoms:
nslookup cat-herding.nettimes out or returns wrong IP- Works from some locations but not others
- Intermittent failures
Resolution:
# Step 1: Verify DNS records in Azure
az network dns record-set a list \
--resource-group "$RESOURCE_GROUP" \
--zone-name "cat-herding.net" \
--output table
# Expected: A record pointing to $ISTIO_INGRESS_IP
# Step 2: Check name server delegation
nslookup -type=NS cat-herding.net
# Should return Azure DNS name servers:
# ns1-01.azure-dns.com.
# ns2-01.azure-dns.net.
# ns3-01.azure-dns.org.
# ns4-01.azure-dns.info.
# If not: Update name servers at domain registrar
# Step 3: Test DNS from multiple servers
dig @8.8.8.8 cat-herding.net # Google DNS
dig @1.1.1.1 cat-herding.net # Cloudflare DNS
dig @ns1-01.azure-dns.com cat-herding.net # Azure DNS directly
# If inconsistent: DNS propagation in progress (wait 5-60 min)
# Step 4: Check TTL values
dig cat-herding.net | grep "^cat-herding.net"
# TTL should be reasonable (300-3600 seconds)
# Lower TTL = faster updates, more queries
# Higher TTL = slower updates, fewer queries, better performance
# Step 5: Flush local DNS cache
# macOS:
sudo dscacheutil -flushcache; sudo killall -HUP mDNSResponder
# Linux:
sudo systemd-resolve --flush-caches
# Windows:
ipconfig /flushdns
# Solution: If propagation is taking too long
# Temporarily add entry to /etc/hosts for testing:
echo "$ISTIO_INGRESS_IP cat-herding.net www.cat-herding.net" | sudo tee -a /etc/hosts
Issue 5: Istio Control Plane Failure
Symptoms:
- Cannot deploy new services
- VirtualService changes not applied
- Gateway configuration stuck
Impact:
- Data plane continues working: Existing traffic unaffected
- Control plane down: Cannot change routing, deploy new services, or update config
Response:
# Step 1: Verify istiod health
kubectl get pods -n aks-istio-system
kubectl describe pod <istiod-pod> -n aks-istio-system
kubectl logs -n aks-istio-system -l app=istiod --tail=100
# Step 2: Check Istio webhook configuration
kubectl get mutatingwebhookconfigurations | grep istio
kubectl get validatingwebhookconfigurations | grep istio
# If webhooks failing:
# - Pods may not get sidecar injection
# - Configuration validation may fail
# Step 3: For Azure-managed Istio, contact Azure support
# Service Request Type: AKS > Istio Add-on Issue
# Include:
# - Cluster name and resource group
# - Timestamp of issue start
# - Output of kubectl get pods -n aks-istio-system
# Step 4: Temporary workarounds
# - Avoid deploying new services until resolved
# - Existing services continue functioning
# - Do not delete/recreate Gateway or VirtualService resources
# Step 5: If self-hosted Istio, restart control plane
kubectl rollout restart deployment/istiod -n istio-system
Emergency Commands Cheat Sheet
# Quick health check
kubectl get pods -A | grep -v Running
kubectl get nodes | grep -v Ready
kubectl top nodes
kubectl get events --all-namespaces --sort-by='.lastTimestamp' | tail -20
# Restart everything (last resort)
kubectl rollout restart deployment/<app-name> -n default
kubectl rollout restart deployment/istiod -n aks-istio-system
# Force certificate renewal
kubectl delete secret <cert-secret-name> -n default
kubectl delete certificate <cert-name> -n default
kubectl apply -f certificate.yaml
# Check all Istio configuration
istioctl analyze -A
# Dump Istio proxy configuration
kubectl exec -it <pod-name> -c istio-proxy -n default -- curl localhost:15000/config_dump > config.json
# Port-forward for local debugging
kubectl port-forward svc/portfolio 8080:80 -n default
curl http://localhost:8080
# Emergency: Scale down problematic service
kubectl scale deployment <app-name> --replicas=0 -n default
# View all resources in namespace
kubectl get all -n default
# Check certificate from external
echo | openssl s_client -servername cat-herding.net -connect cat-herding.net:443 2>/dev/null | openssl x509 -noout -dates
Performance Degradation Patterns
Pattern 1: Gradual Performance Degradation
- Cause: Memory leak or file descriptor exhaustion
- Solution: Enable HPA, set appropriate resource limits, restart pods regularly
Pattern 2: Spike During Deployments
- Cause: Rolling update with insufficient replicas
- Solution: Increase minReplicas, use pod anti-affinity, configure PodDisruptionBudget
Pattern 3: Periodic Slowdowns
- Cause: Batch jobs, backups, or cron consuming resources
- Solution: Use separate node pool for batch workloads, set resource limits
Pattern 4: Certificate Expiration Every 90 Days
- Cause: Let's Encrypt certificates not auto-renewing
- Solution: Verify cert-manager running, check ClusterIssuer, monitor certificate expiration
Monitoring Checklist:
# Set up alerts for:
# - Pod crash loop (>3 restarts in 5 minutes)
# - High error rate (>1% 5xx responses)
# - Certificate expiration (<30 days)
# - Node CPU >80% sustained
# - Node memory >85% sustained
# - Disk space <10% free
# Example: Prometheus alert rules
# (Configure in Prometheus or Azure Monitor)
groups:
- name: kubernetes_alerts
rules:
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} is crash looping"
# Check certificate status
kubectl describe certificate cat-herding-tls-cert -n default
# Check cert-manager logs
kubectl logs -n cert-manager -l app=cert-manager
# Check challenges
kubectl get challenges -A
# Manually trigger certificate renewal
kubectl delete certificate cat-herding-tls-cert -n default
kubectl apply -f certificate.yaml
Issue 2: DNS Not Resolving
# Check DNS records in Azure
az network dns record-set a list \
--resource-group "$RESOURCE_GROUP" \
--zone-name "$DOMAIN_NAME"
# Test DNS from different servers
dig @8.8.8.8 cat-herding.net
dig @1.1.1.1 cat-herding.net
# Check propagation status
nslookup cat-herding.net
Issue 3: 503 Service Unavailable
# Check if pods are running
kubectl get pods -l app=portfolio
# Check pod logs
kubectl logs -l app=portfolio --tail=50
# Check service endpoints
kubectl get endpoints portfolio
# Check Istio configuration
istioctl analyze -n default
# Check VirtualService routing
kubectl describe virtualservice portfolio-virtualservice
Issue 4: TLS Certificate Errors
# Verify secret exists and has correct data
kubectl get secret cat-herding-tls -n default -o yaml
# Check certificate expiration
kubectl get secret cat-herding-tls -n default -o jsonpath='{.data.tls\.crt}' | \
base64 -d | openssl x509 -noout -enddate
# Check if Gateway is using correct secret
kubectl get gateway portfolio-gateway -o yaml | grep credentialName
# Test certificate from external
echo | openssl s_client -servername cat-herding.net -connect cat-herding.net:443 2>/dev/null | \
openssl x509 -noout -subject -issuer -dates
📊 Monitoring and Observability
Enable Istio Telemetry
# Install Prometheus
kubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.20/samples/addons/prometheus.yaml
# Install Grafana
kubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.20/samples/addons/grafana.yaml
# Install Kiali (Istio dashboard)
kubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.20/samples/addons/kiali.yaml
# Access Kiali dashboard
kubectl port-forward -n istio-system svc/kiali 20001:20001
# Open http://localhost:20001
View Istio Traffic Flow
🎓 Best Practices
1. Resource Organization
k8s/
├── kustomization.yaml # Root kustomization
└── apps/
├── portfolio/
│ └── base/
│ ├── kustomization.yaml
│ ├── deployment.yaml
│ ├── service.yaml
│ ├── istio-gateway.yaml
│ └── istio-virtualservice.yaml
└── api-service/
└── base/
├── kustomization.yaml
├── deployment.yaml
├── service.yaml
├── istio-gateway.yaml
└── istio-virtualservice.yaml
2. Security Considerations
- Use RBAC: Implement proper role-based access control
- Network Policies: Restrict pod-to-pod communication
- Image Scanning: Scan container images for vulnerabilities
- Secrets Management: Use Azure Key Vault for sensitive data
- TLS Everywhere: Always use HTTPS, even for internal services
3. High Availability
# Example: HA deployment configuration
apiVersion: apps/v1
kind: Deployment
metadata:
name: portfolio
spec:
replicas: 3 # Multiple replicas
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0 # Zero-downtime deployments
template:
spec:
affinity:
podAntiAffinity: # Spread across nodes
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- portfolio
topologyKey: kubernetes.io/hostname
4. Resource Limits
Always define resource requests and limits:
resources:
requests:
cpu: 250m # Guaranteed CPU
memory: 512Mi # Guaranteed memory
limits:
cpu: 1000m # Maximum CPU
memory: 1Gi # Maximum memory
🚀 Deployment Automation
Create Deployment Script
#!/bin/bash
# deploy-app.sh - Automated application deployment
set -e
APP_NAME=$1
SUBDOMAIN=$2
IMAGE=$3
PORT=$4
if [ -z "$APP_NAME" ] || [ -z "$SUBDOMAIN" ] || [ -z "$IMAGE" ] || [ -z "$PORT" ]; then
echo "Usage: ./deploy-app.sh <app-name> <subdomain> <image> <port>"
echo "Example: ./deploy-app.sh my-app api myregistry.azurecr.io/my-app:v1.0 8080"
exit 1
fi
RESOURCE_GROUP="aks-production-rg"
DOMAIN_NAME="cat-herding.net"
echo "🚀 Deploying $APP_NAME on $SUBDOMAIN.$DOMAIN_NAME"
# 1. Get Ingress IP
echo "📡 Getting Ingress Gateway IP..."
INGRESS_IP=$(kubectl get svc -n aks-istio-ingress aks-istio-ingressgateway-external -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
echo " IP: $INGRESS_IP"
# 2. Create DNS Record
echo "🌐 Creating DNS record..."
az network dns record-set a add-record \
--resource-group "$RESOURCE_GROUP" \
--zone-name "$DOMAIN_NAME" \
--record-set-name "$SUBDOMAIN" \
--ipv4-address "$INGRESS_IP" 2>/dev/null || echo " DNS record already exists"
# 3. Create Certificate
echo "🔐 Creating TLS certificate..."
cat <<EOF | kubectl apply -f -
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: ${SUBDOMAIN}-tls-cert
namespace: default
spec:
secretName: ${SUBDOMAIN}-tls
issuerRef:
name: letsencrypt-prod
kind: ClusterIssuer
dnsNames:
- ${SUBDOMAIN}.${DOMAIN_NAME}
EOF
# 4. Wait for certificate
echo "⏳ Waiting for certificate to be ready (this may take a few minutes)..."
kubectl wait --for=condition=Ready certificate/${SUBDOMAIN}-tls-cert -n default --timeout=300s
# 5. Deploy application
echo "📦 Deploying application manifests..."
mkdir -p k8s/apps/${APP_NAME}/base
cd k8s/apps/${APP_NAME}/base
# Generate manifests (templates omitted for brevity - see full examples above)
# ... deployment.yaml, service.yaml, gateway.yaml, virtualservice.yaml ...
kubectl apply -k .
# 6. Wait for rollout
echo "⏳ Waiting for deployment to be ready..."
kubectl rollout status deployment/${APP_NAME} -n default
# 7. Test
echo "✅ Testing deployment..."
sleep 10
curl -f https://${SUBDOMAIN}.${DOMAIN_NAME}/health || echo "⚠️ Health check failed (may need time)"
echo "🎉 Deployment complete! Visit https://${SUBDOMAIN}.${DOMAIN_NAME}"
📝 Summary
You now have a complete production-ready Azure AKS cluster with:
✅ Managed Kubernetes cluster with auto-scaling
✅ Istio service mesh for advanced traffic management
✅ Custom domain DNS managed in Azure
✅ Automated TLS certificates via cert-manager and Let's Encrypt
✅ HTTP/HTTPS routing with subdomain support
✅ Monitoring and observability with Prometheus, Grafana, and Kiali
✅ Reusable deployment templates for new applications
Complete System Architecture
📊 Production Observability and SRE Practices
Defining SLIs, SLOs, and Error Budgets
Service Level Objectives drive operational decisions and balance feature velocity with reliability.
Example SLO Definition:
# portfolio-slos.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: portfolio-slos
namespace: default
data:
slo-definition: |
# Portfolio Service SLOs (30-day rolling window)
Availability SLO:
Target: 99.5% uptime
Measurement: (successful requests / total requests) >= 0.995
Error Budget: 0.5% (216 minutes/month or ~7.2 minutes/day)
Latency SLO:
p50 < 200ms
p95 < 500ms
p99 < 1000ms
Error Rate SLO:
5xx errors < 0.1% of requests
4xx errors < 5% of requests
Data Freshness SLO:
Blog posts appear within 60 seconds of publish
SLI Implementation with Prometheus:
# prometheus-rules-slis.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: portfolio-slis
namespace: default
spec:
groups:
- name: portfolio_slis
interval: 30s
rules:
# Availability SLI
- record: portfolio:availability:ratio_rate5m
expr: |
sum(rate(istio_requests_total{
destination_service_name="portfolio",
response_code!~"5.."
}[5m]))
/
sum(rate(istio_requests_total{
destination_service_name="portfolio"
}[5m]))
# Latency SLI (p95)
- record: portfolio:latency:p95
expr: |
histogram_quantile(0.95,
sum(rate(istio_request_duration_milliseconds_bucket{
destination_service_name="portfolio"
}[5m])) by (le)
)
# Error Rate SLI
- record: portfolio:error_rate:ratio_rate5m
expr: |
sum(rate(istio_requests_total{
destination_service_name="portfolio",
response_code=~"5.."
}[5m]))
/
sum(rate(istio_requests_total{
destination_service_name="portfolio"
}[5m]))
Alert Design Philosophy:
| Alert Type | Triggers On | Action | Example |
|---|---|---|---|
| Page | SLO violation imminent, customer impact | Immediate response (2-5 min) | Availability <99.0%, p95 latency >1000ms |
| Ticket | Early warning, no customer impact yet | Next business day | Availability <99.7%, error budget 50% consumed |
| Info | Useful context, no action needed | Review weekly | New deployment, configuration change |
# prometheus-alerts-slos.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: portfolio-slo-alerts
namespace: default
spec:
groups:
- name: portfolio_slo_alerts
rules:
# Page: SLO violation (immediate impact)
- alert: PortfolioAvailabilityCritical
expr: |
(
sum(rate(istio_requests_total{
destination_service_name="portfolio",
response_code!~"5.."
}[5m]))
/
sum(rate(istio_requests_total{
destination_service_name="portfolio"
}[5m]))
) < 0.990
for: 2m
labels:
severity: critical
slo: availability
annotations:
summary: "Portfolio availability below 99% (current: {{ $value | humanizePercentage }})"
description: "SLO violation - immediate customer impact. Error budget exhausted."
runbook: "https://wiki.company.com/runbooks/portfolio-availability"
# Ticket: Early warning (50% error budget consumed)
- alert: PortfolioErrorBudgetBurning
expr: |
(
1 - (
sum(rate(istio_requests_total{
destination_service_name="portfolio",
response_code!~"5.."
}[2h]))
/
sum(rate(istio_requests_total{
destination_service_name="portfolio"
}[2h]))
)
) > 0.0025 # 50% of 0.5% error budget
for: 15m
labels:
severity: warning
slo: availability
annotations:
summary: "Error budget burning faster than sustainable rate"
description: "At current rate, error budget will be exhausted in {{ $value | humanizeDuration }}"
# Page: Latency SLO violation
- alert: PortfolioLatencyCritical
expr: |
histogram_quantile(0.95,
sum(rate(istio_request_duration_milliseconds_bucket{
destination_service_name="portfolio"
}[5m])) by (le)
) > 1000
for: 5m
labels:
severity: critical
slo: latency
annotations:
summary: "p95 latency above 1000ms (current: {{ $value }}ms)"
description: "User experience degraded. Investigate immediately."
Distributed Tracing Strategy
Trace Context Propagation:
// src/middleware.ts - Ensure trace headers are propagated
import { NextRequest, NextResponse } from "next/server";
export function middleware(request: NextRequest) {
const response = NextResponse.next();
// Propagate Istio/Jaeger trace headers
const traceHeaders = [
"x-request-id",
"x-b3-traceid",
"x-b3-spanid",
"x-b3-parentspanid",
"x-b3-sampled",
"x-b3-flags",
"b3",
"traceparent", // W3C Trace Context
"tracestate",
];
traceHeaders.forEach((header) => {
const value = request.headers.get(header);
if (value) {
response.headers.set(header, value);
}
});
return response;
}
Jaeger Configuration:
# Deploy Jaeger for distributed tracing
kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
name: jaeger
namespace: default
spec:
replicas: 1
selector:
matchLabels:
app: jaeger
template:
metadata:
labels:
app: jaeger
spec:
containers:
- name: jaeger
image: jaegertracing/all-in-one:1.52
env:
- name: COLLECTOR_OTLP_ENABLED
value: "true"
- name: SPAN_STORAGE_TYPE
value: "memory" # Use elasticsearch for production
ports:
- containerPort: 16686
name: ui
- containerPort: 14268
name: collector
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 1Gi
---
apiVersion: v1
kind: Service
metadata:
name: jaeger-query
namespace: default
spec:
ports:
- port: 16686
targetPort: 16686
name: ui
selector:
app: jaeger
---
apiVersion: v1
kind: Service
metadata:
name: jaeger-collector
namespace: default
spec:
ports:
- port: 14268
targetPort: 14268
name: collector
selector:
app: jaeger
EOF
# Configure Istio to send traces to Jaeger
kubectl apply -f - <<EOF
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
metadata:
name: istio-config
namespace: aks-istio-system
spec:
meshConfig:
defaultConfig:
tracing:
sampling: 1.0 # 100% sampling (reduce in production to 1-10%)
zipkin:
address: jaeger-collector.default.svc.cluster.local:9411
EOF
# Access Jaeger UI
kubectl port-forward -n default svc/jaeger-query 16686:16686
# Open http://localhost:16686
Trace Sampling Strategy:
| Environment | Sampling Rate | Rationale | Storage Cost |
|---|---|---|---|
| Development | 100% | Debug all requests | Negligible (memory storage) |
| Staging | 50% | Balance between coverage and cost | ~$50-100/mo (Elasticsearch) |
| Production (<1K RPS) | 10-20% | Capture anomalies, control costs | ~$100-200/mo |
| Production (>10K RPS) | 1-5% | Focused sampling, head-based | ~$200-500/mo |
| Incident Investigation | 100% (tail-based) | Temporarily sample errors only | Spike during incident |
Capacity Planning Models
Resource Projection Formula:
# capacity-planning.py
def project_capacity(
current_rps: int,
target_rps: int,
current_pods: int,
cpu_per_request_ms: float,
memory_per_request_mb: float,
target_cpu_utilization: float = 0.70, # 70% target
overhead_factor: float = 1.20, # 20% overhead for spikes
) -> dict:
"""
Project required capacity for target RPS.
Example:
- Current: 100 RPS with 3 pods @ 50% CPU
- Target: 500 RPS
- CPU per request: 10ms
- Memory per request: 5MB
"""
# Calculate CPU requirements
total_cpu_ms_per_sec = target_rps * cpu_per_request_ms
total_cpu_cores = (total_cpu_ms_per_sec / 1000) / target_cpu_utilization
total_cpu_with_overhead = total_cpu_cores * overhead_factor
# Calculate memory requirements
# Assume requests are processed concurrently with 100ms average latency
concurrent_requests = target_rps * 0.1 # 100ms latency = 0.1s
total_memory_mb = concurrent_requests * memory_per_request_mb
total_memory_with_overhead = total_memory_mb * overhead_factor
# Calculate pod requirements
# Assume each pod can handle 0.5 CPU cores and 1GB memory
pods_for_cpu = int(total_cpu_with_overhead / 0.5) + 1
pods_for_memory = int(total_memory_with_overhead / 1024) + 1
required_pods = max(pods_for_cpu, pods_for_memory)
# Calculate node requirements
# Standard_D2s_v3: 2 vCPU, 8GB RAM
# Assume 1.6 CPU and 6.4GB usable per node (20% for system)
required_nodes = int((required_pods * 0.5) / 1.6) + 1
# Cost estimate ($75/node/month for Standard_D2s_v3)
monthly_cost = required_nodes * 75
return {
"target_rps": target_rps,
"required_pods": required_pods,
"required_cpu_cores": total_cpu_with_overhead,
"required_memory_gb": total_memory_with_overhead / 1024,
"required_nodes": required_nodes,
"monthly_cost_usd": monthly_cost,
"cpu_utilization_target": target_cpu_utilization,
"overhead_factor": overhead_factor,
}
# Example calculation
result = project_capacity(
current_rps=100,
target_rps=500,
current_pods=3,
cpu_per_request_ms=10,
memory_per_request_mb=5,
)
print(f"For {result['target_rps']} RPS:")
print(f" Required Pods: {result['required_pods']}")
print(f" Required Nodes: {result['required_nodes']}")
print(f" CPU Cores: {result['required_cpu_cores']:.2f}")
print(f" Memory: {result['required_memory_gb']:.2f} GB")
print(f" Monthly Cost: ${result['monthly_cost_usd']}")
Load Testing with K6:
// load-test.js
import http from "k6/http";
import { check, sleep } from "k6";
import { Rate, Trend } from "k6/metrics";
// Custom metrics
const errorRate = new Rate("error_rate");
const latencyTrend = new Trend("latency");
export const options = {
stages: [
{ duration: "2m", target: 50 }, // Ramp up to 50 users
{ duration: "5m", target: 50 }, // Stay at 50 users
{ duration: "2m", target: 100 }, // Ramp to 100 users
{ duration: "5m", target: 100 }, // Stay at 100 users
{ duration: "2m", target: 200 }, // Spike to 200 users
{ duration: "5m", target: 200 }, // Sustained high load
{ duration: "2m", target: 0 }, // Ramp down
],
thresholds: {
http_req_duration: ["p(95)<500", "p(99)<1000"], // 95% < 500ms, 99% < 1s
error_rate: ["rate<0.01"], // Error rate < 1%
http_req_failed: ["rate<0.01"],
},
};
export default function () {
const response = http.get("https://cat-herding.net");
// Check response
const success = check(response, {
"status is 200": (r) => r.status === 200,
"response time OK": (r) => r.timings.duration < 500,
});
// Record metrics
errorRate.add(!success);
latencyTrend.add(response.timings.duration);
sleep(1);
}
// Run test:
// k6 run --out json=results.json load-test.js
// Analyze: k6 stats results.json
Deployment Strategies and Rollback Plans
Canary Deployment with Istio:
# portfolio-canary.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: portfolio-v2-canary
namespace: default
spec:
replicas: 1 # Start with 1 pod (10% of traffic)
selector:
matchLabels:
app: portfolio
version: v2
template:
metadata:
labels:
app: portfolio
version: v2
spec:
containers:
- name: portfolio
image: gabby.azurecr.io/portfolio:v2-candidate
# ... rest of spec same as v1
---
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: portfolio-canary
namespace: default
spec:
hosts:
- "cat-herding.net"
- "www.cat-herding.net"
gateways:
- portfolio-gateway
http:
- match:
- headers:
x-canary:
exact: "true" # Internal testing
route:
- destination:
host: portfolio
subset: v2
weight: 100
- route:
- destination:
host: portfolio
subset: v1
weight: 90 # 90% to stable version
- destination:
host: portfolio
subset: v2
weight: 10 # 10% to canary
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: portfolio-versions
namespace: default
spec:
host: portfolio
subsets:
- name: v1
labels:
version: v1
- name: v2
labels:
version: v2
Automated Canary Promotion with Flagger:
# Install Flagger (progressive delivery operator)
kubectl apply -k github.com/fluxcd/flagger//kustomize/istio
# Define automated canary rollout
cat <<EOF | kubectl apply -f -
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: portfolio
namespace: default
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: portfolio
service:
port: 80
analysis:
interval: 1m
threshold: 5 # Number of failed checks before rollback
maxWeight: 50 # Max canary weight
stepWeight: 10 # Increment weight by 10% each step
metrics:
- name: request-success-rate
thresholdRange:
min: 99 # Rollback if success rate < 99%
interval: 1m
- name: request-duration
thresholdRange:
max: 500 # Rollback if p99 > 500ms
interval: 1m
webhooks:
- name: slack-notification
url: https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK
timeout: 5s
metadata:
channel: "#deployments"
EOF
# Trigger canary by updating image
kubectl set image deployment/portfolio portfolio=gabby.azurecr.io/portfolio:v2
# Flagger will:
# 1. Deploy canary (0% traffic)
# 2. Shift 10% traffic to canary
# 3. Wait 1 minute, check metrics
# 4. If metrics good: Shift 20%, repeat
# 5. If metrics bad: Rollback immediately
# 6. If success: Promote to 100%
Emergency Rollback Playbook:
# SCENARIO: Deployment causing errors, need immediate rollback
# Step 1: Identify current deployment
kubectl rollout history deployment/portfolio -n default
# Step 2: Check current revision
kubectl get deployment portfolio -n default -o jsonpath='{.metadata.annotations.deployment\.kubernetes\.io/revision}'
# Step 3: Quick rollback to previous revision
kubectl rollout undo deployment/portfolio -n default
# Step 4: Verify rollback is working
kubectl rollout status deployment/portfolio -n default
kubectl get pods -n default -l app=portfolio -w
# Step 5: If using Istio canary, immediately shift traffic to v1
kubectl apply -f - <<EOF
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: portfolio-virtualservice
namespace: default
spec:
hosts:
- "cat-herding.net"
- "www.cat-herding.net"
gateways:
- portfolio-gateway
http:
- route:
- destination:
host: portfolio
subset: v1
weight: 100 # 100% to stable version
- destination:
host: portfolio
subset: v2
weight: 0 # 0% to broken canary
EOF
# Step 6: Communicate status
# Post to #incidents Slack channel:
# "🔥 Emergency rollback executed for portfolio service
# Reason: [high error rate / latency / crashes]
# Action: Rolled back from revision X to revision Y
# Status: Monitoring recovery
# ETA: Stable in 2-5 minutes"
# Step 7: Monitor recovery
watch -n 2 'kubectl get pods -n default -l app=portfolio'
# Step 8: Verify metrics returning to normal
# Check Grafana dashboard, Prometheus alerts, error logs
# Step 9: Post-mortem (after recovery)
# Document:
# - What triggered the rollback
# - How it was detected (alert, user report, monitoring)
# - Time to detection and time to recovery
# - Root cause analysis
# - Action items to prevent recurrence
Cost Optimization Strategies
Right-Sizing Workloads:
# Install Vertical Pod Autoscaler (VPA) for recommendations
kubectl apply -f https://github.com/kubernetes/autoscaler/releases/download/vertical-pod-autoscaler-0.13.0/vpa-v0.13.0.yaml
# Create VPA in recommendation mode (doesn't auto-adjust)
cat <<EOF | kubectl apply -f -
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: portfolio-vpa
namespace: default
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: portfolio
updatePolicy:
updateMode: "Off" # Recommendation only
resourcePolicy:
containerPolicies:
- containerName: portfolio
minAllowed:
cpu: 100m
memory: 128Mi
maxAllowed:
cpu: 2000m
memory: 4Gi
EOF
# Get recommendations after 24-48 hours
kubectl describe vpa portfolio-vpa -n default
# Example output:
# Recommendation:
# Container Recommendations:
# Container Name: portfolio
# Lower Bound:
# Cpu: 150m
# Memory: 256Mi
# Target:
# Cpu: 300m # Recommended request
# Memory: 512Mi # Recommended request
# Uncapped Target:
# Cpu: 450m
# Memory: 768Mi
# Upper Bound:
# Cpu: 1000m
# Memory: 1Gi
Spot Instance Strategy:
# Mixed node pool: on-demand + spot
# Use on-demand for critical services, spot for batch workloads
# Label portfolio deployment to prefer spot instances
apiVersion: apps/v1
kind: Deployment
metadata:
name: portfolio-batch-jobs
spec:
template:
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: kubernetes.azure.com/scalesetpriority
operator: In
values:
- spot
tolerations:
- key: kubernetes.azure.com/scalesetpriority
operator: Equal
value: spot
effect: NoSchedule
containers:
- name: batch-processor
image: gabby.azurecr.io/batch:latest
# Batch jobs can tolerate interruptions (70-90% cost savings)
Idle Resource Cleanup:
# Identify unused resources
# Cost optimization tip: Delete these regularly
# Unused PersistentVolumeClaims (orphaned storage)
kubectl get pvc -A --field-selector=status.phase!=Bound
# Unused ConfigMaps/Secrets (check if referenced)
kubectl get configmaps -A
kubectl get secrets -A
# Old ReplicaSets (kept for rollback history, can be pruned)
kubectl get rs -A --sort-by=.metadata.creationTimestamp
# Clean up completed jobs older than 7 days
kubectl delete jobs --field-selector status.successful=1 \
--all-namespaces \
--dry-run=client # Remove dry-run when ready
# Schedule automatic cleanup with CronJob
cat <<EOF | kubectl apply -f -
apiVersion: batch/v1
kind: CronJob
metadata:
name: cleanup-old-jobs
namespace: default
spec:
schedule: "0 2 * * 0" # Every Sunday at 2 AM
jobTemplate:
spec:
template:
spec:
serviceAccountName: cleanup-sa
containers:
- name: kubectl
image: bitnami/kubectl:latest
command:
- /bin/sh
- -c
- |
kubectl delete jobs --field-selector status.successful=1 \
-A --ignore-not-found=true \
$(kubectl get jobs -A -o json | jq -r '.items[] | select(.status.completionTime != null and (now - (.status.completionTime | fromdateiso8601)) > 604800) | "\(.metadata.namespace)/\(.metadata.name)"')
restartPolicy: OnFailure
EOF
Cost Monitoring Dashboard:
| Resource | Monthly Cost | Optimization Opportunity |
|---|---|---|
| AKS cluster management | $0 | N/A (free) |
| 3x Standard_D2s_v3 nodes | $225 | Use spot instances for non-critical (~$68) |
| Managed Istio | $150 | Consider self-hosted if team has expertise |
| Load Balancer (public IP) | $20 | Share LB across multiple services |
| Azure DNS Zone | $1 | N/A (minimal) |
| Bandwidth (100GB egress) | $10 | Use CDN to reduce egress |
| Persistent storage (500GB) | $38 | Clean up old PVCs, use lifecycle policies |
| Total | $444/mo | Optimized: $292/mo (34% savings) |
Azure vs GCP: Cost Comparison for Hobby Projects
Why I Migrated from GCP to Azure (Real Experience)
After running my portfolio on GKE (Google Kubernetes Engine) for 18 months, I migrated to AKS and saw significant cost savings. Here's the honest breakdown:
GCP Configuration (Previous Setup)
# GKE Cluster - us-central1-a
# 3x e2-medium nodes (2 vCPU, 4GB RAM each)
# Cloud SQL PostgreSQL (db-f1-micro)
# Cloud Load Balancer
# Cloud DNS
# Monthly Cost Breakdown (GCP):
# - GKE cluster management: $74.40/mo ($0.10/hour)
# - 3x e2-medium nodes: $73.73/mo ($24.58 each)
# - Cloud SQL (db-f1-micro): $7.67/mo
# - Cloud Load Balancer: $18.26/mo
# - Cloud SQL Proxy (included): $0
# - Cloud DNS zone: $0.20/mo
# - Bandwidth (100GB egress): $12.00/mo
# - Persistent disks (500GB): $80.00/mo (standard PD)
# Total: ~$266/mo (WITHOUT Istio)
# With Istio self-hosted: ~$340/mo
Azure Configuration (Current Setup)
# AKS Cluster - East US
# 3x Standard_B2s nodes (2 vCPU, 4GB RAM each) - HOBBY TIER
# Azure Database for PostgreSQL (Flexible Server, Burstable B1ms)
# Azure Load Balancer
# Azure DNS
# Monthly Cost Breakdown (Azure):
# - AKS cluster management: $0 (FREE!)
# - 3x Standard_B2s nodes: $120.36/mo ($40.12 each)
# - Azure Database for PostgreSQL (B1ms): $12.41/mo
# - Load Balancer (public IP): $3.65/mo
# - Azure DNS zone: $0.50/mo
# - Bandwidth (100GB egress): $8.76/mo
# - Managed disks (500GB Standard HDD): $19.71/mo
# Total: ~$165/mo (WITHOUT Istio)
# With managed Istio: ~$315/mo
Side-by-Side Comparison
| Component | GCP Cost | Azure Cost | Savings | Notes |
|---|---|---|---|---|
| Cluster Management | $74.40 | $0 | $74.40 | AKS management is completely free |
| 3x Nodes (2vCPU, 4GB) | $73.73 (e2-medium) | $120.36 (B2s) | -$46.63 | Azure B-series burstable cheaper than D-series but more than GCP e2 |
| Database (small) | $7.67 (db-f1-micro) | $12.41 (B1ms) | -$4.74 | Azure Flexible Server slightly more |
| Load Balancer | $18.26 | $3.65 | $14.61 | Azure LB significantly cheaper |
| DNS Zone | $0.20 | $0.50 | -$0.30 | Negligible difference |
| Bandwidth (100GB) | $12.00 | $8.76 | $3.24 | Azure egress slightly cheaper |
| Storage (500GB) | $80.00 (SSD) | $19.71 (HDD) | $60.29 | Azure Standard HDD vs GCP Standard PD |
| Managed Istio | N/A (self-host ~$70) | $150 | Variable | Azure offers managed option |
| Total (no Istio) | $266 | $165 | $101/mo (38%) | Azure saves ~$1,212/year |
| Total (with Istio) | $340 (self-hosted) | $315 (managed) | $25/mo (7%) | Azure saves ~$300/year |
Hobby Cluster Optimization: The $50/mo Setup
For personal projects and learning, here's how to run AKS for ~$50-75/month:
# Ultra-Low-Cost Azure Configuration
# 1x Standard_B2s node (starts 1, autoscales to 2)
# Azure Database for PostgreSQL (Burstable B1ms with auto-pause)
# NO Istio (use native Kubernetes Ingress)
# Azure DNS + Let's Encrypt
# Cost Breakdown:
# - AKS cluster management: $0
# - 1x Standard_B2s node: $40.12/mo
# - Azure PostgreSQL (B1ms, auto-pause): $6.21/mo (50% time paused)
# - Load Balancer: $3.65/mo
# - NGINX Ingress Controller: $0 (runs on node)
# - Azure DNS: $0.50/mo
# - Bandwidth (20GB/mo): $1.75/mo
# - Managed disk (100GB): $3.94/mo
# Total: ~$56/mo
# GCP Equivalent (Lowest Cost):
# - GKE Autopilot (smallest): $74.40 management + ~$50 compute = ~$124/mo
# OR
# - GKE Standard (1 node e2-small): $74.40 + $15.33 + $7.67 DB + $18 LB = ~$115/mo
# Azure wins by ~$59-68/mo (52-59% savings) for hobby projects
Configuration for $50/mo Hobby Cluster:
# hobby-cluster.yaml - Minimal viable AKS setup
apiVersion: v1
kind: Namespace
metadata:
name: portfolio
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: portfolio
namespace: portfolio
spec:
replicas: 1 # Single replica for hobby projects
selector:
matchLabels:
app: portfolio
template:
metadata:
labels:
app: portfolio
spec:
containers:
- name: portfolio
image: gabby.azurecr.io/portfolio:latest
ports:
- containerPort: 3000
resources:
requests:
cpu: 100m # Minimal CPU
memory: 256Mi # Minimal memory
limits:
cpu: 500m
memory: 512Mi
livenessProbe:
httpGet:
path: /api/health
port: 3000
initialDelaySeconds: 30
readinessProbe:
httpGet:
path: /api/health
port: 3000
initialDelaySeconds: 5
---
# Use NGINX Ingress instead of Istio (free)
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: portfolio-ingress
namespace: portfolio
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
nginx.ingress.kubernetes.io/ssl-redirect: "true"
spec:
ingressClassName: nginx
tls:
- hosts:
- cat-herding.net
- www.cat-herding.net
secretName: cat-herding-tls
rules:
- host: cat-herding.net
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: portfolio
port:
number: 80
- host: www.cat-herding.net
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: portfolio
port:
number: 80
Setup Commands for $50/mo Cluster:
# Create minimal AKS cluster
az aks create \
--resource-group portfolio-hobby \
--name portfolio-cluster \
--node-count 1 \
--min-count 1 \
--max-count 2 \
--node-vm-size Standard_B2s \
--enable-cluster-autoscaler \
--network-plugin azure \
--network-policy azure \
--tier free \
--no-ssh-key
# Install NGINX Ingress (replaces Istio)
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm install ingress-nginx ingress-nginx/ingress-nginx \
--namespace ingress-nginx \
--create-namespace \
--set controller.service.externalTrafficPolicy=Local \
--set controller.resources.requests.cpu=50m \
--set controller.resources.requests.memory=128Mi
# Install cert-manager (same as before)
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.yaml
# Total setup time: ~15 minutes
# No Istio complexity, no managed service fees
Key Takeaways for Hobby Projects
Azure Advantages:
- ✅ Free cluster management ($74.40/mo savings vs GCP)
- ✅ Cheaper load balancer ($14.61/mo savings)
- ✅ Cheaper storage ($60/mo savings with HDD)
- ✅ Burstable B-series VMs ideal for bursty traffic (personal sites)
- ✅ Simple pricing - fewer surprise charges
- ✅ Better free tier - Can run hobby cluster for ~$50/mo
GCP Advantages:
- ✅ Cheaper compute (e2-medium < B2s)
- ✅ GKE Autopilot removes node management (but costs more)
- ✅ Better integration with Google Cloud services (Cloud Run, etc.)
- ✅ More mature Istio integration (created by Google)
Decision Framework:
My Migration Experience:
| Metric | Before (GCP) | After (Azure) | Change |
|---|---|---|---|
| Monthly Cost | $266 | $165 | -$101 (-38%) |
| Cluster Management Fee | $74.40 | $0 | -$74.40 (-100%) |
| Annual Savings | - | - | $1,212/year |
| Performance (p95 latency) | 180ms | 165ms | -15ms (comparable) |
| Downtime during migration | 0 minutes | 0 minutes | Blue-green cutover |
| Migration complexity | - | 4 hours | Mostly manifest updates |
| Regrets | None | None | Would migrate again |
Bottom Line for Hobby Projects:
- GCP: Great if you're already using Google Cloud services or want Autopilot simplicity
- Azure: Better for pure Kubernetes workloads, especially if cost-sensitive
- Azure wins by $1,212/year for similar configuration without Istio
- Azure wins by ~$700/year even with managed Istio
For my personal portfolio, the migration to Azure was a no-brainer—I saved $1,212/year while gaining managed Istio as an option. The free cluster management fee alone pays for a nice dinner every month.
Comprehensive Service Cost Comparison: Azure vs GCP vs AWS
Beyond the base Kubernetes infrastructure, let's compare the full stack of services you'll need for a production application across all three major cloud providers.
Database Services: PostgreSQL Comparison
| Tier | GCP Cloud SQL | Azure Database | AWS RDS PostgreSQL | Winner | Use Case |
|---|---|---|---|---|---|
| Hobby | db-f1-micro<br/>1 shared vCPU, 614MB<br/>$7.67/mo | B1ms<br/>1 vCPU, 2GB<br/>$12.41/mo | db.t4g.micro<br/>2 vCPU, 1GB<br/>$12.41/mo | GCP 🏆 | Personal, <100 req/hr |
| Small | db-g1-small<br/>1 shared vCPU, 1.7GB<br/>$25.00/mo | B2s<br/>2 vCPU, 4GB<br/>$49.64/mo | db.t4g.small<br/>2 vCPU, 2GB<br/>$24.82/mo | AWS 🏆 | Small apps, <1K req/hr |
| Medium | db-n1-standard-1<br/>1 vCPU, 3.75GB<br/>$49.95/mo | D2s_v3<br/>2 vCPU, 8GB<br/>$116.80/mo | db.t4g.medium<br/>2 vCPU, 4GB<br/>$49.64/mo | GCP 🏆 | Mid-size, 1-10K req/hr |
| High Availability | db-n1-std-2 + HA<br/>2 vCPU, 7.5GB<br/>$198.90/mo | D2s_v3 HA<br/>2 vCPU, 8GB<br/>$233.60/mo | db.m6g.large Multi-AZ<br/>2 vCPU, 8GB<br/>$188.50/mo | AWS 🏆 | Production with failover |
| High Performance | db-n1-standard-4<br/>4 vCPU, 15GB<br/>$199.05/mo | D4s_v3<br/>4 vCPU, 16GB<br/>$233.59/mo | db.m6g.xlarge<br/>4 vCPU, 16GB<br/>$282.24/mo | GCP 🏆 | Heavy, >10K req/hr |
Overall Winner: GCP for hobby/medium, AWS for small/HA scenarios
Note: GCP's shared vCPU tiers are cheapest for hobby projects. AWS t4g (Graviton) instances offer best price/performance for small-medium workloads.
Database Services: MySQL Comparison
| Tier | GCP Cloud SQL | Azure Database | AWS RDS MySQL | Winner | Notes |
|---|---|---|---|---|---|
| Hobby | db-f1-micro<br/>$7.18/mo | B1ms<br/>$13.14/mo | db.t4g.micro<br/>$11.59/mo | GCP 🏆 | MySQL 8.0 |
| Small | db-g1-small<br/>$24.51/mo | B2s<br/>$52.56/mo | db.t4g.small<br/>$23.18/mo | AWS 🏆 | MySQL 8.0 |
| Medium | db-n1-standard-1<br/>$48.23/mo | D2s_v3<br/>$123.29/mo | db.t4g.medium<br/>$46.35/mo | AWS 🏆 | MySQL 8.0 |
| High Performance | db-n1-standard-4<br/>$192.92/mo | D4s_v3<br/>$246.58/mo | db.m6g.xlarge<br/>$263.52/mo | GCP 🏆 | MySQL 8.0 |
Overall Winner: GCP for hobby, AWS for small/medium production workloads
Cache Services: Redis Comparison
| Tier | GCP Memorystore | Azure Cache for Redis | AWS ElastiCache | Winner | Notes |
|---|---|---|---|---|---|
| Basic 1GB | $41.61/mo | Basic C1 (1GB)<br/>$16.06/mo | cache.t4g.micro (0.5GB)<br/>$11.52/mo | AWS 🏆 | No HA, single zone |
| Basic 5GB | $208.05/mo | Basic C3 (6GB)<br/>$64.24/mo | cache.m6g.large (6.38GB)<br/>$75.74/mo | Azure 🏆 | No HA |
| Standard 1GB HA | $78.84/mo | Standard C1 (1GB)<br/>$44.53/mo | cache.t4g.small (1.55GB)<br/>$46.08/mo | Azure 🏆 | High availability |
| Standard 5GB HA | $393.44/mo | Standard C3 (6GB)<br/>$128.48/mo | cache.m6g.large (6.38GB) Multi-AZ<br/>$151.48/mo | Azure 🏆 | HA with replica |
| Premium 26GB HA | $810.72/mo | Premium P1 (6GB)<br/>$164.25/mo | cache.m6g.xlarge (12.93GB) Cluster<br/>$302.96/mo | Azure 🏆 | Clustering, persistence |
| Premium 100GB HA | $3,242.88/mo | Premium P4 (26GB)<br/>$657.00/mo | cache.r6g.xlarge (26.32GB) Cluster<br/>$605.92/mo | AWS 🏆 | Enterprise features |
Overall Winner: Azure for most tiers (40-70% cheaper), AWS for small hobby and large enterprise
Note: Azure Cache for Redis is dramatically cheaper than both GCP and AWS for standard production workloads. This is one of Azure's biggest competitive advantages.
Object Storage Comparison
| Storage Type | GCP Cloud Storage | Azure Blob Storage | AWS S3 | Winner | Use Case |
|---|---|---|---|---|---|
| Hot Storage | Standard<br/>$0.020/GB/mo | Hot tier<br/>$0.0184/GB/mo | Standard<br/>$0.023/GB/mo | Azure 🏆 | Frequently accessed |
| Warm Storage | Nearline (30-day)<br/>$0.010/GB/mo | Cool tier<br/>$0.01/GB/mo | S3 Infrequent Access<br/>$0.0125/GB/mo | Tie (Azure/GCP) | Monthly access |
| Cold Storage | Coldline (90-day)<br/>$0.004/GB/mo | Archive tier<br/>$0.00099/GB/mo | Glacier Flexible<br/>$0.0036/GB/mo | Azure 🏆 | Rarely accessed |
| Deep Archive | Archive<br/>$0.0012/GB/mo | Archive tier<br/>$0.00099/GB/mo | Glacier Deep Archive<br/>$0.00099/GB/mo | Tie (Azure/AWS) | Long-term backup |
| Egress (1TB) | $0.12/GB | $0.087/GB | $0.09/GB | Azure 🏆 | Data transfer out |
| Requests (per 10K) | $0.40 (Class A) | $0.50 (write) | $0.50 (PUT) | GCP 🏆 | Write operations |
Overall Winner: Azure for storage costs and egress, GCP for request pricing
Note: Azure Archive tier is cheapest for cold storage. All three are competitive for hot storage within 20%.
Compute (Virtual Machines) Comparison
| VM Size | GCP (us-central1) | Azure (East US) | AWS (us-east-1) | Winner | Specs |
|---|---|---|---|---|---|
| Micro | e2-micro<br/>$6.11/mo | B1s<br/>$7.59/mo | t4g.nano<br/>$3.07/mo | AWS 🏆 | 0.25-2 vCPU, 0.5-1GB |
| Small | e2-small<br/>$12.23/mo | B1ms<br/>$15.18/mo | t4g.micro<br/>$6.14/mo | AWS 🏆 | 0.5-2 vCPU, 1-2GB |
| Medium Burst | e2-medium<br/>$24.45/mo | B2s<br/>$40.12/mo | t4g.small<br/>$12.29/mo | AWS 🏆 | 2 vCPU, 2-4GB |
| Medium Standard | n1-standard-1<br/>$24.73/mo | D2s_v3<br/>$75.19/mo | t4g.medium<br/>$24.58/mo | GCP 🏆 | 1-2 vCPU, 3.75-4GB |
| Medium Standard | n1-standard-2<br/>$49.45/mo | D2s_v3<br/>$75.19/mo | m6g.large<br/>$60.74/mo | GCP 🏆 | 2 vCPU, 7.5-8GB |
| Large | n1-standard-4<br/>$98.91/mo | D4s_v3<br/>$150.38/mo | m6g.xlarge<br/>$121.47/mo | GCP 🏆 | 4 vCPU, 15-16GB |
| Spot Instance | e2-medium (preemptible)<br/>$7.34/mo (70% off) | B2s (spot)<br/>$12.04/mo (70% off) | t4g.small (spot)<br/>$3.69/mo (70% off) | AWS 🏆 | 2 vCPU, 2-4GB |
Overall Winner: AWS for burstable/spot instances, GCP for standard compute
Note: AWS t4g (Graviton ARM) instances offer best value for burstable workloads. GCP wins for sustained compute. Azure is most expensive across the board.
Kubernetes Node Pricing (Actual Cluster Nodes)
| Node Type | GCP GKE | Azure AKS | AWS EKS | Winner (3-node cost) |
|---|---|---|---|---|
| Hobby | e2-small<br/>0.5 vCPU, 2GB | B1ms<br/>1 vCPU, 2GB | t4g.micro<br/>2 vCPU, 1GB | GCP: $36.69<br/>Azure: $45.54<br/>AWS: $18.42 🏆 |
| Small | e2-medium<br/>2 vCPU, 4GB | B2s<br/>2 vCPU, 4GB | t4g.small<br/>2 vCPU, 2GB | GCP: $73.35<br/>Azure: $120.36<br/>AWS: $36.87 🏆 |
| Production | n1-standard-2<br/>2 vCPU, 7.5GB | D2s_v3<br/>2 vCPU, 8GB | m6g.large<br/>2 vCPU, 8GB | GCP: $148.35 🏆<br/>Azure: $225.57<br/>AWS: $182.22 |
| High Perf | n1-standard-4<br/>4 vCPU, 15GB | D4s_v3<br/>4 vCPU, 16GB | m6g.xlarge<br/>4 vCPU, 16GB | GCP: $296.73 🏆<br/>Azure: $451.14<br/>AWS: $364.41 |
| Spot Nodes | e2-medium<br/>(preemptible) | B2s<br/>(spot) | t4g.small<br/>(spot) | GCP: $22.02<br/>Azure: $36.12<br/>AWS: $11.07 🏆 |
Overall Winner: AWS for hobby/small/spot, GCP for production sustained workloads
Critical Factors:
- GCP: Add $74.40/mo cluster management fee (negates savings for small clusters)
- Azure: $0 cluster management (free)
- AWS: Add $73/mo per cluster management fee (similar to GCP)
Load Balancer Comparison
| Load Balancer Type | GCP | Azure | AWS | Winner | Notes |
|---|---|---|---|---|---|
| Basic (HTTP/HTTPS) | Cloud LB<br/>$18.26/mo | Standard LB<br/>$3.65/mo | ALB<br/>$16.43/mo | Azure 🏆 | 5 forwarding rules |
| With SSL | + free cert<br/>$18.26/mo | + free cert<br/>$3.65/mo | + free cert (ACM)<br/>$16.43/mo | Azure 🏆 | Managed certificates |
| Network LB | Regional LB<br/>$18.26/mo | Basic LB<br/>$0/mo | NLB<br/>$16.43/mo | Azure 🏆 | Layer 4 |
| Data Processing | $0.008/GB | $0.005/GB | $0.008/GB | Azure 🏆 | Per GB processed |
| LCU Hours | Included | Included | $0.008/hour | Tie (Azure/GCP) | Load Balancer Capacity Units |
Overall Winner: Azure for load balancers (75-100% cheaper than both)
Note: Azure's load balancer pricing is one of its strongest competitive advantages. AWS ALB charges for LCUs, GCP charges per forwarding rule.
Content Delivery Network (CDN) Comparison
| CDN Tier | GCP Cloud CDN | Azure CDN | AWS CloudFront | Winner | Use Case |
|---|---|---|---|---|---|
| Cache Egress (10TB) | $0.08/GB | $0.087/GB | $0.085/GB | GCP 🏆 | Cache hit |
| Origin Egress | $0.12/GB | $0.087/GB | $0.085/GB | AWS 🏆 | Cache miss |
| Requests (per 10K) | $0.0075 | $0.0096 | $0.0075 | Tie (GCP/AWS) | HTTP/HTTPS |
Overall Winner: GCP/AWS tie for cache performance, Azure for origin-heavy workloads
Note: All three are competitive within 10%. Choice depends on your cache hit ratio and existing cloud infrastructure.
Managed Kubernetes Service: Total Cost Comparison
Let's compare identical workloads on all three platforms:
Scenario: 3-node production cluster with database, cache, and storage
| Component | GCP Monthly Cost | Azure Monthly Cost | AWS Monthly Cost | Winner |
|---|---|---|---|---|
| Cluster Management Fee | $74.40 | $0 ✅ | $73.00 | Azure |
| 3x Nodes (2 vCPU, 4GB) | $73.35 ✅ (e2-medium) | $120.36 (B2s) | $36.87 ✅ (t4g.small) | AWS |
| PostgreSQL (small) | $25.00 ✅ (db-g1-small) | $49.64 (B2s) | $24.82 ✅ (t4g.small) | AWS |
| Redis Cache (1GB HA) | $78.84 (Memorystore) | $44.53 ✅ (Standard C1) | $46.08 (t4g.small) | Azure |
| Load Balancer | $18.26 | $3.65 ✅ | $16.43 | Azure |
| Storage (500GB SSD) | $85.00 (SSD PD) | $48.00 ✅ (Premium SSD) | $50.00 (gp3) | Azure |
| Bandwidth (100GB) | $12.00 | $8.76 ✅ | $9.00 | Azure |
| DNS | $0.20 ✅ | $0.50 | $0.50 (Route53) | GCP |
| Container Registry | $0 ✅ (500MB free) | $5.00 (Basic) | $0 ✅ (500MB free) | Tie (GCP/AWS) |
| Monitoring | $8.00 (Cloud Logging) | $0 ✅ (basic included) | $10.00 (CloudWatch) | Azure |
| Total | $375.05 | $280.44 ✅ | $266.70 ✅ | AWS wins by $13.74/mo |
| Annual Total | $4,500.60 | $3,365.28 | $3,200.40 ✅ | AWS saves $1,300.20/year |
Winner: AWS for this configuration (saves $1,300/year vs GCP, $165/year vs Azure)
Cost Winner by Service Category
Real-World Cost Scenario Analysis
Scenario 1: Hobby Developer (Personal Portfolio)
Workload:
- 1 node cluster
- Small database
- No cache
- <1K visitors/day
GCP Cost:
- GKE management: $74.40
- 1x e2-small node: $12.23
- db-f1-micro: $7.67
- Load balancer: $18.26
- DNS: $0.20
Total: $112.76/mo
Azure Cost:
- AKS management: $0 ✅
- 1x B1ms node: $15.18
- B1ms PostgreSQL: $12.41
- Load balancer: $3.65 ✅
- DNS: $0.50
Total: $31.74/mo
AWS Cost:
- EKS management: $73.00
- 1x t4g.micro node: $6.14 ✅
- db.t4g.micro: $12.41
- ALB: $16.43
- Route53: $0.50
Total: $108.48/mo
Winner: Azure saves $81.02/mo vs GCP (72% cheaper)
Winner: Azure saves $76.74/mo vs AWS (71% cheaper)
Scenario 2: Startup (10 Services, Growing Traffic)
Workload:
- 3-5 nodes (autoscaling)
- Medium database with HA
- Redis cache for sessions
- 10-50K visitors/day
GCP Cost:
- GKE management: $74.40
- 3x e2-medium nodes: $73.35
- db-n1-standard-2 HA: $198.90
- Memorystore 5GB HA: $393.44
- Load balancer: $18.26
- Storage 1TB: $170.00
- Bandwidth 500GB: $60.00
Total: $988.35/mo
Azure Cost:
- AKS management: $0 ✅
- 3x B2s nodes: $120.36
- D2s_v3 PostgreSQL HA: $233.60
- Standard C3 Redis HA: $128.48 ✅
- Load balancer: $3.65 ✅
- Storage 1TB: $96.00 ✅
- Bandwidth 500GB: $43.80 ✅
Total: $625.89/mo
AWS Cost:
- EKS management: $73.00
- 3x t4g.small nodes: $36.87 ✅
- db.m6g.large Multi-AZ: $188.50 ✅
- cache.m6g.large Multi-AZ: $151.48
- ALB: $16.43
- Storage 1TB: $100.00
- Bandwidth 500GB: $45.00
Total: $611.28/mo ✅
Winner: AWS saves $377.07/mo vs GCP (38% cheaper)
Winner: AWS saves $14.61/mo vs Azure (2% cheaper)
Scenario 3: Enterprise (50+ Services, High Traffic)
Workload:
- 10-20 nodes (autoscaling)
- Large database cluster
- Multiple Redis instances
- 1M+ visitors/day
- Multi-region
GCP Cost:
- GKE management: $74.40
- 10x n1-standard-4 nodes: $989.10
- db-n1-standard-8 HA: $796.20
- Memorystore 100GB HA: $3,242.88
- Multiple load balancers: $60.00
- Storage 5TB: $850.00
- Bandwidth 5TB: $600.00
- CDN: $400.00
Total: $7,012.58/mo
Azure Cost:
- AKS management: $0 ✅
- 10x D4s_v3 nodes: $1,503.80
- D8s_v3 PostgreSQL HA: $935.16
- Premium P4 Redis HA: $657.00 ✅
- Multiple load balancers: $20.00 ✅
- Storage 5TB: $480.00 ✅
- Bandwidth 5TB: $438.00 ✅
- CDN: $435.00
Total: $4,469.96/mo
AWS Cost:
- EKS management: $73.00
- 10x m6g.xlarge nodes: $1,214.70 ✅
- db.m6g.4xlarge Multi-AZ: $1,129.00
- cache.r6g.xlarge Cluster: $605.92 ✅
- Multiple ALBs: $50.00
- Storage 5TB: $500.00
- Bandwidth 5TB: $450.00
- CloudFront: $425.00
Total: $4,447.62/mo ✅
Winner: AWS saves $2,564.96/mo vs GCP (37% cheaper)
Winner: AWS saves $22.34/mo vs Azure (0.5% cheaper)
Annual savings: AWS saves $30,779/year vs GCP, $268/year vs Azure
Cost Optimization Recommendations
Choose GCP if:
- Heavy sustained compute requirements (>100 vCPUs running 24/7)
- Hobby databases (db-f1-micro $7.67/mo is cheapest shared vCPU tier)
- Already using Google Cloud services (BigQuery, Cloud Run, Pub/Sub)
- Need GKE Autopilot for simplified management ($73/mo + per-pod pricing)
- Prefer preemptible instances for batch workloads (sustained use discounts up to 30%)
Choose Azure if:
- Cache-heavy architecture (Redis is 40-70% cheaper than GCP/AWS)
- Multiple microservices (free cluster management saves $73-74/mo vs GCP/AWS)
- Storage-intensive (blob storage 30-40% cheaper than GCP)
- Need managed Istio service mesh (included in AKS at no extra cost)
- Load balancer-heavy architecture (Basic LB $3.65/mo is 75-100% cheaper)
- Microsoft ecosystem integration (AD/Entra, Windows containers, .NET)
- Cost optimization is top priority for small-medium clusters
Choose AWS if:
- ARM-compatible workloads (Graviton t4g/m6g instances 50-75% cheaper)
- Startup phase with standard production cluster (lowest overall cost $266.70/mo)
- Small databases with burstable requirements (db.t4g.micro $12.41/mo)
- Heavy spot instance usage (deepest spot market with most availability zones)
- Already using AWS services (Lambda, DynamoDB, S3, SQS)
- Need highest availability zones per region (up to 6 AZs vs 3 for GCP/Azure)
Hybrid Strategy:
- Run Kubernetes on AWS for overall lowest cost ($266.70/mo base)
- Use Azure Cache for Redis (huge savings: $44.53/mo vs $46.08 AWS vs $78.84 GCP)
- Use GCP Cloud SQL for hobby databases (db-f1-micro $7.67/mo)
- Store cold data on Azure Archive (75% cheaper than GCP/AWS)
- Use AWS Graviton instances where workload allows (50-75% compute savings)
Hidden Costs to Consider
| Cost Factor | GCP | Azure | AWS | Impact |
|---|---|---|---|---|
| Egress to Internet | $0.12/GB (first 1TB) | $0.087/GB | $0.09/GB | Azure 27% cheaper, AWS 25% cheaper |
| Egress between zones | $0.01/GB | $0.01/GB | $0.01/GB (same region) | Same |
| Egress between regions | $0.12/GB | $0.02/GB | $0.02/GB | Azure/AWS 83% cheaper ✅ |
| Snapshot storage | $0.026/GB | $0.05/GB | $0.05/GB | GCP 48% cheaper |
| IP addresses | $6.57/mo per static IP | $3.65/mo per public IP | $3.60/mo per Elastic IP | AWS/Azure 44% cheaper |
| NAT Gateway | $0.045/hour + $0.045/GB | $0.045/hour + $0.045/GB | $0.045/hour + $0.045/GB | Same |
| VPN Gateway | $36.50/mo | $26.28/mo | $36.50/mo | Azure 28% cheaper ✅ |
Next Steps: Building Production-Grade Platform
- GitOps with Flux CD: Automated deployments from Git commits
- Multi-Environment Strategy: Separate dev/staging/prod with namespace isolation
- Secret Management: Migrate to Azure Key Vault with CSI driver
- Database Integration: Connect to Azure Database for PostgreSQL with connection pooling
- Advanced Autoscaling: KEDA for event-driven scaling (queues, schedules, metrics)
- Disaster Recovery: Backup strategy with Velero, multi-region failover
- Compliance: Pod Security Standards enforcement, network policies, audit logging
- Developer Experience: Internal developer platform (IDP) with self-service deployment
🎯 Key Takeaways: Decision Framework for Platform Teams
When to Adopt This Architecture
✅ Good Fit:
- Microservices at scale (20+ services): Service mesh provides observability, security, and traffic management
- Multi-team organizations: Teams need isolation, independent deployment, and self-service capabilities
- Compliance requirements: mTLS, audit logging, and fine-grained access control are mandatory
- High availability needs: SLOs require 99.5%+ uptime with multi-region capabilities
- Mature DevOps culture: Team has SRE practices, GitOps, and infrastructure-as-code expertise
❌ Poor Fit:
- Monolith or <10 services: Service mesh overhead outweighs benefits
- Cost-sensitive startups: $400-500/mo base infrastructure may be too high for MVP stage
- Limited Kubernetes expertise: Steep learning curve can delay time-to-market
- Simple request/response patterns: Don't need advanced routing, retries, or circuit breaking
- Minimal traffic (<100 RPS): Serverless (Azure Functions, Container Apps) more cost-effective
Architecture Decision Summary
| Decision | Choice | Trade-off | When to Reconsider |
|---|---|---|---|
| Service Mesh | Managed Istio | +$150-235/mo, operational complexity | When service count >50 or team expertise grows |
| Node Size | Standard_D2s_v3 | Balanced cost/performance | If workloads are memory-heavy (upgrade to D4s_v3) or CPU-light (downgrade to B-series) |
| Certificate Management | cert-manager + Let's Encrypt | Free certificates, requires maintenance | If >100 certificates or need extended validation (EV) certificates, consider paid CA |
| DNS Provider | Azure DNS | $1/mo, Azure-native integration | If need advanced features (geo-routing, DNSSEC), consider Cloudflare or Route53 |
| Monitoring Stack | Prometheus + Grafana | Self-hosted, requires management | If team lacks monitoring expertise, consider Azure Monitor or Datadog |
| Deployment Strategy | Canary with Flagger | Automated progressive delivery | If team is small or risk-averse, use blue-green or manual rollouts |
Cost Optimization Decision Tree
Operational Maturity Roadmap
Phase 1: Foundation (Months 1-2)
- ✅ AKS cluster with managed Istio running
- ✅ Basic monitoring (Prometheus, Grafana)
- ✅ Manual deployments via kubectl
- ✅ Certificate automation with cert-manager
- ✅ Single production environment
Phase 2: Automation (Months 3-4)
- 🔄 GitOps with Flux CD or ArgoCD
- 🔄 CI/CD pipelines (GitHub Actions, Azure DevOps)
- 🔄 Automated canary deployments with Flagger
- 🔄 Infrastructure-as-code (Terraform, Bicep)
- 🔄 Dev/staging environments with parity
Phase 3: Observability (Months 5-6)
- 🔲 SLO-based alerting (not just resource alerts)
- 🔲 Distributed tracing (Jaeger, Tempo)
- 🔲 Log aggregation (Loki, Azure Log Analytics)
- 🔲 Cost monitoring and attribution by team
- 🔲 Error budget tracking and reporting
Phase 4: Resilience (Months 7-9)
- 🔲 Chaos engineering (Litmus, Chaos Mesh)
- 🔲 Multi-region active-passive setup
- 🔲 Disaster recovery runbooks tested quarterly
- 🔲 Automated incident response (PagerDuty + Kubernetes)
- 🔲 Load testing in CI/CD pipeline
Phase 5: Platform Engineering (Months 10-12)
- 🔲 Internal Developer Platform (IDP) with self-service
- 🔲 Golden paths for common deployment patterns
- 🔲 Policy-as-code (OPA, Kyverno)
- 🔲 Developer productivity metrics (DORA metrics)
- 🔲 Platform team vs product teams separation
Lessons from Production Incidents
Incident 1: Certificate Expiration Outage (2 AM)
- What happened: Let's Encrypt certificate expired, HTTPS traffic failed
- Root cause: cert-manager webhook was blocked by network policy
- Impact: 30 minutes downtime, 0.07% monthly error budget consumed
- Prevention: Alert on certificate expiration <30 days, test renewal quarterly
- Lesson: Monitor certificate lifecycle, not just pod health
Incident 2: Istio Control Plane Upgrade Breaking Change
- What happened: Managed Istio upgraded from 1.17 to 1.18 with breaking CRD changes
- Root cause: Azure auto-upgraded during maintenance window without notification
- Impact: VirtualService configuration failed validation, 503 errors for 15 minutes
- Prevention: Pin Istio minor version, test upgrades in dev first, subscribe to release notes
- Lesson: Managed services reduce operational burden but sacrifice control
Incident 3: Pod Eviction Cascade (Out of Memory)
- What happened: Memory leak in application caused OOMKilled, cascading pod evictions
- Root cause: Missing memory limits allowed pod to consume node resources
- Impact: 5 pods restarted in 2 minutes, 60-second request latency spike
- Prevention: Set memory limits, enable memory profiling, use VPA recommendations
- Lesson: Resource limits are not optional in production
Incident 4: DNS Propagation Delay on New Subdomain
- What happened: New subdomain added but not resolving for 45 minutes
- Root cause: Azure DNS TTL was 3600 seconds, previous record cached
- Impact: Delayed launch announcement, no customer impact
- Prevention: Lower TTL to 300 seconds before DNS changes, wait for propagation
- Lesson: DNS is eventually consistent; plan changes ahead of announcements
Incident 5: Cost Spike from Forgotten Load Test
- What happened: Load test ran overnight, autoscaler increased nodes from 3 to 30
- Root cause: No autoscaler max limit configured, test had no timeout
- Impact: $500 unexpected bill for 8 hours, no customer impact
- Prevention: Set autoscaler maxReplicas, configure cost alerts, add timeouts to tests
- Lesson: Cost visibility is essential; autoscalers need upper bounds
Final Recommendations for Platform Teams
Before You Start:
- Validate scale requirements: Do you really need service mesh? (<20 services likely don't)
- Assess team expertise: Do you have Kubernetes/Istio experience? Budget for training
- Calculate total cost: Include management time, not just infrastructure ($444/mo + 0.5-1 FTE)
- Define success metrics: What SLOs justify this investment? What's the alternative cost?
During Implementation:
- Start small: Single service in production, validate before migrating all workloads
- Automate early: GitOps from day 1, manual kubectl is error-prone at scale
- Monitor everything: If you can't measure it, you can't improve it (observability first)
- Document runbooks: Incident response is only as good as your documentation
After Go-Live:
- Review costs weekly: First month will reveal unexpected expenses
- Conduct game days: Test failure scenarios before they happen in production
- Measure developer productivity: Platform value = features shipped / time, not uptime alone
- Iterate on processes: Retrospectives after every incident, quarterly architecture reviews
When to Choose Alternatives
| Alternative | Best For | Cost Comparison | Migration Difficulty |
|---|---|---|---|
| Azure Container Apps | <10 services, serverless workload | ~$50-200/mo (consumption-based) | Easy (similar Kubernetes API) |
| Azure App Service | Traditional web apps, minimal DevOps | ~$100-300/mo (PaaS pricing) | Easy (no Kubernetes knowledge needed) |
| Azure Functions | Event-driven, sporadic traffic | ~$0-50/mo (pay-per-execution) | Medium (code refactoring required) |
| Self-Hosted Kubernetes (VM) | Maximum control, compliance needs | ~$300-400/mo (3 VMs) | Hard (full cluster management) |
| GKE/EKS (multi-cloud) | Avoid vendor lock-in, global footprint | ~$500-700/mo (cross-cloud costs) | Medium (Kubernetes-compatible) |
Recommended Migration Path for Startups:
Phase 1: Azure Functions (MVP, <$50/mo)
↓ (Growing traffic, need more control)
Phase 2: Azure Container Apps (Scale, ~$200/mo)
↓ (10+ microservices, need service mesh)
Phase 3: AKS + Istio (This guide, ~$400-500/mo)
↓ (100+ microservices, multi-region)
Phase 4: Multi-cloud Kubernetes (Enterprise, $2000-5000/mo)
🔗 Additional Resources
Official Documentation
- Azure AKS Documentation
- Istio Documentation
- cert-manager Documentation
- Azure DNS Documentation
- Prometheus Operator
Books & Deep Dives
- "Kubernetes Patterns" by Bilgin Ibryam & Roland Huß - Reusable patterns for cloud-native apps
- "Production Kubernetes" by Josh Rosso et al. - Running Kubernetes at scale
- "The Site Reliability Workbook" by Google SRE - SLO/SLI implementation guide
- "Cloud Native DevOps with Kubernetes" by John Arundel & Justin Domingus
Community & Tools
- CNCF Landscape - Comprehensive cloud-native ecosystem map
- Awesome Kubernetes - Curated tools and resources
- Kubernetes Failure Stories - Learn from production incidents
- Learnk8s - Kubernetes training and best practices
Cost Calculators
- Azure Pricing Calculator - Estimate infrastructure costs
- KubeCost - Kubernetes cost monitoring and optimization
📝 Closing Thoughts
Building production-grade Kubernetes infrastructure is a journey, not a destination. This architecture represents months of learning, incidents, and iterations. The decisions outlined here are not universal truths—they're context-dependent trade-offs that worked for our team's scale, budget, and expertise.
The most important question isn't "Should I use Kubernetes?"—it's "What problem am I solving, and is Kubernetes the simplest solution?" For many teams, managed PaaS services (Container Apps, App Service) provide 90% of the benefits at 20% of the complexity.
But if you're building a platform for multiple teams, need fine-grained control over traffic routing, or have compliance requirements that demand mTLS and network policies, this architecture provides a solid foundation.
Key Philosophy:
- Simplicity > Features: Only add complexity when it solves a real problem
- Observability > Availability: You can't improve what you don't measure
- Automation > Documentation: Runbooks should execute themselves
- Learning > Perfection: Every incident is an opportunity to improve
Start small, automate early, and always question whether the complexity you're adding is worth the value it provides.
Questions or feedback? I'd love to hear about your Kubernetes journey—find me on Twitter or LinkedIn.
Found this guide helpful? Consider sharing it with your team or contributing improvements on GitHub.
This guide reflects production experience as of January 2025. Cloud platforms and best practices evolve rapidly—verify configurations against current documentation before implementing.