kernel_panic

2025-04-15 · 8 min read

Zero-Downtime Deployment Strategies Compared: Blue-Green vs Canary vs Rolling

Blue-green, canary, and rolling deployments all promise zero downtime — but they have very different tradeoffs in complexity, cost, and rollback speed. Here's how to choose the right one for your team.

You Shouldn't Have to Schedule Deployments

If your team still has a "deployment window" — a specific time when you push code because it's the least risky — that's a sign your deployment process needs work. Modern deployment strategies make it possible to ship code at 2 PM on a Tuesday with the same confidence you'd have at 3 AM on a Sunday.

We've implemented all three major zero-downtime strategies for our clients. Each one works. None of them is universally "best." The right choice depends on your traffic patterns, infrastructure, team size, and risk tolerance. Let's break them down honestly.

Rolling Deployments

Rolling deployments replace instances of your application one (or a few) at a time. At any given moment during the deployment, some instances are running the old version and some are running the new version.

How It Works

If you have four instances of your API, a rolling deployment might:

  1. Take instance 1 out of the load balancer
  2. Deploy the new version to instance 1
  3. Health check instance 1
  4. Put instance 1 back in the load balancer
  5. Repeat for instances 2, 3, and 4

In Kubernetes, this is the default deployment strategy:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  replicas: 4
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 1
  template:
    spec:
      containers:
        - name: api
          image: ghcr.io/yourorg/api:v2.0.0
          readinessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 10

The maxUnavailable: 1 means at most one pod can be down during the update. maxSurge: 1 means Kubernetes can create one extra pod above the desired count to speed things up. The readiness probe ensures traffic isn't sent to a pod until it's actually ready to handle requests.

Tradeoffs

Pros:

  • Simplest to set up, especially on Kubernetes where it's the default
  • No extra infrastructure required
  • Resource-efficient — you don't need a full duplicate environment

Cons:

  • During deployment, two versions are serving traffic simultaneously. If v2 introduces a breaking database schema change, you'll have v1 instances failing against the new schema
  • Rollback requires another rolling update, which takes the same amount of time as the deployment
  • Harder to test the new version with real traffic before full rollout

Best for: Teams that deploy frequently, have good backward compatibility practices, and want the simplest possible setup. This is where most teams should start.

Blue-Green Deployments

Blue-green maintains two identical production environments. One (let's say "blue") serves all traffic. When you deploy, you deploy to the idle environment ("green"), verify it works, then switch traffic from blue to green instantly.

How It Works

                    ┌──────────────┐
                    │ Load Balancer│
                    └──────┬───────┘
                           │
              ┌────────────┼────────────┐
              │                         │
     ┌────────▼────────┐     ┌─────────▼────────┐
     │   Blue (v1.0)   │     │  Green (v2.0)    │
     │   LIVE TRAFFIC  │     │  IDLE / TESTING  │
     └─────────────────┘     └──────────────────┘

After verification, you flip the switch:

                    ┌──────────────┐
                    │ Load Balancer│
                    └──────┬───────┘
                           │
              ┌────────────┼────────────┐
              │                         │
     ┌────────▼────────┐     ┌─────────▼────────┐
     │   Blue (v1.0)   │     │  Green (v2.0)    │
     │      IDLE       │     │   LIVE TRAFFIC   │
     └─────────────────┘     └──────────────────┘

In AWS, this can be implemented with target groups on an ALB:

#!/bin/bash
# deploy-blue-green.sh

NEW_TG="arn:aws:elasticloadbalancing:...:targetgroup/green/..."
OLD_TG="arn:aws:elasticloadbalancing:...:targetgroup/blue/..."
LISTENER="arn:aws:elasticloadbalancing:...:listener/..."

# Deploy new version to green target group
# ... (update ECS service, ASG, etc.)

# Wait for health checks to pass
aws elbv2 wait target-in-service \
  --target-group-arn $NEW_TG

# Switch traffic
aws elbv2 modify-listener \
  --listener-arn $LISTENER \
  --default-actions Type=forward,TargetGroupArn=$NEW_TG

echo "Traffic switched to green. Blue is now idle."
echo "To rollback, run: aws elbv2 modify-listener --listener-arn $LISTENER --default-actions Type=forward,TargetGroupArn=$OLD_TG"

On Kubernetes, you can implement blue-green with two Deployments and a Service that switches its selector:

# Blue deployment (currently live)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-blue
spec:
  replicas: 4
  selector:
    matchLabels:
      app: api
      version: blue
  template:
    metadata:
      labels:
        app: api
        version: blue
    spec:
      containers:
        - name: api
          image: ghcr.io/yourorg/api:v1.0.0
---
# Service pointing to blue
apiVersion: v1
kind: Service
metadata:
  name: api
spec:
  selector:
    app: api
    version: blue  # Change to "green" to switch
  ports:
    - port: 8080

Deploy the green version, verify it, then update the Service selector from version: blue to version: green. Instant cutover.

Tradeoffs

Pros:

  • Instant rollback — just switch traffic back to the old environment
  • You can fully test the new version in production (with production data, production config) before any users see it
  • No mixed-version traffic during deployment

Cons:

  • Doubles your infrastructure cost (you need two full environments running)
  • Database migrations are tricky — both environments typically share a database, so migrations must be backward-compatible
  • More complex to set up and maintain

Best for: Teams where rollback speed is critical, where you can afford the extra infrastructure cost, and where you need to validate the new version thoroughly before switching traffic.

Canary Deployments

Canary deployments route a small percentage of traffic to the new version while the majority continues hitting the old version. You gradually increase the percentage as you gain confidence, watching metrics the whole time.

How It Works

A typical canary rollout might look like:

  1. Deploy new version alongside the old one
  2. Route 5% of traffic to the new version
  3. Monitor error rates, latency, and business metrics for 10 minutes
  4. If metrics are healthy, increase to 25%
  5. Monitor for another 10 minutes
  6. Increase to 50%, then 100%
  7. If metrics degrade at any step, route 100% back to the old version

With Nginx, you can do weighted traffic splitting:

upstream api {
    server api-stable:8080 weight=95;
    server api-canary:8080 weight=5;
}

In Kubernetes with Argo Rollouts, canary deployments become declarative:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: api
spec:
  replicas: 4
  strategy:
    canary:
      steps:
        - setWeight: 5
        - pause: { duration: 10m }
        - setWeight: 25
        - pause: { duration: 10m }
        - setWeight: 50
        - pause: { duration: 10m }
        - setWeight: 100
      canaryService: api-canary
      stableService: api-stable
      trafficRouting:
        nginx:
          stableIngress: api-ingress
  selector:
    matchLabels:
      app: api
  template:
    metadata:
      labels:
        app: api
    spec:
      containers:
        - name: api
          image: ghcr.io/yourorg/api:v2.0.0

The real power comes when you add automated analysis. Argo Rollouts can integrate with Prometheus, Datadog, or other monitoring tools to automatically promote or roll back based on metrics:

      steps:
        - setWeight: 5
        - pause: { duration: 10m }
        - analysis:
            templates:
              - templateName: error-rate-check
            args:
              - name: service
                value: api
        - setWeight: 50
        - pause: { duration: 10m }
        - analysis:
            templates:
              - templateName: error-rate-check
            args:
              - name: service
                value: api

With automated analysis, the deployment promotes itself if error rates stay below your threshold, and rolls back automatically if they don't. No human in the loop required.

Tradeoffs

Pros:

  • Limits the blast radius of a bad deployment to a small percentage of traffic
  • Metrics-driven promotion gives you confidence that's based on data, not hope
  • Can be fully automated with tools like Argo Rollouts or Flagger

Cons:

  • Most complex to set up — requires traffic splitting infrastructure, monitoring integration, and analysis rules
  • Requires enough traffic for statistical significance (if you get 10 requests per hour, a 5% canary split is meaningless)
  • Mixed-version traffic, similar to rolling deployments — backward compatibility matters

Best for: High-traffic services where a bad deployment could affect thousands of users, teams with mature monitoring and observability, and organizations that deploy very frequently and need automated safety nets.

Decision Matrix

FactorRollingBlue-GreenCanary
Setup complexityLowMediumHigh
Infrastructure costLowHigh (2x)Medium
Rollback speedSlow (minutes)InstantFast (seconds)
Mixed-version trafficYesNoYes (controlled)
Minimum traffic neededAnyAnyHigh
Best Kubernetes toolBuilt-inService selectorArgo Rollouts

Our Recommendation

Start with rolling deployments. They're built into Kubernetes, they require no extra tooling, and they work fine for the vast majority of teams. Invest your time in good health checks, backward-compatible database migrations, and solid monitoring.

When you outgrow rolling deployments — typically because rollback speed becomes critical or because you need more control over traffic routing — move to blue-green if rollback speed is your priority, or canary if blast radius control is your priority.

Don't let perfect be the enemy of good. A team deploying ten times a day with rolling deployments is in a better position than a team deploying once a week with a sophisticated canary setup they don't fully understand.


Need help implementing zero-downtime deployments for your infrastructure? We've set up all three strategies for teams at every scale. Let's figure out what's right for your team.

Need help with your infrastructure?

We've been solving problems like these for 18+ years. Let's talk about how we can help your team.