2025-04-15 · 8 min read
Zero-Downtime Deployment Strategies Compared: Blue-Green vs Canary vs Rolling
Blue-green, canary, and rolling deployments all promise zero downtime — but they have very different tradeoffs in complexity, cost, and rollback speed. Here's how to choose the right one for your team.
You Shouldn't Have to Schedule Deployments
If your team still has a "deployment window" — a specific time when you push code because it's the least risky — that's a sign your deployment process needs work. Modern deployment strategies make it possible to ship code at 2 PM on a Tuesday with the same confidence you'd have at 3 AM on a Sunday.
We've implemented all three major zero-downtime strategies for our clients. Each one works. None of them is universally "best." The right choice depends on your traffic patterns, infrastructure, team size, and risk tolerance. Let's break them down honestly.
Rolling Deployments
Rolling deployments replace instances of your application one (or a few) at a time. At any given moment during the deployment, some instances are running the old version and some are running the new version.
How It Works
If you have four instances of your API, a rolling deployment might:
- Take instance 1 out of the load balancer
- Deploy the new version to instance 1
- Health check instance 1
- Put instance 1 back in the load balancer
- Repeat for instances 2, 3, and 4
In Kubernetes, this is the default deployment strategy:
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
spec:
replicas: 4
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
maxSurge: 1
template:
spec:
containers:
- name: api
image: ghcr.io/yourorg/api:v2.0.0
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
The maxUnavailable: 1 means at most one pod can be down during the update. maxSurge: 1 means Kubernetes can create one extra pod above the desired count to speed things up. The readiness probe ensures traffic isn't sent to a pod until it's actually ready to handle requests.
Tradeoffs
Pros:
- Simplest to set up, especially on Kubernetes where it's the default
- No extra infrastructure required
- Resource-efficient — you don't need a full duplicate environment
Cons:
- During deployment, two versions are serving traffic simultaneously. If v2 introduces a breaking database schema change, you'll have v1 instances failing against the new schema
- Rollback requires another rolling update, which takes the same amount of time as the deployment
- Harder to test the new version with real traffic before full rollout
Best for: Teams that deploy frequently, have good backward compatibility practices, and want the simplest possible setup. This is where most teams should start.
Blue-Green Deployments
Blue-green maintains two identical production environments. One (let's say "blue") serves all traffic. When you deploy, you deploy to the idle environment ("green"), verify it works, then switch traffic from blue to green instantly.
How It Works
┌──────────────┐
│ Load Balancer│
└──────┬───────┘
│
┌────────────┼────────────┐
│ │
┌────────▼────────┐ ┌─────────▼────────┐
│ Blue (v1.0) │ │ Green (v2.0) │
│ LIVE TRAFFIC │ │ IDLE / TESTING │
└─────────────────┘ └──────────────────┘
After verification, you flip the switch:
┌──────────────┐
│ Load Balancer│
└──────┬───────┘
│
┌────────────┼────────────┐
│ │
┌────────▼────────┐ ┌─────────▼────────┐
│ Blue (v1.0) │ │ Green (v2.0) │
│ IDLE │ │ LIVE TRAFFIC │
└─────────────────┘ └──────────────────┘
In AWS, this can be implemented with target groups on an ALB:
#!/bin/bash
# deploy-blue-green.sh
NEW_TG="arn:aws:elasticloadbalancing:...:targetgroup/green/..."
OLD_TG="arn:aws:elasticloadbalancing:...:targetgroup/blue/..."
LISTENER="arn:aws:elasticloadbalancing:...:listener/..."
# Deploy new version to green target group
# ... (update ECS service, ASG, etc.)
# Wait for health checks to pass
aws elbv2 wait target-in-service \
--target-group-arn $NEW_TG
# Switch traffic
aws elbv2 modify-listener \
--listener-arn $LISTENER \
--default-actions Type=forward,TargetGroupArn=$NEW_TG
echo "Traffic switched to green. Blue is now idle."
echo "To rollback, run: aws elbv2 modify-listener --listener-arn $LISTENER --default-actions Type=forward,TargetGroupArn=$OLD_TG"
On Kubernetes, you can implement blue-green with two Deployments and a Service that switches its selector:
# Blue deployment (currently live)
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-blue
spec:
replicas: 4
selector:
matchLabels:
app: api
version: blue
template:
metadata:
labels:
app: api
version: blue
spec:
containers:
- name: api
image: ghcr.io/yourorg/api:v1.0.0
---
# Service pointing to blue
apiVersion: v1
kind: Service
metadata:
name: api
spec:
selector:
app: api
version: blue # Change to "green" to switch
ports:
- port: 8080
Deploy the green version, verify it, then update the Service selector from version: blue to version: green. Instant cutover.
Tradeoffs
Pros:
- Instant rollback — just switch traffic back to the old environment
- You can fully test the new version in production (with production data, production config) before any users see it
- No mixed-version traffic during deployment
Cons:
- Doubles your infrastructure cost (you need two full environments running)
- Database migrations are tricky — both environments typically share a database, so migrations must be backward-compatible
- More complex to set up and maintain
Best for: Teams where rollback speed is critical, where you can afford the extra infrastructure cost, and where you need to validate the new version thoroughly before switching traffic.
Canary Deployments
Canary deployments route a small percentage of traffic to the new version while the majority continues hitting the old version. You gradually increase the percentage as you gain confidence, watching metrics the whole time.
How It Works
A typical canary rollout might look like:
- Deploy new version alongside the old one
- Route 5% of traffic to the new version
- Monitor error rates, latency, and business metrics for 10 minutes
- If metrics are healthy, increase to 25%
- Monitor for another 10 minutes
- Increase to 50%, then 100%
- If metrics degrade at any step, route 100% back to the old version
With Nginx, you can do weighted traffic splitting:
upstream api {
server api-stable:8080 weight=95;
server api-canary:8080 weight=5;
}
In Kubernetes with Argo Rollouts, canary deployments become declarative:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: api
spec:
replicas: 4
strategy:
canary:
steps:
- setWeight: 5
- pause: { duration: 10m }
- setWeight: 25
- pause: { duration: 10m }
- setWeight: 50
- pause: { duration: 10m }
- setWeight: 100
canaryService: api-canary
stableService: api-stable
trafficRouting:
nginx:
stableIngress: api-ingress
selector:
matchLabels:
app: api
template:
metadata:
labels:
app: api
spec:
containers:
- name: api
image: ghcr.io/yourorg/api:v2.0.0
The real power comes when you add automated analysis. Argo Rollouts can integrate with Prometheus, Datadog, or other monitoring tools to automatically promote or roll back based on metrics:
steps:
- setWeight: 5
- pause: { duration: 10m }
- analysis:
templates:
- templateName: error-rate-check
args:
- name: service
value: api
- setWeight: 50
- pause: { duration: 10m }
- analysis:
templates:
- templateName: error-rate-check
args:
- name: service
value: api
With automated analysis, the deployment promotes itself if error rates stay below your threshold, and rolls back automatically if they don't. No human in the loop required.
Tradeoffs
Pros:
- Limits the blast radius of a bad deployment to a small percentage of traffic
- Metrics-driven promotion gives you confidence that's based on data, not hope
- Can be fully automated with tools like Argo Rollouts or Flagger
Cons:
- Most complex to set up — requires traffic splitting infrastructure, monitoring integration, and analysis rules
- Requires enough traffic for statistical significance (if you get 10 requests per hour, a 5% canary split is meaningless)
- Mixed-version traffic, similar to rolling deployments — backward compatibility matters
Best for: High-traffic services where a bad deployment could affect thousands of users, teams with mature monitoring and observability, and organizations that deploy very frequently and need automated safety nets.
Decision Matrix
| Factor | Rolling | Blue-Green | Canary |
|---|---|---|---|
| Setup complexity | Low | Medium | High |
| Infrastructure cost | Low | High (2x) | Medium |
| Rollback speed | Slow (minutes) | Instant | Fast (seconds) |
| Mixed-version traffic | Yes | No | Yes (controlled) |
| Minimum traffic needed | Any | Any | High |
| Best Kubernetes tool | Built-in | Service selector | Argo Rollouts |
Our Recommendation
Start with rolling deployments. They're built into Kubernetes, they require no extra tooling, and they work fine for the vast majority of teams. Invest your time in good health checks, backward-compatible database migrations, and solid monitoring.
When you outgrow rolling deployments — typically because rollback speed becomes critical or because you need more control over traffic routing — move to blue-green if rollback speed is your priority, or canary if blast radius control is your priority.
Don't let perfect be the enemy of good. A team deploying ten times a day with rolling deployments is in a better position than a team deploying once a week with a sophisticated canary setup they don't fully understand.
Need help implementing zero-downtime deployments for your infrastructure? We've set up all three strategies for teams at every scale. Let's figure out what's right for your team.