2025-03-15 · 7 min read
Terraform State Management: Remote Backends, Locking, and Workspaces Explained
Terraform state is where most teams get burned. Here's how to set up remote backends, state locking, workspace isolation, and recovery strategies that keep your infrastructure safe.
Why State Management Matters More Than You Think
Terraform state is the single most critical file in your infrastructure-as-code setup. It's the mapping between what Terraform thinks exists and what actually exists in your cloud provider. Corrupt it, lose it, or let two people write to it simultaneously, and you're in for a very bad day.
We've been called in to recover from every state disaster imaginable: state files checked into git with secrets exposed, two engineers running terraform apply at the same time and clobbering each other's changes, and state files that simply vanished because someone ran rm in the wrong directory. Every one of these is preventable with the right setup.
Remote Backends: Stop Storing State Locally
If your state file lives on someone's laptop, you have a single point of failure with legs. Remote backends solve this by storing state in a shared, durable location.
The most common setup for AWS shops is S3 with DynamoDB for locking:
# backend.tf
terraform {
backend "s3" {
bucket = "mycompany-terraform-state"
key = "infrastructure/production/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-state-lock"
}
}
Set up the S3 bucket with versioning enabled. This is non-negotiable. Versioning gives you the ability to recover previous state versions if something goes wrong:
# Create the state bucket with versioning
aws s3api create-bucket \
--bucket mycompany-terraform-state \
--region us-east-1
aws s3api put-bucket-versioning \
--bucket mycompany-terraform-state \
--versioning-configuration Status=Enabled
# Create the lock table
aws dynamodb create-table \
--table-name terraform-state-lock \
--attribute-definitions AttributeName=LockID,AttributeType=S \
--key-schema AttributeName=LockID,KeyType=HASH \
--billing-mode PAY_PER_REQUEST
For GCP teams, the equivalent is a GCS backend:
terraform {
backend "gcs" {
bucket = "mycompany-terraform-state"
prefix = "infrastructure/production"
}
}
GCS handles locking natively, so you don't need a separate lock table.
Securing the State Bucket
Your state file contains sensitive data. Connection strings, IAM role ARNs, resource IDs that could be useful to an attacker. Lock down the bucket:
# state-bucket.tf — run this once, separately
resource "aws_s3_bucket" "terraform_state" {
bucket = "mycompany-terraform-state"
lifecycle {
prevent_destroy = true
}
}
resource "aws_s3_bucket_versioning" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
versioning_configuration {
status = "Enabled"
}
}
resource "aws_s3_bucket_server_side_encryption_configuration" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "aws:kms"
}
}
}
resource "aws_s3_bucket_public_access_block" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
Yes, there's a chicken-and-egg problem here: you need somewhere to store the state for the Terraform that creates your state bucket. We usually bootstrap this with local state, apply once, then migrate to the remote backend with terraform init -migrate-state.
State Locking: Preventing Concurrent Writes
State locking prevents two people (or two CI pipelines) from running terraform apply at the same time. Without it, you get race conditions that can leave your state inconsistent with reality.
If you're using the S3 backend with DynamoDB, locking is automatic. Every plan and apply acquires a lock, and releases it when done. If a lock is held, subsequent runs will wait or fail with a clear message.
Occasionally, locks get stuck. Someone's laptop dies mid-apply, or a CI runner gets terminated. You can manually release a stuck lock:
# Find the lock ID from the error message, then:
terraform force-unlock LOCK_ID
Use force-unlock sparingly and only when you're certain no other operation is running. If you're wrong, you're back to the race condition you were trying to avoid.
CI/CD and Locking
In CI pipelines, we recommend a pattern where terraform plan runs on pull requests and terraform apply runs only on merge to main. This naturally serializes applies through your merge queue:
# .github/workflows/terraform.yml
jobs:
plan:
if: github.event_name == 'pull_request'
runs-on: ubuntu-22.04
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
- run: terraform init
- run: terraform plan -no-color -out=plan.tfplan
- name: Post plan to PR
uses: actions/github-script@v7
with:
script: |
// Post the plan output as a PR comment
apply:
if: github.ref == 'refs/heads/main' && github.event_name == 'push'
runs-on: ubuntu-22.04
concurrency:
group: terraform-apply
cancel-in-progress: false
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
- run: terraform init
- run: terraform apply -auto-approve
The concurrency block is critical. It ensures only one apply job runs at a time, and cancel-in-progress: false means a queued apply won't be canceled by a newer one. You want every merge to be applied in order.
Workspaces vs. Directory Structure
Terraform workspaces let you manage multiple environments (dev, staging, production) from the same configuration. The alternative is separate directories per environment. Both approaches work. Both have tradeoffs.
Workspaces
terraform workspace new staging
terraform workspace new production
terraform workspace select staging
terraform apply -var-file="staging.tfvars"
# main.tf
resource "aws_instance" "app" {
instance_type = var.instance_type
tags = {
Environment = terraform.workspace
}
}
Workspaces share the same backend but create separate state files. They're good when your environments are structurally identical and differ only in sizing or configuration values.
Directory Structure
infrastructure/
modules/
vpc/
ecs-cluster/
rds/
environments/
dev/
main.tf
backend.tf
terraform.tfvars
staging/
main.tf
backend.tf
terraform.tfvars
production/
main.tf
backend.tf
terraform.tfvars
Each environment has its own backend configuration, its own state file, and its own main.tf that composes shared modules. This gives you full isolation and the ability to have structural differences between environments.
Our Recommendation
We almost always recommend the directory structure approach for teams managing real production infrastructure. Workspaces are fine for simple setups, but they fall apart when your production environment has resources that don't exist in dev (like a WAF, a more complex networking setup, or compliance-related logging). The directory approach lets each environment evolve independently while sharing logic through modules.
State Recovery: When Things Go Wrong
Despite best practices, state issues happen. Here's how to recover.
Recovering a previous state version from S3:
# List state versions
aws s3api list-object-versions \
--bucket mycompany-terraform-state \
--prefix infrastructure/production/terraform.tfstate
# Download a specific version
aws s3api get-object \
--bucket mycompany-terraform-state \
--key infrastructure/production/terraform.tfstate \
--version-id "abc123previousversion" \
recovered-state.tfstate
# Push the recovered state
terraform state push recovered-state.tfstate
Importing existing resources into state:
If a resource exists in your cloud provider but not in state (maybe someone created it manually, or it was removed from state accidentally):
terraform import aws_instance.app i-0123456789abcdef0
Removing a resource from state without destroying it:
terraform state rm aws_instance.legacy_server
This is useful when you're migrating a resource to a different Terraform configuration or taking it out of IaC management entirely.
The Patterns That Keep You Safe
After managing Terraform state for dozens of organizations, here's what we always set up on day one:
- Remote backend with versioning — always, no exceptions
- State locking — DynamoDB for AWS, native for GCS
- Encryption at rest — KMS for S3, default for GCS
- Least-privilege access — only CI/CD and senior engineers can write state
- Separate state per environment — via directories, not workspaces
- Regular state backups — S3 versioning handles this, but verify it works
- Plan on PR, apply on merge — serialized through CI concurrency controls
None of this is cutting-edge. It's just the boring, reliable infrastructure work that keeps your team from having a bad weekend.
Need help getting your Terraform state management right? We've untangled state messes for companies of all sizes and we can get yours sorted quickly. Let's talk.