kernel_panic

2026-02-03 · 5 min read

AI Infrastructure Costs Are Out of Control: Here's How to Fix It

GPU instances are expensive. Most teams are wasting 40-60% of their AI compute budget. Practical strategies for cutting costs without slowing down your ML pipeline.

The $100K/Month GPU Problem

Here's a pattern I keep seeing: a startup ships an AI feature, spins up some GPU instances to run inference, and three months later someone notices the AWS bill has quietly tripled. By the time anyone looks closely, they're spending $50-100K/month on compute, and nobody can explain exactly why.

GPU instances are 10-20x more expensive than general compute. The cost optimization strategies that work for regular workloads — rightsizing, reserved instances, auto-scaling — still apply, but the stakes are much higher and the mistakes are more expensive.

Start With Visibility

You can't optimize what you can't see. Before changing anything, answer these questions:

  • Which GPU instance types are you running?
  • What's the actual GPU utilization? (Not CPU — GPU specifically)
  • How many hours per day are your GPUs actually processing requests?
  • Are you running inference, training, or both?
  • What's the per-request cost of your AI features?
# Check GPU utilization on your nodes
kubectl exec -it <gpu-pod> -- nvidia-smi --query-gpu=utilization.gpu,utilization.memory,memory.used,memory.total --format=csv -l 5

Most teams find their GPU utilization is below 30%. That means 70% of their GPU spend is wasted.

Right-Size Your Instances

The most common mistake is running the biggest GPU instance available "just to be safe." A p4d.24xlarge with 8 A100 GPUs costs $32/hour. If your model runs fine on a single A10G (g5.xlarge at $1/hour), you're overspending by 32x.

Profile your model's actual requirements:

  • VRAM usage: How much GPU memory does your model need? A 7B parameter model typically needs 14-28GB depending on quantization
  • Compute throughput: How many requests per second do you need?
  • Latency requirements: What's your acceptable P99 latency?

Then match to the right instance:

Use CaseInstance TypeGPUCost/hr
Small model inferenceg5.xlarge1x A10G (24GB)~$1.00
Medium model inferenceg5.2xlarge1x A10G (24GB)~$1.21
Large model inferencep4d.24xlarge8x A100 (40GB)~$32.77
Fine-tuningg5.12xlarge4x A10G (96GB)~$5.67

Spot Instances for Training

Training jobs are interruptible. If your training checkpoint properly (and it should), spot instances can cut training costs by 60-90%:

# Karpenter provisioner for spot GPU instances
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: gpu-spot-training
spec:
  requirements:
    - key: karpenter.sh/capacity-type
      operator: In
      values: ["spot"]
    - key: node.kubernetes.io/instance-type
      operator: In
      values: ["g5.xlarge", "g5.2xlarge", "g5.4xlarge"]
    - key: nvidia.com/gpu
      operator: Exists
  limits:
    resources:
      nvidia.com/gpu: "8"
  ttlSecondsAfterEmpty: 60

Key rules for spot GPU training:

  • Checkpoint frequently (every 15-30 minutes)
  • Use multiple instance type options for better spot availability
  • Implement graceful shutdown handlers that save state on interruption
  • Keep on-demand as a fallback for deadline-critical training runs

Scale Inference to Zero

If your AI feature isn't used 24/7, why are GPUs running 24/7? Scale inference to zero during off-hours:

  • Use KEDA (Kubernetes Event-Driven Autoscaler) to scale based on queue depth or request rate
  • For HTTP inference, use Knative or a custom HPA with request-based metrics
  • Accept a cold-start penalty (30-60 seconds to load a model) during low-traffic periods

A service that handles 1000 requests during business hours and 10 requests overnight doesn't need the same capacity at 3 AM.

Model Optimization

Before throwing hardware at the problem, make the model itself more efficient:

Quantization

Converting a model from FP32 to INT8 can reduce memory usage by 4x and improve throughput by 2-3x with minimal quality loss:

# Example with ONNX Runtime
import onnxruntime as ort

session_options = ort.SessionOptions()
session_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL

# Use quantized model
session = ort.InferenceSession(
    "model_quantized_int8.onnx",
    session_options,
    providers=['CUDAExecutionProvider']
)

Batching

Process multiple requests together instead of one at a time. Dynamic batching can improve GPU utilization from 20% to 80%+:

  • Use Triton Inference Server's built-in dynamic batching
  • Or implement a simple request queue that batches every 50ms

Model Distillation

If you're running a 70B model but a fine-tuned 7B model would give you 95% of the quality, that's a 10x cost reduction. Test smaller models before assuming you need the biggest one.

Reserved Capacity and Savings Plans

For baseline inference workloads that run 24/7, reserved instances or savings plans are a no-brainer:

  • 1-year reserved: ~40% discount
  • 3-year reserved: ~60% discount
  • Savings Plans (AWS): Commit to a $/hour spend, flexible across instance types

Only reserve what you know you'll use as a baseline. Use on-demand or spot for burst capacity.

The Optimization Playbook

In order of effort vs. impact:

  1. Turn off what you're not using (immediate, biggest savings)
  2. Right-size instances to actual model requirements
  3. Use spot for training jobs
  4. Implement autoscaling for inference
  5. Optimize models (quantization, batching, distillation)
  6. Reserve baseline capacity once you know your steady-state needs

Most teams can cut their AI infrastructure costs by 40-60% with just the first three steps.


GPU bill getting out of hand? I've cut cloud costs in half before and can do it again. Let's talk about your AI infrastructure.

Need help with your infrastructure?

We've been solving problems like these for 18+ years. Let's talk about how we can help your team.