kernel_panic

2026-02-10 · 4 min read

Securing AI Workloads in Kubernetes: A Practical Guide

Running ML models and AI pipelines in Kubernetes introduces unique security challenges. Here's how to lock them down without killing your team's velocity.

AI in Production Is a Different Beast

Everyone's shipping AI features now. The problem is that most teams bolt GPU workloads onto existing Kubernetes clusters with the same security posture they use for a CRUD API — and that's not going to cut it.

AI workloads are different. They're resource-hungry, they handle sensitive training data, they pull massive model artifacts from external registries, and they often run with elevated privileges for GPU access. Each of those is an attack surface.

Here's how to secure them without grinding your ML team to a halt.

GPU Node Isolation

Don't run AI workloads on the same nodes as your application tier. GPU nodes should be isolated with taints and tolerations:

# Taint your GPU nodes
kubectl taint nodes gpu-node-1 nvidia.com/gpu=true:NoSchedule

# In your AI workload deployment
spec:
  tolerations:
    - key: "nvidia.com/gpu"
      operator: "Equal"
      value: "true"
      effect: "NoSchedule"
  nodeSelector:
    node.kubernetes.io/instance-type: "g5.xlarge"

This prevents non-GPU workloads from landing on expensive GPU nodes, and more importantly, prevents a compromised web pod from having access to your model training environment.

Network Policies Are Non-Negotiable

AI pipelines often need to pull model weights from S3, HuggingFace, or internal registries. That doesn't mean they need unrestricted egress. Lock it down:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: ai-pipeline-egress
  namespace: ml-workloads
spec:
  podSelector:
    matchLabels:
      app: model-inference
  policyTypes:
    - Egress
  egress:
    - to:
        - ipBlock:
            cidr: 10.0.0.0/8  # Internal services only
    - to:
        - namespaceSelector:
            matchLabels:
              name: model-registry
      ports:
        - port: 443

If your inference service doesn't need to talk to the internet, it shouldn't be able to. Period.

Secrets and API Keys

LLM deployments are a magnet for API keys — OpenAI, Anthropic, vector database credentials, embedding service tokens. These need proper secrets management:

  • Use an external secrets operator (AWS Secrets Manager, Vault) instead of Kubernetes secrets
  • Rotate keys on a schedule, not "when we remember"
  • Scope API keys to the minimum required permissions
  • Never bake keys into container images or model artifacts
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: ai-api-keys
  namespace: ml-workloads
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secrets-manager
    kind: ClusterSecretStore
  target:
    name: ai-api-keys
  data:
    - secretKey: OPENAI_API_KEY
      remoteRef:
        key: /prod/ai/openai-key

Model Artifact Integrity

When you pull a 70B parameter model from a registry, how do you know it hasn't been tampered with? Supply chain attacks on model artifacts are a real and growing threat.

  • Sign your model artifacts using cosign or Notary
  • Use digest-based references instead of tags (just like container images)
  • Run a private model registry for production models instead of pulling from public sources at runtime
  • Scan model files for embedded code — pickle files in particular can execute arbitrary Python on load

RBAC for ML Teams

Your data scientists don't need cluster-admin. Create scoped roles:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: ml-workloads
  name: ml-engineer
rules:
  - apiGroups: ["batch"]
    resources: ["jobs"]
    verbs: ["create", "get", "list", "delete"]
  - apiGroups: [""]
    resources: ["pods", "pods/log"]
    verbs: ["get", "list", "watch"]
  - apiGroups: [""]
    resources: ["configmaps"]
    verbs: ["get", "list"]

Give them what they need to submit training jobs and check logs, not the ability to modify network policies or access secrets from other namespaces.

Data Protection

Training data is often the most sensitive asset in your cluster. If you're fine-tuning on customer data, PII, or proprietary datasets:

  • Encrypt volumes at rest (EBS encryption, KMS)
  • Use ephemeral volumes for training that get destroyed after the job completes
  • Implement audit logging on all data access
  • Consider running training jobs in isolated namespaces with strict network policies and no egress

Runtime Security

AI containers often need more privileges than typical workloads (GPU device access, shared memory for PyTorch). That makes runtime monitoring even more important:

  • Use Falco or a similar runtime security tool to detect anomalous behavior
  • Alert on unexpected process execution inside AI containers
  • Monitor for cryptocurrency miners — GPU nodes are a prime target
  • Set resource limits to prevent a single job from consuming the entire node
resources:
  limits:
    nvidia.com/gpu: 1
    memory: "32Gi"
    cpu: "8"
  requests:
    nvidia.com/gpu: 1
    memory: "16Gi"
    cpu: "4"

The Bottom Line

AI workloads amplify every existing infrastructure security concern and add new ones. The teams that get this right treat AI infrastructure as a distinct security domain — with its own network boundaries, access controls, and monitoring.

The teams that get it wrong end up with a $50K/month GPU bill mining crypto for someone in Eastern Europe.


Running AI workloads in Kubernetes and want to make sure you're not leaving the door open? Let's talk.

Need help with your infrastructure?

We've been solving problems like these for 18+ years. Let's talk about how we can help your team.