2026-02-10 · 4 min read
Securing AI Workloads in Kubernetes: A Practical Guide
Running ML models and AI pipelines in Kubernetes introduces unique security challenges. Here's how to lock them down without killing your team's velocity.
AI in Production Is a Different Beast
Everyone's shipping AI features now. The problem is that most teams bolt GPU workloads onto existing Kubernetes clusters with the same security posture they use for a CRUD API — and that's not going to cut it.
AI workloads are different. They're resource-hungry, they handle sensitive training data, they pull massive model artifacts from external registries, and they often run with elevated privileges for GPU access. Each of those is an attack surface.
Here's how to secure them without grinding your ML team to a halt.
GPU Node Isolation
Don't run AI workloads on the same nodes as your application tier. GPU nodes should be isolated with taints and tolerations:
# Taint your GPU nodes
kubectl taint nodes gpu-node-1 nvidia.com/gpu=true:NoSchedule
# In your AI workload deployment
spec:
tolerations:
- key: "nvidia.com/gpu"
operator: "Equal"
value: "true"
effect: "NoSchedule"
nodeSelector:
node.kubernetes.io/instance-type: "g5.xlarge"
This prevents non-GPU workloads from landing on expensive GPU nodes, and more importantly, prevents a compromised web pod from having access to your model training environment.
Network Policies Are Non-Negotiable
AI pipelines often need to pull model weights from S3, HuggingFace, or internal registries. That doesn't mean they need unrestricted egress. Lock it down:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: ai-pipeline-egress
namespace: ml-workloads
spec:
podSelector:
matchLabels:
app: model-inference
policyTypes:
- Egress
egress:
- to:
- ipBlock:
cidr: 10.0.0.0/8 # Internal services only
- to:
- namespaceSelector:
matchLabels:
name: model-registry
ports:
- port: 443
If your inference service doesn't need to talk to the internet, it shouldn't be able to. Period.
Secrets and API Keys
LLM deployments are a magnet for API keys — OpenAI, Anthropic, vector database credentials, embedding service tokens. These need proper secrets management:
- Use an external secrets operator (AWS Secrets Manager, Vault) instead of Kubernetes secrets
- Rotate keys on a schedule, not "when we remember"
- Scope API keys to the minimum required permissions
- Never bake keys into container images or model artifacts
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: ai-api-keys
namespace: ml-workloads
spec:
refreshInterval: 1h
secretStoreRef:
name: aws-secrets-manager
kind: ClusterSecretStore
target:
name: ai-api-keys
data:
- secretKey: OPENAI_API_KEY
remoteRef:
key: /prod/ai/openai-key
Model Artifact Integrity
When you pull a 70B parameter model from a registry, how do you know it hasn't been tampered with? Supply chain attacks on model artifacts are a real and growing threat.
- Sign your model artifacts using cosign or Notary
- Use digest-based references instead of tags (just like container images)
- Run a private model registry for production models instead of pulling from public sources at runtime
- Scan model files for embedded code — pickle files in particular can execute arbitrary Python on load
RBAC for ML Teams
Your data scientists don't need cluster-admin. Create scoped roles:
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: ml-workloads
name: ml-engineer
rules:
- apiGroups: ["batch"]
resources: ["jobs"]
verbs: ["create", "get", "list", "delete"]
- apiGroups: [""]
resources: ["pods", "pods/log"]
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["get", "list"]
Give them what they need to submit training jobs and check logs, not the ability to modify network policies or access secrets from other namespaces.
Data Protection
Training data is often the most sensitive asset in your cluster. If you're fine-tuning on customer data, PII, or proprietary datasets:
- Encrypt volumes at rest (EBS encryption, KMS)
- Use ephemeral volumes for training that get destroyed after the job completes
- Implement audit logging on all data access
- Consider running training jobs in isolated namespaces with strict network policies and no egress
Runtime Security
AI containers often need more privileges than typical workloads (GPU device access, shared memory for PyTorch). That makes runtime monitoring even more important:
- Use Falco or a similar runtime security tool to detect anomalous behavior
- Alert on unexpected process execution inside AI containers
- Monitor for cryptocurrency miners — GPU nodes are a prime target
- Set resource limits to prevent a single job from consuming the entire node
resources:
limits:
nvidia.com/gpu: 1
memory: "32Gi"
cpu: "8"
requests:
nvidia.com/gpu: 1
memory: "16Gi"
cpu: "4"
The Bottom Line
AI workloads amplify every existing infrastructure security concern and add new ones. The teams that get this right treat AI infrastructure as a distinct security domain — with its own network boundaries, access controls, and monitoring.
The teams that get it wrong end up with a $50K/month GPU bill mining crypto for someone in Eastern Europe.
Running AI workloads in Kubernetes and want to make sure you're not leaving the door open? Let's talk.