Running Containers and Kubernetes on GPU Virtual Machines

Introduction: Why Containers + Kubernetes on GPU VMs Matter

The rapid growth of AI and machine learning has fundamentally changed how infrastructure is designed and managed. Traditional VM-centric models, where applications are tightly coupled to long-lived virtual machines, are no longer sufficient for modern, GPU-intensive workloads. Today, organizations are shifting toward container-native architectures orchestrated by platforms like Kubernetes, running on GPU-enabled virtual machines. This combination provides the scalability, flexibility, and performance required for AI at scale.

Shift from VM-Centric to Container-Native AI Workloads

In the traditional model:

Applications were deployed directly onto virtual machines.
Each VM was configured manually with specific drivers, libraries, and dependencies.
Scaling required provisioning additional VMs, often leading to overprovisioning or underutilization.

AI workloads intensified the limitations of this model:

GPU drivers and CUDA versions had to be carefully matched.
ML frameworks required specific runtime environments.
Reproducibility became challenging across development, testing, and production.

Containers changed this paradigm:

Applications are packaged with all dependencies.
Environments are consistent across stages.
Infrastructure becomes abstracted and declarative.
Scaling is automated through orchestration.

With Kubernetes, workloads are scheduled dynamically across clusters. Instead of treating VMs as the unit of deployment, containers become the unit of execution. GPU-enabled VMs simply become resource pools that Kubernetes manages intelligently.

Benefits of Combining GPUs with Containers

Running containers on GPU VMs offers a powerful combination of performance and agility:

Efficient GPU Utilization: Kubernetes can schedule workloads onto nodes with available GPUs, allocate GPUs as resources (e.g., 1, 2, or fractional GPUs depending on configuration) and prevent resource contention across workloads. This ensures GPUs, often the most expensive infrastructure component, are fully utilized.
Environment Isolation: Containers encapsulate CUDA, cuDNN, and framework versions, avoid dependency conflicts and allow multiple teams to share the same GPU cluster safely. For example, one team may run PyTorch with CUDA 12, while another uses TensorFlow with CUDA 11 without conflict.
Portability Across Cloud Providers: Whether running on GPU VMs in AWS, Azure, or on-premises environments, containerized workloads remain portable. Kubernetes provides a consistent control plan across infrastructure providers. This reduces vendor lock-in and simplifies hybrid or multi-cloud strategies.
Elastic Scaling for AI Workloads: AI workloads are rarely steady training jobs spike during experimentation, Inference scales with traffic and batch processing may run overnight.

Kubernetes enables horizontal scaling of inference services, auto-scaling GPU nodes and job-based scheduling for large training runs.

Faster Deployment and CI/CD Integration: Containers integrate naturally with CI/CD pipelines that make models can be packaged as container images, versioning is simplified and rollbacks are immediate. This shortens the path from model development to production deployment.

Common Use Cases

Model Training: Large scale distributed training jobs require multiple GPUs, often span multiple nodes and need scheduling guarantees. Kubernetes supports distributed training frameworks (e.g., MPI, PyTorch distributed) and manages GPU allocation efficiently. Training jobs can be run as batch workloads and terminated automatically upon completion.
Model Inference: Production inference workloads demand low latency, must scale horizontally and require high availability. GPU-backed inference services can run as Kubernetes Deployments with autoscaling enabled. As traffic increases, additional pods are scheduled on available GPU nodes.
Batch AI Processing:

Examples: Video processing, Data labeling automation, Large-scale embeddings generation, and Scientific simulations

These jobs run intermittently, consume significant GPU resources and benefit from queue-based scheduling. Kubernetes Jobs and CronJobs allow efficient orchestration of these workloads.

1. Why This Architecture Is Becoming the Standard

The convergence of Containerization, GPU acceleration and Cloud-native orchestration has created a new default for AI infrastructure. Instead of provisioning GPU VMs manually and treating them as isolated silos, organizations now build shared GPU clusters managed by Kubernetes. This enables better cost efficiency, faster experimentation, scalable production deployment and improved resource governance.

2. GPU Virtual Machines vs Container-Native GPUs

Role of GPU VMs in Containerized Environments

GPU Virtual Machines (VMs) provide the hardware foundation for running GPU workloads in the cloud. In a containerized environment, platforms like Kubernetes schedule containers on these GPU-enabled VMs.

In simple terms:

The GPU VM provides the actual GPU hardware.
The container packages the application.
The orchestrator (like Kubernetes) manages and schedules workloads.

GPU VMs act as resource pools. Containers request GPU resources, and the system assigns them from available GPU VMs. This allows multiple AI workloads (training, inference, batch jobs) to share the same infrastructure efficiently.

Bare Metal vs GPU VM for Container Workloads

Bare Metal (Physical Server with GPU):

Direct access to hardware
Slightly better raw performance
No virtualization layer
Requires manual setup and maintenance
Higher upfront cost

GPU Virtual Machine:

Runs on virtualized infrastructure
Slightly small performance overhead
Easier to scale up or down
Faster deployment
Managed by cloud provider

For most container workloads, GPU VMs are preferred because they provide flexibility and easier scaling, even if bare metal offers slightly higher performance.

When GPU VMs Make More Sense Than On-Prem GPUs

GPU VMs are usually better when:

You need quick setup without buying hardware
Workloads are temporary or variable
You want pay-as-you-go pricing
You need scaling on demand
Your team does not want to manage physical hardware

On-prem GPUs make sense when:

Workloads are constant and predictable
You already own hardware
You need maximum performance control
Data must remain strictly on-site

In most modern AI environments, GPU VMs combined with containers provide better flexibility, scalability, and operational simplicity compared to traditional on-prem GPU setups.

3. Prerequisites for Running Containers on GPU VMs

Before running containers with GPU support, some basic requirements must be met. GPUs do not work automatically inside containers. Proper drivers and software must be installed on the host VM first.

Supported GPU Models and Drivers

Not all GPUs support compute workloads.

You need GPUs that support CUDA, such as:

NVIDIA Tesla series
NVIDIA A100, V100
NVIDIA T4
NVIDIA RTX (some models)

Cloud providers usually offer GPU VMs with supported NVIDIA data center GPUs.

Always check the GPU model, supported driver version and CUDA compatibility.

Check GPU Model: lshw -c video

NVIDIA Driver Requirements

The NVIDIA driver must be installed on the GPU VM (host system).

Important points:

Driver must match the GPU model
Driver version must support the required CUDA version
Drivers must be installed on the host, not inside the container

You can check if the driver is working by running: nvidia-smi

If installed correctly, it will show GPU model, Driver version and GPU memory usage. If this command fails, containers will not detect the GPU.

CUDA and cuDNN Compatibility Basics

CUDA is required for GPU compute workloads. cuDNN is used for deep learning acceleration.

Key compatibility rule:

NVIDIA Driver ≥ CUDA version required
Container CUDA version must be supported by host driver

Example: If the container uses CUDA 12, the host driver must support CUDA 12.

Mismatch between driver version, CUDA version and Deep learning framework can cause runtime errors. Always check NVIDIA’s CUDA compatibility matrix before deployment.

Verify cuDNN Version: dpkg -l | grep cudnn

Verify cuDNN Version

Check Installed CUDA Toolkit Version: nvcc –version

OS Considerations (Ubuntu, AlmaLinux, etc.)

The operating system must support NVIDIA drivers,,Docker or container runtime and NVIDIA Container Toolkit.

Common supported OS options:

Ubuntu (most popular and recommended)
AlmaLinux
Rocky Linux
CentOS (older deployments)

Ubuntu is widely used because of better documentation, strong community support and easier NVIDIA installation process.

Make sure the kernel version supports the GPU driver and the system is fully updated before installing drivers.

Confirm Kernel Is Supported by NVIDIA Driver: lspci | grep -i nvidia

4. GPU Enablement for Containers

By default, containers do not have access to the GPU. Extra configuration is required.

Why GPUs Don’t Work in Containers by Default

Containers are isolated environments. They do not automatically have access to Host hardware, GPU devices and NVIDIA drivers. Even if the host has a GPU, containers cannot use it unless it is explicitly exposed. This isolation improves security but requires special configuration for GPU workloads.

NVIDIA Container Toolkit Overview

The NVIDIA Container Toolkit allows containers to use GPUs. It connects container runtime with NVIDIA drivers, passes GPU devices into the container and mounts required driver libraries inside the container. Without this toolkit, GPU-based containers will fail. After installation, Docker can run GPU-enabled containers using:

docker run –gpus all image_name

How GPU Pass-Through Works for Containers

GPU pass-through allows the container to access /dev/nvidia* device files, NVIDIA driver libraries and GPU hardware resources.

The host driver remains installed on the VM. The container uses those drivers via shared libraries. The GPU is not virtualized. It is directly shared with the container. Kubernetes works similarly by advertising GPUs as resources and scheduling pods that request GPUs.

Verifying GPU Access Inside a Container

After setup, test GPU access.

Run a test container: docker run –gpus all nvidia/cuda:11.1.1-cudnn8-runtime nvidia-smi

If configured correctly, you will see the GPU name, Driver version and CUDA version. This confirms the container can access the GPU.

If it fails check the NVIDIA driver, check the NVIDIA Container Toolkit, and restart Docker service.

5. Running GPU Workloads with Docker

Docker is one of the easiest ways to run GPU workloads on a GPU VM. Once your NVIDIA drivers and NVIDIA Container Toolkit are installed, you can start running AI containers.

Installing Docker on GPU VMs

On most Linux systems (like Ubuntu), install Docker using:

sudo apt update

sudo apt install ca-certificates curl

sudo install -m 0755 -d /etc/apt/keyrings

sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc

sudo chmod a+r /etc/apt/keyrings/docker.asc

# Add the repository to Apt sources:

sudo tee /etc/apt/sources.list.d/docker.sources <<EOF

Types: deb

URIs: https://download.docker.com/linux/ubuntu

Suites: $(. /etc/os-release && echo “${UBUNTU_CODENAME:-$VERSION_CODENAME}”)

Components: stable

Signed-By: /etc/apt/keyrings/docker.asc

EOF

sudo apt update

sudo apt install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

Then enable and start Docker:

sudo systemctl enable docker

sudo systemctl start docker

sudo systemctl status docker

Verify installation:

docker –version

Docker must be installed before configuring GPU support.

Configuring Docker to Use GPUs

After installing NVIDIA driver and NVIDIA Container Toolkit restart Docker:

sudo systemctl restart docker

Now Docker can access GPUs through the NVIDIA runtime.

Test if NVIDIA runtime is available: docker info | grep -i nvidia

Test GPU access:

docker run –gpus all -it nvidia/cuda:13.1.1-cudnn-devel-ubuntu24.04 nvidia-smi

If it shows GPU details, the configuration is correct.

Common Docker Images for AI/ML Workloads

Popular GPU-ready images include NVIDIA CUDA base images, PyTorch images, TensorFlow images and Jupyter notebook images with GPU support. These images already include CUDA libraries and required dependencies. They reduce setup time and prevent compatibility issues.

Container Runtime Performance Considerations

When running GPU containers:

GPU performance is near-native
Minimal virtualization overhead
CPU and disk speed still matter
Network performance impacts distributed training

Best practices use SSD storage, allocate enough CPU and RAM and avoid running too many GPU containers on one GPU. Proper resource planning improves performance.

Container Images for GPU Workloads

Choosing the right container image is important for stability and performance.

Using Official NVIDIA CUDA Images

Official CUDA images are provided by NVIDIA and include CUDA runtime, Development libraries and Compatible drivers.

https://hub.docker.com/r/nvidia/cuda

They are good base images for Custom AI applications, Framework installations and Testing GPU access, They ensure CUDA compatibility with drivers.

Framework-Specific Images (PyTorch, TensorFlow)

Instead of building from scratch, you can use ready-made framework images such as PyTorch with CUDA support or TensorFlow GPU versions.

These images already include CUDA, cuDNN and Framework libraries. They are useful for Training deep learning models, Running inference and Research experiments. They save time and reduce setup errors.

Custom Image Best Practices

When building your own GPU image Use official CUDA base image, Install only required packages, Pin specific versions of libraries and Keep image clean and minimal. Avoid installing unnecessary tools. Smaller images start faster, use less storage, deploy quicker.

Image Size, Layers, and Startup Performance

Container images are built in layers. Each command in a Dockerfile creates a new layer. Best practices combine commands where possible, remove temporary files and use slim base images.

Large images take longer to pull, increase deployment time and slow down scaling. Optimized images improve startup speed, especially in Kubernetes clusters.

7. Kubernetes Architecture for GPU Workloads

Kubernetes allows you to manage multiple GPU workloads across nodes in a cluster.

How Kubernetes Schedules GPU Resources?

Kubernetes treats GPUs as special resources. When a pod requests a GPU Kubernetes finds a node with available GPU, schedules the pod there and reserves the GPU for that pod. This prevents resource conflicts.

GPU as an Extended Resource

In Kubernetes, GPUs are not standard CPU or memory resources. They are added as extended resources, such as:

nvidia.com/gpu

When defining a pod:

resources:

limits:

nvidia.com/gpu: 1

This tells Kubernetes the pod requires one GPU.

Node-Level GPU Allocation

GPUs are attached to specific nodes. Kubernetes does not split GPUs like CPU cores (unless using special GPU partitioning like MIG).

This means one pod typically gets one full GPU, GPU cannot be shared unless configured and scheduling depends on available GPUs per node. If no GPU is free, the pod remains pending.

Differences Between CPU and GPU Scheduling

CPU scheduling:

CPUs are shared
Can allocate fractions (e.g., 0.5 CPU)
Multiple pods share CPU cores

GPU scheduling:

GPUs are allocated as whole units
Typically not shared by default
More strict allocation rules

Because GPUs are expensive and limited, Kubernetes scheduling must carefully manage them.

8. NVIDIA Device Plugin for Kubernetes

Why the Device Plugin Is Required?

Kubernetes does not detect GPUs automatically. Even if the node has a GPU and drivers installed, Kubernetes cannot use it without a special component.

The NVIDIA Device Plugin allows Kubernetes to discover GPUs on each node, advertise them as resources and assign them to pods. Without this plugin, GPU scheduling will not work.

How It Exposes GPUs to Kubernetes

The device plugin runs on every GPU node. It communicates with the NVIDIA driver, detects available GPUs and registers them as a resource like: nvidia.com/gpu

Kubernetes then knows how many GPUs are available per node.

Installation Approaches

DaemonSet (Common Method): The plugin runs as a DaemonSet. This ensures one instance runs on every GPU node.
Helm Chart: Helm simplifies installation and upgrades. It is useful in production clusters.

Both methods achieve the same goal, exposing GPUs to Kubernetes.

Verifying GPU Availability in Pods

After installation check nodes: kubectl describe node <node-name>

You should see: nvidia.com/gpu: 1

Run a test pod requesting a GPU and execute: nvidia-smi

If GPU details appear, the setup is correct.

9. Scheduling and Resource Management

Requesting GPUs in Pod Specs

In your pod YAML file:

resources:

limits:

nvidia.com/gpu: 1

This tells Kubernetes the pod requires one GPU.

GPU Limits vs Requests

For GPUs only limits are typically used, requests must equal limits. Unlike CPUs, GPUs cannot be overcommitted by default.

Preventing GPU Oversubscription

Kubernetes prevents multiple pods from using the same GPU unless special sharing is configured.

This ensures stable performance, no resource conflicts and fair scheduling.

Multi-Tenant GPU Isolation

In shared clusters use namespaces, apply resource quotas and use node selectors or taints. This separates workloads between teams and prevents misuse.

10. Running AI/ML Workloads on Kubernetes

Training Jobs vs Inference Services

Training Jobs heavy GPU usage, run for hours or days and often use multiple GPUs.

Inference Services serve predictions, low latency required and scale based on traffic.

Batch Jobs, CronJobs, and Long-Running Pods

Jobs → Run once and exit
CronJobs → Run on schedule
Deployments → Long-running services

Training is often run as Jobs. Inference is usually run as Deployments.

Model-Serving Patterns (REST / gRPC)

Common approaches REST API and gRPC services. Model servers expose endpoints for prediction requests.

Autoscaling Considerations

Autoscaling can be based on CPU usage, custom metrics and request count. GPU-based autoscaling is more complex because GPUs are limited and expensive.

11. Multi-GPU and Distributed Workloads

Using Multiple GPUs in a Single Pod

You can request more than one GPU:

limits:

nvidia.com/gpu: 2

The container can then use 2 GPUs on that node.

Distributed Training Across Nodes

Large models require multiple nodes. Each node provides GPUs. Training is synchronized across nodes.

NCCL and Inter-Node Communication

NCCL is commonly used for GPU communication. It enables fast GPU-to-GPU data transfer and efficient gradient sharing. High-speed networking is required.

Network Requirements

Distributed workloads need high bandwidth, low latency and 10Gbps or higher recommended. Slow networks reduce training efficiency.

12. Performance Considerations and Optimization

Container Overhead on GPU Workloads

Container overhead is minimal. GPU performance inside containers is nearly the same as native.

CPU, Memory, and Storage Impact

GPU workloads still depend on CPU for data preprocessing, RAM for loading datasets and fast storage (SSD/NVMe). Poor storage slows down training.

NUMA and PCIe Considerations

GPUs connect via PCIe. For best performance align CPU and GPU locality and avoid cross-NUMA traffic. This improves throughput.

Benchmarking Containerized GPU Workloads

Use tools like: nvidia-smi, Framework benchmarks or Monitoring tools

Compare GPU utilization, memory usage and training time. This helps optimize resource allocation.

13. GPU Sharing and Advanced Scheduling

Time-Slicing vs Full GPU Allocation

Full Allocation:

One pod = one GPU
Best performance
Strong isolation

Time-Slicing:

Multiple pods share GPU
Better utilization
Slight performance impact

NVIDIA vGPU and MIG Overview

vGPU stands for Virtual GPU instances. It is used in virtualized environments.

MIG stands for Multi-Instance GPU. Splits one GPU into multiple logical GPUs and provides hardware-level isolation. It is useful for multi-tenant clusters.

When GPU Sharing Makes Sense

GPU sharing is beneficial in scenarios where full GPU capacity is not required by a single workload. Since GPUs are expensive and powerful resources, allocating an entire GPU to a small task can lead to underutilization. In such cases, sharing allows better efficiency and cost optimization.

GPU sharing makes sense in the following situations:

1. Lightweight Workloads: If applications use only a small portion of GPU memory and compute power, assigning a full GPU is unnecessary.

2. Small Inference Tasks: Inference workloads often require less GPU power compared to model training. For example, serving predictions for smaller models or handling low traffic applications may not fully utilize a GPU. In such cases, multiple inference services can safely share a single GPU, improving overall utilization.

3. Development and Testing Environments: Development teams usually need GPU access for experimentation, debugging, or testing code changes. These workloads are often short-lived and do not require full GPU performance.

Sharing GPUs in development environments reduces infrastructure costs while still providing necessary access.

Trade-Offs of GPU Sharing

While GPU sharing improves utilization, it also introduces certain trade-offs. It is important to understand the balance between efficiency and performance.

More Sharing

When multiple workloads share a GPU:

Higher Utilization: More of the GPU’s compute and memory capacity is used, reducing idle resources.
Lower Isolation: Workloads may compete for GPU resources, which can affect performance stability. If one workload spikes in usage, others may experience slowdowns.

This approach is ideal for cost-sensitive or non-critical workloads.

Less Sharing (Dedicated GPU Allocation):

When a GPU is dedicated to a single workload:

Better Performance: The application gets full access to GPU resources, ensuring consistent and predictable performance.
Higher Cost: Some GPU capacity may remain unused, leading to lower overall utilization and higher infrastructure cost per workload.

This approach is best for production training jobs, latency-sensitive inference services, and mission-critical applications.

14. Monitoring and Observability

Monitoring and observability are critical when running GPU workloads in containers. GPUs are expensive resources, and without proper monitoring, they can easily become underutilized, overheated, or overloaded. A well-designed monitoring system helps ensure stability, performance, and cost efficiency across your Kubernetes cluster.

Monitoring GPU Usage in Containers

Monitoring GPU usage allows you to avoid idle GPUs and maximize return on investment, identify overheating, memory exhaustion, or abnormal behavior before failures occur and track how GPUs are being used and eliminate waste from unused or oversized allocations.

In Kubernetes environments, GPU monitoring should cover both:

Node-level metrics (physical GPU health and performance)
Pod-level metrics (which workload is consuming GPU resources)

This visibility helps administrators understand how workloads behave in real time.

Key Metrics to Track

Tracking the right metrics is essential for performance tuning and troubleshooting.

GPU Utilization (%): Shows how much of the GPU compute capacity is being used. Low utilization may indicate inefficient workloads or oversized allocations. High sustained utilization may signal resource saturation.
GPU Memory Usage: Tracks how much GPU memory (VRAM) is consumed. Out-of-memory errors can crash training or inference jobs.
Temperature: High GPU temperatures may indicate insufficient cooling, heavy sustained workloads and hardware issues. Monitoring temperature helps prevent hardware damage.
Power ConsumptionL Helps analyze performance efficiency, energy usage patterns and cost optimization opportunities.
Pod-Level Resource Usage: It is important to map GPU usage to specific Kubernetes pods. This helps identify which workload is consuming GPU resources, whether resource limits are correctly configured and If scaling adjustments are needed.

Logging and Alerting

Monitoring alone is not enough. You must configure alerts to react quickly to issues.

Set Alerts For:

High GPU temperature
Low GPU utilization (indicating waste)
Out-of-memory (OOM) errors
Node failures or GPU crashes

Proactive alerts allow teams to respond before workloads fail. Logs also play an important role in troubleshooting:

Framework logs help diagnose model crashes
Kubernetes logs show pod failures
Node logs reveal driver or hardware issues

Combining logs with metrics provides complete observability.

Popular Tools and Exporters

Several tools are commonly used to monitor GPU workloads in Kubernetes environments.

NVIDIA DCGM Exporter: Collects GPU metrics directly from NVIDIA drivers. Exposes metrics for monitoring systems.

Prometheus: Scrapes metrics from exporters and stores time-series data.
Widely used in Kubernetes environments.

Grafana: Provides dashboards and visualizations. Displays GPU utilization, memory, temperature, and more.

Kubernetes Metrics Server: Collects basic CPU and memory metrics for pods and nodes. Used for autoscaling decisions.

Together, these tools provide dashboards, monitoring, and alerting for GPU clusters.

15. Security and Isolation in Containerized GPU Environments

Security is critical when running GPU workloads in shared environments. Since GPUs are powerful and expensive resources, proper isolation and access control must be enforced.

Container Isolation vs VM Isolation

Virtual Machines (VMs):

Provide strong isolation
Each VM has its own kernel
Better separation between tenants

Containers:

Share the host kernel
Lightweight and fast
Less isolated than VMs by default

In GPU environments, containers rely on the host system for drivers and device access. This makes security configuration very important.

If strict isolation is required (for example, in multi-tenant or enterprise environments), combining VMs and containers is often recommended.

Risks of GPU Device Access

When a container is given GPU access, it can interact with: /dev/nvidia* device files, driver-level resources and GPU memory.

Potential risks include resource abuse, data leakage between workloads and denial-of-service from heavy GPU usage. Improper configuration may allow containers more access than intended.

Securing Container Runtimes

To improve security:

Avoid running containers as root
Use minimal base images
Keep NVIDIA drivers updated
Limit privileged container access
Apply security profiles (AppArmor or seccomp)

Do not grant GPU access unless required. Only trusted workloads should receive GPU resources.

Role of RBAC and Namespaces

In Kubernetes, security is controlled through:

RBAC (Role-Based Access Control):

Controls who can create pods
Limits who can request GPU resources
Restricts administrative actions

Namespaces:

Separate teams or environments
Apply resource quotas
Control GPU consumption per team

Together, RBAC and namespaces prevent unauthorized GPU usage and enforce workload separation.

16. Failure Scenarios and Troubleshooting

Even with a proper setup, GPU workloads in Kubernetes can sometimes fail. Because GPUs involve drivers, container runtimes, Kubernetes components, and hardware, issues can occur at multiple layers. Understanding common failure scenarios helps reduce downtime and speeds up troubleshooting.

GPU Not Visible in Containers:

One of the most common problems is when the GPU is not accessible inside a container.

If running the following command inside a container fails: nvidia-smi

It usually means the GPU is not properly exposed to the container.

Possible Causes

NVIDIA driver is not installed on the host
NVIDIA Container Toolkit is missing
NVIDIA device plugin is not running in Kubernetes
Docker is not configured with GPU support

Step-by-Step Checks

1. Check GPU on the Host First: Run on the VM (not inside container):

nvidia-smi

If this fails, the issue is at the driver level. Install or reinstall the NVIDIA driver.

2. Verify NVIDIA Container Toolkit: Ensure the toolkit is installed and Docker is restarted.

Check installed package (Ubuntu): dpkg -l | grep -i nvidia-container

3. Check Device Plugin in Kubernetes:Verify that the NVIDIA device plugin pod is running:

kubectl get pods -n kube-system

4. Confirm GPU Resource Is Registered: Run:

kubectl describe node <node-name>

Make sure you see: nvidia.com/gpu

If it is not listed, Kubernetes cannot schedule GPU workloads.

Pods Stuck in Pending State

Another common issue is when a GPU pod remains in Pending state and never starts.

Common Reasons

No available GPUs on the cluster
NVIDIA device plugin not installed
Incorrect resource request in pod YAML
Node selector or taint mismatch

Troubleshooting Steps

Check pod events: kubectl describe pod <pod-name>

Look for messages such as “Insufficient nvidia.com/gpu” or “0/3 nodes available”.

Then verify node resources: kubectl describe node <node-name>

Ensure nvidia.com/gpu is listed and available.

If GPUs are fully allocated, you may need to wait for a GPU to free up, add more GPU nodes or reduce GPU requests.

Driver and CUDA Mismatches

Driver and CUDA compatibility is critical.

Errors often occur when the container CUDA version is higher than host driver supports and host driver version is outdated.

Important Rule: Host driver version must be greater than or equal to the required CUDA version inside the container.

Example: If a container uses CUDA 12, the host driver must support CUDA 12.

Mismatch can cause runtime errors, library loading failures or silent crashes. Always verify compatibility using NVIDIA’s CUDA compatibility matrix before deployment.

Debugging GPU Crashes in Kubernetes

If GPU workloads crash after starting, deeper investigation is required.

Step 1: Check Pod Logs: kubectl logs <pod-name>

Look for: CUDA errors, Out-of-memory (OOM) errors and Segmentation faults.

Step 2: Check Node Logs: journalctl -u kubelet

This may show device plugin failures, container runtime errors and node instability.

Step 3: Verify GPU Health

Run on the node: nvidia-smi

Look for GPU memory exhaustion, high temperature, GPU reset events and hardware errors. Monitoring tools can help detect recurring GPU failures and performance drops.

17. Cost Management and Efficiency

GPU infrastructure is expensive. Without proper planning, costs can increase quickly. Efficient resource usage is essential for sustainable operations.

GPU Cost Drivers in Kubernetes

Major cost factors include GPU type (for example, A100 is much more expensive than T4), Number of GPUs per node, Idle time, Storage and network usage and Long-running idle pods.

Even if a pod is idle but still requesting a GPU, you are paying for that allocation. Unused GPU time significantly increases operational costs.

Avoiding Idle GPUs

To reduce wasted GPU resources use cluster autoscaling, shut down unused GPU nodes, monitor GPU utilization regularly and use job queues for batch workloads.

If GPUs show consistently low utilization, consider consolidating workloads, reducing node size and enabling GPU sharing.

Right-Sizing Nodes and Pods

Over-allocation leads to waste.

For example:

A small inference model may only require 1 GPU
A large distributed training job may require multiple GPUs

Always match GPU requests to actual workload needs.

Proper resource requests improve utilization, prevent overspending, and improve scheduling efficiency.

Spot / Preemptible GPU Nodes

Many cloud providers offer discounted GPU virtual machines under special pricing models such as Spot instances or Preemptible nodes. These instances provide the same GPU hardware as regular on-demand instances but at a much lower price.

The lower cost comes with a trade-off: the cloud provider can terminate (stop) these instances at any time when capacity is needed elsewhere.

What Are Spot and Preemptible GPU Nodes?

Spot Instances: Discounted instances that can be interrupted when demand increases.
Preemptible Nodes: Short-lived instances that run at a reduced price but may automatically shut down after a fixed time or when the provider reclaims resources.

Both options are designed for workloads that can handle interruptions.

Advantages

Significantly Lower Cost: Spot or preemptible GPU nodes can cost much less than standard GPU instances. This makes them very attractive for organizations running large AI training workloads.
Ideal for Batch Training Jobs: Training jobs that run for several hours and can restart from checkpoints are good candidates. If the node is terminated, the job can resume from the last saved state.
Suitable for Fault-Tolerant Workloads: Workloads that are designed to handle failures—such as distributed training frameworks with checkpointing, can run efficiently on spot nodes without major issues.

Disadvantages

1. Can Be Terminated at Any Time: The biggest drawback is unpredictability. The cloud provider may shut down the instance with short notice.

This can interrupt Long-running jobs without checkpoints and real-time processing tasks.

2. Not Reliable for Critical Inference Services: Production inference systems that require High availability, Low latency and continuous uptime. should not depend on spot or preemptible GPU nodes.

Best Practice

Use spot or preemptible GPU nodes for non-critical training jobs, experimental workloads, batch processing tasks and restartable AI training with checkpointing.

Avoid using them for production inference APIs, customer-facing services and mission-critical applications.

A common strategy is to run baseline workloads on regular GPU nodes and use spot nodes to scale out training jobs when additional capacity is needed.

By combining both types wisely, organizations can significantly reduce GPU infrastructure costs while maintaining reliability where it matters most.

18. Best Practices for Production Deployments

Running GPU workloads in production requires careful planning to ensure reliability, reproducibility, and optimal performance. Following best practices helps avoid downtime and maximize GPU utilization.

1. Use Immutable Images:

Build container images once and avoid changing them in production. Immutable images ensure consistency across environments and help in reproducing experiments and debugging issues.

2. Pin CUDA and Driver Versions:

ismatched CUDA and NVIDIA drivers can cause runtime failures. Always specify exact versions in your Dockerfile and host driver installation.

Example: CUDA 12.1 → NVIDIA driver version ≥ 525.x

Pinning prevents incompatibility issues when updating containers or nodes.

3. Separate Training and Inference Clusters:

Training workloads: Long-running, GPU-intensive, batch-oriented

Inference workloads: Latency-sensitive, smaller GPU usage

Running them on separate clusters or node pools avoids resource contention and improves predictability.

4. Use Node Labels and Taints for GPU Nodes:

Labels: Tag nodes with gpu=true for GPU workloads
Taints: Prevent non-GPU workloads from being scheduled on GPU nodes
Tolerations: Allow GPU pods to run on tainted nodes

This ensures proper scheduling, avoids accidental GPU resource usage, and improves cluster organization.

19. Provider-Specific Considerations for GPU VMs

When choosing GPU VMs from cloud providers, several factors can impact performance, cost, and reliability.

1. GPU Availability by Region:

Certain GPU models (A100, V100, T4, etc.) may be limited in specific regions. Check availability before provisioning. Multi-region deployments may be required for redundancy.

2. Network Bandwidth Limits:

Distributed AI training often requires high-speed inter-node communication. Network performance may vary by region or VM type. Consider VMs with enhanced networking for multi-GPU or multi-node setups.

3. Storage Performance:

Large datasets require fast read/write speeds. Use SSDs or NVMe storage for training datasets. Slow storage can bottleneck GPU throughput.

4. SLA and Uptime Implications:

Review provider SLAs for GPU VMs. Spot or preemptible instances may not guarantee uptime. Production inference workloads require stable, high-availability GPU instances.

Conclusion: When to Use Containers and Kubernetes on GPU VMs

Containers and Kubernetes provide flexibility, scalability, and efficient resource management for GPU workloads. However, they are not always the best choice.

Ideal Workloads

AI/ML training with multiple GPUs
Distributed deep learning workloads
Batch inference pipelines
Multi-tenant GPU clusters with shared resources

Kubernetes excels at scheduling, autoscaling, and managing GPU resources across nodes.

When Simpler Setups Are Better

Small-scale or single-GPU workloads
Development or experimental projects
Applications that do not require autoscaling or multi-node orchestration

For these cases, a single GPU VM with Docker may be simpler and faster to manage.

Future Trends

Serverless GPUs: Pay-per-use GPU services for short-lived workloads
Managed AI platforms: Platforms that handle scaling, GPU management, and monitoring automatically
GPU virtualization and sharing improvements: More efficient utilization of GPU clusters in multi-tenant environments

Containers and Kubernetes on GPU VMs are powerful for modern AI workloads, but choosing the right level of complexity depends on your project’s scale, performance needs, and cost considerations.

This completes a detailed guide on deploying GPU workloads using containers and Kubernetes, from setup to monitoring, optimization, and production best practices.

Visited 25 times, 1 visit(s) today