vGPU vs. GPU VM-Passthrough: A Comparative Summary
1. Purpose of This Document
- When GPU resource management using either vGPU or VM passthrough is discussed within the company, this document serves as a reference to help guide how GPU resources can be utilized in Kubernetes.
- It provides a clear understanding of the differences between vGPU and VM passthrough, and why the KubeVirt setup in the Cosmic Cluster opts for VM passthrough over vGPU.
2. Summary of Key Terms
2.1 vGPU
- To put it simply, vGPU is a method provided by Nvidia to enable GPU resource allocation in virtualized environments, allowing multiple VMs to share the performance of a physical GPU.
- Traditionally, resources like CPU and memory could be partitioned and used within virtual environments, but this wasn’t feasible for GPUs until vGPU technology emerged.
- Interestingly, vGPU wasn’t originally developed for AI workloads, but rather to support intensive graphics work on servers that would otherwise be difficult to handle on laptops or client machines.
- However, the downside is that it requires a costly license and only supports a limited range of GPU models, making it difficult to use with widely available GPUs.
2.2 VM-Passthrough
- VM passthrough allows a virtual machine to directly access a host-installed I/O device (commonly GPUs or NICs), bypassing the hypervisor.
- KubeVirt uses this VM passthrough method and supports enabling this option via the GPU Operator.
2.3 Understanding the vfio-pci Driver
- It’s essential to understand VFIO first.
- VFIO is a framework in the Linux kernel that allows non-privileged userspace drivers to securely access physical hardware devices.
- This functionality is central to implementing device passthrough, where virtual machines directly use physical devices.
- The “VFIO PCI driver” refers to a specific kernel module (
vfio-pci.ko) within this framework that handles devices on the PCI or PCI Express (PCIe) bus. - Because high-performance devices such as GPUs, NICs, and NVMe controllers typically use PCIe, the vfio-pci module plays a key role in passthrough setups.
Once vfio-pci claims the GPU, the output looks like the following:
root@gpu:/# lspci -nnk | grep -A2 -i nvidia
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation AD102GL [RTX 6000 Ada Generation] [vendors:id] (rev a1)
Subsystem: NVIDIA Corporation Device [vendors:id]
**Kernel driver in use: vfio-pci # This line should appear**
Kernel modules: nvidiafb, nouveau3. In-depth Explanation of vGPU Functionality
- To better understand how vGPU is implemented:
- A physical GPU is installed in a server running a hypervisor like VMware ESXi or Citrix XenServer. On top of this, a vGPU management software such as NVIDIA Virtual GPU Manager is installed.
- This vGPU manager partitions the physical GPU’s resources into multiple vGPU instances.
- Each vGPU is assigned a fixed amount of GPU memory and virtual display output.
4. In-depth Explanation of VM-Passthrough Functionality
Here’s how VM passthrough works in more detail:
- Enable IOMMU: You need to activate IOMMU (Intel VT-d or AMD-Vi) in the BIOS or UEFI of the host system. This ensures device memory access is securely mapped to the VM’s memory space.
- While you can edit
/etc/default/grubto enable this, the BIOS setting must be enabled for passthrough devices like GPUs.
root@gpu:/# cat /etc/default/grub
# If you change this file, run 'update-grub' afterwards to update
# /boot/grub/grub.cfg.
# For full documentation of the options in this file, see:
# info -f grub -n 'Simple configuration'
GRUB_DEFAULT=0
GRUB_TIMEOUT_STYLE=hidden
GRUB_TIMEOUT=0
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash rd.driver.blacklist=nouveau modprobe.blacklist=nouveau amd_iommu=on kvm.ignore_msrs=1 video=efifb:off vfio-pci.ids=vendor_ids"
GRUB_CMDLINE_LINUX=""- Use the VFIO Driver: On Linux, the VFIO framework isolates PCI devices from the host and passes them to VMs.
VFIO safely manages Direct Memory Access (DMA) to these devices.
The vfio-pci driver ensures that the host does not load other drivers (e.g.,nvidia,nouveau), leaving the device free for the VM. - Assign Devices: Use tools like
virshorvirt-managerto allocate specific PCI devices to a VM.
In a KubeVirt setup, this task is handled by the virt-manager interface. - Install Drivers in VM: Inside the VM, install the appropriate GPU drivers after confirming the device is visible (e.g., via
lspci).
5. Conclusion
If available, vGPU offers a more straightforward configuration path. However, it comes with critical limitations:
- High licensing costs
- Compatibility only with a narrow list of supported GPUs
On the other hand, VM passthrough is a native feature in Linux and does not restrict GPU models.
It allows virtual machines to achieve near-native performance by directly interfacing with physical devices. However, it also comes with certain caveats:
- Because passthrough grants a VM direct access to the host’s GPU, it introduces potential security vulnerabilities.
- Any changes made within the VM could persist even after the VM shuts down.
- Since the hypervisor is largely bypassed during device operation, data transmitted between the VM and physical GPU can be exposed at the hardware level.
Use each approach according to your scenario. If the server is publicly accessible or exposed to external access, consider the security implications of passthrough setups.
6. Personal Notes
I learned for the first time that settings like IOMMU must be toggled in the BIOS because direct device access starts at the hardware level.
This makes sense in hindsight — just like how CPU and memory overclocking is managed through BIOS, so too is device passthrough.
When operating KubeVirt in production, we should also verify that passthrough yields the full GPU performance as expected.
Since fewer layers are involved, it’s reasonable to expect improved performance in terms of latency and bandwidth.
7. References
- https://www.youtube.com/watch?v=NiXtswuE1MI&t=5s
- https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-kubevirt.html
- https://docs.nvidia.com/vgpu/latest/grid-vgpu-user-guide/index.html
- https://kubevirt.io/user-guide/compute/host-devices/
- https://docs.nvidia.com/ai-enterprise/deployment/rhel-with-kvm/latest/setting-vgpu-devices.html
- https://docs.nvidia.com/vgpu/gpus-supported-by-vgpu.html
- https://gruuuuu.github.io/linux/kvm-gpu-passthrough/?trk=feed_main-feed-card_feed-article-content
