May 092024
 
NVIDIA vGPU Network Licensing Token

When deploying NVIDIA vGPU across a VDI environment, I often see IT teams deploy the licensing token directly on the persistent VMs, or on the non-persistent base golden image. This often causes a nightmare when the client activation token must be updated.

I highly recommend considering network placement of the NVIDIA vGPU Licensing Client Configuration token file for your deployments.

In this post we’ll review the Client Configuration Token File, why you’d want to place it on the network, and how to do so.

What is the Client Configuration Token File

The Client Configuration Token File, tells the NVIDIA vGPU driver on your VM where to find the licensing server information. This token will point the driver to either the CLS or DLS licensing server and request the applicable license to be issued.

By default, the vGPU driver will check the following location for the token:

C:\Program Files\NVIDIA Corporation\vGPU Licensing\ClientConfigToken\

While this is common, there’s a much better (and easier) method that you can use to deploy the Client Configuration Tokens, using Network Shares, to ease management of these files.

Placing the NVIDIA vGPU Licensing client configuration token on a network share

Using the Windows Registry, along with a GPO (Group Policy Object), you can configure a network location for the NVIDIA Client Configuration Token, so that your systems whether Persistent or Non-Persistent will use this location.

In the event of a token change, you can simply delete and remove the old token, and place a new configuration token, and all the systems will have immediate access to it, without manually updating individual systems.

Here we’ll use the registry and a GPO to configure the token location:

  1. Using an administrative account, create a folder called “vGPU-Licensing” on your domain SYSVOL share.
    • Example: \\Domain.com\SYSVOL\Domain.com\vGPU-Licensing\
  2. Place your NVIDIA Licensing Client Configuration Token in this folderNVIDIA Licensing Token SYSVOL
  3. Open “Group Policy Management” and create a new GPO called “VDI-NVIDIA-LicensingToken”
  4. Navigate to: Computer Configuration -> Preferences -> Windows Settings -> Registry
  5. Right Click and select New -> Registry Item
  6. Under the New Registry Window Enter the following:
    • Action: Update
    • Hive: HKEY_LOCAL_MACHINE
    • Key Path: SYSTEM\CurrentControlSet\Services\nvlddmkm\Global\GridLicensing
    • Value Name: ClientConfigTokenPath
    • Value Type: REG_SZ
    • Value Data: \\Domain.com\SYSVOL\Domain.com\vGPU-Licensing
    • Change the network location to match your environment and your setup
  7. After populating the fields, it should be similar to the following example: NVIDIA GPO Registry Client Configuration Token
  8. Hit Apply, then Ok, then link the newly created GPO to the OU where your VDI VM guests are located with NVIDIA vGPU.

That’s it! All we did was created a GPO which configures the Registry key “ClientConfigTokenPath” inside of HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\nvlddmkm\Global\GridLicensing\ and set it to a network share that has the configuration tokens.

Please note, the NVIDIA licensing service accesses the network location using the services security context (not the user’s context), which is why I chose the SYSVOL share, as the computer accounts have read access to this location (example, reading the GPOs on boot and user logon).

Additionally, note that the registry key and location may vary if you’re using older versions of the NVIDIA vGPU Driver. The key used in this post is for versions 16.x and 17.x.

May 092024
 
NVIDIA vGPU

You may notice a frozen session or frozen screen with NVIDIA vGPU, Windows 11, and Omnissa Horizon (formerly VMware Horizon) in your VDI environment.

While I’ve mostly observed this issue using non-persistent Instant Clones with vGPU on Windows 11 23H2, I have also noticed issues and anomalies with persistent VMs as well.

I’ve noticed this issue across multiple customer environments, and was able to replicate it in my own environment. I’ll go over the problem and solution below.

The Problem

This issue occurs due to the combination of hardware being used, the VMware SVGA driver, a secondary “Virtual Display”, and the resolution being set during logon and initialization of the VMware Horizon VDI session.

When a user logs on, the resolutions are set across all virtual displays. There is an issue where due to a timeout (observed in log files), the resolution cannot be set, resulting in a session that either appears to be frozen, or if active, the interactive cursor is actually off-set from the visible display (your mouse is somewhere else, other than where it’s being displayed).

The Solution

In my troubleshooting, I’ve identified the following solutions:

Solution #1

To resolve this issue, disable the “VMware SVGA 3D” Display Adapter in the Windows Explorer (as shown below). Simply right-click on “VMware SVGA 3D” and set to Disabled.

After disabling this Display Adapter, you’ll noticed the issue will be resolved, and you’ll also notice your VDI sessions are established very quickly (including initializing the resolutions with vGPU).

If you’re using non-persistent VDI (VMware Horizon Instant Clones), you’ll need to perform this on your base image.

Note: By disabling this adapter, you will lose the ability to use the VMware Console on VMware vSphere vCenter. To gain console access, you’ll either need to enable the VMware SVGA 3D adapter in a VDI session, or remove the vGPU adapter.

Solution #2

Another solution is to force the VDI session to use the VMware Horizon Indirect Display Driver.

  1. Open Windows Registry and navigate to the following location: HKLM\Software\Policies\VMware, Inc.\VMware Blast\Config
  2. Create a new Registry String (REG_SZ) called “PixelProviderForceViddCapture” and set it to: 1

Note: If you force the use of the VMware Horizon Indirect Display Driver as your Primary Display Driver, you may run in to GPU issues with the VMware Horizon Indirect Display Driver where the capabilities of your NVIDIA vGPU may not be detected by your applications that require the features and capabilities that come from an NVIDIA GPU.

Jan 062024
 
vMotion with vGPU

Normally, any VMs that are NVIDIA vGPU enabled have to be manually migrated with manual vMotion if a host is placed in to maintenance mode, to evacuate the host. While we may have grown accustomed to this, there is a better way, with vGPU Enabled VM DRS Evacuation during Maintenance mode!

A new feature that was introduced with vSphere 7.0 U3f, was the ability to configure and allow automatic vMotion of VMs with vGPUs, meaning that DRS can now migrate your VDI and AI/ML vGPU enabled workloads when hosts are placed in to maintenance mode. This also allows you to streamline remediation with vLCM when updating vGPU enabled hosts running vGPU enabled VMs.

Additionally, as of vSphere 8.0 U2, DRS can now estimate the STUN times required for vMotion of vGPU enabled VMs, and control whether automatic DRS vMotion’s are allowed. This STUN time limit can be set buy an administrator.

Enable automatic vMotion evacuation of vGPU enabled VMs

To enable the automatic vMotion of vGPU enabled VMs on your vSphere Cluster:

  1. Navigate to your vSphere Cluster.vSphere Cluster Selected
  2. Click on the “Configure” Tab, and then select “vSphere DRS”, and click “Edit”.vSphere DRS Cluster - DRS Advanced Settings
  3. Navigate to the “Advanced Options” tab.
  4. Add “VgpuMMAutomationTimeoutSecs” and set to “-1”.vSphere DRS set VgpuMMAutomationTimeoutSecs

After performing the above, when you place a host with vGPU enabled Virtual Machines in to Maintenance Mode, vSphere DRS will evacuate and migrate the VMs to other hosts in the cluster that have the required hardware.

If you attempt to place a host in to Maintenance Mode without enabling automatic vMotion of vGPU enabled VMs, it will fail with the error: “DRS failed to generate a vMotion recommendation for a virtual machine on a host entering Maintenance Mode“.

Enable and Configure vGPU STUN Time Estimate and Limits

If you are running vSphere 8U2 or higher, you can enable vGPU STUN time estimation and limits for DRS on the vGPU enabled cluster. Similar to the instructions above, we can add and configure two variables to the vSphere DRS cluster “Advanced Options”.

To enable STUN time estimation, add PassthroughDrsAutomation and set to “1”.

To override the default vMotion STUN time limit of 100 seconds, add VmDevicesStunTimeTolerated and set it to your preferred maximum number of seconds. Alternatively, you can set this limit Per VM by navigating to the VM in vSphere and adding this variable under the “VM Options” “Advanced Settings” section.

Additional Documentation

Jan 052024
 
NVIDIA vGPU Installed in VMware ESXi Host

You may experience GPU issues with the VMware Horizon Indirect Display Driver in your environment when using 3rd party applications which incorrectly utilize the incorrect display adapter. This results with the inability to use and/or run GPU accelerated workloads including VDI, AI, and ML.

This issue effects NVIDIA vGPU (both vGPU and vDGA passthrough), AMD MxGPU, and Intel Data Center GPU Flex GPUs using SR-IOV, in any deployment where the VMware Indirect Display Driver is installed.

When this issue occurs, the application will incorrectly query the capabilities of the VMware Indirect Display Adapter instead of the GPU that is presented to the VM, resulting in a scenario where the application isn’t aware of the capabilities of the GPU you are utilizing, failing to utilize the GPU, and hardware acceleration, such as hardware encoding (NVENC) and hardware decoding.

What is the VMware Horizon Indirect Display Driver

The VMware Horizon Indirect Display Driver, also known as the VMware Indirect Display Driver, is a “virtual” display driver that isn’t bound to a specific hypervisor, and works with many deployments because of the lack of that limitation.

GPU Issues with the VMware Horizon Indirect Display Driver Enabled

This driver is installed with the VMware Horizon agent, and can work in conjunction with hardware acceleration, including GPUs (such as NVIDIA vGPU, AMD MxGPU, and Intel Data Center GPUs using SR-IOV).

Under normal circumstances, the VMware Horizon Indirect Display Driver is prioritized as a fallback driver for remoting protocols, except in environments where no hypervisor or GPU display drivers are available (like Horizon Cloud on Azure) in which case it would become the priority.

The Problem

Applications designed to use a GPU, may not be able to correctly identify which display adapter to use on the VM. While you may have a GPU, vGPU, or 3D acceleration in your environment, the application may be unaware of the device and/or its capabilities.

This is caused by the application either not correctly using the preferred primary display adapter (GPU and/or vGPU), or not being designed to handle multiple display adapters (and drivers).

Example Scenario:

When using CyberLink PowerDirector 360 in a VMware Horizon environment with an NVIDIA vGPU, the application will query the VM’s Windows instance for hardware acceleration capabilities, specifically hardware encoding, hardware decoding, and use of APIs like NVIDIA’s NVENC encoder. In this scenario, while the VM does have an NVIDIA vGPU workstation profile attached with a valid NVIDIA RTX Virtual Workstation (vWS) license, the application is only aware of the VMware Indirect Display Driver and it’s capabilities. This results in all hardware accelerated encoding and decoding capabilities to be disabled.

Example Symptoms

  • 3D Acceleration not detected by application
  • CUDA Cores not available for application
  • OpenCL not available
  • DirectX and Direct3D usage unavailable

In all scenarios, the VM will appear to have 3D acceleration, however one or multiple applications won’t have access.

The Solution

Thanks to the design of the VMware Indirect Display Driver, it should be prioritized in a fashion that it’s used only when other display drivers aren’t available (including NVIDIA vGPU), or system resources aren’t available; however, some 3rd party application may not be able to reference the prioritization, or support multi-GPU (multi display driver), resulting in the incorrect display adapter being used.

As a workaround, you can remove the VMware Indirect Display Driver from the Windows instance running in the VM.

NVIDIA vGPU with VMware Horizon Indirect Display Driver Removed

Please note that simply disabling the “VMware Horizon Indirect Display Driver” will not suffice. A full removal (Right Click, “Uninstall Device”) is required to workaround this issue. Additionally, upgrading or re-installing the VMware Horizon Agent will re-install the VMware Indirect Display Driver.

Jul 282023
 
NVIDIA GPU Manager

In May of 2023, NVIDIA released the NVIDIA GPU Manager for VMware vCenter. This appliance allows you to manage your NVIDIA vGPU Drivers for your VMware vSphere environment.

Since the release, I’ve had a chance to deploy it, test it, and use it, and want to share my findings.

In this post, I’ll cover the following (click to skip ahead):

  1. What is the NVIDIA GPU Manager for VMware vCenter
  2. How to deploy and configure the NVIDIA GPU Manager for VMware vCenter
    • Deployment of OVA
    • Configuration of Appliance
  3. Using the NVIDIA GPU Manager to manage, update, and deploy vGPU drivers to ESXi hosts

Let’s get to it!

What is the NVIDIA GPU Manager for VMware vCenter

The NVIDIA GPU Manager is an (OVA) appliance that you can deploy in your VMware vSphere infrastructure (using vCenter and ESXi) to act as a driver (and update) repository for vLCM (vSphere Lifecycle Manager).

In addition to acting as a repo for vLCM, it also installs a plugin on your vCenter that provides a GUI for browsing, selecting, and downloading NVIDIA vGPU host drivers to the local repo running on the appliance. These updates can then be deployed using LCM to your hosts.

In short, this allows you to easily select, download, and deploy specific NVIDIA vGPU drivers to your ESXi hosts using vLCM baselines or images, simplifying the entire process.

Supported vSphere Versions

The NVIDIA GPU Manager supports the following vSphere releases (vCenter and ESXi):

  • VMware vSphere 8.0 (and later)
  • VMware vSphere 7.0U2 (and later)

The NVIDIA GPU Manager supports vGPU driver releases 15.1 and later, including the new vGPU 16 release version.

How to deploy and configure the NVIDIA GPU Manager for VMware vCenter

To deploy the NVIDIA GPU Manager Appliance, we have to download an OVA (from NVIDIA’s website), then deploy and configure it.

See below for the step by step instructions:

Download the NVIDIA GPU Manager

  1. Log on to the NVIDIA Application Hub, and navigate to the “NVIDIA Licensing Portal” (https://nvid.nvidia.com).
  2. Navigate to “Software Downloads” and select “Non-Driver Downloads”
  3. Change Filter to “VMware vCenter” (there is both VMware vSphere, and VMware vCenter, pay attention to select the correct).
  4. To the right of “NVIDIA GPU Manager Plug-in 1.0.0 for VMware vCenter”, click “Download” (see below screenshot).
Screenshot of download link for NVIDIA GPU Manager for VMware vCenter
NVIDIA GPU Manager Download Page

After downloading the package and extracting, you should be left with the OVA, along with Release Notes, and the User Guide. I highly recommend reviewing the documentation at your leisure.

Deploy and Configure the NVIDIA GPU Manager

We will now deploy the NVIDIA GPU Manager OVA appliance:

  1. Deploy the OVA to either a cluster with DRS, or a specific ESXi host. In vCenter either right click a cluster or host, and select “Deploy OVF Template”. Choose the GPU Manager OVA file, and continue with the wizard.NVIDIA GPU Manager OVA Deploy
  2. Configure Networking for the Appliance
    • You’ll need to assign an IP Address, and relevant networking information.
    • I always recommend creating DNS (forward and reverse entries) for the IP.NVIDIA GPU Manager OVA Network Configuration
  3. Finally, power on Appliance.

We must now create a role and service account that the GPU Manager will use to connect to the vCenter server.

While the vCenter Administrator account will work, I highly recommend creating a service account specifically for the GPU Manager that only has the required permissions that are necessary for it to function.

  1. Log on to your vCenter Server
  2. Click on the hamburger menu item on the top left, and open “Administration”.
  3. Under “Access Control” select Roles. vCenter-Roles
  4. Select New to create a new role. We can call it “NVIDIA Update Services”.
  5. Assign the following permissions:
    • Extension Privileges
      • Register Extension
      • Unregister Extension
      • Update Extension
    • VMware vSphere Lifecycle Manager Configuration Priveleges
      • Configure Service
    • VMware vSphere Lifecycle Manager Settings Priveleges
      • Read
    • Certificate Management Privileges
      • Create/Delete (Admins priv)
      • Create/Delete (below Admins priv)
    • ***PLEASE NOTE: The above permissions were provided in the documentation and did not work for me (resulted in an insufficient privileges error). To resolve this, I chose “Select All” for “VMware vSphere Lifecycle Manager”, which resolved the issue.***
  6. Save the Role
  7. On the left hand side, navigate to “Users and Groups” under “Single Sign On”
  8. Change the domain to your local vSphere SSO domain (vsphere.local by default)
  9. Create a new user account for the NVIDIA appliance, as an example you could use “nvidia-svc”, and choose a secure password.
  10. Navigate to “Global Permissions” on the left hand side, and click “Add” to create a new permission.
  11. Set the domain, and choose the new “nvidia-svc” service account we created, and set the role to “NVIDIA Update Services”, and check “Propagate to Children”.
  12. You have now configured the service account.

Now, we will perform the initial configuration of the appliance. To configure the application, we must do the following:

  1. Access the appliance using your browser and the IP you configured above (or FQDN) GPU Manager Account Creation
  2. Create a new password for the administrative “vcp_admin” account. This account will be used to manage the appliance.
    • A secret key will be generated that will allow the password to be reset, if required. Save this key somewhere safe.
  3. We must now register the appliance (and plugin) with our vCenter Server. Click on “REGISTER”. NVIDIA GPU Manager Register
  4. Enter the FQDN or IP of your vCenter server, the NVIDIA Service account (“nvidia-svc” from example), and password.
  5. Once the GPU Manager is registered with your vCenter server, the remainder of the configuration will be completed from the vCenter GPU.
    • The registration process will install the GPU Manager Plugin in to VMware vCenter
    • The registration process will also configure a repository in LCM (this repo is being hosted on the GPU manager appliance).

We must now configure an API key on the NVIDIA Licensing portal, to allow your GPU Manager to download updates on your behalf.

  1. Open your browser and navigate to https://nvid.nvidia.com. Then select “NVIDIA LICENSING PORTAL”. Login using your credentials.
  2. On the left hand side, select “API Keys”.
  3. On the upper right hand, select “CREATE API KEY”.
  4. Give the key a name, and for access type choose “Software Downloads”. I would recommend extending the key validation time, or disabling key expiration. NVIDIA Download API Create Key
  5. The key should now be created.
  6. Click on “view api key”, and record the key. You’ll need to enter this in later in to the vCenter GPU Manager plugin.

And now we can finally log on to the vCenter interface, and perform the final configuration for the appliance.

  1. Log on to the vCenter client, click on the hamburger menu, and select “NVIDIA GPU Manager”.
  2. Enter the API key you created above in to the “NVIDIA Licensing Portal API Key” field, and select “Apply”.
  3. The appliance should now be fully configured and activated. GPU Manager Activated API Key
  4. Configuration is complete.

We have now fully deployed and completed the base configuration for the NVIDIA GPU Manager.

Using the NVIDIA GPU Manager to manage, update, and deploy vGPU drivers to ESXi hosts

In this section, I’ll be providing an overview of how to use the NVIDIA GPU Manager to manage, update, and deploy vGPU drivers to ESXi hosts. But first, lets go over the workflow…

The workflow is a simple one:

  1. Using the vCenter client plugin, you choose the drivers you want to deploy. These get downloaded to the repo on the GPU Manager appliance, and are made available to Lifecycle Manager.
  2. You then use Lifecycle Manager to deploy the vGPU Host Drivers to the applicable hosts, using baselines or images.

As you can see, there’s not much to it, despite all the configuration we had to do above. While it is very simple, it simplifies management quite a bit, especially if you’re using images with Lifecycle Manager.

To choose and download the drivers, load up the plugin, use the filters to filter the list, and select your driver to download.

GPU Manager downloading vGPU Driver
NVIDIA GPU Manager downloading vGPU Driver

As you can see in the example, I chose to download the vGPU 15.3 host driver. Once completed, it’ll be made available in the repo being hosted on the appliance.

Once LCM has a changed to sync with the updated repos, the driver is then made available to be deployed. You can then deploy using baselines or host images.

LCM Image Update with NVIDIA vGPU Driver from NVIDIA GPU Manager
LCM Image Update with NVIDIA vGPU Driver from NVIDIA GPU Manager

In the example above, I added the vGPU 16 (535.54.06) host driver to my clusters update image, which I will then remediate and deploy to all the hosts in that cluster. The vGPU driver was made available from the download using GPU Manager.