Nov 232024
 

In some scenarios, you may encounter an issue where the Veeam WAN Accelerator service fails to start.

This will cause backup and backup copy jobs to fail that use the Veeam WAN Accelerator, which is how this issue is usually first diagnosed.

In this post I’ll explain the problem, what can cause it, and how to resolve the issue.

The Problem

When this issue occurs, and when a Backup or Backup Copy job runs, it will usually first fail with the following error from the Veeam console:

Error: The RPC server is unavailable. RPC function call failed. Function name: [InvokerTestConnection]. Target machine: [IP.OF.WAN.ACC:6464].

Failed to process (VM Name).

See below for a screenshot of the example:

Veeam Backup Copy Job Failing due to Veeam WAN Accelerator Service failing

From the error above, the next step is usually to check the logs to find out what’s happening. Because this Backup Copy job uses the WAN accelerator, we’ll look at the log for the Veeam WAN Accelerator Service.

Svc.VeeamWANSvc.log

[23.11.2024 11:46:24.251] <  3440> srv      | RootFolder = V:\VeeamWAN
[23.11.2024 11:46:24.251] <  3440> srv      | SendFilesPath = V:\VeeamWAN\Send
[23.11.2024 11:46:24.251] <  3440> srv      | RecvFilesPath = V:\VeeamWAN\Recv
[23.11.2024 11:46:24.251] <  3440> srv      | EnablePerformanceMode = true
[23.11.2024 11:46:24.255] <  3440> srv      | ERR |Fatal error
[23.11.2024 11:46:24.255] <  3440> srv      | >>  |boost::filesystem::create_directories: The system cannot find the path specified: "V:\"
[23.11.2024 11:46:24.255] <  3440> srv      | >>  |Unable to apply settings. See log for details.
[23.11.2024 11:46:24.255] <  3440> srv      | >>  |An exception was thrown from thread [3440].
[23.11.2024 11:46:24.255] <  3440> srv      | Stopping service...
[23.11.2024 11:46:24.256] <  3440> srv      | Stopping retention thread...
[23.11.2024 11:46:24.257] <  4576>          | Thread started.  Role: 'Retention thread', thread id: 4576, parent id: 3440.
[23.11.2024 11:46:24.258] <  4576>          | Thread finished. Role: 'Retention thread'.
[23.11.2024 11:46:24.258] <  3440> srv      | Waiting for a client('XXX-Veeam-WAN:6165')
[23.11.2024 11:46:24.258] <  3440> srv      | Stopping server listening thread.
[23.11.2024 11:46:24.258] <  3440> srv      |   Stopping command handler threads.
[23.11.2024 11:46:24.258] <  3440> srv      |   Command handler threads have stopped.
[23.11.2024 11:46:24.258] <  4580>          | Thread started.  Role: 'Server thread', thread id: 4580, parent id: 3440.

In the Veeam WAN Service log file above, you’ll note a fatal error where the service is unable to find the paths configured, which caused the service to halt and stop.

In some configurations, iSCSI is used to access Veeam backup repository storage hosted on iSCSI targets. Furthermore, in some iSCSI configurations special vendor plugins are used to access the iSCSI storage, and configure items like MPIO (multipath input output), which can take additional time to initialize.

In this scenario, the Veeam WAN Accelerator Service was starting before the Windows iSCSI service, MPIO Service, and Nimble Windows Connection Manager plugin had time to initialize, resulting in the WAN accelerator failing because it couldn’t find the directories it was expecting.

The Solution

To resolve this issue, we want the Veeam WAN Accelerator Service to have a delayed start on the Windows Server operating system bootup sequence.

  1. Open Windows Services
  2. Select “Veeam WAN Accelerator Service”
  3. Change “Startup Type” to “Automatic (Delayed Start)”
  4. Click “Apply” to save, and then click “Start” to start the service.

As per the screenshot below:

Veeam WAN Accelerator Service Properties

The Veeam WAN Accelerator Service will now have a delayed start on system bootup, allowing the iSCSI initiator to establish and mount all iSCSI Target block devices, before starting the WAN service.

Jun 262024
 
vSphere 8U3 vGPU Mixed-Size Profiles

I’m happy to announce today that you can now deploy vGPU Mixed Size Virtual GPU types with VMware vSphere 8U3, also known as “Heterogeneous Time-Slice Sizes” or “Heterogeneous vGPU types”.

VMware vSphere 8U3 was released yesterday (June 26th, 2024), and brought with it numerous new features and functionality. However, mixed vGPU types deserves it’s own blog post as it’s a major game-changer for those who use NVIDIA vGPU for AI and VDI workloads, including Omnissa Horizon.

NVIDIA vGPU (Virtual GPU) Types

When deploying NVIDIA vGPU, you configure Virtual GPU types that provide Workstation RTX (vWS Q-Series), Virtual PC (vPC B-Series), or Virtual Apps (vApps A-Series) class capabilities to virtual machines.

On top of the classifications above, you also needed to configure the Framebuffer memory size (or VRAM/Video RAM) allotted to the vGPU.

Historically, when you powered the first VM, the physical GPU that provides vGPU, would then only be able to serve that Virtual GPU type (class and Framebuffer size) to other VMs, locking all the VMs on running on that GPU to same vGPU type. If you had multiple GPUs in a server, you could run different vGPU types on the different physical GPU, however each GPU would be locked to the vGPU type of the first VM started with it.

NVIDIA Mixed Size Virtual GPU Type functionality

Earlier this year, NVIDIA provided the ability to deploy heterogeneous mixed vGPU types through the vGPU drivers, first starting with the ability to run different classifications (you could mix vWS and vPC), and the later adding support for mixed-size frame buffers (example, mixing a 4Q and 8Q profile on the same GPU).

While the NVIDIA vGPU solution supported this, VMware vSphere did not immediately add support so it couldn’t take advantage of this until the new release of VMware vSphere 8U3, VMware vCenter 8U3, and VMware ESXi 8U3.

To configure different classifications (vWS mixed with vPC), it requires no configuration other than using a host-driver and guest-driver that support it, however to use different sized framebuffers, it needs to be enabled on the host.

To Enable vGPU Mixed Size Virtual GPU types:

  1. Log on to VMware vCenter
  2. Confirm all vGPU enabled Virtual Machines are powered off
  3. Select the host in your inventory
  4. Select the “Configure” tab on the selected host
  5. Navigate to “Graphics” under “Hardware”
  6. Select the GPU from the list, click “Edit”, and change the “vGPU Mode” to “Mixed Size”
Screenshot showing the "Graphics Properties" for GPU adapters on VMware ESXi 8U3 with the "vGPU Mode" set to "Mixed Size"

Once you configure this, you can now deploy mixed-size vGPU profiles.

When you SSH in to your host, you can query to confirm it’s configured:

[root@ESXi-HOST:~] nvidia-smi -q

    vGPU Device Capability
        Fractional Multi-vGPU             : Supported
        Heterogeneous Time-Slice Profiles : Supported
        Heterogeneous Time-Slice Sizes    : Supported
        vGPU Heterogeneous Mode           : Enabled

It’s supported, and enabled!

Additional Notes

Please note the following:

  • When restarting your hosts, resetting the GPU, and/or restarting the vGPU Manager daemon, the ESXi host will change back to it’s default “Same Size” mode. You will need to manually change it back to “Mixed Mode”.
  • When enabling mixed-size vGPU types, the number of some types of vGPU profiles may be reduced vs running the GPU in equal-size mode (to allow other profile types). Please see the additional links for information on Mixed-Size vGPU types inside the “Virtual GPU Types for Supported GPUs” link.
  • Only “Best Effort and “Equal Share” schedulers are supported with mixed mode vGPU. Fixed Share scheduling is not supported.
May 252024
 
VDI Gaming Demo with NVIDIA vGPU and Omnissa Horizon

Here’s a fun quick VDI Gaming Demo with NVIDIA vGPU and Omnissa Horizon 8, using an NVIDIA L4 GPU and the L4-12Q Profile.

This video is just for fun, and is just to show some of the capabilities of the technology, hardware, and software, in this case, with Cloud Gaming.

The NVIDIA vGPU solution provides the ability to “slice” and create multiple Virtual GPU (vGPU) devices for your Virtual Machines and Virtual workloads.

In this video:

  • Quick Introduction to NVIDIA vGPU with Omnissa Horizon 8
  • Validating NVIDIA vGPU functionality (with DirectX Diagnostics, Horizon Performance Monitor Tracker)
  • MechWarrior 5 Cloud Gaming
  • Heaven Benchmark

Environment Details:

  • 2 x HPE DL360p Gen8 Servers (2 x 10 Core Procs, 384GB of RAM)
    • 1 Server with NVIDIA A2
    • 1 Server with NVIDIA L4
  • VMware vSphere 8U2
  • Omnissa Horizon 8

Hope you enjoy the video and demo!

Jan 062024
 
vMotion with vGPU

Normally, any VMs that are NVIDIA vGPU enabled have to be manually migrated with manual vMotion if a host is placed in to maintenance mode, to evacuate the host. While we may have grown accustomed to this, there is a better way, with vGPU Enabled VM DRS Evacuation during Maintenance mode!

A new feature that was introduced with vSphere 7.0 U3f, was the ability to configure and allow automatic vMotion of VMs with vGPUs, meaning that DRS can now migrate your VDI and AI/ML vGPU enabled workloads when hosts are placed in to maintenance mode. This also allows you to streamline remediation with vLCM when updating vGPU enabled hosts running vGPU enabled VMs.

Additionally, as of vSphere 8.0 U2, DRS can now estimate the STUN times required for vMotion of vGPU enabled VMs, and control whether automatic DRS vMotion’s are allowed. This STUN time limit can be set buy an administrator.

Enable automatic vMotion evacuation of vGPU enabled VMs

To enable the automatic vMotion of vGPU enabled VMs on your vSphere Cluster:

  1. Navigate to your vSphere Cluster.vSphere Cluster Selected
  2. Click on the “Configure” Tab, and then select “vSphere DRS”, and click “Edit”.vSphere DRS Cluster - DRS Advanced Settings
  3. Navigate to the “Advanced Options” tab.
  4. Add “VgpuMMAutomationTimeoutSecs” and set to “-1”.vSphere DRS set VgpuMMAutomationTimeoutSecs

After performing the above, when you place a host with vGPU enabled Virtual Machines in to Maintenance Mode, vSphere DRS will evacuate and migrate the VMs to other hosts in the cluster that have the required hardware.

If you attempt to place a host in to Maintenance Mode without enabling automatic vMotion of vGPU enabled VMs, it will fail with the error: “DRS failed to generate a vMotion recommendation for a virtual machine on a host entering Maintenance Mode“.

Enable and Configure vGPU STUN Time Estimate and Limits

If you are running vSphere 8U2 or higher, you can enable vGPU STUN time estimation and limits for DRS on the vGPU enabled cluster. Similar to the instructions above, we can add and configure two variables to the vSphere DRS cluster “Advanced Options”.

To enable STUN time estimation, add PassthroughDrsAutomation and set to “1”.

To override the default vMotion STUN time limit of 100 seconds, add VmDevicesStunTimeTolerated and set it to your preferred maximum number of seconds. Alternatively, you can set this limit Per VM by navigating to the VM in vSphere and adding this variable under the “VM Options” “Advanced Settings” section.

Additional Documentation

Jan 052024
 
NVIDIA vGPU Installed in VMware ESXi Host

You may experience GPU issues with the VMware Horizon Indirect Display Driver in your environment when using 3rd party applications which incorrectly utilize the incorrect display adapter. This results with the inability to use and/or run GPU accelerated workloads including VDI, AI, and ML.

This issue effects NVIDIA vGPU (both vGPU and vDGA passthrough), AMD MxGPU, and Intel Data Center GPU Flex GPUs using SR-IOV, in any deployment where the VMware Indirect Display Driver is installed.

When this issue occurs, the application will incorrectly query the capabilities of the VMware Indirect Display Adapter instead of the GPU that is presented to the VM, resulting in a scenario where the application isn’t aware of the capabilities of the GPU you are utilizing, failing to utilize the GPU, and hardware acceleration, such as hardware encoding (NVENC) and hardware decoding.

What is the VMware Horizon Indirect Display Driver

The VMware Horizon Indirect Display Driver, also known as the VMware Indirect Display Driver, is a “virtual” display driver that isn’t bound to a specific hypervisor, and works with many deployments because of the lack of that limitation.

GPU Issues with the VMware Horizon Indirect Display Driver Enabled

This driver is installed with the VMware Horizon agent, and can work in conjunction with hardware acceleration, including GPUs (such as NVIDIA vGPU, AMD MxGPU, and Intel Data Center GPUs using SR-IOV).

Under normal circumstances, the VMware Horizon Indirect Display Driver is prioritized as a fallback driver for remoting protocols, except in environments where no hypervisor or GPU display drivers are available (like Horizon Cloud on Azure) in which case it would become the priority.

The Problem

Applications designed to use a GPU, may not be able to correctly identify which display adapter to use on the VM. While you may have a GPU, vGPU, or 3D acceleration in your environment, the application may be unaware of the device and/or its capabilities.

This is caused by the application either not correctly using the preferred primary display adapter (GPU and/or vGPU), or not being designed to handle multiple display adapters (and drivers).

Example Scenario:

When using CyberLink PowerDirector 360 in a VMware Horizon environment with an NVIDIA vGPU, the application will query the VM’s Windows instance for hardware acceleration capabilities, specifically hardware encoding, hardware decoding, and use of APIs like NVIDIA’s NVENC encoder. In this scenario, while the VM does have an NVIDIA vGPU workstation profile attached with a valid NVIDIA RTX Virtual Workstation (vWS) license, the application is only aware of the VMware Indirect Display Driver and it’s capabilities. This results in all hardware accelerated encoding and decoding capabilities to be disabled.

Example Symptoms

  • 3D Acceleration not detected by application
  • CUDA Cores not available for application
  • OpenCL not available
  • DirectX and Direct3D usage unavailable

In all scenarios, the VM will appear to have 3D acceleration, however one or multiple applications won’t have access.

The Solution

Thanks to the design of the VMware Indirect Display Driver, it should be prioritized in a fashion that it’s used only when other display drivers aren’t available (including NVIDIA vGPU), or system resources aren’t available; however, some 3rd party application may not be able to reference the prioritization, or support multi-GPU (multi display driver), resulting in the incorrect display adapter being used.

As a workaround, you can remove the VMware Indirect Display Driver from the Windows instance running in the VM.

NVIDIA vGPU with VMware Horizon Indirect Display Driver Removed

Please note that simply disabling the “VMware Horizon Indirect Display Driver” will not suffice. A full removal (Right Click, “Uninstall Device”) is required to workaround this issue. Additionally, upgrading or re-installing the VMware Horizon Agent will re-install the VMware Indirect Display Driver.