If you’re like me, you want to make sure that your environment is as optimized as possible. I recently noticed that my NVIDIA A2 vGPU was reporting the vGPU PCIe Link Speed and Generation that the card was using was below what it was supposed to be running at on my VMware vSphere ESXi host.
I needed to find out if this was being reported incorrectly, if there was an issue, or something else effecting this. In my case, the specific GPU I was using is supposed to support PCIe Gen4, and has a physical connector supporting 4x, my host has PCIe Gen3 slots, so I should at least be getting Gen3 speeds.
NVIDIA A2 vGPU
The Problem
When running the command “nvidia-smi -q”, the GPU was reporting that it was only running at PCIe Gen 1 speeds, as shown below:
GPU Link Info
PCIe Generation
Max : 3
Current : 1
Device Current : 1
Device Max : 4
Host Max : N/A
Link Width
Max : 16x
Current : 8x
Something else worth noting, is that the card states that it supports a 16x interface, when it actually only physical has a 8x connector. I believe they use this chip on another board that has multiple GPUs on a single board that actually supports 16x.
You could say I was quite puzzled. Why would the card only be running at PCIe Generation 1 speeds? I thought it could be any of the scenarios below:
Dynamic mode that alternates when required (possibly for power savings)
Hardware issue
Hardware Limitation (I’m using this in an older server)
Software issues
Configuration issue
Unfortunately, when searching the internet, I couldn’t find many references to this metric, however I did find references from other user’s copy/pastes of “nvidia-smi -q” which had the same values (running PCIe Gen1), even with beefier and more high-end cards.
The Solution
After some more searching, I finally came across an NVIDIA technical document titled “Useful nvidia-smi Queries” that states that the current PCIe Generation Link speed “may be reduced when the GPU is not in use”. This confirms that it’s dynamic and adjusts when needed.
Finally, I decided to give some games a shot in a couple of the VMs, and to my surprise when running a game, the “Device Current” and “Current” PCIe Generation changed to PCIe Gen3 (note that my server isn’t capable of PCIe Gen4, which is the cards maximum), as shown below:
GPU Link Info
PCIe Generation
Max : 3
Current : 3
Device Current : 3
Device Max : 4
Host Max : N/A
Link Width
Max : 16x
Current : 8x
In conclusion, if you notice this in your environment, do not be alarmed as this is completely normal and expected behavior.
When either directly passing through a GPU, or attaching an NVIDIA vGPU to a Virtual Machine on VMware ESXi that has more than 16GB of Video Memory, you may run in to a situation where the VM fails to boot with the error “Module ‘DevicePowerOn’ power on failed.”. Special considerations are required when performing GPU or vGPU Passthrough with 16GB+ of video memory.
This issue is specifically caused by memory mapping a GPU or vGPU device that has 16GB of memory or higher, and could involve both the host system (the ESXi host) and/or the Virtual Machine configuration.
In this post, I’ll address the considerations and requirements to passthrough these devices to virtual machines in your environment.
In the order of occurrence, it’s usually VM configuration related, however if the recommendations in the “VM Configuration Considerations” section do not resolve the issue, please proceed to reviewing the “ESXi Host Considerations” section.
Please note that if the issue is host related, other errors may be present, or the device may not even be visible to ESXi.
VM GPU and vGPU Configuration Considerations
First and foremost, all new VMs should be created using the “EFI” Firmware type. EFI provides numerous advantages in device access and memory mapping versus the older style “BIOS” firmware types.
VM Firmware type EFI
To do this, create a new virtual machine, navigate to “VM Options”, expand “Boot Options”, and confirm/change the Firmware to “EFI”. I recommend this for all new VMs, and not only for VMs accessing GPUs or vGPUs with over 16GB of memory. Please note that you shouldn’t change an existing VM, and should do this on a fresh new VM.
With performing GPU or vGPU Passthrough with 16GB+ of video memory, you’ll need to create a couple of entries under “Advanced” settings to properly configure access to these PCIe devices and provide the proper environment for memory mapping. The lack of these settings is specifically what causes the “Module ‘DevicePowerOn’ power on failed.” error.
Under the VM settings, head over to “VM Options”, expand “Advanced” and click on “Edit Configuration”, click on “Add Configuration Params”, and add the following entries:
VM GPU and vGPU Memory Settings for 16GB or higher memory mapping
You’ll notice that while our GPU or vGPU profile may have 16GB of memory, we need to double that value, and set it for the “pciPassthru.64bitMMIOSizeGB” variable. If your card or vGPU profile had 32GB, you’d set it to “64”.
Additionally if you were passing through multiple GPUs or vGPU devices, you’d need to factor all the memory being mapped, and double the combined amount.
ESXi GPU and vGPU Host Considerations
On most new and modern servers, the host level doesn’t require any special configuration as they are already designed to pass through such devices to the hypervisor properly. However in some special cases, and/or when using older servers, you may need to modify configuration and settings in the UEFI or BIOS.
If setting the VM Configuration above still results in the same error (or possibly other errors), than you most likely need to make modifications to the ESXi hosts BIOS/UEFI/RBSU to allow the proper memory mapping of the PCIe device, in our case being the GPU.
This is where things get a bit tricky because every server manufacturer has different settings that will need to be configured.
Look for the following settings, or settings with similar terminology:
“Memory Mapping Above 4G”
“Above 4G Decoding”
“PCI Express 64-Bit BAR Support”
“64-Bit IOMMU Mapping”
Once you find the correct setting or settings, enable them.
Every vendor could be using different terminology and there may be other settings that need to be configured that I don’t have listed above. In my case, I had to go in to a secret “SERVICE OPTIONS” menu on my HPE Proliant DL360p Gen8, as documented here.
After performing the recommendations in this guide, you should now be able to passthrough devices with over 16GB of memory.
With VMware ESXi 6.5 and 6.7 going End of Life on October 15th, 2022, many of you are looking for options to update hosts in your homelab, especially in my case putting ESXi 7.0 on HP Proliant DL360p Gen8 servers.
As far as support goes, HPE last provided a custom installer for ESXi for versions 6.5 U3 which was released December of 2019. This was the “last Pre-Gen9 custom image” released, as ESXi 7.0 on the DL360p Gen8 is totally unsupported.
The jump from 6.5 to 6.7 was a little easier, as you could use the 6.5 custom installer, and then upgrade to 6.7. For the most part, as long as the hardware itself was supported, you were in pretty good shape.
Additionally, with the HPE vibsdepot loaded in to VMware Update Manager (now known as Lifecycle Manager), you could also keep all the HPE drivers and agents up to date.
ESXi 7.0 on the Gen8 Servers
Some were lucky enough to upgrade their current installs to 7 with no or limited problems, however the general consensus online was to expect problems. There were some major driver changes, which I think at one point led to an advisory to perform a fresh install, even if you had a fully supported configuration with newer generation servers such as the Proliant Gen9 and Gen10 servers, when upgrading from older versions.
In my setup, I have the following:
2 x HPE Proliant DL360p Gen8 Servers
Dual Intel Xeon E5-2660v2 Processors in each server
USB and/or SD for booting ESXi
No other internal storage
External SAN iSCSI Storage
Concerns and Considerations
My main concern is to not only have a stable and functioning ESXi 7 instance, but I also, if possible would like to have the HPE drivers, agents, and integrations with iLO.
You must consider that while this is completely unsupported, you do need to make sure that the components of your current configuration are supported, such as the processor and PCIe cards, even if the server as a whole is not supported.
In my case, my processors were supported, however my RAID controller was not. So theoretically, since I’m not using my RAID controllers, I should be fine.
The process – Installing ESXi 7.0
I was able to install ESXi 7.0 on my HPE Proliant Gen8 servers, by performing the following steps.
Download the Generic ESXi installer from VMware directly.
Boot server, install using the Generic Installer downloaded above.
Mount NFS or iSCSI datastore.
Copy HPE Custom Addon for ESXi zip file to datastore.
Enable SSH on host (or use console).
Place host in to maintenance mode.
Run “esxcli software vib install -d /vmfs/volumes/datastore-name/folder-name/HPE-703.0.0.10.9.1.5-Jul2022-Addon-depot.zip” from the command line.
The install will run and complete successfully.
Restart your server as needed, you’ll now notice that not only were HPE drivers installed, but also agents like the Agentless management agent, and iLO integrations.
You’ll now have a functioning instance.
HP Proliant DL360p Gen8 running ESXi 7.0
In my case everything was working, except for the “Smart Array P420i” RAID Controller, which I don’t use anyways.
Additionally, if you have a vCenter instance running, make sure that you add the HPE vibsdepot repo to your Lifecycle Manager. After you add the repo, create a baseline, and attach the baseline to the host, go ahead and proceed to scan, stage, and remediate the server which will then further update all the HPE specific drivers and software.
It’s been coming for a while: The requirement to deploy VMs with a TPM module… Today I’ll be showing you the easiest and quickest way to create and deploy Virtual Machines with vTPM with NKP (Native Key Provider) on VMware vSphere!
As most of you know, Windows 11 has a requirement for Secureboot as well as a TPM module. It’s with no doubt that we’ll also possibly see this requirement with future Microsoft Windows Server operating systems.
While users struggle to deploy TPM modules on their own workstations to be eligible for the Windows 11 upgrade, ESXi administrators are also struggling with deploying Virtual TPM modules, or vTPM modules on their virtualized infrastructure.
With the Native Key Provider (NKP) on VMware vSphere, you can easily deploy a key provider, enabling vTPM (Virtual Trusted Key Platform) enabled Virtual Machines.
What is a TPM Module?
TPM stands for Trusted Platform Module. A Trusted Platform Module, is a piece of hardware (or chip) inside or outside of your computer that provides secured computing features to the computer, system, or server that it’s attached to.
This TPM modules provides things like a random number generator, storage of encryption keys and cryptographic information, as well as aiding in secure authentication of the host system.
In a virtualization environment, we need to emulate this physical device with a Virtual TPM module, or vTPM.
What is a Virtual TPM (vTPM) Module?
A vTPM module is a virtualized software instance of a traditional physical TPM module. A vTPM can be attached to Virtual Machines and provide the same features and functionality that a physical TPM module would provide to a physical system.
vTPM modules can be can be deployed with VMware vSphere, and can be used to deploy Windows 11 on ESXi.
Deployment of vTPM modules, require a Key Provider on the vCenter Server.
Deploying vTPM (Virtual TPM Modules) on VMware vSphere with NKP
In order to deploy vTPM modules (and VM encryption, vSAN Encryption) on VMware vSphere, you need to configure a Key Provider on your vCenter Server.
Previously (but still an option), this would be accomplished with a Standard Key Provider utilizing a Key Management Server (KMS), however this required a 3rd party KMS server and is what I would consider a complex deployment.
VMware has made this easy as of vSphere 7 Update 2 (7U2), with the Native Key Provider (NKP) on the vCenter Server.
The Native Key Provider, allows you to easily deploy technologies such as vTPM modules, VM encryption, vSAN encryption, and the best part is, it’s all built in to vCenter Server.
Enabling VMware Native Key Provider (NKP)
To enable NKP across your vSphere infrastructure:
Log on to your vCenter Server
Select your vCenter Server from the Inventory List
Select “Key Providers”
Click on “Add”, and select “Add Native Key Provider”
Give the new NKP a friendly name
De-select “Use key provider only with TPM protected ESXi hosts” to allow your ESXi hosts without a TPM to be able to use the native key provider.
In order to activate your new native key provider, you need to click on “Backup” to make sure you have it backed up. Keep this backup in a safe place. After the backup is complete, you NKP will be active and usable by your ESXi hosts.
VMware vCenter with Native Key Provider (NKP) Configured
There’s a few additional things to note:
Your ESXi hosts do NOT require a physical TPM module in order to use the Native Key Provider
Just make sure you disable the checkbox “Use key provider only with TPM protected ESXi hosts”
NKP can be used to enable vTPM modules on all editions of vSphere
If your ESXi hosts have a TPM module, using the Native Key Provider with your hosts TPM modules can provide enhanced security
Onboard TPM module allows keys to be stored and used if the vCenter server goes offline
If you delete the Native Key Provider, you are also deleting all the keys stored with it.
Make sure you have it backed up
Make sure you don’t have any hosts/VMs using the NKP before deleting
You can now deploy vTPM modules to virtual machines in your VMware environment.
We all know that vMotion is awesome, but what is even more awesome? Optimizing VMware vMotion to make it redundant and faster!
vMotion allows us to migrate live Virtual Machines from one ESXi host to another without any downtime. This allows us to perform physical maintenance on the ESXi hosts, update and restart the hosts, and also load balance VMs across the hosts. We can even take this a step further use DRS (Distributed Resource Scheduler) automation to intelligently load the hosts on VM boot and to dynamically load balance the VMs as they run.
VMware vMotion
In this post, I’m hoping to provide information on how to fully optimize and use vMotion to it’s full potential.
VMware vMotion
Most of you are probably running vMotion in your environment, whether it’s a homelab, dev environment, or production environment.
I typically see vMotion deployed on the existing data network in smaller environments, I see it deployed on it’s own network in larger environments, and in very highly configured environments I see it being used with the vMotion TCP stack.
While you can preform a vMotion with 1Gb networking, you certainly almost always want at least 10Gb networking for the vMotion network, to avoid any long running VMs. Typically most IT admins are happy with live migration vMotion’s in the seconds, and not the minutes.
VMware vMotion Optimization
So you might ask, if vMotion is working and you’re satisfied, what is there to optimize? There’s actually a few things, but first let’s talk about what we can improve on.
We’re aiming for improvements with:
Throughput/Speed
Faster vMotion
Faster Speed
Less Time
Migrate more VMs
Evacuate hosts faster
Enable more aggressive DRS
Migrate many VMs at once very quickly
Redundancy
Redundant vMotion Interfaces (NICs and Uplinks)
More Complex vMotion Configurations
vMotion over different subnets and VLANs
vMotion routed over Layer 3 networks
To achieve the above, we can focus on the following optimizations:
Enable Jumbo Frames
Saturation of NIC/Uplink for vMotion
Multi-NIC/Uplink vMotion
Use of the vMotion TCP Stack
Let’s get to it!
Enable Jumbo Frames
I can’t stress enough how important it is to use Jumbo Frames for specialized network traffic on high speed network links. I highly recommend you enable Jumbo Frames on your vMotion network.
Note, that you’ll need to have a physical switch and NICs that supports Jumbo frames.
In my own high throughput testing on a 10Gb link, without using Jumbo frames I was only able to achieve transfer speeds of ~6.7Gbps, whereas enabling Jumbo Frames allowed me to achieve speeds of ~9.8Gbps.
When enabling this inside of vSphere and/or ESXi, you’ll need to make sure you change and update the applicable vmk adapter, vSwitch/vDS switches, and port groups. Additionally as mentioned above you’ll need to enable it on your physical switches.
Saturation NIC/Uplink for vMotion
You may assume that once you configure a vMotion enabled NIC, that when performing migrations you will be able to fully saturate it. This is not necessarily the case!
When performing a vMotion, the vmk adapter is bound to a single thread (or CPU core). Depending on the power of your processor and the speed of the NIC, you may not actually be able to fully saturate a single 10Gb uplink.
In my own testing in my homelab, I needed to have a total of 2 VMK adapters to saturate a single 10Gb link.
If you’re running 40Gb or even 100Gb, you definitely want to look at adding multiple VMK adapters to your vMotion network to be able to fully saturate a single NIC or Uplink.
You can do this by simply configuring multiple VMK adapters per host with different IP addresses on the same subnet.
One important thing to mention is that if you have multiple physical NICs and Uplinks connected to your vMotion switch, this change will not help you utilize multiple physical interfaces (NICs/Uplinks). See “Multi-NIC/Uplink vMotion”.
Please note: As of VMware vSphere 7 Update 2, the above is not required as vMotion has been optimized to use multiple streams to fully saturate the interface. See VMware’s blog post “Faster vMotion Makes Balancing Workloads Invisible” for more information.
Multi-NIC/Uplink vMotion
Another situation is where we may want to utilize multiple NICs and Uplinks for vMotion. When implemented correctly, this can provide load balancing (additional throughput) as well as redundancy on the vMotion network.
If you were to simply add additional NIC interfaces as Uplinks to your vMotion network, this would add redundancy in the event of a failover but it wouldn’t actually result in increased speed or throughput as special configuration is required.
To take advantage of the additional bandwidth made available by additional Uplinks, we need to specially configure multiple portgroups on the switch (vSwitch or vDS Distributed Switch), and configure each portgroup to only use one of the Uplinks as the “Active Uplink” with the rest of the uplinks under “Standby Uplink”.
Example Configuration
vSwitch or vDS Switch
Portgroup 1
Active Uplink: Uplink 1
Standby Uplinks: Uplink 2, Uplink 3, Uplink 4
Portgroup 2
Active Uplink: Uplink 2
Standby Uplinks: Uplink 1, Uplink 3, Uplink 4
Portgroup 3
Active Uplink: Uplink 3
Standby Uplinks: Uplink 1, Uplink 2, Uplink 4
Portgroup 4
Active Uplink: Uplink 4
Standby Uplinks: Uplink 1, Uplink 2, Uplink 3
You would then place a single or multiple vmk adapters on each of the portgroups per host, which would result in essentially mapping the vmk(s) to the specific uplink. This will allow you to utilize multiple NICs for vMotion.
And remember, you may not be able to fully saturate a NIC interface (as stated above) with a single vmk adapter, so I highly recommend creating multiple vmk adapters on each of the Port groups above to make sure that you’re not only using multiple NICs, but that you can also fully saturate each of the NICs.
VMware released the vMotion TCP Stack to provided added security to vMotion capabilities, as well as introduce vMotion over multiple subnets (routed vMotion over layer 3).
Using the vMotion TCP Stack, you can isolate and have the vMotion network using it’s own gateway separate from the other vmk adapters using the traditional TCP stack on the ESXi host.
This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish.
Do you accept the use of cookies and accept our privacy policy? AcceptRejectCookie and Privacy Policy
Privacy & Cookies Policy
Privacy Overview
This website uses cookies to improve your experience while you navigate through the website. Out of these cookies, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may have an effect on your browsing experience.