Mar 182020
 
vSphere Logo Image

I’ve noticed in a few situations where an ESXi host is marked as “unresponsive” or “disconnected” inside of vCenter due to issues occurring on that host (or connected hardware). This recently happened again with a customer and is why I’m writing this article at this very moment.

In these situations, usually all normal means of managing, connecting, or troubleshooting the host are unavailable. Usually in cases like this ESXi administrators would simply reset the host.

However, I’ve found hosts can often be rescued without requiring an ungraceful restart or reset.

Observations

In these situations, it can be observed that:

  • The ESXi host is in a unresponsive to disconnected state to vCenter Server.
  • Connecting to the ESXi host directly does not work as it either doesn’t acknowledge HTTPS requests, or comes up with an error.
  • Accessing the console of the ESXi host isn’t possible as it appears frozen.
  • While the ESXi host is unresponsive, the virtual machines are still online and available on the network.

Troubleshooting

In the few situations I’ve noticed this occurring, troubleshooting is possible but requires patience. Consider the following:

  • When trying to access the ESXi console, give it time after hitting enter or selecting a value. If there’s issues on the host such as commands pending, tasks pending, or memory issues, the console may actually respond if you give it 30 seconds to 5 minutes after selecting an item.
  • With the above in mind, attempt to enable console access (preferably console and not SSH). The logins may take some time (30 seconds to 5 minutes after typing in the password), but you might be able to gain troubleshooting access.
  • Check the SAN, NAS, and any shared storage… In one instance, there were issues with a SAN and datastore that froze 2 VMs. The Queued commands to the SAN caused the ESXi host to become unresponsive.
  • There may be memory issues with the ESXi instance. The VMs are fine, however an agent, driver, or piece of software may be causing the hypervisor layer to become unresponsive.

If there are storage issues, do what you can. In one of the cases above, we had to access the ESXi console, issue a “kill -9” to the VM, and then restart the SAN. We later found out there was issues with the SAN and corrupted virtual machines. The moment the SAN was restarted, the ESXi host became responsive, connected to the vCenter server and could be managed.

In another instance, on an older version of ESXi there was an HPE agentless management driver/service that was consuming the ESXi hosts memory continuously causing the memory to overflow, the host to fill the swap and become unresponsive. Eventually after gracefully shutting down the VMs, I was able to access the console, kill the service, and the host become responsive.

May 082019
 
vSphere Logo Image

We’ve all been in the situation where we need to install a driver, vib file, or check “esxtop”. Many advanced administration tasks on ESXi need to be performed via shell access, and to do this you either need a console on the physical ESXi host, an SSH session, or use the Remote vCLI.

In this blog post, I’m going to be providing a quick “How to” enable SSH on an ESXi host in your VMware Infrastructure using the vCenter flash-based web administration interface. This will allow you to perform the tasks above, as well as use the “esxcli” command which is frequently needed.

This method should work on all vCenter versions up to 6.7, and ESXi versions up to 6.7.

How to Enable SSH on an ESXi Host Server

  1. Log on to your vCenter server.
    vCenter Server Login Window
  2. On the left hand “Navigator” pane, select the ESXi host.
    ESXi Navigator Pane on vCenter Web Interface
  3. On the right hand pane, select the “Configure” tab, then “Security Profile” under “System.
    ESXi Host Configuration under Configure Tab in Web Interface
  4. Scroll down and look for “Services” further to the right and select “Edit”.
    ESXi Host Services in Host Configuration
  5. In the “Edit Security Profile” window, select and highlight “SSH” and then click “Start”.
    ESXi Services List on vCenter web interface
  6. Click “Ok”.

This method can also be used to stop, restart, and change the startup policy to enable or disable SSH starting on boot.

Congratulations, you can now SSH in to your ESXi host!

May 022019
 
Nvidia GRID Logo

I can’t tell you how excited I am that after many years, I’ve finally gotten my hands on and purchased an Nvidia Quadro K1 GPU. This card will be used in my homelab to learn, and demo Nvidia GRID accelerated graphics on VMware Horizon View. In this post I’ll outline the details, installation, configuration, and thoughts. And of course I’ll have plenty of pictures below!

The focus will be to use this card both with vGPU, as well as 3D accelerated vSGA inside in an HPE server running ESXi 6.5 and VMware Horizon View 7.8.

Please Note: As of late (late 2020), hardware h.264 offloading no longer functions with VMware Horizon and VMware BLAST with NVidia Grid K1/K2 cards. More information can be found at https://www.stephenwagner.com/2020/10/10/nvidia-vgpu-grid-k1-k2-no-h264-session-encoding-offload/

Please Note: Some, most, or all of what I’m doing is not officially supported by Nvidia, HPE, and/or VMware. I am simply doing this to learn and demo, and there was a real possibility that it may not have worked since I’m not following the vendor HCL (Hardware Compatibility lists). If you attempt to do this, or something similar, you do so at your own risk.

Nvidia GRID K1 Image

For some time I’ve been trying to source either an Nvidia GRID K1/K2 or an AMD FirePro S7150 to get started with a simple homelab/demo environment. One of the reasons for the time it took was I didn’t want to spend too much on it, especially with the chances it may not even work.

Essentially, I have 3 Servers:

  1. HPE DL360p Gen8 (Dual Proc, 128GB RAM)
  2. HPE DL360p Gen8 (Dual Proc, 128GB RAM)
  3. HPE ML310e Gen8 v2 (Single Proc, 32GB RAM)

For the DL360p servers, while the servers are beefy enough, have enough power (dual redundant power supplies), and resources, unfortunately the PCIe slots are half-height. In order for me to use a dual-height card, I’d need to rig something up to have an eGPU (external GPU) outside of the server.

As for the ML310e, it’s an entry level tower server. While it does support dual-height (dual slot) PCIe cards, it only has a single 350W power supply, misses some fancy server technologies (I’ve had issues with VT-d, etc), and only a single processor. I should be able to install the card, however I’m worried about powering it (it has no 6pin PCIe power connector), and having ESXi be able to use it.

Finally, I was worried about cooling. The GRID K1 and GRID K2 are typically passively cooled and meant to be installed in to rack servers with fans running at jet engine speeds. If I used the DL360p with an external setup, this would cause issues. If I used the ML310e internally, I had significant doubts that cooling would be enough. The ML310e did have the plastic air baffles, but only had one fan for the expansion cards area, and of course not all the air would pass through the GRID K1 card.

The Purchase

Because of a limited budget, and the possibility I may not even be able to get it working, I didn’t want to spend too much. I found an eBay user local in my city who had a couple Grid K1 and Grid K2 cards, as well as a bunch of other cool stuff.

We spoke and he decided to give me a wicked deal on the Grid K1 card. I thought this was a fantastic idea as the power requirements were significantly less (more likely to work on the ML310e) on the K1 card at 130 W max power, versus the K2 card at 225 W max power.

NVIDIA GRID K1 and K2 Specifications
NVIDIA GRID K1 and K2 Specification Table

The above chart is a capture from:
https://www.nvidia.com/content/cloud-computing/pdf/nvidia-grid-datasheet-k1-k2.pdf

We set a time and a place to meet. Preemptively I ran out to a local supply store to purchase an LP4 power adapter splitter, as well as a LP4 to 6pin PCIe power adapter. There were no available power connectors inside of the ML310e server so this was needed. I still thought the chances of this working were slim…

These are the adapters I purchased:

Preparation and Software Installation

I also decided to go ahead and download the Nvidia GRID Software Package. This includes the release notes, user guide, ESXi vib driver (includes vSGA, vGPU), as well as guest drivers for vGPU and pass through. The package also includes the GRID vGPU Manager. The driver I used was from:
https://www.nvidia.com/Download/driverResults.aspx/144909/en-us

To install, I copied over the vib file “NVIDIA-vGPU-kepler-VMware_ESXi_6.5_Host_Driver_367.130-1OEM.650.0.0.4598673.vib” to a datastore, enabled SSH, and then ran the following command to install:

esxcli software vib install -v /path/to/file/NVIDIA-vGPU-kepler-VMware_ESXi_6.5_Host_Driver_367.130-1OEM.650.0.0.4598673.vib

The command completed successfully and I shut down the host. Now I waited to meet.

We finally met and the transaction went smooth in a parking lot (people were staring at us as I handed him cash, and he handed me a big brick of something folded inside of grey static wrap). The card looked like it was in beautiful shape, and we had a good but brief chat. I’ll definitely be purchasing some more hardware from him.

Hardware Installation

Installing the card in the ML310e was difficult and took some time with care. First I had to remove the plastic air baffle. Then I had issues getting it inside of the case as the back bracket was 1cm too long to be able to put the card in. I had to finesse and slide in on and angle but finally got it installed. The back bracket (front side of case) on the other side slid in to the blue plastic case bracket. This was nice as the ML310e was designed for extremely long PCIe expansion cards and has a bracket on the front side of the case to help support and hold the card up as well.

For power I disconnected the DVD-ROM (who uses those anyways, right?), and connected the LP5 splitter and the LP5 to 6pin power adapter. I finally hooked it up to the card.

I laid the cables out nicely and then re-installed the air baffle. Everything was snug and tight.

Please see below for pictures of the Nvidia GRID K1 installed in the ML310e Gen8 V2.

Host Configuration

Powering on the server was a tense moment for me. A few things could have happened:

  1. Server won’t power on
  2. Server would power on but hang & report health alert
  3. Nvidia GRID card could overheat
  4. Nvidia GRID card could overheat and become damaged
  5. Nvidia GRID card could overheat and catch fire
  6. Server would boot but not recognize the card
  7. Server would boot, recognize the card, but not work
  8. Server would boot, recognize the card, and work

With great suspense, the server powered on as per normal. No errors or health alerts were presented.

I logged in to iLo on the server, and watched the server perform a BIOS POST, and start it’s boot to ESXi. Everything was looking well and normal.

After ESXi booted, and the server came online in vCenter. I went to the server and confirmed the GRID K1 was detected. I went ahead and configured 2 GPUs for vGPU, and 2 GPUs for 3D vSGA.

ESXi Graphics Settings for Host Graphics and Graphics Devices
ESXi Host Graphics Devices Settings

VM Configuration

I restarted the X.org service (required when changing the options above), and proceeded to add a vGPU to a virtual machine I already had configured and was using for VDI. You do this by adding a “Shared PCI Device”, selecting “NVIDIA GRID vGPU”, and I chose to use the highest profile available on the K1 card called “grid_k180q”.

Virtual Machine Edit Settings with NVIDIA GRID vGPU and grid_k180q profile selected
VM Settings to add NVIDIA GRID vGPU

After adding and selecting ok, you should see a warning telling you that must allocate and reserve all resources for the virtual machine, click “ok” and continue.

Power On and Testing

I went ahead and powered on the VM. I used the vSphere VM console to install the Nvidia GRID driver package (included in the driver ZIP file downloaded earlier) on the guest. I then restarted the guest.

After restarting, I logged in via Horizon, and could instantly tell it was working. Next step was to disable the VMware vSGA Display Adapter in the “Device Manager” and restart the host again.

Upon restarting again, to see if I had full 3D acceleration, I opened DirectX diagnostics by clicking on “Start” -> “Run” -> “dxdiag”.

DirectX Diagnostic Tool (dxdiag) showing Nvidia Grid K1 on VMware Horizon using vGPU k180q profile
dxdiag on GRID K1 using k180q profile

It worked! Now it was time to check the temperature of the card to make sure nothing was overheating. I enabled SSH on the ESXi host, logged in, and ran the “nvidia-smi” command.

nvidia-smi command on ESXi host showing GRID K1 information, vGPU information, temperatures, and power usage
“nvidia-smi” command on ESXi Host

According to this, the different GPUs ranged from 33C to 50C which was PERFECT! Further testing under stress, and I haven’t gotten a core to go above 56. The ML310e still has an option in the BIOS to increase fan speed, which I may test in the future if the temps get higher.

With “nvidia-smi” you can see the 4 GPUs, power usage, temperatures, memory usage, GPU utilization, and processes. This is the main GPU manager for the card. There are some other flags you can use for relevant information.

nvidia-smi with vgpu flag for vgpu information
“nvidia-smi vgpu” for vGPU Information
nvidia-smi with vgpu -q flag
“nvidia-smi vgpu -q” to Query more vGPU Information

Final Thoughts

Overall I’m very impressed, and it’s working great. While I haven’t tested any games, it’s working perfect for videos, music, YouTube, and multi-monitor support on my 10ZiG 5948qv. I’m using 2 displays with both running at 1920×1080 for resolution.

I’m looking forward to doing some tests with this VM while continuing to use vGPU. I will also be doing some testing utilizing 3D Accelerated vSGA.

The two coolest parts of this project are:

  • 3D Acceleration and Hardware h.264 Encoding on VMware Horizon
  • Getting a GRID K1 working on an HPE ML310e Gen8 v2

Highly recommend getting a setup like this for your own homelab!

Uses and Projects

Well, I’m writing this “Uses and Projects” section after I wrote the original article (it’s now March 8th, 2020). I have to say I couldn’t be impressed more with this setup, using it as my daily driver.

Since I’ve set this up, I’ve used it remotely while on airplanes, working while travelling, even for video editing.

Some of the projects (and posts) I’ve done, can be found here:

Leave a comment and let me know what you think! Or leave a question!

Feb 182019
 
ESXi Fatal error: 8 (Device Error)

Unable to boot ESXi from USB or SD Card on HPE Proliant Server

After installing HPE iLO Amplifier on your network and updating iLO 4 firmware to 2.60 or 2.61, you may notice that your HPE Proliant Servers may fail to boot ESXi from a USB drive or SD-Card.

This was occuring on 2 ESXi Hosts. Both were HPE Proliant DL360p Gen8 Servers. One server was using an internal USB drive for ESXi, while the other was using an HPE branded SD Card.

The issue started occuring on both hosts after a planned InfoSight implementation. Both hosts iLO controllers firmware were upgraded to 2.61, iLO Amplifier was deployed (and the servers added), and the amplifier was connected to an HPE InfoSight account.

Update – May 24th 2019: As an HPE partner, I have been working with HPE, the product manager, and development team on this issue. HPE has provided me with a fix to test that I have been able to verify fully resolves this issue! Stay tuned for more information!

Update – June 5th 2019: Great news! As Bob Perugini (WW Product Manager at HPE) put it: “HPE is happy to announce that this issue has been fixed in latest version of iLO Amplifier Pack, v1.40. To download iLO Amplifier Pack v1.40, go to http://www.hpe.com/servers/iloamplifierpack and click “download”.” Scroll to the bottom of the post for more information!

Please see below for errors:

Errors

ESXi Fatal error: 8 (Device Error)
ESXi Fatal error: 8 (Device Error)
Error loading /s.v00
Compressed MD5: 00000000000000000000000000
Decompressed MD5: 00000000000000000000000000
Fatal error: 8 (Device Error)
Error mboot.c32 attempted DOS system call
Error mboot.c32 attempted DOS system call
mboot.c32: attempted DOS system call INT 21 0d00 E8004391
boot:

Symptoms

This issue may occur intermittently, on the majority of boots, or on all boots. Re-installing ESXi on the media, as well as replacing the USB/SD Card has no effect. Installation will be successful, however you the issue is still experiences on boot.

HPE technical support was unable to determine the root of the issue. We found the source of the issue and reported it to HPE technical support and are waiting for an update.

The Issue and Fix

This issue occurs because the HPE iLO Amplifier is running continuous server inventory scans while the hosts are booting. When one inventory completes, it restarts another scan.

The following can be noted:

  • iLO Amplifier inventory percentage resets back to 0% and starts again numerous times during the server boot
  • Inventory scan completes, only to restart again numerous times during the server boot
  • Inventory scan resets back to 0% during numerous different phases of BIOS initialization and POST.
HPe iLO Amplifier Inventory
HPE iLO Amplifier Inventory

We noticed that once the HPE iLO Amplifier Virtual Machine was powered off, not only did the servers boot faster, but they also booted 100% succesfully each time. Powering on the iLO Amplifier would cause the ESXi hosts to fail to boot once again.

I’d also like to note that on the host using the SD-Card, the failed boot would actually completely lock up iLO, and would require physical intervention to disconnect and reconnect the power to the server. We were unable to restart the server once it froze (this did not happen to the host using the USB drive).

There are some settings on the HPE iLO amplifier to control performance and intervals of inventory scans, however we noticed that modifying these settings did not alter or stop the issue, and had no effect.

As a temporary workaround, make sure your iLO amplifier is powered off during any maintenance to avoid hosts freezing/failing to boot.

To fully resolve this issue, upgrade your iLO Amplifier to the latest version (1.40 as of the time of this update). The latest version can be downloaded at: http://www.hpe.com/servers/iloamplifierpack.

Update – April 10th 2019

I’ve attempted to try downgrading to the earliest supported iLo version 2.54, and the issue still occurs.

I also upgraded to the newest version 2.62 which presented some new issues.

On the first boot, the BIOS reported memory access issues on Processor 1 socket 1, then another error reporting memory access issues on Processor 1 socket 4.

I disconnected the power cables, reconnected, and restarted the server. This boot, the server didn’t even detect the bootable USB stick.

Again, after shutting down the iLo Amplifier, the server booted properly and the issue disappeared.

Update – May 24th 2019

As an HPE partner, I have been working with HPE, the product manager, and development team on this issue. HPE has provided me with a fix to test that I have been able to verify fully resolves this issue! Stay tuned for more information!

Update – June 5th 2019 – ITS FIXED!!!

Great news as the issue is now fixed! As Bob Perugini (WorldWide Product Manager at HPE) said it:

HPE is happy to announce that this issue has been fixed in latest version of iLO Amplifier Pack, v1.40.

To download iLO Amplifier Pack v1.40, go to http://www.hpe.com/servers/iloamplifierpack and click “download”.


Here’s what’s new in iLO Amplifier Pack v1.40:
─ Available as a VMware ESXi appliance and as a Hyper-V appliance (Hyper-V is new)
─ VMware tools have been added to the ESXi appliance
─ Ability to schedule the time of the daily transmission of Active Health System (AHS) data to InfoSight
─ Ability to opt-in and allow the IP address and hostname of the server to be transmitted to InfoSight and displayed
─ Test connectivity button to help verify iLO Amplifier Pack has successfully connected to InfoSight
─ Allow user authentication credentials for the proxy server when connecting to InfoSight
─ Added ability to specify IP address or hostname for the HPE RDA connection when connection to InfoSight
─ Ability to send updated AHS data “now” for an individual server
─ Ability to stage firmware and driver updates to the iLO Repository and then deploy the staged updates at a later date or time (HPE Gen10 servers only)
─ Allow the firmware and driver updates of servers whose iLO has been configured in CNSA (Commercial National Security Algorithm) mode (HPE Gen10 servers only)

Oct 202018
 

On an ESXi host when performing a manual unmap on your storage datastore, you may notice a very large (hidden) file on the datastore root called “.asyncUnmapFile”. This file could be taking up terabytes of space, and you aren’t able to delete this file.

asyncUnmapFile

Typically the asyncUnmapFile is used by the UNMAP feature on ESXi hosts to deal with unmapping and unallocating storage blocks on the SAN. When you run a manual UNMAP, this file should be created and should appear to using “0” (no) space (even if it is). When an UNMAP completes, this file should disappear and be automatically removed by the function. If an UNMAP is interupted, this file will not be deleted, allowing you to restart the process and upon a full successful completion, it should then be deleted.

The Problem

Some time ago, I had an issue when performing a manual UNMAP, where the ESXi host became unresponsive (due to memory issues). The command appeared to be completed, however I believe it caused potential issues or corruption on the iSCSI datastore. In subsequent runs, the UNMAP appeared to be functioning and working, however I didn’t realize that the asyncUnmapFile had grown to around 1.5TB.

This was noticed during a SAN storage audit, where we saw that the virtual pool on the SAN was using up way more storage than it should be on the datastore.

When we identified the file was this large and causing issues, we attempted to perform 2 UNMAPs (different reclaim sizes) to see if it would be automatically cleared afterwards. It had no effect and the file was unchanged.

We also tried to modify the permissions on the file, however when trying to delete it, it would report that the file or folder was not found, or that it does not exist. This was concerning as we were worried about potential datastore corruption.

It was also noticed that in the hostd.log and vmkernel.log we saw some errors where the host believed that the blocks on the datastore had already been freed: “on volume labeled ‘iSCSIDatastore01’ already freed by another host: This may be a non-issue”

The Solution

Unfortunately with all the research we did, we couldn’t find a clear-cut solution. With worries that the datastore may be corrupted, we needed to do something.

A decision was ultimately made to Storage vMotion all the VMs (Virtual Machines) to another datastore on a separate storage pool, delete the now empty LUN, and recreate it from scratch. After this, we used Storage vMotion again to move the VMs back.

Instantly I noticed that the VMs on that datastore were running faster (it’s only been 12 hours, so I’ll be adding an update in a few days to confirm). We no longer have the file on any of our datastores.

If anyone has further insight in to this issue, please leave a comment!