Nov 052016
 

Yesterday, I had a reader (Nicolas) leave a comment on one of my previous blog posts bringing my attention to the MTU for Jumbo Frames on the HPE MSA 2040 SAN.

MSA 2040 MTU Comment

Since I first started working with the MSA 2040. Looking at numerous HPE documents outlining configuration and best practices, the documents did confirm that the unit supported Jumbo Frames. However, the documentation on the MTU was never clearly stated and can be confusing. I was under the assumption that the unit supported 9000 MTU, while reserving 100 bytes for overhead. This is not necessarily the case.

Nicolas chimed in and provided details on his tests which confirmed the HPE MSA 2040 does actually have a working MTU of 8900. In my configuration I did the tests (that Nicolas outlined), and confirmed that the MTU would cause packet fragmentation if the MTU was greater than 8900.

ESXi vmkping usage: https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1003728

This is a big discovery because packet fragmentation will not only degrade performance, but flood the links with lots of packet fragmentation.

I went ahead and re-configured my ESXi hosts to use an MTU of 8900 on the network used with my SAN. This immediately created a MASSIVE performance increase (both speed, and IOPS). I highly recommend that users of the MSA 2040 SAN confirm this on their own, and update the MTUs as they see fit.

Also, this brings up another consideration. Ideally, on a single network, you want all devices to be running the same MTU. If your MSA 2040 SAN is on a storage network with other SAN devices (or any other device), you may want to configure all of them to use the MTU of 8900 if possible (and of course, don’t forget your servers).

A big thank you to Nicolas for pointing this out!

Sep 232016
 

Well, recently one of the servers I monitor and maintain in a remote oil town recently started throwing out a Windows event log warning:

Event ID: 129

Source: HpCISSs2

Description: Reset to device, \Device\RaidPort0, was issued.

The server is an HP ML350p Gen8 (Windows Server 2008 R2) running latest firmware and management software. It has 2 RAID Arrays (RAID1, and RAID5), and a total of 6 disks.

Researching this error, I read that most people had this occur when running the latest HP WBEM providers, as well as anti-virus software. In our case, I actually tried to downgrade to an older version, but noticed the warning still occurs. While we do have anti-virus, it’s not actively scanning (only weekly scheduled scans).

In the process of troubleshooting, I noticed that under the HP Systems Management Homepage, one of the drives in the RAID1 array, had the following stats:

Hard Read Erros:  150
Recovery Read Errors:  7
Total Seeks:  0
Seek Errors:  0

I found these numbers to be very high in my experience. None of the other drives had anything close to this (in 4 years of running, only one other disk had a read error (a single one), this disk however had tons. For some reason the drive is still reporting as operational, when I’d expect it to be marked as a predicted failure, or failed.

While all online documentation was pointing towards at locks on the array by software, from my own experience I think it was actually the array waiting for a read operation on the array, and it was this single disk that was causing a threshold to be hit in the driver, that caused a retry to recover the read operation.

Called up HPE support, I mentioned I’d like to have the drive replaced. The support engineer consulted her senior engineer and reviewed the evidence I presented (along with ADU reports, and Active Monitoring health reports), the senior engineer concurred that the drive should be replaced.

Replacing the drive resolved the issue. I’m also noticing a performance increase on the array as well.

Make sure to always check the stats on the individual components of your RAID arrays, even if everything is operating sound.

Sep 102016
 

When initiating manual backups or occasionally when automatic/scheduled backups run, a user may notice that Windows Server Backup may appear to “hang” when the status is reporting: “Preparing media to store backups…”.

In some rare cases, it may actually be in a hang state, however most of the time, it’s actually consolidating and/or checking previous backups on the destination media.

To Confirm this:

Open the Task Manager as Administrator, then click on the “Performance” tab, click on “Open Resource Monitor”. Flip over to the “Disk” tab, expand “Disk Activity”, and sort by name. You should see the read requests on the destination media, you’ll also notice that it is slowly progressing consecutively through each backup set (increments of 1, accessing multiple at a time).

This confirms that the Windows Server Backup services are functioning and it is in fact running. In one case, I had 723 previous backups, and it took around 50 minutes to count from 1 to 723, and then the backup finally proceeded.

I have also seen this occur when a previous backup failed or was cancelled. This occurs with Windows Server Backup on Windows Server 2008, Windows Server 2008 R2, and Windows Server 2012 R2.

Jul 302016
 

I have identified and confirmed with 2 different HPE MSA 2040 SANs an issue with SMTP notifications. I’ve identified the issue with multiple firmware versions (even the latest version as of the date of this article being written). The issue stops e-mail notifications from being sent from the MSA 2040 when the SAN is configured with some SMTP relays. This issue also occurs on HPE MSA 2050 arrays, as well as HPE MSA 2052 arrays.

The main concern is that some administrators may configure the notification service believing it is working, when in fact it is not. This could cause problems if the SAN isn’t regularly monitored and if e-mail notifications alone are being used to monitor its health.

Configuration:

-MSA 2040 (2050/2052) Dual Controller SAN configured with SMTP notifications

-SMTP destination server configured as EXIM mail proxy (in my case a Sophos UTM firewall)

Symptoms:

-Test notifications are not received (even though the MSA confirms OK on transmission)

-Real notifications are not received

-Occasionally if numerous tests are sent in a short period of time (5+ tests within 3 seconds), one of the tests may actually go through.

Events and Logs observed:

/var/log/smtp/2016/06/smtp-2016-06-20.log.gz:2016:06:20-20:44:29 SERVERNAME exim-in[16539]: 2016-06-20 20:44:29 SMTP connection from [SAN:CONTROLLER:IP:ADDY]:36977 (TCP/IP connection count = 1)

/var/log/smtp/2016/06/smtp-2016-06-20.log.gz:2016:06:20-20:44:29 SERVERNAME exim-in[18615]: 2016-06-20 20:44:29 SMTP protocol synchronization error (input sent without waiting for greeting): rejected connection from H=[SAN:CONTEROLLER:IP:ADDY]:36977 input=”NOOP\r\n”

Resolution:

To resolve this issue, I tried numerous things however the only fix I could come up with, is configuring the SAN to relay SMTP notifications through a Exchange 2013 Server. To do this, you must create a special connector to allow SMTP relaying of anonymous messages (security must be configured on this connector to stop SPAM), and further modify security permissions on that send connector to allow transmission to external e-mail addresses. After doing this, e-mail notifications (and weekly SMTP reports) from the SAN are being received reliably.

Additional Notes:

-While in my case the issue was occurring with EXIM on a Sophos UTM firewall, I believe this issue may occur with other E-mail servers or SMTP relay servers.

-Tried configuring numerous exceptions on the SMTP relay with no effect.

-Rejected e-mail messages do not appear in the mail manager, only the SMTP relay log on the Sophos UTM.

-Always test SMTP notifications on a regular basis.

Mar 262016
 

An issue that’s been making me rip my hair apart for some time… And a fix for you experiencing the same.

 

Equipment:

HP Proliant DL360 G6 Server (with a P800 Controller) running Server 2012 R2 and Backup Exec 2014

HP MSL-2024 Tape Library with a single HP SAS LTO-6 Tape Drive

 

Symptoms:

-After a clean restart, a backup job completes successfully. Subsequent jobs fail until server or services restarted.

-While the initial backup does complete, errors/warnings can be seen in the adamm.log and the Event Viewer even when successful.

-Subsequent backups failing report that the device is offline. The Windows Device Manager reports everything is fine.

-Windows Server itself does not report any device errors whatsoever.

 

Observations:

[5648] 03/05/16 07:50:46 Adamm Mover Error: DeviceIo: 03:07:00:00 – Device error 1167 on “\\.\Tape0”, SCSI cmd 0a, 1 total errors
[5648] 03/05/16 07:55:46 Adamm Mover Error: DeviceIo: 03:07:00:00 – Refresh handle on “\\.\Tape0”, SCSI cmd 00, new handle 214, error 0
[5648] 03/05/16 07:55:46 Adamm Mover Error: DeviceIo: 00:00:00:00 – Retry Logic: Retry logic was engaged on device: HP       Ultrium 6-SCSI
[5648] 03/05/16 07:55:46 Adamm Mover Error: DeviceIo: 00:00:00:00 – Retry Logic: Original settings restored on device: HP       Ultrium 6-SCSI

Event ID 58053
Backup Exec Alert: Storage Error
(Server: “WhatsMySRVRname”) The device state has been set to offline because the device attached to the computer is not responding.

Ensure that the drive hardware is turned on and is properly cabled. After you correct the problem, right-click the device, and then click Offline to clear the check mark and bring the device online.

[09968] 03/05/16 01:42:08.426 DeviceIo: 03:07:00:00 – Refresh handle on “\\.\Tape0”, SCSI cmd 17, new handle ffffffff, error 32
[09968] 03/05/16 01:42:08.426 DeviceIo: 03:07:00:00 – Refresh handle on “\\.\Tape0”, SCSI cmd 1a, new handle ffffffff, error 32
[09968] 03/05/16 01:22:07.867 PvlSession::DismountMedia( 0, 0, 0 )
Job = {JOBHEXNUMBERZZZZZZ} “ServerBackup-Full”
Changer    = {CHANGERZZZZ} “Robotic library 0001”
Drive      = {MYBACKUPDRVXZZZZZ} “Tape drive 0001”
Slot       = 13
Media      = {MEDIAZIDZZZZ} “BARCODEID”
ERROR = 0xE0008114 (E_PVL_CHANGER_NOT_AVAILABLE)

[19812] 03/05/16 01:42:12.613 DeviceIo: 03:07:00:00 – Refresh handle on “\\.\Tape0”, SCSI cmd 1a, new handle ffffffff, error 32
[19812] 03/05/16 01:42:13.129 DeviceIo: 03:07:00:00 – Refresh handle on “\\.\Tape0”, SCSI cmd 00, new handle ffffffff, error 32
[19812] 03/05/16 01:42:13.645 PvlDrive::DisableAccess() – ReserveDevice failed, offline device
Drive = 1007 “Tape drive 0001”
ERROR = 0x0000001F (ERROR_GEN_FAILURE)

[19812] 03/05/16 01:42:13.645 PvlDrive::UpdateOnlineState()
Drive = 1007 “Tape drive 0001”
ERROR = The device is offline!

[19812] 03/05/16 01:42:12.613 DeviceIo: 03:07:00:00 – Refresh handle on “\\.\Tape0”, SCSI cmd 1a, new handle ffffffff, error 32
[19812] 03/05/16 01:42:13.129 DeviceIo: 03:07:00:00 – Refresh handle on “\\.\Tape0”, SCSI cmd 00, new handle ffffffff, error 32
[19812] 03/05/16 01:42:13.645 PvlDrive::DisableAccess() – ReserveDevice failed, offline device
Drive = 1007 “Tape drive 0001”
ERROR = 0x0000001F (ERROR_GEN_FAILURE)

[19812] 03/05/16 01:42:13.645 PvlDrive::UpdateOnlineState()
Drive = 1007 “Tape drive 0001”
ERROR = The device is offline!

Event ID 1000
Faulting application name: wmiprvse.exe, version: 6.3.9600.17415, time stamp: 0x54505614
Faulting module name: MSVCR110.dll, version: 11.0.51106.1, time stamp: 0x5098826e

 

Research:

I spent a ton of time researching this… Old support threads were pointing me in all different directions, most of the threads being old, mentioning drivers, etc… Initially I thought it was hardware related, until through testing I got the gut feeling it was software related. There was absolutely no articles covering Backup Exec 2014 running on Windows Server 2012 R2 with this specific issue.

Tried a bunch of stuff, including swapping the P800 controller, for another HP P212. While it didn’t fix the issue, I gained some backup speed! 🙂

Updating the HP software (agents, providers, HP SMH, WBEM) had no effect.

Disabling the HP providers, and disabling the HP Monitoring, Insight, Management services had no effect whatsoever. Tried different firmware versions, also tried different drivers for the Library and Tape drive, no effect. Tried factory resets, no effect. Tried Library and Tape tools, all tests passed.

Disabled other monitoring software we have in place to monitor software/hardware on clients servers, no effect.

 

Resolution:

-Uninstalled the HP WBEM Providers and Agents.

-Added a “BusyRetryCount” 32-bit DWORD value of 250 (decimal) to the “Storport” key under “Device Parameters” in all the Tape Library and Tape Drive Registry entries. Example:

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Enum\SCSI\Changer&Ven_HP&Prod_MSL_G3_Series\5&334e8424&0&000500\Device Parameters\Storport]
“BusyRetryCount”=dword:000000fa

This needs to be added to ONLY and ALL the tape device entries (under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Enum\SCSI\) for the Tape Library and Tape Drives. You probably will have to create “Storport” key under the devices “Device Parameters” key.

After doing this, the backups run consecutively with absolutely no issues. The event log is CLEAN, and Adamm.log is clean, and the “Faulting application name: wmiprvse.exe” errors in the event log no longer occur.

Fixed!

 

Additional Notes:

-Both “fixes” were applied at the same time. I believe the WBEM providers/agent caused the Event ID 1000 errors on WMIPRVSE.exe. While the registry keys alone may have possibly resolved the backup issues, I believe there still would have been an underlying issue with WMIPRVSE.exe faulting that could have other consequences.

-I do not believe the original installation of the HP WBEM providers caused the issue, I have a feeling a subsequent Windows Update, Backup Exec update, other module update, or an update to the HP software may have caused the issue to occur at a later time than original install. I do remember we didn’t have an issue with the backups for months, until one day it started occurring.

-I will be re-installing the HP providers and agents at a later time. I will be uninstalling all of them, and re-installing from scratch the latest versions. I will post an update with my results.

-There is a chance the registry key is needed for the HP software to co-exist with Backup Exec backups for this configuration.

-There is a chance that the registry key isn’t needed if you never load the HP software.