Nov 222012
 

Just something I wanted to share in case anyone else ran in to this issue…

At a specific client we have 2 X MSA60 units attached via Smart Array P800 controllers to 2 X DL360 G6 servers. These combo of server, controller, and storage units were purchased just after they were originally released from HP.

I’m writing about a specific condition in which after a drive fails in RAID 5, during rebuild, numerous (and I mean over 70,000) event log entries in the event viewer state: “Surface analysis has repaired an inconsistent stripe on logical drive 1 connected to array controller P800 located in server slot 2. This repair was conducted by updating the parity data to match the data drive contents.”

 

One one of these arrays, shortly after a successful rebuild while the event viewer was spitting these errors out, had another drive fail. At this point the RAID array went offline, and the entire RAID array and all it’s contents were unrecoverable. Keep in mind this occurred after the rebuild, while a surface scan was in progress. In this specific case we rebuilt the array, restored from backup and all was good. After mentioning this to HP support techs, they said it was safe to ignore these messages as they were fine and informational (I didn’t feel this was the case). After creating the new RAID array on this specific unit, we never saw these messages on that unit again.

On the other MSA60 unit however, we regularly received these messages (we always keep the firmware of the MSA60 unit, and the P800 controller up to date). Again numerous times asked HP support and they said we could safely ignore these. Recently, during a power outage, the P800 controller flagged it’s cache batteries as failed, at the same time a drive failed and we were yet again presented with these errors after the rebuild. After getting the drive replaced, I contacted HP again, and finally insisted that they investigate this issue regarding the event log errors. This specific time, new errors about parity were presenting themselves in the event viewer.

After being put on hold for some time, they came back and mentioned that these errors are probably caused because the RAID array was created with a very early firmware version. They recommended to delete the logical array, and re-create it with the latest firmware to avoid any data loss. I specifically asked if there was a chance that the array could fail due to these errors, and the fact it was created with an early firmware version, and they confirmed it. I went ahead, created backups, deleted the array and re-created it, restored the back and the errors are no longer present.

 

I just wanted to create this blog post, as I see numerous people are searching for the meaning of these errors, and wanted to shed some light and maybe help a few of you out, to help you avoid any future catastrophic problems!