A quiet afternoon doing my ‘Mid Year Review Check-In’ is interrupted by SQL mirroring “disconnected”email alerts telling me that my mirrors are down for some reason.
Quick confirmation from my DBA team that they were not doing any maintenance and were also puzzled by the alerts. We then get hit by a flood of I/O errors:
Name: The operating system has reported I/O error
Description: The operating system returned error 21(The device is not ready.) to SQL Server during a read at offset 0x00000997bd0000 in file ‘K:\MSSQL\DATA\DR.MDF’. Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
I realize from the server names that they all have their storage on one of our EMC SANs so immediately escalated to our SAN team for investigation.
In our primary data center we have 3 EMC Clarions and 1 Hitachi USPV. The EMC being mid-range and the Hitachi enterprise class. Our primary (principal) servers are on the Hitachi and the Mirrors are on the EMC. Lucky for us today it was the EMC that went down.
SAN engineers brought the EMC SAN back online but all SQL mirrors had to be SQL cycled to bring the databases back. The mirroring sessions were in ‘suspended’ mode after this and required us to hit the ‘resume’ button on each mirror pair to re-establish the mirroring sessions.
On one large server we got the EMC LUN back but the 2 of the databases were ‘suspect’. A couple of SQL re-starts and they were still suspect. SQL Error log confirmed corruption so we had to restore a 4TB database from backup…
These old EMC SANs are out of warranty. Extending the warranties for a few months to tie us over till we migrate datacenters will cost $250,000+. We made the decision recently to not extend. Instead we have to get a confirmed Purchase Order before EMC will assist and it will be on a strictly time and materials support basis.
Lessons learned and validated today:
- SANs are single points of failure – albeit highly redundant.
- Be aware of what else you lose when you let your SAN warranty lapse (monitoring, log analysis, fast response time)
- Mirroring rocks
- VLDBs take a long time to restore
- VLDBs take a long time to recover
- check your backups
- test your backups