Monday morning, 9am, settling down to my first coffee of the day, notice SQL connectivity alerts from a server in our risk cluster. My first thought is why the hell hasn’t the on-call person seen these alerts as they would have been going to the pager. Anyway, start investigating and see that I can ping the server but can’t connect to SQL. A terminal session hangs at login. I engage the Data Center ops team to investigate. 10 minutes later Sheldon from DCOPS tells me the server has hung and he can’t login. Do I have a local admin account for him to use? I tell him no so we then try to login using the Emergency Repair Disk. 8 different ERD disks later and we still have no luck. Apparently this Mach 1 SKU is notorious for not working with ERDs. Mach 1 is like an HP DL360/380 server that has been specially customized just for us by HP and our hardware engineers.
Sheldon reckons the only option we have is to flatten and rebuild it! Hang on a minute – this server is a primary server. We had already forced a failover of databases to its mirror and we’re lucky our applications are clever enough to re-route the calls to the secondaries and mirrors but I was in no mood to rebuild a server just because it seemed to have fallen off the network.
Unfortunately, this server does not have Integrated Lights Out (ILO) connected but we do have the option of asking them to hook up a temporary ILO so we can troubleshoot from the console. Data Center is in Seattle but my Ops team are in Shanghai so ILO is wonderful.. After an hour Sheldon manages to hook up the ILO and get me connected. He had to unrack the server to get it done. My first attempt to login from ILO tells me there are no login servers to authenticate me. Damn – but this is expected – after all, the server has dropped off the network so we can’t connect to AD. I don’t have any cached credentials on this server but I know someone who might because they patched these servers last week. I call up my trusted DBA apprentice and ask him to login for me. Great! It works. I’m in.
Now, let’s see that the problem is. IPCONFIG tells me the server has auto assigned IPs which means its lost all its NIC settings. Strange. I re-enter the correct IP, DNS and WINS settings. On saving I’m told this IP address already exists and is bound to another NIC. I select ok to overwrite these settings. A quick ping from another server on the domain tells me the server is back up and SQL is running. Awesome. SQL mirroring synchs up and replication starts to catch up. QC shows no other issues. Need to do a root cause analysis (RCA) to determine how and why this server dropped off the network and lost its settings. Perhaps the DC guys replaced the NIC by mistake or replaced the NIC on the wrong server. I was also told of a ghost NIC issue where NIC settings disappear due to a BIOS issue.
Lessons learned. Don’t just give up and go for a server rebuild. use ILO. Emergency Repair Disks (ERD) can be your friend but don’t always work. Keep a local admin account if your security policy allows.
SQL mirroring is quite resilient. It can survive a forced failover and then re-synch once the principal comes back. Make sure you have enough transaction log for at least one day of downtime. Same goes for replication – make sure your distribution database can hold a couple of days data to save you from any re-initializations.
Webstore (our internal scale out and Ha middleware for SQL) is able to automatically route our API calls from a primary that is down to the secondary without any intervention or manual failover steps. This is what saved us today…