DBA diaries – ghost NICs and ILO

Monday morning, 9am, settling down to my first coffee of the day, notice SQL connectivity alerts from a server in our risk cluster. My first thought is why the hell hasn’t the on-call person seen these alerts as they would have been going to the pager. Anyway, start investigating and see that I can ping the server but can’t connect to SQL. A terminal session hangs at login. I engage the Data Center ops team to investigate. 10 minutes later Sheldon from DCOPS tells me the server has hung and he can’t login. Do I have a local admin account for him to use? I tell him no so we then try to login using the Emergency Repair Disk. 8 different ERD disks later and we still have no luck. Apparently this Mach 1 SKU is notorious for not working with ERDs.  Mach 1 is like an HP DL360/380 server that has been specially customized just for us by HP and our hardware engineers.

Sheldon reckons the only option we have is to flatten and rebuild it! Hang on a minute – this server is a primary server. We had already forced a failover of databases to its mirror and we’re lucky our applications are clever enough to re-route the calls to the secondaries and mirrors but I was in no mood to rebuild a server just because it seemed to have fallen off the network.

Unfortunately, this server does not have Integrated Lights Out (ILO) connected but we do have the option of asking them to hook up a temporary ILO so we can troubleshoot from the console. Data Center is in Seattle but my Ops team are in Shanghai so ILO is wonderful.. After an hour Sheldon manages to hook up the ILO and get me connected. He had to unrack the server to get it done. My first attempt to login from ILO tells me there are no login servers to authenticate me. Damn – but this is expected – after all, the server has dropped off the network so we can’t connect to AD. I don’t have any cached credentials on this server but I know someone who might because they patched these servers last week. I call up my trusted DBA apprentice and ask him to login for me. Great! It works. I’m in.

 

Now, let’s see that the problem is. IPCONFIG tells me the server has auto assigned IPs which means its lost all its NIC settings. Strange. I re-enter the correct IP, DNS and WINS settings. On saving I’m told this IP address already exists and is bound to another NIC. I select ok to overwrite these settings. A quick ping from another server on the domain tells me the server is back up and SQL is running. Awesome. SQL mirroring synchs up and replication starts to catch up. QC shows no other issues. Need to do a root cause analysis (RCA) to determine how and why this server dropped off the network and lost its settings. Perhaps the DC guys replaced the NIC by mistake or replaced the NIC on the wrong server. I was also told of a ghost NIC issue where NIC settings disappear due to a BIOS issue.

Lessons learned. Don’t just give up and go for a server rebuild. use ILO. Emergency Repair Disks (ERD) can be your friend but don’t always work. Keep a local admin account if your security policy allows.

SQL mirroring is quite resilient. It can survive a forced failover and then re-synch once the principal comes back. Make sure you have enough transaction log for at least one day of downtime. Same goes for replication – make sure your distribution database can hold a couple of days data to save you from any re-initializations.

Webstore (our internal scale out and Ha middleware for SQL) is able to automatically route our API calls from a primary that is down to the secondary without any intervention or manual failover steps. This is what saved us today…

Posted in SQL Server | Leave a comment

more swill oil

this is just gross…. I’m probably injesting this cooking oil recycled from gutters every day as all Chinese food is dripping in oil.. swill oil… urgghh. Heres a bloke collecting swill oil from a gutter for refinement.

A syndicate that made and sold cooking oil recycled from kitchen wastes has been busted by police in Chongqing in southwest China.
The syndicate refined swill oil in an underground factory and sold it as cooking oil to restaurants in Chongqing and the provinces of Sichuan, Yunnan, Henan, Hunan and Guizhou, the Beijing News reported today.
Its illegally production of recycled oil was enough for 2,600 households to consume in one year, the report said.
In the underground factory, which used to be a pig shed, rotten kitchen leftovers were stored in several cement pits. A large cauldron was used to boil food wastes, local police said.
Like cooking oil recycled from gutters, swill oil is also hard to detect in the marketplace because its lab test matches that of normal cooking oil, officials said.
Chongqing police said there is no law that bans swill oil as a toxic food product.
Investigation showed the syndicate enjoyed high profits in this business. It paid only a token fee to buy kitchen wastes from restaurants and, after processing, it could sell swill oil at 8,000 to 9,000 yuan (US$1,256-1,413) per ton.

Posted in China Food Safety | Leave a comment

swill oil

ALL Shanghai restaurants must have oil-filtering machines installed in their kitchens by the end of the year, as the city attempts to eliminate the illegal “swill oil” trade.
Swill oil is produced from oil collected from drains and kitchens and resold for culinary use.
The filtering machines remove water and other residue from kitchen and meal waste. Water goes down the drain, while oil is collected by government-licensed companies to recycle for industrial use.
About 500 of the city’s 60,000 licensed eateries have completed installation and are using the filters on a trial basis, said the local food safety watchdog yesterday.
Licensed eateries cover everything from top-end restaurants through to work canteens and fast-food chains.
The cost of the machines for eateries taking part in the trial has been met by oil-recycling companies.
It has not yet been announced who will pay for the machines when the initiative is rolled out for all restaurants.
“The trial has proved successful and now we are turning it into a mandatory business rule,” said Yan Zuqiang, director with the Shanghai Food Safety Office.
He said medium and large establishments are required to be using oil-filtering systems by June, while smaller eateries should complete installation by the end of the year.
“And applications to open restaurants will not be passed unless this is fitted,” said Yan.
Currently, filtering systems are provided by oil-collecting companies. In return, restaurants give them collected oil.
Yan said the local government plans a scheme under which restaurants would receive fresh oil in exchange for waste oil.
“In future, restaurants will get fresh oil from recycling companies in return for the used oil. We are working out what a reasonable ratio would be,” Yan said.
“Using oil-filtering machines is the best solution we’ve found to eliminate sources of swill oil,” Yan added.
Without the devices, it is easy for kitchens to dump waste oil in gutters or sell it to underground dealers. Swill oil dealers ladle oil from drains and buy leftover supplies from restaurants.
“Restaurants that cheat and avoid processing their waste oil using the machine will be fined heavily,” Yan said.
Meanwhile, local food authorities also said they are creating an online traceability system covering rice and grain sold locally.
Consumers will be able to use this to trace production details of these foods.

http://www.shanghaidaily.com/article/?id=495194&type=Metro

 

Posted in China Food Safety | Leave a comment

tainted bean sprouts

GROWERS of tainted bean sprouts in Shanghai’s Qingpu District have been detained, local authorities said yesterday.
Shanghai Food and Drug Administration said the bean sprouts found in unlicensed premises in the Xianghuaqiao residential community contained illegal additives.
Officials gave no further details of what kind of additives they were and it was not known whether they were toxic or added in excessive amounts.
All the contaminated bean sprouts have been destroyed and several suspects detained after local authorities acted on a tip-off from a resident.
Officials said that police were still investigating the case.
A thorough inspection is being launched into bean sprouts sold locally and efforts to crack down on illegal sales of bean sprouts and their production intensified.
The case is not the first one involving bean sprouts to have sparked a food safety scare in China.
Last year, nearly 2,000 kilograms of tainted bean sprouts were seized in Suzhou in Shanghai’s neighboring Jiangsu Province.
Those bean sprouts were said to have been soaked in illegal solutions to make them look fresh. Banned chemicals were also used by growers to whiten the bean sprouts and increase their appeal to buyers.
Also last year, six people in northeast China’s Liaoning Province were jailed for up to two years for producing and selling poisonous sprouts grown using a toxic fertilizer.
The six were found guilty of applying urea and enrofloxacin to bean sprouts to increase yields. Both chemicals are banned from use in agricultural in China.
Bean sprouts require no soil, only water and cool temperatures for growth, which makes them very easy to produce. A sprout emerges in two to seven days from the seed or bean, depending on the type.

http://www.shanghaidaily.com/article/?id=495325&type=Metro

Posted in China Food Safety | Leave a comment

when you need 10,000 IOPS from direct attached storage

Our white lab coat engineers have provided a fixed range of optimised servers. In most cases we can find a server for our needs but just sometimes we need that extra capacity or IOPS.

Our priciest HP DL580G7 server at $33,000 comes in a 4u chassis, 4 socket -32 cores, 128GB RAM. 6 x 600GB local storage + 2 D2700 external arrays with 49 x 600GB drives. Total rack space 8u!

   click to view Disk Array config

This SKU peaks at 6,000 IOPS at 20ms latency. Not good enough for us! We need 10TB of storage and some peak IOPS requirements of 10,000. This is how we do it

+ 

Purchase an additional 2 drive arrays at a cost of $16,000 and fill them with 50 600GB SAS drives. Make allowance for the extra rack space and power and you’re away…

Standby for screenshots and iometer graphs…

Posted in SQL Server | Tagged , | Leave a comment

Zizhu Technology Park Shanghai

When I came over to Shanghai to be interviewed for my current role, the Microsoft offices were located in the swanky Grand Gateway Shopping tower in downtown Xujiahui.

I heard a rumour from a colleague that M$ was moving to a new campus on an industrial park south of the city but was assured by the hiring manager that this wouldn’t happen and my office would be in Grand Gateway.

Sure enough, 3 months later when I finally make it out to Shanghai from London Microsoft has moved to the new $100 million campus in Southern Minhang district.

It’s not as trendy as working downtown and there is nowhere to eat but the cafeteria.  Employees are bussed in daily from 30 locations around greater Shanghai. Unlucky for me there was no shuttle bus where I lived to take me to work so I had a nightmare journey of taxis, subways and shuttle bus to get me to work – sometimes taking 2 hours. Needless to say, I hated my commute and considered leaving for that same reason.

But after a year and much nagging of the facilities team they eventually agreed to put a shuttle bus on my route. I suppose it helped that there were now some senior employees living in my compound that required transport.

Some more information about the Zizhu Technology Park – the location of Microsoft Campus…

MS has around 1200 people on this campus. The campus was built with expansion in mind and is actually only half complete. They will build the other half when they need the capacity I guess.

There is a canteen managed by Sodexho serving local and western tastes as long as a minimart, gym, coffee shops etc.

Everything was brand new 2 years ago when I moved in but its starting to show its age already.

We had an incident where some heavy wall tiles fell off above the barista seriously injuring her. All tiles were removed after that for safety.

There are games rooms on each floor with xBox, pool and table tennis tables. Massage chairs are also available but seem to be used more for sleeping than actual massage. Sleeping at your desk on the couches is considered OK in China. I guess it shows everybody how hard you work!

Some more tenants of Zizhu…

Intel. I have no idea what they do over there as I don’t know anyone working there but I guess it’s some R&D and software developement.

My good friend, Didier, is the IT Manager at the SanDisk factory a couple of blocks away. He was kind enough to give me a lift to work during those dark days before the shuttle bus. This is the main Sandisk factory where they manufacture for the likes of Apple, HTC and co.  They are currently building another factory right next door.  My repeated requests of a factory floor tour have been met with refusal due to top secrets!

      

Borg Warner are here, Yamaha as well as a very mysterious OMRON. It seems that nobody goes in or comes out of the OMRON plant. Weird. Sometimes I think that foreign companies have built these shell factories in China waiting for the time when they can bring them online. Perhaps they have been built as favours to the Chinese government to enter the Chinese market.

Coke has a huge plant and ExxonMobil have some very impressive looking offices that are empty.

 

Some Chinese firms are also here. Xinhua Control and a massive Solar Power factory whose entire building frontage is covered in solar panels.

 

Lastly, a couple more pictures of Wicresoft (?) across the road and the Digital Hub where Microsoft also has some staff.

    

Posted in Microsoft China | Tagged , , | 1 Comment

DBA Diaries – Losing the EMC SAN (again)

A quiet afternoon doing my ‘Mid Year Review Check-In’ is interrupted by SQL mirroring “disconnected”email alerts telling me that my mirrors are down for some reason.

Quick confirmation from my DBA team that they were not doing any maintenance and were also puzzled by the alerts. We then get hit by a flood of I/O errors:

Source:  MSSQLSERVER

Name:  The operating system has reported I/O error

Description:  The operating system returned error 21(The device is not ready.) to SQL Server during a read at offset 0x00000997bd0000 in file ‘K:\MSSQL\DATA\DR.MDF’. Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

I realize from the server names that they all have their storage on one of our EMC SANs so immediately escalated to our SAN team for investigation.

In our primary data center we have 3 EMC Clarions and 1 Hitachi USPV.  The EMC being mid-range and the Hitachi enterprise class.  Our primary (principal) servers are on the Hitachi and the Mirrors are on the EMC.  Lucky for us today it was the EMC that went down.

SAN engineers brought the EMC SAN back online but all SQL mirrors had to be SQL cycled to bring the databases back.  The mirroring sessions were in ‘suspended’ mode after this and required us to hit the ‘resume’ button on each mirror pair to re-establish the mirroring sessions.

On one large server we got the EMC LUN back but the 2 of the databases were ‘suspect’. A couple of SQL re-starts and they were still suspect. SQL Error log confirmed corruption so we had to restore a 4TB database from backup…

suspect database

These old EMC SANs are out of warranty. Extending the warranties for a few months to tie us over till we migrate datacenters will cost $250,000+.  We made the decision recently to not extend.  Instead we have to get a confirmed Purchase Order before EMC will assist and it will be on a strictly time and materials support basis.

 

Lessons learned and validated today:

  • SANs are single points of failure – albeit highly redundant.
  • Be aware of what else you lose when you let your SAN warranty lapse (monitoring, log analysis, fast response time)
  • Mirroring rocks
  • VLDBs take a long time to restore
  • VLDBs take a long time to recover
  • check your backups
  • test your backups
Posted in SQL Server | Tagged , , | Leave a comment