Excessive HDD failures in TX300 S8 und TX200 S/

PRIMERGY, SPARC Enterprise Server, PRIMEFLEX, PRIMEPower, BS2000

Moderator: ModTeam

voralyan
Posts: 1
Joined: Fri Jan 12, 2018 11:34
Product(s): TX200 S7
TX300 S8
TX 200 S$

Excessive HDD failures in TX300 S8 und TX200 S/

Postby voralyan » Fri Jan 12, 2018 12:56

Hello,

I’m experiencing severe HDD problems on both a TX200 S7 and a TX300 S8.
The TX200 S7 has a RAID 10 config with (currently) eight Seagate ST9600205SS. One of these was exchanged for a ST600MM0208 due to bad blocks failures about six weeks ago. The server had eight ST9300605SS in it’s original config before a capacity upgrade. SAS Contoller is a LSI Controller supplied by Fujitsu (see attached ServerView screenshot for exact model).

2018-01-11 17_36_57-ps-backup-01 - Remotedesktopverbindung.png
ServerView RAID
2018-01-11 17_36_57-ps-backup-01 - Remotedesktopverbindung.png (122.83 KiB) Viewed 3081 times


The TX300 S8 has a similar config, of course with different HDD models (I have to look up the exact model number, if that matters… However HDD models where changed here also.)
The complete stack of HDDs have been changed over the last three and half years (approximately) at least three times. All HDDs developed block errors over relatively short periods of time (ranging between some days to around 6 months), on both servers. And it keeps going… An extract of the ServerView RAID Log is shown in the screenshot.

The cause of these errors seems to be an absolute mystery.

Things that have been checked and can be excluded are:

- Room temperature: Ranges usually between 18 to 25 deg. C., HDD SMART monitoring reads about 30 deg. C. average, with a minimum about 30 deg. C. and a maximum of 40 deg. C.

- The server room is a basement room which is neither overly dusty or humid (about 60% relative humidity)

- The servers aren’t subjected to any unusual mechanical stress.

- Server load is rather low most times, no excessive I/O load on the RAID

I would also exclude hardware problems on the controller, cabling or the HDD backplane, as these have alrady been checked. Also I would it call very unlikely that the exact same hardware problem shows up on two differend servers with differend hardware. Plus that there are other machines running in the room (a TX200 S4 and some generic PCs) which don’t have any problems since years.
Searching the net and talking to other admins I’ve not read or heard about similar issues like these . Shure, there are HDDs failing every now and then, but not of that amount in a comparatively short timeframe…

For me I’ve two theories left which sound somewhat reasonable:

- First, I’ve heard that intense sound sources near to a server have led to an increased HDD failure rate. Intense sound sources by conventional meanings can’t be found in the server room, however a big A0 plotter is operated in ca. 1 meter distance from the servers. I can think of infrasound waves which are propagated through the concrete floor of the room through the table on which the servers sit (the servers are the tower versions) into the servers themselves.

- Second, there may be ground loops or excessive noise in the power wiring of the room. The building has two independant connection to the power grid. Power sockets in the server room are connected to one single phase of one or the other building grid connection (both are triphase with 250V each in a star configuration, as usual here in germany. Both circuits are individually earthed as far as I can tell…
The servers have dual power supplies. The first power supply of each server is connected directly to the first building power circuit. The second power supply is connected to an UPS, which, in turn, is connected to the second building circuit. Each server has it's individual UPS. The TX200 S7 uses an APC SmartUPS 1000, the TX300 S8 an APC SmartUPS 2200. Schematics of the setup can be found here:

http://www.voralyan.de/inf/forums/badhhds/index.html

Out of curiosity I've done some measurements on the power system, also shown on the site linked above. There is some interesting stuff going on, but nothing that may jump into the eye as an obvious problem. It may be interesting that the other machines in the room, which show no problems, are single power supply and arn't connected to the UPSes mentioned above. Actually that is the only clear difference in external setup between the machines in the rooms.

I'm aware that both scenarios outlines above may be far fetched, but they are the two remaining explanations I can think of.
As a side note: I've tested some of the harddrives shown as defect on another controller in another machine at another location: They are actually physically defect, showing lots of bad blocks. So read and write errors caused by noise or bad connections can be ruled out I think...
Does anyone have more ideas on what's going on here?

Or maybe one should install shielding against cosmic rays or call the exorcist... ;-)
Am I overlooking something obvious?

Greetings,
T. Wehmeier

Return to “Server Products”

Who is online

Users browsing this forum: No registered users and 1 guest