Posted on February 10, 2009 at 11:03 am

Dynatron-o-mite?

So, as some of you might know, I built a new server.  Haven’t been able to put it into production mostly because of lack of time.  Then when I had time I was noticing strange drive issues.  My SATA drives were going online and offline somewhat randomly.  This went on for about two months while I took care of my day job and my life.  Then it came down to finally figuring out what was going on.  Once I was on site with the box, I noticced a VERY loud noise, accompanied by enough vibration you could feel it quite readily in the rest of the rack.  I began to wonder what was wrong, and how the box had managed to continue running with whatever it was that was going on in there.  I was wondering how many of the drives had bad bearings.

None.  The drives, were fine.  As soon as I got close I could tell the noise was deeper inside…

When I pulled the cover, I was pretty soon sure it was the CPU fans.  A pair of Dynatron A86Gs with 60x60x25mm dual ball bearing fans.  Neither of which sounded good…so…I unplugged them (DO NOT DO THIS AT HOME FOLKS).  The noise and vibration stopped.  I put the cover back on, and was able to boot into SXCE (Solaris Express Community Edition) without a hitch, well, except that ZFS was making note of errors.  So with my system up I went ahead and started a scrub.  Keep in mind this machine had been going for a couple months.  All in all, when the scrub completed (without the filesystems ever being *down* mind you) I had corruption in 21 files.  20 of which were simply my OS/Net Nevada source tree mirror, easily replaceable, the other, was/is in the volume (virtual hard drive) associated with a VM that I just use to run BOINC.  So I run fsck on the system.  No errors that fsck was able to detect, and the image runs, so I’m willing to let it go.  There is possibly some corruption in an area that the OS does not use.  Now, one MIGHT classify this as a ZFS failure.  But remember, I had random drives coming and going for TWO MONTHS SOLID.  There were *thousands* of other errors that ZFS successfully corrected.  The machine had not been brought into any sort of production but I was running constant benchmarks and lots of I/O on it.  Quite a few million files in the pools that were affected, hundreds of gigs of data.  More important than ANY of that is that ZFS stayed online.  The repair was NOT an offline process, and while it would have affected production system that was under high usage, it would at least allow you to get into the filesystem and check on things and recover things that were business critical.  Also, ZFS *KNOWS* about not only metadata corruption (or potential metadata corruption, of which there was none to report in this instance) but about YOUR DATA.  And thats what a filesystem should do.  Protect Your Data.  And if it fails at that, it should atleast KNOW it failed, and tell you.  Better than that you know what ZFS does?  It says “Hey buddy, this sucks, I failed.  But here’s what I’ve got anyway, I just know it’s wrong.”  Does it panic?  No.  Does it SIGSEGV?  No.  Does it Oops?  No.  It just lets you get back to business.

And after having experienced now, back to back, two very, VERY different failure modes of two very different filesystems, I really don’t know why you would NOT use Solaris and ZFS, even if you have to put it up over NFS, in as many scenarios as is humanly possible.  It (ZFS) WILL SAVE YOUR BACON.  And whats more, it’ll do it without missing a single beat.  It is what filesystems NEED to be in this day and age.

And before you say ‘but you had less files/data/whatever’ on the ZFS system.  I had as much, or MORE.  I also had/have similar size drives and I/O subsystem performance.  But I never had to go offline to fsck on my ZFS system.

Back to the fans.  I ordered new fans (Thermaltake A1097 – lower RPM somewhat lower CFM but they seem to be very happy) the original fans, /dev/trash.  I left the system mostly offline except when I needed it while I waited for the new fans to arrive.