Skip to content

2006

Status Update

A staffer requested that I create a new post for every status update, as it'll be much easier for people using RSS readers to stay informed, so I've taken all the previous status updates and created a new post for each one and will continue doing so in the future.

Our second attempt at repairing the file system failed. Therefore, per the course of action I mentioned in my last status update, we've decided to wipe the disk array clean and rebuild the file system from our primary backup. This should help minimize downtime and get the OCF back to peak performance ASAP.

Once the dust settles, we (the Site Managers) will probably be sending out a more formal email message describing the failure, our response to it, and how we plan on avoiding such failures in the future.

Oh, and I should note that we're currently queuing incoming mail.

Status Update

A new file system was created on the disk array around 5 AM this morning, and we've been transferring data back to the array over NFS and regular Ethernet (silly endian issues prevent us from connecting the array directly to our backup system using U320 SCSI). At the current rate, we should be finished transferring the files over sometime tomorrow. A staffer might be able to drop by the lab today and figure out a better wiring method to improve transfer speeds and to get data on the array faster.

Also, here's a lesson for all system administrators out there: DO NOT BUY DISK ARRAYS FROM SHADY VENDORS. Since our budget is relatively limited, we've always been pretty conservative with our purchases and primarily relied upon donations to keep us going (thanks to Sun Microsystems for our super fast servers!). Consequently, when we needed to expand the OCF's disk offerings, we were only able to justify the purchase of a 'budget' disk array. That was 2 years ago. Our disk array is currently failing, and the company we bought the disk array from went out of business and was bought by some other company, who only wants to perform service via RMA through a process that might take 15 days.

ARE THEY FREAKING INSANE? So, we're supposed to send them our 3U form-factor disk array with 12 drives via postal mail and be down for 15 more days? Uh, no thanks.

Ok, the end of my status update and rant.

Status Update

The fsck didn't go so well. We're restoring our secondary backup of the disk array and going for another attempt at fixing the file system. This will be our final attempt at repairing the file system; we don't want to prolong our downtime since the process of restoring the backup to the disk array takes upwards of 10 hours. If we are unable to restore the file system, we'll wipe the disk array clean, create a new UFS file system, and rebuild user data from the tar archives we created on Friday and Saturday.

That is, our first attempt at repairing the file system failed. We're going to try again, but we're trying to balance our efforts at recovery with minimizing downtime. If we can't repair our file system, we're just going to wipe the slate clean and pull data from an archive we made, which may be missing a very small fraction of user data (basically the data that was damaged during the initial hardware failure). Our worst case estimate is around 1% data loss; most users won't be affected, and for the users that were, most files that we were unable to be recover seem to be unimportant files (browser cache files, temporary lock files, etc).

So, just to be clear: we're trying our best to get 100% data recovery, but doing so while minimizing downtime is difficult. Our worst case scenario is bringing back the OCF with about 99% of the data intact and working with users to recover any important data from the 1% that may be lost.

Status Update

We're almost done backing up all user data. While the backup was going on, we were able to assemble a simple 1 TB ZFS array using our newly acquired 500 GB Seagate SATA drives from Frys. Once the backup finishes, we'll do a raw dump of the disk array (where user data was stored) to our ZFS array (which we just built yesterday night). This will provide a secondary backup, just in case things go wrong -- we want to be extra careful with user data. Once that completes, we'll perform a fsck of the disk array, and, if everything goes well, most or all user data should be safe and accessible, and we'll start bringing OCF services back up. In other words, if everything does go well, some OCF services should be back up by the end of the weekend.

Now, if things go wrong, and the disk array starts spitting out errors, we're going to attempt to recover data from our ZFS array (it's basically our backup-backup). If that'll result in too much downtime, we'll dump our first backup onto the disk array (that is, the copy where we have 99% of user data or so) and work on bringing the OCF back up as quickly as possible. That way, users will have access to their data as soon as possible, and we'll work on restoring the extra 1% of data as we can from our secondary backup, without too much pressure on time.

Status Update

We're still backing up the rest of user data on our disk array to some spare space we have on our servers. Since we have upwards of 400 GB of data, and we're transferring most of it over NFS (regular Ethernet and not SCSI or Fibre Channel), it's taking a long time.

Some users have asked about data loss during this recovery. Most mail daemons should be smart enough to retry delivery once service to the OCF is restored. If our downtime ends up becoming prolonged, we will try to figure out a way to queue mail so it doesn't end up getting bounced.

In regards to user data (ie., anything other than mail), we're pulling the data off the disk array as quickly as possible. So far, it seems like most user data is intact; we're only getting about 1-2% corruption. That's not to say that that 1-2% of data is lost; we're just pulling the good data from the disk array at the moment. We haven't even begun to run the Unix equivalent of Scandisk, so it seems like there's a good chance we'll have 100% data recovery. Keep your fingers crossed, though.

Beyond the fact that we're working with such a massive amount of data, one of the holdups on our recovery is acquiring a LSI Logic PCI-X SAS/SATA host controller that supports Sun Solaris SPARC so we can setup a staging area to backup our disk array. If you don't understand what all those acronyms mean, let's just say you can't walk into any CompUSA or BestBuy and find that card. The only place that seems to carry the card is Newegg, but it's $300, and, even with overnight shipping, the earliest we're getting it is Monday.

Yury and I (the current site-managers) have been taking long shifts in the OCF to get user data back, and most of the other staff have been around to provide assistance (thanks sluo for saving us when we don't know Solaris 10!), so, we're working on it!

Status Update

Some OCF staffers are making a trip to Fremont to pick up some 500 GB drives so we have more space to backup user data. They should be back in Berkeley by 5 PM today, and I'll be working through the night to setup the drives so that we can dump data to them.

Home directories have been successfully backed up, and we're currently going through web space. After web space, we'll have Microsoft Windows profiles and MySQL/Postgresql databases left to backup (I'm sure the other staffers will correct me if I'm missing something here).

Oh, and in regards to a user's comment, yes, mail in other directories should be restored (they're part of the home directories, which are almost done being dumped from the array).

Thanks for all your support!

(More) Unscheduled Downtime

The OCF is currently (mostly) down due to a hardware failure on one of our core servers (specifically, war appears to have lost its SCSI controller). We're working to restore service; check back here for updates.

Status Update

We're still working on the problem. It seems like our disk array that holds all user data is having some troubles. I'm currently backing up all user mail to a safe location, and we're working on doing the same with home directories (ie., your regular data and web space).

Status Update

Mail is pretty much safe for the moment. We have two complete backups of mail stored on different systems. We're getting some errors while performing our initial backup of user data, but we're hoping that these errors are only temporary (ie., they'll be solved when we fsck the file system).

Some users have requested an ETA, and, for the moment, it seems the earliest the OCF will be back up and in working order is this weekend. No guarantees, though. Although we know how important it is for our users to have access to their mail, data, and web services, we're trying to take our time and do everything right so no user data is lost. If there are any special circumstances or issues that you believe we should be aware of, please visit our IRC channel (#ocf on irc.ocf.berkeley.edu).