Skip to content

2006

Possible Downtime on Monday

We have been alerted to a failing memory module on our web, database, and printing server and have scheduled an on-site visit by a Sun Microsystems field engineer to replace the defective memory module. Assuming that everything goes to plan, there will be a slight interruption in our web, database, and in-lab printing services around noon on Monday, November 6 as we repair our server. We are anticipating a maximum downtime of two hours.

No other OCF services should be affected (ex., mail and shell accounts). If the repair takes less than two hours, I'll use the remaining time to upgrade the system software on our server. If the repair process extends beyond the allocated time, status updates will be posted to this blog.

makehttp works again

I just fixed makehttp. Users should be able to create their own web directories once again.

For those that didn't already know, OCF accounts come with your own web space. You can activate it by executing 'makehttp' at a command prompt. You can store files that you want to be accessible to the Internet in your 'public_html' directory, and the URL to your web space will be www.ocf.berkeley.edu/~username.

Newegg's Automation Needs Work

On Sunday, we ordered another 500 GB SATA2 drive from Newegg along with a LSI Logic SAS/SATA HBA so we could build a temporary array while we had our current array serviced. The plan was to move all OCF data to this 'new' array temporarily while we sent our currently broken array back to the manufacturer. So, our Newegg package finally arrived today.

Except there was one slight problem: Newegg sent us a PATA drive instead of a SATA drive. So, now we have three SATA drives and one PATA drive. You put the four things together, and you don't have a SATA array.

Newegg is closed for the weekend, so we're trying to see if we can acquire a SATA drive over the weekend and get to building our temporary disk array.

This past month has got to be an example of Murphy's Law in action.

Random Outages

Over the course of the past day and a half, we have been experiencing some random errors with our primary authentication system. These errors have led to some difficulties logging in for some users and some other problems in the physical lab (printer queue jams, frozen terminals, etc). We're not quite sure of what's causing the problems, but we have a pretty good idea it's related to us using NIS+ (an old standard developed by Sun Microsystems that has been deprecated). Thanks to sluo, we were able to recover from these errors, so everything should be up and working now.

In regards to our other services:

The mail queue is still being processed, but there's still a huge chunk of mail that's left in the queue.

MySQL databases should be restored as well as we can restore them. Users with data that we have identified as problematic in recovering will be individually contacted via email tomorrow evening (I'm consolidating a list of the errors we received so I can send it all in one pass).

PostgreSQL is currently being looked at and debugged.

We've found a way to get our disk array serviced, but it means sending back critical parts of our disk array. Since downtime is unacceptable, we're going to build a temporary disk array out of commodity parts and use that in a 'hot-swap' manner with our current disk array. I'm waiting for the parts in the mail, though...

Sorry about not updating this blog in the past couple days, but I've been rather busy, and I only got to leave the OCF around 4 AM yesterday morning.

Mail and Database Status Update

We've begun delivering mail that was queued up during our array failure outage. We've also enabled IMAP and POP3 access again, so users should be able to read their mail using their favorite mail client. It may take a couple days for all mail to be delivered; hundred of thousands of messages were queued up during the outage.

MySQL databases have been restored as well as we can restore them. Once we re-import all the databases into our MySQL server, we'll bring MySQL back online. A small number of users have irrecoverable errors with their databases; we will be contacting each of the affected users individually to work with them on recovering their data.

More Services Being Restored

We're re-enabled access to mail that was delivered before our array failure on Thursday. Users should be able to read and manage their old mail. Unfortunately, you'll need to login to one of our shell servers to read your mail; we haven't brought POP or IMAP back up yet. At the same time, we're working on delivering mail that was queued during the downtime. Please have patience; we're working very hard on it.

Also, we're aware of the issues with databases and are trying to debug the problem.

I'm currently attempting to probe our disk image of the array to see if I can find any files that were lost during the restoration process. Since the image is about ~1 TB, it's not going as quickly as I'd like, but it's still running.

OCF Sorta Back Up

We've just finished restoring most OCF services. Home directories and web pages should work, and logins to all our general servers should also work. There might be some glitches here and there as we put the finishing touches on our restoration, though, so please bear with us.

Mail is still offline, but we're still queuing mail. The main reason mail remains offline is that we need to run all the mail we've queued up through our mail delivery system again, and we can only really do that once we're sure everything works.

Status Update

Files are still being copied to the array, but at the current rate, we should definitely have all files back on our disk array by Tuesday morning. If all goes well, the OCF should be back up and running by Tuesday night or Wednesday morning. I hope.

Since we're already down, we've decided to migrate our mail service over to a much faster server. Hopefully that'll allow some good to come out of this entire mess...

Status Update

All user data has been restored to the disk array. I'm currently running a fsck on the array just to make sure that the array hasn't already corrupted it. We'll be keeping very regular backups of user data until we can figure out what's wrong with our disk array or until we can get it serviced, so there shouldn't be any future extended downtime like this again (at least as a result of the disk array).

We're on track for the OCF coming back online sometime tonight or tomorrow morning. As an added safety precaution, I'm currently setting up a two-way RAID-1 mirror with hot-spare on our primary NFS server (basically the computer that serves all user files) to make everything triply redundant.

Thanks for all your support through this process!

Status Update

I'm about to head back to the OCF and swap our SCSI connectors to our disk array so we can continue with our fsck recovery efforts. In the mean time, the other staffers are working on bringing our mail server back online so we can queue incoming mail.