Skip to content

2010

OCF Mail Down

OCF mail is down; please stay tuned for updates.


UPDATE (10 December): The server which hosts /var/mail had a drive failure (one of the pins on the boot drive was bent, which short-circuited something and caused a system shutdown). We've restored the server using its other drive and are rebuilding RAID. All mail services should now be functional.

UPDATE (11 December): The mailspool server is down again, and its remote management system seems to be malfunctioning. Mail is down until an OCF staffer living in Berkeley can investigate; if the server is irreparably damaged, we'll have to wait until finals are over to rebuild it. Sorry for the inconvenience.

UPDATE (12 December): Some of you have expressed concern that your data may have been lost. Sorry for any alarm — rest assured, your data is safe, and we do have recent backups of /var/mail.

As part of our transition to a new mail infrastructure, we recently migrated /var/mail from our central disk array to a new machine with a better NFS server. We didn't realize until the server shut down two days ago that one of its drives was damaged, at which point we swapped out the drive and turned the machine on again. The server shut down again yesterday, and S.M.A.R.T. is predicting another drive failure, so we fear the motherboard or drive controller may have been damaged by the faulty drive. The server runs and your data is intact, but rather than just rebuilding RAID again and risking a major meltdown during finals week, we've decided to leave everything off after finals, when we'll have time to take care of this properly.

UPDATE (14 December): As of early morning today, mail is tentatively operational. Last night we transitioned /var/mail to an alternate hard disk in a temporary server. Maintenance work will be scheduled soon, especially over break. Special thanks to the ASUC for their cooperation.

Unexpected Downtime

The OCF's DNS server just underwent spontaneous massive existence failure and we are trying to get the system to boot. Further bulletins as events warrant.


In other news, there was a hardware failure of unknown origin on the new mail server and /var/mail was down for a few minutes while we recovered the RAID array. Sorry for the inconvenience.

Update(10:30PM): DNS is back up. Sorry for the inconvenience again.

Update(11:00PM): The login servers had the blues during the DNS outage. They have been rebooted, and are fully functional again.

Scheduled downtime this weekend

Hey OCFers,

In order to give OCF mail a much-needed reliability boost, we plan to incrementally replace our current mail infrastructure with a centralized, redundant new system. The current system is scattered and full of single points of failure — as an example, our SpamAssassin server didn't survive last week's power outage, and our mail servers were forced to reject all incoming mail until we were able to build a replacement spam filter on a different machine. Our hope is that this new system will be significantly less volatile and more maintainable.

To that end, we'll be taking OCF mail down this Saturday. We'll be migrating mail storage, outgoing SMTP, IMAP/POP, and webmail to the new mail server. Incoming mail will not be accepted while we're working on the servers, but we'll try to keep read-only access to your old mail online for most of the day. As always, you can follow our progress here and on IRC.

UPDATE (20 November, 12:15PM): It begins! We've taken incoming mail offline and made /var/mail read-only for the mailspool migration.

UPDATE (20 Nov, 2:30PM): We've finished copying data from the old mailstore to the new one and are in the process of switching NFS servers. IMAP/POP access to mail is now down.

UPDATE (20 Nov, 5:30PM): We're having a bit of trouble with the new NFS server; IMAP/POP is still down, and our login servers hung and had to be rebooted a short while ago. tsunami and conquest should be working now, though.

UPDATE (20 Nov, 6:45PM): The NFS server is now up and running, now that we've squashed a pesky NFSv4-related bug (thanks to sluo and dwc for their help!). /var/mail is no longer read-only. Once we finish migrating disk quota information, we'll bring the mail servers back online. Thanks for your patience.

UPDATE (21 Nov, 12:15AM): /var/mail has been fully migrated to the new NFS server, and we've re-enabled the old mail servers. Incoming mail and POP/IMAP still live on the old infrastructure, but now that the NFS server migration is complete we should be able to set up the new infrastructure with minimal downtime.

To be clear, we have not yet migrated SMTP, IMAP/POP, or webmail to the new mail server, but we plan to do so in the near future. Watch this space for updates.

ASUC Building Power Outage

Hi,
Tonight from 2am-6am Esheleman Hall will have a power outage, stay tuned for updates.

Update 8:44: Everything except the OCF webserver/disk array/authentication servers has now been shutdown.

Update 9:00pm Shutting down mysql

Update: 9:15pm Webserver shutdown

Update: 9:20pm Disk array shutdown/authentication servers shut down. The only things that are still up are infrastructure related servers. These allow us to manage machines remotely. The UPS says 2:15 of runtime, the outage is 4hrs, lets hope 4hrs is a conservative estimate?

Update 6:36am Marginally restored our infrastructure, we are running tests to make sure everything that is up is working. Login Servers will be up shortly

Update 7:04am There are a few issues coming back up we are looking to get them resolved ASAP

Update 7:31am We had some disk array issues, they seemed to be resolved for the time being, all the windows machines work, printing works, and we will soon boot up the login servers after we are sure permissions and such are working properly

Update 7:56 Our DNS server is being stubborn, seems to be the root cause of recent issues. Mysql should be back up

Update 8:04am FSCK time, what a fun way to start the moring, fsck'ing broken filesystems

Update 8:12am The webserver should be working again

Update 8:22am DNS is plodding along, expect a delay between 9-10:30 since I have class at this time.

Update 8:41am FSCK on the DNS server, will likely be down for a while, login servers should be up and running, docs and webmail should work too.

Update 8:44am spoke too soon disregard the previous post

Update 10:56am still working on getting dns up.

Update 11:11am DNS should be up now

Update 6:40pm Reaching hour 30 of this adventure, most of our services have been restored. Mail is a work in progress, but your stored email should be fully accessible now. apocalypse.ocf.berkeley.edu doesn't seem to turn on, so we will keep that off for now (while we straighten out everything else).

Other failures

In a rather unlucky streak of timing here are other known failures in the OCF (today was not the best day).

2 printers (1 critically)
3 infrastructure related servers

Will update you as we get things fixed.

Mysql Down

Mysql went down for a short period of time yesterday, we hosted the service from backups in a read-only mode for the night.

UPDATE (15:41):
ETA of 8-10hrs before we get mysql up and running again.

UPDATE (19:41):
mysql, postgresql back in business. props to jaws.ocf.berkeley.edu, for performing admirably

Expected downtime

Hi all,
Yesterday's reboot left behind some nasty problems in our mail volume, we are going to take some downtime to fix it. 7pm-11pm, I expect system-wide downtime, I apologize for whatever inconvenience this has caused. In hindsight we should have handled this better, but we learn from our mistakes.

Sorry for the trouble.

Update 4:56pm
We may start working a bit early at around 5:30-6:00 so don't be surprised if /var/mail disappears during that time.

Update 8:40pm
Stuff should be working again for the most part, thanks for your patience

Disk array reboot

Hi all,
We rebooted our disk array, resulting about 30 mins of downtime system wide. Just notifying all in case there are any stray stale nfs handles hanging around, or if any scripts broke. Don't panic if you couldn't log into our machines during this time, it was nothing serious.

Webmail Restored

We've restored webmail functionality. Please let us know if there are any problems or hiccups.