Skip to content

News from the staff team

Mail service downtime

Mail service, including SMTP, IMAP, and POP, was unavailable between 2:19am and 3:06am because of transient errors with LDAP (user database). Some mail sent or received during this time may have been incorrectly rejected.

Unexpected downtime, RAID failure

Taking precautionary measures after RAID failure on our main disk array, we have taken down all services that depend on NFS, including web hosting, mail service, and home directory file access.

We are doing our best by working overnight so that service will be restored as soon as possible.

We apologize for the inconvenience.

Update Thurs Sep 22 05:45am


First, we want to apologize (again!) for the delay. We were hoping to restore service yesterday morning. Things didn't exactly work out...

While we don't want to blame the delay on excuses, we would also like to be straightforward about what is going on.

Last Sunday, our backup server began to fail, and by Tuesday, the largest volume, which contained backups of home and web directories, appeared to be unrecoverable. Data is stored in a RAID volume (meaning it is resilient to a certain number of hard drive failures), so the simultaneous hard drive failures/corruption that would have been required suggest hardware problems with the server itself and not (or not only) the hard drives. To be honest, hard drive or RAID controller failure is not completely unexpected for an aging machine with aging hard drives (the server easily predates all of current staff, so we don't know exactly how old). In 20/20 hindsight, we could have acquired and set up new hardware, but for a machine that hosts backup copies of data and is not directly accessible, the extra expense in time and money did not appear to be worthwhile.

On Tuesday at 10:30pm, the main disk array, which exports most data stored in NFS (home and web directories, and mail folders but not mail inboxes). Because RAID adds redundancy, no data was lost, but redundancy was lost, meaning future failures result in data loss. This is why we then took down all services that can access or edit this data and modified others (like printing) to not depend on it (this will however prevent you from being able to see your print quota, you will need to ask a staff member).

The disk array is in its third year of service, so hardware problems with the server itself are not really expected but not improbable either. However, even the most reliable hard drives, accessed constantly 24/7, can fail. We did not have any hard drives other than those in the disk array with capacities greater than or equal to 1 TB, meaning we could not immediately begin rebuilding the RAID volume (again, in hindsight, this was probably a mistake on our part), let alone any (more expensive) "enterprise-class" 1 TB hard drives as would be proper for (and which are currently used in) the disk array.

On Wednesday morning, we bought a temporary "desktop-class" hard drive. When we mounted the new hard drive in the disk array, it was detected but unrecognized and marked as "bad" on reboot. We tried unsuccessfully to work around the problem. Other hard drives (of smaller capacity; they cannot be used to rebuild the RAID volume) were recognized and usable for other purposes without errors. It seems highly unlikely that a brand new hard drive would be bad, and we could not find any sign of errors when testing and running diagnostics on the hard drive in other machines, so the disk array is suspect, but since it may work with other hard drives, is not clearly at fault either. (There appear to be firmware restrictions on intermixing "desktop class" and "enterprise class" drives.)

On Wednesday evening, as another precautionary measure, we planned out a procedure to replicate the data on other machines so that if another hard drive or the disk array were to fail, we would not have data loss or corruption. To prioritize, we are copying data in alphabetical order from enabled (i.e., not disabled) accounts to another hard drive on the disk array. We will remove this hard drive when done for safekeeping, and also copy the same data over our internal network to another server with a RAID 1 (mirror) setup.

We will try our best to restore service as soon as possible. We don't want to sound deceiving by suggesting a time earlier than what might end up happening, especially since we need to first make sure that existing data is safely backed up. Service downtime through the weekend is not acceptable but it is possible, and depending on any obstacles encountered, the length of downtime could be longer or shorter.

Our Board of Directors (comprised of interested OCF members, volunteer staff and "users" alike) currently meets weekly on Thursday at 6:45pm in the OCF lab. Our next meeting is today, and if you have any advice or comments, related or not to the downtime, we encourage you to attend.

Update Sat Sep 24 06:30pm


The ASUC Auxiliary is closed on weekends, and as a result we won't be able to obtain the package of hard drive replacements that we ordered until Monday, at which point we will be able to rebuild the array and bring services back online. We may be able to mount the disk array read-only before then if the local copy is complete.

Update Sun Sep 25 12:00am


The local copying of non-disabled accounts that was started on Thursday morning is about 90% finished. We're expecting it to be finished by the morning.

Update Sun Sep 25 10:30am


The login and mail servers are now mounting home directories read-only. SSH/SFTP will give you read-only access to your files, IMAP/POP/mutt/webmail will give you read-only access to your mail.

Unfortunately, the ASUC Auxiliary is closed on weekends, and as a result we won't be able to obtain the package of hard drive replacements that we ordered until Monday, at which point we will be able to repair the array and bring all services back up.

Update Sun Sep 25 11:30am


The web server is now serving web pages read-only. This may break some sites that require writing to the home or web directory. For the time being, you can optionally use our error message to give an HTTP 503 Service temporarily unavailable error with an explanation.

Update Mon Sep 26 02:30pm


We obtained and added the new hard drives to the array at 9am this morning. If there are no errors, we expect the resync to be complete by 5pm. We will then mount home directories with full read and write access in the state they were originally.


Update Mon Sep 26 05:15pm


Finally, all services are operational as before. This will hopefully be the last update...

Some incoming mail rejected

Between September 1 and September 15, some incoming mail was incorrectly rejected as spam. A blacklist incorrectly included all mail servers, raising the score of incoming messages, and causing some high-scoring messages to be rejected at the SMTP stage. See Debian bug #641227.

Since these messages were rejected before their contents could be accepted, they cannot be recovered.

Email senders should have received a non-delivery report (bounce message) that stated:

Client host rejected: Mail appeared to be SPAM or forged. Ask your Mail/DNS-Administrator to correct HELO and DNS MX settins or to get removed from DNSBLs;


or

Client host rejected: temporarily blocked because of previous errors - retrying too fast.


We apologize for the inconvenience.

Unanticipated Downtime

Due to some unanticipated network issues on the nfs server, all servers that depend on nfs are down. We apologize for the inconvenience.

Update 1:40am: All services have been restored on all servers.
Update 11:20am: An unrelated issue was denying IMAP/POP/SMTP authentication, fixed.

Login server restarted

The primary login server (tsunami, ssh.OCF) was restarted at 4:00am PDT to apply a kernel security update as part of scheduled maintenance.

MySQL downtime

MySQL was unavailable between 9:08p and 1:52a for unanticipated maintenance.

Service downtime

User information hosted in LDAP was unavailable from 4p to 9p PDT. Web hosting, email, and SSH/SFTP services were intermittently interrupted. Mail service was also taken down from 10p to 11p PDT for maintenance. We apologize for the inconvenience.

Although no user data was affected, some email messages sent to OCF accounts during the service interruption may have been rejected and bounced to the sender.

Login server restarted

The login server (tsunami) was restarted at 10:14 PM for unanticipated maintenance. Zabbix, an uptime monitoring software, was not communicating with tsunami properly. The iptables firewall were flushed, causing network connectivity to be disrupted briefly. We apologize for the inconvenience.

Service downtime

There was intermittent down time between 11a-12:30p PDT due to software updates.

Login server and wiki downtime

Secure shell login and access to the wiki were unavailable Monday morning and part of the afternoon. These services are hosted on virtual machines whose hypervisor (once again) had problems accessing its SCSI drives. The machine was reset, and is running normally again.

Additionally, the switch connecting the alternate login machine (pileup) to the OCF had been switched off. This has also been remedied.

As was the case last time, no user data is stored on these machines, so your data is not affected. We apologize for the inconvenience.