Skip to content

2011

Intermittent downtime for maintenance

On December 16, we rewired parts of our network between 5pm and 7pm, resulting in intermittent unavailability for OCF services. We believe that the end of the semester would be a less intrusive time to make changes, especially since the lab is closed until spring semester.

OCF, as with other student groups in Eshleman Hall, will be moved out of the building in August before it is demolished as part of Bears Initiative. We are aggressively consolidating our hardware (reducing the number of physical servers), since space for our hardware after the move will be limited, and working to prevent these changes from having an adverse effect on performance and reliability. We apologize for the inconvenience.

See also: Lab and printing services uncertain after Spring 2012

Lab and printing services uncertain after Spring 2012

Dilbert.com



From: OCF announcements
To: All OCF members
Subject: OCF lab and printing ending after Spring 2012
Date: Wed, Dec 7, 2011 01:31p PST

Hello OCF members,

Last night OCF's volunteer staff unanimously decided to close lab services, including free printing, after Spring 2012.

OCF, as with other student groups in Eshleman Hall, will be moved out of the building in August before it is demolished as part of Bears Initiative. Until new space is obtained elsewhere on campus, we have been allocated temporary space in Hearst Gymnasium basement for the duration of Lower Sproul construction, which is projected to last for at least 3 years.

This area is limited both in space and technical resources, and we feel that it would be a disservice to all OCF members to provide an unreliable and inadequate level of service, in both quality and quantity. We would not be able to fit much of our computing and printing equipment into the space allotted to us, and time constraints limit what can be set up before the fall semester begins.

We understand and share your concern for the OCF. We too are students and members of the OCF. This is a drastic change for the organization, which has had a lab since 1989.

We want to restore the lab and its services as soon as possible, but we must also keep in mind the process of our last two moves (to Heller Lounge in the MLK Student Union, and to our present location in Eshleman). It would be dishonest if we were not to give notice of lab downtime, which has been unavoidable and
excessive during each move. Furthermore, the compressed space and needs of other student groups constrain us more than in the past.

We hope that by shifting resources, we will be able to compensate by expanding our other services, including currently web hosting, disk space, email, and shell accounts, and ensure that they would not be disrupted by a move as they have been before. We are looking at our options for negotiating server space and we will continue hosting websites, including those of student groups. We will better assist other students and student groups as a technical resource given our close proximity in the temporary space, and encourage you to make use of what we would be able to offer.

We will be considering all options, including other space if available, and appreciate any feedback or advice you can offer. We don't want to let you down...

Good luck on finals guys,
OCF staff members,
Eshleman basement

P.S. One of the printers is broken, so printing might be slower than we'd hope, but please, enjoy our lab while it lasts. 9am-9pm hours during dead week and finals, see our website for more info.

You have received this announcement to update you about important changes to the OCF. The Open Computing Facility is an all-volunteer student organization providing free shell accounts, disk space, web hosting, email, and printing from a lab, lounge, and server room in Eshelman Hall.

Update Dec 8: Daily Cal has published an article: Computing facility in Eshleman Hall to close after next semesterUpdate Dec 17: To clarify, OCF is not going away, but staff felt that, given our constraints, it would not be possible to guarantee a public lab next fall, and misleading if we did not inform members. The post's title has been updated to reflect that lab and printing services are uncertain, since no decision is made in stone, and the constraints which led to our decision can (and hopefully will) change. Sorry about the confusion.Two years ago OCF completed its 15-month move from Heller Lounge in MLK to the larger space in Eshleman, with excessive lab and server downtime (caused in part by the discovery of asbestos and later a failed pump flooding the new lab, then a pipe burst dropping raw sewage, then a network and later power outage which was prolonged). The current move leaves us two weeks (Aug 1 to Aug 15) to move and bring the lab up in a much smaller space before the semester begins. That also includes getting rid of desktops, servers, and equipment that would not fit in the space, since Moore's Law makes the storage of aging computer hardware uneconomical.We also hoped that by not committing ourselves to the lab immediately, we would leave ourselves time to ensure that non-lab services, like web hosting, would not be disrupted by the move. During the last move our web hosting, used by many student groups, was down for several months, and the complaints were not pretty. To this day we hear about OCF "unreliability" based on that downtime.And lastly, we have every intent to restore our lab when it is possible and when we have the logistics of the space worked out. To quote an ex-staff member: "The group's choice to focus on back-end makes a lot of sense given the uncertainty with construction and space."And yes, this makes us very unhappy.













Networking maintenance downtime

Downtime is scheduled between 8pm and midnight tonight.

Update 4:00am


We replaced some wiring and networking equipment between 11:30pm and 01:15am. Most services were only briefly affected. Web, mail, SSH, and dependent services were temporarily moved to another switch during the rewiring.

We thank a generous donor for two new gigabit switches.

Networking maintenance

There will be one day of downtime during the days of November 18-20 for networking maintenance. The anticipated downtime should be much less than one day. The exact date will be confirmed as the day approaches.

Mail service downtime

Mail service, including SMTP, IMAP, and POP, was unavailable between 2:19am and 3:06am because of transient errors with LDAP (user database). Some mail sent or received during this time may have been incorrectly rejected.

Unexpected downtime, RAID failure

Taking precautionary measures after RAID failure on our main disk array, we have taken down all services that depend on NFS, including web hosting, mail service, and home directory file access.

We are doing our best by working overnight so that service will be restored as soon as possible.

We apologize for the inconvenience.

Update Thurs Sep 22 05:45am


First, we want to apologize (again!) for the delay. We were hoping to restore service yesterday morning. Things didn't exactly work out...

While we don't want to blame the delay on excuses, we would also like to be straightforward about what is going on.

Last Sunday, our backup server began to fail, and by Tuesday, the largest volume, which contained backups of home and web directories, appeared to be unrecoverable. Data is stored in a RAID volume (meaning it is resilient to a certain number of hard drive failures), so the simultaneous hard drive failures/corruption that would have been required suggest hardware problems with the server itself and not (or not only) the hard drives. To be honest, hard drive or RAID controller failure is not completely unexpected for an aging machine with aging hard drives (the server easily predates all of current staff, so we don't know exactly how old). In 20/20 hindsight, we could have acquired and set up new hardware, but for a machine that hosts backup copies of data and is not directly accessible, the extra expense in time and money did not appear to be worthwhile.

On Tuesday at 10:30pm, the main disk array, which exports most data stored in NFS (home and web directories, and mail folders but not mail inboxes). Because RAID adds redundancy, no data was lost, but redundancy was lost, meaning future failures result in data loss. This is why we then took down all services that can access or edit this data and modified others (like printing) to not depend on it (this will however prevent you from being able to see your print quota, you will need to ask a staff member).

The disk array is in its third year of service, so hardware problems with the server itself are not really expected but not improbable either. However, even the most reliable hard drives, accessed constantly 24/7, can fail. We did not have any hard drives other than those in the disk array with capacities greater than or equal to 1 TB, meaning we could not immediately begin rebuilding the RAID volume (again, in hindsight, this was probably a mistake on our part), let alone any (more expensive) "enterprise-class" 1 TB hard drives as would be proper for (and which are currently used in) the disk array.

On Wednesday morning, we bought a temporary "desktop-class" hard drive. When we mounted the new hard drive in the disk array, it was detected but unrecognized and marked as "bad" on reboot. We tried unsuccessfully to work around the problem. Other hard drives (of smaller capacity; they cannot be used to rebuild the RAID volume) were recognized and usable for other purposes without errors. It seems highly unlikely that a brand new hard drive would be bad, and we could not find any sign of errors when testing and running diagnostics on the hard drive in other machines, so the disk array is suspect, but since it may work with other hard drives, is not clearly at fault either. (There appear to be firmware restrictions on intermixing "desktop class" and "enterprise class" drives.)

On Wednesday evening, as another precautionary measure, we planned out a procedure to replicate the data on other machines so that if another hard drive or the disk array were to fail, we would not have data loss or corruption. To prioritize, we are copying data in alphabetical order from enabled (i.e., not disabled) accounts to another hard drive on the disk array. We will remove this hard drive when done for safekeeping, and also copy the same data over our internal network to another server with a RAID 1 (mirror) setup.

We will try our best to restore service as soon as possible. We don't want to sound deceiving by suggesting a time earlier than what might end up happening, especially since we need to first make sure that existing data is safely backed up. Service downtime through the weekend is not acceptable but it is possible, and depending on any obstacles encountered, the length of downtime could be longer or shorter.

Our Board of Directors (comprised of interested OCF members, volunteer staff and "users" alike) currently meets weekly on Thursday at 6:45pm in the OCF lab. Our next meeting is today, and if you have any advice or comments, related or not to the downtime, we encourage you to attend.

Update Sat Sep 24 06:30pm


The ASUC Auxiliary is closed on weekends, and as a result we won't be able to obtain the package of hard drive replacements that we ordered until Monday, at which point we will be able to rebuild the array and bring services back online. We may be able to mount the disk array read-only before then if the local copy is complete.

Update Sun Sep 25 12:00am


The local copying of non-disabled accounts that was started on Thursday morning is about 90% finished. We're expecting it to be finished by the morning.

Update Sun Sep 25 10:30am


The login and mail servers are now mounting home directories read-only. SSH/SFTP will give you read-only access to your files, IMAP/POP/mutt/webmail will give you read-only access to your mail.

Unfortunately, the ASUC Auxiliary is closed on weekends, and as a result we won't be able to obtain the package of hard drive replacements that we ordered until Monday, at which point we will be able to repair the array and bring all services back up.

Update Sun Sep 25 11:30am


The web server is now serving web pages read-only. This may break some sites that require writing to the home or web directory. For the time being, you can optionally use our error message to give an HTTP 503 Service temporarily unavailable error with an explanation.

Update Mon Sep 26 02:30pm


We obtained and added the new hard drives to the array at 9am this morning. If there are no errors, we expect the resync to be complete by 5pm. We will then mount home directories with full read and write access in the state they were originally.


Update Mon Sep 26 05:15pm


Finally, all services are operational as before. This will hopefully be the last update...

Some incoming mail rejected

Between September 1 and September 15, some incoming mail was incorrectly rejected as spam. A blacklist incorrectly included all mail servers, raising the score of incoming messages, and causing some high-scoring messages to be rejected at the SMTP stage. See Debian bug #641227.

Since these messages were rejected before their contents could be accepted, they cannot be recovered.

Email senders should have received a non-delivery report (bounce message) that stated:

Client host rejected: Mail appeared to be SPAM or forged. Ask your Mail/DNS-Administrator to correct HELO and DNS MX settins or to get removed from DNSBLs;


or

Client host rejected: temporarily blocked because of previous errors - retrying too fast.


We apologize for the inconvenience.

Unanticipated Downtime

Due to some unanticipated network issues on the nfs server, all servers that depend on nfs are down. We apologize for the inconvenience.

Update 1:40am: All services have been restored on all servers.
Update 11:20am: An unrelated issue was denying IMAP/POP/SMTP authentication, fixed.

Login server restarted

The primary login server (tsunami, ssh.OCF) was restarted at 4:00am PDT to apply a kernel security update as part of scheduled maintenance.

MySQL downtime

MySQL was unavailable between 9:08p and 1:52a for unanticipated maintenance.