Skip to content

2016

MySQL and printing unavailable for 34 minutes (resolved)

MySQL and printing were unavailable today for about 34 minutes due to an unscheduled outage.

Why it was down
For background, all of the OCF's production infrastructure is supposed to live on two physical servers: jaws and pandemic. There's a third legacy physical server named hal which hosts some testing machines and our backups.

Due to a problem removing a backup logical volume which led to a deadlock and many processes in uninterruptable sleep after Tuesday (believed to be a kernel bug), a staff member gave 15 minutes warning before restarting hal today (Thursday) to try to fix the issue. Since hal isn't supposed to hold important services, this should be totally safe and is considered an acceptable warning period, since normally the only people who will even notice are other staff.

Unfortunately, two servers, pollution (the print server) and maelstrom (the MySQL server) were on hal due to some temporary migrations. They should have been moved back to jaws about a week ago, but weren't.

When hal went down, it took down these production services, killing MySQL (which also took down many websites, the OCF's website, Request Tracker, ...) and printing in the lab. This was realized as soon as monitoring triggered, and the staff member phoned another staffer currently in the lab after hal wasn't coming back up.

Due to a misconfiguration, hal entered maintenance mode, and the other staffer had to enter the root password and fix the filesystem configuration before hal would boot. As soon as hal booted, MySQL and printing were started and service was restored.

Timeline

  • 6:35pm "15 minutes until hal restart" email goes out froms taffer at home
  • 6:50pm hal is restarted remotely
  • 6:52pm staffer realizes hal had production VMs and isn't coming back online; phones another staffer in the lab
  • 7:04pm staffer in lab fixes boot config, hal is restarted; remote staffer leaves home toward ocf
  • 7:09pm hal is back on and services are available
  • 7:15pm original staffer arrives in lab to find everything already fixed

mirrors.ocf.berkeley.edu read-only for about 20 hours

We're moving mirrors.ocf.berkeley.edu, our free-and-open-source software mirror, from its current hardware (a recycled desktop with some extra hard drives) onto a new server (with server-grade harddrives, RAID, etc.).

To do this with the minimum amount of downtime, we're going to be copying the disk from our current mirror to the new server. To ensure consistency, we need to first make it read-only. We expect the copy to take about 8 hours, after which point we'll make the replacement server the main mirror. At this point, mirrors will be about 8 hours old, but will quickly catch back up when the cronjobs start running.

Update 8:30pm 2/28: This is starting now.

Update 5:33pm 2/29: Maintenance is complete.

Downtime Saturday 2/20 for NFS migration

On Friday, Feb 19 we'll be migrating our NFS and production servers onto newer hardware backed by SSDs and more RAM.

This will involve a couple hours of downtime (but hopefully not too many). We'll try to keep it quick.

It will take place in the evening (10pm or later).

Update 2/19: We're pushing this back until Saturday, Feb 20 evening.

Update 2/20 11:50pm: This is done after about 30 minutes in read-only mode and about two minutes minutes of actual downtime. Some VMs will be moving over to jaws now to increase their performance, but the bulk of the work is done and you shouldn't notice any more issues.

MySQL downtime Feb 11 to fsck disks

MySQL will be offline for about 10 minutes tonight for emergency maintenance.

This is in response to unscheduled downtime about an hour ago due to a kernel deadlock which took down all MySQL services. We don't really think it could be caused by filesystem corruption, but because of recent corruption at the OCF which affected nearly all servers (caused by Debian #788062) we think it's worth checking.

Upgrading user-facing servers to jessie

In the past year we've upgraded our entire infrastructure to Debian jessie, with the exception of user-facing machines.

The time to upgrade them is now. We've prepared upgraded versions of each of these servers and will swap them out early morning on Wednesday, Feb. 10th.

The servers that will be upgraded are:

  • tsunami, the public login server
  • biohazard, the app-hosting server
  • death, the web server


Most users won't notice the update, except that most software will be a newer versions. The one exception is users who have dynamically-linked binaries somewhere in their home directories.

Because many libraries will be upgrading, most of these programs will fail to run after the upgrade. The best solution is to recompile the binaries (or find newer, pre-compiled versions).

One specific case is with environment managers like Python's virtualenv, Ruby's rbenv or rvm, and Node's nodeenv or nvm. These often put fully-compiled versions of the interpreter in your home directory, and in most cases, this will fail to work. After the upgrade, you'll need to rebuild these.

For application hosting, you can find instructions on our website:
https://www.ocf.berkeley.edu/docs/services/webapps/

During the server swap, you should expect a small amount of downtime (about 5 minutes).

If you have any questions or need assistance feel free to reach out to help@ocf.berkeley.edu.

Update Feb 07: We're going to push this back until early morning Wednesday (originally it was Monday) to give us a little more time to ensure a smooth upgrade.

Update Feb 09: For biohazard (app hosting), we'll be reaching out to individual groups using the server to coordinate a smooth upgrade. biohazard will continue to be available (and unupgraded); we'll be moving groups one-by-one to the new server (named werewolves).

Unexpected downtime Feb. 5

There was around half an hour of unexpected downtime for all OCF web services from 12:15-12:45am on February 5 as a server had to be recovered from an accidental configuration. If you recognize any issues in the next few days, please report them to help@ocf.berkeley.edu.

Thanks,
OCF Staff

Campus internet access degraded (resolved)

The UC Berkeley campus is currently experiencing highly degraded internet (50+ ms latency and 30+% packet loss).


This is affecting access to all OCF servers from outside of campus. It is also affecting AirBears, ResComp, EECS machines, and other campus resources.

Update 4:12pm This appears to have resolved itself.

Degraded internet access (resolved)

The OCF is currently experiencing degraded internet. We are investigating.

Update: This has been resolved since 4:34pm.

Scheduled downtime Thursday night

We plan to perform maintenance on Thursday, January 21st around 9pm. All services will be unavailable during this time. Total downtime should be less than 20 minutes