Skip to content

News from the staff team

MySQL read-only Saturday 3/19

As part of our work to transition from Percona to MariaDB for our MySQL server, we'll be migrating user data tonight around 9pm.


To do this, we'll put the existing Percona server into read-only mode, then make a final import to the new MariaDB host. We believe this will take about an hour and don't anticipate any issues (we've already tested imports from our regular backups without problems).

Read-only mode is necessary during the import to ensure we get a consistent backup, and so that writes made during the transition are not lost.

Some sites may experience downtime while the server is in read-only mode (if they require writing to the database to show pages). Most sites will experience some level of degradation (e.g. can't log in to admin or edit posts).

Update 10:01pm: We have entered read-only mode.

Update 10:10pm: The backup is complete and is being imported into MariaDB now.

Update 10:14pm: ETA 35 minutes.

Update 10:35pm: The import was interrupted when the new server ran out of memory. We're increasing memory / reducing memory use by mysqld and starting the import again. Still in read-only mode.

Update 10:45pm: ETA 38 minutes.

Update 11:11pm: Import has finished, we're now swapping out MariaDB for Percona (which will involve about 2 minutes of downtime).

Update 11:16pm: We've noticed some issues with the import (views were not correctly copied) so we'll need to re-do the import. Still in read-only mode, expect another hour or two in this state. Sorry for the trouble!

Update 11:44pm: The view problem is fixed, so we're proceeding to move MariaDB into production. Expect about 2-3 minutes of downtime now.

Update 11:55pm: All work is completed and we are now on MariaDB. Total downtime was about 3 minutes, with read-only mode lasting about two hours.

MySQL and printing unavailable for 34 minutes (resolved)

MySQL and printing were unavailable today for about 34 minutes due to an unscheduled outage.

Why it was down
For background, all of the OCF's production infrastructure is supposed to live on two physical servers: jaws and pandemic. There's a third legacy physical server named hal which hosts some testing machines and our backups.

Due to a problem removing a backup logical volume which led to a deadlock and many processes in uninterruptable sleep after Tuesday (believed to be a kernel bug), a staff member gave 15 minutes warning before restarting hal today (Thursday) to try to fix the issue. Since hal isn't supposed to hold important services, this should be totally safe and is considered an acceptable warning period, since normally the only people who will even notice are other staff.

Unfortunately, two servers, pollution (the print server) and maelstrom (the MySQL server) were on hal due to some temporary migrations. They should have been moved back to jaws about a week ago, but weren't.

When hal went down, it took down these production services, killing MySQL (which also took down many websites, the OCF's website, Request Tracker, ...) and printing in the lab. This was realized as soon as monitoring triggered, and the staff member phoned another staffer currently in the lab after hal wasn't coming back up.

Due to a misconfiguration, hal entered maintenance mode, and the other staffer had to enter the root password and fix the filesystem configuration before hal would boot. As soon as hal booted, MySQL and printing were started and service was restored.

Timeline

  • 6:35pm "15 minutes until hal restart" email goes out froms taffer at home
  • 6:50pm hal is restarted remotely
  • 6:52pm staffer realizes hal had production VMs and isn't coming back online; phones another staffer in the lab
  • 7:04pm staffer in lab fixes boot config, hal is restarted; remote staffer leaves home toward ocf
  • 7:09pm hal is back on and services are available
  • 7:15pm original staffer arrives in lab to find everything already fixed

mirrors.ocf.berkeley.edu read-only for about 20 hours

We're moving mirrors.ocf.berkeley.edu, our free-and-open-source software mirror, from its current hardware (a recycled desktop with some extra hard drives) onto a new server (with server-grade harddrives, RAID, etc.).

To do this with the minimum amount of downtime, we're going to be copying the disk from our current mirror to the new server. To ensure consistency, we need to first make it read-only. We expect the copy to take about 8 hours, after which point we'll make the replacement server the main mirror. At this point, mirrors will be about 8 hours old, but will quickly catch back up when the cronjobs start running.

Update 8:30pm 2/28: This is starting now.

Update 5:33pm 2/29: Maintenance is complete.

Downtime Saturday 2/20 for NFS migration

On Friday, Feb 19 we'll be migrating our NFS and production servers onto newer hardware backed by SSDs and more RAM.

This will involve a couple hours of downtime (but hopefully not too many). We'll try to keep it quick.

It will take place in the evening (10pm or later).

Update 2/19: We're pushing this back until Saturday, Feb 20 evening.

Update 2/20 11:50pm: This is done after about 30 minutes in read-only mode and about two minutes minutes of actual downtime. Some VMs will be moving over to jaws now to increase their performance, but the bulk of the work is done and you shouldn't notice any more issues.

MySQL downtime Feb 11 to fsck disks

MySQL will be offline for about 10 minutes tonight for emergency maintenance.

This is in response to unscheduled downtime about an hour ago due to a kernel deadlock which took down all MySQL services. We don't really think it could be caused by filesystem corruption, but because of recent corruption at the OCF which affected nearly all servers (caused by Debian #788062) we think it's worth checking.

Upgrading user-facing servers to jessie

In the past year we've upgraded our entire infrastructure to Debian jessie, with the exception of user-facing machines.

The time to upgrade them is now. We've prepared upgraded versions of each of these servers and will swap them out early morning on Wednesday, Feb. 10th.

The servers that will be upgraded are:

  • tsunami, the public login server
  • biohazard, the app-hosting server
  • death, the web server


Most users won't notice the update, except that most software will be a newer versions. The one exception is users who have dynamically-linked binaries somewhere in their home directories.

Because many libraries will be upgrading, most of these programs will fail to run after the upgrade. The best solution is to recompile the binaries (or find newer, pre-compiled versions).

One specific case is with environment managers like Python's virtualenv, Ruby's rbenv or rvm, and Node's nodeenv or nvm. These often put fully-compiled versions of the interpreter in your home directory, and in most cases, this will fail to work. After the upgrade, you'll need to rebuild these.

For application hosting, you can find instructions on our website:
https://www.ocf.berkeley.edu/docs/services/webapps/

During the server swap, you should expect a small amount of downtime (about 5 minutes).

If you have any questions or need assistance feel free to reach out to help@ocf.berkeley.edu.

Update Feb 07: We're going to push this back until early morning Wednesday (originally it was Monday) to give us a little more time to ensure a smooth upgrade.

Update Feb 09: For biohazard (app hosting), we'll be reaching out to individual groups using the server to coordinate a smooth upgrade. biohazard will continue to be available (and unupgraded); we'll be moving groups one-by-one to the new server (named werewolves).

Unexpected downtime Feb. 5

There was around half an hour of unexpected downtime for all OCF web services from 12:15-12:45am on February 5 as a server had to be recovered from an accidental configuration. If you recognize any issues in the next few days, please report them to help@ocf.berkeley.edu.

Thanks,
OCF Staff

Campus internet access degraded (resolved)

The UC Berkeley campus is currently experiencing highly degraded internet (50+ ms latency and 30+% packet loss).


This is affecting access to all OCF servers from outside of campus. It is also affecting AirBears, ResComp, EECS machines, and other campus resources.

Update 4:12pm This appears to have resolved itself.