Server maintenance 5/24
We will be updating our physical servers on Sunday, May 24th around 9pm PDT. All OCF services will be affected, though we expect downtime to be less than 15 minutes.
We will be updating our physical servers on Sunday, May 24th around 9pm PDT. All OCF services will be affected, though we expect downtime to be less than 15 minutes.
We will be performing updates on jaws during Thursday, May 20th. jaws is a testing machine which hosts no public services, though staff VMs (and any services they provide) will be unavailable.
Around 1:15am Tuesday morning, we starting experiencing high latency on our internal network. The high latency resulted in NFS reads/writes blocking for periods of several seconds, causing a backlog of processes on the web server and other servers. This resulted in timeouts when trying to access web pages, and eventually complete downtime when we took the servers offline.
We had four different volunteer staff in the lab troubleshooting the issue around 1:30am. It was difficult to pin down because the actual cause was intermittent, so downtime was slightly more than 30 minutes. (We tried various steps such as searching for network loops, removing different servers from the network, disconnecting from campus, etc.)
The ultimate cause was a broken backup script run by one of the student groups we host. From what we can tell, a daily backup script they had scheduled exceeded their disk quota, then continued thrashing the network trying to write blocks (which failed after exceeding the disk quota).
We're monitoring the network now to ensure everything continues to operate normally, and will work on methods for limiting individual accounts' ability to cripple the network. We'll also improve our ability to monitor the network (our existing tools weren't granular enough for us to see the problem without directly witnessing it in iotop).
On Sunday night (February 15), we will be performing one-time maintenance on the OCF file server. Total downtime should be no more than two hours (and probably much less).
IST will be performing maintenance on OCF's firewall on Tuesday 2/17 from 5:30am to 7am. OCF services may be unavailable during this window. update: IST rescheduled the maintenance to 2/17
All servers will be restarted the night of Tuesday, Jan 27 to apply security updates. Sorry for the inconvenience.
The login server (SSH) will be restarted Monday (Jan 19) night to apply security and performance updates. Total downtime should be less than 10 minutes.
Edit: Originally, the downtime was only intended for the login (SSH) server. We're expanding it to include all servers to include recent security updates. Total downtime should be less than 20 minutes.