Boston University

From: Glenn Bresnahan, IS&T/Director Scientific Computing and Visualization (glenn@bu.edu)
Subject: Shared Computing Cluster/MGHPCC Outage December 9, 2013
Date: November 20, 2013

Dear Colleague,

In order to address an exigent issue, there will be a full power outage at the MGHPCC on December 9th. This outage impacts the Shared Computing Cluster and ATLAS, as well as the on-campus Katana and LinGA clusters that use data stored on systems in Holyoke. This downtime is to repair a defective component in the main power path for the facility. If left unaddressed this defect could result in extensive and unplanned downtime and other serious problems for the data center. Please see the note below from John Goodhue, the MGHPCC Executive Director, for more detail.

The MGHPCC will have a full power outage running from midnight on Sunday, December 8 through midnight on Monday, December 9. We will start a process on the prior Wednesday, December 4 to drain the batch queues on the Shared Computing Cluster (SCC) to prevent long running jobs from starting. This process will continue until 10:00 PM Sunday at which point all jobs will be stopped and we will shut down all the SCC computer systems, including all the login nodes. Although you may continue submitting jobs between Wednesday and Sunday, if your jobs would not complete prior to the shutdown, they will remain pending until the computer systems are returned to normal operations. We anticipate having all systems returned to production status by 9:00AM, December 10.

Although the Katana and LinGA clusters, which are located on campus, will not be powered off, user data which is stored on the SCC in Holyoke will not be accessible. On Katana the batch system will be disabled through the full outage.

Please Email help@scc.bu.edu or me directly if you have any additional questions or concerns.


Message from John Goodhue, MGHPCC Executive Director

On December 9, the MGHPCC Data Center will be shutting down for 24 hours to replace a defective component in the main power feed. During that time we will also be performing annual maintenance of critical systems. Normally, a planned maintenance event such as this one would be scheduled with at least 6 months notice. Unfortunately, we cannot let the defective component remain in place for that long in this case. The facility shutdown will start at midnight on Sunday December 8, and normal facility operation will resume at Midnight on December 9. There will be some time before and after the 24-hour period during which computer systems will be shut down and restarted.

We have been working with IT and networking groups at each of the MGHPCC universities on an hour-by-hour plan and time line. If you have questions about availability of a particular computer system, please speak with your regular point of contact for that system, or one of the people below: