Boston University

From: Glenn Bresnahan, IS&T/Director Scientific Computing and Visualization (glenn@bu.edu)
Subject: SCC Performance Issues
Date: August 14, 2013

Dear Colleague,

Over the last few weeks we have been experiencing a number of stability and performance problems on the Shared Computing Cluster. I am writing to you to inform you about these problems and the actions we are taking to address them, and to assure you that resolving these problems remains our highest priority. We sincerely apologize for any issues these have caused for you and your collaborators.

Nearly all the problems that we have seen have been related to file system access. In an effort to get to the root cause of the problem, we instrumented the system, and were able to identify two distinct problem scenarios.

In the first scenario, particular applications have swamped the file servers, causing a severe impact on file system performance for all users. In most of these cases we have developed mitigation plans to either workaround the problem or to provide us with early warning signals so that we can intervene with corrective actions. We are also working with the individual researchers to prevent future occurrences of the problem. We will continue to diligently monitor the system for performance degradations and investigate all such issues.

In the second scenario, one or more bugs in the Linux kernel have created instabilities in the file servers. The bug which we believe is primarily responsible has also been reported by others, but no patch is yet available. We are continuing to monitor the situation, collect and analyze trace data, and explore workarounds and other strategies for resolving the problem should a fix not be forthcoming. In no cases have we lost any data on the file system. In most cases, the file system servers have detected an error, failed-over, and rebooted to self-correct the problem. In these cases, the problem would have been seen as a pause, the file server returning to service within approximately eight minutes. However there have been a few cases in which manual intervention was required resulting in the file system being unavailable for a more extended period. In either case, batch jobs generally did not terminate and continued to run once file access was restored.

In addition to tracing the root cause of the remaining problems, we are also planning to add additional hardware to make the file system more resilient to failures, to increase the overall system capacity and to reduce the impact of failures on interactive jobs. As soon as possible, we will acquire and deploy an additional two redundant file servers, bringing the total number to five. This will allow the file system to withstand failures on two of the five servers. The additional file servers will also allow us to dedicate file servers specifically to interactive jobs with the intent of reducing the impact to these jobs in the event of other failures.

Additionally, we are investigating a number of possibilities to improve the performance of interactive applications, particularly applications with graphics interfaces (GUIs), which has been impacting individual application users. The set of things we are exploring ranges from changes to the base networking configuration to new applications software that provides better performing remote desktop capabilities. We hope to provide more information on these in the very near future.

We are also instituting a set of procedures to better communicate about any problems that the system may be experiencing. At the top of the SCC updates pages (http://www.bu.edu/tech/about/research/computation/scc/updates/) is a colored indicator reflecting the current status of the system. Clicking on that indicator will provide additional details regarding any on-going issues. In the future we will populate that page with additional real-time system performance measurements.

Again, we deeply apologize for any problems which you may be experiencing as we bring this new system into production. We sincerely appreciate your patience. We are confident that we will be able to resolve all of these problems and provide a stable and productive platform for your computational research.