IBM Blue Gene
Table of contents
Overview
On May 9, 2005, we took delivery of an IBM Blue Gene system, one of the first such systems installed in the world. The Blue Gene is based on an IBM Research project dedicated to exploring the frontiers in supercomputing, and a larger version of the same system is the current leader on the Top 500 Supercomputer sites list. Our own system was initially ranked at #59 on the Top 500 list and is currently #374. IBM's Blue Gene is part of a new family of supercomputers optimized for bandwidth, scalability and the ability to handle large amounts of data while consuming a fraction of the power and floor space required by previous systems.
The Boston University Blue Gene is a single rack system, containing 1024 compute nodes. Each compute node contains two 32-bit 700 MHz PowerPC 440 processors. Our Blue Gene has a peak performance of 5.7 Teraflops.
The Blue Gene is designed for codes that scale well on hundreds or even thousands of processors. The individual processors in the Blue Gene are actually slower than those on the pSeries machines or the Linux Cluster. Therefore, unless your code scales well and uses a lot of processors (generally at least 256), you should run on the pSeries machines or the Linux Cluster. As a very rough estimate, a node on the Blue Gene runs about half as fast as one on the pSeries machines or the Linux Cluster. In addition to the processor issue, some other restrictions apply to the Blue Gene as well. Please also note that general application software such as MATLAB and Mathematica are note available on the Blue Gene. If you are not sure what machine is right for you, email one of the SCV Scientific Programmers (Kadin Tseng kadin@bu.edu or Doug Sondak sondak@bu.edu) to discuss it.
The Blue Gene went into general use on October 1, 2006. However, SCF users were not automatically given access to the machines, due to special federal export regulations that apply to this system. In order to get a Blue Gene account, external users (those who are not current BU students, faculty, or staff) need to submit identity documentation (such as a passport) to us and all users need to fill out a web form to request access. To apply for access, go here (accessing this page will require your SCF/Twister login and password) and then submit the Update Personal Information form. External users only will also need to follow the identity information instructions on that web form.
The login nodes for the Blue Gene are the Linux machines levi.bu.edu and lee.bu.edu and users must use ssh to log in. Passwords are shared over the Scientific Computing Facilities, so if you already have an account and password on our other systems and have access to the Blue Gene, you will have the same login and password on this system.
Help Information
This page gives a basic introduction to using the Blue Gene system. For more detailed information, follow the sidebar links, probably starting with the Programming page.
If you are experiencing system problems, please send Email to help@twister.bu.edu.
For more information or help in using or porting applications to the Blue Gene system, please contact Doug Sondak (sondak@bu.edu) or Kadin Tseng (kadin@bu.edu).
If you have questions regarding your computer account or resource allocations, please send Email to scfacct@bu.edu.
Allocations and Account Management
Our allocations policy is outlined in our SCF Users Information document. Requests for additional time can be made via the Scientific Computing Facilities Resource Requests pages.
Hardware Configuration
The Blue Gene rack contains 1024 compute nodes, 128 input/output nodes, and several internal networks used for inter-node communication. Both types of nodes consist of dual core 32-bit PPC440 processors (700 MHz) with 512 MB of main memory. Each node has a 32 KB L1 cache, 2 KB L2 cache, and a 4 MB L3 cache. See the IBM Journal of Research and Development issue on the Blue Gene for more system details.
The login nodes, levi.bu.edu and lee.bu.edu, are IBM eServer OpenPower 720s with 2-way 1.50 GHz 64-bit POWER5 processors. Each has a main memory of 4 GB and runs the SuSE Linux Enterprise Server 9 operating system. The configuration is similar to our other Linux systems.
File Systems
User home directories are the same as on the Linux cluster and all of the standard shared filesystems (e.g. /project, /project2, /projectnb1, and /projectnb2) are accessible. Note that there is no /scratch space available on the Blue Gene system.
If you need access to your twister (pSeries) home directory from your Blue Gene account, you can access it by prepending /ibm to your home directory name, for example, /ibm/usr2/faculty/your_login
Please also read the information on disk space in our SCF Users Information document.
Usage Policies
The login nodes are intended for program development, compilation, and for submitting jobs to be run on the Blue Gene. Do not run CPU intensive applications on these machines. Use one of our other systems for this purpose instead. See the Scientific Computing Facilities Technical Summary for more information.
Software
Since the Blue Gene is a highly specialized computing platform, few commercial or opensource packages are currently ported to it. For the most part, only standard Linux tools, compilers and some math libraries are available. Packages that have been ported for use on the Blue Gene are listed here.
Programming Information
Due to the specialized nature of this machine, the process of compiling and running programs is more complex than on our other computing platforms. Instructions are below and we highly recommend that you read and follow them.
Note on Environment Variables
Your home directory is shared between the Linux Cluster and the Blue Gene login nodes. If you are using the default .cshrc and .login files your environment will automatically be set up properly for both systems. If you have problems finding compilers or running jobs make sure that you are not overriding the system settings of the PATH or LD_LIBRARY_PATH environment variables. These variables are set properly for each system by the global startup files. See here for help on adding your own directories to these variables.
Blue Gene Programming Restrictions
The Blue Gene is not a general purpose computer and thus there are many limits on what a Blue Gene application is allowed to do. Here are some:
- Must use MPI.
- No shared libraries.
- No threads.
- Most Unix system calls are not allowed, e.g. fork, exec, signal,...
- File IO is allowed but only a limited number of filesystems are available: (/project, /project2, /projectnb1, /projectnb2 or any of /usr1-/usr4)
- No /scratch or /tmp.
For more details on these programming restrictions please refer to the Blue Gene/L Application and Development Redbook (html|pdf)
Compiling for the Blue Gene
Compiling for the Blue Gene is relatively straightforward as long as you use the right compiler, include the right header files, and link with the right libraries.
We highly recommend that you refer to both the Blue Gene/L Application and Development Redbook (html|pdf) and our programming guide before compiling your programs for use on the Blue Gene.
Compilers:
The login nodes have two types of compilers: native compilers for building programs to run on the login nodes and cross compilers for building programs to run on the Blue Gene. The native compilers have standard names like "gcc" and "xlf." The Blue Gene cross compilers all have names beginning with "blrts_." Here is a list:
IBM compilers
- blrts_c89
- blrts_c99
- blrts_cc
- blrts_xlc
- blrts_xlC
- blrts_xlc++
- blrts_f77
- blrts_fort77
- blrts_f90
- blrts_f95
- blrts_xlf
- blrts_xlf90
- blrts_xlf95
GNU compilers
- blrts_g++
- blrts_gcc
- blrts_g77
Header Files
The MPI and various Blue Gene header files are in the directory /bgl/BlueLight/ppcfloor/bglsys/include/ . You will therefore need to include -I/bgl/BlueLight/ppcfloor/bglsys/include with your compiler flags.
Libraries
Every Blue Gene program must be linked with at least four libraries which are in the directory /bgl/BlueLight/ppcfloor/bglsys/lib/. The main 4 required libraries are libmpich.rts.a, libdevices.rts.a, libmsglayer.rts.a, and librts.rts.a.
That is all you need for C and Fortran. The library libcxxmpich.rts.a is also required to compile C++ code.
The compiler option -fno-underscoring is required by blrts_g77.
More detailed compiler and compiler options information is available here.
Makefile
A sample makefile which incorporates all of the above information is available.
Math Libraries, etc
There is limited math library support and it is detailed here.
Running Jobs on the Blue Gene
Preparing your job to run on the Blue Gene
The mpirun command is used to execute a job on the Blue Gene. Because we do not allow interactive use of the mpirun command on the login nodes (except by special arrangement), one must submit it as a job to the LoadLeveler batch system. Preparing your job for submission requires you to create a job command file (jcf) that contains the correct arguments to the mpirun command. This jcf file is then used by LoadLeveler to correctly dispatch your job to run on the Blue Gene.
A sample jcf file is available.
To run a Blue Gene executable via the mpirun command, you should compile an executable and place it on a file system that can be accessed by the Blue Gene (/project, /project2, /projectnb1, /projectnb2, or any of /usr1-/usr4). You should then modify the sample jcf file, particularly by changing the "@ arguments" line to reference your executable and set appropriate command line arguments to mpirun.
The main command line arguments to mpirun are:
| Argument Usage |
Mandatory? |
Description |
| -np N |
Mandatory |
N = the number of mpi tasks
1-1024 (or 1-2048 in VN mode) see below |
| -cwd start_dir |
Mandatory |
start_dir = full pathname of directory where
program runs |
| -exe path_to_executable |
Mandatory |
Full pathname of your executable. |
| -verbose 0-4 |
Optional |
controls diagnostic output, default is 0, 1 is recommended |
| -args "list_of_args" |
Optional |
"list of args" = list of args to executable (enclosed in quotes) |
| -mode CO|VN |
Optional |
CO|VN = COprocessor(default) or Virtual Node mode |
| -connect MESH|TORUS |
Optional |
MESH|TORUS = defaults to MESH, N must be
multiple of 512 to use TORUS |
See mpirun -h for a full list of mpirun options.
Important note on N above. Although you can choose N to be any value in the above ranges, the system will only allocate 32, 128, 512, or 1024 physical nodes to a job. The system allocates the smallest number of allowed physical nodes necessary to run one task per node (or two per node in VN mode.)
For complete documentation on mpirun, please refer to Chapter 3 of the Blue Gene/L System Administration Redbook (html |pdf)
Submitting and Tracking your Job Using LoadLeveler
Once you have tailored your jcf file you can use LoadLeveler to submit your job to run on the Blue Gene. The command you use to submit your job is:
levi% llsubmit jcf_file
Once you have submitted your job you can use the following commands to monitor and change your job while it is queued and running on the Blue Gene:
llq Shows queued and running jobs
llcancel Delete a queued or running job
llprio Change the priority of your queued jobs
qstat Similar to llq but includes more information
bglstat Shows current allocation of the Blue Gene machine
Scheduling Policy
The scheduler implements the following usage limits:
| Limit |
During Business Hours* |
During Off Hours |
| Maximum Runtime per Job |
5 hours |
5 hours |
| Maximum Nodes used per User |
512 |
1024 |
(*Business Hours: 9am - 5pm Eastern Time, Monday - Friday.)
After enforcing the above limits, the scheduler prioritizes the runnable jobs and runs the highest priority one if the necessary resources are available. If the necessary resources are not yet available the scheduler uses a backfilling strategy to run lower priority, short duration jobs on any available resources as long as doing so will not delay the starting of the highest priority job.
The primary ordering criterion used to prioritize jobs is the amount of recent runtime accumulated by the user. This quantity is displayed by the qstat command under the SYSPRI column as a negative value. Jobs with the same SYSPRI are ordered by submission time. Finally, a user can alter the relative ordering of their own jobs with the llprio command. This command modifies the "user priority" which is displayed in the PRI column of the llq and qstat commands. The qstat command lists the waiting jobs in the above described scheduling order.
Interactive Mpirun Use
Sometimes it is convenient (e.g. during program development or debugging) to execute the mpirun directly on the login node rather than through the batch system. This is normally not permitted but it can be arranged by sending a request to help@twister.bu.edu. We will allocate a partition of the machine for your exclusive use and you will be able to use it by invoking mpirun in the following way:
levi% mpirun -noallocate -partition YOUR_PARTITION ...
where YOUR_PARTITION is the name of the partition assigned to you and "..." represents all the other flags you would normally pass to mpirun.
Totalview Debugger
The Totalview debugger is available for use on the Blue Gene. To use the debugger you must first compile your code with the -g flag. It will also be convenient to have your executable and source code in the same directory and to invoke the debugger from that directory.
You can start totalview through the batch system by using an appropriately modified version of the sample totalview jcf file.
You can also start it interactively on a login node if you have your own partition as described under Interactive Mpirun Use above:
levi% totalview mpirun -a -noallocate -partition YOUR_PARTITION ...
In either case, once it starts, two windows will appear on your screen. Click the GO button in the larger window titled mpirun.
A small window titled Question will pop up. Click YES in that window (you want to stop the job.)
After a little while the large window will be retitled mpirun<your_program>.0 and you should see your source code displayed in it.
At this point all of your mpi tasks have been created and are under the control of the debugger. They are stopped before main has been called.
See the totalview documentation for more information.
Blue Gene Links
|