Scientific Computing & Visualization
Help Contact
About Accounts Computation Visualization Documentation Services

Running Jobs

The mpirun command is used to execute a job on the Blue Gene. Because we do not allow interactive use of the mpirun command on the login nodes (except by special arrangement), one must submit it as a job to the LoadLeveler batch system. There are two ways to submit batch jobs. The standard method is for the user to set up a job command file (jcf), in which the mpirun command along with the executable name and other parameters are specified. This jcf is then submitted to the batch system. A sample of a typical jcf is provided here. The alternative method is to run a script that accepts input parameters such as the executable file name from the user on the command line. A jcf is then generated based on the command line input parameters and submitted to the batch queue.

Mpirun  

To run a Blue Gene executable via the mpirun command, you should compile an executable and place it on a file system that can be accessed by the Blue Gene (/project, /project2, /projectnb1, /projectnb2, or any of /usr1-/usr4). You should then modify the sample jcf, particularly by changing the "@ arguments" line to reference your executable and set appropriate command line arguments for mpirun.

The main command line arguments to mpirun are:
Argument Usage Mandatory? Description
-np N Mandatory N = number of MPI tasks (See section below)
-cwd start_dir Mandatory start_dir = full pathname of directory where
program runs
-exe path_to_executable Mandatory Full pathname of your executable.
-verbose 0-4 Optional controls diagnostic output, default is 0, 1 is recommended
-args "list_of_args" Optional "list of args" = list of args to executable (enclosed in quotes)
-mode CO|VN Optional CO|VN = COprocessor(default) or Virtual Node mode (See section below)
-connect MESH|TORUS Optional MESH|TORUS = defaults to MESH, N must be
multiple of 512 to use TORUS

MPI parallel tasks and tasks per node

In the above table, N is defined as the number of MPI parallel tasks instead of the traditional definition of number of processors to more appropriately represent the Blue Gene nodes' hardware configuration. A Blue Gene node consists of 2 processors. In the default COprocessor mode (CO), one processor is used for computation and the other is dedicated to communication. This results in 1 MPI task per node and the node's entire 512 MB of memory is available to the task. In the Virtual Node mode (VN), both processors are used for computations. In this case, there are 2 tasks per node and both tasks share the node's 512 MB of memory. Our Blue Gene has a total of 1024 nodes. In the CO mode (1 task per node), the maximum number of tasks you can request is 1024. In the VN mode (2 tasks per node), the maximum number of tasks is 2048.

Important note on the number of MPI tasks, N. Although you can choose N to be any value up to 1024 (CO) or 2048 (VN), the system will only allocate 32, 128, 512, or 1024 physical nodes to a job. The system allocates the smallest number of allowed physical nodes necessary to run one task per node (or two per node in VN mode.)

The following examples demonstrate some typical arguments to mpirun that are specified in the jcf as well as how the CO|VN mode affects the number of nodes allocated.

Example 1.
mpirun -np 1000 -cwd /project/xyz/abc -exe /project/xyz/exe/a.out

In this case, the system allocates 1024 nodes for the job because the job runs under the default CO mode and 1024 is the smallest allowable number of nodes (among 32, 128, 512, 1024) necessary to accomodate the requested 1000 tasks.

Example 2.
mpirun -np 1000 -cwd /project/xyz/abc -exe /project/xyz/exe/a.out -mode VN

In this case, the system allocates 512 nodes for the job because the job runs under the VN mode and 512 is the smallest allowable number of nodes (among 32, 128, 512, 1024) necessary to accomodate the requested 1000 tasks.

For additional mpirun options, enter mpirun -h at the system prompt. For complete documentation, please refer to Chapter 3 of IBM System Blue Gene Solution: System Administration (html |pdf)

Batch scheduling policy 

The scheduler implements the following usage limits:
Limit During Business Hours* During Off Hours
Maximum Runtime per Job 5 hours 5 hours
Maximum Nodes used per User 512 1024
(*Business Hours: 9am - 5pm Eastern Time, Monday - Friday.)

After enforcing the above limits, the scheduler prioritizes the runnable jobs and runs the highest priority one if the necessary resources are available. If the necessary resources are not yet available the scheduler uses a backfilling strategy to run lower priority, short duration jobs on any available resources as long as doing so will not delay the starting of the highest priority job.

The primary ordering criterion used to prioritize jobs is the amount of recent runtime accumulated by the user. This quantity is displayed by the qstat command under the SYSPRI column as a negative value. Jobs with the same SYSPRI are ordered by submission time. Finally, a user can alter the relative ordering of their own jobs with the llprio command. This command modifies the "user priority" which is displayed in the PRI column of the llq and qstat commands. The qstat command lists the waiting jobs in the above described scheduling order.

Alternative method to submit a batch job -- without a jcf 

This method essentially involves a wrapper script bglsub which accepts command line input such as the number of tasks and executable name from the user to generate a jcf as required in the standard method. The last operation of bglsub is to submit this newly generated jcf to the batch queue. (More ...)  

Interactive mpirun usage

Sometimes it is convenient (e.g. during program development or debugging) to execute the mpirun directly on the login node rather than through the batch system. This is normally not permitted but it can be arranged by sending a request to help@twister.bu.edu. We will allocate a partition of the machine for your exclusive use and you will be able to use it by invoking mpirun in the following way:

levi% mpirun -noallocate -partition YOUR_PARTITION ...  
where YOUR_PARTITION is the name of the partition assigned to you and "..." represents all the other flags you would normally pass to mpirun.

Batch Job Management Commands

  • To submit a batch job
    Lee:~ % llsubmit . . .
  • To query the status of batch jobs
    Lee:~ % llq . . .
  • LoadLeveler's own command to query the machine status
    Lee:~ % llstatus . . .
  • To delete a batch job from the system
    Lee:~ % llcancel . . .
  • To hold or release a submitted job
    Lee:~ % llhold . . .
  • To change the job priority of a submitted job
    Lee:~ % llprio . . .
  • To charge a batch job to a project
    A batch job is normally charged to the user's default project. If the user works on a single project or if the charge should be levied against the default project, no user action is required. On the other hand, users working on multiple projects may, at times, need to charge a batch job to a non-default project. Note that the charging procedure varies among all SCV machines ( See FAQ, Project Accounting). Please consult the respective machine's runningjobs webpage for the correct procedure. Described below is the charging procedure for the Blue Gene.
    1. Charging to the dafault project
      No actions required.
    2. Charging to a non-default project
      Add the following line to your batch script
       . . . . .
      
      # @ group = project_name
       . . . . .
  • To find out the projects of which you are a member
    Lee:~ % groups
    my_default_project  my_second_project  my_third_project . . .
    
    The first on the list is always the default project which can be changed.
Boston University
Boston University
 
OIT | CCS | April 28, 2008  
Scientific Computing & Visualization Boston University home page Boston University home page