Question: My batch job exited with code ###. What does that mean? When a program finishes executing it returns an exit code to the system. The batch system reports this exit code. There are three general ways for the exit code of a program to be set. 1) The program can explicitly call exit() (or return from main(), which eventually calls exit()). In this case the exit code is the argument to exit() and its meaning depends on the program. The call to exit() may actually occur in a library routine that your program uses. An example of this is the SGI FORTRAN io library. The FORTRAN io routines set an exit code in the range 100 - 185 when an error occurs. The specific meaning of these codes can be found in the appendix to the Fortran 77 Programmer's Guide (available online as an insight book). 2) The program executes the last instruction in main(), (not calling exit() or return). In this case the system sets the exit code to 0. 3) The program can terminate due to the receipt of a signal. In this case the system sets the exit code to 128 + . (This assumes that the program doesn't have a signal handler which calls exit(), then we're back in case 1). The following table lists the various signals whose default action is to terminate a program. Note that the codes are different depending on the platform you run on (SGI Origins or IBM pSeries). See /usr/include/sys/signal.h for more info. Name Number (SGI) Number (IBM) SIGHUP 1 1 SIGINT 2 2 SIGQUIT 3 3 SIGILL 4 4 SIGTRAP 5 5 SIGABRT 6 6 SIGEMT 7 7 SIGFPE 8 8 SIGKILL 9 9 SIGBUS 10 10 SIGSEGV 11 11 SIGSYS 12 12 SIGPIPE 13 13 SIGALRM 14 14 SIGTERM 15 15 SIGUSR1 16 30 SIGUSR2 17 31 SIGPOLL 22 23 SIGIO 22 23 SIGVTALRM 28 34 SIGPROF 29 32 SIGXCPU 30 24 SIGXFSZ 31 25 SIGRTMIN 49 888 SIGRTMAX 64 999 There are several reasons that your program might receive a signal. a) You sent it a signal with kill, bkill, or bdel. If you don't specify which signal to send, kill defaults to SIGTERM (exit code 143) and bkill defaults to SIGKILL (exit code 137). Bdel sends SIGINT (exit code 130), then SIGTERM, then SIGKILL until your job dies. b) The system sent it a signal because an error occurred or a system resource limit was reached. In this case, in addition to the exit code, the batch system will usually report an error message. Examples of this case are: Signal Exit Code Typical Reason SIGILL 132 illegal instruction, binary probably corrupt SIGTRAP 133 integer divide-by-zero SIGFPE 136 floating point exception or integer overflow (these exceptions aren't generated unless special action is taken, see man sigfpe for more information) SIGBUS 138 unaligned memory access (e.g. loading a word that is not aligned on a word boundary) SIGSEGV 139 attempt to access a virtual address which is not in your address space SIGXCPU 158/152 CPU time limit exceeded SIGXFSZ 159/153 File size limit exceeded c) The batch system sent it a signal because it exceeded a limit on the queue it was running in. Three queue limits are enforced in this way: i) CPU limit. When the total CPU usage of all processes in a batch job exceeds the queue CPU limit, the batch system kills the job by sending SIGXCPU, then SIGINT, then SIGTERM, then SIGKILL until the job dies. In this case the exit message says "Exited with signal termination: Cputime limit exceeded, and core dumped." Under rare circumstances, the system (as opposed to the batch system) could kill a job by sending SIGXCPU. In this case the exit message would say "Exited with exit code 158." ii) RUNTIME limit. Most batch queues have a limit on the actual time that a job can run. When this limit is exceeded, the batch system kills the job by sending SIGUSR2, then SIGINT, then SIGTERM, then SIGKILL until the job dies. Usually the SIGUSR2 kills the job and the exit message says "Exited with exit code 145." on the SGI systems "Exited with exit code 159." on the IBM systems (Since the batch system is killing the job, one could argue that it is a bug not to give a more informative message.) A third queue limit, the STACKSIZE limit, is enforced by the system (rather than the batch system) killing the job by sending a SIGSEGV. The exit message says "Exited with exit code 139." 9/11/07 dugan