Scientific Computing & Visualization
Help Contact
About Accounts Computation Visualization Documentation Services

Scientific Computing Facilities Frequently Asked Questions

Table of Contents

General Questions 

1. What are the Scientific Computing Facilities (SCF)?
The Scientific Computing Facilities (SCF) currently include the IBM BlueGene system, IBM pSeries machines (4 p690s and 9 p655s), an Intel Pentium III Linux Cluster (Cluster), and our virtual reality/scientific visualization facilities. They are part of the more general SCV Computing and Visualization Facilities.

A general introduction to use of the SCF is available at Information for New SCF Users. The primary computational machines are listed under our Scientific Computing Facility Technical Summary.

2. What machines is my account good for?
Your SCF account gives you access to the IBM pSeries systems (twister.bu.edu) and to the Intel Pentium III Linux Cluster (skate.bu.edu and cootie.bu.edu). The IBM Blue Gene (levi.bu.edu and lee.bu.edu) is also in production use, but special restrictions apply to accessing it so users are not automatically given access. If you need access to the Blue Gene, fill out this web form.

The primary computation clusters each have one or two machines designated for interactive use and you can only log in to those machines. On the pSeries systems, the machine is twister.bu.edu, on the Linux Cluster they are skate.bu.edu and cootie.bu.edu, and on the Blue Gene for those who have access to it, they are levi.bu.edu and lee.bu.edu. Your password is shared over all of these systems.

You should log into one of the above machines and do all your editing and compiling there as well. If your program runs on a single processor and requires less than ten minutes of CPU time, you can also execute your program on one of these machines (with the exception of the Blue Gene systems) interactively. Otherwise, you should submit your program as a batch job and it will automatically be parceled out to the appropriate machine in the facilities based on available resources and what queue you select (please also see later questions on batch system usage).

3. Where do I find documentation?
4. What debuggers are there?
On the IBM pSeries machines, the debugger is pdbx. pdbx is a command-line parallel debugger suitable for MPI.

On the Linux cluster, the debugger is idb, the Intel debugger.

There are also the standard debuggers dbx and gdb.

The debugger on the Blue Gene system is Totalview

5. How do I change my password?
To change your SCF password, you must run passwd on twister.bu.edu.
6. What do I do if I forget my password?
Please send email to scfacct@bu.edu explaining your situation.
7. How do I retrieve lost files?
Please send email to help@twister.bu.edu, explaining exactly what files you deleted, what machine and filesystem they were on, and at what day and time you did it.
8. How do I get more resources (such as disk space)?
For home directory disk space, fill out this form. If you are a Principal Investigator for a project which needs more CPU time or /project space, try using the appropriate form linked to from http://scv.bu.edu/accounts/. Make sure to specify what machine you are requesting resources on, why you need them and what exactly you need. These requests, particularly large ones, can take several weeks to process and consider.

Filesystem/Disk Questions 

9. Which filesystems are shared?
You have one home directory on the IBM pSeries systems and one on the Linux Cluster/Blue Gene. Your Linux home directory is accessible from the IBM machines with the pathname /linux/$HOME. Similarly, your IBM home directory is accessible from the Linux machines with the pathname /ibm/$HOME.

Each machine (with the exception of the Blue Gene systems) has its own scratch partition. If necessary, you can access the /scratch partitions on other machines of the same architecture. On the pSeries systems you can access a remote scratch space via the pathname /hostname/scratch (for example, /frisbee/scratch). On the Linux Cluster machines, a remote scratch space can be accessed via the pathname /net/hostname/scratch (for example, /net/node006/scratch). On the Katana Cluster machines, use the pathname "/net/katana-aNN/scratch", where NN is a node number from 01, 02, ... 13.

There are several partitions of Project space. All of the /project file systems can be accessed from any of the SCF machines.

10. I need large amounts of temporary space for my jobs. What do I do?
Use /scratch and see the previous question.

If /scratch on a given machine is full, you should do one of the following things. 1) Remove as many files as you can which you no longer need to free up space. 2) Use /scratch on a different machine which has more space (see next question).

If this is a regular need and /scratch does not adequately take care of it, the Principal Investigator of your project can apply for /project disk space, backed up or not backed up as appropriate.

11. Why do my recently unTARed files get immediately removed?
The /scratch reaper automatically removes files which are more than 10 days old. It determines how old a file is by looking at its "write date." By default, tar does not modify write dates, so an older file which is unTARed will be reaped at the next opportunity. The -m switch to the tar command can be used to override this behavior. The following is from the tar man page:
m      Do not restore the modification times.  
       The modification time will be the time of extraction.
12. Does the SCF have a long term storage facility?
Yes, it is possible to archive your files for long term storage using the IBM Distributed Storage Manager (Tape Robot).

Batch Job/System Questions 

13. How do I submit a batch job?
On the IBM pSeries systems, we utilize Platform Computing's LSF batch management system. Use bsub (see next question) or xlsbatch, the Motif GUI for lsfbatch, the load sharing batch system installed on the SCF. Lsfbatch uses LSF (Load Sharing Facility) to distribute the load for both parallel and serial batch jobs over the system. Also, see the SCV help page on LSF.

The batch system on the Linux Cluster is OpenPBS.

On the Blue Gene, the batch system is LoadLeveler.

14. What limitations are there on jobs (# of nodes, runtime, etc...) on the various systems?
Our Scientific Computing Facility Technical Summary explains the job limitations on all of our systems.
15. How do I have one batch job wait for another to complete?
The bsub command in the LSF batch system (on the pSeries machines) has a wait option (-w) which allows you to specify the conditions which you wish to wait for before starting the job, including waiting for the termination of another job. For example,
bsub -w 'done("myjob1")&&done("myjob2")' myjob3
will cause myjob3 to wait until both myjob1 and myjob2 have completed. Another option -b allows you to specify that jobs should not be run before a certain time. Finally, the -E option provides a completely general mechanism to have a job wait until an arbitrary condition is true. With this option you specify a command which the batch system will execute before running your job. If the command exits with a 0, the job is run. Otherwise it is put back on the queue.
16. My batch job starts several other jobs but these other jobs get killed by the reaper. Why?
If the original job terminates before its children, the reaper cannot determine that the children were started by the batch job and so kills them. Make sure the parent job does not end before the children.
17. My batch job is expected to take longer runtime than a queue's time limit, what can I do?
The answer depends on how your code is implemented:
  1. If your code is written as a serial (single processor) application, rewriting it as a parallel (multiprocessor) application could help, provided that the underlying algorithm used in your code is inherently parallelizable. Parallelization could be achieved with OpenMP or MPI. OpenMP is limited to shared-memory machines such as the IBM pSeries. MPI, on the other hand, works on shared memory machines as well as distributed memory machines (i.e., the SCV Linux Cluster). Please contact Kadin Tseng or Doug Sondak for more details.
  2. If your program is already parallelized using MPI you might want to try running it on the Linux Cluster. The Linux Cluster usage policy allows 16 node parallel jobs to run for as long as 24 hours compared to 5 hours for the pSeries multi-processor queues. So even though the Linux Cluster processors are slower than the pSeries processors your job may be able to complete within the allowed time on the Linux Cluster.
  3. Modify your program so it periodically saves state and can be restarted where it left off. See Doug or Kadin for help doing this.
  4. If your code is already parallelized with MPI and is scalable to many (hundreds) processors, you could port it to the IBM BlueGene.
18. My batch job exited with code ###. What does that mean?
See this long explanation of the batch system exit codes.
19. How does the LSF batch system schedule jobs?
See this long explanation of the batch system scheduler.
20. My LSF (batch) run seemed to run to completion, but I never received the usual email message notifying me that the job had finished. What is the problem?
At the end of an LSF run the user is automatically sent email to indicate that the job has completed. This email contains everything that was written to standard out during the run. If a large amount of information (greater than 10MB) is written to standard out, the email becomes too large for the mail system to process, and the email is not sent. This sometimes occurs when the user forgets to delete a large number of diagnostic print statements from a run. The best solution is to always re-direct standard out to a file (e.g., myrun > myoutput).
21. I am a member of multiple project groups.  How do I account my usage to a different project than my default one?
On the pSeries, use the -P project_name option to bsub when you submit your job.

On the Katana Cluster, run the command newgrp project_name in your shell window before doing your run or submitting your job (from that window) to the batch system.

On the Linux Cluster, the option to qsub is -W group_list=project_name. However, this will unfortunately only work for single processor jobs. For multiprocessor jobs on the Linux Cluster, there unfortunately is currently no way to avoid them being charged to your default project so see below for instructions on changing that.

You can also change your default project by going here (requires your Twister login and password) and then completing and submitting the appropriate Web form. Your default project will then be changed the next time the system configuration files are updated, generally overnight.

Parallel Programming Paradigms - MPI, OpenMP, etc... 

22. How do I specify the number of processors my job will run on?
It depends on the parallel paradigm (OpenMP, MPI) and/or the computer platform. Please consult the appropriate link below:
23. How do I run MPI jobs?
Please read the Multiprocessing by Message Passing Tutorial where you will find instructions on what you need to do to use MPI.
24. How do I run PVM jobs?
PVM is not available on any of our current systems.

Miscellaneous Questions 

25. I think I have discovered a bug in F90, gcc, etc... What should I do?
Send email to help@twister.bu.edu with a description of the problem. If possible, tell us exactly how to reproduce the problem you are having. If we can reproduce your problem, we can probably fix it. If you don't know how to reproduce the problem, please provide as much information as possible including:
  • Hostname of machine.
  • The name and location of the program (with flags and input files).
  • Any error messages you get.
26. On the IBM/AIX machines my program fails with the error:
   twister:~> a.out    
     exec(): 0509-036 Cannot load program a.out because of the following errors:
             0509-026 System error: There is not enough memory available now.  
How do I deal with this error and get access to more memory on the IBM/AIX machines?
The error message is misleading, the system has plenty of memory. By default, a 32-bit AIX executable has a 256MB data segment limit. You need to use the -bmaxdata compiler flag to use more memory. For example:
twister:~> xlf -bmaxdata:0x40000000 prog.f 
produces an executable with a 1GB data segment limit. It is usually safe to compile with -bmaxdata:0x80000000 for a 2GB data limit. To go above 2GB, you need to add a /dsa. The largest value you can specify is -bmaxdata:0xd0000000/dsa for a 3.25GB data limit. However, this may or may not work depending on the details of your program. If you need that much memory consider compiling a 64-bit executable by using the -q64 flag.

You can also use the ldedit command to "fix" the executable without recompiling.

Finally, if you are using the GNU compilers (gcc,g++,g77) you need an additional -Xlinker flag:

twister:~> g77 -Xlinker -bmaxdata:0x40000000 prog.f 
27. Is there a problem with Mathematica regarding fonts?
The X frontend for Mathematica requires use of some special mathematica fonts. If the fonts are not available to your X-server mathematica won't work. To fix this you need to install the fonts in a place where your X-server can find them. The fonts are in the following directory: /usr/local/apps/mathematica-5.2/SystemFiles/Fonts/BDF on all SCV maintained machines. Copy (20MB) them to a directory which your X-server can access and, if necessary, add the directory to your font path by executing a command like:
xset fp+ <full path to mathematica font directory>

If you need help, please ask the administrator of your workstation for assistance.

28. Why can't the machine I am using read my datafile that I created on another computer?
If the file is a binary file, it may be a problem with endian-ness. Intel, and DEC computers are usually "little-endian" while MIPS(SGI), SPARC(SUN), PPC(IBM/AIX) are "big-endian". This means that the order they store the bytes in an integer for example, is reversed. The best solution is to use a portable data format for your data files such as ASCII.
29. None of this answered my questions. What should I do?
For other questions, send mail to help@twister.bu.edu or take advantage of the newsgroup bu.mail.scfug-l to post questions so other users can help you or see if your question has come up before. You can also subscribe to the scfug-l mailing list by sending mail to majordomo@bu.edu with the following line as the BODY of the message (the Subject line does not matter)
"subscribe scfug-l@bu.edu your_login_name@machine_name.bu.edu". Make sure to specify your BU login name and include a specific machine name to send mail to. Mail to this list also appears in the bu.mail.scfug-l newsgroup mentioned above.
Boston University
Boston University
 
OIT | CCS | April 16, 2008  
Scientific Computing & Visualization Boston University home page Boston University home page