The System Driver for the Boston University Blue Gene/L was upgraded on Tuesday June 28th, 2005 to Version 202. This Driver contains many fixes to the BGL system as well as newer versions of the compliers. Application profiling is the only major new functionality added by Driver 202 but there are many bug fixes, enhancements and changes for driver 202. Send questions to help@twister.bu.edu =========================================================== RECOMPILING APPLICATIONS =========================================================== You do not have to recompile applications to run on the 202 driver, but you will want to recompile to pick up improvements and fixes for application execution. =========================================================== ======================================================== APPLICATION PROFILING SUPPORT ======================================================== * Application profiling support has been enabled in driver 202. This works as it normally does on Linux with two differences regarding the gmon.out file that's produced when the profiled application is executed. 1. On BG/L, a gmon data file is created for every compute node (by default). The gmon files are named gmon.out.N, where N is the MPI rank of the node. 2. This can get out of hand for large partitions, so we've added a mondisable() routine (no parameters) that MPI library can call to disable the creating of gmon.out.N files for particular nodes. Note: You must recompile/relink any code built with older driver libraries in order to get access to the driver 202 application profiling. ======================================================= ISSUES ADDRESSED IN DRIVER 202 ======================================================= The following is a list of the bug fixes, enhancements and changes for driver 202: BUG FIXES ----------------------------------------------------------- o RAS messages will be discontinued after 30 secs and core dumps will be discontinued after 5 minutes when a block is being deallocated. o Data corruption on certain Read operations resolved o Improved shutdown of IO node via {i} sysrq h o Security enhancement - Authentication checking between ciodb and ciod. If authentication fails, RAS event is written to the database and connection is allowed to time out. o Fixed: access attempts will fail when requesting R access on WX file or W access on RX file o Fixed: ionode kernel panic was not generating RAS entries o Removed length restrictions on mmcs command output o Improvements and additions to web page function o New trigger enforces that only FREE blocks can be deleted. o TotalView now sees correct number of processors in virtual node mode o Enhanced termination processing to avoid various system hangs when applications are killed o Improved MMCS loggging o Diagnostics Version 3 o Fixed memory leaks in bridge APIs. o Fixed failure of ciodb to generate core dumps. o Enhancements to compute nodes to generate additional RAS events upon failures o Fixed mmcs_db_console failure to reconnect after mmcs server restart o Service Action improvements now support PrepareForService on Racks, Fans, Link Cards, Node Cards, Midplanes, Bulk Power Supply o Fixed misleading error message sent when you try to free a block that has jobs running o Fixed mpirun problem: Lots of output to stdout caused job to hang o Enhancement BGLMaster will now write a 'FAILURE' RAS entry with facility 'BGLMASTER' when a server fails abnormally. It will say ' ended abnormally'. BGLMaster will also write a 'INFO' RAS entry when it starts, giving its startup parameters. o Improved message Trying to kill a job on another block will now fail with the error "job is associated with another block" o Improved boot time on multiple rack systems o "list bgljob" will now show exitstatus of 0 for a running job o Increased verbosity of I/O node boot o Reduced mpirun orphaned processes. All the partition handling logic in mpirun has been moved to backend (Service Node). o Fixed EMAC performance degradation on IO node o Improved MPI link level checksum support o The time and gettimeofday syscalls now return the external time (the number of seconds since Jan 1, 1970 UTC). (They used to return the time since the machine was started). o Suppressed repetition of partition name on llbgljob o x output o Changed name of bgllinux rpm to bglmcp o Fixed: restart (BGL_CHKPT_RESTART_SEQNO) zeros the initial segment of each file (all bytes up to the restart offset) o Fixed mmcs_db_server memory leaks o Fixed segmentation fault in MMCS_DB_SERVER when user allocated same block twice o Improved RAS messages for midplanes in error o Fixed ciod scaling issue with number of available file descriptors o Fixed MPI Iprobe/send to self error o Added an example bgl_perfctr file to /bgl/BlueLight/ppcdriver/examples/multichip/bgl_perfctr/ o clog/cerr will now be line buffered o Improved security on CIOD debug port by only allowing connection from service node o Fixed fcntl problem and problem where flow control is not cleared after a KILL o New rm_modify_partition bridge API for enabling kernel verbose options. o Fixed ciodb killjob hangs when mpirun not reading from stdout/err socket o Startciodb now supports a --ciodrspto (ciod response timeout) argument, which defaults to 60 seconds. o Fixed partition name collisions that caused rm_add_partition to fail o Add readdir64 and a getdents64 for CNK so that wordexp works (Fixed: Missing symbols in /bgl/BlueLight/ppcfloor/blrts) o Fixed IONode out of memory MPI ENHANCEMENTS ------------------------------------------------------------------ o Much improved virtual node mode alltoall/alltoallv. 2x faster on 512 (1024 tasks) vn mode. o Bcast up to 6x faster in virtual node mode on rectangular communicators o Faster coprocessor mode broadcast bandwidth on rectangular communicators (5-10%) o Improved memory copy for virtual node mode and unexpected messages o Improved virtual node mode point to point performance o Packet Pacing (flow control for performance). Improvement in codes that saturate the torus. Improves bisection bandwidth for off axis pairings. To control this, there are 2 environment variables. BGLMPI_PACING={y/n} BGLMPI_PACING_WIN=n {n=some number, 50 default}. The pacing window is how often a receiver will ack the sender. For example if you have the window set to 40, every 20 packets sent must be acked, or the sender will stop sending data. This relieves internal network congestion, giving a much higher throughput. Packet Pacing is **on** by default. Some codes may see slightly worse latency for p2p sends depending on thepacing settings. MPIRUN ENHANCEMENTS --------------------------------------------------------------------- o Added "-nofree" command line argument and "MPIRUN_NOFREE" env. variable. o The default bridge.config file is copied to the bglsys/bin directory from bglsw/bglbridge directory o For compatibility with mpiexec standard, mpirun will now accept -np or -n, -cwd or -wdir, and instead of setting the MCS_SERVER_IP environment variable, the user can specify the service node with -host. o Improved mpirun return code process. Added support in mpirun according to the following policy: + If any error in the job cycle (not in the BG/L job itself) occurs o If user has specifically asked to get the mpirun exit status (flag -nw): mpirun will exit with a nono zero error code o If user has not asked to get the mpirun exit status, mpirun will exit with 1 o In any other case, mpirun will exit with the BG/L job return code. + The exit status is any value from 0 through 255. This value, which is returned from MPIRUN on the front end node, reflects the composite exit status of the application as follows: o If all tasks terminate via exit(nn>=0) and nn is not equal to 1 and is <128 for all nodes, the exit status is the largest value of nn from any parallel job (mod 256). o If any task terminates via exit(nn =1), the exit status will be 1 and the job will terminate immediately. o If any task terminates via a signal (for example, a segment violation), the exit status is 128+signal and the entire job is immediately terminated. o If MPIRUN terminates before the start of the user's application, the exit status is =1. o If the user's application cannot be loaded or fails before the user's main() is called, the exit status is =255.