NAME libbandela.a, libbandelalight.a - libraries to collect Bandela performance data. SYNOPSIS On Altix there is no need to recompile but relinking with the Bandela library placed before libmpi is necessary (The LD_PRELOAD environment variable is supposed to be similar to the Irix _RLD64_LIST variable. But in practice an application behavior using this variable is unpredictable. DESCRIPTION Bandela is a set of tools to model the MPI behavior of an application. Bandela just records the computation timings (outside MPI) and then "replays" these timings to predict the MPI communication times based on an average BANdwidth and an average software LAtency as input. libbandela.so is used by Bandela to capture execution data which will be latter used by the Bandela model "replaympi". -libandelalight.so can be used to simply time the Mpi routines in a slected window of observation. It also gives an insight about what to account to the Communication Hardware. Below is an example (4 CPU) of the kind of Data libbandelalight produces at the end of stdout. >>>> Receive Matrix (Mb) <<<< eceive Matrix (Mb) <<<< Lines are receivers , columns are senders CPU 0 1 2 3 0 0.000 57.243 25.693 0.517 1 87.842 0.000 0.610 25.972 2 61.848 0.495 0.000 37.043 rank 0 Bytes buffered= 73.789 MBytes, internal send/recv requests=86091, total Barrier requests=2 rank 1 Bytes buffered= 80.692 MBytes, internal send/recv requests=87735, total Barrier requests=2 rank 2 Bytes buffered= 59.409 MBytes, internal send/recv requests=66148, total Barrier requests=2 rank 3 Bytes buffered= 63.250 MBytes, internal send/recv requests=69943, total Barrier requests=2 >>>> Request times in seconds <<<< Time transfering is computed based of 700.000000 MB/s Bandwidth. This time not only take into account the above recv but also includes time for buffering sends. Such buffering as well as local recv are computed using double Bandwidth Latency time is computed based of 2.000000 Micro second per internal send/recv request. Barrier Latency time is computed based of 10.000000 Micro second per MPI_Barrier request. CPU Comput Wait Latency Transfer send ...... barrier wait allreduc reduce allgathe gather 0000 273.1785 65.9074 0.1722 0.1719 0.4147 ...... 0.0048 20.2849 28.1839 0.9209 0.0007 1.4254 0001 255.7327 83.3007 0.1755 0.2211 0.8133 ...... 0.0049 21.7696 44.9627 0.0209 0.0006 0.0003 0002 255.0448 84.0686 0.1323 0.1844 0.6910 ...... 0.0044 22.7878 45.9924 0.6450 0.0007 0.0003 0003 267.3547 71.7532 0.1399 0.1821 0.7509 ...... 0.0046 21.2961 37.5431 0.0345 0.0002 0.0003 >>>> Number of requests <<<< CPU Comput Wait Latency Transfer send ...... barrier wait allreduc reduce allgathe gather 0000 ------ ---- ------- -------- 23789 ...... 10 10083 2033 88 4 3 0001 ------ ---- ------- -------- 39562 ...... 10 10083 2033 88 4 3 0002 ------ ---- ------- -------- 25725 ...... 10 10083 2033 88 4 3 0003 ------ ---- ------- -------- 35603 ...... 10 10083 2033 88 4 3 Only activated MPI routines are displayed. The Receive Matrix array displays the data received. This include what was received following an explicit MPI_IRECV or MPI_RECV call by the user but it also includes the idden calls issued by the MPI library for collective routines. The following heuristic is used to compute Bytes buffered : The library assumes that any data is symetric. So Buffering send only depends on the MPI_BUFFER_MAX value. Send issued by some collective routines are considered sent single copy whatever the size. In the last 2 arrays, except for Wait, Latency, Transfer, the number displayed are stricly measurement of the computations and communications for the observed window. Transfer : which is the time the hardware was indeed transfering something. This is computed based of the information displayed is the fisrt array and by using the Bandwidth defined by the BANDELA_BANDWIDTH variables. Latency is computed based on the numbers internal send/recv/barrier requests and the values of the BANDELA_LATENCY and BANDELA_BARRIER_LATENCY variables. Wait: is the sum of the measured communications minus the above Transfer and Latency. It gives an approximation of the time the application cannot do anything else but wait. -As the real capture of the computational timings may generate a huge amount of data, it is important to select a window of observation. Libbandelalight.so can also used to select such a window. ENVIRONMENT VARIABLES They are common to both libraries except otherwise specified. BANDELA_WATCH_ROUT: specifies the mpi routine Bandela must watch.The value can be: MPIBCAST | ALLREDUC | BARRIERS (default) for respectively mpi_bcast, mpi_allreduce, mpi_barrier. If set the two following environment variables must be set. BANDELA_BARRIER_START: Integer value which specifies the number of times the BANDELA_WATCH_ROUT routine must be called before to start capturing data. (no default) BANDELA_BARRIER_END: Integer value which specifies the number of time the BANDELA_WATCH_ROUT routine must be called before to stop capturing data. (no default) BANDELA_COMM_T_W: Integer value which specifies which communicator ( set 1 for MPI_COMM_WORL ) which must be taken into account for the above counter. If not set any call will be taken into account. BANDELA_SHOW_W: Integer value . Each BANDELA_SHOW_W times the watched routine is called a message will be printed on stdout. By default nothing is printed. The purpose of such print is to allow one the select a window based of the application prints on stdout. Example of such print: Rank 0 2000 calls to mpi_allreduce with comm= 1 BANDELA_HEART_BEAT: (libbandelaligh only). Integer value. Each BANDELA_HEART_BEAT calls to the watched routines the computation elapse times and the communication elapse times are written for each rank in a fort.(mpi_rank + 177) file. The utility format_trace located in the same directory than the libbandela libraries will generate (for example) the following output : format_trace 178 will generate the following output: call Computation communication 1000 207.1109768000027 52.29531520000086 1500 207.0920743998927 54.68045520002019 2000 207.1742191999331 51.77246160000016 2500 207.1877143999957 51.36017680002058 3000 207.2967560000152 51.97978720000788 3500 207.3676232000030 51.44490240000459 4000 207.4347855999872 51.15051600004426 This allow to check wether the ratios computation/communication change in time. BANDELA_PARTIAL_EXPERIMENT: YES OR NO. By default the instrumentation starts when mpi_init is called and stops with a call to mpi_finalize. This variable set to YES tells the Bandela library not to start capturing Data when mpi_init is called but when the Bandela_start_() routines is called. The Bandela_end_() routines must be called to stop the capture. If Bandela_end_() is not called capturing will stop at the mpi_finalize time. The two routines Bandela_start_() and Bandela_end_() are fully collective routines. Note that the mechanism described with this BANDELA_PARTIAL_EXPERIMENT variable is incompatible with the one describe above. BANDELA_F_BASE: Integer value. Libbandela.so and Libbandelalight.so with BANDELA_HEART_BEAT create by default fort.(mpi_rank + 177) files. The environment variable allows to change this "177" number. BANDELA_FILE_INCREMENT: If just set this variable tells Bandela to buffer the traces in memory and to write the fort.xx files only at the end of the program. If not set fort.xx files are written directly. If set to an integer value, this value will be interpreted in Mbytes for sizing the memory buffer used to keep the traces. This value will also be used to increment the buffer in case of overflow (default is 10 Mbytes). Most of the time the IO are buffered by the system so using this variable is not necessary. When a lot of CPU are used (more than 128), a too much IO activity may lead to troubles. This variable must be tried in such a case. BANDELA_SYS_TIME: YES | NO. By default Bandela just consider the user CPU time. If set to YES the variable instructs bandela to take the CPU system time into account. BANDELA_JUST_ELAPSE: YES | NO. An high number of calls to the timing routines which bracketed the Mpi functions may lead to a subsequent amount of system time. Setting this variable instructs Bandela to just get the elapse time which does not generate system time.(YES is the DEFAULT) BANDELA_APPS_SIGNAL: string for ending the application. If set, Bandela will "system" this string to end the application when BANDELA_BARRIER_END is reached, instead of using the bandela mechanism. This allow the application to print its own termination report or wahtever. For example Pam-crash looks for a file signal each cycle to interact with the user. Setting BANDELA_APPS_SIGNAL='echo QUIT>signal' will leads Bandela to call system("echo QUIT>signal") when BANDELA_BARRIER_END is reached. BANDELA_COST_EVAL: YES | NO. If NO Bandelalight just measures the communication no transfer matrix nor "Wait Latency Transfer" columns. Default is YES. BANDELA_BANDWIDTH: The Bandwith in MB/s any tranfer will be evaluated. Default is 700 Mb/s. This is a good average value for the Altix but in case of high collisions in the network this value may have to be lower. But on the example above even considering a very pessmistic bandwidth (say 300 Mb/s) it is clear that the communication hardware is of few importance. Such case would work well on a Cluster. BANDELA_LATENCY: The Latency in Micro second. Default is 2 Micro second. BANDELA_BARRIER_LATENCY: The latency to apply for a barrier: Default is 10 Micro second. BANDELA_DELAY_AT_INIT : All ranks will sleep for this second value just after calling MPI_Init> This gives times to attach somes processes to a debugger. BANDELA_PRINT_SIZ_IN_K: (Bandelalight only) Print size in Kbytes instead of the default Mb BANDELA_DO_NOT_FLUSH_PARALLEL: (Bandela only) If BANDELA_FILE_INCREMENT is set the traces are aminted in memory and written to files at the end. Writting the files occurs in parallel by default. This may upset an NFS file system. In such case set this variable and zritting the trace files will occur one by one. BANDELA_FAST_BARRIER_RECORD: (Bandela-Intel only). With the Intel MPI library or MPICH2 library, barriers are handles by internal call to send/recv/wait function. In case the user would like to model a system with a fast barrier mechanism such as the one used (fetch-op) by Mpt this variable must be used in conjunction with the INTRA_HOST_BARRIER_LATENCY input keyword. BANDELA_RECORD_COMM_TIME:(libbandela only). If set,in addition to the bandela.xx trace, a set of corresponding files bl_comm.xx will be created containing a measurement of the time taken by the internal transfer. BANDELA_CALL_GUEST_ROUTINE: If set, Bandela will call at the MPI_Init time the user routine bandela_guest_(void). This routine must be inside a library called libbandelaguest.so. The path to such library must be met before the path to the other Bandela libraries.