NAME
	libbandela.a, libbandelalight.a - libraries to collect Bandela performance data.

SYNOPSIS
	On Altix there is no need to recompile but relinking with the Bandela library placed before libmpi is necessary (The LD_PRELOAD environment variable is supposed to be similar to the Irix _RLD64_LIST variable. But in practice an application behavior using this variable is unpredictable.

DESCRIPTION
        Bandela is a set of tools to model the MPI behavior of an application. Bandela just records the computation timings (outside MPI) and then "replays" these timings to predict the MPI communication times based on an average BANdwidth and an average software LAtency as input.
	libbandela.so is used by Bandela to capture execution data which will be latter used by the Bandela model "replaympi".
-libandelalight.so can be used to simply time the Mpi routines in a slected window of observation. It also gives an insight about what to account to the Communication Hardware. Below is an example (4 CPU) of the kind of Data libbandelalight produces at the end of stdout. 

>>>> Receive Matrix (Mb) <<<<
		  
eceive Matrix (Mb) <<<<
Lines are receivers , columns are senders

CPU             0          1          2          3
     0      0.000     57.243     25.693      0.517
     1     87.842      0.000      0.610     25.972
     2     61.848      0.495      0.000     37.043
 rank 0 Bytes buffered=    73.789 MBytes, internal send/recv requests=86091, total Barrier requests=2
 rank 1 Bytes buffered=    80.692 MBytes, internal send/recv requests=87735, total Barrier requests=2
 rank 2 Bytes buffered=    59.409 MBytes, internal send/recv requests=66148, total Barrier requests=2
 rank 3 Bytes buffered=    63.250 MBytes, internal send/recv requests=69943, total Barrier requests=2

 >>>> Request times in seconds <<<<
    
  Time transfering is computed based of 700.000000 MB/s Bandwidth. This time not only
  take into account the above recv but also includes time for buffering sends.
  Such buffering as well as local recv are computed using double Bandwidth
  Latency time is computed based of 2.000000 Micro second per internal send/recv request.
  Barrier Latency time is  computed based of 10.000000 Micro second per MPI_Barrier request.
    
CPU   Comput   Wait   Latency Transfer  send     ......  barrier  wait     allreduc reduce   allgathe gather  
0000 273.1785  65.9074  0.1722  0.1719   0.4147  ......  0.0048  20.2849  28.1839   0.9209   0.0007   1.4254
0001 255.7327  83.3007  0.1755  0.2211   0.8133  ......  0.0049  21.7696  44.9627   0.0209   0.0006   0.0003
0002 255.0448  84.0686  0.1323  0.1844   0.6910  ......  0.0044  22.7878  45.9924   0.6450   0.0007   0.0003
0003 267.3547  71.7532  0.1399  0.1821   0.7509  ......  0.0046  21.2961  37.5431   0.0345   0.0002   0.0003
    
 >>>> Number of requests <<<<
CPU   Comput   Wait   Latency Transfer  send     ...... barrier  wait     allreduc reduce   allgathe gather  
0000  ------ ----   ------- --------     23789   ......    10    10083     2033       88        4        3 
0001  ------ ----   ------- --------     39562   ......    10    10083     2033       88        4        3 
0002  ------ ----   ------- --------     25725   ......    10    10083     2033       88        4        3 
0003  ------ ----   ------- --------     35603   ......    10    10083     2033       88        4        3 


Only activated MPI routines are displayed.


The Receive Matrix array displays the data received. This include what was received following  an explicit MPI_IRECV or MPI_RECV call by the user but it also includes the idden calls issued by the MPI library for collective routines.

The following heuristic is used to compute Bytes buffered : The library assumes that any data is symetric. So Buffering send only depends on the MPI_BUFFER_MAX value. Send issued by some collective routines are considered sent single copy whatever the size.

In the last 2 arrays, except for Wait, Latency, Transfer, the number displayed are stricly measurement of the computations and communications for the observed window. 

Transfer : which is the time the hardware was indeed transfering something. This is computed based of the information displayed is the fisrt array and by using the Bandwidth defined by the BANDELA_BANDWIDTH variables.

Latency is computed based on the numbers internal  send/recv/barrier requests and the values of the BANDELA_LATENCY and BANDELA_BARRIER_LATENCY variables.

Wait: is the sum of the measured communications minus the above Transfer and Latency. It gives an approximation of the time the application cannot do anything else but wait.

        -As the real capture of the computational timings may generate a huge amount of data, it is important to select a window of observation. Libbandelalight.so can also used to select such a window.


ENVIRONMENT VARIABLES
        They are common to both libraries except otherwise specified.

	BANDELA_WATCH_ROUT: specifies the mpi routine Bandela must watch.The value can be: MPIBCAST | ALLREDUC | BARRIERS (default) for respectively mpi_bcast, mpi_allreduce, mpi_barrier. If set the two following environment variables must be set.

	BANDELA_BARRIER_START: Integer value which specifies the number of times the BANDELA_WATCH_ROUT routine must be called before to start capturing data. (no default)
	
	BANDELA_BARRIER_END: Integer value which specifies the number of time the BANDELA_WATCH_ROUT routine must be called before to stop capturing data. (no default)

	BANDELA_COMM_T_W: Integer value which specifies which communicator ( set 1 for MPI_COMM_WORL ) which must be taken into account for the above counter. If not set any call will be taken into account.

	BANDELA_SHOW_W: Integer value . Each BANDELA_SHOW_W times the watched routine is called a message will be printed on stdout. By default nothing is printed. The purpose of such print is to allow one the select a window based of the application prints on stdout. Example of such print:

         Rank 0         2000 calls to mpi_allreduce with comm=           1

	BANDELA_HEART_BEAT: (libbandelaligh only). Integer value. Each BANDELA_HEART_BEAT calls to the watched routines the computation elapse times and the communication elapse times are written for each rank in a fort.(mpi_rank + 177) file. The utility format_trace located in the same directory than the libbandela libraries will generate (for example) the following output :

format_trace 178 will generate the following output:

   call         Computation             communication
         1000   207.1109768000027       52.29531520000086
         1500   207.0920743998927       54.68045520002019
         2000   207.1742191999331       51.77246160000016
         2500   207.1877143999957       51.36017680002058
         3000   207.2967560000152       51.97978720000788
         3500   207.3676232000030       51.44490240000459
         4000   207.4347855999872       51.15051600004426

         This allow to check wether the ratios computation/communication change in time.

	BANDELA_PARTIAL_EXPERIMENT: YES OR NO. By default the instrumentation starts when mpi_init is called and stops with a call to mpi_finalize. This variable set to YES tells the Bandela library not to start capturing Data when mpi_init is called but when the Bandela_start_() routines is called. The Bandela_end_() routines must be called to stop the capture. If Bandela_end_() is not called capturing will stop at the mpi_finalize time. The two routines Bandela_start_() and Bandela_end_() are fully collective routines. Note that the mechanism described with this BANDELA_PARTIAL_EXPERIMENT variable is incompatible with the one describe above.

	BANDELA_F_BASE: Integer value. Libbandela.so and Libbandelalight.so with BANDELA_HEART_BEAT create by default fort.(mpi_rank + 177) files. The environment variable allows to change this "177" number.

	BANDELA_FILE_INCREMENT: If just set this variable tells Bandela to buffer the traces in memory and to write the fort.xx files only at the end of the program. If not set fort.xx files are written directly. If set to an integer value, this value will be interpreted in Mbytes for sizing the memory buffer used to keep the traces. This value will also be used to increment the buffer in case of overflow (default is 10 Mbytes). Most of the time the IO are buffered by the system so using this variable is not necessary. When a lot of CPU are used (more than 128), a too much IO activity may lead to troubles. This variable must be tried in such a case.
	
	BANDELA_SYS_TIME: YES | NO. By default Bandela just consider the user CPU time. If set to YES the variable instructs bandela to take the CPU system time into account.

	BANDELA_JUST_ELAPSE: YES | NO. An high number of calls to the timing routines which bracketed the Mpi functions may lead to a subsequent amount of system time. Setting this variable instructs Bandela to just get the elapse time which does not generate system time.(YES is the DEFAULT)

	BANDELA_APPS_SIGNAL: string for ending the application. If set, Bandela will "system" this string to end the application when BANDELA_BARRIER_END is reached, instead of using the bandela mechanism. This allow the application to print its own termination report or wahtever. For example Pam-crash looks for a file signal each cycle to interact with the user. Setting BANDELA_APPS_SIGNAL='echo QUIT>signal' will leads Bandela to call system("echo QUIT>signal") when BANDELA_BARRIER_END is reached.


	BANDELA_COST_EVAL: YES | NO. If NO Bandelalight  just measures the communication no transfer matrix nor "Wait Latency Transfer" columns. Default is YES.

	BANDELA_BANDWIDTH: The Bandwith in MB/s any tranfer will be evaluated. Default is 700 Mb/s. This is a good average value for the Altix but in case of high collisions in the network this value may have to be lower. But on the example above even considering a very pessmistic bandwidth (say 300 Mb/s) it is clear that the communication hardware is of few importance. Such case would work well on a Cluster.

        BANDELA_LATENCY: The Latency in Micro second. Default is 2 Micro second.


	BANDELA_BARRIER_LATENCY: The latency to apply for a barrier: Default is 10 Micro second.

	BANDELA_DELAY_AT_INIT <sleep value in seconds>: All ranks will sleep for this second value just after calling MPI_Init> This gives times to attach somes processes to a debugger.

	BANDELA_PRINT_SIZ_IN_K: (Bandelalight only) Print size in Kbytes instead of the default Mb


	BANDELA_DO_NOT_FLUSH_PARALLEL: (Bandela only) If BANDELA_FILE_INCREMENT is set the traces are aminted in memory and written to files at the end. Writting the files occurs in parallel by default. This may upset an NFS file system. In such case set this variable and zritting the trace files will occur one by one.


	BANDELA_FAST_BARRIER_RECORD: (Bandela-Intel only). With the Intel MPI library or MPICH2 library, barriers are handles by internal call to send/recv/wait function. In case the user would like to model a system with a fast barrier mechanism such as the one used (fetch-op) by Mpt this variable must be used in conjunction  with the INTRA_HOST_BARRIER_LATENCY input keyword.


	BANDELA_RECORD_COMM_TIME:(libbandela only). If set,in addition to the  bandela.xx trace, a set of corresponding files bl_comm.xx will be created containing a measurement of the time taken by the internal transfer.


	BANDELA_CALL_GUEST_ROUTINE: If set, Bandela will call at the MPI_Init time the user routine bandela_guest_(void). This routine must be inside a library called libbandelaguest.so. The path to such library must be met before the path to the other Bandela libraries.