Debugging with Totalview
Objective
Background of this item: when the model does not run as expected, the calculations will need to be checked step by step. Debuggers such as IDB help to identify what are the values taken by some variables at some point, check that a subroutine of interest is actually called during the execution of the code (and how many times it happens),... This might be very useful when one would like to speed up the identification of the variable which is the first to take weird values (particularly after a significant modification of the code) and the snap of code which cause that trouble. An basic alternative to the use of debuggers is added WRITE statements to the code. Totalview is GUI debugger that works very well with mpi/omp binaries. You can find it installed in big HPCs such as Curie or ADA. (Note Curie is not working anymore. Now it is Irene at TGCC)
Boundary conditions for using a debugger
Of course, getting a functional version of the code after a modification of one of the routine of ORCHIDEE continues to require a few steps and the debugger we present only helps to speed up the second one :
- Getting a version of the code which can be compiled. The first errors displayed by the compiler before crashing should be of some help to solve that issue.
- Once the code including the modification can be compiled properly, it may often happen that some of the variables take aberrant values, even for runs in offline mode on one point. You are likely to be interested by this tutorial if you are used to proceed to tedious cycles of :
- addition within the subroutine of interest of lines such as "PRINT *, 'MY_VAR='my_var
- compilation of the code
- screening of the standard output of the executed code
- Check that the introduction of the new feature doesn't lead to weird behaviour for runs at the global scales and/or coupled with the GCM.
Totalview on Curie/IRENE
Authors: A. Jornet
Last revision: A. Jornet (2019/05/02)
SPMD
In order to run totalview you need to make available to you like this:
module load totalview
In order to run you simulation in interactive mode so you can use the debugger interface:
ccc_mprun -n 16 -p standard -A <your project id> -d tv ./orchidee_ol
This call ask for 16 processos in the standard queue. -d tv selects totalview as a debugger.
When the startup window shows up, select "Enable memory debugging".
MPMD
The script below defines how to run an MPI-OpenMP with totalview (eg: coupled LMDZ + Orchidee) program:
#!/bin/bash #MSUB -r mictgrm_test # Request name #MSUB -n 64 # Number of tasks to use #MSUB -c 2 # Number of tasks to use #MSUB -T 4000 # Elapsed time limit in seconds #MSUB -o orchid_%I.o # Standard output. %I is the job id #MSUB -e orchid_%I.e # Error output. %I is the job id #MSUB -Q normal #MSUB -D #MSUB -X #MSUB -A gen6328 #MSUB -q standard set -x module unload netcdf hdf5 module load netcdf/4.3.3.1_hdf5_parallel module load hdf5/1.8.9_parallel module load totalview #module load ddt #unset SLURM_SPANK_AUKS # enable core dump file in case of error ulimit -c unlimited export KMP_STACKSIZE=3g export KMP_LIBRARY=turnaround export MKL_SERIAL=YES OMP_NUM_THREADS=2 cat << END > pp.conf 4 ./xios.x 60 totalview ./lmdz.x END mpirun -tv -n 4 ./xios.x : -n 60 ./lmdz.x
In this specific case, coupled lmdz + Orchidee runs 60 MPI procs and 4 XIOS procs with 2 OpenMP threads. In total, it requires 128 procs.
When the startup window shows up, select "Enable memory debugging".
Notes
Totalview is not able to debug when the binary is compiled with -p flag (only for profiling purposes). For that reason, it needs to be removed from the compilation.
If you compile orchidee with the makeorchidee_fcm tool, make sure to remove it from arch.fcm file:
arch.fcm:
%DEBUG_FFLAGS -fpe0 -p -O0 -g -traceback -fp-stack-check -ftrapuv -check bounds -check all
to
%DEBUG_FFLAGS -fpe0 -O0 -g -traceback -fp-stack-check -ftrapuv -check bounds -check all
Make sure this module is not loaded when compiling the source code AND running the executable. Unload by
module unload gnu/4.8.1