wiki:Documentation/UserGuide/restartability

Version 1 (modified by luyssaert, 5 years ago) (diff)

--

One plus One or the challenge of restartability

In some rare cases after bugfixes or implementation of new code, problems with reproducibility or 1+1=2 might be introduced unintentionally. Often these are related to incorrect variable dimensions in different sub-routines, memory issues or the lack of variables in the restart files. Such issues are easier to catch sooner than later. Thus, to minimize the time spent on debugging reproducibility and 1+1=2 issues, the following simple test are suggested/required before each commit of substantial code changes*:

1+1=2

If you do not run these test globally, make sure to use impose_veg=y. The standard F2 run.def settings have been tested and 1+1=2 from revision r6272. Thus, please always make the test for the standard settings. In case of other run.def settings during your developments, make same tests for your settings also. More recent tests have shown that 1+1=2 for LCC with r6279 at the global scale.

The standard test

1) 1Y vs. 12*1M, i.e. do two simulations for a full year; one with period length of 1 year; the other with period length of 1 month. Afterwards compare their final restart files both from stomate and sechiba.

Most issues should be caught with (1). In case of problems, it will make the debugging easier, if you can track down the onset of difference between the restart files (i.e. start of year, onset of growing season, end of year etc.) Thus, continue with test like

2) 1D+1D=2D (compare the final restart files)

3) 1M+1M=2M (compare the final restart files)

How to compare netcdf files

The comparison is easiest if the same variables are contained in the two netcdf files and the variables are in the same order. The differ100.sh script by Josefine Ghattas, nicely does this. Moreover, it uses cdo diffv to compare the files. Howeve,r 5dim variables are ignored by the cdo diffv command, thus not all variables in the restart files can be compared by the differ100.sh

Have to check for differences between to netcdf files that have variables with dimensions higher than 4

The matlab function nccmp are able to compare all variables contained within two netcdf files. The original version can be found here: https://fr.mathworks.com/matlabcentral/fileexchange/47857-comparing-two-netcdf-files. I have made some small modifications such that the information produced by the script are put into a file instead of being printed to the screen. The update version can be found here on IRENE:/ccc/work/cont003/dofoco/dofoco/SCRIPTS/debug/nccmp.m or here on obelix:/home/data03/dofoco/SCRIPTS_obelix/debug/nccmp.m.

Sadly, matlab is not on obelix, but on IRENE. To open matlab on IRENE type Matlab or if you wish to run from the terminal type matlab -nodesktop.

Next run the function by typing:

NCCMP(ncfile1,ncfile2,tolerance,forceCompare)

Tolerance is if you allow some variation in the variables between the two files. We want identical files thus put [] here.

forceCompare can be set to true or false.

  • True - write all occurrences of differences in a variable (specifically gives all the indices) to the file: all_diff.txt.
  • False - only write if there is differences in a variable and its first occurrence of such differences to the file: first_diff.txt.

For global simulation the True option can produce a large file and the information might be hard to process, if there are many differences between the compared restart files. In addition, the True option makes the much script slower. However, for small simulation the true option is very useful.

I recommend that you use the re-ordered files from the difffer100.sh script as inputs to nccmp.

Debugging suggestions:

  • If possible limit the spatial scale (to maximize speed).
  • Track down the onset of the deviation between the restart files.
  • Track down the problem. Hopefully, the differences in the restarts files will give you a clue on which variable to start the investigation from. The best approach depend on the source of the problem (memory issue or lack variable in the restart file etc.). For memory issue a debugger could be the best choice. For lack of variables in restart file it is best to run two identical runs with different period lenghts – either manually or by Totalview while tracking down which variables are causing the differences.
  • Once you have fixed the problem, verify that it is also valid at the global scale (i.e. run the global tests again, if you chose to zoom in on a smaller region)

Attachments (2)

Download all attachments as: .zip