#2492 closed Bug (fixed)
Out-of-bounds error in ORCA2-based mono-processor configuration
Reported by: | smueller | Owned by: | smueller |
---|---|---|---|
Priority: | low | Milestone: | |
Component: | ICB | Version: | v4.0 |
Severity: | minor | Keywords: | ICB LBC non-MPP v4.0 |
Cc: | smasson, pierre.mathiot@… |
Description
Context
A user of an ORCA2-based mono-processor configuration (key_mpp_mpi undefined) has reported an out-of-bounds error which occurs during the north-fold boundary exchange in subroutine lbc_nfd_2d_ext.
Analysis
This out-of-bounds error can readily be reproduced in reference configuration ORCA2_ICE_PISCES by removing CPP-keys key_mpp_mpi and key_iomput.
The error is caused by the initialisation of array tmask_e(0:jpi+1,0:jpj+1) in subroutine icb_init; it is absent when iceberg handling is disabled (ln_icebergs = .FALSE.). The initialisation of tmask_e is finalised by calling subroutine mpp_lnk_2d_icb (interface lbc_lnk_icb), which differs in the north-fold treatment depending on whether jpni is 1 (incl. mono-processor case) or greater: if jpni = 1, subroutine lbc_nfd_2d_ext (implemented in source:/NEMO/releases/r4.0/r4.0-HEAD/src/OCE/LBC/lbc_nfd_ext_generic.h90) is called. Of array tmask_e(0:jpi+1,0:jpj+1), the subset tmask_e(1:jpi,1:jpj+1) is passed to subroutine lbc_nfd_2d_ext by subroutine mpp_lnk_2d_icb. While subroutine lbc_nfd_2d_ext refers to this subset as ptab(1:jpi,0:jpj), it accesses array elements with a dimension-2 subscript of nlcj+1. Since jpj == nlcj, this results in out-of-bounds array access and potentially incorrect content of the array.
The same error also affects arrays {u,v}mask_e initialised in subroutine icb_init (source:/NEMO/releases/r4.0/r4.0-HEAD/src/OCE/ICB/icbini.F90) and {uo,vo,ff,tt,fr,ua,va,hi,vi}_e initialised in subroutine icb_utl_copy (source:/NEMO/releases/r4.0/r4.0-HEAD/src/OCE/ICB/icbutl.F90). Unrelated to out-of-bounds array access, it appears that the lbc_lnk_icb calls used to initialise arrays {u,v}mask_e specify incorrect grid types ('T'), which could result in further incorrect boundary exchanges that may negatively affect these two arrays.
Fix
Subroutine mpp_lnk_2d_icb could be adjusted to retain the bounds for dimension 2 of the array passed to subroutine lbc_nfd_2d_ext, i.e.,
-
src/OCE/LBC/lbclnk.F90
381 381 IF( npolj /= 0 ) THEN 382 382 ! 383 383 SELECT CASE ( jpni ) 384 CASE ( 1 ) ; CALL lbc_nfd ( pt2d(1:jpi,1 :jpj+kextj), cd_type, psgn, kextj )384 CASE ( 1 ) ; CALL lbc_nfd ( pt2d(1:jpi,1-kextj:jpj+kextj), cd_type, psgn, kextj ) 385 385 CASE DEFAULT ; CALL mpp_lbc_north_icb( pt2d(1:jpi,1:jpj+kextj), cd_type, psgn, kextj ) 386 386 END SELECT 387 387 !
Further, the grid types specified in the lbc_lnk_icb calls for arrays {u,v}mask_e could be adjusted according to
-
src/OCE/ICB/icbini.F90
239 239 umask_e(:,:) = 0._wp ; umask_e(1:jpi,1:jpj) = umask(:,:,1) 240 240 vmask_e(:,:) = 0._wp ; vmask_e(1:jpi,1:jpj) = vmask(:,:,1) 241 241 CALL lbc_lnk_icb( 'icbini', tmask_e, 'T', +1._wp, 1, 1 ) 242 CALL lbc_lnk_icb( 'icbini', umask_e, ' T', +1._wp, 1, 1 )243 CALL lbc_lnk_icb( 'icbini', vmask_e, ' T', +1._wp, 1, 1 )242 CALL lbc_lnk_icb( 'icbini', umask_e, 'U', +1._wp, 1, 1 ) 243 CALL lbc_lnk_icb( 'icbini', vmask_e, 'V', +1._wp, 1, 1 ) 244 244 ! 245 245 ! assign each new iceberg with a unique number constructed from the processor number 246 246 ! and incremented by the total number of processors
Commit History (2)
Changeset | Author | Time | ChangeLog |
---|---|---|---|
13350 | smueller | 2020-07-28T14:28:29+02:00 | Remedy for the bugs reported in ticket #2492 |
13276 | mathiot | 2020-07-09T09:47:18+02:00 | ticket #2494 and #2375: wrong point type inn lbc_lnk_icb for umask_e and vmask_e (see ticket #2492) |
Change History (12)
comment:1 Changed 4 years ago by smasson
- Cc smasson added
comment:2 Changed 4 years ago by mathiot
- Cc pierre.mathiot@… added
comment:3 Changed 4 years ago by mathiot
In 13276:
comment:4 Changed 4 years ago by smueller
- Version v4.0.* deleted
Subroutine mpp_lnk_2d_icb only appears to be called from within the ICB source code (subroutines icb_ini and icb_utl_copy), so its modification should only affect model runs with ln_icebergs=.true.. Further, the proposed modification of mpp_lnk_2d_icb only affects a subroutine call when jpni=1 and, since runs using the model compiled with key_mpp_mpi, ln_icebergs=.true., and jpni=1 are explicitely prevented (source:/NEMO/releases/r4.0/r4.0-HEAD/src/OCE/ICB/icbini.F90:#L112), it should only affect mono-processor runs without key_mpp_mpi.
comment:5 Changed 4 years ago by smueller
- Version set to v4.0
comment:6 Changed 4 years ago by smueller
- Owner changed from systeam to smueller
- Status changed from new to assigned
In an email discussion it was proposed to test source:/NEMO/releases/r4.0/r4.0-HEAD with the above fix for module lbclnk by comparing the run.stat output files produced by LONG runs with the ORCA2_ICE_PISCES reference configuration i) as used by SETTE (with key_mpp_mpi, jpni=4, and jpnj=8), ii) with jpni=1 after disabling line source:/NEMO/releases/r4.0/r4.0-HEAD/src/OCE/ICB/icbini.F90@13346:#L112, and iii) without key_mpp_mpi; it was also suggested that the second of the bugs reported above, the specification of incorrect grid types in the initialisation of arrays {u,v}mask_e, should be fixed as proposed (see also [13276]).
comment:7 Changed 4 years ago by smueller
After disabling line source:/NEMO/releases/r4.0/r4.0-HEAD/src/OCE/ICB/icbini.F90@13346:#L112, the model in ORCA2_ICE_PISCES reference configuration with jpni=1 and jpnj=32 crashes at time step 693; after including the proposed fixes for modules lbclnk and icbini, this model crash no longer occurs.
comment:8 Changed 4 years ago by smueller
The proposed test (see comment:6) has been successful: run.stat files produced using source:/NEMO/releases/r4.0/r4.0-HEAD/@13346 with the proposed fixes of modules lbclnk and icbini are identical across all three cases; further, in cases i and iii, the run.stat output files are also identical to the corresponding run.stat files produced using source:/NEMO/releases/r4.0/r4.0-HEAD@13346 without the proposed fixes (in case ii, one of the runs did not complete, see comment:7).
Further, it has also been found that output files tracer.stat differ between the MPP cases (i, ii) and the mono-processor case without key_mpp_mpi (iii); the tracer.stat output, however, has remained unchanged after the proposed fixes have been applied both in case i and iii. This difference in tracer.stat output appears to be unrelated to the bugs detailed above and should be reported in a different ticket.
comment:9 Changed 4 years ago by smueller
In 13350:
comment:10 Changed 4 years ago by smueller
- Resolution set to fixed
- Status changed from assigned to closed
source:/NEMO/releases/r4.0/r4.0-HEAD@13350 has passed the standard SETTE tests. Further, source:/NEMO/releases/r4.0/r4.0-HEAD@13350 compiled with debug options (incl. bounds checking) and without key_mpp_mpi and key_iomput runs successfully.
comment:11 Changed 4 years ago by mathiot
Should this fix also be added to the trunk ? Is the plan to have a big push of bug fixes into the trunk based on the NEMO 4.0.3 released ?
comment:12 Changed 2 years ago by nemo
- Keywords v4.0 added
the mpp part of the icebergs is quite mysterious to me... but:
BTW how can we justify in 2020 that we keep (and maintain) the possibility to use NEMO without this key? We are no more in fix format as we don't use punched card any more... Same story for key_mpp_mpi. Who has a computer with only one core today? In addition, we need it for xios and 1D configurations can still use 1 core even with MPI...