New URL for NEMO forge!   http://forge.nemo-ocean.eu

Since March 2022 along with NEMO 4.2 release, the code development moved to a self-hosted GitLab.
This present forge is now archived and remained online for history.
#2492 (Out-of-bounds error in ORCA2-based mono-processor configuration) – NEMO

Opened 4 years ago

Closed 4 years ago

Last modified 2 years ago

#2492 closed Bug (fixed)

Out-of-bounds error in ORCA2-based mono-processor configuration

Reported by: smueller Owned by: smueller
Priority: low Milestone:
Component: ICB Version: v4.0
Severity: minor Keywords: ICB LBC non-MPP v4.0
Cc: smasson, pierre.mathiot@…

Description

Context

A user of an ORCA2-based mono-processor configuration (key_mpp_mpi undefined) has reported an out-of-bounds error which occurs during the north-fold boundary exchange in subroutine lbc_nfd_2d_ext.

Analysis

This out-of-bounds error can readily be reproduced in reference configuration ORCA2_ICE_PISCES by removing CPP-keys key_mpp_mpi and key_iomput.

The error is caused by the initialisation of array tmask_e(0:jpi+1,0:jpj+1) in subroutine icb_init; it is absent when iceberg handling is disabled (ln_icebergs = .FALSE.). The initialisation of tmask_e is finalised by calling subroutine mpp_lnk_2d_icb (interface lbc_lnk_icb), which differs in the north-fold treatment depending on whether jpni is 1 (incl. mono-processor case) or greater: if jpni = 1, subroutine lbc_nfd_2d_ext (implemented in source:/NEMO/releases/r4.0/r4.0-HEAD/src/OCE/LBC/lbc_nfd_ext_generic.h90) is called. Of array tmask_e(0:jpi+1,0:jpj+1), the subset tmask_e(1:jpi,1:jpj+1) is passed to subroutine lbc_nfd_2d_ext by subroutine mpp_lnk_2d_icb. While subroutine lbc_nfd_2d_ext refers to this subset as ptab(1:jpi,0:jpj), it accesses array elements with a dimension-2 subscript of nlcj+1. Since jpj == nlcj, this results in out-of-bounds array access and potentially incorrect content of the array.

The same error also affects arrays {u,v}mask_e initialised in subroutine icb_init (source:/NEMO/releases/r4.0/r4.0-HEAD/src/OCE/ICB/icbini.F90) and {uo,vo,ff,tt,fr,ua,va,hi,vi}_e initialised in subroutine icb_utl_copy (source:/NEMO/releases/r4.0/r4.0-HEAD/src/OCE/ICB/icbutl.F90). Unrelated to out-of-bounds array access, it appears that the lbc_lnk_icb calls used to initialise arrays {u,v}mask_e specify incorrect grid types ('T'), which could result in further incorrect boundary exchanges that may negatively affect these two arrays.

Fix

Subroutine mpp_lnk_2d_icb could be adjusted to retain the bounds for dimension 2 of the array passed to subroutine lbc_nfd_2d_ext, i.e.,

  • src/OCE/LBC/lbclnk.F90

     
    381381      IF( npolj /= 0 ) THEN 
    382382         ! 
    383383         SELECT CASE ( jpni ) 
    384                    CASE ( 1 )     ;   CALL lbc_nfd          ( pt2d(1:jpi,1:jpj+kextj), cd_type, psgn, kextj ) 
     384                   CASE ( 1 )     ;   CALL lbc_nfd          ( pt2d(1:jpi,1-kextj:jpj+kextj), cd_type, psgn, kextj ) 
    385385                   CASE DEFAULT   ;   CALL mpp_lbc_north_icb( pt2d(1:jpi,1:jpj+kextj), cd_type, psgn, kextj ) 
    386386         END SELECT 
    387387         ! 

Further, the grid types specified in the lbc_lnk_icb calls for arrays {u,v}mask_e could be adjusted according to

  • src/OCE/ICB/icbini.F90

     
    239239      umask_e(:,:) = 0._wp   ;   umask_e(1:jpi,1:jpj) = umask(:,:,1) 
    240240      vmask_e(:,:) = 0._wp   ;   vmask_e(1:jpi,1:jpj) = vmask(:,:,1) 
    241241      CALL lbc_lnk_icb( 'icbini', tmask_e, 'T', +1._wp, 1, 1 ) 
    242       CALL lbc_lnk_icb( 'icbini', umask_e, 'T', +1._wp, 1, 1 ) 
    243       CALL lbc_lnk_icb( 'icbini', vmask_e, 'T', +1._wp, 1, 1 ) 
     242      CALL lbc_lnk_icb( 'icbini', umask_e, 'U', +1._wp, 1, 1 ) 
     243      CALL lbc_lnk_icb( 'icbini', vmask_e, 'V', +1._wp, 1, 1 ) 
    244244      ! 
    245245      ! assign each new iceberg with a unique number constructed from the processor number 
    246246      ! and incremented by the total number of processors 

Commit History (2)

ChangesetAuthorTimeChangeLog
13350smueller2020-07-28T14:28:29+02:00

Remedy for the bugs reported in ticket #2492

13276mathiot2020-07-09T09:47:18+02:00

ticket #2494 and #2375: wrong point type inn lbc_lnk_icb for umask_e and vmask_e (see ticket #2492)

Change History (12)

comment:1 Changed 4 years ago by smasson

  • Cc smasson added

the mpp part of the icebergs is quite mysterious to me... but:

  • In icbini, it is clearly said that it is not working when using 1 processus.
    IF( lk_mpp .AND. jpni == 1 )   CALL ctl_stop( 'icbinit: having ONE processor in x currently does not work' )
    
    So I have little hope it can work without key_mpp_mpi...
    BTW how can we justify in 2020 that we keep (and maintain) the possibility to use NEMO without this key? We are no more in fix format as we don't use punched card any more... Same story for key_mpp_mpi. Who has a computer with only one core today? In addition, we need it for xios and 1D configurations can still use 1 core even with MPI...
  • The duplication of the mpp routines with extended halos (_ext) only for the icebergs was originally not the best idea. Duplications is a quick (and dirty) solution that is hard to maintain as the code changes. This solution is now obsolete (and is not really compatible) since we introduced nn_hls > 1

=> to me, the mpi part of the icebergs should be entirely reviewed and rewritten using the default mpp routines.

Version 0, edited 4 years ago by smasson (next)

comment:2 Changed 4 years ago by mathiot

  • Cc pierre.mathiot@… added

comment:3 Changed 4 years ago by mathiot

In 13276:

Error: Failed to load processor CommitTicketReference
No macro or processor named 'CommitTicketReference' found

comment:4 Changed 4 years ago by smueller

  • Version v4.0.* deleted

Subroutine mpp_lnk_2d_icb only appears to be called from within the ICB source code (subroutines icb_ini and icb_utl_copy), so its modification should only affect model runs with ln_icebergs=.true.. Further, the proposed modification of mpp_lnk_2d_icb only affects a subroutine call when jpni=1 and, since runs using the model compiled with key_mpp_mpi, ln_icebergs=.true., and jpni=1 are explicitely prevented (source:/NEMO/releases/r4.0/r4.0-HEAD/src/OCE/ICB/icbini.F90:#L112), it should only affect mono-processor runs without key_mpp_mpi.

comment:5 Changed 4 years ago by smueller

  • Version set to v4.0

comment:6 Changed 4 years ago by smueller

  • Owner changed from systeam to smueller
  • Status changed from new to assigned

In an email discussion it was proposed to test source:/NEMO/releases/r4.0/r4.0-HEAD with the above fix for module lbclnk by comparing the run.stat output files produced by LONG runs with the ORCA2_ICE_PISCES reference configuration i) as used by SETTE (with key_mpp_mpi, jpni=4, and jpnj=8), ii) with jpni=1 after disabling line source:/NEMO/releases/r4.0/r4.0-HEAD/src/OCE/ICB/icbini.F90@13346:#L112, and iii) without key_mpp_mpi; it was also suggested that the second of the bugs reported above, the specification of incorrect grid types in the initialisation of arrays {u,v}mask_e, should be fixed as proposed (see also [13276]).

comment:7 Changed 4 years ago by smueller

After disabling line source:/NEMO/releases/r4.0/r4.0-HEAD/src/OCE/ICB/icbini.F90@13346:#L112, the model in ORCA2_ICE_PISCES reference configuration with jpni=1 and jpnj=32 crashes at time step 693; after including the proposed fixes for modules lbclnk and icbini, this model crash no longer occurs.

comment:8 Changed 4 years ago by smueller

The proposed test (see comment:6) has been successful: run.stat files produced using source:/NEMO/releases/r4.0/r4.0-HEAD/@13346 with the proposed fixes of modules lbclnk and icbini are identical across all three cases; further, in cases i and iii, the run.stat output files are also identical to the corresponding run.stat files produced using source:/NEMO/releases/r4.0/r4.0-HEAD@13346 without the proposed fixes (in case ii, one of the runs did not complete, see comment:7).

Further, it has also been found that output files tracer.stat differ between the MPP cases (i, ii) and the mono-processor case without key_mpp_mpi (iii); the tracer.stat output, however, has remained unchanged after the proposed fixes have been applied both in case i and iii. This difference in tracer.stat output appears to be unrelated to the bugs detailed above and should be reported in a different ticket.

comment:9 Changed 4 years ago by smueller

In 13350:

Error: Failed to load processor CommitTicketReference
No macro or processor named 'CommitTicketReference' found

comment:10 Changed 4 years ago by smueller

  • Resolution set to fixed
  • Status changed from assigned to closed

source:/NEMO/releases/r4.0/r4.0-HEAD@13350 has passed the standard SETTE tests. Further, source:/NEMO/releases/r4.0/r4.0-HEAD@13350 compiled with debug options (incl. bounds checking) and without key_mpp_mpi and key_iomput runs successfully.

comment:11 Changed 4 years ago by mathiot

Should this fix also be added to the trunk ? Is the plan to have a big push of bug fixes into the trunk based on the NEMO 4.0.3 released ?

comment:12 Changed 2 years ago by nemo

  • Keywords v4.0 added
Note: See TracTickets for help on using tickets.