Opened 6 years ago

Closed 6 years ago

#769 closed upgrade (Done)

VERY IMPORTANT Update: Everyone needs to read provide information

Reported by: arango Owned by:
Priority: major Milestone: Release ROMS/TOMS 3.7
Component: Nonlinear Version: 3.7
Keywords: Cc:

Description

This update contains critical information, and I recommend everybody to consider updating their version:

  • While coding and debugging the tangent linear and adjoint of the nesting algorithms, we discovered a two-way bug in the nonlinear model code in telescoping nested applications. Recall that in ROMS, a telescoping grid is a refined grid containing another refined grid inside. For Example:

https://www.myroms.org/trac/multi-refinement_nd.png

In the above diagram, grid 2 is the only telescoping grid in this configuration. We need to have a two-way transfer of information from 4 to 2 and 2 to 1 for grid 2 to be considered a telescoping type. The coaser grid 1 is not considered a telescoping grid by definition in ROMS.

In a three-grid refinenemt application for the US east coast, we noticed unnecessary two-way exchanges between telescoping grid 2 and coarser grid 1. See symbol >>>> print in the standard output below:

 NL ROMS/TOMS: started time-stepping: (Grid: 01 TimeSteps: 000000000001 - 000000007200)
 NL ROMS/TOMS: started time-stepping: (Grid: 02 TimeSteps: 000000000001 - 000000021600)
 NL ROMS/TOMS: started time-stepping: (Grid: 03 TimeSteps: 000000000001 - 000000043200)

 TIME-STEP YYYY-MM-DD hh:mm:ss.ss  KINETIC_ENRG   POTEN_ENRG    TOTAL_ENRG    NET_VOLUME  Grid
                     C => (i,j,k)       Cu            Cv            Cw         Max Speed

         0 2014-01-01 00:00:00.00  2.479705E-02  1.895958E+04  1.895961E+04  2.057379E+15  01
                     (128,010,40)  6.586887E-02  8.830755E-02  0.000000E+00  2.146392E+00
         0 2014-01-01 00:00:00.00  9.826892E-03  1.355795E+04  1.355796E+04  2.279106E+14  02
                     (067,001,40)  6.153040E-02  3.379189E-02  0.000000E+00  1.549213E+00
         0 2014-01-01 00:00:00.00  4.571369E-03  1.116663E+04  1.116663E+04  4.813844E+13  03
                     (170,008,29)  2.109439E-02  2.133386E-02  0.000000E+00  3.891363E-01
         1 2014-01-01 00:01:00.00  4.569719E-03  1.116663E+04  1.116663E+04  4.813848E+13  03
                     (224,073,01)  1.516512E-02  1.260101E-02  8.206179E-02  3.891982E-01
      FINE2COARSE - exchanging data between grids: dg = 03 and rg = 02  at cr = 04
         1 2014-01-01 00:02:00.00  9.831911E-03  1.355797E+04  1.355798E+04  2.279109E+14  02
                     (135,136,01)  5.762565E-03  1.577325E-04  2.528599E-01  1.547482E+00
         2 2014-01-01 00:02:00.00  4.574353E-03  1.116663E+04  1.116664E+04  4.813852E+13  03
                     (222,074,37)  2.382373E-02  1.525561E-02  1.121914E-01  3.892563E-01
         3 2014-01-01 00:03:00.00  4.576449E-03  1.116665E+04  1.116665E+04  4.813859E+13  03
                     (233,065,40)  1.694378E-02  6.053802E-03  1.822713E-01  3.886253E-01
      FINE2COARSE - exchanging data between grids: dg = 03 and rg = 02  at cr = 04
>>>>  FINE2COARSE - exchanging data between grids: dg = 02 and rg = 01  at cr = 02
         2 2014-01-01 00:04:00.00  9.848035E-03  1.355799E+04  1.355800E+04  2.279112E+14  02
                     (134,137,40)  2.929985E-03  7.659153E-04  6.330313E-01  1.548821E+00
>>>>  FINE2COARSE - exchanging data between grids: dg = 02 and rg = 01  at cr = 02
         4 2014-01-01 00:04:00.00  4.576183E-03  1.116666E+04  1.116667E+04  4.813865E+13  03
                     (237,066,40)  1.547985E-02  5.251661E-03  2.077931E-01  3.886372E-01
>>>>  FINE2COARSE - exchanging data between grids: dg = 02 and rg = 01  at cr = 02
         5 2014-01-01 00:05:00.00  4.578296E-03  1.116668E+04  1.116669E+04  4.813873E+13  03
                     (240,066,40)  1.518760E-02  4.659222E-03  1.958341E-01  3.885507E-01
      FINE2COARSE - exchanging data between grids: dg = 03 and rg = 02  at cr = 04
         1 2014-01-01 00:06:00.00  2.472506E-02  1.895960E+04  1.895962E+04  2.057381E+15  01
                     (031,036,01)  2.730114E-02  1.131899E-02  2.988898E-01  2.164003E+00
         3 2014-01-01 00:06:00.00  9.864558E-03  1.355800E+04  1.355801E+04  2.279115E+14  02
                     (136,140,40)  2.016993E-03  3.375310E-03  5.526394E-01  1.550218E+00
         6 2014-01-01 00:06:00.00  4.583192E-03  1.116670E+04  1.116670E+04  4.813881E+13  03
                     (242,068,40)  1.469385E-02  4.240571E-03  1.770060E-01  3.886908E-01
         7 2014-01-01 00:07:00.00  4.589692E-03  1.116671E+04  1.116672E+04  4.813888E+13  03
                     (214,082,40)  1.952714E-02  3.661358E-03  1.516865E-01  3.892088E-01
      FINE2COARSE - exchanging data between grids: dg = 03 and rg = 02  at cr = 04
         4 2014-01-01 00:08:00.00  9.875476E-03  1.355809E+04  1.355810E+04  2.279128E+14  02
                     (136,140,40)  2.392311E-03  2.585103E-03  5.107434E-01  1.552007E+00
         8 2014-01-01 00:08:00.00  4.596425E-03  1.116673E+04  1.116673E+04  4.813895E+13  03
                     (216,085,40)  1.761177E-02  3.653321E-03  1.553607E-01  3.887313E-01
         9 2014-01-01 00:09:00.00  4.603149E-03  1.116675E+04  1.116676E+04  4.813905E+13  03
                     (250,064,40)  1.535948E-02  1.897569E-03  1.583301E-01  3.885450E-01
      FINE2COARSE - exchanging data between grids: dg = 03 and rg = 02  at cr = 04
>>>>  FINE2COARSE - exchanging data between grids: dg = 02 and rg = 01  at cr = 02
         5 2014-01-01 00:10:00.00  9.884353E-03  1.355818E+04  1.355819E+04  2.279141E+14  02
                     (136,141,40)  3.595284E-03  2.194098E-03  4.981900E-01  1.553226E+00
>>>>  FINE2COARSE - exchanging data between grids: dg = 02 and rg = 01  at cr = 02
        10 2014-01-01 00:10:00.00  4.605400E-03  1.116677E+04  1.116678E+04  4.813915E+13  03
                     (253,064,40)  1.487150E-02  1.964936E-03  1.542588E-01  3.883007E-01
>>>>  FINE2COARSE - exchanging data between grids: dg = 02 and rg = 01  at cr = 02
        11 2014-01-01 00:11:00.00  4.605182E-03  1.116681E+04  1.116681E+04  4.813930E+13  03
                     (255,065,40)  1.409242E-02  1.498900E-03  1.489003E-01  3.873421E-01
      FINE2COARSE - exchanging data between grids: dg = 03 and rg = 02  at cr = 04
         2 2014-01-01 00:12:00.00  2.470726E-02  1.895960E+04  1.895962E+04  2.057385E+15  01
                     (026,035,40)  3.741860E-02  1.691325E-03  5.639773E-01  2.167463E+00

Therefore, for each timestep of grid 1 there are three two-way exchanges from grid 2 to 1 in the contact region cr = 2. These redundant exchanges are not a problem in the nonlinear model because grid 1 is waiting for grids 2 and 3 to to reach its current time. They are fatal in the adjoint model. The two-way exchanges are expensive in distributed-memory parallel condigurations because it involves MPI communication that slow the solution affecting performace. However, the bug is that the two-way exchange between grid 2 and 1 is missing the last piece from grid 3. It needs to happen after the exchange between 3 and 2, so grid 1 can advance another timestep (see ++++). This bug does not seem to change the solution that much but it still is wrong. We need to have instead:

 NL ROMS/TOMS: started time-stepping: (Grid: 01 TimeSteps: 000000000001 - 000000007200)
 NL ROMS/TOMS: started time-stepping: (Grid: 02 TimeSteps: 000000000001 - 000000021600)
 NL ROMS/TOMS: started time-stepping: (Grid: 03 TimeSteps: 000000000001 - 000000043200)

 TIME-STEP YYYY-MM-DD hh:mm:ss.ss  KINETIC_ENRG   POTEN_ENRG    TOTAL_ENRG    NET_VOLUME  Grid
                     C => (i,j,k)       Cu            Cv            Cw         Max Speed

         0 2014-01-01 00:00:00.00  2.479705E-02  1.895958E+04  1.895961E+04  2.057379E+15  01
                     (128,010,40)  6.586887E-02  8.830755E-02  0.000000E+00  2.146392E+00
         0 2014-01-01 00:00:00.00  9.826892E-03  1.355795E+04  1.355796E+04  2.279106E+14  02
                     (067,001,40)  6.153040E-02  3.379189E-02  0.000000E+00  1.549213E+00
         0 2014-01-01 00:00:00.00  4.571369E-03  1.116663E+04  1.116663E+04  4.813844E+13  03
                     (170,008,29)  2.109439E-02  2.133386E-02  0.000000E+00  3.891363E-01
         1 2014-01-01 00:01:00.00  4.569719E-03  1.116663E+04  1.116663E+04  4.813848E+13  03
                     (224,073,01)  1.516512E-02  1.260101E-02  8.206179E-02  3.891982E-01
      FINE2COARSE - exchanging data between grids: dg = 03 and rg = 02  at cr = 04
         1 2014-01-01 00:02:00.00  9.831911E-03  1.355797E+04  1.355798E+04  2.279109E+14  02
                     (135,136,01)  5.762565E-03  1.577325E-04  2.528599E-01  1.547482E+00
         2 2014-01-01 00:02:00.00  4.574353E-03  1.116663E+04  1.116664E+04  4.813852E+13  03
                     (222,074,37)  2.382373E-02  1.525561E-02  1.121914E-01  3.892563E-01
         3 2014-01-01 00:03:00.00  4.576449E-03  1.116665E+04  1.116665E+04  4.813859E+13  03
                     (233,065,40)  1.694378E-02  6.053802E-03  1.822713E-01  3.886253E-01
      FINE2COARSE - exchanging data between grids: dg = 03 and rg = 02  at cr = 04
         2 2014-01-01 00:04:00.00  9.848035E-03  1.355799E+04  1.355800E+04  2.279112E+14  02
                     (134,137,40)  2.929985E-03  7.659153E-04  6.330313E-01  1.548821E+00
         4 2014-01-01 00:04:00.00  4.576183E-03  1.116666E+04  1.116667E+04  4.813865E+13  03
                     (237,066,40)  1.547985E-02  5.251661E-03  2.077931E-01  3.886372E-01
         5 2014-01-01 00:05:00.00  4.578296E-03  1.116668E+04  1.116669E+04  4.813873E+13  03
                     (240,066,40)  1.518760E-02  4.659222E-03  1.958341E-01  3.885507E-01
      FINE2COARSE - exchanging data between grids: dg = 03 and rg = 02  at cr = 04
++++  FINE2COARSE - exchanging data between grids: dg = 02 and rg = 01  at cr = 02
         1 2014-01-01 00:06:00.00  2.472503E-02  1.895960E+04  1.895962E+04  2.057381E+15  01
                     (031,036,01)  2.730114E-02  1.131899E-02  2.988898E-01  2.164003E+00
         3 2014-01-01 00:06:00.00  9.864558E-03  1.355800E+04  1.355801E+04  2.279115E+14  02
                     (136,140,40)  2.016993E-03  3.375310E-03  5.526394E-01  1.550218E+00
         6 2014-01-01 00:06:00.00  4.583192E-03  1.116670E+04  1.116670E+04  4.813881E+13  03
                     (242,068,40)  1.469385E-02  4.240571E-03  1.770060E-01  3.886908E-01
         7 2014-01-01 00:07:00.00  4.589692E-03  1.116671E+04  1.116672E+04  4.813888E+13  03
                     (214,082,40)  1.952714E-02  3.661358E-03  1.516865E-01  3.892088E-01
      FINE2COARSE - exchanging data between grids: dg = 03 and rg = 02  at cr = 04
         4 2014-01-01 00:08:00.00  9.875476E-03  1.355809E+04  1.355810E+04  2.279128E+14  02
                     (136,140,40)  2.392311E-03  2.585103E-03  5.107434E-01  1.552007E+00
         8 2014-01-01 00:08:00.00  4.596425E-03  1.116673E+04  1.116673E+04  4.813895E+13  03
                     (216,085,40)  1.761177E-02  3.653321E-03  1.553607E-01  3.887313E-01
         9 2014-01-01 00:09:00.00  4.603149E-03  1.116675E+04  1.116676E+04  4.813905E+13  03
                     (250,064,40)  1.535948E-02  1.897569E-03  1.583301E-01  3.885450E-01
      FINE2COARSE - exchanging data between grids: dg = 03 and rg = 02  at cr = 04
         5 2014-01-01 00:10:00.00  9.884353E-03  1.355818E+04  1.355819E+04  2.279141E+14  02
                     (136,141,40)  3.595284E-03  2.194098E-03  4.981900E-01  1.553226E+00
        10 2014-01-01 00:10:00.00  4.605400E-03  1.116677E+04  1.116678E+04  4.813915E+13  03
                     (253,064,40)  1.487150E-02  1.964936E-03  1.542588E-01  3.883007E-01
        11 2014-01-01 00:11:00.00  4.605182E-03  1.116681E+04  1.116681E+04  4.813930E+13  03
                     (255,065,40)  1.409242E-02  1.498900E-03  1.489003E-01  3.873421E-01
      FINE2COARSE - exchanging data between grids: dg = 03 and rg = 02  at cr = 04
++++  FINE2COARSE - exchanging data between grids: dg = 02 and rg = 01  at cr = 02
         2 2014-01-01 00:12:00.00  2.470724E-02  1.895960E+04  1.895962E+04  2.057385E+15  01

Once that this bug is corrected, telescoping applications run faster. The gain in efficiency depends on the applicatuion and the number of processors used. We have observed improvements between 6-10 percent.

Therefore, this is a critical update for users running nested applications.

  • We also found that additional improvements (around 8 percent) in ROMS solutions compiled with ifort when the option -heap_arrays is removed. However, we need to set the stacksize option to a large value in some computers:
               FFLAGS += -Wl,-stack_size,0x64000000
    
    or set the environmental variable stacksize to unlimited in the login script. For example, I have the following command in my .tcshrc:
    limit stacksize unlimited
    
    If I type the limit UNIX command on a Linux cluster, I get:
    % limit
    cputime      unlimited
    filesize     unlimited
    datasize     unlimited
    stacksize    unlimited
    coredumpsize 0 kbytes
    memoryuse    unlimited
    vmemoryuse   unlimited
    descriptors  1024
    memorylocked unlimited
    maxproc      1024
    
    ROMS has lots of authomatic arrays, so one has the option to allocate those arrays on heap or stack. Usually, the stack option is faster but we need to have enough of it. Otherwise, ROMS will blow-up because memory corruption.
  • Corrected the reporting of the longitude and latitude ranges at RHO-points in get_grid.F. Many thanks to John Warner for bringing this to our attention.
  • Added MPI broadcasting of 2D and 3D string arrays in distribute.F. Now, we have the following interface for mp_bcasts:
          INTERFACE mp_bcasts
            MODULE PROCEDURE mp_bcasts_0d
            MODULE PROCEDURE mp_bcasts_1d
            MODULE PROCEDURE mp_bcasts_2d
            MODULE PROCEDURE mp_bcasts_3d
          END INTERFACE mp_bcasts
    
  • Added the reading and writting of generic 2D and 3D strigs to NetCDF files. The mod_netcdf.F now have the following updated intefaces for netcdf_get_svar and netcdf_put_svar:
          INTERFACE netcdf_get_svar
            MODULE PROCEDURE netcdf_get_svar_0d
            MODULE PROCEDURE netcdf_get_svar_1d
            MODULE PROCEDURE netcdf_get_svar_2d
            MODULE PROCEDURE netcdf_get_svar_3d
          END INTERFACE netcdf_get_svar
    
          INTERFACE netcdf_put_fvar
    
          INTERFACE netcdf_put_svar
            MODULE PROCEDURE netcdf_put_svar_0d
            MODULE PROCEDURE netcdf_put_svar_1d
            MODULE PROCEDURE netcdf_put_svar_2d
            MODULE PROCEDURE netcdf_put_svar_3d
          END INTERFACE netcdf_put_svar
    
    The reasons for this update will be obvious in the future.
  • Added new C-preprocessing option IMPLICIT_NUDGING to the momentum radiation boundary conditions in u2dbc_im.F, v2dbc_im.F, u3dbc_im.F, and v3dbc_im.F. The implicit treatment of the nudging term in the radiation equation is more stable but one need to be sure that the land/sea masking does not have one-point bays. Many thanks to Alistar and Kate for suggesting this option.

Change History (1)

comment:1 by arango, 6 years ago

Resolution: Done
Status: newclosed
Note: See TracTickets for help on using tickets.