Description |
This is an important update because I revised the entire profile before I start looking for ways to improve the computational efficiency. In particular, I have been experimenting with ways to accelerate the ROMS nesting algorithms. Currently, I am concentrating on routines mp_assemble and mp_aggregate of distribute.F.
This update also includes a correction with the management of ntend(ng) and few changes to the arguments for routines wclock_on and clock_off.
What it is new?
- The routine mp_assemble is a multidimensional version of mp_collect and are use in the nesting and 4D-Var algorithms, respectively. Both routines assemble/collect elements of arrays from all the MPI nodes. Each node process parts of these arrays computed from tiled state variables.
The assembly/collection operation can be code with high-level MPI functions like mpi_allgather or mpi_allreduce (summation since all arrays are initialized from zero). Alternatively, one could use lower-level routines mpi_irecv, mpi_isend, and mpi_bcast similarly at what is done in the tile-halo exchanges (mp_exchange.F). It turns out the lower-level functions are actually more efficient than the higher-level functions. This is the case for us using generic MPI libraries (like OpenMPI). The high-level functions are usually optimized in millions of dollars supercomputers and compilers by the vendors.
Notice that at the top of distribute.F, we have the following internal CPP options to set the desired communication options. The default is to have:
# undef ASSEMBLE_ALLGATHER /* use mpi_allgather im mp_assemble */
# undef ASSEMBLE_ALLREDUCE /* use mpi_allreduce in mp_assemble */
# define BOUNDARY_ALLREDUCE /* use mpi_allreduce in mp_boundary */
# undef COLLECT_ALLGATHER /* use mpi_allgather in mp_collect */
# undef COLLECT_ALLREDUCE /* use mpi_allreduce in mp_collect */
# define REDUCE_ALLGATHER /* use mpi_allgather in mp_reduce */
# undef REDUCE_ALLREDUCE /* use mpi_allreduce in mp_reduce */
- The ROMS internal profiling was modified to include more regions (Pregions) in mod_strings.F:
character (len=50), dimension(Nregion) :: Pregion = &
& (/'Allocation and array initialization ..............', & !01
& 'Ocean state initialization .......................', & !02
& 'Reading of input data ............................', & !03
& 'Processing of input data .........................', & !04
& 'Processing of output time averaged data ..........', & !05
& 'Computation of vertical boundary conditions ......', & !06
& 'Computation of global information integrals ......', & !07
& 'Writing of output data ...........................', & !08
& 'Model 2D kernel ..................................', & !09
& 'Lagrangian floats trajectories ...................', & !10
& 'Tidal forcing ....................................', & !11
& '2D/3D coupling, vertical metrics .................', & !12
& 'Omega vertical velocity ..........................', & !13
& 'Equation of state for seawater ...................', & !14
& 'Biological module, source/sink terms .............', & !15
& 'Sediment transport module, source/sink terms .....', & !16
& 'Atmosphere-Ocean bulk flux parameterization ......', & !17
& 'KPP vertical mixing parameterization .............', & !18
& 'GLS vertical mixing parameterization .............', & !19
& 'My2.5 vertical mixing parameterization ...........', & !20
& '3D equations right-side terms ....................', & !21
& '3D equations predictor step ......................', & !22
& 'Pressure gradient ................................', & !23
& 'Harmonic mixing of tracers, S-surfaces ...........', & !24
& 'Harmonic mixing of tracers, geopotentials ........', & !25
& 'Harmonic mixing of tracers, isopycnals ...........', & !26
& 'Biharmonic mixing of tracers, S-surfaces .........', & !27
& 'Biharmonic mixing of tracers, geopotentials ......', & !28
& 'Biharmonic mixing of tracers, isopycnals .........', & !29
& 'Harmonic stress tensor, S-surfaces ...............', & !30
& 'Harmonic stress tensor, geopotentials ............', & !31
& 'Biharmonic stress tensor, S-surfaces .............', & !32
& 'Biharmonic stress tensor, geopotentials ..........', & !33
& 'Corrector time-step for 3D momentum ..............', & !34
& 'Corrector time-step for tracers ..................', & !35
& 'Nesting algorithm ................................', & !36
& 'Bottom boundary layer module .....................', & !37
& 'GST Analysis eigenproblem solution ...............', & !38
& 'Two-way coupling to Atmosphere Model .............', & !39
& 'Two-way coupling to Sea Ice Model ................', & !40
& 'Two-way coupling to Wave Model ...................', & !41
& 'Reading model state vector .......................', & !42
& '4D-Var minimization solver .......................', & !43
& 'Background error covariance matrix ...............', & !44
& 'Posterior error covariance matrix ................', & !45
& 'Unused 01 ........................................', & !46
& 'Unused 02 ........................................', & !47
& 'Unused 03 ........................................', & !48
& 'Unused 04 ........................................', & !49
& 'Unused 05 ........................................', & !50
& 'Unused 06 ........................................', & !51
& 'Unused 07 ........................................', & !52
& 'Unused 08 ........................................', & !53
& 'Unused 09 ........................................', & !54
& 'Unused 10 ........................................', & !55
& 'Unused 11 ........................................', & !56
& 'Unused 12 ........................................', & !57
& 'Unused 13 ........................................', & !58
& 'Unused 14 ........................................', & !59
& 'Message Passage: 2D halo exchanges ...............', & !60
& 'Message Passage: 3D halo exchanges ...............', & !61
& 'Message Passage: 4D halo exchanges ...............', & !62
& 'Message Passage: lateral boundary exchanges ......', & !63
& 'Message Passage: data broadcast ..................', & !64
& 'Message Passage: data reduction ..................', & !65
& 'Message Passage: data gathering ..................', & !66
& 'Message Passage: data scattering..................', & !67
& 'Message Passage: boundary data gathering .........', & !68
& 'Message Passage: point data gathering ............', & !69
& 'Message Passage: nesting point data gathering ....', & !70
& 'Message Passage: nesting array data gathering ....', & !71
& 'Message Passage: synchronization barrier .........', & !72
& 'Message Passage: multi-model coupling ............'/) !73
Notice that we now have 73 regions including 14 unused regions for later use. We need to separate the Message Passage (MPI) regions from the rest. It was tedious to renumber all the MPI regions from the rest of algorithms. The MPI regions need to located in indices Mregion=60 to Nregion=72. In wclock_off, we have:
# ifdef DISTRIBUTE
DO imodel=1,4
DO iregion=Mregion,Nregion
...
END DO
END DO
# endif
to process all the MPI regions. Notice that regions indices 36, 39, 40, 41, 42, 43, 44, 45, 70, 71, and 72 are the new regions introduced here to help the profiling and identify the bottleneck areas.
- There are two additional arguments to routines wclock_on and wclock_off:
SUBROUTINE wclock_on (ng, model, region, line, routine)
SUBROUTINE wclock_off (ng, model, region, line, routine)
so in the calling routine, we have for example:
CALL wclock_on (ng, iNLM, 9, __LINE__, __FILE__)
CALL wclock_off (ng, iNLM, 9, __LINE__, __FILE__)
and the C-preprocessing code will yield:
CALL wclock_on (ng, iNLM, 9, 39, "ROMS/Nonlinear/step2d_LF_AM3.h")
CALL wclock_off (ng, iNLM, 9, 116, "ROMS/Nonlinear/step2d_LF_AM3.h")
The new arguments line and routine will be used in the future for more ellaborated profiling using third-party libraries.
Below is the profiling statistics for a 5-day simulation with two nested grids using the low-level mpi_irecv/mpi_isend/mpi_bcast in routines mp_assemble and mp_collect. The simulation was run on my latest Mac on 4-CPUS.
Elapsed CPU time (seconds):
Node # 0 CPU: 4150.677
Node # 3 CPU: 4209.386
Node # 1 CPU: 4209.369
Node # 2 CPU: 4209.324
Total: 16778.756
Nonlinear model elapsed CPU time profile, Grid: 01
Allocation and array initialization .............. 1.185 ( 0.0071 %)
Ocean state initialization ....................... 0.837 ( 0.0050 %)
Reading of input data ............................ 63.005 ( 0.3755 %)
Processing of input data ......................... 27.495 ( 0.1639 %)
Processing of output time averaged data .......... 91.075 ( 0.5428 %)
Computation of vertical boundary conditions ...... 0.636 ( 0.0038 %)
Computation of global information integrals ...... 18.845 ( 0.1123 %)
Writing of output data ........................... 103.600 ( 0.6174 %)
Model 2D kernel .................................. 405.983 ( 2.4196 %)
Tidal forcing .................................... 23.336 ( 0.1391 %)
2D/3D coupling, vertical metrics ................. 58.658 ( 0.3496 %)
Omega vertical velocity .......................... 35.253 ( 0.2101 %)
Equation of state for seawater ................... 44.661 ( 0.2662 %)
Atmosphere-Ocean bulk flux parameterization ...... 53.266 ( 0.3175 %)
GLS vertical mixing parameterization ............. 851.479 ( 5.0747 %)
3D equations right-side terms .................... 62.280 ( 0.3712 %)
3D equations predictor step ...................... 148.728 ( 0.8864 %)
Pressure gradient ................................ 45.963 ( 0.2739 %)
Harmonic mixing of tracers, geopotentials ........ 96.407 ( 0.5746 %)
Harmonic stress tensor, S-surfaces ............... 38.824 ( 0.2314 %)
Corrector time-step for 3D momentum .............. 79.510 ( 0.4739 %)
Corrector time-step for tracers .................. 105.353 ( 0.6279 %)
Nesting algorithm ................................ 205.556 ( 1.2251 %)
Reading model state vector ....................... 0.785 ( 0.0047 %)
Total: 2562.721 15.2736
Nonlinear model message Passage profile, Grid: 01
Message Passage: 2D halo exchanges ............... 51.440 ( 0.3066 %)
Message Passage: 3D halo exchanges ............... 93.536 ( 0.5575 %)
Message Passage: 4D halo exchanges ............... 36.680 ( 0.2186 %)
Message Passage: data broadcast .................. 117.041 ( 0.6976 %)
Message Passage: data reduction .................. 1.395 ( 0.0083 %)
Message Passage: data gathering .................. 20.711 ( 0.1234 %)
Message Passage: data scattering.................. 0.912 ( 0.0054 %)
Message Passage: boundary data gathering ......... 0.904 ( 0.0054 %)
Message Passage: point data gathering ............ 0.573 ( 0.0034 %)
Message Passage: nesting point data gathering .... 708.861 ( 4.2248 %)
Total: 1032.054 6.1510
Nonlinear model elapsed CPU time profile, Grid: 02
Allocation and array initialization .............. 1.185 ( 0.0071 %)
Ocean state initialization ....................... 0.851 ( 0.0051 %)
Reading of input data ............................ 6.918 ( 0.0412 %)
Processing of input data ......................... 24.180 ( 0.1441 %)
Processing of output time averaged data .......... 610.139 ( 3.6364 %)
Computation of vertical boundary conditions ...... 3.645 ( 0.0217 %)
Computation of global information integrals ...... 93.566 ( 0.5576 %)
Writing of output data ........................... 187.852 ( 1.1196 %)
Model 2D kernel .................................. 2680.264 (15.9742 %)
Tidal forcing .................................... 0.038 ( 0.0002 %)
2D/3D coupling, vertical metrics ................. 175.925 ( 1.0485 %)
Omega vertical velocity .......................... 131.463 ( 0.7835 %)
Equation of state for seawater ................... 213.727 ( 1.2738 %)
Atmosphere-Ocean bulk flux parameterization ...... 274.219 ( 1.6343 %)
GLS vertical mixing parameterization ............. 4496.748 (26.8002 %)
3D equations right-side terms .................... 414.284 ( 2.4691 %)
3D equations predictor step ...................... 758.085 ( 4.5181 %)
Pressure gradient ................................ 242.797 ( 1.4471 %)
Harmonic mixing of tracers, geopotentials ........ 503.073 ( 2.9983 %)
Harmonic stress tensor, S-surfaces ............... 219.733 ( 1.3096 %)
Corrector time-step for 3D momentum .............. 362.818 ( 2.1624 %)
Corrector time-step for tracers .................. 418.174 ( 2.4923 %)
Nesting algorithm ................................ 842.104 ( 5.0189 %)
Reading model state vector ....................... 1.443 ( 0.0086 %)
Total: 12663.231 75.4718
Nonlinear model message Passage profile, Grid: 02
Message Passage: 2D halo exchanges ............... 223.769 ( 1.3336 %)
Message Passage: 3D halo exchanges ............... 254.173 ( 1.5149 %)
Message Passage: 4D halo exchanges ............... 113.998 ( 0.6794 %)
Message Passage: data broadcast .................. 115.124 ( 0.6861 %)
Message Passage: data reduction .................. 5.276 ( 0.0314 %)
Message Passage: data gathering .................. 34.649 ( 0.2065 %)
Message Passage: data scattering.................. 0.336 ( 0.0020 %)
Message Passage: point data gathering ............ 0.348 ( 0.0021 %)
Message Passage: nesting point data gathering .... 565.746 ( 3.3718 %)
Message Passage: nesting array data gathering .... 485.481 ( 2.8934 %)
Total: 1798.901 10.7213
Unique code regions profiled ..................... 15225.952 90.7454 %
Residual, non-profiled code ...................... 1552.804 9.2546 %
All percentages are with respect to total time = 16778.756
Notice that the most expensive algorithm in this particular profiling is the GLS vertical mixing parameterization (31.6%), 2D kernel (19.2%), and nesting (6.2%). I don't think that there is much that we can do about the GSL since involve several fractional powers that are very expensive.
On average, the low-level MPI functions yield 1-3% faster code than using mpi_allreduce in mp_assemble and mp_aggregate. Similarly, the low-level MPI functions yield 6-9% faster code than using mpi_allgather in mp_assemble and mp_aggregate.
One must be careful when examining these numbers. It will depend on the type of computer hardware, compiler, the number of parallel nodes, intra-node connectivity, node speed, etc. We always need to investigate the optimal number of nodes for a particular ROMS application. Too many nodes may slow the computations because of the communications overhead.
Therefore, the default is to use the low-level MPI functions in routines mp_assemble, mp_aggregate, and mp_collect. See top of distribute.F.
|
Description |
WARNING:
The standard input script ocean.in was modified and include new configuration parameters.
This ticket includes two critical updates to ROMS: tidal processing and upper limiter in vertical diffusion and viscosity computed from mixing parameterizations.
For example, in a three-grid nesting application we just need:
NFFILES == 8 ! number of unique forcing files
FRCNAME == ../om/lwrad_down_nam_3hourly_MAB_and_GoM_2014.nc |
../om/lwrad_down_nam_3hourly_MAB_and_GoM_2015.nc \
../om/Pair_nam_3hourly_MAB_and_GoM_2014.nc |
../om/Pair_nam_3hourly_MAB_and_GoM_2015.nc \
../om/Qair_nam_3hourly_MAB_and_GoM_2014.nc |
../om/Qair_nam_3hourly_MAB_and_GoM_2015.nc \
../om/rain_nam_3hourly_MAB_and_GoM_2014.nc |
../om/rain_nam_3hourly_MAB_and_GoM_2015.nc \
../om/swrad_daily_nam_3hourly_MAB_and_GoM_2014.nc |
../om/swrad_daily_nam_3hourly_MAB_and_GoM_2015.nc \
../om/Tair_nam_3hourly_MAB_and_GoM_2014.nc |
../om/Tair_nam_3hourly_MAB_and_GoM_2015.nc \
../om/Uwind_nam_3hourly_MAB_and_GoM_2014.nc |
../om/Uwind_nam_3hourly_MAB_and_GoM_2015.nc \
../om/Vwind_nam_3hourly_MAB_and_GoM_2014.nc |
../om/Vwind_nam_3hourly_MAB_and_GoM_2015.nc
instead of three copies of these set of atmospheric forcing files.
Notice that some idealized applications may have a single file for all the forcing fields (say, atmospheric and tidal data). In such case, we just need to specify the same filename in both the TIDENAME and FRCNAME keywords. ROMS will read the fields that it needs in both cases.
- Introduced switches LprocessTides and LprocessOBS to determine which nested grids need input tidal data and open boundary condition data. These switches are set in inp_par.F:
!
! Determine the switch to process input open boundary conditions data.
!
! In nesting applications, the lateral boundary conditions data is
! is needed only by the main coarser grid (RefineScale(ng)=0).
!
DO ng=1,Ngrids
IF (.not.(RefinedGrid(ng).and.RefineScale(ng).gt.0)) THEN
LprocessOBC(ng)=.TRUE.
END IF
END DO
#if defined SSH_TIDES || defined UV_TIDES
!
! Determine the switch to process input tidal forcing data.
!
! In nesting applications, the tides are processed only by the main
! coarser grid (RefineScale(ng)=0) and the other grids get tidal
! forcing from the contact areas.
!
DO ng=1,Ngrids
IF (.not.(RefinedGrid(ng).and.RefineScale(ng).gt.0)) THEN
LprocessTides(ng)=.TRUE.
END IF
END DO
#endif
- Introduced C-preprocessing options LIMIT_VDIFF and LIMIT_VVISC to limit the upper values of vertical diffusion and vertical viscosity, respectively, computed from vertical mixing parameterizations (GLS_MIXING, LMD_MIXING, and MY25_MIXING). Sometimes, these parameterizations yield high mixing values under stable water column conditions, and the threshold variables Akt_limit and Akv_limit are used as an upper limiter. However, if the water column is unstable, high values of vertical mixing are needed to eliminate vertical static instability numerically. So, please use these limiters with caution. The standard input script ocean.in was modified to include these upper limiting values:
! Upper threshold values to limit vertical mixing coefficients computed
! from vertical mixing parameterizations. Although this is an engineering
! fix, the vertical mixing values inferred from ocean observations are
! rarely higher than this upper limit value.
AKT_LIMIT == 1.0d-3 1.0d-3 ! m2/s
AKV_LIMIT == 1.0d-3 ! m2/s
For example, in gls_corstep.F we have:
# if defined LIMIT_VDIFF || defined LIMIT_VVISC
!
! Limit vertical mixing coefficients with the upper threshold value.
! This is an engineering fix, but it can be based on the fact that
! vertical mixing in the ocean from indirect observations are not
! higher than the threshold value.
!
DO k=0,N(ng)
# ifdef LIMIT_VDIFF
DO itrc=1,NAT
Akt(i,j,k,itrc)=MIN(Akt_limit(itrc,ng), Akt(i,j,k,itrc))
END DO
# endif
# ifdef LIMIT_VVISC
Akv(i,j,k)=MIN(Akv_limit(ng), Akv(i,j,k))
# endif
END DO
# endif
My experience with the vertical mixing parameterizations is that often they yield high vertical diffusion/viscosity values in trouble spots. Usually, their values are from 3 to 4 orders of magnitude greater than those inferred from ocean observations. I have talked with various observationalists in meetings, and the consensus is that vertical diffusion/viscosity in the ocean rarely exceeds a nominal value of 0.001. I have been worry about this for years, and I mentioned it to Hans Burchard when he visited us several years ago. Apparently, the turbulence algorithm goes unstable. I suspected that this was due to the nonlinear 3D advection of turbulent state variables. Hans mentioned that they rarely activate 3D advection in GOTM (General Ocean Turbulence Model). I tried to turn off the 3D advection in the GLS with CPP options, but the model blow-up. I didn't spend much time debugging. It is still an item in my extensive priority list.
|