Description |
John Warner brought to my attention that the nesting algorithms were computing the vertical interpolation weights (Vweight) at every timestep for each nested grid.
Currently, the vertical interpolation weights are used in composite grids because their grids are usually not coincident. They are not needed in refinement grids because the donor and receiver grids have the same number of vertical levels and matching bathymetry. However, in the future, it is possible to have configurations that require vertical interpolation weights in refinement. The switch get_Vweights is introduced to control if such weights are computed or not. If false, it will accelerate computations because of less distributed-memory communications.
Therefore, in nesting.F we now have:
IF ((isection.eq.nzwgt).and.get_Vweights) THEN
DO tile=last_tile(ng),first_tile(ng),-1
CALL z_weights (ng, model, tile)
END DO
RETURN
END IF
Here, get_Vweiths is a new logical switch added in mod_nesting.F. It is initialized in set_contact.F as:
!
! Set the switch to compute vertical interpolation weights. Currently,
! they are only needed in non-coincident composite grids.
!
!
IF (.not.ANY(Lcoincident).and.ANY(Lcomposite)) THEN
get_Vweights=.TRUE.
ELSE
get_Vweights=.FALSE.
END IF
where the coincident and composite variables are the set in the nested grid configuration NetCDF file and computed in Matlab script contact.m.
Profiling:
Several runs were made to measure the improvement in nested applications with refinement grids (no composite grids):
- The LAKE_JERSEY test case (grids a and d) with one refinement grid runs 21.11% faster on 12 processes.
- The LAKE_JERSEY test case (grids a, c, d, and e) with tree refinement grids runs 36.83% faster on 12 processes.
- Our operational US East Coast application (grids DOPPIO, PIONEER, and ARRAY) with 2 telescoping refinement grid runs 24.77% faster on 12 processes.
That's quite an improvement with such a simple change in the code. Obviously, as the number of nested grids are increased, the improvement is greater.
|
Description |
Several files are updated to facilitate the splitting of the 4D-Var algorithms into multiple executables. Such split 4D-Var drivers will be released in the future when they are thoroughly tested. However, here the infrastructure needed for split 4D-Var data assimilation is released to synchronize all the research repositories and avoid version conflicts.
The 4D-Var drivers are split into its elementary phase routines: prior_error, background, increment, analysis, posterior_analysis, and posterior_error.
For example, we can use Executable A to compute the nonlinear trajectory (background and analysis phases) needed to linearize the tangent linear and adjoint models in the 4D-Var inner loops iterations, during the minimization of the cost function. It allows the nonlinear trajectory to be part of a coupling system and or include nested grids. Then, Executable B is used to compute the control state increment by minimizing the cost function over Ninner loops. In this phase, it is possible to use a coarser grid resolution to accelerate computations. If so, the finer background trajectory needs to be interpolated into the coarser grid. Then, at the end of the inner loops, the coarse grid increment is interpolated back to the finer grid. The increment phase may be run at a lower precision.
Therefore, there are a lot of changes that are needed in ROMS to achieve this 4D-Var data assimilation strategy:
- A new subroutine set_grid.F is added to configure the application grid configuration during initialization. For example, in initial.F we have:
!
!-----------------------------------------------------------------------
! Set application grid, metrics, and associated variables and
! parameters.
!-----------------------------------------------------------------------
!
DO ng=1,Ngrids
IF (SetGridConfig(ng)) THEN
CALL set_grid (ng, iNLM)
SetGridConfig(ng)=.FALSE.
END IF
END DO
which includes calls to ana_grid or get_grid, set_scoord, set_weights, metrics, nesting, ana_wtype, and ana_nudgcoef or get_nudgcoef. The new switch SetGridConfig is used to avoid repetetive calls during iterative algorithms or drivers. It needs to be called once during execution.
- Added a new routine get_hash.F to compute the checksum during the processing of input data to facilitate comparison between ROMS solutions.
There are various algorithms to compute the checksum for character and integer data. In floating-point data, its values are interpreted as unsigned integers. The challenge is that Fortran does not support unsigned integers. The intrinsic TRANSFER function is used instead to convert 32-bit reals to 32-bit integers. We could use the C-biding capabilities of Fortran-2003 to call C-based checksum function. Perhaps, we can do it in the future. Currently, two methods are coded using the 32-bit Adler algorithm (adler32) or 32-blit cyclic redundancy check (crc32) algorithm. The crc32 method is the default. The checksum is reported in ROMS standard output:
GET_GRID - bathymetry at RHO-points: h
(Grid = 01, File: ../../Data/wc13_grd.nc)
(Min = 1.00000000E+01 Max = 5.19952300E+03)
(CheckSum = -1818917438)
...
NLM: GET_STATE - Reading state initial conditions, 2004-01-03 00:00:00.00
(Grid 01, Outer=0000, t = 13008.0000, File: wc13_roms_ini_20040103.nc, Rec=0001, Index=1)
- free-surface
(Min = 5.17101820E+04 Max = 1.20459790E+05 CheckSum = 1424115731)
...
GET_2DFLD - surface u-wind component, 2004-01-03 12:30:00.00
(Grid=01, Rec=2, Index=2, File: coamps_wc13_wind.nc)
(Tmin= 13007.5208 Tmax= 13026.5208) t = 13008.5208
(Min = -5.22122431E+00 Max = 8.30806637E+00) regrid = F
(CheckSum = 14278099)
...
GET_3DFLDR - u-momentum component, 2004-01-06 16:00:00.00
(Grid=01, Rec=45, Index=1, File: wc13_fwd_20040103_outer0.nc)
(Tmin= 13008.0000 Tmax= 13012.0000) t = 13011.6667
(Min = -4.29381760E-01 Max = 5.30997271E-01)
(CheckSum = 1192629864)
This capability is activated with CPP option CHECKSUM.
- Several input management routines (get_grid.F, get_2dfld.F, get_2dfldr.F, get_3dfld.F, get_3dfldr.F, get_nudgcoef.F, and get_state.F) were updated to report the checksum when CHECKSUM is activated. The get_ngfld.F and get_ngfldr.F were also modified but the report of the checksum is sometimes different even if the data are identical. It seems that the checksum value is sensitive to array size. For now, the reporting is available if CHECKSUM_NG is activated.
- The NetCDF reading routines (nf_fread2d.F, nf_fread2d_bry.F, nf_fread3d.F, nf_fread3d_bry.F, and nf_fread4d.F) have a checksum optional argument to compoute it value during the reading of NetCDF data:
status=nf_fread2d(ng, model, ncfile, ncid, &
& Vname(1,ifield), Vid, &
& Trec, Vtype, Vsize, &
& LBi, UBi, LBj, UBj, &
& Fscale(ifield,ng), Fmin, Fmax, &
#ifdef MASKING
& Fmask, &
#endif
& Fout(:,:,Tindex), &
#ifdef CHECKSUM
& checksum = Fhash, &
#endif
& Lregrid = Lregrid)
- The module i4dvar.F and rbl4dvar.F were modified to include variables that are needed for the scripted, split 4D-Var data assimilation strategy. In the unsplit algorithm, such variables are available in memory. It took a lot of time and testing to get identical solutions between the unsplit and split data assimilation applications. Also, the i4dvar_ocean.h and rbl4dvar_ocean.h were updated to include the split 4D-Var logic.
- Removed shared-memory directives starting with !$OMP in several adjoint routines. As coded, the exact, discrete adjoint model cannot be run in shared memory due to array elements (i-1, i+1, j-1, j+1, etc.) memory collision between neighbor parallel tiles.
- Added new routine load_frc to frc_adjust.F to load the 4D-Var surface forcing adjustments that are written to the INI NetCDF file. A call is added to main2d.F and main3d.F:
# if defined ADJUST_STFLUX || defined ADJUST_WSTRESS
!
!-----------------------------------------------------------------------
! Interpolate surface forcing increments and adjust surface forcing.
! Load surface forcing into storage arrays. Skip the last output
! timestep.
!-----------------------------------------------------------------------
!
DO ig=1,GridsInLayer(nl)
ng=GridNumber(ig,nl)
IF (iic(ng).lt.(ntend(ng)+1)) THEN
DO tile=first_tile(ng),last_tile(ng),+1
CALL frc_adjust (ng, tile, Lfinp(ng))
CALL load_frc (ng, tile, Lfout(ng))
END DO
!$OMP BARRIER
END IF
END DO
# endif
- Added new routines to the minimization solvers cgradient.F (cg_read) and rpcg_lanczos.F (cg_read_rpcg) to read several conjugate gradient parameters from the MOD NetCDF in the split 4D-Var algorithms. In the unsplit case, such variables values are available in memory.
- Corrected bug in obs_sen_rbl4dvar_forecast.h when computing the observation forecast impacts for HF Radar radial data. The haveObsMeta switch is always false on the first call to obs_read, so ObsMeta is zero for the radials. The reason why this is happening is that the NetCDF file had already been opened somewhere prior to calling obs_initial in tl_initial (and ad_initial when appropriate). As a consequence, the observstion arrays were not getting initialized properly by these routines. For now, the solution is to close the NetCDF file before each call to tl_initial and ad_initial. Many thanks to Andy Moore for tracking and fixing this issue.
- Added switch Lappend to attach information to existing ROMS standard output (log.roms) file in the split 4D-Var algorithms. Also, the CPP option SUPPRESS_REPORT can be activated to avoid reporting repetitive application configuration information fro the split 4D-Var algorithms.
- Introduced the CPP option INITIALIZE_AUTOMATIC to initialize to zero all the local arrays in step2d_LF_AM3.h and its TLM, RPM, and ADM versions to facilitate debugging in TotalView when comparing two different solutions side-by-side. Otherwise, the automatic arrays are initialized with random garbage.
|