Opened 4 years ago

Last modified 4 years ago

#4 closed enhancement

VERY IMPORTANT: Added Saddle-point 4D-Var — at Version 2

Reported by: arango Owned by:
Priority: major Milestone:
Component: component1 Version:
Keywords: Cc:

Description (last modified by arango)

A new 4D-Var algorithm is added based on the weak constraint, Saddle-point (SP4D-Var) formulation of Fisher and Furol (2017). The tangent linear model (TLM) and adjoint model (ADM) trajectories are divided into a Nsaddle set of short time integration windows, so the 4D-Var algorithm may be parallelized in time and accelerated. The TLM and ADM can be run concurrently when the ROMS distributed-memory (MPI) communicator object is split into Nsaddle disjointed subgroups. The saddle-point cost function is minimized using the Generalized Minimal Residual (GMRES) method.

Reference:

Fisher, M. and S. Gurol, 2017: Parallelization in the time dimension of four-dimensional variational data assimilation, Q. J. E. Meteorol. Soc., 142, 1136-1147, DOI:10.1002/qj.2997

Many thanks to Andy Moore for his immense help in coding and testing the SP4D-Var algorithm. Also, many thanks to Mike Fisher (ECMWF) and Selime Gürol (CERFACS) for developing the methodology. Many thanks to Anthony Weaver for hosting us at CERFACS, Toulouse. As always, Selime is awesome in explaining the algorithm and its preconditioning.


Technical Details:

In concurrent setup (MPI plus DISJOINTED activated), the ROMS full communicator (OCN_COMM_WORLD save as FULL_COMM_WORLD) object is split into several disjointed subgroups according to the standard input parameter Nsaddle. Notice that OCN_COMM_WORLD is the same object as MPI_COMM_WORLD if no multi-model coupling. The splitting into sub-communicators (FORK_COMM_WORLD) is based on the color and key parameters.

In routine split_communicator of module mod_parallel.F, we have:

!
!  Save full communicator (starting handle) and inquire about its size
!  and rank.
!
      FULL_COMM_WORLD=OCN_COMM_WORLD
      CALL mpi_comm_rank (FULL_COMM_WORLD, FullRank, MyError)
      CALL mpi_comm_size (FULL_COMM_WORLD, FullSize, MyError)
      FullMaster=FullRank.eq.MyMaster
!
!  Split the full communicator into sub-communicators based on color and
!  key. It is a collective operation.
!
!  For example, if FullSize=8 and Nsplit=2, we get GroupSize=4 and:
!
!  FullRank:    0    1    2    3    4    5    6    7    FULL_COMM_WORLD
!  MyColor:     0    0    0    0    1    1    1    1    split criteria
!  MyKey:       0    1    2    3    4    5    6    7    same as FullRank
!  MyRank:      0    1    2    3    0    1    2    3    FORK_COMM_WORLD
!
!
!  The color determines in which of the sub-communicators the current
!  process (spawn by FULL_COMM_WORLD) will fall.
!
!  The key argument determines the rank ordering within the split
!  communicator. The process that passes in the smallest value for
!  key will be rank 0, the next smallest will be rank 1, and so on.
!  Here, the key is set to the rank of the full communicator since we
!  want all of the processes in the split communicator to be in the
!  same order that they were in the full communicator.
!
!  The resulting communicator, FORK_COMM_WORLD, has the same handle
!  value (context ID) in all sub-communicators. However, each process
!  only have reference to the sub-communicator that it belongs. It is
!  an opaque object.
!
      GroupSize=FullSize/Nsplit
      MyColor=FullRank/GroupSize
      MyKey=FullRank
      CALL mpi_comm_split (FULL_COMM_WORLD, MyColor, MyKey,            &
     &                     FORK_COMM_WORLD, MyError)
      IF (MyError.ne.MPI_SUCCESS) THEN
        CALL mpi_error_string (MyError, string, Lstr, Serror)
        WRITE (stdout,10) 'MPI_COMM_SPLIT', FullRank, MyError,         &
     &                    TRIM(string)
        exit_flag=2
        RETURN
      END IF

In the saddle_point.h driver, we have concurrent regions or not. The routine assign_communicator is called once in ROMS_initialize:

#if defined DISTRIBUTE && defined DISJOINTED
!
!  Slip distributed-memory communicator, so the saddle-point 4D-Var
!  time windows are solved concurrently. Then, initialize parallel
!  control switches.
!
        CALL getpar_i (MyRank, 'Nsaddle', Nsaddle) 
        IF (FoundError(exit_flag, NoError, __LINE__,                    &
     &                 __FILE__)) RETURN

        CALL split_communicator (Nsaddle)
        IF (FoundError(exit_flag, NoError, __LINE__,                    &
     &                 __FILE__)) RETURN

        CALL assign_communicator ('CONCURRENT')
#else
!
!  Initialize parallel control switches. These scalars switches are
!  independent from standard input parameters.
!
        CALL initialize_parallel
#endif

Here, assign_communicator (choice) is set to concurrent:

!
!  Set communicator handle and get its size and rank.
!
      SELECT CASE (TRIM(choice))
        CASE ('FORK', 'SPLIT', 'CONCURRENT')
          OCN_COMM_WORLD=FORK_COMM_WORLD
          CALL mpi_comm_size (OCN_COMM_WORLD, numthreads, MyError)
          CALL mpi_comm_rank (OCN_COMM_WORLD, MyRank, MyError)
          Master=MyRank.eq.MyMaster
        CASE ('FULL')
          OCN_COMM_WORLD=FULL_COMM_WORLD
          CALL mpi_comm_size (OCN_COMM_WORLD, numthreads, MyError)
          CALL mpi_comm_rank (OCN_COMM_WORLD, MyRank, MyError)
          Master=MyRank.eq.MyMaster
      END SELECT
      CALL mpi_barrier (OCN_COMM_WORLD, MyError)
!
!  Initialize parallel control switches.
!
# ifdef PARALLEL_IO
      InpThread=.TRUE.
      OutThread=.TRUE.
# else
      IF (MyRank.eq.0) THEN
        InpThread=.TRUE.
        OutThread=.TRUE.
      ELSE
        InpThread=.FALSE.
        OutThread=.FALSE.
      END IF
# endif

One can determine which sections of the code are computed by a subset of communicator by using the FullRank parameter:

      IF (FullRank.le.(GroupSize-1)) THEN
#ifdef SOLVE3D
        CALL main3d (RunInterval)
#else
        CALL main2d (RunInterval)
#endif
      END IF
      DO ng=1,Ngrids
        IF ('''GlobalError'''(ng, iNLM, exit_flag, NoError, __LINE__,         &
     &                  __FILE__)) RETURN
      END DO

Here, only the processes in the first subset of communicators are running the nonlinear model (say, main3d) while the rest of the processes are waiting in the call to GlobalError, which have an MPI call that serves as a barrier:

#ifdef DISTRIBUTE
# ifdef DISJOINTED
      MyCOMM=FULL_COMM_WORLD
      MasterProcess=FullRank.eq.0
# else
      MyCOMM=OCN_COMM_WORLD
      MasterProcess=MyRank.eq.0
# endif
#else
      MasterProcess=Master
#endif 
#if defined DISTRIBUTE && defined DISJOINTED
!
      CALL mp_bcasti (ng, model, flag, MyCOMM)
#endif

As you can see, this is a very complex algorithm both in algebra and parallelism.

http://www.myroms.org/trac/SaddlePoint.png

As shown in the above figure, the algorithm, for example, is run for four days from Jan 3 to Jan 7. The time integration is divided into four intervals of 24 hours in length. The ROMS MPI communicator is split into four-time (Nsaddle=4) sections of different colors. Each disjointed color has the same number of processes NtileI(ng)*NtileJ(ng) and runs the tangent linear and adjoint models concurrently on each interval independent of each other. The nonlinear model before the outer loop is run in the first color for the full period of Jan 3 to Jan 7. However, the nonlinear model at the end of the inner loops is run concurrently by intervals by all colors.

If NtileI=2, NtileJ=4, and Nsaddle=4 the execution is started as:

mpirun -np 32 romsM roms.in > & err &

That is, we need 32 processes to run the application (NtileI*NtileJ*Nsaddle). It generates log00.roms, log01.roms, log02.roms, log03.roms standard output information files. Now, if we want to run the tangent linear and adjoint models simultaneously, we need 64 processors instead. This capability is under development and available soon.


Relevant C-Preprocessing Options:

  • SP4DVAR: Option to activate Weak Contrain, Saddle-point $D-Var.
  • DISJOINTED. Option to split the MPI communicator object (OCN_COMM_WORLD) into disjointed Nsaddle subgroups in concurrent computations.
  • GENERIC_DSTART: Avoid using dstart as initialization time in the ADM, TLM, and RPM. See src:ticket:825 for more information.
  • ROMS_STDOUT: Option to write ROMS standard output information into a file instead of a screen dump.

What Is New:

The SP4DVAR algorithm includes the following new files:

  • ROMS/Drivers/saddle_point.h: Saddle-point 4D-Var data assimilation main driver.
  • ROMS/Utility/ADfromTL.F: Initialized the adjoint model state arrays with the tangent linear model state values.
  • ROMS/Utility/atimesv.F: Assembles all of the components of the operation A*v where A is the saddle-point matrix and v is the previous Arnoldi vectors. Specifically:
                 (  D    0    L )                                        
             A = (  0    R    H )                                        
                 (  L'   H'   0 )                                        
                                                                         
     where   D = the background and model error covariance matrix        
             L = is the F operator in Fisher and Gurol (2017)            
             R = the observation error covariance matrix
             H = the TL model sampled the obs points
    
  • ROMS/Utility/def_state.F Creates an ocean state NetCDF file according to input derived (TYPE T_IO) structure S. It defines its dimensions, attributes, and variables.
  • ROMS/Utility/gmres.F: minimizes saddle-point 4D-Var split cost functions using a preconditioned Generalized Minimal Residual (GMRES) method. For more information about the weak constraint, SP4D-Var methodology check Fisher and Gurol (2017).
  • ROMS/Utility/precsp.F: Applies the preconditioning for the saddle-point 4D-Var following the approach of Fisher and Gurol (2017), Eq 37.
  • ROMS/Utility/state_join.F: reads and writes ROMS NLM background state solution for specified input and output records. Primarily, it is used to concatenate the NLM solution from concurrent interval window history NetCDF files. The resulting solution is used to linearize the tangent linear and adjoint kernels.
  • ROMS/Utility/state_read.F: Generic routine to read the ROMS ocean state from requested NetCDF file.
  • ROMS/Utility/tracing.F: Traces the sequence of ROMS calls for debugging purposes. It is activated when the new standard input parameter Ltracing is set to T.
    ! Debugging information.
    
        Ltracing  = F                          ! reports the sequence of routine calls
    
  • ROMS/Utility/wrt_state.F: Writes requested tangent linear or adjoint state fields into NetCDF file described by TYPE T_IO structure S. The NetCDF is opened and closed, and S(ng)%ncid is not updated

Change History (2)

comment:1 by arango, 4 years ago

Description: modified (diff)

comment:2 by arango, 4 years ago

Description: modified (diff)
Note: See TracTickets for help on using tickets.