Opened 6 years ago

Closed 6 years ago

#789 closed upgrade (Done)

VERY IMPORTANT: ROMS single precision numerical kernel

Reported by: arango Owned by:
Priority: major Milestone: Release ROMS/TOMS 3.7
Component: Nonlinear Version: 3.7
Keywords: Cc:

Description (last modified by arango)

ROMS uses double-precision arithmetic in its numerical kernel by default. I have been asked several times what is needed to run in single precision to accelerate the computations. I always answered that it is straightforward since ROMS defines the numerical precision with KIND type parameters in mod_kinds.F globally. That is, of course, very far from reality because we need to look at the accuracy in various computations like:

  • time managing variables,
  • bounded time interpolation between snapshots,
  • vertical stretching of terrain-following coordinates,
  • weighting coefficients for time averaging of barotropic fields,
  • area and volume conservation variables,
  • time variables representation in output NetCDF files, and
  • processing of ROMS unique namelist like parameters from input scripts

Therefore, implementing a single-precision numerical kernel was not as trivial as I thought since we need to keep/compute a few variables in double precision with lots of files being modified.

What Is New:

  • A new C-preprocessing SINGLE_PRECISION is introduced to convert the numerical kernel to single precision. In globaldef.h, we now have:
    /*
    ** Turn ON/OFF double precision arithmetic in numerical kernel (default)
    ** and floating-point type variables and associated intrinsic functions.
    */
    
    #ifdef SINGLE_PRECISION
    # ifdef OUT_DOUBLE
    #   undef OUT_DOUBLE
    # endif
    # ifndef RST_SINGLE
    #   define RST_SINGLE
    # endif
    #else
    # define DOUBLE_PRECISION
    #endif
    
    Notice that if OUT_DOUBLE and RST_SINGLE are activated, they are turned off for consistency in the numerical kernel.
  • In mod_kinds.F, we now have
            integer, parameter :: i1b= SELECTED_INT_KIND(1)        !  8-bit
            integer, parameter :: i2b= SELECTED_INT_KIND(2)        !  8-bit
            integer, parameter :: i4b= SELECTED_INT_KIND(4)        ! 16-bit
            integer, parameter :: i8b= SELECTED_INT_KIND(8)        ! 32-bit
            integer, parameter :: c8 = SELECTED_REAL_KIND(6,30)    ! 32-bit
            integer, parameter :: dp = SELECTED_REAL_KIND(12,300)  ! 64-bit
            integer, parameter :: r4 = SELECTED_REAL_KIND(6,30)    ! 32-bit
    # ifdef SINGLE_PRECISION
            integer, parameter :: r8 = SELECTED_REAL_KIND(6,30)    ! 32-bit
    # else
            integer, parameter :: r8 = SELECTED_REAL_KIND(12,300)  ! 64-bit
    # endif
    
    Notice that the r8 parameter is defined for 32-bit (4-bytes) floating-point variables in single precision computations. A new parameter, dp, is introduced to keep few variable in double precision. For example, in mod_scalars.F we have:
           real(dp), allocatable :: tdays(:)                ! days
           real(dp), allocatable :: time(:)                 ! seconds
    
           real(dp), allocatable :: dt(:)                   ! seconds
           real(dp), allocatable :: dtfast(:)               ! seconds
    
           real(dp), allocatable :: TimeEnd(:)              ! seconds
           real(dp), allocatable :: AVGtime(:)              ! seconds
           real(dp), allocatable :: DIAtime(:)              ! seconds
           real(dp), allocatable :: IMPtime(:)              ! seconds
           real(dp), allocatable :: ObsTime(:)              ! seconds
           real(dp), allocatable :: FrcTime(:)              ! seconds
    
           real(dp) :: dstart = 0.0_dp                      ! days
           real(dp) :: io_time = 0.0_dp                     ! seconds
           real(dp) :: run_time = 0.0_dp                    ! seconds
           real(dp) :: tide_start = 0.0_dp                  ! days
    
  • A new module, inp_decode.F, is introduced to process and decode ROMS KeyWord parameters from input script files. It has the following module procedures:
          INTERFACE load_i
            MODULE PROCEDURE load_1d_i       ! 1D integer array
            MODULE PROCEDURE load_2d_i       ! 2D integer array
            MODULE PROCEDURE load_3d_i       ! 3D integer array
          END INTERFACE load_i
    
          INTERFACE load_l
            MODULE PROCEDURE load_1d_l       ! 1D logical array
            MODULE PROCEDURE load_2d_l       ! 2D logical array
            MODULE PROCEDURE load_3d_l       ! 3D logical array
          END INTERFACE load_l
    
          INTERFACE load_r
    #ifdef SINGLE_PRECISION
            MODULE PROCEDURE load_1d_dp      ! 1D real(dp) array
            MODULE PROCEDURE load_2d_dp      ! 2D real(dp) array
            MODULE PROCEDURE load_3d_dp      ! 3D real(dp) array
    #endif
            MODULE PROCEDURE load_1d_r8      ! 1D real(r8) array
            MODULE PROCEDURE load_2d_r8      ! 2D real(r8) array
            MODULE PROCEDURE load_3d_r8      ! 3D real(r8) array
          END INTERFACE load_r
    
    to process 1D, 2D, and 3D variables since we are using the strong typing properties of modules. Before we didn't use modules because all the arrays arguments were reshaped to 1D arrays (F77 style). This part is very tricky but more robust since we need to consider single and double precision variables that are written to input scripts with 8-bytes double representation. I considered polymorphic objects (available in Fortran 2003, 2008, 2018), but I rejected it because of backward compatibility with older Fortran compilers.

WARNING: The input script processing routines (read_asspar.F, read_biopar.F, read_phypar.F, and read_stapar.F) were modified, and the functions load_i, load_l, and load_r have a different number of arguments depending if we are processing 1D, 2D, or 3D variables. For Example for load_r we can have:

           CASE ('DT')
             Npts=load_r(Nval, Rval, Ngrids, dt)
           CASE ('TNU2')
             Npts=load_r(Nval, Rval, NAT+NPT, Ngrids, Rtracer)
           CASE ('HdecayB(isTvar)')
             Npts=load_r(Nval, Rval, 4, MT, Ngrids, Rboundary)

for processing dt(:), Rtracer(:,:,:), and Rboundary(:,:,:) respectively.


NOTICE: on principle single-precision computations run twice as fast as double-precision computations. However, since some double precision computations are carried out for accuracy in SINGLE_PRECISION solutions the gain is less than 50 percent. For example, in a 3-nested grid application on 4 CPUs, I get

Elapsed CPU time (seconds):

Node   #    0 CPU:    4923.242
Node   #    1 CPU:    4953.730
Node   #    2 CPU:    4953.504
Node   #    3 CPU:    4953.539
Total:               19784.015    for double-precision computations

Elapsed CPU time (seconds):

Node   #    0 CPU:    2628.964
Node   #    1 CPU:    2683.894
Node   #    3 CPU:    2683.867
Node   #    2 CPU:    2683.818
Total:               10680.543    for single-precision computations

That is, the single-precision is 46 percent faster than the double-precision simulation.

In 4D-Var simulations, this number get much smaller because the computations are dominated by I/O.

Change History (1)

comment:1 by arango, 6 years ago

Description: modified (diff)
Resolution: Done
Status: newclosed
Note: See TracTickets for help on using tickets.