ROMS Parallel Framework:
- There are four different modes of computations in ROMS: (i) serial, (ii) serial with tile partitions, (iii) parallel with the shared-memory (OpenMP) protocol, and (iv) parallel with the distributed-memory (MPI, MPI2, OpenMPI) protocols. Mixed shared- and distributed-memory computations are not possible neither available in ROMS.
- ROMS has a coarse-grained parallelization design that allows the application domain to be partitioned into horizontal tiles. There is not vertical tiling in ROMS. Each tile is operated on by different parallel node or thread accelerating the computations substantially. The number of parallel tiles in the I- and J-directions are specified in NtileI and NtileJ standard input parameters. Each tile contains the physical points ranging from Istr:Iend, Jstr:Jend and Kstr:Kend plus the ghost (halo) points (usually, Nghost=2 but sometimes Nghost=3) that allow the computations of the horizontal gradients within each tile, privately. The value of Kstr is either 0 (W-points) or 1 (RHO-points) and Kend is N (number of vertical levels) always.
The extended bounds (labelled by suffix R) are designed to cover also the outer grid points (outlined above with :), if the subdomain tile is adjacent to the physical boundary (outlined above with +). Notice that IstrR, IendR, JstrR, JendR are only relevant for the tiles next to the domain boundary.
Code: Select all
Left/Top Tile Right/Top Tile JendR r..u..r..u..r..u..r..u * * * * u..r..u..r..u..r..u..r : Istr Iend Istr Iend : v p++v++p++v++p++v++p * * Jend * * p++v++p++v++p++v++p v : + | | | | | | + : r u r u r u r u * * * * u r u r u r u r : + | | | | | | + : v p--v--p--v--p--v--p * * * * p--v--p--v--p--v--p v : + | | | | | | + : r u r u r u r u * * * * u r u r u r u r : + | | | | | | + : v p--v--p--v--p--v--p * * Jstr * * p--v--p--v--p--v--p v * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * IstrR IstrU IendR * * * * * * * * * * * Ghost Points * * * * * * * * * * * * * p--v--p--v--p--v--p * * Jend IstrR=Istr | | | | IstrT=Istr Interior * * u r u r u r u * * IstrU=Istr Tile | | | | IendR=Iend * * p--v--p--v--p--v--p * * IendT=Iend | | | | JstrR=Jstr * * u r u r u r u * * JstrT=Jstr | | | | JstrV=Jstr * * p--v--p--v--p--v--p * * Jstr JendR=Jend JendT=Jend * * * * * * * * * * * * * * * * * * * * * * Istr Iend * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * Istr Iend v p--v--p--v--p--v--p * * Jend * * p--v--p--v--p--v--p v : + | | | | | | + : r u r u r u r u * * * * u r u r u r u r : + | | | | | | + : JstrV v p--v--p--v--p--v--p * * * * p--v--p--v--p--v--p v : + | | | | | | + : r u r u r u r u * * * * u r u r u r u r : + | | | | | | + : v p++v++p++v++p++v++p * * Jstr * * p++v++p++v++p++v++p v : : r..u..r..u..r..u..r..u * * * * u..r..u..r..u..r..u..r IstrR IstrU IendR Left/Bottom Tile Right/Bottom Tile
- All the global state arrays are allocated as (LBi:UBi, LBj:UBj, ...). The values of LBi, UBi, LBj, and UBj depend on the number of ghost points and if you are running in distributed-memory or not. In distributed-memory, their values are only bounded by the tile limits. For example, it is possible to allocate the bathymetry as h(48:102, 148:182) for a tile having two ghost points (Nghost=2) and Istr=50, Iend=100, Jstr=150, Jend=180. Contrarily, in serial and shared-memory applications, the allocation values correspond to those for the full grid: LBi=0, UBi=Im+1, LBj=0, and UBj=Jm+1. Check globaldefs.h to see how these values change in periodic applications. All these parameters are computed in several routines in get_bounds.F before the state arrays are allocated.
- The private automatic arrays are dimentioned as (IminS:ImaxS, JminS,JmaxS, ...) and include additional points to compute the spatial derivatives, locally, without the need to communicate with the neighbor tiles. This is a very important feature in ROMS and vital for parallelization.
- All the computations in ROMS need to be bounded by the tile limits of Istr:Iend and Jstr:Jend. Any assigment outside of these ranges is illegal and a serious parallel bug. Recall that mostly all the computations in ROMS are inside of parallel regions. For example:
In distributed-memory, the token TILE is replaced by MyRank during C-preprocessing. Otherwise, it is replaced with the tile variable. Users should never monkey around with this logic. This strategy allow us to have both shared- and distributed-memory cohexisting in the same code.
Code: Select all
!$OMP PARALLEL DO PRIVATE(thread,subs,tile) SHARED(ng,numthreads) DO thread=0,numthreads-1 subs=NtileX(ng)*NtileE(ng)/numthreads DO tile=subs*(thread+1)-1,subs*thread,-1 CALL step2d (ng, TILE) END DO END DO !$OMP END PARALLEL DO
- ROMS provides an extensive user interface to its kernel with the analytical options in the Functionals directory. Users should be extremely carefull when adding code to these routines. All these routines are called inside of a parallel region:
So all the required input fields are either interpolated from snapshots of data from input NetCDF files or specified with analytical functions. This is all done in parallel
Code: Select all
!$OMP PARALLEL DO PRIVATE(thread,subs,tile) SHARED(ng,numthreads) DO thread=0,numthreads-1 subs=NtileX(ng)*NtileE(ng)/numthreads DO tile=subs*thread,subs*(thread+1)-1,+1 CALL set_data (ng, TILE) END DO END DO !$OMP END PARALLEL DO
- If you do not understand this parallel design and what the coding rules are, you need to avoid using the analytical options and use input NetCDF files instead It is much easier and faster to create a NetCDF files than understand all the technical aspects of parallel coding which include false sharing in OpenMP, messages broadcasting, exchange, and global reductions in MPI. By the way, PRINT, READ, and WRITE statements need to be done correctly. Otherwise, you will introduce a parallel bug or get surprising results. We are usually very busy and cannot debug your customized code. We try very hard to give you a free and full working parallel model.
- I corrected recently the routine ana_psource.h. Applying the point source in this routine is extremely difficult, as you can see. I spent several hours in the debugger making sure that this work in all the four modes of computations in ROMS. If you do not understand the coding in this routine, I recommend to you to avoid modifying this or other routines in ROMS. For input forcing, use NetCDF files instead. Even if you understand or are familiar with parallel coding, you still need to do a lot of debugging to make sure that it is coded correctly There are no exceptions to this rule. You always need to debug and test.