This is one of standard test cases in the ROMS model. ROMS 1 is a family of versions written in Fortran 77 and supporting shared–memory parallelism via OpenMP. The upwelling test case is a smallish 3D model run, occupying some 20 MiB of memory when compiled with the 8-byte floating point numbers (the default). To use this model as a benchmark I have taken the test case, reduced the number of time steps to 72, set the output frequency to the minimum and directed output to /tmp (which is normally a local disk or RAM disk). I therefore expect that disk I/O performance will not be a factor in the execution time. Ocean models like ROMS tend to be sensitive to memory I/O performance because they cycle through 2D and 3D fields, doing relatively little computation on each value. However ROMS has been specifically designed to be cache-friendly on modern RISC CPUs so I think a smallish run like this will primarily test CPU performance, especially floating-point.
Machine | CPU | OS | Compiler & switches | Notes | CPU time |
---|---|---|---|---|---|
Hadfield (2001) |
P3 800 MHz |
Win 2000 | df (release) | 170 | |
df (debug) | 630 | ||||
g77 -O | 340 | ||||
Linux | g77 -O | 310 | |||
Hadfield (2003) | P4 2.67 GHz DDR 266 | Win 2000 | g77 -O3 | 41 | |
g95 -O3 | 42 | ||||
df /fast | 40 | ||||
df /check:bounds | 45 | ||||
Fargo | P3 600 MHz | Linux | g77 -O | 300 | |
Lebowski | P3 600 MHz | Linux | g77 -O | 240 | |
Duathlon | Athlon MP1800 × 2 | Linux | g77 -O | 78 | |
2 concurrent runs | 110/110 | ||||
Weinberg | P4 Xeon 2.? GHz | Linux | g77 -O | 57 | |
g77 -O | REAL*4 | 38 | |||
Shuttle | Athlon XP2600 + 333FSB | Linux | g77 -O | 49 | |
g77 -O | REAL*4 | 36 | |||
Grass | P4 2.4 GHz DDR266 | Linux | g77 -O | 41 | |
Wetocean | P4 Xeon 2.4 GHz × 2 | Linux | g77 -O3 | 36 | |
g95 -O3 | 37 | ||||
Otter | P4 Xeon 2.8 GHz × 2 | Linux | g77 -O | 41 | |
f90 -O3 (Absoft) | 41 | ||||
g77 -O | 2 concurrent runs | 55–90 | |||
4 concurrent runs | 140 | ||||
Kupe | Alpha EV5 600 MHz | UNICOS/mk | f90 -O | 220 | |
Rangi | Alpha EV56 600MHz | Digital Unix | f95 -O | 110 | |
Thor | Alpha EV67 667MHz | Digital Unix | f95 -O | 32 | |
f95 -O | REAL*4 | 25 | |||
f95 -O -check bounds | 52 |
ROMS 2 is a rewrite of the model in Fortran 95, allowing multiple nested grids (not yet fully implemented) and supporting distributed–memory parallelism (via MPI) as an alternative to the shared–memory parallelism of the earlier versions. ROMS 2 includes code to measure CPU time during a run—this is the source of the numbers in the table below. One is inclined to be suspicious of these numbers at first, as they imply that the upwelling test case runs approx. 60% faster in ROMS 2 than in ROMS 1. However in several cases I have compared the CPU time reported by the model with the results of the "time" utility and found good agreement.
ROMS 2 raises some interesting issues about memory handling and performance. As mentioned in the previous paragraph (see also the table below) the upwelling test case runs significantly faster on many compilers in ROMS 2 than in ROMS 1. However on some compilers (mostly older ones) early versions of ROMS 2 ran much slower than ROMS 1. This seems to be related to the way in which dummy arguments are declared in subprograms. Early versions of ROMS used explicit-shape declarations; it seems that this causes some compilers to create (unnecessary) temporary copies of array data, which slows down performance drastically. Later versions use assumed-shape declarations, which eliminates the copying. Another issue relates to tiling. Like most parallel ocean models, ROMS 2 divides the domain horizontally into tiles. In MPI mode the number of tiles must equal the number of MPI nodes (processors). In OpenMP and serial mode the number of tiles must be an integer multiple of the number of threads. I haven't experimented with OpenMP but I have played around with varying the number of tiles in serial mode. For smaller cases like UPWELLING there is no benefit in running more than one tile on a single processor, but on larger runs like BENCHMARK1 (below) the multi-tile configurations run slightly faster. This presumably occurs because the data from each tile fit in the processor's cache.
Machine | CPU | OS | Compiler & switches | Notes | CPU time |
---|---|---|---|---|---|
Kupe | Alpha EV5 600 MHz | UNICOS/mk | f90 -O3 | 140 | |
f90 -R b | 416 | ||||
f90 -O3 | #undef ASSUMED_SHAPE | 119 | |||
f90 -R b | #undef ASSUMED_SHAPE | 173 | |||
f90 -O3 | MPI 2 × 2 | 38 | |||
Thor | Alpha EV67 667MHz | Digital Unix | f90 -fast | 22 | |
Rickard (2002) | P4 1.8 GHz | Win 2000 | df /fast | 32 | |
Hadfield (2003) | P4 2.67 GHz DDR 266 | Win 2000 | df /fast | 22 | |
df /check:bounds | 30 | ||||
g95 -O3 | 31 | ||||
Otter | P4 Xeon 2.8 GHz × 2 | f90 -O1 (Absoft) | 40 | ||
g95 -O3 | 29 |
A set of three BENCHMARK runs is bundled in ROMS 2. They are all simulations of an idealised Southern Ocean on grids of 512 x 64 x 30 (BENCHMARK1), 1024 x 128 x 30 (BENCHMARK2) and 2048 x 256 x 30 (BENCHMARK3). BENCHMARK1 takes approx. 300 MiB of RAM in REAL*8 precision so can be run on a number of machines at Greta Point. BENCHMARK2 takes approx. 1200 MiB of RAM in REAL*8 precision and I have not yet found a machine that will run it in serial mode. On Kupe it requires a minimum of between 8 and 16 processors.
I have ported the BENCHMARK cases back into the ROMS 1 source code. Since ROMS 1 supports only serial and OpenMP modes, BENCHMARK1 is the only one I can run. Here it is run for 20 time steps.
Machine | CPU | OS | Compiler & switches | Notes | CPU time |
---|---|---|---|---|---|
Hadfield (2003) | P4 2.67 GHz DDR 266 | Win 2000 | df /fast | 185 | |
Otter | P4 Xeon 2.8 GHz × 2 | Linux | f90 -O3 | 205 | |
g95 -O3 | 195 | ||||
g77 -O | 195 |
Here are BENCHMARK1 results from ROMS 2. I originally ran most of these for 20 time steps but have been redoing them with the standard ROMS 2.1 input file, which runs the simulation for 200 steps.
Machine | CPU | OS | Compiler & switches | Notes | CPU time |
---|---|---|---|---|---|
Kupe | Alpha EV5 600 MHz | UNICOS/mk | f90 -O3 | Serial | 770 |
f90 -O3 |
Serial #undef ASSUMED_SHAPE |
740 | |||
f90 -O3 | MPI 12 × 2 | 36 | |||
Nforce2 |
Athlon XP2600 |
Linux |
ifc -O3 -tpp7 |
198 | |
Hadfield (2003) | P4 2.67 GHz DDR 266 | Win 2000 | df /fast | 200 steps | 1600 |
df /fast | Serial 4 × 4 | 136 | |||
df /fast | Serial 8 × 2 | 147 | |||
df /fast | Serial 8 × 8 | 137 | |||
g95 -O3 | Serial 4 × 4 | 230 | |||
Otter | P4 Xeon 2.8 GHz × 2 | Linux | f90 -O1 | Serial 4 × 4 | 240 |
g95 -O3 | Serial 4 × 4 | 200 |
Here are BENCHMARK2 results from ROMS 2 on Kupe:
Machine | CPU | OS | Compiler & switches | Notes | CPU time |
---|---|---|---|---|---|
Kupe | Alpha EV5 600 MHz | UNICOS/mk | f90 -O3 | MPI 16 × 1 | 2080 |
f90 -O3 | MPI 16 × 2 | 1340 |