tiling and divide by zero error

Bug reports, work arounds and fixes

Moderators: arango, robertson

Post Reply
Message
Author
pmaccc
Posts: 74
Joined: Wed Oct 22, 2003 6:59 pm
Location: U. Wash., USA

tiling and divide by zero error

#1 Unread post by pmaccc »

I am having a curious problem on a cluster I use to run ROMS. I have 10 nodes with 40 cores each, and have routinely been running ROMS using mpi job on all 400 cores. However, when I try to run the same job with 200 or 40 cores I get an error like this:

[pmacc@klone1 driver]$ cat slurm-1501920.out
(in /gscratch/macc/parker/LO/driver)
[n3111:90571:0:90571] Caught signal 8 (Floating point exception: integer
divide by zero)
==== backtrace (tid: 90571) ====
0 0x0000000000012b20 .annobin_sigaction.c() sigaction.c:0
1 0x00000000004acb41 output_() ???:0
2 0x0000000000479dab main3d_() ???:0
3 0x000000000040d57f ocean_control_mod_mp_roms_run_() ???:0
4 0x000000000040d35f MAIN__() ???:0
5 0x000000000040d162 main() ???:0
6 0x00000000000237b3 __libc_start_main() ???:0
7 0x000000000040d06e _start() ???:0

Any clues? Thanks, Parker

jcwarner
Posts: 1172
Joined: Wed Dec 31, 2003 6:16 pm
Location: USGS, USA

Re: tiling and divide by zero error

#2 Unread post by jcwarner »

is ​there any more relevant info? i assume you change the NtileI NtileJ values accordingly.
also, looks like the error might be in output. did you change anything in wrt_his or any other place to write things out for your setup?
-j

pmaccc
Posts: 74
Joined: Wed Oct 22, 2003 6:59 pm
Location: U. Wash., USA

Re: tiling and divide by zero error

#3 Unread post by pmaccc »

Thanks, John. I did adjust NtileI and NtileJ accordingly, and did not change wrt_his. I'll see if I can find any more info.

User avatar
arango
Site Admin
Posts: 1347
Joined: Wed Feb 26, 2003 4:41 pm
Location: DMCS, Rutgers University
Contact:

Re: tiling and divide by zero error

#4 Unread post by arango »

What is the size of your grid? I assume that it is very large since you have been using 400 PETs. What type of I/O library are you using? Using 400 processes can add a substantial bottleneck to the simulation because of I/O. That's the reason I implemented the PIO library for these cases. What version of ROMS are you using?

The curious thing is that runs on 400 and not in 200 processes. I was expecting the opposite to be true. It may give us a clue about the memory requirements.

pmaccc
Posts: 74
Joined: Wed Oct 22, 2003 6:59 pm
Location: U. Wash., USA

Re: tiling and divide by zero error

#5 Unread post by pmaccc »

Hernan, the grid size is (30, 1302, 663) (s_rho, eta_rho, xi_rho). ROMS version is revision 823, from 2016. I suppose I should try with a newer version! I don't know where to look for the I/O library but if you give me a clue I will find it. I am writing to NetCDF4 with compression: #define DEFLATE, #define HDF5. Thanks!

User avatar
arango
Site Admin
Posts: 1347
Joined: Wed Feb 26, 2003 4:41 pm
Location: DMCS, Rutgers University
Contact:

Re: tiling and divide by zero error

#6 Unread post by arango »

That's a very old version of the code. It doesn't have PIO. I released that capability this year.

Then, what is your tile partition? It seems to me that it will run more efficiently with fewer processes. Are you activating an ecosystem model or nesting? The memory requirements increase as the tracer increases. In newer versions of ROMS, we estimate memory usage.

The issue here is that the size of your grid cannot decompose on powers of 2, which facilitate all kinds of parallel partitions:

Code: Select all

1302 = 2 * 3 * 7 * 31
663  = 3 * 13 * 17
 
Possibilities:
 
(1) NtileI = 7,  NtileJ = 17 (119 processes)
(2) NtileI = 14, NtileJ = 13 (182 processes)
(3) NtileI = 14, NtileJ = 17 (238 processes)
(4) NtileI = 7,  NtileJ = 39 (273 processes)
(5) Ntile  = 21, NtileJ = 17 (357 processes)
(6) NtileI = 31, NtileJ = 13 (403 processes)
(7) Nitile = 31, NtileJ = 17 (527 processes)
The recommendation is to have the same number of points per tile to have a balanced computation. That is, all processes have the same amount of work. Otherwise, some will hibernate at synchronization points in the MPI library, which may be problematic with communications in clusters and the job may hang. Recall that master is overworking because of the serial I/O.

pmaccc
Posts: 74
Joined: Wed Oct 22, 2003 6:59 pm
Location: U. Wash., USA

Re: tiling and divide by zero error

#7 Unread post by pmaccc »

I'll look into updating the version. What always holds my group back on this topic is that we have a custom NPZD model - mostly is it separate programs, but it requires a few edits to the ROMS source code as well. I'd like to design my system to absorb ROMS updates more easily. We are not using nesting.

The tiling partition choices are handled in some python code that automates the creation of the dot_in file:
elif Ldir['np_num'] == 400: # klone
ntilei = '20' # number of tiles in I-direction
ntilej = '20' # number of tiles in J-direction
elif Ldir['np_num'] == 200: # klone
ntilei = '10' # number of tiles in I-direction
ntilej = '20' # number of tiles in J-direction
elif Ldir['np_num'] == 40: # klone
ntilei = '5' # number of tiles in I-direction
ntilej = '8' # number of tiles in J-direction
but the basic answer is the 20x20 works and 20x10 does not (and 8x5 works sporadically!)

Regarding the "powers of two" should I be designing grids with rho_grid sizes like 256, 512, 1024, 2048? If so, does this number apply to the actual size of the grid.nc fields I create, or the the INTERIOR points (smaller by 2)?

User avatar
arango
Site Admin
Posts: 1347
Joined: Wed Feb 26, 2003 4:41 pm
Location: DMCS, Rutgers University
Contact:

Re: tiling and divide by zero error

#8 Unread post by arango »

That's not good. You need to estimate the tile partition by hand since we are dealing with some prime numbers. Try some of the options that I typed above.

User avatar
wilkin
Posts: 875
Joined: Mon Apr 28, 2003 5:44 pm
Location: Rutgers University
Contact:

Re: tiling and divide by zero error

#9 Unread post by wilkin »

Hernan can correct me if I'm wrong, but ideally it is Lm and Mm that have many small prime factors so that you can have equal size tiles on many cores.
So, if you rho variables are 1302 x 663 (Lp by Mp) you are looking at prime factors of 1300 by 661. Unfortunately, 661 is prime.
John Wilkin: DMCS Rutgers University
71 Dudley Rd, New Brunswick, NJ 08901-8521, USA. ph: 609-630-0559 jwilkin@rutgers.edu

User avatar
arango
Site Admin
Posts: 1347
Joined: Wed Feb 26, 2003 4:41 pm
Location: DMCS, Rutgers University
Contact:

Re: tiling and divide by zero error

#10 Unread post by arango »

The tile partition in ROMS is computed with Lm and Mm, since everybody has the same number of ghost points. In my possibility values above for NtileI and NtileJ, I assume that the values provided were Lm and Mm. I always prefer less tiles in the I-direction to allow vectorization to accelerate computations.

User avatar
wilkin
Posts: 875
Joined: Mon Apr 28, 2003 5:44 pm
Location: Rutgers University
Contact:

Re: tiling and divide by zero error

#11 Unread post by wilkin »

Parker,

You can use svn to methodically bring your code up to date.

(1) Check out a clean version of release 823.

(2) copy over your NPZD mods to that repo (check with svn diff how it differs from 823).

(3) Proceed in modest steps to update the whole repo to increasing version numbers ...

svn update -r 829

If you take modest steps in version number (you don't have to go one at a time) the conflicts should be easy to resolve. If not, you can always roll back to a smaller version number.

svn update - r 825

If your NPZD driven modifications touch only a few files, have a look at the history of their changes in "trac" browser https://www.myroms.org/projects/src/browser and you might be able to identify larger update steps because code mods were independent of anything you touched.

The prudent thing would be to compile and run at some of these intermediate steps.

Be sure to check for ocean.in/roms.in and varinfo.dat diffs while you do this. It is a common mistake to forget that code updates which add new features are often accompanied by new input parameters in roms.in to get them to work.

Once up to date it's worth keeping this branch up to date, if only to check for conflicts, even if it's not the branch you are routinely running.
John Wilkin: DMCS Rutgers University
71 Dudley Rd, New Brunswick, NJ 08901-8521, USA. ph: 609-630-0559 jwilkin@rutgers.edu

Post Reply