tiling and divide by zero error
tiling and divide by zero error
I am having a curious problem on a cluster I use to run ROMS. I have 10 nodes with 40 cores each, and have routinely been running ROMS using mpi job on all 400 cores. However, when I try to run the same job with 200 or 40 cores I get an error like this:
[pmacc@klone1 driver]$ cat slurm-1501920.out
(in /gscratch/macc/parker/LO/driver)
[n3111:90571:0:90571] Caught signal 8 (Floating point exception: integer
divide by zero)
==== backtrace (tid: 90571) ====
0 0x0000000000012b20 .annobin_sigaction.c() sigaction.c:0
1 0x00000000004acb41 output_() ???:0
2 0x0000000000479dab main3d_() ???:0
3 0x000000000040d57f ocean_control_mod_mp_roms_run_() ???:0
4 0x000000000040d35f MAIN__() ???:0
5 0x000000000040d162 main() ???:0
6 0x00000000000237b3 __libc_start_main() ???:0
7 0x000000000040d06e _start() ???:0
Any clues? Thanks, Parker
[pmacc@klone1 driver]$ cat slurm-1501920.out
(in /gscratch/macc/parker/LO/driver)
[n3111:90571:0:90571] Caught signal 8 (Floating point exception: integer
divide by zero)
==== backtrace (tid: 90571) ====
0 0x0000000000012b20 .annobin_sigaction.c() sigaction.c:0
1 0x00000000004acb41 output_() ???:0
2 0x0000000000479dab main3d_() ???:0
3 0x000000000040d57f ocean_control_mod_mp_roms_run_() ???:0
4 0x000000000040d35f MAIN__() ???:0
5 0x000000000040d162 main() ???:0
6 0x00000000000237b3 __libc_start_main() ???:0
7 0x000000000040d06e _start() ???:0
Any clues? Thanks, Parker
Re: tiling and divide by zero error
is there any more relevant info? i assume you change the NtileI NtileJ values accordingly.
also, looks like the error might be in output. did you change anything in wrt_his or any other place to write things out for your setup?
-j
also, looks like the error might be in output. did you change anything in wrt_his or any other place to write things out for your setup?
-j
Re: tiling and divide by zero error
Thanks, John. I did adjust NtileI and NtileJ accordingly, and did not change wrt_his. I'll see if I can find any more info.
- arango
- Site Admin
- Posts: 1361
- Joined: Wed Feb 26, 2003 4:41 pm
- Location: DMCS, Rutgers University
- Contact:
Re: tiling and divide by zero error
What is the size of your grid? I assume that it is very large since you have been using 400 PETs. What type of I/O library are you using? Using 400 processes can add a substantial bottleneck to the simulation because of I/O. That's the reason I implemented the PIO library for these cases. What version of ROMS are you using?
The curious thing is that runs on 400 and not in 200 processes. I was expecting the opposite to be true. It may give us a clue about the memory requirements.
The curious thing is that runs on 400 and not in 200 processes. I was expecting the opposite to be true. It may give us a clue about the memory requirements.
Re: tiling and divide by zero error
Hernan, the grid size is (30, 1302, 663) (s_rho, eta_rho, xi_rho). ROMS version is revision 823, from 2016. I suppose I should try with a newer version! I don't know where to look for the I/O library but if you give me a clue I will find it. I am writing to NetCDF4 with compression: #define DEFLATE, #define HDF5. Thanks!
- arango
- Site Admin
- Posts: 1361
- Joined: Wed Feb 26, 2003 4:41 pm
- Location: DMCS, Rutgers University
- Contact:
Re: tiling and divide by zero error
That's a very old version of the code. It doesn't have PIO. I released that capability this year.
Then, what is your tile partition? It seems to me that it will run more efficiently with fewer processes. Are you activating an ecosystem model or nesting? The memory requirements increase as the tracer increases. In newer versions of ROMS, we estimate memory usage.
The issue here is that the size of your grid cannot decompose on powers of 2, which facilitate all kinds of parallel partitions:
The recommendation is to have the same number of points per tile to have a balanced computation. That is, all processes have the same amount of work. Otherwise, some will hibernate at synchronization points in the MPI library, which may be problematic with communications in clusters and the job may hang. Recall that master is overworking because of the serial I/O.
Then, what is your tile partition? It seems to me that it will run more efficiently with fewer processes. Are you activating an ecosystem model or nesting? The memory requirements increase as the tracer increases. In newer versions of ROMS, we estimate memory usage.
The issue here is that the size of your grid cannot decompose on powers of 2, which facilitate all kinds of parallel partitions:
Code: Select all
1302 = 2 * 3 * 7 * 31
663 = 3 * 13 * 17
Possibilities:
(1) NtileI = 7, NtileJ = 17 (119 processes)
(2) NtileI = 14, NtileJ = 13 (182 processes)
(3) NtileI = 14, NtileJ = 17 (238 processes)
(4) NtileI = 7, NtileJ = 39 (273 processes)
(5) Ntile = 21, NtileJ = 17 (357 processes)
(6) NtileI = 31, NtileJ = 13 (403 processes)
(7) Nitile = 31, NtileJ = 17 (527 processes)
Re: tiling and divide by zero error
I'll look into updating the version. What always holds my group back on this topic is that we have a custom NPZD model - mostly is it separate programs, but it requires a few edits to the ROMS source code as well. I'd like to design my system to absorb ROMS updates more easily. We are not using nesting.
The tiling partition choices are handled in some python code that automates the creation of the dot_in file:
elif Ldir['np_num'] == 400: # klone
ntilei = '20' # number of tiles in I-direction
ntilej = '20' # number of tiles in J-direction
elif Ldir['np_num'] == 200: # klone
ntilei = '10' # number of tiles in I-direction
ntilej = '20' # number of tiles in J-direction
elif Ldir['np_num'] == 40: # klone
ntilei = '5' # number of tiles in I-direction
ntilej = '8' # number of tiles in J-direction
but the basic answer is the 20x20 works and 20x10 does not (and 8x5 works sporadically!)
Regarding the "powers of two" should I be designing grids with rho_grid sizes like 256, 512, 1024, 2048? If so, does this number apply to the actual size of the grid.nc fields I create, or the the INTERIOR points (smaller by 2)?
The tiling partition choices are handled in some python code that automates the creation of the dot_in file:
elif Ldir['np_num'] == 400: # klone
ntilei = '20' # number of tiles in I-direction
ntilej = '20' # number of tiles in J-direction
elif Ldir['np_num'] == 200: # klone
ntilei = '10' # number of tiles in I-direction
ntilej = '20' # number of tiles in J-direction
elif Ldir['np_num'] == 40: # klone
ntilei = '5' # number of tiles in I-direction
ntilej = '8' # number of tiles in J-direction
but the basic answer is the 20x20 works and 20x10 does not (and 8x5 works sporadically!)
Regarding the "powers of two" should I be designing grids with rho_grid sizes like 256, 512, 1024, 2048? If so, does this number apply to the actual size of the grid.nc fields I create, or the the INTERIOR points (smaller by 2)?
- arango
- Site Admin
- Posts: 1361
- Joined: Wed Feb 26, 2003 4:41 pm
- Location: DMCS, Rutgers University
- Contact:
Re: tiling and divide by zero error
That's not good. You need to estimate the tile partition by hand since we are dealing with some prime numbers. Try some of the options that I typed above.
Re: tiling and divide by zero error
Hernan can correct me if I'm wrong, but ideally it is Lm and Mm that have many small prime factors so that you can have equal size tiles on many cores.
So, if you rho variables are 1302 x 663 (Lp by Mp) you are looking at prime factors of 1300 by 661. Unfortunately, 661 is prime.
So, if you rho variables are 1302 x 663 (Lp by Mp) you are looking at prime factors of 1300 by 661. Unfortunately, 661 is prime.
John Wilkin: DMCS Rutgers University
71 Dudley Rd, New Brunswick, NJ 08901-8521, USA. ph: 609-630-0559 jwilkin@rutgers.edu
71 Dudley Rd, New Brunswick, NJ 08901-8521, USA. ph: 609-630-0559 jwilkin@rutgers.edu
- arango
- Site Admin
- Posts: 1361
- Joined: Wed Feb 26, 2003 4:41 pm
- Location: DMCS, Rutgers University
- Contact:
Re: tiling and divide by zero error
The tile partition in ROMS is computed with Lm and Mm, since everybody has the same number of ghost points. In my possibility values above for NtileI and NtileJ, I assume that the values provided were Lm and Mm. I always prefer less tiles in the I-direction to allow vectorization to accelerate computations.
Re: tiling and divide by zero error
Parker,
You can use svn to methodically bring your code up to date.
(1) Check out a clean version of release 823.
(2) copy over your NPZD mods to that repo (check with svn diff how it differs from 823).
(3) Proceed in modest steps to update the whole repo to increasing version numbers ...
svn update -r 829
If you take modest steps in version number (you don't have to go one at a time) the conflicts should be easy to resolve. If not, you can always roll back to a smaller version number.
svn update - r 825
If your NPZD driven modifications touch only a few files, have a look at the history of their changes in "trac" browser https://www.myroms.org/projects/src/browser and you might be able to identify larger update steps because code mods were independent of anything you touched.
The prudent thing would be to compile and run at some of these intermediate steps.
Be sure to check for ocean.in/roms.in and varinfo.dat diffs while you do this. It is a common mistake to forget that code updates which add new features are often accompanied by new input parameters in roms.in to get them to work.
Once up to date it's worth keeping this branch up to date, if only to check for conflicts, even if it's not the branch you are routinely running.
You can use svn to methodically bring your code up to date.
(1) Check out a clean version of release 823.
(2) copy over your NPZD mods to that repo (check with svn diff how it differs from 823).
(3) Proceed in modest steps to update the whole repo to increasing version numbers ...
svn update -r 829
If you take modest steps in version number (you don't have to go one at a time) the conflicts should be easy to resolve. If not, you can always roll back to a smaller version number.
svn update - r 825
If your NPZD driven modifications touch only a few files, have a look at the history of their changes in "trac" browser https://www.myroms.org/projects/src/browser and you might be able to identify larger update steps because code mods were independent of anything you touched.
The prudent thing would be to compile and run at some of these intermediate steps.
Be sure to check for ocean.in/roms.in and varinfo.dat diffs while you do this. It is a common mistake to forget that code updates which add new features are often accompanied by new input parameters in roms.in to get them to work.
Once up to date it's worth keeping this branch up to date, if only to check for conflicts, even if it's not the branch you are routinely running.
John Wilkin: DMCS Rutgers University
71 Dudley Rd, New Brunswick, NJ 08901-8521, USA. ph: 609-630-0559 jwilkin@rutgers.edu
71 Dudley Rd, New Brunswick, NJ 08901-8521, USA. ph: 609-630-0559 jwilkin@rutgers.edu