Run fails when using more than 1 node in a computing cluster

Message

lcbernardo · Wed Oct 17, 2018 5:18 am

Dear ROMS users,

We've had this problem for around half a year already and have attempted working with the technical staff in our institution. However, the problem persists and I thought I'd try asking here in the forums.

We're running ROMS on a parallel computing cluster, and when we use only 1 node (which in our case consists of 28 cores), we are able to run successfully. However, whenever we try to use 2 or more nodes, the run fails near the start and seems to occur while reading the initial condition netcdf file. In the log file, here's how it appears:

Metrics information for Grid 01:
===============================

Minimum X-grid spacing, DXmin = 1.50000000E+00 km
Maximum X-grid spacing, DXmax = 1.50000000E+00 km
Minimum Y-grid spacing, DYmin = 1.50000000E+00 km
Maximum Y-grid spacing, DYmax = 1.50000000E+00 km
Minimum Z-grid spacing, DZmin = -1.33120450E+01 m
Maximum Z-grid spacing, DZmax = 2.34310913E+03 m

Minimum barotropic Courant Number = 2.66422670E-02
Maximum barotropic Courant Number = 7.21447999E-01
Maximum Coriolis Courant Number = 3.96367952E-03

Minimum horizontal diffusion coefficient = 1.25000000E+01 m2/s
Maximum horizontal diffusion coefficient = 1.25000000E+01 m2/s

Minimum horizontal viscosity coefficient = 1.25000000E+01 m2/s
Maximum horizontal viscosity coefficient = 1.00000000E+20 m2/s

NLM: GET_STATE - Reading state initial conditions, 2016-04-30 00:00:00.00
(Grid 01, t = 5964.0000, File: CRSE_MB1_ini_160501.nc, Rec=0001, Index=1)
- free-surface
(Min = -2.04140008E-01 Max = 1.37334052E+00)
- vertically integrated u-momentum component
(Min = -3.23873087E-01 Max = 7.55435041E-01)
- vertically integrated v-momentum component
(Min = -2.82343629E-01 Max = 6.29832212E-01)

And when the run fails, a file with a *.btr extension is generated and contains the following lines:

oceanM:56948 terminated with signal 11 at PC=0 SP=7fffffff74a8. Backtrace:
/usr/lib64/libinfinipath.so.4(+0x45a8)[0x2aaac28bd5a8]
/lib64/libpthread.so.0(+0x10b20)[0x2aaaac957b20]

If anyone has experienced a similar issue and solved it or might have some thoughts on how to go about doing so, I would greatly appreciate any help.

Thanks,
Lawrence

jcwarner · #2 Unread post by **jcwarner** » Thu Oct 18, 2018 3:55 am

this looks like an architecture/lib issue. this looks similar:
https://software.intel.com/en-us/forums ... pic/270080

-j

lcbernardo · Thu Oct 18, 2018 10:16 am

Thank you for the link Dr. Warner. I'll see if I can use this when I get a chance to consult with our technical staff on the issue.

Lawrence

Ocean Modeling Discussion

Run fails when using more than 1 node in a computing cluster

Run fails when using more than 1 node in a computing cluster

Re: Run fails when using more than 1 node in a computing clu

Re: Run fails when using more than 1 node in a computing clu