Dear ROMS users,
We've had this problem for around half a year already and have attempted working with the technical staff in our institution. However, the problem persists and I thought I'd try asking here in the forums.
We're running ROMS on a parallel computing cluster, and when we use only 1 node (which in our case consists of 28 cores), we are able to run successfully. However, whenever we try to use 2 or more nodes, the run fails near the start and seems to occur while reading the initial condition netcdf file. In the log file, here's how it appears:
Metrics information for Grid 01:
===============================
Minimum X-grid spacing, DXmin = 1.50000000E+00 km
Maximum X-grid spacing, DXmax = 1.50000000E+00 km
Minimum Y-grid spacing, DYmin = 1.50000000E+00 km
Maximum Y-grid spacing, DYmax = 1.50000000E+00 km
Minimum Z-grid spacing, DZmin = -1.33120450E+01 m
Maximum Z-grid spacing, DZmax = 2.34310913E+03 m
Minimum barotropic Courant Number = 2.66422670E-02
Maximum barotropic Courant Number = 7.21447999E-01
Maximum Coriolis Courant Number = 3.96367952E-03
Minimum horizontal diffusion coefficient = 1.25000000E+01 m2/s
Maximum horizontal diffusion coefficient = 1.25000000E+01 m2/s
Minimum horizontal viscosity coefficient = 1.25000000E+01 m2/s
Maximum horizontal viscosity coefficient = 1.00000000E+20 m2/s
NLM: GET_STATE - Reading state initial conditions, 2016-04-30 00:00:00.00
(Grid 01, t = 5964.0000, File: CRSE_MB1_ini_160501.nc, Rec=0001, Index=1)
- free-surface
(Min = -2.04140008E-01 Max = 1.37334052E+00)
- vertically integrated u-momentum component
(Min = -3.23873087E-01 Max = 7.55435041E-01)
- vertically integrated v-momentum component
(Min = -2.82343629E-01 Max = 6.29832212E-01)
And when the run fails, a file with a *.btr extension is generated and contains the following lines:
oceanM:56948 terminated with signal 11 at PC=0 SP=7fffffff74a8. Backtrace:
/usr/lib64/libinfinipath.so.4(+0x45a8)[0x2aaac28bd5a8]
/lib64/libpthread.so.0(+0x10b20)[0x2aaaac957b20]
If anyone has experienced a similar issue and solved it or might have some thoughts on how to go about doing so, I would greatly appreciate any help.
Thanks,
Lawrence
Run fails when using more than 1 node in a computing cluster
-
- Posts: 88
- Joined: Wed Oct 01, 2014 8:57 pm
- Location: International Coastal Research Center
Re: Run fails when using more than 1 node in a computing clu
this looks like an architecture/lib issue. this looks similar:
https://software.intel.com/en-us/forums ... pic/270080
-j
https://software.intel.com/en-us/forums ... pic/270080
-j
-
- Posts: 88
- Joined: Wed Oct 01, 2014 8:57 pm
- Location: International Coastal Research Center
Re: Run fails when using more than 1 node in a computing clu
Thank you for the link Dr. Warner. I'll see if I can use this when I get a chance to consult with our technical staff on the issue.
Lawrence
Lawrence