Error aobut COAWST parallel mode

Report or discuss software problems and other woes

Moderators: arango, robertson

Post Reply
Message
Author
EdwardElric
Posts: 12
Joined: Sun Sep 27, 2020 10:52 pm
Location: College of Oceanic and Atmospheric Sciences, Ocean

Error aobut COAWST parallel mode

#1 Unread post by EdwardElric »

hello everyone
I use the COAWST model (so far,only use the ROMS)to simulate sediment transport.
Before this, I successfully run the several cases in the project. But when I run the real case in caowstM, the output.log report the error below:
NLM: GET_STATE - Reading state initial conditions, 2010-01-01 00:00:00.00
(Grid 01, t = 0.0000, File: mycase-ini.nc, Rec=0001, Index=1)
- free-surface
(Min = 0.00000000E+00 Max = 0.00000000E+00)
- vertically integrated u-momentum component
(Min = 0.00000000E+00 Max = 0.00000000E+00)
- vertically integrated v-momentum component
(Min = 0.00000000E+00 Max = 0.00000000E+00)
--------------------------------------------------------------------------
mpirun noticed that process rank 10 with PID 31701 on node 168 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
4 total processes killed (some possibly by mpirun during cleanup)
I ran it many times and the process was always killed suddendly when it read the initial conditions

I tried two ideas to solve this problem:
1. I thought it may be problems about initial file. So I run it in debug mode
(by the way, another difference is that I undef TS_MPDATA. Because it said that system cannot activate TS_MPDATA in serial with partitions or shared-memory)
the log showed that my initial file was OK(it can read it smoothly):
NLM: GET_STATE - Reading state initial conditions, 2010-01-01 00:00:00.00
(Grid 01, t = 0.0000, File: mycase-ini.nc, Rec=0001, Index=1)
- free-surface
(Min = 0.00000000E+00 Max = 0.00000000E+00)
- vertically integrated u-momentum component
(Min = 0.00000000E+00 Max = 0.00000000E+00)
- vertically integrated v-momentum component
(Min = 0.00000000E+00 Max = 0.00000000E+00)
- u-momentum component
(Min = 0.00000000E+00 Max = 0.00000000E+00)
- v-momentum component
(Min = 0.00000000E+00 Max = 0.00000000E+00)
- potential temperature
(Min = 1.59813415E-02 Max = 2.52891937E+01)
- salinity
(Min = 2.90392214E+01 Max = 3.49541981E+01)
The forcing part is ok( the three scripts below was added by me )
The bounary part is ok
The climatology part is ok
···
···
···
INQUIRY - unable to find requested variable: sustr
in files:
test_lwrad_era.nc
test_Pair_era.nc
test_Qair_era.nc
test_rain_era.nc
test_swrad_era.nc
test_Tair_era.nc
test_wind_era.nc
Found Error: 02 Line: 404 Source: ROMS/Utility/inquiry.F
Found Error: 02 Line: 128 Source: ROMS/Utility/get_2dfld.F
Found Error: 02 Line: 337 Source: ROMS/Nonlinear/get_data.F
Found Error: 02 Line: 856 Source: ROMS/Nonlinear/initial.F
Found Error: 02 Line: 200 Source: ROMS/Drivers/nl_ocean.h
···
···
···

ROMS/TOMS - Output NetCDF summary for Grid 01:

Analytical header files used:

/home/COAWST/COAWST-2/Projects/mycase/ana_sediment.h
Found Error: 02 Line: 465 Source: ROMS/Utility/close_io.F

ROMS/TOMS - Input error ............. exit_flag: 2


ERROR: Abnormal termination: NetCDF INPUT.
REASON: No error
2.I thought it may be the differences between the serial and parallel mode. So I install a mpich instead of the openmpi I used before
But here comes another question:
Resolution, Grid 01: 180x252x30, Parallel Nodes: 1, Tiling: 4x4

ROMS/TOMS: Wrong choice of grid 01 partition or number of parallel nodes.
NtileI * NtileJ must be equal to the number of parallel nodes.
Change -np value to mpirun or
change domain partition in input script.
Found Error: 06 Line: 162 Source: ROMS/Utility/inp_par.F
Found Error: 06 Line: 114 Source: ROMS/Drivers/nl_ocean.h
The scripit I entered is 'mpirun -np 16 ./coawstM rivertest.in >&output.log&' why the paralled nodes is 1? when I used openmpi , the output.log do not show the Parallel Nodes number


I have no idea about this, Can anyone give me some ideas?
thanks very much!
-Edward

jcwarner
Posts: 1172
Joined: Wed Dec 31, 2003 6:16 pm
Location: USGS, USA

Re: Error aobut COAWST parallel mode

#2 Unread post by jcwarner »

I think you are barking up the wrong tree.
it is most likely not an mpi issue. Now that you changed to mpich you need to make sure that the mpirun call points to mpich, and that all the libraries (hdf5, netcdf ,etc ...) are all with that flavor of mpi.

Suggest you go back to openmpi and fix the message
"INQUIRY - unable to find requested variable: sustr
in files:
....

was that ever fixed?

EdwardElric
Posts: 12
Joined: Sun Sep 27, 2020 10:52 pm
Location: College of Oceanic and Atmospheric Sciences, Ocean

Re: Error aobut COAWST parallel mode

#3 Unread post by EdwardElric »

jcwarner wrote: Thu Apr 22, 2021 12:43 pm I think you are barking up the wrong tree.
it is most likely not an mpi issue. Now that you changed to mpich you need to make sure that the mpirun call points to mpich, and that all the libraries (hdf5, netcdf ,etc ...) are all with that flavor of mpi.

Suggest you go back to openmpi and fix the message
"INQUIRY - unable to find requested variable: sustr
in files:
....

was that ever fixed?

hello John
the problem(INQUIRY - unable to find requested variable: sustr) happen ONLY when I run in debug mode
when I run in openmpi, it do not have such problem. Here is the log when I run in openmpi:

Code: Select all

 Output/Input Files:
             Output Restart File:  ocean_rst.nc
             Output History File:  ocean_his.nc
            Output Averages File:  ocean_avg.nc
                 Input Grid File:  mycase-grid.nc
    Input Nonlinear Initial File:  mycase-ini.nc
        Input Sources/Sinks File:  mycase-river.nc
              Tidal Forcing File:  mycase-tide.nc
           Input Forcing File 01:  test_lwrad_era.nc
           Input Forcing File 02:  test_Pair_era.nc
           Input Forcing File 03:  test_Qair_era.nc
           Input Forcing File 04:  test_rain_era.nc
           Input Forcing File 05:  test_swrad_era.nc
           Input Forcing File 06:  test_Tair_era.nc
           Input Forcing File 07:  test_wind_era.nc
         Input Boundary File 01:  mycase-bry.nc
         
          Tile partition information for Grid 01:  180x252x30  tiling: 4x4

     tile     Istr     Iend     Jstr     Jend     Npts

 Number of tracers:            2
        0        1       45        1       63    85050
        1       46       90        1       63    85050
        ···
        ···
        


the files(test_lwrad_era.nc, test_Pair_era.nc, etc) mentioned in INQUIRY is forcing file which used for BULK_FLUX.
I compared the cases which define bulk_flux opotions , and they do not need to prepare the sustr variable,right?

AND come back to the topic acout openmpi
maybe my expression is not clear enough.
I mean when I run in openmpi, the model break dowm when reading the initial condition, but when I run in serial mode, it can read successfully.
So I think it may be the problem about parallel mode and run in mpich mode, which leads to another question.
However, the problem I concern the most is why the model can not read the initial condition successfully in openmpi
IF it is not the mpi and initial files issue either, how can I FIX it ?

thanks very much
-Edward

Post Reply