Fatal error in MPI_Recv

General scientific issues regarding ROMS

Moderators: arango, robertson

Post Reply
Message
Author
agpc
Posts: 63
Joined: Mon Jul 27, 2020 7:44 pm
Location: Applied Geophysics Center (AGPC)

Fatal error in MPI_Recv

#1 Unread post by agpc »

Hi everyone,
I had a problem when running the COAWST model, this problem is presented in the below:

Code: Select all

Fatal error in MPI_Recv: Other MPI error, error stack:
MPI_Recv(186).............: MPI_Recv(buf=0x7ffdaaeef83c, count=1, MPI_INTEGER, src=0, tag=0, comm=0x84000002, status=0x47a6940) failed
dequeue_and_set_error(596): Communication error with rank 0
This is my coawst.out logfile:

Code: Select all

 Nesting domain
 ids,ide,jds,jde            1        1129           1         757
 ims,ime,jms,jme          837         999         463         579
 ips,ipe,jps,jpe          847         987         473         567
 INTERMEDIATE domain
 ids,ide,jds,jde           59         440          58         315
 ims,ime,jms,jme          333         399         207         258
 ips,ipe,jps,jpe          343         389         217         248
 *************************************
d01 2014-07-05_00:00:00  alloc_space_field: domain            2 ,               34688000  bytes allocated
d01 2014-07-05_00:00:00  alloc_space_field: domain            2 ,               35056560  bytes allocated
d01 2014-07-05_00:00:00  alloc_space_field: domain            2 ,               35381760  bytes allocated
d01 2014-07-05_00:00:00  alloc_space_field: domain            2 ,               35587720  bytes allocated
d01 2014-07-05_00:00:00  alloc_space_field: domain            2 ,               35381760  bytes allocated
d01 2014-07-05_00:00:00  alloc_space_field: domain            2 ,               35587720  bytes allocated
d01 2014-07-05_00:00:00  alloc_space_field: domain            2 ,               35381760  bytes allocated
d01 2014-07-05_00:00:00  alloc_space_field: domain            2 ,               35587720  bytes allocated
d01 2014-07-05_00:00:00  alloc_space_field: domain            2 ,               35772000  bytes allocated

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 40 PID 103363 RUNNING AT compute-1-3
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================
My coupling.in file is that:

Code: Select all

! Their sum must be equal to the total number of processors.

   NnodesATM =  64                 ! atmospheric model
   NnodesWAV =  7                   ! wave model
   NnodesOCN =  49                  ! ocean model
   NnodesHYD =  0                    ! hydrology model

! Time interval (seconds) between coupling of models.

  TI_ATM2WAV =   1800.0d0              ! atmosphere to wave coupling interval
  TI_ATM2OCN =   1800.0d0              ! atmosphere to ocean coupling interval
  TI_WAV2ATM =   1800.0d0              ! wave to atmosphere coupling interval
  TI_WAV2OCN =   1800.0d0              ! wave to ocean coupling interval
  TI_OCN2WAV =   1800.0d0              ! ocean to wave coupling interval
  TI_OCN2ATM =   1800.0d0              ! ocean to atmosphere coupling interval
  TI_OCN2HYD =     0.0d0              ! ocean to hydro coupling interval
  TI_HYD2OCN =     0.0d0              ! hydro to ocean coupling interval

! Enter names of Atm, Wav, and Ocn input files.
! The Wav program needs multiple input files, one for each grid.

   ATM_name = Projects/Neoguri/namelist.input                            ! atmospheric model
   WAV_name = Projects/Neoguri/swan_neoguri.in
!              Projects/Sarika/swan_Sarika_ref3.in         ! wave model
!  WAV_name = ww3_grid.inp
   OCN_name = Projects/Neoguri/ocean_neoguri.in             ! ocean model
   HYD_name = hydro.namelist                            ! hydro model

! Sparse matrix interpolation weights files. You have 2 options:
! Enter "1" for option 1, or "2" for option 2, and then list the
! weight file(s) for that option.

   SCRIP_WEIGHT_OPTION = 1
Please help me to address it.
Thank all,

jcwarner
Posts: 1223
Joined: Wed Dec 31, 2003 6:16 pm
Location: USGS, USA

Re: Fatal error in MPI_Recv

#2 Unread post by jcwarner »

there is not enough information here to figure out what the issue is.
can you dig in a bit deeper?

agpc
Posts: 63
Joined: Mon Jul 27, 2020 7:44 pm
Location: Applied Geophysics Center (AGPC)

Re: Fatal error in MPI_Recv

#3 Unread post by agpc »

Hi John,
I've tried many times to fix it but it didn't work.
Please help me to solve this problem.
I attached all of my namelist and the log files in this post.
Thank you very much!
Attachments
swan_neoguri.in.txt
(5.19 KiB) Downloaded 1097 times
ocean_neoguri.in.txt
(153.66 KiB) Downloaded 1064 times
namelist.input.txt
(4.71 KiB) Downloaded 1049 times
coupling_neoguri.in.txt
(8.48 KiB) Downloaded 908 times
coawst.out.txt
(209.9 KiB) Downloaded 1170 times
COAWST.err.txt
(154.73 KiB) Downloaded 880 times

jcwarner
Posts: 1223
Joined: Wed Dec 31, 2003 6:16 pm
Location: USGS, USA

Re: Fatal error in MPI_Recv

#4 Unread post by jcwarner »

please read all of the Err* files. This is half way through:
"MCT::m_SparseMatrixPlus::initFromRoot_:: FATAL--length of vector y different from row count of sMat.Length of y = 184275 Number of rows in sMat = 183400
010.MCT(MPEU)::die.: from MCT::m_SparseMatrixPlus::initFromRoot_()"

so the MCT is reading the scrip weights and there is a difference in what MCT expects and what is in the weights file.
you have SCRIP_COAWST_NAME = Projects/Neoguri/scrip_neoguri_static.nc
so how did you compute that weights file?

my first guess is that you did not enter the correct size for the SWAN GRID, because that is where most people have the problem.
-SWAN INPUT file lists it's grid size as the full size - 1 (SWAN Manual:" number of meshes in computational grid in .. ξ−direction for a curvilinear grid (this number is one less than the
number of grid points in this domain!).
524*350 = 183400
525*351 = 184275
-WRF lists the total full number of rho points.
-ROMS lists it's grid size as the full size -2 (Lm = L-1, but it starts at 0, so it is really 2 less than the total )

Everyone is different. not sure why but that is how it goes.
Because SWAN is an ascii text file, you have to list the size of the grid so it knows how to set up the dimensions in the scrip.in file.

So in the scrip.in, you need to set the
! -the size of the SWAN grids (full number of center points)
This is going to be 1 more than the INPUT file. I did this because MCT wants to know the total number of points.
so try swan_numx=525 swan_numy=351
and rerun scrip, see how that goes.
-j

agpc
Posts: 63
Joined: Mon Jul 27, 2020 7:44 pm
Location: Applied Geophysics Center (AGPC)

Re: Fatal error in MPI_Recv

#5 Unread post by agpc »

Hi John,
I have changed the size of SWAN model with total number of points (swan_numx=525 swan_numy=351), however it failed.
The coawst.out file is displayed that:

Code: Select all

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 40 PID 104777 RUNNING AT compute-1-8
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================
And the error log file is that:

Code: Select all

Fatal error in PMPI_Wait: Other MPI error, error stack:
PMPI_Wait(183)............: MPI_Wait(request=0x7fff3416e01c, status=0x47a6900) failed
MPIR_Wait_impl(77)........:
dequeue_and_set_error(596): Communication error with rank 40
I also attach all of my log files in this post.
Please help me to solve this error.
Attachments
coawst.out.txt
(230.59 KiB) Downloaded 974 times
COAWST.err.txt
(154.96 KiB) Downloaded 1119 times

jcwarner
Posts: 1223
Joined: Wed Dec 31, 2003 6:16 pm
Location: USGS, USA

Re: Fatal error in MPI_Recv

#6 Unread post by jcwarner »

if you look in the output, you will see that you basically have the same error with different values
MCT::m_SparseMatrixPlus::initFromRoot_:: FATAL--length of vector y different from row count of sMat.Length of y = 185152 Number of rows in sMat = 184275
031.MCT(MPEU)::die.: from MCT::m_SparseMatrixPlus::initFromRoot_()

please check the grid size settings for SWAN in the scrip interpolation weights file.
This should be the total number of 'rho points' in the x and y directions.
for example, if you have a roms and swan on same grid, netcdf_load(roms_file), size(h) will give you the values to use in the scrip command.
SWAN input grid files are in ASCII format, so you need to tell the system what the grid sizes are.
-j

agpc
Posts: 63
Joined: Mon Jul 27, 2020 7:44 pm
Location: Applied Geophysics Center (AGPC)

Re: Fatal error in MPI_Recv

#7 Unread post by agpc »

Hi John,
I've used Matlab to display the size of my ROMS grid:

Code: Select all

>>  netcdf_load('ROMS_COAWST_grd1.nc')
>> size(h)
ans =
   525   351
The result is equal to the size of the SWAN model which I set up in my namelist file.
However, it still failed when I changed the grid of SWAN model was the same as the ROMS's grid.
In addition, I got another error in the Errfile01-001 file:
Error : 3 comp. grid points on one line
Error : Grid angle <0 or >180 degrees

jcwarner
Posts: 1223
Joined: Wed Dec 31, 2003 6:16 pm
Location: USGS, USA

Re: Fatal error in MPI_Recv

#8 Unread post by jcwarner »

if your roms grid size(h) is 525 351

then, for the SWAN INPUT file, you need to have
INPGRID BOTTOM CURVILINEAR 0 0 524 350 EXC 9.999000e+003

agpc
Posts: 63
Joined: Mon Jul 27, 2020 7:44 pm
Location: Applied Geophysics Center (AGPC)

Re: Fatal error in MPI_Recv

#9 Unread post by agpc »

Hi John,
I've set up my namelist file like that:

Code: Select all

 
 #With ROMS model
          Lm == 523        ! Number of I-direction INTERIOR RHO-points -2
          Mm == 349          ! Number of J-direction INTERIOR RHO-points -2
  #With SWAN model
          && KEYWORDS TO CREATE AND READ COMPUTATIONAL GRID &&
CGRID CURVILINEAR 524 350 EXC 9.999000e+003 9.999000e+003 CIRCLE 36 0.04 1.0 24
READGRID COORDINATES 1 'Projects/Mangkhut/swan_coord.grd' 4 0 0 FREE

&& KEYWORDS TO CREATE AND READ BATHYMETRY GRID &&
INPGRID BOTTOM CURVILINEAR 0 0 524 350 EXC 9.999000e+003
READINP BOTTOM 1 'Projects/Mangkhut/swan_bathy.bot' 4 0 FREE

&& KEYWORD TO CREATE CURRENT GRID &&
INPGRID CURRENT CURVILINEAR 0 0 524 350 EXC 9.999000e+003 &
       NONSTATIONARY 20180912.000000 5 DAY 20180917.000000

&& KEYWORD TO CREATE WATER LEVEL GRID &&
INPGRID WLEV CURVILINEAR 0 0 524 350 EXC 9.999000e+003 &
       NONSTATIONARY 20180912.000000 5 DAY 20180917.000000

&& KEYWORD TO CREATE BOTTOM FRIC GRID &&
INPGRID FRIC CURVILINEAR 0 0 524 350 EXC 9.999000e+003 &
       NONSTATIONARY 20180912.000000 5 DAY 20180917.000000

&& KEYWORD TO CREATE WIND GRID &&
INPGRID WIND REGULAR 96.9742 -2.3665 0 524 350 0.2 0.2 EXC 9.999000e+003 &
      NONSTATIONARY 20180912.000000 6 HR 20180917.000000
& Boundary files  ****************************************
& 2D Spec Boundary files  ****************************
BOUND SHAPESPEC JONSWAP MEAN DSPR DEGREES
BOUNDSPEC SEGMENT IJ    0    0    0  350  CONSTANT PAR  0.1  20.0    0.0  15
BOUNDSPEC SEGMENT IJ    0    0  524    0  CONSTANT PAR  0.1  20.0    0.0  15
BOUNDSPEC SEGMENT IJ  524    0  524  350  CONSTANT PAR  0.1  20.0    0.0  15
BOUNDSPEC SEGMENT IJ    0  350  524  350  CONSTANT PAR  0.1  20.0    0.0  15
Unfortunately, it still failed.

Code: Select all

MCT::m_SparseMatrixPlus::initFromRoot_:: FATAL--length of vector y different from row count of sMat.Length of y =   184275 Number of rows in sMat =   183400
038.MCT(MPEU)::die.: from MCT::m_SparseMatrixPlus::initFromRoot_()
application called MPI_Abort(MPI_COMM_WORLD, 2) - process 56
d01 2018-09-12_00:00:00  alloc_space_field: domain            2 ,               25040400  bytes allocated
Fatal error in PMPI_Wait: Other MPI error, error stack:
PMPI_Wait(183)............: MPI_Wait(request=0x7ffe7b48211c, status=0x47a6900) failed
MPIR_Wait_impl(77)........: 
dequeue_and_set_error(596): Communication error with rank 40
Can you have another suggestions for me?
Thank you!

jcwarner
Posts: 1223
Joined: Wed Dec 31, 2003 6:16 pm
Location: USGS, USA

Re: Fatal error in MPI_Recv

#10 Unread post by jcwarner »

the sparse matrix calls are reading the scrip interpolation weights.
What do you have in the scrip.in to compute the interpolation weights?
-j

agpc
Posts: 63
Joined: Mon Jul 27, 2020 7:44 pm
Location: Applied Geophysics Center (AGPC)

Re: Fatal error in MPI_Recv

#11 Unread post by agpc »

Hi John,
Here is my script.in file:

Code: Select all

$INPUTS
!
!  Input file for scrip_coawst.
!  The $INPUTS line is required at the top of this file. 
!  Edit this file to enter the correct information below.
!  Then run this program as "scrip_coawst scrip_coawst_sandy.in"
!
! 1) Enter name of output netcdf4 file
!
!OUTPUT_NCFILE='scrip_sandy_moving.nc'
OUTPUT_NCFILE='scrip_mangkhut_static.nc'
!OUTPUT_NCFILE='scrip_sandy_nowavenest.nc'

! 2) Enter total number of ROMS, SWAN, WW3, and WRF grids:
!
NGRIDS_ROMS=1,
NGRIDS_SWAN=1,
NGRIDS_WW3=0,
NGRIDS_WRF=2,
NGRIDS_HYD=0,

! 3) Enter name of the ROMS grid file(s):
!
ROMS_GRIDS(1)='../../Projects/Mangkhut/ROMS_COAWST_grd1.nc',
!ROMS_GRIDS(2)='../../Projects/Sarika/Sandy_roms_grid_ref3.nc',

! 4) Enter SWAN information:
!    -the name(s) of the SWAN grid file(s) for coords and bathy.
!    -the size of the SWAN grids (full number of center points), and 
!    -if the swan grids are Spherical(set cartesian=0) or
!                           Cartesian(set cartesian=1).
!
SWAN_COORD='../../Projects/Mangkhut/swan_coord.grd',
!SWAN_COORD(2)='../../Projects/Sarika/Sandy_swan_coord_ref3.grd',
SWAN_BATH='../../Projects/Mangkhut/swan_bathy.bot',
!SWAN_BATH(2)='../../Projects/TEST/Sandy_swan_bathy_ref3.bot',
SWAN_NUMX=524,
!SWAN_NUMX(2)=116,
SWAN_NUMY=350,
!SWAN_NUMY(2)=86,
CARTESIAN=0,
!CARTESIAN(2)=0,

! 5) Enter WW3 information
!    -the name(s) of the WW3 grid fil6e(s) for x- y- coords and bathy.
!    -the size of the WW3 grids (full number of grid center points). 
!
!WW3_XCOORD(1)='../../Projects/Sandy/WW3/ww3_sandy_xcoord.dat',
!WW3_YCOORD(1)='../../Projects/Sandy/WW3/ww3_sandy_ycoord.dat',
!WW3_BATH(1)='../../Projects/Sandy/WW3/ww3_sandy_bathy.bot',
!WW3_NUMX(1)=87,
!WW3_NUMY(1)=65,

! 6) Enter the name of the WRF input grid(s). If the grid is a 
!    moving child nest then enter that grid name as 'moving'.
!    Also provide the grid ratio, this is used for a moving nest.
!
WRF_GRIDS(1)='../../Projects/Mangkhut/wrfinput_d01',
WRF_GRIDS(2)='../../Projects/Mangkhut/wrfinput_d02',
!WRF_GRIDS(2)='moving'
PARENT_GRID_RATIO(1)=1,
PARENT_GRID_RATIO(2)=3,
PARENT_ID(1)=0
PARENT_ID(2)=1

! 7) Enter the name of the WRF Hydro input grid(s).
!
!HYDRO_GRIDS(1)='../../WRF_hydro/forcings/WRF-Hydro/DOMAIN/Fulldom_hires.nc',

!
!  The $END statement below is required
!
$END 
I also attach the couple.in file in this post.
Attachments
coupling_mangkhut.in
(8.51 KiB) Downloaded 969 times

jcwarner
Posts: 1223
Joined: Wed Dec 31, 2003 6:16 pm
Location: USGS, USA

Re: Fatal error in MPI_Recv

#12 Unread post by jcwarner »

in your scrip file use (1)
SWAN_COORD='../../Projects/Mangkhut/swan_coord.grd',
SWAN_BATH='../../Projects/Mangkhut/swan_bathy.bot',
SWAN_NUMX=524,
SWAN_NUMY=350,
CARTESIAN=0,

SWAN_COORD(1)='../../Projects/Mangkhut/swan_coord.grd',
SWAN_BATH(1)='../../Projects/Mangkhut/swan_bathy.bot',
SWAN_NUMX(1)=524,
SWAN_NUMY(1)=350,
CARTESIAN(1)=0,

agpc
Posts: 63
Joined: Mon Jul 27, 2020 7:44 pm
Location: Applied Geophysics Center (AGPC)

Re: Fatal error in MPI_Recv

#13 Unread post by agpc »

Hi John,
I've changed my script.in following your suggestion, but it failed.
The error I got in the last post still occurred.

Code: Select all

MCT::m_SparseMatrixPlus::initFromRoot_:: FATAL--length of vector y different from row count of sMat.Length of y =   184275 Number of rows in sMat =   183400
038.MCT(MPEU)::die.: from MCT::m_SparseMatrixPlus::initFromRoot_()
application called MPI_Abort(MPI_COMM_WORLD, 2) - process 56
d01 2018-09-12_00:00:00  alloc_space_field: domain            2 ,               25040400  bytes allocated
Fatal error in PMPI_Wait: Other MPI error, error stack:
PMPI_Wait(183)............: MPI_Wait(request=0x7ffc2147929c, status=0x47a6900) failed
MPIR_Wait_impl(77)........:
dequeue_and_set_error(596): Communication error with rank 40

jcwarner
Posts: 1223
Joined: Wed Dec 31, 2003 6:16 pm
Location: USGS, USA

Re: Fatal error in MPI_Recv

#14 Unread post by jcwarner »

not sure if you figured this out yet, but maybe i misled you . For the scrip file you need to list the full size of the SWAN grid

in your scrip file use
SWAN_COORD(1)='../../Projects/Mangkhut/swan_coord.grd',
SWAN_BATH(1)='../../Projects/Mangkhut/swan_bathy.bot',
SWAN_NUMX(1)=525,
SWAN_NUMY(1)=351,
CARTESIAN(1)=0,

Post Reply