SWAN can't use more than one CPU
SWAN can't use more than one CPU
When i run the TESTHEAD example on a cluster with the mpi, I can't set more than one CPU for SWAN. ROMS can use more than one CPU, but SWAN can't. If I set more than one CPU for SWAN, there was a error during running. Could you give me some suggestions?
to change the number of cpus for swan , you need to change 2 things and check 1 thing:
1) coupling_test_head.in : you need to modify the NnodesWAV value to equal the number of processors to allocate to SWAN.
2) when you run the job, you need to specify the total number of processors , such as
mpirun - np X ./oceanM ROMS/External/coupling_test_head.in
where X = NnodesOCN + NnodesWAV
3) Also check that NnodesOCN is equal to the number of partitions set in ocean_test_head.in (so NnondesOCN = NtileI + NtileJ)
1) coupling_test_head.in : you need to modify the NnodesWAV value to equal the number of processors to allocate to SWAN.
2) when you run the job, you need to specify the total number of processors , such as
mpirun - np X ./oceanM ROMS/External/coupling_test_head.in
where X = NnodesOCN + NnodesWAV
3) Also check that NnodesOCN is equal to the number of partitions set in ocean_test_head.in (so NnondesOCN = NtileI + NtileJ)
I have done what you said. My X = NnodesOCN + NnodesWAV
I can set X=3 , NnodesOCN = 2, NnodesWAV = 1
or X=5 , NnodesOCN = 4, NnodesWAV = 1
or any others while NnodesWAV = 1
But I can't set X=4 , NnodesOCN = 2, NnodesWAV = 2
or X=6 , NnodesOCN = 4, NnodesWAV = 2
or any others while NnodesWAV > 1. There was a error.
I can set X=3 , NnodesOCN = 2, NnodesWAV = 1
or X=5 , NnodesOCN = 4, NnodesWAV = 1
or any others while NnodesWAV = 1
But I can't set X=4 , NnodesOCN = 2, NnodesWAV = 2
or X=6 , NnodesOCN = 4, NnodesWAV = 2
or any others while NnodesWAV > 1. There was a error.
X=6 , NnodesOCN = 4 NnodesWAV = 2
The error message are shown below:
-------------------------------------------------------------------------------------------------------------------------------
NL ROMS/TOMS: started time-stepping:( TimeSteps: 00000001 - 00001440)
== SWAN sent wave fields and Myerror= 0
== SWAN recvd ocean fields and Myerror= 0
+time 20030101.000200 , step 1; iteration 1; sweep 1
STEP time[DAYS] KINETIC_ENRG POTEN_ENRG TOTAL_ENRG NET_VOLUME trd
0 0.00000 0.000000E+00 9.619779E+01 9.619779E+01 1.952863E+10 0
DEF_HIS - creating history file: ocean_his.nc
WRT_HIS - wrote history fields (Index=1,1) into time record = 0000001
1 0.00035 1.993010E-12 9.619779E+01 9.619779E+01 1.952863E+10 0
+time 20030101.000200 , step 1; iteration 1; sweep 2
2 0.00069 1.534867E-08 9.619779E+01 9.619779E+01 1.952863E+10 0
3 0.00104 3.498813E-09 9.619779E+01 9.619779E+01 1.952863E+10 0
4 0.00139 1.010260E-08 9.619779E+01 9.619779E+01 1.952863E+10 0
+time 20030101.000200 , step 1; iteration 1; sweep 3
5 0.00174 2.099356E-08 9.619779E+01 9.619779E+01 1.952863E+10 0
6 0.00208 4.158252E-08 9.619779E+01 9.619779E+01 1.952863E+10 0
7 0.00243 7.630712E-08 9.619778E+01 9.619778E+01 1.952863E+10 0
+time 20030101.000200 , step 1; iteration 1; sweep 4
p4_6891: p4_error: interrupt SIGSEGV: 11
p5_6905: p4_error: interrupt SIGSEGV: 11
rm_l_4_6902: (2.312500) net_send: could not write to fd=5, errno = 32
p1_6849: p4_error: net_recv read: probable EOF on socket: 1
p2_6863: p4_error: net_recv read: probable EOF on socket: 1
p3_6877: p4_error: net_recv read: probable EOF on socket: 1
rm_l_1_6860: (2.402344) net_send: could not write to fd=5, errno = 32
p4_6891: (2.312500) net_send: could not write to fd=5, errno = 32
rm_l_2_6874: (2.375000) net_send: could not write to fd=5, errno = 32
rm_l_3_6888: (2.343750) net_send: could not write to fd=5, errno = 32
rm_l_5_6916: (2.285156) net_send: could not write to fd=5, errno = 32
-- end MPICH run --
p2_6863: (6.375000) net_send: could not write to fd=5, errno = 32
p3_6877: (6.347656) net_send: could not write to fd=5, errno = 32
p1_6849: (6.406250) net_send: could not write to fd=5, errno = 32
p5_6905: (10.285156) net_send: could not write to fd=5, errno = 32
-----------------------------------------------------------------------------------------------------------------
If X=5 , NnodesOCN = 4, NnodesWAV = 1, it runs normally.
The error message are shown below:
-------------------------------------------------------------------------------------------------------------------------------
NL ROMS/TOMS: started time-stepping:( TimeSteps: 00000001 - 00001440)
== SWAN sent wave fields and Myerror= 0
== SWAN recvd ocean fields and Myerror= 0
+time 20030101.000200 , step 1; iteration 1; sweep 1
STEP time[DAYS] KINETIC_ENRG POTEN_ENRG TOTAL_ENRG NET_VOLUME trd
0 0.00000 0.000000E+00 9.619779E+01 9.619779E+01 1.952863E+10 0
DEF_HIS - creating history file: ocean_his.nc
WRT_HIS - wrote history fields (Index=1,1) into time record = 0000001
1 0.00035 1.993010E-12 9.619779E+01 9.619779E+01 1.952863E+10 0
+time 20030101.000200 , step 1; iteration 1; sweep 2
2 0.00069 1.534867E-08 9.619779E+01 9.619779E+01 1.952863E+10 0
3 0.00104 3.498813E-09 9.619779E+01 9.619779E+01 1.952863E+10 0
4 0.00139 1.010260E-08 9.619779E+01 9.619779E+01 1.952863E+10 0
+time 20030101.000200 , step 1; iteration 1; sweep 3
5 0.00174 2.099356E-08 9.619779E+01 9.619779E+01 1.952863E+10 0
6 0.00208 4.158252E-08 9.619779E+01 9.619779E+01 1.952863E+10 0
7 0.00243 7.630712E-08 9.619778E+01 9.619778E+01 1.952863E+10 0
+time 20030101.000200 , step 1; iteration 1; sweep 4
p4_6891: p4_error: interrupt SIGSEGV: 11
p5_6905: p4_error: interrupt SIGSEGV: 11
rm_l_4_6902: (2.312500) net_send: could not write to fd=5, errno = 32
p1_6849: p4_error: net_recv read: probable EOF on socket: 1
p2_6863: p4_error: net_recv read: probable EOF on socket: 1
p3_6877: p4_error: net_recv read: probable EOF on socket: 1
rm_l_1_6860: (2.402344) net_send: could not write to fd=5, errno = 32
p4_6891: (2.312500) net_send: could not write to fd=5, errno = 32
rm_l_2_6874: (2.375000) net_send: could not write to fd=5, errno = 32
rm_l_3_6888: (2.343750) net_send: could not write to fd=5, errno = 32
rm_l_5_6916: (2.285156) net_send: could not write to fd=5, errno = 32
-- end MPICH run --
p2_6863: (6.375000) net_send: could not write to fd=5, errno = 32
p3_6877: (6.347656) net_send: could not write to fd=5, errno = 32
p1_6849: (6.406250) net_send: could not write to fd=5, errno = 32
p5_6905: (10.285156) net_send: could not write to fd=5, errno = 32
-----------------------------------------------------------------------------------------------------------------
If X=5 , NnodesOCN = 4, NnodesWAV = 1, it runs normally.
- What kind of a system is this : Linux cluster, PC, ??
- Are you using latest version ?
- I remember an issue like this, but thought i fixed it along the way.
It looks like there is a write error.
"rm_l_4_6902: (2.312500) net_send: could not write to fd=5, errno = 32 "
See what files have been created for swan.
Since you are trying to run swan with 2 processors, it should have 2 files for each type of output:
PRINT-001
PRINT-002
hsig.mat-001
hsig.mat-002
swan_restart.dat-001
swan_restart.dat-002
etc etc
There should also be a files called swaninit.
Remove swaninit and rerun it.
- Are you using latest version ?
- I remember an issue like this, but thought i fixed it along the way.
It looks like there is a write error.
"rm_l_4_6902: (2.312500) net_send: could not write to fd=5, errno = 32 "
See what files have been created for swan.
Since you are trying to run swan with 2 processors, it should have 2 files for each type of output:
PRINT-001
PRINT-002
hsig.mat-001
hsig.mat-002
swan_restart.dat-001
swan_restart.dat-002
etc etc
There should also be a files called swaninit.
Remove swaninit and rerun it.
what's difference
I guess Haibo used the package checked out from the ROMS trunk rather than CSTM. The codes checked out from the CSTM trunk should work with multi-processor setup for swan. I just tried the CSTM version I checked out several month ago. The only error message was null communicator at the final stage which I guess caused by double calls of mpi_finalize (in swan and coupler). It won't affect results. So what's the difference between CSTM trunk and ROMS trunk right now?
Right now, the roms trunk is identical to the cstm trunk.
I
did submit a fix for that mpi finalize issue. I also just found that we have a fortran stop in the mct router cleanup phase, and I want to remove that as well. But, as you said, that is all clean up stuff that does not affect the run itself.
I just checked the release, and I can get test_head to run with multiple processors for swan. So i can not recreate the problem.
Is he running it on a different system (a PC??).
I
did submit a fix for that mpi finalize issue. I also just found that we have a fortran stop in the mct router cleanup phase, and I want to remove that as well. But, as you said, that is all clean up stuff that does not affect the run itself.
I just checked the release, and I can get test_head to run with multiple processors for swan. So i can not recreate the problem.
Is he running it on a different system (a PC??).
can you send the entire output that is written to stdout, not just a short section of it? Send it to my email so we don't fill upthis whole screen with it:
jcwarner@usgs.gov
jcwarner@usgs.gov
-
- Posts: 25
- Joined: Tue Sep 22, 2015 3:09 pm
- Location: Indian Institute of Technology Gandhinagar
Re: SWAN can't use more than one CPU
Hi All,
I am also not able to provide more than one processor for the SWAN model. I am trying to run the SANDY_Coupled Test case and I am getting the following error.
Timing for main: time 2012-10-28_12:01:00 on domain 2: 2.76283 elapsed seconds
Timing for main: time 2012-10-28_12:01:00 on domain 2: 2.76283 elapsed seconds
WRT_HIS - wrote history fields (Index=1,1) in record = 0000001 01
DEF_AVG - creating average file, Grid 01: Sandy_ocean_avg.nc
At line 5036 of file swanparll.f90
Fortran runtime error: Bad unit number in statement
In the coupling_sandy.in I have mentioned as follows
NnodesATM = 4 ! atmospheric model
NnodesWAV = 4 ! wave model
NnodesOCN = 4 ! ocean model
and I am running as
mpirun -np 12 ./coawstM coupling_sandy.in
I am running in my Workstation with Ubuntu OS. I am attaching the entire log. The line (In the file Waves/SWAN/Src/swanparll.F, line 1169 from REAL IOPTR to REAL IOPTR(ILEN)) mentioned in the suggestion by Dr. Warner is not there in the new subversion of swanparll.F.
I would be grateful for any suggestions to solve this error.
Thanking You,
I am also not able to provide more than one processor for the SWAN model. I am trying to run the SANDY_Coupled Test case and I am getting the following error.
Timing for main: time 2012-10-28_12:01:00 on domain 2: 2.76283 elapsed seconds
Timing for main: time 2012-10-28_12:01:00 on domain 2: 2.76283 elapsed seconds
WRT_HIS - wrote history fields (Index=1,1) in record = 0000001 01
DEF_AVG - creating average file, Grid 01: Sandy_ocean_avg.nc
At line 5036 of file swanparll.f90
Fortran runtime error: Bad unit number in statement
In the coupling_sandy.in I have mentioned as follows
NnodesATM = 4 ! atmospheric model
NnodesWAV = 4 ! wave model
NnodesOCN = 4 ! ocean model
and I am running as
mpirun -np 12 ./coawstM coupling_sandy.in
I am running in my Workstation with Ubuntu OS. I am attaching the entire log. The line (In the file Waves/SWAN/Src/swanparll.F, line 1169 from REAL IOPTR to REAL IOPTR(ILEN)) mentioned in the suggestion by Dr. Warner is not there in the new subversion of swanparll.F.
I would be grateful for any suggestions to solve this error.
Thanking You,
- Attachments
-
- Sandy_test.log
- (107.01 KiB) Downloaded 355 times
Re: SWAN can't use more than one CPU
not sure. are you writing to multiple netcdf files? this part is where swan opens files for each processor for writing. comment out swan writing of output just to see if that makes is work.