Imbalance in Parallel writing in ROMS

General scientific issues regarding ROMS

Moderators: arango, robertson

Post Reply
Message
Author
koushik
Posts: 12
Joined: Mon Aug 12, 2019 3:29 pm
Location: IISC

Imbalance in Parallel writing in ROMS

#1 Unread post by koushik »

Hello All,

Questions are at the last part of this post:(I have explained my oberservations first)


I have done IO analysis of writing data into NetCDF files. I have found load imbalance while writing the data.

The configuration I have used is as below...
40*36 PEs = 1440 PEs used for simulation in CRAY XC40 sysstem [60 nodes -- 24 Processors per Node]
Data is written in Parallel [PARALLEL_IO and HDF5 flags are turned ON]
1000 time steps [NTIMES == 1000]
Total 100 time steps saved --> 10 quick files generated --> 10 records saved per quick file [NQCK == 1 NDEFQCK == 100]

I have done the IO analysis by breaking up the "Writing of output data..." further to find the imbalance per Processor.

The results I found are as follows (with the setup attached):
FIGURE 1.
FIGURE 1.
FIGURE 1. -- Data distribution {land and Water points} among different processors.
FIGURE 2.
FIGURE 2.
FIGURE 2. -- Corresponding CPU time taken per processor.

Observations and Conclusion:
No such direct relation observed between data distribution and CPU time taken by each processor.
a. When WRITE_WATER is OFF, data is uniformly distributed.
b. When WRITE_WATER is ON, water points are redistributed again to make it uniform again for load balancing.
FIGURE 3.
FIGURE 3.
FIGURE 3. -- The CPU time per PE is plotted (Sorted in increasing order of CPU times per processor)

Observations and Conclusion:
There is a lot of load imbalance present in CPU Times per Processor in case of PARALLEL WRITING ( Time taken by each Processor is varying significantly ).

So, a further breakup of CPU times is obtained as below..
FIGURE 4.
FIGURE 4.
FIGURE 4. -- CPU time = COMPUTATION time + IO time

Observations and Conclusion:
The computation time is load balanced whereas the IO times are load imbalanced.


So, a further breakup of IO time is obtained as below..
FIGURE 5.
FIGURE 5.
FIGURE 5. -- IO time = Define the files time + Write into files time

Observations and Conclusion:
The data write into time is load balanced whereas the define file times are load imbalanced.

So, a further breakup of Defining files time is obtained as below..
FIGURE 6.
FIGURE 6.
FIGURE 6. -- Define time = Write_Info + Others calls(very insignificant time taken by them)

Observations and Conclusion:
The Write_Info are load imbalanced. [/ROMS/Utility/wrt_info.F called from /ROMS/Utility/def_quick.F ]
During define file phase , nf_fwrite2d function are called inside wrt_info.F but there is no nf_fwrite3d calls inside wrt_info.F whereas during writing data into file phase , both nf_fwrite2d and nf_fwrite3d are called.
Also it is observed that ,the nf_fwrite3d calls are load balanced whereas the nf_fwrite2d are load imbalanced.
So, it can be concluded that the load imbalance is coming from defining the files phase.
Max, Min and Average.png
FIGURE 7. -- Overall Imbalance analysis
Observations and Conclusion:
The imbalance is flowing as below...
Total CPU time ==> Output time ==> Define file ==> Wrt_Info ==> nf_fwrite2d.F ==> futher investigation to be done further.



Questions:

1. Is there any particular reason for such Imbalance?
2. How can we reduce this imbalance so that the overall IO time is reduced?
3. Can we select a subset of processor out of the computation phase to do the IO phase?

I will do similar studies for the serial IO and compare the imbalance obtained in the 2 cases.
Attachments
mspin.txt
(132.93 KiB) Downloaded 175 times
gbplume.h
(1.47 KiB) Downloaded 189 times
build.sh
(17.04 KiB) Downloaded 190 times

User avatar
arango
Site Admin
Posts: 1347
Joined: Wed Feb 26, 2003 4:41 pm
Location: DMCS, Rutgers University
Contact:

Re: Imbalance in Parallel writing in ROMS

#2 Unread post by arango »

Hi Koushik, great study, and thank you for the research that you continue doing about ROMS parallel I/O :!: Let me digest the information and come up with a strategy for more efficient parallel I/O. Are you aware that ROMS has the CPP option NO_WRITE_GRID, which suppresses the definition and writing of all the grid arrays in def_info.F and wrt_info.F? We had this capability for years. The output NetCDF files are no longer compliant because the grid information is not available in the output file, but is available in the input grid NetCDF file. It implies that the grid NetCDF file needs to be read for plotting the data or other post-processing. As a result, the output ROMS NetCDF files are smaller, and the parallel I/O imbalance is much less.

Currently, I am playing with the splitting of ROMS MPI communicator to partition the computational tasks. We may have a specified number of PETs to perform specific tasks.

koushik
Posts: 12
Joined: Mon Aug 12, 2019 3:29 pm
Location: IISC

Re: Imbalance in Parallel writing in ROMS

#3 Unread post by koushik »

Hello Arango,

Thanks for your suggestions.

1.I was not aware of NO_WRITE_GRID flag. I think the imbalance is in writing the GRID arrays. I will perform the analysis with this flag ON and let you know the results.

2.I am doing the analysis for the serial NetCDF writing to compare with the parallel NetCDF writing timings.

3."Currently, I am playing with the splitting of ROMS MPI communicator to partition the computational tasks. We may have a specified number of PETs to perform specific tasks."
-- Will it allow to continue the computation for next time steps before writing of data into NetCDF files is completed? Will we need additional buffer memory for the same?
-- Is it splitting the processors between computational and IO tasks?

Thanks,
Koushik

koushik
Posts: 12
Joined: Mon Aug 12, 2019 3:29 pm
Location: IISC

Re: Imbalance in Parallel writing in ROMS

#4 Unread post by koushik »

arango wrote: Sun Mar 22, 2020 3:56 pm Hi Koushik, great study, and thank you for the research that you continue doing about ROMS parallel I/O :!: Let me digest the information and come up with a strategy for more efficient parallel I/O. Are you aware that ROMS has the CPP option NO_WRITE_GRID, which suppresses the definition and writing of all the grid arrays in def_info.F and wrt_info.F? We had this capability for years. The output NetCDF files are no longer compliant because the grid information is not available in the output file, but is available in the input grid NetCDF file. It implies that the grid NetCDF file needs to be read for plotting the data or other post-processing. As a result, the output ROMS NetCDF files are smaller, and the parallel I/O imbalance is much less.

Currently, I am playing with the splitting of ROMS MPI communicator to partition the computational tasks. We may have a specified number of PETs to perform specific tasks.

I have investigated the imbalance further within nf_fwrite2d.F and here are the results ---
Figure 1:
Figure 1:
Imbalance in parallel NetCDF define phase.png (29.74 KiB) Viewed 3939 times
Figure 1: Further Parallel writing Imbalance analysis


Observations and Conclusion:
1.The imbalance is flowing as below... -- The imbalance is found in nf90_put_var calls inside nf_fwrite2d.F function during define phase.

Total CPU time ==> Output time ==> Define file ==> Wrt_Info ==> nf_fwrite2d.F ==> nf90_put_var [Further result.] [Check Figure 1]

2. nf90_put_var inside nf_fwrite2d.F during define phase calls are imbalanced and taking significant time [Check Figure 1] whereas nf90_put_var inside nf_fwrite2d.F during write phase are balanced and taking insignificant time. [Figure not included as insignificant]

3. nf90_put_var inside nf_fwrite3d.F during write phase are balanced and taking significant time. [Check Figure 2]
Figure 2.
Figure 2.
Balance in parallel NetCDF write phase .png (13.69 KiB) Viewed 3939 times
4. With different runs with exactly same setup, the imbalance pattern is changing. -- Overshoots are happening in different PEs. [Check Figure 3]
Figure 3.
Figure 3.

Post Reply