Unable to write with PIO in restart runs

Bug reports, work arounds and fixes

Moderators: arango, robertson

Post Reply
Message
Author
chuning
Posts: 8
Joined: Thu Jan 21, 2016 6:06 pm
Location: Rutgers University

Unable to write with PIO in restart runs

#1 Unread post by chuning »

I ran into an issue when using the PIO library for output. When I restart from a previous run and attach the outputs to existing netcdf files (LDEFOUT=F), the code reports back a netcdf error (not a ROMS error) and quits. The error message is

Abort with message NetCDF: Attempt to extend dataset during NC_INDEPENDENT I/O operation. Use nc_var_par_access to set mode NC_COLLECTIVE before extending variable. in file pio_getput_int.c at line 1271

Upon closer look, the error occurs in line 2926 of wrt_his.F90, where wrt_his_pio attempts to expand the time dimension.

I did some research, but the pio and netcdf APIs and libraries are all tangled together and I couldn't identify the problem.

In def_var.F90, there is an extra step to set parallel access when the old PARALLEL_IO is used. I tried to add the a similar function in mod_pio_netcdf.F90 to change the parallel access rule, but the function cannot find the right ncid and vid in the pioFile and pioVar structure. In the pio source code, I wasn't able to find a function that performs similarly to nf90_var_par_access() either.

Another issue I encountered, perhaps related to this issue, is that when I use PIO for reading grid and initial conditions (INP_LIB=2), ROMS froze while reading the initial file. It did not report any errors or quit - just froze there. The issue didn't occur when PIO reads in the grid file. Since the grid file does not have unlimited dimensions, I suspect PIO library is not properly set for initiallizing or processing variables with unlimited dimensions, thus causes both reading and writing errors.

Perhaps it is a problem with my PIO setting, but I tried on two machines, one is a centos cluster with pre-built netcdf and the other is a linux subsystem on a laptop, and the issue is consistant on both machines. I built pnetcdf and PIO on both systems for the tests. For now I create new history file for each restart with LDEFOUT=T, which does not trigger the issue.

Attached are pio build scripts, ROMS build scripts, input file and log files for my test runs. The tests are performed with a modified version of ROMS; I also tested with the master branch which has the same issue.

Chuning
Attachments
ross_sea.h
(1.79 KiB) Downloaded 179 times
roms_ross_sea_rst.in
(151.43 KiB) Downloaded 170 times
log_rst.txt
(35.54 KiB) Downloaded 184 times
log.txt
(41.99 KiB) Downloaded 183 times
build_roms.sh
(11.66 KiB) Downloaded 181 times
pio_build.sh
(854 Bytes) Downloaded 180 times

User avatar
arango
Site Admin
Posts: 1347
Joined: Wed Feb 26, 2003 4:41 pm
Location: DMCS, Rutgers University
Contact:

Re: Unable to write with PIO in restart runs

#2 Unread post by arango »

I don't know what is going on here, but NC_INDEPENDENT and NC_COLLECTIVE is for the NetCDF library and NOT for the PIO library. You cannot add things to the PIO interface without knowledge. The nf90_var_par_access function is for the native NetCDF library with parallel I/O (option PARALLEL_IO and HDF5). It doesn't have nothing to do with the PIO library, which have it own module interface layer. See mod_pio_netcdf.F. They are totally different things.

Lastly, we specifically said the PIO library is intended for HPC systems with a Parallel File System (PFS) and applications running on many processes (~100 and up). If you don't know what PFS is, you can google it to find more information. It is an expensive hardware media that are only available in supercomputers. There is no point in running PIO on your laptop. Use the regular NetCDF library. If you want to use the PIO library, read the papers, we provided links in WikiROMS. The PIO cannot be used as a black box. It requires information in computer technology to set the input parameters correctly for a specific supercomputer.

If you build the PIO library yourself, make sure that it is compiled correctly and passed all the tests.

Post Reply