Problems with restarting the model

General scientific issues regarding ROMS

Moderators: arango, robertson

Post Reply
Message
Author
gli353
Posts: 30
Joined: Tue May 14, 2019 1:39 pm
Location: The University of Auckland

Problems with restarting the model

#1 Unread post by gli353 »

Hi everyone,
We are currently working on a 3D tidal model. My colleague tried to use my restart file (with perfect_restart variables) stored in it to continue the model run, yet bizarre things have occurred. The restarted model crashes after about 600 steps since restarting (about 40 minutes in terms of the model time); however, if I do a continuous model run, that extends beyond the ending time (timestep 288000) of the previous model, then it has not crashed at all up to the time step 288600.
Below is a quick summary of the problem, and log files or output file, if necessary, can also be shared via cloud or email. Thanks in advance, and any advice will be highly appreciated!

1. Key CPP flags:
#define SOLVE3D
#define PERFECT_RESTART

Mixing:
#define UV_SMAGORINSKY
#undef VISC_GRID
#define GLS_MIXING
#define CANUTO_A
#define LIMIT_BSTRESS

Wet-and-dry:
#define MASKING
#define WET_DRY
Tracers:
#define TS_FIXED

Tide:
#undef RAMP_TIDES
#define SSH_TIDES
#define ADD_FSOBC
#define ADD_M2OBC
(however, for now, non-tidal zeta and momentum have not been added yet; I just keep these flags switched on, while ana_xx.h files set them to 0)
#define UV_TIDES
#define ANA_FSOBC
#define ANA_M2OBC
(a tidal forcing file from the TPXO-9 dataset is used to provide the boundary data)


Based on what I have read on the forum, tracers, gls mixing, wetting and drying, could all be potential culprits behind my problem. Yet those posts are relatively old, and my ROMS build number is 1053, a relatively new one. Not sure if those posts are still relevant or not.

2. Information in the Input Parameter File
A grid 968*1188*16 in size; 16 layers in the vertical, max(rx0) = 0.2, max(rx1) = 6.87. The grid is horizontally stretched, with the smallest horizontal resolution being 25 meters. Barotropic CFL number = 0.6. (dt = 6.0s; NDTFAST = 20)
NRREC == -1 (and the initialization file is set to the rst file from the previous run)
K-OMEGA mixing parameters are used

3. Comparison between the Restarted Model and Continuous Run (log file):

3.1 Log-file Record of the First 100-ish Steps after Restarting of the Previous Run (step 288102 shown as an example):
Continuous run:
288102 2021-05-03 00:10:12.00 8.619749E-04 2.744692E+02 2.744701E+02 1.033459E+11
(673,0507,16) 2.350429E-01 8.301378E-02 6.773269E+00 2.328141E+00

Restarted run:
288102 2021-05-03 00:10:12.00 4.308471E-02 2.758133E+02 2.758564E+02 1.040719E+11
(673,0507,16) 2.350429E-01 8.301378E-02 6.773269E+00 2.352860E+00
(102 steps after restarting)


The maximum CFL number occurs at the same location, with same Cu, Cv, Cw; the maximal speed, however, is larger in the restarted model, a 3cm/s difference.
We are aware of the very large Cw numbers, however, since we need to resolve the tidal flats, it is not very easy to reduce them further, though we are trying to. Nevertheless, the large Cw values have never caused the model to crash in the “continuous-run” mode.

3.2 Step 288575, where blow-up occurs in the restarted mode
Continuous run:
288575 2021-05-03 00:57:30.00 3.689414E-03 2.745084E+02 2.745121E+02 1.031233E+11
(674,0505,01) 1.374180E-02 5.424304E-02 1.886988E+00 1.768348E+00
Restarted run:
288575 2021-05-03 00:57:30.00 9.824319E-02 2.756918E+02 2.757900E+02 1.073060E+11
(256,0930,16) 3.476779E+00 6.479183E+00 1.878256E+00 4.849472E+01

Between the step 288102 and 288575, the differences between the two models gradually accumulated. The max speed in the restarted run grew exponentially in the last few steps before it crashed, which is not unexpected, though. The max CFL values also occur at different locations.

4. Snapshots of vbar (approx. east-west direction) velocity
Before the crash of the restarted model, the last snapshot in the history file is stored for the time step 288400 (40 minutes after restarting the model).
figure1_vbar_continuous.jpg
Figure1: the above figure shows vbar in the continuous model, at timestep 288,400. (see attachment figure1_vbar_continuous.jpg)
figure2_vbar_restarted.jpg
Figure2: the above figures shows vbar in the restarted model, at timestep 288,400 (see attachment, figure2_vbar_restarted.jpg)[/b]

The two models appear very different (see figure 1, figure 2 in the attachment), and some large velocity values (up to higher than 1m/s) are recorded in the restarted model in the area where the grid resolution is coarse (600 m in one or both direction).

Below is a comparison in the area where the models have high horizontal resolution.
figure3_vbar_continuous_25m_Res.jpg
Figure3: the above figure shows vbar in the continuous model, at timestep 288,400; zoomed in to focus on the area with refined resolution; 25 meters in both horizontal directions. (see attachment figure3_vbar_continuous_25m_Res.jpg)
figure4_vbar_restarted_25m_Res.jpg
Figure4: the above figures shows vbar in the restarted model, at timestep 288,400; zoomed in to focus on the area with refined resolution; 25 meters in both horizontal directions. (see attachment, figure4_vbar_restarted_25m_Res.jpg)

In the area where the grid resolution is fine, 25-meter, the discrepancy is also very significant. Some topographic eddies are well resolved in the continuous model (figure 3; see attachment), while absent from the restarted one (figure 4).

The results simply look as if they come from two completely unrelated models. I'm somehow at my wit's end, and the old posts on the forum have not completely solved my problem. Any suggestions will be appreciated! Thanks!

Kind Regards,
Gaoyang

User avatar
kate
Posts: 4088
Joined: Wed Jul 02, 2003 5:29 pm
Location: CFOS/UAF, USA

Re: Problems with restarting the model

#2 Unread post by kate »

The first thing to do is to track down the restart issue. You can run the model for two steps, restart it for two more, then compare that solution to a run that's four steps long. You have to restart after an even number of steps because the model doesn't save nstp, which alternates between 1 and 2.

You can do this procedure with and without each of the model components you suspect of being imperfect. When you confirm that you've got imperfect restarts, that's when the fun begins. What I do is run both the restarted run after 2 steps and the run from the beginning at step 2 in dueling debuggers. This is when it's best to have a modest sized domain, not the monster pan-Arctic. In MOM6, they have a switch for writing out checksums throughout the code and you can compare them instead of using a debugger, since not everyone has a debugger. In either case, you are looking for the first place you get a divergence between the solutions. You can also save history files every timestep for such short runs and see where those diverge.

User avatar
arango
Site Admin
Posts: 1347
Joined: Wed Feb 26, 2003 4:41 pm
Location: DMCS, Rutgers University
Contact:

Re: Problems with restarting the model

#3 Unread post by arango »

Yes, using checksum is a good idea. ROMS also has checksum for I/O processing. You need to activate the CPP option CHECKSUM.

gli353
Posts: 30
Joined: Tue May 14, 2019 1:39 pm
Location: The University of Auckland

Re: Problems with restarting the model

#4 Unread post by gli353 »

kate wrote: Sun Oct 31, 2021 5:31 pm The first thing to do is to track down the restart issue. You can run the model for two steps, restart it for two more, then compare that solution to a run that's four steps long. You have to restart after an even number of steps because the model doesn't save nstp, which alternates between 1 and 2.

You can do this procedure with and without each of the model components you suspect of being imperfect. When you confirm that you've got imperfect restarts, that's when the fun begins. What I do is run both the restarted run after 2 steps and the run from the beginning at step 2 in dueling debuggers. This is when it's best to have a modest sized domain, not the monster pan-Arctic. In MOM6, they have a switch for writing out checksums throughout the code and you can compare them instead of using a debugger, since not everyone has a debugger. In either case, you are looking for the first place you get a divergence between the solutions. You can also save history files every timestep for such short runs and see where those diverge.
Hi Kate,
I will do that in the coming days, and will let you know if it helps. I would nominate wetting-and-drying and gls mixing (or the combination of the two) as potential cause(s) of the problem, and see how they affect the restarting. However, I have taken a look at cppdefs.h, both in my own directory and in the most up-to-date directory, I cannot find the CHECKSUMS flag. I think it is only available for MOM6? Then I will contact our HPC centre for running the debugger.
Kind Regards,
Gaoyang
Last edited by gli353 on Mon Nov 01, 2021 7:20 am, edited 2 times in total.

gli353
Posts: 30
Joined: Tue May 14, 2019 1:39 pm
Location: The University of Auckland

Re: Problems with restarting the model

#5 Unread post by gli353 »

arango wrote: Sun Oct 31, 2021 6:13 pm Yes, using checksum is a good idea. ROMS also has checksum for I/O processing. You need to activate the CPP option CHECKSUM.
Hi Arango,
Thanks for the suggestion. However, I have tried to find CHECKSUMS flag in both my own directory and also in the most up-to-date trunk directory, it is not in either of them. I suppose it is only available in MOM6?

Kind Regards,
Gaoyang

User avatar
arango
Site Admin
Posts: 1347
Joined: Wed Feb 26, 2003 4:41 pm
Location: DMCS, Rutgers University
Contact:

Re: Problems with restarting the model

#6 Unread post by arango »

The C-preprocessing option is CHECKSUM. It is there because I code it:

Code: Select all

...

 ASSUMED_SHAPE            Using assumed-shape arrays
 CHECKSUM                 Report order-invariant checksum (hash) when processing I/O
 BOUNDARY_ALLREDUCE       Using mpi_allreduce in mp_boundary routine
 ...
 
 INITIAL: Configuring and initializing forward nonlinear model ...
 *******

  GET_STATE_NF90   - NLM: state initial conditions,                       2004-01-03 00:00:00.00
                      (Grid 01, t = 13008.0000, File: wc13_roms_ini_20040103.nc, Rec=0001, Index=1)
                   - free-surface
                      (Min = -2.79513705E-01 Max =  2.21655332E-01 CheckSum = 69554)
                   - vertically integrated u-momentum component
                      (Min = -1.31252241E-01 Max =  2.00719797E-01 CheckSum = 66936)
                   - vertically integrated v-momentum component
                      (Min = -2.22975398E-01 Max =  1.77798919E-01 CheckSum = 66863)
                   - u-momentum component
                      (Min = -4.46971292E-01 Max =  4.56028027E-01 CheckSum = 2014229)
                   - v-momentum component
                      (Min = -5.51383331E-01 Max =  3.29409303E-01 CheckSum = 2010288)
                   - potential temperature
                      (Min =  0.00000000E+00 Max =  1.83744519E+01 CheckSum = 1792741)
                   - salinity
                      (Min =  0.00000000E+00 Max =  3.46896648E+01 CheckSum = 1603834)
                   - vertical viscosity coefficient
                      (Min =  0.00000000E+00 Max =  0.00000000E+00 CheckSum = 0)
                   - temperature vertical diffusion coefficient
                      (Min =  0.00000000E+00 Max =  6.51978114E-01 CheckSum = 2081832)
                   - salinity vertical diffusion coefficient
                      (Min =  0.00000000E+00 Max =  6.51978114E-01 CheckSum = 2081832)
Perhaps, you didn't spell the option correctly. What version of ROMS are you using?

User avatar
wilkin
Posts: 875
Joined: Mon Apr 28, 2003 5:44 pm
Location: Rutgers University
Contact:

Re: Problems with restarting the model

#7 Unread post by wilkin »

A few things ...

You have #define TS_FIXED so your tracer fields temp and salt are not going to evolve in time. I can see this might be a valid approach for your initial testing, especially if you have set them to constants so you are effectively solving a homogenous ocean problem with z-dependence arising only in the velocity profile

But I am very unsure how TS_FIXED plays with GLS_MIXING. Any terms in the turbulence closure affected by stratification (Richardson number, for example) are going to be dynamically decoupled from the evolving solution. I don't think it will go well.

Regarding restart, are there other options you haven't told us about? For instance, did you #define RST_SINGLE? That might destroy perfect restart.

Examine the restart file to check everything is there that should be: all terms needed for GLS (see PERFECT_RESTART in checkvars.F), and the wet/dry masks. That could be your problem. For starters, I recommend turning off WET_DRY to see if that's the cause of the failed restart. Also, on restart have the initial conditions written to the history and file compare them for differences (I recommend #define OUT_DOUBLE for this test).
John Wilkin: DMCS Rutgers University
71 Dudley Rd, New Brunswick, NJ 08901-8521, USA. ph: 609-630-0559 jwilkin@rutgers.edu

gli353
Posts: 30
Joined: Tue May 14, 2019 1:39 pm
Location: The University of Auckland

Re: Problems with restarting the model

#8 Unread post by gli353 »

arango wrote: Mon Nov 01, 2021 1:14 pm The C-preprocessing option is CHECKSUM. It is there because I code it:

Code: Select all

...

 ASSUMED_SHAPE            Using assumed-shape arrays
 CHECKSUM                 Report order-invariant checksum (hash) when processing I/O
 BOUNDARY_ALLREDUCE       Using mpi_allreduce in mp_boundary routine
 ...
 
 INITIAL: Configuring and initializing forward nonlinear model ...
 *******

  GET_STATE_NF90   - NLM: state initial conditions,                       2004-01-03 00:00:00.00
                      (Grid 01, t = 13008.0000, File: wc13_roms_ini_20040103.nc, Rec=0001, Index=1)
                   - free-surface
                      (Min = -2.79513705E-01 Max =  2.21655332E-01 CheckSum = 69554)
                   - vertically integrated u-momentum component
                      (Min = -1.31252241E-01 Max =  2.00719797E-01 CheckSum = 66936)
                   - vertically integrated v-momentum component
                      (Min = -2.22975398E-01 Max =  1.77798919E-01 CheckSum = 66863)
                   - u-momentum component
                      (Min = -4.46971292E-01 Max =  4.56028027E-01 CheckSum = 2014229)
                   - v-momentum component
                      (Min = -5.51383331E-01 Max =  3.29409303E-01 CheckSum = 2010288)
                   - potential temperature
                      (Min =  0.00000000E+00 Max =  1.83744519E+01 CheckSum = 1792741)
                   - salinity
                      (Min =  0.00000000E+00 Max =  3.46896648E+01 CheckSum = 1603834)
                   - vertical viscosity coefficient
                      (Min =  0.00000000E+00 Max =  0.00000000E+00 CheckSum = 0)
                   - temperature vertical diffusion coefficient
                      (Min =  0.00000000E+00 Max =  6.51978114E-01 CheckSum = 2081832)
                   - salinity vertical diffusion coefficient
                      (Min =  0.00000000E+00 Max =  6.51978114E-01 CheckSum = 2081832)
Perhaps, you didn't spell the option correctly. What version of ROMS are you using?
Hi Arango,
I did a test run, and, yes, CHEKSUM is there in the standard output file (revision 1053). Though it is not listed in the cppdefs.h (neither in mine nor in the cppdefs.h in the svn trunk), which led me to think it might be only available elsewhere. Thanks.

Kind Regards,
Gaoyang

gli353
Posts: 30
Joined: Tue May 14, 2019 1:39 pm
Location: The University of Auckland

Re: Problems with restarting the model

#9 Unread post by gli353 »

wilkin wrote: Mon Nov 01, 2021 7:34 pm A few things ...

You have #define TS_FIXED so your tracer fields temp and salt are not going to evolve in time. I can see this might be a valid approach for your initial testing, especially if you have set them to constants so you are effectively solving a homogenous ocean problem with z-dependence arising only in the velocity profile

But I am very unsure how TS_FIXED plays with GLS_MIXING. Any terms in the turbulence closure affected by stratification (Richardson number, for example) are going to be dynamically decoupled from the evolving solution. I don't think it will go well.

Regarding restart, are there other options you haven't told us about? For instance, did you #define RST_SINGLE? That might destroy perfect restart.

Examine the restart file to check everything is there that should be: all terms needed for GLS (see PERFECT_RESTART in checkvars.F), and the wet/dry masks. That could be your problem. For starters, I recommend turning off WET_DRY to see if that's the cause of the failed restart. Also, on restart have the initial conditions written to the history and file compare them for differences (I recommend #define OUT_DOUBLE for this test).
Hi Wilkin,
Yes, here we are doing some initial testing without any solar radiation or freshwater input (hence, if the initial field is homogeneous, the subsequent results should also stay homogenous, I presume?). That's why I switched on TS_FIXED. I would expect, in an idealised case, the density gradient in the Richardson number, or in the buoyancy flux in the TKE equations, will be simply zero in this case, so I was not expecting this to cause trouble. Not sure if I'm wrong, misunderstanding some aspects of ROMS' behaviour regarding mixing.

We have checked the variables in the rst output, and confirmed everything needed is there. The next thing on the list is switching off WET_AND_DRY, and I will keep you and your colleagues updated about this. Many thanks.

Kind Regards,
Gaoyang

gli353
Posts: 30
Joined: Tue May 14, 2019 1:39 pm
Location: The University of Auckland

Re: Problems with restarting the model

#10 Unread post by gli353 »

Hi,
It seems like we have addressed this problem, in a weird way.... Neither wet-and-dry nor GLS-mixing was the culprit; switching off either of them did not help. However, my colleague is using a more recent version of the model (1089 or 1096?) versus mine (1053); when we were running the same restart-experiment in his more up-to-date model, the issue simply disappeared.... Although the results in the restarted model still slightly differ from those from a continuous run (the biggest relative difference is less than 1%). Even though the PERFECT_RESTART flag was also switched on when running the experiment in this more recent model, the generated rst file turned out to be only 1/6 the size as the rst file generated by my older model (1053) and some of the stored variables lack that "three dimension" present in the older version... I presume here has been some updates to the model which have fixed our problem, which in the end I don't quite understand is related to what.

Kind Regards,
Gaoyang

User avatar
wilkin
Posts: 875
Joined: Mon Apr 28, 2003 5:44 pm
Location: Rutgers University
Contact:

Re: Problems with restarting the model

#11 Unread post by wilkin »

Kia ora Gaoyang,

You can use the 'trac' source code browser https://www.myroms.org/projects/src/browser to explicitly compare two svn release numbers to see what might differ.

But, if you don't have the three time level fields in your output that suggests it was not a perfect restart configuration. Check the CPP defs logged in the output netcdf files - that tells you what ROMS actually did as opposed to what you think you asked it to do.

Your grid of Waitematā and Tīkapa Moana has spectacular detail. Would love to know more about what you are doing. PM me.
John Wilkin: DMCS Rutgers University
71 Dudley Rd, New Brunswick, NJ 08901-8521, USA. ph: 609-630-0559 jwilkin@rutgers.edu

gli353
Posts: 30
Joined: Tue May 14, 2019 1:39 pm
Location: The University of Auckland

Re: Problems with restarting the model

#12 Unread post by gli353 »

A belated update:
It is indeed a weird case. I did nothing to my cpp options or input file (at least, not in any significant way I would keep in my lab record). I simply upgraded to the SVN revision 1098 (used to be 1053 or something near that), then the problem with restarting was gone. I don't quite know why, though...

Post Reply