Hello,
I am running ROMS on a unix cluster, which has an 8 hour walltime limit. My ROMS simulation is 270000 time steps long and the 8 hour walltime limit completes about 26000 time steps. So, I wrote a script(given below) moving the output files to another directory; copied the ocean_rst.nc to ocean_ini.nc and then running 10 times. The script completes the first run; then the model makes an interrupt exit after reaching the 8 hour walltime limit. The script is also then terminated, giving the following error.
p0_7176: p4_error: interrupt SIGx: 15
p5_7219: p4_error: net_recv read: probable EOF on socket: 1
rm_l_5_7221: (28843.296875) net_send: could not write to fd=10, errno = 32
p9_7228: p4_error: interrupt SIGx: 13
I can still manually copy the ocean_rst.nc to ocean_ini.nc and restart the simulation and that works fine. Is there a way to trap this interruption due to walltime limit; but still continue executing the loop in the script?
Thanks,
Sankar
#!/bin/bash
#$ -S /bin/bash
#PBS -N ROMS1
#PBS -V
#PBS -l nodes=8:ppn=8
# -o /home/subbayya/roms/projects/upwelling/upwelling.log
#PBS -e /home/subbayya/roms/projects/upwelling/upwelling.err
MAXRUNS=10
STARTRUN=1
JOBDIR=/home/subbayya/roms-3.0/projects/jet_obcs
#RESULTS_DIR=${JOBDIR}/Results
RUNNUMFILE=${JOBDIR}/runnum.txt
cd $PBS_O_WORKDIR
if [ ! -f ${RUNNUMFILE} ] ; then
echo 1 > ${RUNNUMFILE}
fi
while(true)
do
RUNNUM=`cat ${RUNNUMFILE}`
if [ ${RUNNUM} -le ${MAXRUNS} ] ; then
echo Starting job ${RUNNUM} of ${MAXRUNS} on `date`
time mpirun -machine vapi ./oceanM ocean_jet_obcs.in > jet.log
rundir=${JOBDIR}/RUN${RUNNUM}
mkdir -p ${rundir}
cp *.nc ${rundir}/.
mv ocean_rst.nc ocean_ini.nc
cp jet.log ${rundir}/.
echo `expr ${RUNNUM} + 1` > ${RUNNUMFILE}
else
break;
fi
done
echo run ended at `date`
automatic restart in pbs script.
Re: automatic restart in pbs script.
I think you have to do what's known as job chaining - having the first script submit the second script, and so on. I'm sure it's been written up in the ARSC HPC newsletter.