automatic restart in pbs script.

Report or discuss software problems and other woes

Moderators: arango, robertson

Post Reply
Message
Author
sankaras
Posts: 33
Joined: Mon Nov 27, 2006 6:02 pm
Location: Stanford University.

automatic restart in pbs script.

#1 Unread post by sankaras »

Hello,

I am running ROMS on a unix cluster, which has an 8 hour walltime limit. My ROMS simulation is 270000 time steps long and the 8 hour walltime limit completes about 26000 time steps. So, I wrote a script(given below) moving the output files to another directory; copied the ocean_rst.nc to ocean_ini.nc and then running 10 times. The script completes the first run; then the model makes an interrupt exit after reaching the 8 hour walltime limit. The script is also then terminated, giving the following error.

p0_7176: p4_error: interrupt SIGx: 15
p5_7219: p4_error: net_recv read: probable EOF on socket: 1
rm_l_5_7221: (28843.296875) net_send: could not write to fd=10, errno = 32
p9_7228: p4_error: interrupt SIGx: 13

I can still manually copy the ocean_rst.nc to ocean_ini.nc and restart the simulation and that works fine. Is there a way to trap this interruption due to walltime limit; but still continue executing the loop in the script?

Thanks,

Sankar

#!/bin/bash
#$ -S /bin/bash
#PBS -N ROMS1
#PBS -V
#PBS -l nodes=8:ppn=8
# -o /home/subbayya/roms/projects/upwelling/upwelling.log
#PBS -e /home/subbayya/roms/projects/upwelling/upwelling.err
MAXRUNS=10
STARTRUN=1
JOBDIR=/home/subbayya/roms-3.0/projects/jet_obcs
#RESULTS_DIR=${JOBDIR}/Results
RUNNUMFILE=${JOBDIR}/runnum.txt
cd $PBS_O_WORKDIR


if [ ! -f ${RUNNUMFILE} ] ; then
echo 1 > ${RUNNUMFILE}
fi


while(true)
do
RUNNUM=`cat ${RUNNUMFILE}`
if [ ${RUNNUM} -le ${MAXRUNS} ] ; then
echo Starting job ${RUNNUM} of ${MAXRUNS} on `date`
time mpirun -machine vapi ./oceanM ocean_jet_obcs.in > jet.log
rundir=${JOBDIR}/RUN${RUNNUM}
mkdir -p ${rundir}
cp *.nc ${rundir}/.
mv ocean_rst.nc ocean_ini.nc
cp jet.log ${rundir}/.
echo `expr ${RUNNUM} + 1` > ${RUNNUMFILE}
else
break;
fi
done

echo run ended at `date`

User avatar
kate
Posts: 4089
Joined: Wed Jul 02, 2003 5:29 pm
Location: CFOS/UAF, USA

Re: automatic restart in pbs script.

#2 Unread post by kate »

I think you have to do what's known as job chaining - having the first script submit the second script, and so on. I'm sure it's been written up in the ARSC HPC newsletter.

Post Reply