When I up scale my application from 10 km to 2 km resolution, I have trouble getting the model running with a sensible tiling. Please see here a table with previous successful and failed runs and the heads of the error messages at the end of the post:
Code: Select all
Resolution (km) || Cells_i * Cells_j || Tile_i * Tile_j || n cpu || mem_req (GB) || mem_used (GB) ||m em_req/cpu (GB) || ncpus/node || mem_req/node (GB)|| status
10 630 * 530 16 * 16 256 96 44.26 0.38 16.00 6.00 ok
4 1575 * 1325 48 * 48 2304 3072 2900.00 1.33 16.00 21.33 ok
2 3150 * 2650 96 * 96 9216 18432 ? 2.00 16.00 32.00 error 1
2 3151 * 2650 96 * 96 9216 36864 ? 4.00 16.00 64.00 error 2
2 3151 * 2650 64 * 64 4096 16384 ? 4.00 16.00 64.00 error 2
2 3152 * 2650 56 * 56 3136 14336 ? 4.57 28.00 128.00 error 3
2 3153 * 2650 56 * 28 1568 14336 4630.00 9.14 28.00 256.00 ok
Thanks for any thoughts on this issue,
Ole
##########################################
error 1:
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.
HNP daemon : [[32568,0],0] on node r228
Remote daemon: [[32568,0],369] on node r1860
This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
oceanM 00000000007B57CE for__signal_handl Unknown Unknown
libpthread-2.12.s 00002B3C8697B7E0 Unknown Unknown Unknown
libpthread-2.12.s 00002B3C86978490 pthread_spin_init Unknown Unknown
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
oceanM 00000000007B57CE for__signal_handl Unknown Unknown
libpthread-2.12.s 00002B2AB175F7E0 Unknown Unknown Unknown
libmlx4-rdmav2.so 00002B2AC888EB18 Unknown Unknown Unknown
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
oceanM 00000000007B57CE for__signal_handl Unknown Unknown
libpthread-2.12.s 00002AC8D98087E0 Unknown Unknown Unknown
...
###################################################
error 2:
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
oceanM 00000000007B57CE for__signal_handl Unknown Unknown
libpthread-2.12.s 00002B8BEC4F77E0 Unknown Unknown Unknown
mca_btl_openib.so 00002B8C066437C2 Unknown Unknown Unknown
mca_btl_openib.so 00002B8C06643AFE Unknown Unknown Unknown
libopen-pal.so.40 00002B8BF0D9450C opal_progress Unknown Unknown
libmpi.so.40.10.0 00002B8BEBF52CD6 Unknown Unknown Unknown
libmpi.so.40.10.0 00002B8BEBF52D19 ompi_request_defa Unknown Unknown
libmpi.so.40.10.0 00002B8BEBFD5FBF ompi_coll_base_bc Unknown Unknown
libmpi.so.40.10.0 00002B8BEBFD67E2 ompi_coll_base_bc Unknown Unknown
mca_coll_tuned.so 00002B8C090DACB6 ompi_coll_tuned_b Unknown Unknown
libmpi.so.40.10.0 00002B8BEBF6FDAD MPI_Bcast Unknown Unknown
libmpi_mpifh.so.4 00002B8BEBCE5F0C Unknown Unknown Unknown
oceanM 000000000045AF65 Unknown Unknown Unknown
oceanM 00000000006D5BFA Unknown Unknown Unknown
oceanM 000000000068D83D Unknown Unknown Unknown
oceanM 000000000047A94B Unknown Unknown Unknown
oceanM 000000000040CE29 Unknown Unknown Unknown
oceanM 000000000040C478 Unknown Unknown Unknown
oceanM 000000000040C25E Unknown Unknown Unknown
libc-2.12.so 00002B8BEC723D1D __libc_start_main Unknown Unknown
oceanM 000000000040C169 Unknown Unknown Unknown
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
oceanM 00000000007B57CE for__signal_handl Unknown Unknown
libpthread-2.12.s 00002AB99828B7E0 Unknown Unknown Unknown
libpthread-2.12.s 00002AB998288453 pthread_spin_lock Unknown Unknown
libmlx4-rdmav2.so 00002AB9AB5005D0 Unknown Unknown Unknown
...
##############################################
error 3:
------------------------------------------------------------------------
Job 8680424.r-man2 has exceeded memory allocation on node r3760
Process "orted", pid 22075, rss 28557312, vmem 401702912
Process "oceanM", pid 22119, rss 4919689216, vmem 5650546688
Process "oceanM", pid 22120, rss 4908826624, vmem 5648678912
Process "oceanM", pid 22121, rss 4914126848, vmem 5648465920
Process "oceanM", pid 22122, rss 4921999360, vmem 5648588800
Process "oceanM", pid 22123, rss 4911214592, vmem 5648367616
Process "oceanM", pid 22124, rss 4918145024, vmem 5648261120
Process "oceanM", pid 22125, rss 4916121600, vmem 5648224256
Process "oceanM", pid 22126, rss 4908433408, vmem 5648265216
Process "oceanM", pid 22127, rss 4912648192, vmem 5647982592
Process "oceanM", pid 22128, rss 4919373824, vmem 5648101376
Process "oceanM", pid 22129, rss 4914343936, vmem 5647953920
Process "oceanM", pid 22130, rss 4916404224, vmem 5648113664
Process "oceanM", pid 22131, rss 4930711552, vmem 5648183296
Process "oceanM", pid 22132, rss 4917096448, vmem 5648289792
Process "oceanM", pid 22133, rss 4921593856, vmem 5648162816
Process "oceanM", pid 22134, rss 4922822656, vmem 5648314368
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.
HNP daemon : [[32568,0],0] on node r228
Remote daemon: [[32568,0],369] on node r1860
This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
oceanM 00000000007B57CE for__signal_handl Unknown Unknown
libpthread-2.12.s 00002B3C8697B7E0 Unknown Unknown Unknown
libpthread-2.12.s 00002B3C86978490 pthread_spin_init Unknown Unknown
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
oceanM 00000000007B57CE for__signal_handl Unknown Unknown
libpthread-2.12.s 00002B2AB175F7E0 Unknown Unknown Unknown
libmlx4-rdmav2.so 00002B2AC888EB18 Unknown Unknown Unknown
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
oceanM 00000000007B57CE for__signal_handl Unknown Unknown
libpthread-2.12.s 00002AC8D98087E0 Unknown Unknown Unknown
Process "oceanM", pid 22135, rss 4921352192, vmem 5648011264
Process "oceanM", pid 22136, rss 4918321152, vmem 5648138240
Process "oceanM", pid 22137, rss 4927127552, vmem 5647982592
Process "oceanM", pid 22138, rss 4932517888, vmem 5648523264
Process "oceanM", pid 22139, rss 4924923904, vmem 5648207872
Process "oceanM", pid 22140, rss 4925964288, vmem 5648740352
Process "oceanM", pid 22141, rss 4935540736, vmem 5648572416
Process "oceanM", pid 22142, rss 4930662400, vmem 5648318464
Process "oceanM", pid 22143, rss 4917063680, vmem 5648359424
Process "oceanM", pid 22144, rss 4928651264, vmem 5648490496
Process "oceanM", pid 22145, rss 4922134528, vmem 5648486400
Process "oceanM", pid 22146, rss 4912766976, vmem 5635805184
------------------------------------------------------------------------
For more information visit https://opus.nci.org.au/x/SwGRAQ
------------------------------------------------------------------------
------------------------------------------------------------------------
Job 8680424.r-man2 has exceeded memory allocation on node r3758
Process "orted", pid 3379, rss 28434432, vmem 401707008
Process "oceanM", pid 3423, rss 4895670272, vmem 5635981312
Process "oceanM", pid 3424, rss 4907446272, vmem 5648654336
Process "oceanM", pid 3425, rss 4902199296, vmem 5648379904