Why floats solution behave with anti-parallel speed-up?

Bug reports, work arounds and fixes

Moderators: arango, robertson

Post Reply
Message
Author
AlexisEspinosa
Posts: 23
Joined: Fri May 24, 2013 3:05 am
Location: UWA

Why floats solution behave with anti-parallel speed-up?

#1 Unread post by AlexisEspinosa »

Hello ROMS experts,

I have the feeling that the behaviour I have seen for the solution of floats in parallel is not normal, so that is why I'm posting it as a bug. Below, I'm giving a detailed explanation hoping it to be clear enough. My questions are numbered as 1) and 2) in the text below.

My starting point is the simulation I want to run. I want to follow a huge amount of particles (floats) within the simulation of a bay. But the floats solution is being extremelly slow!

Without floats, the execution-time for solving hydrodynamics fields speeds-up properly. An optimal number of 104 partitions (NtileI == 8 , NtileJ == 13) allows me to obtain a 21 ocean-days simulation in around 7 hours of execution time. This means that my history files, which are saved every 1 hour ocean-time, are created every ~1 minute of execution-time. (The mesh size is Lm == 160, Mm == 256, N == 40.)

But when solving for 1,024,000 floats to be liberated at the free surface (25 floats per horizontal cell), the execution time increases dramatically. And, unexpectedly, it increases more when using more partitions. I found the following execution times:

PARTITIONS.......EXECUTION TIME FOR 1 HOUR OCEAN-TIME
104.....................~1 minute (just hydrodynamics without floats)
104.....................~37 minutes (with floats to be liberated)
48......................~35 minutes (with floats to be liberated)
12......................~33 minutes (with floats to be liberated)
6.......................~22 minutes (with floats to be liberated)
2.......................~24 minutes (with floats to be liberated)
1.......................~27 minutes (with floats to be liberated)

While there should be a natural increase in execution time due to the solution of the lagrangian particle trajectories, I would expect that this overhead could be reduced in parallel, but the overhead grows when running in parallel! The execution time just for the solution of the flow field should still be speeding-up when running in parallel. I guess that the minimal reduction in execution time from 1 to 6 partitions is due to the speed-up of just the hydrodynamics, but that the floats are increasing the overhead even from the use of 2 partitions. When the partitions are greater than 6, the anti-parallel overhead of the floats solution kills any speed-up from the hydrodynamics part and this anti-parallel effect dominates the execution time.

1) So my first question is, why is this anti-parallel behaviour happening for the solution of floats trajectories? Is this a bug or is this the expected behaviour? Is there anything that can be done in order to improve the execution time? I don't know, maybe some compilation options, or something. I really do not know what to do! I thought this may be due to excessive writing of floats data into the floats file, but I tested a run increasing NFLT by 1000 times (the writing frequency of floats) and the execution time is still the same. While a take-what-you-can-get approach would be to keep running with 6 partitions, I would like to know what is happening and what can be done to improve the execution time while solving for the particle trajectories.

Another weird behaviour of this simulation is that the increase in execution time occurs from the very beginning of the simulation. I found this weird, because the particles are not liberated from the very beginning, indeed the particles are liberated after 5 hours of ocean-time from the beginning of the simulation. But the first 5 hours of ocean-time have the same execution-time as the following ocean-hours with particles already liberated. Moreover, I'm liberating these 25 batches of particles (160X256) at a rate of 1 batch every ocean-hour, and the execution-time is still the same without particles in the ocean, with the first batch of particles in the ocean, with two batches in the ocean, or 25 batches in the ocean. This means that the degradation of the execution-time does not depend on the solution of the particle trajectories but on the total amount of particles that are GOING TO BE liberated. So, even if the particles are going to be liberated in a year-ocean time, the solution degrades from the first second of ocean-time of the simulation. The solution can pass an eternity in ocean-time without particles but with an overhead in execution time just because of the amount of particles that are going to be liberated sometime in the future, which could be a long long time away! This is not a problem for the no-particle situation, because we can solve just the hydrodynamics in a first stage of the solution, but the problem is when the particles are going to be liberated gradually and you suffer the overhead of particles that are in the ocean but also of all the particles that are yet to be liberated!

2)So my second question is, what is happening here?! Is this overhead, that appears even when the particles have not been liberated, an expected behaviour or a bug? What can be done in this regard to improve the execution time?

My problem set-up has worked properly in the past for a smaller amount of particles (~10,000). In that case, the execution-time was decent enough that I never measured the degradation due to the floats. But now with ~1million particles TO BE LIBERATED the "anti-parallel" and the "overhead-without-particles-in-the-ocean" problems are very obvious. I hope this is not a major bug too difficult to correct. Hopefully this can be easily corrected. Please tell me how if you figure it out!

Thanks a lot for your feedback,
Alexis Espinosa
Last edited by AlexisEspinosa on Fri Aug 01, 2014 6:48 am, edited 1 time in total.

User avatar
kate
Posts: 4088
Joined: Wed Jul 02, 2003 5:29 pm
Location: CFOS/UAF, USA

Re: Why floats solution behave with anti-parallel speed-up?

#2 Unread post by kate »

Frankly, I'm not surprised by this. The float code is designed to be parallel with domain decomposition, just as the hydrodynamics is. Each tile has to go through all the floats and ask each float "are you mine?" and "are you active?". It zeroes out all the floats that don't live on the tile then a global sum reduction takes place at the end of each timestep component for sharing the float location information.

If you are saving hourly, can you run the floats offline using the hourly stored fields? A better parallel method for that many floats would be to assign 1,000,000/N of the floats to each of the N processes. Then each process would have to read the full grid of fields and be able to track its floats all over the domain. Anyway, we have been running large numbers of floats offline with the (serial) tracmass code.

User avatar
m.hadfield
Posts: 521
Joined: Tue Jul 01, 2003 4:12 am
Location: NIWA

Re: Why floats solution behave with anti-parallel speed-up?

#3 Unread post by m.hadfield »

Is OpenMP an option for you? In my experience, this is much better for parallel runs with floats than MPI. I presume this is because the overhead that Kate describes is absent or greatly reduced.

AlexisEspinosa
Posts: 23
Joined: Fri May 24, 2013 3:05 am
Location: UWA

Re: Why floats solution behave with anti-parallel speed-up?

#4 Unread post by AlexisEspinosa »

So this swipe along all the floats is done every time step? Clearly this is not optimal and is damaging the parallel execution!

I would have thought this to be done only once (at the first time step) in order to 'sieve' the particle information and put it into the corresponding partition: all those particles already floating in the partition domain or all those particles to be liberated in the partition domain would then be managed by the corresponding partition processor. If a particle leaves the partition to go into another partition domain, then the particle information would be sent into the receiving partition.

At every writing, data would need to be concentrated and ordered again, but this overhead should happen only during writing times.

I would have thought also that some kind of flag would control the YET TO BE LIBERATED particles in order to leave them apart and do not waste resources on them until its really time to liberate them into the ocean.

I know a mechanical engineering CFD code (OpenFOAM) that uses this kind of approach, but unfortunately it is not really prepared for ocean hydrodynamics. But some ideas could be used from their strategy. They also use fractional time-steps in order to have an intermediate time step any time a particle crosses a cell boundary. This helps to avoid errors with the interpolation of velocities (i.e. velocity interpolation is always done with the data corresponding to the cell where the particle really is) and also to control when a particle crosses into another partition domain (processor).

The only difficulty with this approach is that when a high number of particles is concentrated in a single partition, then none of the other processsors could help with the solution of particle trajectories. I guess it depends with the kind of application, but the strategy mentioned above makes more sense to me. Specially the flag for the YET TO BE LIBERATED particles, which is independent of the parallel strategy, and would save a huge amount of resources.

Anyway, easy to think, but I do not know how difficult to implement in ROMS?!!

But with ROMS as it is right now, I'll need to think in 'clever' ways to solve my particle trajectories in less time, even if I need to execute many simulations with a lower amount of particles each, and put the total results together again in my postprocessing. I'll also take a look into OpenMP and offline particle tracking. The problem with offline solvers is the amount of velocity fields to be saved. Right now I'm writing results every hour but the trajectories are being calculated every time step. I don't think that using hourly velocity fields would be enough for my purposes, so I may need to save my fields with a higher frequency which may generate some other problems. Ha, ha, there is no free lunch, isn't it?

Thanks a lot for your feedback,
Alexis Espinosa-Gayosso
UWA
Last edited by AlexisEspinosa on Fri Aug 01, 2014 2:38 am, edited 1 time in total.

User avatar
m.hadfield
Posts: 521
Joined: Tue Jul 01, 2003 4:12 am
Location: NIWA

Re: Why floats solution behave with anti-parallel speed-up?

#5 Unread post by m.hadfield »

AlexisEspinosa wrote:If a particle leaves the partition domain to go into another, then the particle information would be sent into the receiving partition.
I think this is the tricky bit. :wink:

hugobastos
Posts: 15
Joined: Tue May 06, 2008 8:46 pm
Location: FURG

Re: Why floats solution behave with anti-parallel speed-up?

#6 Unread post by hugobastos »

Hi,

I think that your problem of slowness even when you dont release your floats yet can be tweaked if you have the patience to test out a little modification in the code.

the ideia is only call the floats when close to release time right!? You can try to edit the main2d.F and main3d.F and just after the #define FLOATS you can setup a IF case related to the time of the release of your floats (assuming the complexity is small).

For the case below I have only one release timestep (or the same for all) and no restart.

I define two new variables in modules/mod_scalars.F so you can set them in your ocean.in file :

Code: Select all

integer,dimension(Ngrids) :: float_ndt !time in seconds of first release
integer,dimension(Ngrids) :: float_ini_twindow ! number of ndt times to check before the release
In main2d.F, after #define FLOATS , you can add:

Code: Select all

!
Lfloats(Ng)=.FALSE.
IF ((FIRST_2D_STEP).or.(time(ng).ge.(float_ndt-float_ini_twindow*dt(ng)))) THEN
    Lfloats(Ng)=.TRUE.
ENDIF

      IF (Lfloats(Ng)) THEN
!$OMP PARALLEL DO PRIVATE(thread,chunk_size,Lstr,Lend)                  &
!$OMP&            SHARED(ng,numthreads,Nfloats)
        DO thread=0,numthreads-1
          chunk_size=(Nfloats(ng)+numthreads-1)/numthreads
          Lstr=1+thread*chunk_size
          Lend=MIN(Nfloats(ng),Lstr+chunk_size-1)
          CALL step_floats (ng, Lstr, Lend)
          ...
and in main3d.F you can add the same, but with the logic related to 3d timesteps:

Code: Select all

!
Lfloats(Ng)=.FALSE.
IF ((icc(ng).eq.ntstart(ng)).or.(time(ng).ge.(float_ndt-float_ini_twindow*dt(ng)))) THEN
Lfloats(Ng)=.TRUE.
END IF

       IF (Lfloats(Ng)) THEN
!$OMP PARALLEL DO PRIVATE(thread,chunk_size,Lstr,Lend)                  &
!$OMP&            SHARED(ng,numthreads,Nfloats)
        DO thread=0,numthreads-1
          chunk_size=(Nfloats(ng)+numthreads-1)/numthreads
          Lstr=1+thread*chunk_size
          Lend=MIN(Nfloats(ng),Lstr+chunk_size-1)
          CALL step_floats (ng, Lstr, Lend)
          ...

This lines should be inside main2d and main3d for obvious reasons, since it's the "top" only places where ROMS call the floats routines (if my grep is right and for nonlinear model!).

Note that maybe it's necessary to call the function at the initial time step to avoid problems with initialization. The difference calculation is to ensure that just before the timesteping for release of the floats, we avoid again any kind of problems related to initialization or previous data gathering that can influence the release at the desired timestep (you can set a large value to the window to make sure you do not enter in a trap because of the timestepping or fractional 2d timesteps --> preferred to use a 2d time step that is integer and not fractional, since the idea here is to avoid the 3dkernel calling the float routine and the 2dkernel missing the call).

Yes, I forced the Lfloats to false for debugging purposes, making sure you only have floats after the time you need them (checking the complexity here, so if everything goes wrong/right, we will know that we need to check other calls). If this works you can remove the line and impose the time if clause directly to avoid set/reset this variable, and we will know that this is the right place to stay.

Of course you can implement a proper code for this (doing it in integer timesteps), but you will need to search the time variables in mod_stepping.F (i think that you can do a simple math based on the timestep variable iff/icc (2d/3d). You will need to think in the logic of the timestepping. For simplicity, I use this approach in seconds and with a input from ocean.in. If you want ,search the logic of find/match the timestep in output.F (ndefHIS for example).

If you have floats with several different times or restart, the above is the way to go plus looping through the FLT object ( i think FLT%info hold the info, but not sure) searching the minimum timestep required. Again, you will need to match the right values to propose to the if clause.

Good luck!

Post Reply