Overview

  • What's in a NAME?
    • A schematic overview
    • Particles and sub-grid-scale (Lagrangian) motion
    • (Eulerian) Chemistry
  • Parallelism:
    • Existing threaded parallelism (particle decomposition)
    • Distributed particle decomposition
    • Distributed domain decomposition
  • Development/Test Strategy
    • A "staged de-replication"
  • Results
    • Volcanic Ash
    • Chemistry
    • Greenhouse Gas
  • The future...

Schematically...

Schematic

Mixed Lagrangian...

Eyjafjallajokull Eruption April 2010 (Photo: Arni Frioriksson)
  • Particle motion
    • Advected by background wind field
    • Other processes including convection/plume effects
    • Random motion to represent turbulent scales not resolved
  • Statistics
    • Accuracy requires sufficient number of (pseudo-)particles
    • Higher resolution requires more particles

...Eulerian

Schemeatic of Lagrangian to Eulerian conversion for Chemistry
  • Particles
    • Each carry a payload of chemical constituents
    • Require a sum to give a concentration at given location
  • Chemistry
    • A "black box" which is entirely local to a single Eulerian cell
    • Concentrations must be "re-distributed" to particles
    • A separate Eulerian advection facility is not considered here

Thread parallelism

Source generates particles

An array of particles in memory

Update particles...


    

!$omp parallel do

do n = 1, nParticles call particleUpdateMe(n, metData, ...) end do
Particles distributed between threads

Particle data structure


       type (Particle_)
         integer (int64)        :: iUP     ! unique identifier
         real    (pos_t)        :: x(3)    ! position
         integer (int32)        :: iCoord  ! coordinate system
         

integer (int64) :: rng(2) ! RNG state

... type (Extra_), pointer :: extra ! optional extra data double (real64) :: mass(:) ! chemical species end type Particle_

Memory requirement


    Particle_  ! c. 128 bytes
    Extra_     ! c. 400 bytes (if required)
    mass(:)    ! typically 42 species (if required)
  

Output

Particles may contribute to output...

Particles contibute to output

Distributed particle decomposition

A copy of the source generates all particles...

Particles distributed between tasks

Particles assigned to MPI task based on unique index iUP

Update particles...


    do n = 1, nParticlesLocal
      call particleUpdateMe(n, metData, ...)
    end do
    

Greenhouse Gas Benchmark

  • UK plus surrounding region (backwards in time)
    • NWP data: 1.5km, 57 vertical levels at 1 hour
    • Original 480,000 particles (c. 30 minutes 1 node); 48 million < 4 hours
Efficiency for different problem size

Output (again)

Particles may contribute to output...

Particles contibute to output
Solution
  • Each task forms its own contribution to (replicated) output
  • Communicate to add contributions
  • Single MPI task performs actual output to file

Volcanic Ash Benchmark

  • Askja (Iceland)
    • NWP data: global UM at 3 hours
    • Smaller: 750,000 particles; Larger: 7,500,000 particles
Thread and task efficiency

Chemistry

Chemistry scheeatic
Problem
  • Chemistry performed on an Eulerian grid, so ...
  • All particles in a given Eulerian cell must be present on same MPI task
James Gillray cartoon showing Pitt and Napoleon

Reference: https://upload.wikimedia.org/wikipedia/commons/9/97/Caricature_gillray_plumpudding.jpg

Some machinery

Particle memory management
  • Complete re-write
  • Separate from application concerns
MPI data types
  • For Particle_ structure
  • Also some components therein
  • Used to form messages
Patches
  • Auxiliary list of particles for decomposition
  • Helps to sort particles in horizontal and vertical
  • (Uses Peano-Hilbert order in horizontal for potential load balance)

Recall...

Schematic

Domain decomposition

A copy of the source generates all particles...


    ! Based on initial particle position (x,y,z)

    if (isEulerian) then
      ! Assign to horizontal patch based on (x,y)
    else
      ! Don't care...
      ! Assign to horizontal patch based on iUP 
    end if

    

Particle movement ...

Particle changes MPI task
In fact:
  • Group all particles crossing a particular boundary into single message

Chemistry parallelisation

Eulerian concentrations
  • Storage remains replicated
  • Work (domain-) distributed
Threaded parallelisation
  • Implementation completely rewritten
  • Based on patch structure (e.g., sorting in vertical)
  • Scheduling/synchonisation issues removed (identified via profile)
Updated concentrations
  • Gather to rank 0 for output
  • Eulerian physical processes/advection not considered

Chemistry Benchmark

  • UK plus north-west Europe (Eulerian grid at c. 10km)
    • NWP data: c. 0.5$^\circ$ in horizontal, 50 vertical levels at 4 hours
    • STOCHEM with 42 species
Thread and task efficiency

Development

Strategy

  • "Staged de-replication"
    • Refactored output/errors (everything else replicated)
    • Run multiple redundant copies (single output)

Reproducability

  • Code gives same result independent of threads/tasks
  • But not the same result as previous versions ...
    • Random number generator initialisation per particle
    • Serial algorithms replaced by parallel algorithms

Testing

No tests exsited in original (!)

  • A number of levels introduced
    • Assertions to enforce design-by-contract
    • Unit tests for new functionality
    • A number of standard (performance) benchmarks for repeatability
  • Quality
    • No warnings with gfortran -Wall (new/refactored code)
    • Portability: GNU, NAG, Intel, Cray, (Windows?)
    • Review by Met Office before introduction in trunk

Assertions

    
  pure function patchXYIndex(patches, iXPH, iYPH) result(iPatchGlobal)

    ! Map patch (iXPH, iYPH) to one-dimensional patch index.

    type (Patches_), intent(in) :: patches
    integer,         intent(in) :: iXPH
    integer,         intent(in) :: iYPH
    integer                     :: iPatchGlobal

    assert(1 <= iXPH .and. iXPH <= patches%nxPH)
    assert(1 <= iYPH .and. iYPH <= patches%nyPH)

    iPatchGlobal = peanoHilbert2DTo1D(patches%nPH, iXPH, iYPH)

    assert(1 <= iPatchGlobal .and. iPatchGlobal <= patches%nGlobal)

  end function patchXYIndex
    

Implementation via pre-processor macro

Unit tests

Stand-alone unit tests run for new functionality (MPI/threads)

  • Coverage assessed via gcov
  • e.g., gfortran --covergage
  • lcov produces a test report

Summary

Aim of the work:

  • Produce a functional distributed memory version
  • More memory allows more particles:
    • Allow increased size/resolution while maintaining time-to-solution
    • Maintain statistical accuracy

Exposed a number of bottlenecks:

  • Output
    • ASCII could be replaced by netCDF
  • Release of particles
    • Requires a parallel implementation
  • Reading large NWP meteorological data sets
    • Improved threaded parallelism should help

Credits

At UK Met Office

  • Dave Thomson (Science lead)
  • Ben Devenish (Benchmarks/Volcanos)
  • Andrew Jones (Software/tests)
  • Ben Evans (Project lead)

Credits

Photo: Arni Frioriksson

https://commons.wikimedia.org/wiki/File:Eyjafjallajokull-April-17.JPG

Improving performance of Met Office NAME code

Kevin Stratford (Proposal: IP07)

Eyjafjallajokull Eruption April 2010 (Photo: Arni Frioriksson)
  • Background
    • Run operationally and for research
    • Based on particles
  • What's needed
    • Fortran / OpenMP
    • Running tests and benchmarks / assessing performance

Improving performance of Met Office NAME code

    
       type (Particle_)
         integer (int64) :: iUP
         real    (pos_t) :: x(3)
         integer (int32) :: iCoord
         .
         .
         .
       end type Particle_
    
An array of particles in memory
    
       type (ParticleArray_)
         integer (int64) :: iUP(:)
         real    (pos_t) :: x(3,:)
         integer (int32) :: iCoord(:)
         .
         .
         .
       end type ParticleArray_
    
  • What's in it for me?
    • Experience working on a real code
    • Met Office are engaged and interested in the work

Performance portability ...

Kevin Stratford (Proposal: EP28)

  • Problem
    • Want to run on both CPU and GPU
    • Can we do it without writing two versions of the source code?
  • Solution
    • Need abstraction of the thread model / memory layout
    • Various ways are available... (we have "targetDP")

... simulation of particle suspensions

Kevin Stratford (Proposal: EP28)

  • CFD code "Ludwig"
    • Some areas of code abstracted
    • Want to extend to full coverage of moving particles
  • Skills
    • Must be happy with C / pointers / data structures
    • OpenMP / GPU knowledge a plus
  • What's in it for me?
    • A key, growth, area in scientific computing
    • Practical experience with world-class research code