SlideShare ist ein Scribd-Unternehmen logo
1 von 35
TERRESTRIAL SYSTEMS MODELLING PLATFORM
J. BENKE, D. CAVIEDES VOULLIEME, S. POLL, G. TASHAKOR, I. ZHUKOV
JÜLICH SUPERCOMPUTING CENTRE (JSC)
PORTING A COUPLED MULTISCALE AND MULTIPHYSICS EARTH SYSTEM MODEL TO
HETEROGENEOUS ARCHITECTURES (BENCHMARKING AND PERFORMANCE
ANALYSIS)
j.benke@fz-juelich.de, d.caviedes.voullieme@fz-juelich.de, g.tashakor@fz-juelich.de, s.poll@fz-juelich.de, i.zhukov@fz-juelich.de
http://www.fz-juelich.de/ias/jsc/slts
http://www.hpsc-terrsys.de
@HPSCTerrSys
HPSC TerrSys
TERRESTRIAL SYSTEMS
MODELLING PLATFORM (TSMP)
A QUICK OVERVIEW
Page 2
Member of the Helmholtz Association
TERRESTRIAL SYSTEM MODELLING PLATFORM (TSMP)
• Represents processes for soil, land, vegetation and
atmosphere
• Numerical modelling system coupling COSMO / ICON,
Community Land Model (CLM) and ParFlow
 Fully modular
• Physically-based representation of transport processes
of mass, energy and momentum
• Component models can have different spatio-temporal
resolution; explicit feedbacks between compartments
• Parallel Data Assimilation Framework (TSMP-PDAF)
• Multiple Program Multiple Data (MPMD) execution
model,OASIS provides common MPI_COMM_WORLD
• https://www.terrsysmp.org/
Page 3
Solving the terrestrial water and energy cycle from groundwater to the atmosphere
ICON / COSMO
CLM
Parflow
PDAF
Page 4
Source codes used:
 TSMP (v1.3.3): Interface and modeling framework
 COSMO (v5.01): Atmospheric model
 CLM (v3.5): Land surface model (1D column model)
 ParFlow (v3.7): Surface and subsurface hydrological model
 OASIS3-MCT: MPI based coupler for all submodels
 PDAF: Parallel Data Assimilation Framework (not used here)
Parallelism:
 Hybrid (MPI, MPI-CUDA, MPI-OpenMP), depends on component model
 Heterogeneous computing enabled (ParFlow GPU + COSMO/CLM CPU)
 Performance analysis (profiling/tracing): First results exists (later)
TERRESTRIAL SYSTEM MODELLING PLATFORM (TSMP)
Software features
TSMP COUPLING SCHEME
Page 5
One coupling step:
1) COSMO and ParFlow are sending coupling fields to CLM
2) COSMO and ParFlow are idle while CLM is running
3) CLM sends coupling fields back to COSMO and ParFlow
4) COSMO and ParFlow are running simultaneously (CLM is idle)
Communication pattern between components (programs)
SCALING TESTS
Page 6
Member of the Helmholtz Association
The Benchmark Scenario (SimDiCyPBL)
• Motivation:
 Systematic scaling tests and performance analysis were already performed (and published) over 8 years
ago (JUQUEEN)
 But since that time TSMP and its submodels evolved
 The same ist true for HPC architectures (especially for accelerator architectures)
 As the consequence new scaling tests and performance analysis results are needed on the new machines
• SimDiCyPBL: Simulation of the Diurnal Cycle of the Planetary Boundary Layer
 Synthetic scenario with limited complexity which is of interest only for checking computational correctness,
computational benchmarks and performance analysis tests
 Adaption of the TSMP Fall School 2019 scenario
 Predefined scenario in the TSMP Data Github (idealRTD)
 Advantage: Very adaptable problem, which can be run as a very small configuration or scaled indefinitely
for performance and scalability studies
 I.a. the geometry, mesh size and step width (time and space) can be easily adapted
Page 7
Member of the Helmholtz Association
The Benchmark Scenario (SimDiCyPBL)
• The setup of the SimDiCyPBL scenario
 All three models are used (but not PDAF)
 Atmosphere (COSMO 5.01)
 Surface/Land/Vegetation (CLM 3.5)
 Hydrology/Hydrogeology (ParFlow 3.7)
 Area size (example case): 600 x 600 km
 Height atmosphere: 22 km, depth ground: 30m
 Flat ground with height 0 m (ParFlow)
 Constant initial aquifer head: 5 m below sea level
(see blue line with arrow)
 Spatially homogeneous and constant
unsaturated zone
 Homogeneous ground/soil (with initial constant
temperature of 287 K), radiative forcing
 Periodic Boundary Conditions in x and y
direction for COSMO
Graphic adapted from TSMP FallSchool 2019, Day 2, p 6
Page 8
Member of the Helmholtz Association
The Benchmark Scenario (SimDiCyPBL) (contd)
• All job runs were performed on JUWELS Cluster
 CPU and GPU partition (for heterogeneous runs)
 Nodes are non shared
 Cluster-Booster runs are planned
• Every model component gets a predefined number
of processes (nodes)
 In our test cases every node is fully utilized, that
means 48 processes on 48 Cores
 Example (CPU only):
 8 COSMO nodes, 1 CLM node, 2 ParFlow
nodes (8-1-2 scenario)
 Results in COSMO = 384, CLM = 48 and
ParFlow = 96 processes
Graphic adapted from TSMP FallSchool 2019, Day 2, p 6
Page 9
Member of the Helmholtz Association
The Machine (JUWELS Cluster and Booster)
• JUWELS (Cluster-Booster-System; batch system Slurm)
 Cluster
 2271 standard compute nodes, 56 accelerated compute nodes
 2x Intel Xeon Platinum 8168 CPUs, 2x 24 cores, 2.7 GHz, 2 Hts/core (CPU partition)
 96 GB DDR4-2666
 Infiniband EDR (ConnectX-4)
 Additional 4x NVIDIA V100 GPUs, 16 GB HBM (GPU partition of JUWELS Cluster)
 10.6 (CPU) + 1.7 (GPU) PetaFlops peak performance (System)
 Booster
 936 accelerated compute nodes
 2x AMD EPYC Rome 7402 CPUs, 2x 24 Cores, 2.8 GHz
 512 GB DDR4-3200
 4x NVIDIA A100 GPUs, 4x 40 GB HBM2e
 Infiniband HDR200 (ConnectX-6)
 73 PetaFlops peak performance (System)
Page 10
JUWELS (CLUSTER AND BOOSTER)
Page 11
STRONG SCALING
EXPERIMENTS
Page 12
Member of the Helmholtz Association
Strong Scaling Experiment 300x300 surface mesh (config)
• Model domain: 300 x 300 km
• Model mesh size (nx x ny x nz):
 COSMO: 306 x 306 x 50 (approx 4.7 x 106
nodes)
 CLM: 300 x 300 x 10 (9.0 x 105
nodes)
 ParFlow: 300 x 300 x 30 (2.7 x 106
nodes)
 Step width (space): Δx = Δy = 1 km
• Model simulation time: 6 hours
• Model time step (all models): Δt = 18 seconds (200 time steps per model hour)
 Different time steps per model can be taken
• Coupling frequency OASIS-MCT: 18 seconds
• I/O interval: 6 hours (1 output of every model at the end of the benchmark)
• Used range of number of nodes in experiments: COSMO=1-16, CLM=1, ParFlow=1-8
• Process pinning and distribution (MPI only)
 Pinning: by core; Distribution: block : cyclic : cyclic
• The runtime measurement interval is from the start to the end of the job
Page 13
Member of the Helmholtz Association
Strong Scaling Experiment 300x300 surface mesh (results)
Page 14
Member of the Helmholtz Association
Strong Scaling Experiment 300x300 surface mesh (results)
• Explanation of the function graphs on the previous slide:
 Function graphs left hand side: Runtime measurements for CPU only runs (in minutes)
 Function graphs right hand side: Runtime measurements for CPU-GPU runs (in minutes)
 For every measurement: #CLM nodes = 1
• Every node is fully utilized with 48 processes (CPU only) or with 4 processes (for ParFlow running on GPUs)
• Every function graph of one colour shows a discrete runtime function of the number of COSMO nodes (with
#ParFlow nodes = constant)
 E.g. the red discrete function is the runtime graph of the run with 1 CLM node, 1 Parflow node and 1,2,4,8
and 16 COSMO nodes (in the CPU only and the CPU-GPU case)
 Every dot shows one measurement point
 x axis: Number of COSMO nodes (#Nodes COSMO)
 y axis: Runtime of a measurement (job run) in minutes
 How to read it: If you are searching for the runtime of a 8-1-2 job (8 COSMO, 1 CLM, 2 ParFlow nodes), then
please have look at the x axis for the number 8, and then the intersection between the virtual vertical axis
through 8 with the green line (#CLM nodes are always 1)
Page 15
Member of the Helmholtz Association
Strong Scaling Experiment 300x300 surface mesh (results)
• Both functional groups (CPU only and CPU-GPU) are strictly decreasing monotonically
 Except for #Nodes COSMO>8
• COSMO limited parts of a function
 Graph of the function is nearly constant or increasing
 Interpretation: COSMO is waiting for ParFlow
• ParFlow limited parts of a function
 Graph of the function is (strictly) decreasing monotonically
 Interpretation: ParFlow is waiting for COSMO
• A (quasi) load balanced state is the “elbow” of a function
 Example CPU only case: 8 COSMO, 1 CLM, 2 ParFlow nodes (8-1-2 scenario)
• In all cases the runtime remains constant or increases with 16 COSMO nodes (no more speed up)
 The reason is the small size of the mesh of this scenario
• Interesting to observe: In the CPU-GPU case the fastest runs are those with #Nodes ParFlow = 1
 Regarding runtime and resource usage/energy efficiency the best choice would be 8-1-1 (CPU-GPU)
• Regarding runtime only the optimal point would be 16-1-2 (CPU only)
Page 16
Member of the Helmholtz Association
Strong Scaling Experiment 1200x600 surface mesh (config)
• Model domain: 1200 x 600 km
• Model mesh size (nx x ny x nz):
 COSMO: 1206 x 606 x 50 (approx 3.7 x 107
nodes)
 CLM: 1200 x 600 x 10 (7.2 x 106
nodes)
 ParFlow: 1200 x 600 x 30 (approx 2.2 x 107
nodes)
 Step width (space): Δx = Δy = 1 km
• Model simulation time: 6 hours
• Model time step (all models): Δt = 18 seconds (200 time steps per model hour)
 Different time steps per model can be taken
• Coupling frequency OASIS-MCT: 18 seconds
• I/O interval: 6 hours (1 output of every model at the end of the benchmark)
• Used range of number of nodes in experiments: COSMO=1-16, CLM=1, ParFlow=1-8
• Process pinning and distribution (MPI only)
 Pinning: by core; Distribution: block : cyclic : cyclic
• The runtime measurement interval is from the start to the end of the job
Page 17
Member of the Helmholtz Association
Strong Scaling Experiment 1200x600 surface mesh (results)
Page 18
Member of the Helmholtz Association
Strong Scaling Experiment 1200x600 surface mesh (results)
• COSMO limited parts of a function
 Graph of the function is nearly constant or increasing
 Interpretation: COSMO is waiting for ParFlow
• ParFlow limited parts of a function
 Graph of the function is (strictly) decreasing monotonically
 Interpretation: ParFlow is waiting for COSMO
• A (quasi) load balanced state is the “elbow” of a function
 Example CPU only case: 4 COSMO, 1 CLM, 1 ParFlow nodes (4-1-1 scenario)
• CPU only case:
 #Nodes ParFlow = 1 is COSMO limited with #Nodes COSMO = 8,16
 #Nodes ParFlow = 2 is COSMO limited with #Nodes COSMO = 16
• CPU-GPU case:
 All runs are ParFlow limited (waiting for COSMO)
• In the CPU-GPU case the fastest runs are again with #Nodes ParFlow = 1 (16-1-1)
 Regarding runtime and resource consumption/energy efficiency best choice
Page 19
Member of the Helmholtz Association
Strong Scaling Experiments (Summary)
• CPU only case:
 In some cases (especially #Nodes ParFlow = 1,2) the runtime is COSMO limited
• CPU-GPU case:
 Most of the runs are ParFlow limited
 In all CPU-GPU cases (scenarios) the best performance is reached in the #Nodes ParFlow = 1 case
 With only a minor runtime loss compared to the best CPU only case
 Regarding energy efficiency and runtime the best CPU-GPU version should be used
• To investigate this further larger meshes are needed
 Under construction, but problems arose with creating appropriate OASIS3 rmp* files
• CPU only and CPU-GPU haven’t big differences in runtime because they are coupled and after one time
step they synchronize via sending their data to CLM and wating for the scattered data
 Possible solutions to optimize this problem (under investigation):
 Different time step sizes for all models (first results are available)
 Larger coupling frequency (first results are available, but no large gain in runtime and cases exist
with problems of correctness of the results)
Page 20
PERFORMANCE ANALYSIS
Page 21
Member of the Helmholtz Association
Performance Analysis: Used Tools
Score-P (Version 7.1; www.score-p.org)
 Performance analysis tool infrastructure
 Instrumentation and measurement system
 Supports (call-paths) profiling and event tracing
Scalasca (Version 2.6; www.scalasca.org)
 Automatic trace analysis of parallel programs
 Automatic search for patterns and inefficient behaviour
 Classification and quantification of significance
• Cube (Version 4.6; https://www.scalasca.org/scalasca/software/cube-4.x)
 Performance report explorer for Scalasca and Score-P
 Incudes libraries, algebra utilities and GUI
 GUI supports interactive performance analysis and metrics exploration
• Vampir (Version 9.11;https://vampir.eu/ )
 Parallel performance analysis framework
 Graphical representation of performance metrics and dynamical processes
Page 22
PERFORMANCE ANALYSIS
(TRACING)
Page 23
Member of the Helmholtz Association
Performance Analysis (Tracing)
• The following slides show exemplary the load balancing problems of TSMP
• Instrumentation for tracing was done with Score-P, visualization via Vampir
• Explanation of the next 3 slides:
 Presentation of Traces and Profiling of a 1-1-1 CPU only job run of TSMP
 300x300 surface mesh
 Structure of the tracing pictures (slide 1 and 2)
 COSMO: The upper block (48 procs)
 ParFlow: Centered block (48 procs)
 CLM: Lower block (48 procs)
 This structure of traces show the MPMD nature of TSMP
 in particular the concurrent execution of the three component models.
 Every row shows the runtime behaviour of one process (for a better overview all processes are shown,
but it’s possible to zoom in)
Page 24
Member of the Helmholtz Association
Performance Analysis (Tracing) (cont’d)
• Explanation of the next 3 slides (cont’d):
 Coloring:
 Red bars are MPI operations (P2P communication, collective communication, MPI requests, MPI
initialization, etc)
 Vertical black lines is MPI communication, black dots MPI bursts resulting in MPI communication
 Green areas describe user functions
 Slide 3 shows the Accumulated Exclusive Time per Function (a kind of Profiling)
Page 25
Member of the Helmholtz Association
Performance Analysis TSMP using ScoreP/Vampir (CPU only)
Page 26
Member of the Helmholtz Association
Performance Analysis TSMP using ScoreP/Vampir (CPU only)
Page 27
Member of the Helmholtz Association
Page 28
Member of the Helmholtz Association
Performance Analysis (Tracing) (results)
• CPU only (slide 1 and 2)
 Slide 1 shows an overview over all 144 processes of the 3 component models over the full runtime
(approx 750 s)
 Most of the area is red (MPI) and a closer look shows, that most of the time MPI requests
(MPI_Waitall) are done (see also slide 3)
 Most of its time CLM is waiting for COSMO and ParFlow, since it’s the fastest (i.a. smallest mesh)
 Slide 2 shows a zoom into the first time steps (0 – 33 seconds)
 Second 0 to 14 shows the initialization interval
 Computation begins with second 14
 CLM starts with computing (approx second 14.2), while ParFlow and COSMO are waiting
 After finishing CLM scatters its results to COSMO and ParFlow (black vertical lines) and then waits
(second 14.2 to 15.5)
 ParFlow computes from second 14.3 to 14.5, sends its data to CLM and waits
 COSMO calculates from 14.3 to 15.5 and then sends its data to CLM
 After sending data, COSMO and ParFlow are waiting for CLM, which started to compute
 The cycle starts again
Page 29
Member of the Helmholtz Association
Performance Analysis (Tracing) (results; cont’d)
• CPU – GPU (no slide)
 The same mechanism like in the previous slides
 CLM waits almost all of its runtime
 Loadbalancing issues because of different model complexities!
• Profiling (Slide 3)
 Slide shows Accumulated Exclusive Time per Function
 MPI_Waitall dominates all other functions
 But:
 Much MPI_Waitall time is spent in CLM, because its the by far the smallest and fastest model
 COSMO is the most complex model (with the largest mesh)
 Important: Reduction of load imbalance to an “optimal” point
 MPI communication (in per cent) relative to other regions decreases with increasing mesh size and
load balanced models
Page 30
Member of the Helmholtz Association
Performance Analysis of TSMP with Scalasca
Goergen et al. | JSC SAC 2019 | 16/17 Sep 2019 | Jülich


Page 31
 Scalasca allows trace analysis to detect bottlenecks and
patterns/regions of poor performance
 Results can be shown with Cube
 An example scenario can be seen on the left hand side
 300x300 TSMP CPU only scenario (1-1-1)
 Only the metric tree of Cube is shown here
 The accumulated total time of the program can be seen in
line 1 and the number of visits in line 2
 Additionally Scalasca excerpts for example events like ...
 “Late Sender” or “Late Receiver” (line 7 and 8)
 Seems to be here a problem, since it takes a significant
amount of the runtime
 To locate the problems more accurately it’s possible to
select different objects
 Gives a better overview in which part of TSMP the
problems occur
 Further bottlenecks can be detected with Scalasca (see box
on the left hand side)
Member of the Helmholtz Association
Parallel Performance Analysis with Scalasca (Late Sender)
Goergen et al. | JSC SAC 2019 | 16/17 Sep 2019 | Jülich

Page 32
 Scalasca allows trace analysis to detect bottlenecks and
patterns/regions of poor performance
ONGOING DEVELOPMENTS (SELECTION)
Page 33
• ICON-CLM5 coupling via OASIS3-MCT
• (ICON + ParFlow) on GPUs + (CLM) CPUs
• Flexible/adaptive grids for handling streams/rivers
• Enlarging the mesh size and further strong and weak scaling
experiments on JUWELS and DEEP Cluster and Booster
 Both are pure MSA architectures
• Continuing performance analysis with Scalasca/ScoreP and
Vampir/Cube
• In depth performance analysis of all components of the models
• Best practices performance analysis for TSMP (MPMD)
• Best practices optimization for TSMP
 Especially regarding load balancing (heterogeneous and MSA)
 Different time steps of models and coupling frequencies
(systematic tests)
THANK YOU VERY MUCH
FOR YOUR ATTENTION!
Page 34
TERRESTRIAL SYSTEMS MODELLING PLATFORM
J. BENKE, D. CAVIEDES VOULLIEME, S. POLL, G. TASHAKOR, I. ZHUKOV
JÜLICH SUPERCOMPUTING CENTRE (JSC)
PORTING A COUPLED MULTISCALE AND MULTIPHYSICS EARTH SYSTEM MODEL TO
HETEROGENEOUS ARCHITECTURES (BENCHMARKING AND PERFORMANCE
ANALYSIS)
j.benke@fz-juelich.de, d.caviedes.voullieme@fz-juelich.de, g.tashakor@fz-juelich.de, s.poll@fz-juelich.de, i.zhukov@fz-juelich.de
http://www.fz-juelich.de/ias/jsc/slts
http://www.hpsc-terrsys.de
@HPSCTerrSys
HPSC TerrSys

Weitere ähnliche Inhalte

Ähnlich wie 2022_01_TSMP_HPC_Asia_Final.pptx

Nafems15 systeme
Nafems15 systemeNafems15 systeme
Nafems15 systemeSDTools
 
DSD-NL 2017 Parallel Krylov Solver Package for iMODFLOW-MetaSWAP - Verkaik
DSD-NL 2017 Parallel Krylov Solver Package for iMODFLOW-MetaSWAP - VerkaikDSD-NL 2017 Parallel Krylov Solver Package for iMODFLOW-MetaSWAP - Verkaik
DSD-NL 2017 Parallel Krylov Solver Package for iMODFLOW-MetaSWAP - VerkaikDeltares
 
RT15 Berkeley | ARTEMiS-SSN Features for Micro-grid / Renewable Energy Sourc...
RT15 Berkeley |  ARTEMiS-SSN Features for Micro-grid / Renewable Energy Sourc...RT15 Berkeley |  ARTEMiS-SSN Features for Micro-grid / Renewable Energy Sourc...
RT15 Berkeley | ARTEMiS-SSN Features for Micro-grid / Renewable Energy Sourc...OPAL-RT TECHNOLOGIES
 
byteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurationsbyteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurationsbyteLAKE
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolversinside-BigData.com
 
byteLAKE's Alveo FPGA Solutions
byteLAKE's Alveo FPGA SolutionsbyteLAKE's Alveo FPGA Solutions
byteLAKE's Alveo FPGA SolutionsbyteLAKE
 
Atc On An Simd Cots System Wmpp05
Atc On An Simd Cots System   Wmpp05Atc On An Simd Cots System   Wmpp05
Atc On An Simd Cots System Wmpp05Ülger Ahmet
 
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...PyData
 
The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...NECST Lab @ Politecnico di Milano
 
Advancements in the Real-Time Simulation of Large Active Distribution Systems...
Advancements in the Real-Time Simulation of Large Active Distribution Systems...Advancements in the Real-Time Simulation of Large Active Distribution Systems...
Advancements in the Real-Time Simulation of Large Active Distribution Systems...OPAL-RT TECHNOLOGIES
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)byteLAKE
 
Advances in the Solution of NS Eqs. in GPGPU Hardware. Second order scheme an...
Advances in the Solution of NS Eqs. in GPGPU Hardware. Second order scheme an...Advances in the Solution of NS Eqs. in GPGPU Hardware. Second order scheme an...
Advances in the Solution of NS Eqs. in GPGPU Hardware. Second order scheme an...Storti Mario
 
Early Application experiences on Summit
Early Application experiences on Summit Early Application experiences on Summit
Early Application experiences on Summit Ganesan Narayanasamy
 

Ähnlich wie 2022_01_TSMP_HPC_Asia_Final.pptx (20)

Nafems15 systeme
Nafems15 systemeNafems15 systeme
Nafems15 systeme
 
DSD-NL 2017 Parallel Krylov Solver Package for iMODFLOW-MetaSWAP - Verkaik
DSD-NL 2017 Parallel Krylov Solver Package for iMODFLOW-MetaSWAP - VerkaikDSD-NL 2017 Parallel Krylov Solver Package for iMODFLOW-MetaSWAP - Verkaik
DSD-NL 2017 Parallel Krylov Solver Package for iMODFLOW-MetaSWAP - Verkaik
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 
RT15 Berkeley | ARTEMiS-SSN Features for Micro-grid / Renewable Energy Sourc...
RT15 Berkeley |  ARTEMiS-SSN Features for Micro-grid / Renewable Energy Sourc...RT15 Berkeley |  ARTEMiS-SSN Features for Micro-grid / Renewable Energy Sourc...
RT15 Berkeley | ARTEMiS-SSN Features for Micro-grid / Renewable Energy Sourc...
 
B.tech_project_ppt.pptx
B.tech_project_ppt.pptxB.tech_project_ppt.pptx
B.tech_project_ppt.pptx
 
byteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurationsbyteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurations
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolvers
 
cuTau Leaping
cuTau LeapingcuTau Leaping
cuTau Leaping
 
byteLAKE's Alveo FPGA Solutions
byteLAKE's Alveo FPGA SolutionsbyteLAKE's Alveo FPGA Solutions
byteLAKE's Alveo FPGA Solutions
 
Atc On An Simd Cots System Wmpp05
Atc On An Simd Cots System   Wmpp05Atc On An Simd Cots System   Wmpp05
Atc On An Simd Cots System Wmpp05
 
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
 
The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...
 
Advancements in the Real-Time Simulation of Large Active Distribution Systems...
Advancements in the Real-Time Simulation of Large Active Distribution Systems...Advancements in the Real-Time Simulation of Large Active Distribution Systems...
Advancements in the Real-Time Simulation of Large Active Distribution Systems...
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
 
HPPS 2008 - Maesani Moro
HPPS 2008 - Maesani MoroHPPS 2008 - Maesani Moro
HPPS 2008 - Maesani Moro
 
Gene's law
Gene's lawGene's law
Gene's law
 
Available HPC resources at CSUC
Available HPC resources at CSUCAvailable HPC resources at CSUC
Available HPC resources at CSUC
 
Advances in the Solution of NS Eqs. in GPGPU Hardware. Second order scheme an...
Advances in the Solution of NS Eqs. in GPGPU Hardware. Second order scheme an...Advances in the Solution of NS Eqs. in GPGPU Hardware. Second order scheme an...
Advances in the Solution of NS Eqs. in GPGPU Hardware. Second order scheme an...
 
Design and Implementation of Pulse Width Modulation Using Hardware/Software M...
Design and Implementation of Pulse Width Modulation Using Hardware/Software M...Design and Implementation of Pulse Width Modulation Using Hardware/Software M...
Design and Implementation of Pulse Width Modulation Using Hardware/Software M...
 
Early Application experiences on Summit
Early Application experiences on Summit Early Application experiences on Summit
Early Application experiences on Summit
 

Mehr von Ghazal Tashakor

Paralle Programming in Python
Paralle Programming in PythonParalle Programming in Python
Paralle Programming in PythonGhazal Tashakor
 
Rapid Survey on Routing in Data Centers
Rapid Survey on Routing in Data CentersRapid Survey on Routing in Data Centers
Rapid Survey on Routing in Data CentersGhazal Tashakor
 
Smart controller over resource nodes
Smart controller over resource nodesSmart controller over resource nodes
Smart controller over resource nodesGhazal Tashakor
 
Investigating resource elasticity
Investigating resource elasticityInvestigating resource elasticity
Investigating resource elasticityGhazal Tashakor
 

Mehr von Ghazal Tashakor (6)

Paralle Programming in Python
Paralle Programming in PythonParalle Programming in Python
Paralle Programming in Python
 
Rapid Survey on Routing in Data Centers
Rapid Survey on Routing in Data CentersRapid Survey on Routing in Data Centers
Rapid Survey on Routing in Data Centers
 
Smart controller over resource nodes
Smart controller over resource nodesSmart controller over resource nodes
Smart controller over resource nodes
 
Investigating resource elasticity
Investigating resource elasticityInvestigating resource elasticity
Investigating resource elasticity
 
Resource scheduling
Resource schedulingResource scheduling
Resource scheduling
 
Resource scheduling
Resource schedulingResource scheduling
Resource scheduling
 

Kürzlich hochgeladen

Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
fundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyfundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyDrAnita Sharma
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 

Kürzlich hochgeladen (20)

Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
fundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyfundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomology
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 

2022_01_TSMP_HPC_Asia_Final.pptx

  • 1. TERRESTRIAL SYSTEMS MODELLING PLATFORM J. BENKE, D. CAVIEDES VOULLIEME, S. POLL, G. TASHAKOR, I. ZHUKOV JÜLICH SUPERCOMPUTING CENTRE (JSC) PORTING A COUPLED MULTISCALE AND MULTIPHYSICS EARTH SYSTEM MODEL TO HETEROGENEOUS ARCHITECTURES (BENCHMARKING AND PERFORMANCE ANALYSIS) j.benke@fz-juelich.de, d.caviedes.voullieme@fz-juelich.de, g.tashakor@fz-juelich.de, s.poll@fz-juelich.de, i.zhukov@fz-juelich.de http://www.fz-juelich.de/ias/jsc/slts http://www.hpsc-terrsys.de @HPSCTerrSys HPSC TerrSys
  • 2. TERRESTRIAL SYSTEMS MODELLING PLATFORM (TSMP) A QUICK OVERVIEW Page 2
  • 3. Member of the Helmholtz Association TERRESTRIAL SYSTEM MODELLING PLATFORM (TSMP) • Represents processes for soil, land, vegetation and atmosphere • Numerical modelling system coupling COSMO / ICON, Community Land Model (CLM) and ParFlow  Fully modular • Physically-based representation of transport processes of mass, energy and momentum • Component models can have different spatio-temporal resolution; explicit feedbacks between compartments • Parallel Data Assimilation Framework (TSMP-PDAF) • Multiple Program Multiple Data (MPMD) execution model,OASIS provides common MPI_COMM_WORLD • https://www.terrsysmp.org/ Page 3 Solving the terrestrial water and energy cycle from groundwater to the atmosphere ICON / COSMO CLM Parflow PDAF
  • 4. Page 4 Source codes used:  TSMP (v1.3.3): Interface and modeling framework  COSMO (v5.01): Atmospheric model  CLM (v3.5): Land surface model (1D column model)  ParFlow (v3.7): Surface and subsurface hydrological model  OASIS3-MCT: MPI based coupler for all submodels  PDAF: Parallel Data Assimilation Framework (not used here) Parallelism:  Hybrid (MPI, MPI-CUDA, MPI-OpenMP), depends on component model  Heterogeneous computing enabled (ParFlow GPU + COSMO/CLM CPU)  Performance analysis (profiling/tracing): First results exists (later) TERRESTRIAL SYSTEM MODELLING PLATFORM (TSMP) Software features
  • 5. TSMP COUPLING SCHEME Page 5 One coupling step: 1) COSMO and ParFlow are sending coupling fields to CLM 2) COSMO and ParFlow are idle while CLM is running 3) CLM sends coupling fields back to COSMO and ParFlow 4) COSMO and ParFlow are running simultaneously (CLM is idle) Communication pattern between components (programs)
  • 7. Member of the Helmholtz Association The Benchmark Scenario (SimDiCyPBL) • Motivation:  Systematic scaling tests and performance analysis were already performed (and published) over 8 years ago (JUQUEEN)  But since that time TSMP and its submodels evolved  The same ist true for HPC architectures (especially for accelerator architectures)  As the consequence new scaling tests and performance analysis results are needed on the new machines • SimDiCyPBL: Simulation of the Diurnal Cycle of the Planetary Boundary Layer  Synthetic scenario with limited complexity which is of interest only for checking computational correctness, computational benchmarks and performance analysis tests  Adaption of the TSMP Fall School 2019 scenario  Predefined scenario in the TSMP Data Github (idealRTD)  Advantage: Very adaptable problem, which can be run as a very small configuration or scaled indefinitely for performance and scalability studies  I.a. the geometry, mesh size and step width (time and space) can be easily adapted Page 7
  • 8. Member of the Helmholtz Association The Benchmark Scenario (SimDiCyPBL) • The setup of the SimDiCyPBL scenario  All three models are used (but not PDAF)  Atmosphere (COSMO 5.01)  Surface/Land/Vegetation (CLM 3.5)  Hydrology/Hydrogeology (ParFlow 3.7)  Area size (example case): 600 x 600 km  Height atmosphere: 22 km, depth ground: 30m  Flat ground with height 0 m (ParFlow)  Constant initial aquifer head: 5 m below sea level (see blue line with arrow)  Spatially homogeneous and constant unsaturated zone  Homogeneous ground/soil (with initial constant temperature of 287 K), radiative forcing  Periodic Boundary Conditions in x and y direction for COSMO Graphic adapted from TSMP FallSchool 2019, Day 2, p 6 Page 8
  • 9. Member of the Helmholtz Association The Benchmark Scenario (SimDiCyPBL) (contd) • All job runs were performed on JUWELS Cluster  CPU and GPU partition (for heterogeneous runs)  Nodes are non shared  Cluster-Booster runs are planned • Every model component gets a predefined number of processes (nodes)  In our test cases every node is fully utilized, that means 48 processes on 48 Cores  Example (CPU only):  8 COSMO nodes, 1 CLM node, 2 ParFlow nodes (8-1-2 scenario)  Results in COSMO = 384, CLM = 48 and ParFlow = 96 processes Graphic adapted from TSMP FallSchool 2019, Day 2, p 6 Page 9
  • 10. Member of the Helmholtz Association The Machine (JUWELS Cluster and Booster) • JUWELS (Cluster-Booster-System; batch system Slurm)  Cluster  2271 standard compute nodes, 56 accelerated compute nodes  2x Intel Xeon Platinum 8168 CPUs, 2x 24 cores, 2.7 GHz, 2 Hts/core (CPU partition)  96 GB DDR4-2666  Infiniband EDR (ConnectX-4)  Additional 4x NVIDIA V100 GPUs, 16 GB HBM (GPU partition of JUWELS Cluster)  10.6 (CPU) + 1.7 (GPU) PetaFlops peak performance (System)  Booster  936 accelerated compute nodes  2x AMD EPYC Rome 7402 CPUs, 2x 24 Cores, 2.8 GHz  512 GB DDR4-3200  4x NVIDIA A100 GPUs, 4x 40 GB HBM2e  Infiniband HDR200 (ConnectX-6)  73 PetaFlops peak performance (System) Page 10
  • 11. JUWELS (CLUSTER AND BOOSTER) Page 11
  • 13. Member of the Helmholtz Association Strong Scaling Experiment 300x300 surface mesh (config) • Model domain: 300 x 300 km • Model mesh size (nx x ny x nz):  COSMO: 306 x 306 x 50 (approx 4.7 x 106 nodes)  CLM: 300 x 300 x 10 (9.0 x 105 nodes)  ParFlow: 300 x 300 x 30 (2.7 x 106 nodes)  Step width (space): Δx = Δy = 1 km • Model simulation time: 6 hours • Model time step (all models): Δt = 18 seconds (200 time steps per model hour)  Different time steps per model can be taken • Coupling frequency OASIS-MCT: 18 seconds • I/O interval: 6 hours (1 output of every model at the end of the benchmark) • Used range of number of nodes in experiments: COSMO=1-16, CLM=1, ParFlow=1-8 • Process pinning and distribution (MPI only)  Pinning: by core; Distribution: block : cyclic : cyclic • The runtime measurement interval is from the start to the end of the job Page 13
  • 14. Member of the Helmholtz Association Strong Scaling Experiment 300x300 surface mesh (results) Page 14
  • 15. Member of the Helmholtz Association Strong Scaling Experiment 300x300 surface mesh (results) • Explanation of the function graphs on the previous slide:  Function graphs left hand side: Runtime measurements for CPU only runs (in minutes)  Function graphs right hand side: Runtime measurements for CPU-GPU runs (in minutes)  For every measurement: #CLM nodes = 1 • Every node is fully utilized with 48 processes (CPU only) or with 4 processes (for ParFlow running on GPUs) • Every function graph of one colour shows a discrete runtime function of the number of COSMO nodes (with #ParFlow nodes = constant)  E.g. the red discrete function is the runtime graph of the run with 1 CLM node, 1 Parflow node and 1,2,4,8 and 16 COSMO nodes (in the CPU only and the CPU-GPU case)  Every dot shows one measurement point  x axis: Number of COSMO nodes (#Nodes COSMO)  y axis: Runtime of a measurement (job run) in minutes  How to read it: If you are searching for the runtime of a 8-1-2 job (8 COSMO, 1 CLM, 2 ParFlow nodes), then please have look at the x axis for the number 8, and then the intersection between the virtual vertical axis through 8 with the green line (#CLM nodes are always 1) Page 15
  • 16. Member of the Helmholtz Association Strong Scaling Experiment 300x300 surface mesh (results) • Both functional groups (CPU only and CPU-GPU) are strictly decreasing monotonically  Except for #Nodes COSMO>8 • COSMO limited parts of a function  Graph of the function is nearly constant or increasing  Interpretation: COSMO is waiting for ParFlow • ParFlow limited parts of a function  Graph of the function is (strictly) decreasing monotonically  Interpretation: ParFlow is waiting for COSMO • A (quasi) load balanced state is the “elbow” of a function  Example CPU only case: 8 COSMO, 1 CLM, 2 ParFlow nodes (8-1-2 scenario) • In all cases the runtime remains constant or increases with 16 COSMO nodes (no more speed up)  The reason is the small size of the mesh of this scenario • Interesting to observe: In the CPU-GPU case the fastest runs are those with #Nodes ParFlow = 1  Regarding runtime and resource usage/energy efficiency the best choice would be 8-1-1 (CPU-GPU) • Regarding runtime only the optimal point would be 16-1-2 (CPU only) Page 16
  • 17. Member of the Helmholtz Association Strong Scaling Experiment 1200x600 surface mesh (config) • Model domain: 1200 x 600 km • Model mesh size (nx x ny x nz):  COSMO: 1206 x 606 x 50 (approx 3.7 x 107 nodes)  CLM: 1200 x 600 x 10 (7.2 x 106 nodes)  ParFlow: 1200 x 600 x 30 (approx 2.2 x 107 nodes)  Step width (space): Δx = Δy = 1 km • Model simulation time: 6 hours • Model time step (all models): Δt = 18 seconds (200 time steps per model hour)  Different time steps per model can be taken • Coupling frequency OASIS-MCT: 18 seconds • I/O interval: 6 hours (1 output of every model at the end of the benchmark) • Used range of number of nodes in experiments: COSMO=1-16, CLM=1, ParFlow=1-8 • Process pinning and distribution (MPI only)  Pinning: by core; Distribution: block : cyclic : cyclic • The runtime measurement interval is from the start to the end of the job Page 17
  • 18. Member of the Helmholtz Association Strong Scaling Experiment 1200x600 surface mesh (results) Page 18
  • 19. Member of the Helmholtz Association Strong Scaling Experiment 1200x600 surface mesh (results) • COSMO limited parts of a function  Graph of the function is nearly constant or increasing  Interpretation: COSMO is waiting for ParFlow • ParFlow limited parts of a function  Graph of the function is (strictly) decreasing monotonically  Interpretation: ParFlow is waiting for COSMO • A (quasi) load balanced state is the “elbow” of a function  Example CPU only case: 4 COSMO, 1 CLM, 1 ParFlow nodes (4-1-1 scenario) • CPU only case:  #Nodes ParFlow = 1 is COSMO limited with #Nodes COSMO = 8,16  #Nodes ParFlow = 2 is COSMO limited with #Nodes COSMO = 16 • CPU-GPU case:  All runs are ParFlow limited (waiting for COSMO) • In the CPU-GPU case the fastest runs are again with #Nodes ParFlow = 1 (16-1-1)  Regarding runtime and resource consumption/energy efficiency best choice Page 19
  • 20. Member of the Helmholtz Association Strong Scaling Experiments (Summary) • CPU only case:  In some cases (especially #Nodes ParFlow = 1,2) the runtime is COSMO limited • CPU-GPU case:  Most of the runs are ParFlow limited  In all CPU-GPU cases (scenarios) the best performance is reached in the #Nodes ParFlow = 1 case  With only a minor runtime loss compared to the best CPU only case  Regarding energy efficiency and runtime the best CPU-GPU version should be used • To investigate this further larger meshes are needed  Under construction, but problems arose with creating appropriate OASIS3 rmp* files • CPU only and CPU-GPU haven’t big differences in runtime because they are coupled and after one time step they synchronize via sending their data to CLM and wating for the scattered data  Possible solutions to optimize this problem (under investigation):  Different time step sizes for all models (first results are available)  Larger coupling frequency (first results are available, but no large gain in runtime and cases exist with problems of correctness of the results) Page 20
  • 22. Member of the Helmholtz Association Performance Analysis: Used Tools Score-P (Version 7.1; www.score-p.org)  Performance analysis tool infrastructure  Instrumentation and measurement system  Supports (call-paths) profiling and event tracing Scalasca (Version 2.6; www.scalasca.org)  Automatic trace analysis of parallel programs  Automatic search for patterns and inefficient behaviour  Classification and quantification of significance • Cube (Version 4.6; https://www.scalasca.org/scalasca/software/cube-4.x)  Performance report explorer for Scalasca and Score-P  Incudes libraries, algebra utilities and GUI  GUI supports interactive performance analysis and metrics exploration • Vampir (Version 9.11;https://vampir.eu/ )  Parallel performance analysis framework  Graphical representation of performance metrics and dynamical processes Page 22
  • 24. Member of the Helmholtz Association Performance Analysis (Tracing) • The following slides show exemplary the load balancing problems of TSMP • Instrumentation for tracing was done with Score-P, visualization via Vampir • Explanation of the next 3 slides:  Presentation of Traces and Profiling of a 1-1-1 CPU only job run of TSMP  300x300 surface mesh  Structure of the tracing pictures (slide 1 and 2)  COSMO: The upper block (48 procs)  ParFlow: Centered block (48 procs)  CLM: Lower block (48 procs)  This structure of traces show the MPMD nature of TSMP  in particular the concurrent execution of the three component models.  Every row shows the runtime behaviour of one process (for a better overview all processes are shown, but it’s possible to zoom in) Page 24
  • 25. Member of the Helmholtz Association Performance Analysis (Tracing) (cont’d) • Explanation of the next 3 slides (cont’d):  Coloring:  Red bars are MPI operations (P2P communication, collective communication, MPI requests, MPI initialization, etc)  Vertical black lines is MPI communication, black dots MPI bursts resulting in MPI communication  Green areas describe user functions  Slide 3 shows the Accumulated Exclusive Time per Function (a kind of Profiling) Page 25
  • 26. Member of the Helmholtz Association Performance Analysis TSMP using ScoreP/Vampir (CPU only) Page 26
  • 27. Member of the Helmholtz Association Performance Analysis TSMP using ScoreP/Vampir (CPU only) Page 27
  • 28. Member of the Helmholtz Association Page 28
  • 29. Member of the Helmholtz Association Performance Analysis (Tracing) (results) • CPU only (slide 1 and 2)  Slide 1 shows an overview over all 144 processes of the 3 component models over the full runtime (approx 750 s)  Most of the area is red (MPI) and a closer look shows, that most of the time MPI requests (MPI_Waitall) are done (see also slide 3)  Most of its time CLM is waiting for COSMO and ParFlow, since it’s the fastest (i.a. smallest mesh)  Slide 2 shows a zoom into the first time steps (0 – 33 seconds)  Second 0 to 14 shows the initialization interval  Computation begins with second 14  CLM starts with computing (approx second 14.2), while ParFlow and COSMO are waiting  After finishing CLM scatters its results to COSMO and ParFlow (black vertical lines) and then waits (second 14.2 to 15.5)  ParFlow computes from second 14.3 to 14.5, sends its data to CLM and waits  COSMO calculates from 14.3 to 15.5 and then sends its data to CLM  After sending data, COSMO and ParFlow are waiting for CLM, which started to compute  The cycle starts again Page 29
  • 30. Member of the Helmholtz Association Performance Analysis (Tracing) (results; cont’d) • CPU – GPU (no slide)  The same mechanism like in the previous slides  CLM waits almost all of its runtime  Loadbalancing issues because of different model complexities! • Profiling (Slide 3)  Slide shows Accumulated Exclusive Time per Function  MPI_Waitall dominates all other functions  But:  Much MPI_Waitall time is spent in CLM, because its the by far the smallest and fastest model  COSMO is the most complex model (with the largest mesh)  Important: Reduction of load imbalance to an “optimal” point  MPI communication (in per cent) relative to other regions decreases with increasing mesh size and load balanced models Page 30
  • 31. Member of the Helmholtz Association Performance Analysis of TSMP with Scalasca Goergen et al. | JSC SAC 2019 | 16/17 Sep 2019 | Jülich   Page 31  Scalasca allows trace analysis to detect bottlenecks and patterns/regions of poor performance  Results can be shown with Cube  An example scenario can be seen on the left hand side  300x300 TSMP CPU only scenario (1-1-1)  Only the metric tree of Cube is shown here  The accumulated total time of the program can be seen in line 1 and the number of visits in line 2  Additionally Scalasca excerpts for example events like ...  “Late Sender” or “Late Receiver” (line 7 and 8)  Seems to be here a problem, since it takes a significant amount of the runtime  To locate the problems more accurately it’s possible to select different objects  Gives a better overview in which part of TSMP the problems occur  Further bottlenecks can be detected with Scalasca (see box on the left hand side)
  • 32. Member of the Helmholtz Association Parallel Performance Analysis with Scalasca (Late Sender) Goergen et al. | JSC SAC 2019 | 16/17 Sep 2019 | Jülich  Page 32  Scalasca allows trace analysis to detect bottlenecks and patterns/regions of poor performance
  • 33. ONGOING DEVELOPMENTS (SELECTION) Page 33 • ICON-CLM5 coupling via OASIS3-MCT • (ICON + ParFlow) on GPUs + (CLM) CPUs • Flexible/adaptive grids for handling streams/rivers • Enlarging the mesh size and further strong and weak scaling experiments on JUWELS and DEEP Cluster and Booster  Both are pure MSA architectures • Continuing performance analysis with Scalasca/ScoreP and Vampir/Cube • In depth performance analysis of all components of the models • Best practices performance analysis for TSMP (MPMD) • Best practices optimization for TSMP  Especially regarding load balancing (heterogeneous and MSA)  Different time steps of models and coupling frequencies (systematic tests)
  • 34. THANK YOU VERY MUCH FOR YOUR ATTENTION! Page 34
  • 35. TERRESTRIAL SYSTEMS MODELLING PLATFORM J. BENKE, D. CAVIEDES VOULLIEME, S. POLL, G. TASHAKOR, I. ZHUKOV JÜLICH SUPERCOMPUTING CENTRE (JSC) PORTING A COUPLED MULTISCALE AND MULTIPHYSICS EARTH SYSTEM MODEL TO HETEROGENEOUS ARCHITECTURES (BENCHMARKING AND PERFORMANCE ANALYSIS) j.benke@fz-juelich.de, d.caviedes.voullieme@fz-juelich.de, g.tashakor@fz-juelich.de, s.poll@fz-juelich.de, i.zhukov@fz-juelich.de http://www.fz-juelich.de/ias/jsc/slts http://www.hpsc-terrsys.de @HPSCTerrSys HPSC TerrSys