Performance Analysis of a Molecular Dynamics Code Thomas William

Transcrição

Performance Analysis of a Molecular Dynamics Code Thomas William
Center for Information Services and High Performance Computing (ZIH)
IUmd
Performance Analysis of a
Molecular Dynamics Code
Thomas William
Dresden, 5/27/13
Overview
• 
• 
• 
• 
• 
• 
• 
IUmd – Introduction
First Look with Vampir
Comparison of Code-versions
Hardware Counters
Source Code Analysis
Source Code Optimization
Porting IUmd to GPGPU
5/29/13
IUmd
Folie Nr. 2 von 47
Initial Motivation
• 
• 
• 
• 
Code had been in development for many years
Lots of different code versions doing “similar science”
Developers wanted to compare these versions
Search for Performance optimization possibilities
•  They (IU) provide domain specific expertise
•  We (TUD) provide performance optimization expertise
5/29/13
IUmd
Folie Nr. 3 von 47
Introduction: IUmd Code
•  Classical molecular dynamics simulation of dense
nuclear matter consisting of either fully ionized atoms
or free neutrons and protons
•  Main targets are studies of the dense matter in white
dwarf and neutron stars.
•  Interaction potentials between particles treated as
classical two-particle central potentials
•  No complicated particle geometry or orientation
•  Electrons can be treated as a uniform background
charge (not explicitly modeled)
5/29/13
IUmd
Folie Nr. 4 von 47
Visualization of Sample Data (VTK)
5/29/13
IUmd
Folie Nr. 5 von 47
“Oh my god – it’s FORTRAN”
DIGGING INTO THE CODE
5/29/13
IUmd
Folie Nr. 6 von 47
IUmd Code Details
•  Particle-particle-interactions have been implemented
in multiple ways (nucleon, pure-ion, ion-mixture)
•  Located in different files PP01, PP02 and PP03
•  The code blocks are selectable using preprocessor
macros
•  PP01 is the original implementation with no division into
the Ax, Bx or NBS blocks
PP02 implements the versions in use by the physicists
today
• 
• 
• 
3 different implementations for the Ax block
3 implementations for the Bx block
No manual blocking (NBS)
•  PP03 includes new routines
• 
• 
• 
5/29/13
3 Ax blocks
8 Bx blocks
Can be blocked using the NBS value
IUmd
Folie Nr. 7 von 47
MD Code Details
IUmd Code Details
Two sections of code each have two or more variations
One section is labelled A and the other B
Variations
numbered
•  IUmd
can beare
compiled
as a serial, OpenMP, MPI, or
MPI+OpenMP
(Hybrid)as a serial, OpenMP, MPI or
MD can be compiled
MPI+OpenMP
⌥
⌃
⌦
make MDEF=XRay md_mpi ALG=PP02 BLKA=A0 BLKB=B2 NBS= "NBSX=0 "
5/29/13
IUmd
Folie Nr. 8 von 47
⇧
IUmd Workflow
The structure of the program is simple:
•  First reads in a parameter file “runmd.in”
•  And an initial particle configuration file “md.in”
•  This allows checkpointing
•  Program then calculates the initial set of accelerations
•  Enters a time-stepping loop, a triply nested ”do” loop
5/29/13
IUmd
Folie Nr. 9 von 47
Runtime Parameters
Runtime Parameter
⌥
! Parameters :
sim_type = ’ ion m i x t u r e ’ , ! s i m u l a t i o n t y p e
tstart
=
0.00 ,
! s t a r t time
dt
=
25.00 ,
! t i m e s t e p ( fm / c )
! Warmup :
nwgroup =
2,
! groups
nwsteps =
50 ,
! s t e p s per group
! Measurement :
ngroup
=
2,
! groups
ntot
=
2,
! per group
nind
=
25 ,
! s t e p s between
tnormalize =
50 ,
! temp normal .
ncom
=
50 ,
! c e n t e r of mass
! motion c a n c e l .
⌃
⌦
Figure: Runtime parameter (excerpt from runmd.in)
Figure : Runtime parameters, snippet from runmd.in
5/29/13
IUmd
Folie Nr. 10 von 47
⇧
The Main Loop
The Main Loop
⌥
⌃
⌦
do 100 i g =1 , ngroup
i n i t i a l i z e group i g s t a t i s t i c s
do 40 j =1 , n t o t
do i =1 , n i n d
! computes f o r c e s
! updates x and v
c a l l newton
enddo
call vtot
enddo
compute group i g s t a t i s t i c s
40
continue
100 continue
Figure:
Simplified
main
Figure
: Simplifiedversion
version ofofthethe
main
loop loop
5/29/13
IUmd
Folie Nr. 11 von 47
⇧
MD Implementation Details - newton module
•  Forces are calculated in the newton module in a pair
of nested do-loops
•  Outer loop goes over target particles
•  Inner loop over source particles
•  Targets are assigned to MPI processes in a round-robin
fashion
•  Within each MPI process, the work is shared among
OpenMP threads
5/29/13
IUmd
Folie Nr. 12 von 47
The cobbler should stick to his last
FIRST LOOK WITH VAMPIR
5/29/13
IUmd
Folie Nr. 13 von 47
Pure MPI
5/29/13
IUmd
Folie Nr. 14 von 47
HUGE OpenMP Overhead in Vampir
5/29/13
IUmd
Folie Nr. 15 von 47
“burnin CPU hours for good”
COMPARISON OF CODEVERSIONS
5/29/13
IUmd
Folie Nr. 16 von 47
Cray XT5mTM
•  XT5m is a 2D mesh of nodes
•  Each node has two sockets each having four cores
•  AMD Opteron 23 ”Shanghai” (45mm)
•  Running at 2.4 GHz
•  84 compute nodes
•  a total of 672 cores
•  pgi/9.0.4
•  Using the XT PE driver „xtpe-shanghai“
5/29/13
IUmd
Folie Nr. 17 von 47
Time Constraints
First runs showed that:
•  5k particles measurement takes
•  27k particles measurement takes
•  55k particles measurement takes
5 minutes
1 hour
10 hours
Scientifically useful data with 55k particles or more
5/29/13
IUmd
Folie Nr. 18 von 47
Overview PP01, PP02, and PP03
Figure: Overview of the runtimes for all (143) codeblock combinations with an ion-mix input dataset of 55k
particles. The naming scheme is
”source-code-file A-block B-block blockingfactor”
5/29/13
IUmd
Folie Nr. 19 von 47
PP02 Code Blocks
5/29/13
IUmd
Folie Nr. 20 von 47
Livin in a hardware meritocracy
HARDWARE COUNTERS
5/29/13
IUmd
Folie Nr. 21 von 47
Floating Point Instructions
5/29/13
IUmd
Folie Nr. 22 von 47
The end of the world is nigh!
SOURCE CODE ANALYSIS
5/29/13
IUmd
Folie Nr. 23 von 47
Block A
5/29/13
IUmd
Folie Nr. 24 von 47
Block B
cut-off sphere
reduces
interactions to
19%
5/29/13
IUmd
Folie Nr. 25 von 47
Recap: PP02 Code Blocks
5/29/13
IUmd
Folie Nr. 26 von 47
Why is -O2 as fast as -O3
… and why is –fast only marginally better?
5/29/13
IUmd
Folie Nr. 27 von 47
Finally, Progress . . .
SOURCE CODE
OPTIMIZATION
5/29/13
IUmd
Folie Nr. 28 von 47
Recap: OpenMP Overhead in Vampir
5/29/13
IUmd
Folie Nr. 29 von 47
Code Changes
Code Changes
•  Omp
do fordothe
is removed
ompparallel
parallel
for j-loop
the j-loop
is removed
omp parallel
parallel
region
aroundthe
the i-loop
i-loop
•  Omp
region
around
j-loopisisparallelized
parallelized explicitly
•  J-loop
explicitly
Entire ij-loop section in a 3rd loop over threads
Entire ij-loop section in a 3rd loop over threads
Each iteration of this outer loop is assigned to an
•  Each
iteration
of this outer loop is assigned to an
OpenMP
thread
OpenMP thread
⌥
⌥
⌃
⌦
⇧⌦
⌃
a l l o c a t e ( f j ( 3 , 0 : n 1))
do 100 i =myrank , n 2,nprocs
f i ( : ) = 0 . 0 d0
! $omp p a r a l l e l do p r i v a t e ( r2 , k , xx , r , f c ) , \
r e d u c t i o n ( + : f i ) , schedule ( r u n t i m e )
do 90 j = i +1 ,n 1
!
A Block
.
.
.
.
.
.
5/29/13
! $omp p a r a l l e l p r i v a t e ( n t h r d , n j , nr , j 0 ,
j 1 , xx , r2 , r , f c , f i )
n t h r d = omp_get_num_threads ( )
a l l o c a t e ( f i ( 3 , 0 : n 1))
! $omp do schedule ( s t a t i c , 1 )
do 110 i t h r d =0 , n t h r d 1
do 100 i =myrank , n 2,nprocs
f i ( : , i ) = 0 . 0 d0
n j =( n i 1)/ n t h r d
n r =mod( n i 1, n t h r d )
j 0 =( i +1)+ i t h r d ⇤ n j +min ( i t h r d , n r )
j 1 =( i + 1 ) + ( i t h r d +1)⇤ n j +min ( i t h r d +1 , n r) 1
do 90 j = j 0 , j 1
IUmd
Folie Nr. 30 von 47
⇧
New MPI/OpenMP Version in Vampir
5/29/13
IUmd
Folie Nr. 31 von 47
Parallel Efficiency of the OpenMP Application
5/29/13
IUmd
Folie Nr. 32 von 47
AMD Interlagos OpenMP Efficiency
97.6% Parallel Efficiency
5/29/13
IUmd
Folie Nr. 33 von 47
“The next level”
PORTING IUMD TO GPGPU
5/29/13
IUmd
Folie Nr. 34 von 47
3 Different Approaches
•  HMPP
•  Only a proof of concept so far
•  OpenACC
•  CUDA
•  PGI CUDA Fortran
•  Both CUDA and OpenACC can be combined with MPI
5/29/13
IUmd
Folie Nr. 35 von 47
Code Changes
Goal: Combine MPI Version with CUDA
Needs new parameters to define
•  The dimensions of the MPI process grid
•  Number of source chunks
•  Thread block dimensions for the CUDA kernel
•  [1] must exactly divide the number of particles N
•  [1] must be a multiple of the warp size (32)
•  [2]*[1]/4) must exactly divide N/nschunk
!MPI and CUDA parameters
procdim =
1, 1,
!MPI process grid dimensions
nschunk =
4,
!number of source chunks
thrdBlkDim = 128, 1
!CUDA thread block and block grid dim 5/29/13
IUmd
Folie Nr. 36 von 47
Changes to MPI
Split COM_WORLD into Cartesian Grid
Example: Calculate kinetic energy (ttot)
•  Master column adds up kinetic energies (Allreduce)
•  Broadcast new values along each row
5/29/13
IUmd
Folie Nr. 37 von 47
Example: int_ion_mix - Original
5/29/13
IUmd
Folie Nr. 38 von 47
OpenACC
5/29/13
IUmd
Folie Nr. 39 von 47
CUDA
Step A:
•  Declare arrays in shared memory (xss, ffs)
•  Define grid, block and thread parameters
•  Copy target data from global memory to registers
•  Initialize force
•  Copy a chunk of source data from global to shared
memory
5/29/13
IUmd
Folie Nr. 40 von 47
CUDA
Step B:
•  Thread (i,j) applies its sources (xss) to its target
Step C:
•  Sum the forces (ffs) from all threads of my block
Step D:
•  Copy shared memory array (ffs) to global memory
•  Each block copies to its own array section
•  Thread columns share the work
5/29/13
IUmd
Folie Nr. 41 von 47
CUDA
Step E:
•  Last block in row that copies its results then sums the
contribution from each block
•  Threads share the work, accumulating results into
registers
•  Threads take turns adding their contributions to
shared memory
•  Thread columns write the final results to global
memory
5/29/13
IUmd
Folie Nr. 42 von 47
Fermi vs Kepler (1 GPU, pgfortran CUDA)
Baseline:
•  8 OpenMP Threads (icc)
•  Pure_ion simulation
•  Weak scaling (increasing the number of ions)
Speedup 8 6 4 FERMI 2 KEPLER 0 20000 5/29/13
40000 Number of Ions IUmd
80000 Folie Nr. 43 von 47
Comparison OpenACC vs CUDA
•  OpenMP version as baseline:
•  2 Intel(R) Xeon(R) CPU (E5620, 2.4GHz)
•  Intel 12.1 Compiler
•  8 OpenMP threads
•  1 Keppler K20
•  Cuda-version: 5.0
•  CAPS 3.3.1 for ACC
•  PGI 13.1 for ACC/CUDA
•  IUmd parameter:
•  nschunk:
•  blkDim:
•  grdDim:
5/29/13
4
128
216
IUmd
Folie Nr. 44 von 47
Comparison OpenACC vs CUDA
•  Simulation type: ion-mixture
•  Input data set changed!
12 10 8 OpenACC CAPS 6 OpenACC PGI CUDA 4 2 0 Speedup 5/29/13
IUmd
Folie Nr. 45 von 47
Combined Benchmark – first results
Run2me in seconds Benchmark (IUmd 6.3.0) 100 74.05 38.54 19.27 10.14 10 5.07 2.54 2.60 5.12 1.28 2.56 1.28 0.64 1 0.22 0.1 0.11 0.06 0.05 0.01 1 2 4 8 OpenMP 5/29/13
16 32 2x16 2x32 4x4 4x8 4x16 4x32 1 MPI+OpenMP IUmd
2x1 4x1 6x1 CUDA Folie Nr. 46 von 47
Note of Thanks to Contributors
(Listing Names in Alphabetical Order)
Robert Dietrich
Jonas Stolle
Frank Winkler
5/29/13
Donald Berry
Robert Henschel
Joseph Schuchart
IUmd
Folie Nr. 47 von 47
General approach to optimization
5/29/13
IUmd
Folie Nr. 48 von 47
FPU idle times
Figure: FPU idle times in percent of PAPI measured total cycles
5/29/13
IUmd
Folie Nr. 49 von 47
Branch Miss Predictions
5/29/13
IUmd
Folie Nr. 50 von 47

Documentos relacionados