Performance Analysis of a Molecular Dynamics Code Thomas William
Transcrição
Performance Analysis of a Molecular Dynamics Code Thomas William
Center for Information Services and High Performance Computing (ZIH) IUmd Performance Analysis of a Molecular Dynamics Code Thomas William Dresden, 5/27/13 Overview • • • • • • • IUmd – Introduction First Look with Vampir Comparison of Code-versions Hardware Counters Source Code Analysis Source Code Optimization Porting IUmd to GPGPU 5/29/13 IUmd Folie Nr. 2 von 47 Initial Motivation • • • • Code had been in development for many years Lots of different code versions doing “similar science” Developers wanted to compare these versions Search for Performance optimization possibilities • They (IU) provide domain specific expertise • We (TUD) provide performance optimization expertise 5/29/13 IUmd Folie Nr. 3 von 47 Introduction: IUmd Code • Classical molecular dynamics simulation of dense nuclear matter consisting of either fully ionized atoms or free neutrons and protons • Main targets are studies of the dense matter in white dwarf and neutron stars. • Interaction potentials between particles treated as classical two-particle central potentials • No complicated particle geometry or orientation • Electrons can be treated as a uniform background charge (not explicitly modeled) 5/29/13 IUmd Folie Nr. 4 von 47 Visualization of Sample Data (VTK) 5/29/13 IUmd Folie Nr. 5 von 47 “Oh my god – it’s FORTRAN” DIGGING INTO THE CODE 5/29/13 IUmd Folie Nr. 6 von 47 IUmd Code Details • Particle-particle-interactions have been implemented in multiple ways (nucleon, pure-ion, ion-mixture) • Located in different files PP01, PP02 and PP03 • The code blocks are selectable using preprocessor macros • PP01 is the original implementation with no division into the Ax, Bx or NBS blocks PP02 implements the versions in use by the physicists today • • • 3 different implementations for the Ax block 3 implementations for the Bx block No manual blocking (NBS) • PP03 includes new routines • • • 5/29/13 3 Ax blocks 8 Bx blocks Can be blocked using the NBS value IUmd Folie Nr. 7 von 47 MD Code Details IUmd Code Details Two sections of code each have two or more variations One section is labelled A and the other B Variations numbered • IUmd can beare compiled as a serial, OpenMP, MPI, or MPI+OpenMP (Hybrid)as a serial, OpenMP, MPI or MD can be compiled MPI+OpenMP ⌥ ⌃ ⌦ make MDEF=XRay md_mpi ALG=PP02 BLKA=A0 BLKB=B2 NBS= "NBSX=0 " 5/29/13 IUmd Folie Nr. 8 von 47 ⇧ IUmd Workflow The structure of the program is simple: • First reads in a parameter file “runmd.in” • And an initial particle configuration file “md.in” • This allows checkpointing • Program then calculates the initial set of accelerations • Enters a time-stepping loop, a triply nested ”do” loop 5/29/13 IUmd Folie Nr. 9 von 47 Runtime Parameters Runtime Parameter ⌥ ! Parameters : sim_type = ’ ion m i x t u r e ’ , ! s i m u l a t i o n t y p e tstart = 0.00 , ! s t a r t time dt = 25.00 , ! t i m e s t e p ( fm / c ) ! Warmup : nwgroup = 2, ! groups nwsteps = 50 , ! s t e p s per group ! Measurement : ngroup = 2, ! groups ntot = 2, ! per group nind = 25 , ! s t e p s between tnormalize = 50 , ! temp normal . ncom = 50 , ! c e n t e r of mass ! motion c a n c e l . ⌃ ⌦ Figure: Runtime parameter (excerpt from runmd.in) Figure : Runtime parameters, snippet from runmd.in 5/29/13 IUmd Folie Nr. 10 von 47 ⇧ The Main Loop The Main Loop ⌥ ⌃ ⌦ do 100 i g =1 , ngroup i n i t i a l i z e group i g s t a t i s t i c s do 40 j =1 , n t o t do i =1 , n i n d ! computes f o r c e s ! updates x and v c a l l newton enddo call vtot enddo compute group i g s t a t i s t i c s 40 continue 100 continue Figure: Simplified main Figure : Simplifiedversion version ofofthethe main loop loop 5/29/13 IUmd Folie Nr. 11 von 47 ⇧ MD Implementation Details - newton module • Forces are calculated in the newton module in a pair of nested do-loops • Outer loop goes over target particles • Inner loop over source particles • Targets are assigned to MPI processes in a round-robin fashion • Within each MPI process, the work is shared among OpenMP threads 5/29/13 IUmd Folie Nr. 12 von 47 The cobbler should stick to his last FIRST LOOK WITH VAMPIR 5/29/13 IUmd Folie Nr. 13 von 47 Pure MPI 5/29/13 IUmd Folie Nr. 14 von 47 HUGE OpenMP Overhead in Vampir 5/29/13 IUmd Folie Nr. 15 von 47 “burnin CPU hours for good” COMPARISON OF CODEVERSIONS 5/29/13 IUmd Folie Nr. 16 von 47 Cray XT5mTM • XT5m is a 2D mesh of nodes • Each node has two sockets each having four cores • AMD Opteron 23 ”Shanghai” (45mm) • Running at 2.4 GHz • 84 compute nodes • a total of 672 cores • pgi/9.0.4 • Using the XT PE driver „xtpe-shanghai“ 5/29/13 IUmd Folie Nr. 17 von 47 Time Constraints First runs showed that: • 5k particles measurement takes • 27k particles measurement takes • 55k particles measurement takes 5 minutes 1 hour 10 hours Scientifically useful data with 55k particles or more 5/29/13 IUmd Folie Nr. 18 von 47 Overview PP01, PP02, and PP03 Figure: Overview of the runtimes for all (143) codeblock combinations with an ion-mix input dataset of 55k particles. The naming scheme is ”source-code-file A-block B-block blockingfactor” 5/29/13 IUmd Folie Nr. 19 von 47 PP02 Code Blocks 5/29/13 IUmd Folie Nr. 20 von 47 Livin in a hardware meritocracy HARDWARE COUNTERS 5/29/13 IUmd Folie Nr. 21 von 47 Floating Point Instructions 5/29/13 IUmd Folie Nr. 22 von 47 The end of the world is nigh! SOURCE CODE ANALYSIS 5/29/13 IUmd Folie Nr. 23 von 47 Block A 5/29/13 IUmd Folie Nr. 24 von 47 Block B cut-off sphere reduces interactions to 19% 5/29/13 IUmd Folie Nr. 25 von 47 Recap: PP02 Code Blocks 5/29/13 IUmd Folie Nr. 26 von 47 Why is -O2 as fast as -O3 … and why is –fast only marginally better? 5/29/13 IUmd Folie Nr. 27 von 47 Finally, Progress . . . SOURCE CODE OPTIMIZATION 5/29/13 IUmd Folie Nr. 28 von 47 Recap: OpenMP Overhead in Vampir 5/29/13 IUmd Folie Nr. 29 von 47 Code Changes Code Changes • Omp do fordothe is removed ompparallel parallel for j-loop the j-loop is removed omp parallel parallel region aroundthe the i-loop i-loop • Omp region around j-loopisisparallelized parallelized explicitly • J-loop explicitly Entire ij-loop section in a 3rd loop over threads Entire ij-loop section in a 3rd loop over threads Each iteration of this outer loop is assigned to an • Each iteration of this outer loop is assigned to an OpenMP thread OpenMP thread ⌥ ⌥ ⌃ ⌦ ⇧⌦ ⌃ a l l o c a t e ( f j ( 3 , 0 : n 1)) do 100 i =myrank , n 2,nprocs f i ( : ) = 0 . 0 d0 ! $omp p a r a l l e l do p r i v a t e ( r2 , k , xx , r , f c ) , \ r e d u c t i o n ( + : f i ) , schedule ( r u n t i m e ) do 90 j = i +1 ,n 1 ! A Block . . . . . . 5/29/13 ! $omp p a r a l l e l p r i v a t e ( n t h r d , n j , nr , j 0 , j 1 , xx , r2 , r , f c , f i ) n t h r d = omp_get_num_threads ( ) a l l o c a t e ( f i ( 3 , 0 : n 1)) ! $omp do schedule ( s t a t i c , 1 ) do 110 i t h r d =0 , n t h r d 1 do 100 i =myrank , n 2,nprocs f i ( : , i ) = 0 . 0 d0 n j =( n i 1)/ n t h r d n r =mod( n i 1, n t h r d ) j 0 =( i +1)+ i t h r d ⇤ n j +min ( i t h r d , n r ) j 1 =( i + 1 ) + ( i t h r d +1)⇤ n j +min ( i t h r d +1 , n r) 1 do 90 j = j 0 , j 1 IUmd Folie Nr. 30 von 47 ⇧ New MPI/OpenMP Version in Vampir 5/29/13 IUmd Folie Nr. 31 von 47 Parallel Efficiency of the OpenMP Application 5/29/13 IUmd Folie Nr. 32 von 47 AMD Interlagos OpenMP Efficiency 97.6% Parallel Efficiency 5/29/13 IUmd Folie Nr. 33 von 47 “The next level” PORTING IUMD TO GPGPU 5/29/13 IUmd Folie Nr. 34 von 47 3 Different Approaches • HMPP • Only a proof of concept so far • OpenACC • CUDA • PGI CUDA Fortran • Both CUDA and OpenACC can be combined with MPI 5/29/13 IUmd Folie Nr. 35 von 47 Code Changes Goal: Combine MPI Version with CUDA Needs new parameters to define • The dimensions of the MPI process grid • Number of source chunks • Thread block dimensions for the CUDA kernel • [1] must exactly divide the number of particles N • [1] must be a multiple of the warp size (32) • [2]*[1]/4) must exactly divide N/nschunk !MPI and CUDA parameters procdim = 1, 1, !MPI process grid dimensions nschunk = 4, !number of source chunks thrdBlkDim = 128, 1 !CUDA thread block and block grid dim 5/29/13 IUmd Folie Nr. 36 von 47 Changes to MPI Split COM_WORLD into Cartesian Grid Example: Calculate kinetic energy (ttot) • Master column adds up kinetic energies (Allreduce) • Broadcast new values along each row 5/29/13 IUmd Folie Nr. 37 von 47 Example: int_ion_mix - Original 5/29/13 IUmd Folie Nr. 38 von 47 OpenACC 5/29/13 IUmd Folie Nr. 39 von 47 CUDA Step A: • Declare arrays in shared memory (xss, ffs) • Define grid, block and thread parameters • Copy target data from global memory to registers • Initialize force • Copy a chunk of source data from global to shared memory 5/29/13 IUmd Folie Nr. 40 von 47 CUDA Step B: • Thread (i,j) applies its sources (xss) to its target Step C: • Sum the forces (ffs) from all threads of my block Step D: • Copy shared memory array (ffs) to global memory • Each block copies to its own array section • Thread columns share the work 5/29/13 IUmd Folie Nr. 41 von 47 CUDA Step E: • Last block in row that copies its results then sums the contribution from each block • Threads share the work, accumulating results into registers • Threads take turns adding their contributions to shared memory • Thread columns write the final results to global memory 5/29/13 IUmd Folie Nr. 42 von 47 Fermi vs Kepler (1 GPU, pgfortran CUDA) Baseline: • 8 OpenMP Threads (icc) • Pure_ion simulation • Weak scaling (increasing the number of ions) Speedup 8 6 4 FERMI 2 KEPLER 0 20000 5/29/13 40000 Number of Ions IUmd 80000 Folie Nr. 43 von 47 Comparison OpenACC vs CUDA • OpenMP version as baseline: • 2 Intel(R) Xeon(R) CPU (E5620, 2.4GHz) • Intel 12.1 Compiler • 8 OpenMP threads • 1 Keppler K20 • Cuda-version: 5.0 • CAPS 3.3.1 for ACC • PGI 13.1 for ACC/CUDA • IUmd parameter: • nschunk: • blkDim: • grdDim: 5/29/13 4 128 216 IUmd Folie Nr. 44 von 47 Comparison OpenACC vs CUDA • Simulation type: ion-mixture • Input data set changed! 12 10 8 OpenACC CAPS 6 OpenACC PGI CUDA 4 2 0 Speedup 5/29/13 IUmd Folie Nr. 45 von 47 Combined Benchmark – first results Run2me in seconds Benchmark (IUmd 6.3.0) 100 74.05 38.54 19.27 10.14 10 5.07 2.54 2.60 5.12 1.28 2.56 1.28 0.64 1 0.22 0.1 0.11 0.06 0.05 0.01 1 2 4 8 OpenMP 5/29/13 16 32 2x16 2x32 4x4 4x8 4x16 4x32 1 MPI+OpenMP IUmd 2x1 4x1 6x1 CUDA Folie Nr. 46 von 47 Note of Thanks to Contributors (Listing Names in Alphabetical Order) Robert Dietrich Jonas Stolle Frank Winkler 5/29/13 Donald Berry Robert Henschel Joseph Schuchart IUmd Folie Nr. 47 von 47 General approach to optimization 5/29/13 IUmd Folie Nr. 48 von 47 FPU idle times Figure: FPU idle times in percent of PAPI measured total cycles 5/29/13 IUmd Folie Nr. 49 von 47 Branch Miss Predictions 5/29/13 IUmd Folie Nr. 50 von 47