FFT Program Generation for Shared Memory: SMP and Multicore
Transcrição
FFT Program Generation for Shared Memory: SMP and Multicore
FFT Program Generation for Shared Memory: SMP and Multicore F. Franchetti, Y. Voronenko, M. Püschel Electrical and Computer Engineering Carnegie Mellon University Nicola Marcacci Rossi, 31. Oktober 2011 Overview First Part: From the algorithm to the code Second Part: Spiral extension for Shared Memory Discussion Discrete Fourier Transform The problem: Given: Compute: x ∈ CN y = DFTN x DFTN = [wNkl ]0≤k,l<N wn = e−2πi/N Discrete Fourier Transform The problem: Given: Compute: x ∈ CN y = DFTN x DFTN = [wNkl ]0≤k,l<N wn = e −2πi/N Algorithms: I I Direct Matrix-Vector Multiplication: O(N 2 ) Fast Fourier Transforms: O(N log N ) Discrete Fourier Transform The problem: Given: Compute: x ∈ CN y = DFTN x I DFTN = [wNkl ]0≤k,l<N wn = e Algorithms: I Direct Matrix-Vector Multiplication: O(N 2 ) Fast Fourier Transforms: O(N log N ) −2πi/N Cooley-Tukey FFT I Divide and Conquer I Matrix factorization: I Assume N = 2k I Base case: DFT2 DFTmn x = (DFTm ⊗ In )Dm,n (Im ⊗ DFTn )Lmn m x Discrete Fourier Transform The problem: Given: Compute: x ∈ CN y = DFTN x I DFTN = [wNkl ]0≤k,l<N wn = e Algorithms: I Direct Matrix-Vector Multiplication: O(N 2 ) Fast Fourier Transforms: O(N log N ) −2πi/N Cooley-Tukey FFT I Divide and Conquer I Matrix factorization: I Assume N = 2k I Base case: DFT2 DFTmn x = (DFTm ⊗ In )Dm,n (Im ⊗ DFTn )Lmn m x Spiral: generating optimized implementations 1 1 source: FFT Program Generation for Shared Memory: SMP and Multicore DFTN Start formula: Cooley-Tukey Rule: DFTmn → (DFTm ⊗ In)Dm,n (Im ⊗ DFTn )Lmn m Spiral: generating optimized implementations 1 1 source: FFT Program Generation for Shared Memory: SMP and Multicore DFTN Start formula: Cooley-Tukey Rule: Example: DFTmn → (DFTm ⊗ In)Dm,n (Im ⊗ DFTn )Lmn m DFT8 = (DFT2 ⊗ I4 )D8,4 (I2 ⊗ (DFT2 ⊗ I2 )D4,2 (I2 ⊗ DFT2 )L42 )L92 Spiral: generating optimized implementations Start formula: Cooley-Tukey Rule: SPL to code translation table: DFTmn → (DFTm ⊗ In)Dm,n (Im ⊗ DFTn )Lmn m y = (An Bn )x y = (Im ⊗ An )x ... 1 1 source: FFT Program Generation for Shared Memory: SMP and Multicore DFTN t[0:1:n-1] = B(x[0:1:n-1]); y[0:1:n-1] = A(t([0:1:n-1]); for (i=0;i<m;i++) y[i*n:1:i*n+n-1] = A(x[i*n:1:i*n+n-1]); ... Spiral: generating optimized implementations Start formula: Cooley-Tukey Rule: SPL to code translation table: DFTmn → (DFTm ⊗ In)Dm,n (Im ⊗ DFTn )Lmn m y = (An Bn )x y = (Im ⊗ An )x ... 1 1 source: FFT Program Generation for Shared Memory: SMP and Multicore DFTN t[0:1:n-1] = B(x[0:1:n-1]); y[0:1:n-1] = A(t([0:1:n-1]); for (i=0;i<m;i++) y[i*n:1:i*n+n-1] = A(x[i*n:1:i*n+n-1]); ... Search space: factorizations and base cases Extending Spiral for Shared Memory New tag: DFT | {z N} smp(p,µ) I Number of processors: p I Cache line size: µ Find parallel Cooley-Tukey: DFT → ··· | {zmn} smp(p,µ) Extending Spiral for Shared Memory New tag: DFT | {z N} Steps: I smp(p,µ) I Number of processors: p I Cache line size: µ Find parallel Cooley-Tukey: DFT → ··· | {zmn} smp(p,µ) Identify parallel constructs (and their implementation) I Dene rewriting rules I Derive a parallel Cooley-Tukey Extending Spiral for Shared Memory New tag: DFT | {z N} Steps: I smp(p,µ) I Number of processors: p I Cache line size: µ Find parallel Cooley-Tukey: DFT → ··· | {zmn} smp(p,µ) Identify parallel constructs (and their implementation) I Dene rewriting rules I Derive a parallel Cooley-Tukey Issues: I Load Balancing I Synchronization overhead I False Sharing False sharing Circumstances: I Dierent data I Same cache line I Consecutive accesses Leads to cache line thrashing Solution: one processor per cache line Parallel Constructs: Block-Diagonal Products A ∈ Cnµ×nµ : A A (I ⊗ A)x = Given .. x . A Parallel Constructs: Block-Diagonal Products A ∈ Cnµ×nµ : A A (I ⊗ A)x = Given .. x . A Multiplication with a Block-Diagonal Matrix Embarassingly parallel Parallel Constructs: Block-Diagonal Products A ∈ Cnµ×nµ : A A (I ⊗ A)x = Given .. x . A Multiplication with a Block-Diagonal Matrix Embarassingly parallel Implementation: OMP for #pragma omp parallel for schedule(static) shared(x, y) for (i=0; i<p; i++) y[i*n:1:i*n-1] = A(x[i*n:1:i*n+n-1]); Parallel Constructs: Block-Diagonal Products Ai ∈ Cnµ×nµ : A1 p−1 A2 L Ai x = i=0 Given .. x . An Parallel Constructs: Block-Diagonal Products Ai ∈ Cnµ×nµ : A1 p−1 A2 L Ai x = i=0 Given .. x . An Multiplication with a Block-Diagonal Matrix Embarassingly parallel Implementation: OMP for Parallel Constructs: Reordering Cache Lines (P ⊗ Iµ ) P a permutation matrix Parallel Constructs: Reordering Cache Lines (P ⊗ Iµ ) P a permutation matrix Example: 0 0 P = 1 0 1 0 0 0 (P ⊗ Iµ ) = 0 1 0 0 0 0 0 1 Iµ Iµ Iµ Iµ Parallel Constructs: Reordering Cache Lines (P ⊗ Iµ ) P a permutation matrix Example: 0 0 P = 1 0 1 0 0 0 (P ⊗ Iµ ) = 0 1 0 0 0 0 0 1 Iµ Iµ Iµ Iµ Implementation (no data permutation!): Parallel Constructs: Reordering Cache Lines (P ⊗ Iµ ) P a permutation matrix Example: 0 0 P = 1 0 1 0 0 0 (P ⊗ Iµ ) = 0 1 0 0 0 0 0 1 Iµ Iµ Iµ Iµ Implementation (no data permutation!): I Modify access pattern of next construct Rewriting Rules Identify appropriate Rewriting Rules I I I Transform Cooley-Tukey FFT to a Multicore version Don't exist for all Transforms They exist for FFT: a major contribution of the paper Rewriting Rules Identify appropriate Rewriting Rules I I I Transform Cooley-Tukey FFT to a Multicore version Don't exist for all Transforms They exist for FFT: a major contribution of the paper AB |{z} → A |{z} (1) B |{z} smp(p,µ) smp(p,µ) smp(p,µ) mp mp ⊗ In/p ) Am ⊗ In → (Lm ⊗ In/p )(Ip ⊗ (Am ⊗ In/p ))(Lp {z } | | {z } (p,µ) (p,µ) (2) mn/p pn ) (Lp ⊗ Im/p ) (Ip ⊗ L m/p {z } {z }| | (p,µ) mn (p,µ) Lm → (3) pm mn/p | {z } ) (Lm ⊗ In/p ) (Ip ⊗ Lm | {z } {z } | (p,µ) (p,µ) (p,µ) smp smp smp smp smp smp smp Im ⊗ An → Ip ⊗ (Im/p ⊗ An ) | {z } (4) (P ⊗ In ) → (P ⊗ In/µ ) ⊗ Iµ | {z } (5) smp(p,µ) smp(p,µ) D |{z} smp(p,µ) → p−1 M i=0 Di (6) The Result: A Multicore Cooley-Tukey FFT Recall the Cooley-Tukey FFT rule: DFTmn → (DFTm ⊗ In )Dm,n (Im ⊗ DFTn )Lmn m The Result: A Multicore Cooley-Tukey FFT Recall the Cooley-Tukey FFT rule: DFTmn → (DFTm ⊗ In )Dm,n (Im ⊗ DFTn )Lmn m Cooley-Tukey FFT adapted for Shared Memory: DFT → | {zmn} smp(p,µ) mp ((Lmp m ⊗ In/pµ ) ⊗ Iµ )(Ip ⊗ (DFTm ⊗ In/p ))((Lp ⊗ In/pµ ) ⊗ Iµ ) p−1 L i Dm,n i=0 (Ip ⊗ (Im/p ⊗ DFTn ))(Ip ⊗ Lm/p ) ((Lpn p ⊗ Im/pµ ) ⊗ Iµ ) mn/p Discussion 2 2 source: FFT Program Generation for Shared Memory: SMP and Multicore Discussion Peculiarities of FFTW I I I 2 2 source: FFT Program Generation for Shared Memory: SMP and Multicore State-of-the-art multithreading DFT implementation Optimized for big problem sizes and many processors (overhead) Does not use µ, p explicitly Discussion Peculiarities of FFTW I I I 2 2 source: FFT Program Generation for Shared Memory: SMP and Multicore State-of-the-art multithreading DFT implementation Optimized for big problem sizes and many processors (overhead) Does not use µ, p explicitly Advantages of Spiral I Enables high-level optimizations and reasoning I No need for loop analysis I Automatizes implementation