FFT Program Generation for Shared Memory: SMP and Multicore

Transcrição

FFT Program Generation for Shared Memory: SMP and Multicore
FFT Program Generation for Shared Memory:
SMP and Multicore
F. Franchetti, Y. Voronenko, M. Püschel
Electrical and Computer Engineering
Carnegie Mellon University
Nicola Marcacci Rossi, 31. Oktober 2011
Overview
First Part: From the algorithm to the code
Second Part: Spiral extension for Shared Memory
Discussion
Discrete Fourier Transform
The problem:
Given:
Compute:
x ∈ CN
y = DFTN x
DFTN = [wNkl ]0≤k,l<N
wn = e−2πi/N
Discrete Fourier Transform
The problem:
Given:
Compute:
x ∈ CN
y = DFTN x
DFTN = [wNkl ]0≤k,l<N
wn = e
−2πi/N
Algorithms:
I
I
Direct Matrix-Vector
Multiplication: O(N 2 )
Fast Fourier Transforms:
O(N log N )
Discrete Fourier Transform
The problem:
Given:
Compute:
x ∈ CN
y = DFTN x
I
DFTN = [wNkl ]0≤k,l<N
wn = e
Algorithms:
I
Direct Matrix-Vector
Multiplication: O(N 2 )
Fast Fourier Transforms:
O(N log N )
−2πi/N
Cooley-Tukey FFT
I
Divide and Conquer
I
Matrix factorization:
I
Assume N = 2k
I
Base case: DFT2
DFTmn x =
(DFTm ⊗ In )Dm,n (Im ⊗ DFTn )Lmn
m x
Discrete Fourier Transform
The problem:
Given:
Compute:
x ∈ CN
y = DFTN x
I
DFTN = [wNkl ]0≤k,l<N
wn = e
Algorithms:
I
Direct Matrix-Vector
Multiplication: O(N 2 )
Fast Fourier Transforms:
O(N log N )
−2πi/N
Cooley-Tukey FFT
I
Divide and Conquer
I
Matrix factorization:
I
Assume N = 2k
I
Base case: DFT2
DFTmn x =
(DFTm ⊗ In )Dm,n (Im ⊗ DFTn )Lmn
m x
Spiral: generating optimized implementations
1
1 source: FFT Program Generation for Shared Memory:
SMP and Multicore
DFTN
Start formula:
Cooley-Tukey Rule:
DFTmn →
(DFTm ⊗ In)Dm,n
(Im ⊗ DFTn )Lmn
m
Spiral: generating optimized implementations
1
1 source: FFT Program Generation for Shared Memory:
SMP and Multicore
DFTN
Start formula:
Cooley-Tukey Rule:
Example:
DFTmn →
(DFTm ⊗ In)Dm,n
(Im ⊗ DFTn )Lmn
m
DFT8 =
(DFT2 ⊗ I4 )D8,4
(I2 ⊗ (DFT2 ⊗ I2 )D4,2
(I2 ⊗ DFT2 )L42 )L92
Spiral: generating optimized implementations
Start formula:
Cooley-Tukey Rule:
SPL to code translation table:
DFTmn →
(DFTm ⊗ In)Dm,n
(Im ⊗ DFTn )Lmn
m
y = (An Bn )x
y = (Im ⊗ An )x
...
1
1 source: FFT Program Generation for Shared Memory:
SMP and Multicore
DFTN
t[0:1:n-1] =
B(x[0:1:n-1]);
y[0:1:n-1] =
A(t([0:1:n-1]);
for (i=0;i<m;i++)
y[i*n:1:i*n+n-1] =
A(x[i*n:1:i*n+n-1]);
...
Spiral: generating optimized implementations
Start formula:
Cooley-Tukey Rule:
SPL to code translation table:
DFTmn →
(DFTm ⊗ In)Dm,n
(Im ⊗ DFTn )Lmn
m
y = (An Bn )x
y = (Im ⊗ An )x
...
1
1 source: FFT Program Generation for Shared Memory:
SMP and Multicore
DFTN
t[0:1:n-1] =
B(x[0:1:n-1]);
y[0:1:n-1] =
A(t([0:1:n-1]);
for (i=0;i<m;i++)
y[i*n:1:i*n+n-1] =
A(x[i*n:1:i*n+n-1]);
...
Search space:
factorizations and base cases
Extending Spiral for Shared Memory
New tag:
DFT
| {z N}
smp(p,µ)
I
Number of processors: p
I
Cache line size: µ
Find parallel Cooley-Tukey:
DFT
→ ···
| {zmn}
smp(p,µ)
Extending Spiral for Shared Memory
New tag:
DFT
| {z N}
Steps:
I
smp(p,µ)
I
Number of processors: p
I
Cache line size: µ
Find parallel Cooley-Tukey:
DFT
→ ···
| {zmn}
smp(p,µ)
Identify parallel constructs
(and their implementation)
I
Dene rewriting rules
I
Derive a parallel Cooley-Tukey
Extending Spiral for Shared Memory
New tag:
DFT
| {z N}
Steps:
I
smp(p,µ)
I
Number of processors: p
I
Cache line size: µ
Find parallel Cooley-Tukey:
DFT
→ ···
| {zmn}
smp(p,µ)
Identify parallel constructs
(and their implementation)
I
Dene rewriting rules
I
Derive a parallel Cooley-Tukey
Issues:
I
Load Balancing
I
Synchronization overhead
I
False Sharing
False sharing
Circumstances:
I
Dierent data
I
Same cache line
I
Consecutive accesses
Leads to cache line thrashing
Solution: one processor per cache
line
Parallel Constructs: Block-Diagonal Products
A ∈ Cnµ×nµ :

A

A

(I ⊗ A)x = 

Given

..


x

.
A
Parallel Constructs: Block-Diagonal Products
A ∈ Cnµ×nµ :

A

A

(I ⊗ A)x = 

Given

..


x

.
A
Multiplication with a Block-Diagonal
Matrix
Embarassingly parallel
Parallel Constructs: Block-Diagonal Products
A ∈ Cnµ×nµ :

A

A

(I ⊗ A)x = 

Given

..


x

.
A
Multiplication with a Block-Diagonal
Matrix
Embarassingly parallel
Implementation: OMP for
#pragma omp parallel for schedule(static)
shared(x, y)
for (i=0; i<p; i++)
y[i*n:1:i*n-1] = A(x[i*n:1:i*n+n-1]);
Parallel Constructs: Block-Diagonal Products
Ai ∈ Cnµ×nµ :

A1
p−1 
A2
L

Ai x = 

i=0
Given

..


x

.
An
Parallel Constructs: Block-Diagonal Products
Ai ∈ Cnµ×nµ :

A1
p−1 
A2
L

Ai x = 

i=0
Given

..


x

.
An
Multiplication with a Block-Diagonal
Matrix
Embarassingly parallel
Implementation: OMP for
Parallel Constructs: Reordering Cache Lines
(P ⊗ Iµ )
P a permutation
matrix
Parallel Constructs: Reordering Cache Lines
(P ⊗ Iµ )
P a permutation
matrix
Example:

0
 0
P =
 1
0
1
0
0
0


(P ⊗ Iµ ) = 

0
1
0
0

0
0 

0 
1
Iµ

Iµ



Iµ
Iµ
Parallel Constructs: Reordering Cache Lines
(P ⊗ Iµ )
P a permutation
matrix
Example:

0
 0
P =
 1
0
1
0
0
0


(P ⊗ Iµ ) = 

0
1
0
0

0
0 

0 
1
Iµ

Iµ



Iµ
Iµ
Implementation (no data permutation!):
Parallel Constructs: Reordering Cache Lines
(P ⊗ Iµ )
P a permutation
matrix
Example:

0
 0
P =
 1
0
1
0
0
0


(P ⊗ Iµ ) = 

0
1
0
0

0
0 

0 
1
Iµ

Iµ



Iµ
Iµ
Implementation (no data permutation!):
I
Modify access pattern of next construct
Rewriting Rules
Identify appropriate
Rewriting Rules
I
I
I
Transform Cooley-Tukey
FFT to a Multicore
version
Don't exist for all
Transforms
They exist for FFT: a
major contribution of the
paper
Rewriting Rules
Identify appropriate
Rewriting Rules
I
I
I
Transform Cooley-Tukey
FFT to a Multicore
version
Don't exist for all
Transforms
They exist for FFT: a
major contribution of the
paper
AB
|{z}
→
A
|{z}
(1)
B
|{z}
smp(p,µ) smp(p,µ) smp(p,µ)
mp
mp
⊗ In/p )
Am ⊗ In → (Lm ⊗ In/p )(Ip ⊗ (Am ⊗ In/p ))(Lp
{z
}
|
|
{z
}
(p,µ)
(p,µ)
(2)

mn/p
pn

) (Lp ⊗ Im/p )
 (Ip ⊗ L

m/p



{z
}

{z
}|
 |


(p,µ)
mn
(p,µ)
Lm
→
(3)
pm
mn/p


| {z }
)
 (Lm ⊗ In/p ) (Ip ⊗ Lm



|
{z
}

{z
}
|
(p,µ)



(p,µ)
(p,µ)
smp
smp
smp
smp
smp
smp
smp
Im ⊗ An → Ip ⊗ (Im/p ⊗ An )
|
{z
}
(4)
(P ⊗ In ) → (P ⊗ In/µ ) ⊗ Iµ
|
{z
}
(5)
smp(p,µ)
smp(p,µ)
D
|{z}
smp(p,µ)
→
p−1
M
i=0
Di
(6)
The Result: A Multicore Cooley-Tukey FFT
Recall the Cooley-Tukey FFT rule:
DFTmn → (DFTm ⊗ In )Dm,n (Im ⊗ DFTn )Lmn
m
The Result: A Multicore Cooley-Tukey FFT
Recall the Cooley-Tukey FFT rule:
DFTmn → (DFTm ⊗ In )Dm,n (Im ⊗ DFTn )Lmn
m
Cooley-Tukey FFT adapted for Shared Memory:
DFT
→
| {zmn}
smp(p,µ)
mp
((Lmp
m ⊗ In/pµ ) ⊗ Iµ )(Ip ⊗ (DFTm ⊗ In/p ))((Lp ⊗ In/pµ ) ⊗ Iµ )
p−1
L i
Dm,n
i=0
(Ip ⊗ (Im/p ⊗ DFTn ))(Ip ⊗ Lm/p )
((Lpn
p ⊗ Im/pµ ) ⊗ Iµ )
mn/p
Discussion
2
2 source: FFT Program Generation for Shared Memory:
SMP and Multicore
Discussion
Peculiarities of FFTW
I
I
I
2
2 source: FFT Program Generation for Shared Memory:
SMP and Multicore
State-of-the-art multithreading
DFT implementation
Optimized for big problem sizes
and many processors (overhead)
Does not use µ, p explicitly
Discussion
Peculiarities of FFTW
I
I
I
2
2 source: FFT Program Generation for Shared Memory:
SMP and Multicore
State-of-the-art multithreading
DFT implementation
Optimized for big problem sizes
and many processors (overhead)
Does not use µ, p explicitly
Advantages of Spiral
I
Enables high-level optimizations
and reasoning
I
No need for loop analysis
I
Automatizes implementation

Documentos relacionados