Parallel Computing I SS 2014 - Numerical linear algebra II

Transcrição

Parallel Computing I SS 2014 - Numerical linear algebra II
Parallel Computing
Numerical linear algebra II
Thorsten Grahs, 23.06.2014
Overview
Sparse matrix Example – PageRank
Efficient storage formats
Parallel matrix-matrix-multiplication
Virtual topologies in MPI
23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 2
Matrices are extremely important in HPC
While it’s certainly not the case that high performance
computing is all about computing with matrices, matrix
operations are key to many important HPC applications.
Many important applications can be reduced to operations
on matrices, including (but not limited to)
1. searching and sorting
2. numerical simulation of physical processes
3. optimization
The list of the top 500 supercomputers in the world
(http://www.top500.org) is determined by a benchmark
program that performs matrix operations.
23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 3
Dense vs. sparse matrices
Matrix A ∈ Rm×n is dense if all or most of its entries are
nonzero.
Storing a dense matrix (sometimes called a full matrix)
requires storing all m × n elements of the matrix.
Usually an array data structure is used to store a dense
matrix.
A matrix is sparse if most of its entries are zero.
Here we expect the number of zeros to far exceed the
number of nonzeros.
It is often most efficient to store only the nonzero entries
of a sparse matrix, but this requires that location
information also be stored.
Arrays and lists are most commonly used to store sparse
matrices.
23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 4
Dense matrix examples
Find a matrix to represent this
complete graph if the ij entry
contains the weight of the edge
connecting node corresponding to
row i with the node corresponding
to column j . Use the value 0 if a
connection is missing.
23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 5
Sparse matrix examples
Find a matrix to represent this
graph if the ij entry contains the
weight of the edge connecting node
corresponding to row i with the
node corresponding to column j .
As before, use the value 0 if a
connection is missing.
23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 6
Example | PageRank
Google’s search algorithm works with PageRank
For each web-page a weight (PageRank) is computed.
The weight is higher when more sides point/linked to this
page. The weight is even higher, when the linked pages
have higher weights
Thus, the weight PRi of page i computes from the weight
PRj of the linked web-pages j.
Is page j linked to cj web-pages, the weight PRj is
distributed over these linked pages
23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 7
Example | PageRank
from Wikipedia http://de.wikipedia.org/wiki/PageRank
23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 8
Example | PageRank
This gives rise to the following recursive formula
X PRj
1−d
PRi =
+d
n
cj
∀j∈(j,i)
n total number of pages
d damping factor between 0 and 1.
This could be written as a linear system:
1−d
MPR =
1
n
with Mij = δij − dLij where L is a (sparse) matrix without
1/cij , if page i is linked with page j
Lij =
0,
else.
23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 9
Example | PageRank
This system leads to the Eigenvalue problem
PT (PR) = PR
with Pso-called Google matrix.
For d < 1 the solution of the system is unique
(Theorem of Perron-Frobenius)
The system is solved with an iterative algorithm.
BiCGstab vs. Reverse GS for matrix n = 2Mill..
G.M. Del COrsa, A. Gullia, F. Romania, Comparison of Krylov subspace methods on the
PageRank problem, J. Comp.Appl.Math, Vol. 210,1–2, 31 2007
23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 10
Storing in two dim. arrays
Dense or banded matrix
It is natural to use a 2D array to store a dense or banded
matrix. Unfortunately, there are a couple of significant issues
that complicate this seemingly simple approach.
Row-major vs. column-major storage pattern is language
dependent.
It is not possible to dynamically allocate two-dimensional
arrays in C and C++.
(at least not without pointer storage and manipulation
overhead).
23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 11
Row-major storage
Languages like C and C++ use what is often called a
row-major storage pattern for 2D matrices.
In C and C++, the last index in a multidimensional array
indexes contiguous memory locations. Thus a[i][j] and
a[i][j + 1] are adjacent in memory.
Example
The stride between adjacent elements in the same row is
1.
The stride between adjacent elements in the same column
is the row length (5 in this example).
23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 12
Column-major storage
In Fortran 2D matrices are stored in column-major form.
The first index in a multidimensional array indexes
contiguous memory locations. Thus a(i, j) and a(i + 1, j)
are adjacent in memory.
Example
The stride between adjacent elements in the same row is
the column length (2 in this example) while the stride
between adjacent elements in the same column is 1.
Notice that if C is used to read a matrix stored in Fortran
(or vice-versa), the transpose matrix will be read.
23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 13
Significance of array ordering
There are two main reasons why HPC programmers need to
be aware of this issue:
Memory access patterns can have a dramatic impact on
performance, especially on modern systems with a
complicated memory hierarchy.
These code segments access the same elements of an
array, but the order of accesses is different.
Many important numerical libraries (e.g. LAPACK) are
written in Fortran. To use them with C or C++ the
programmer must often work with a transposed matrix
23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 14
Dense storage formats
Efficient storage formats for sparse matrices
Obviously, there is no need to store the zeros for an
sparse matrix.
There are several ways to store a sparse matrix efficiently
Common sparse storage formats:
Coordinate
Diagonal
Compressed Sparse Row (CSR)
ELLPACK
Hybrid
23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 15
Storage format COO
Coordinate (COO)
23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 16
Storage format CSR
Compressed Sparse Row (CSR)
23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 17
Storage format DIA
Diagonal (DIA)
23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 18
Storage format ELL
ELLPACK (ELL)
23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 19
Storage format Hyb
Hybrid (HYB) ELL+COO
(CUSP only)
23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 20
Use of formats
23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 21
Formats comparison | structured
Structured matrix
Bytes per non-zero entry
Format
COO
CSR
DIA
ELL
HYB
23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 22
float
12.00
8.45
4.05
8.11
8.11
double
16.00
12.45
8.10
12.16
12.16
Formats comparison | unstructured
Unstructured matrix
Bytes per non-zero entry
Format
COO
CSR
DIA
ELL
HYB
23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 23
float double
12.00
16.00
8.62
12.62
164.11 328.22
11.06
16.60
9.00
13.44
Formats comparison | random
Random matrix
Bytes per non-zero entry
Format
COO
CSR
DIA
ELL
HYB
23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 24
float
12.00
8.42
76.83
14.20
9.60
double
16.00
12.42
153.65
21.29
14.20
Formats comparison | power-law
Power-law matrix
Bytes per non-zero entry
Format
COO
CSR
DIA
ELL
HYB
23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 25
float double
12.00
16.00
8.74
12.73
118.83 237.66
49.88
74.82
13.50
19.46
Matrix-matrix multiplication
Series of inner product (dot product) operations
Iterative row-oriented algorithm
23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 26
Block-matrix computation
Replace scalar multiplication with matrix multiplication
Replace scalar addition with matrix addition
23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 27
Block-matrix computation
Recurs Until B Small Enough
23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 28
First parallel algorithm
Partitioning
Divide matrices into rows
Each primitive task has corresponding rows of three
matrices
Communication
Each task must eventually see every row of B
Organize tasks into a ring
Agglomeration and mapping
Fixed number of tasks, each requiring same amount of
computation
Regular communication among tasks
Strategy: Assign each process a contiguous group of rows
23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 29
Communication of B
23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 30
Communication of B (cont.)
23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 31
Communication of B (cont.)
23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 32
Communication of B (cont.)
23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 33
Complexity
Algorithm has p iterations
During each iteration a process multiplies
(n/p) × (n/p) block of A by (n/p) × n block of B, i.e.
multiplications per processor: O(n3 /p2 )
Total computation time O(n3 /p)
Each process ends up passing
(p − 1)n2 /p = O(n2 ) elements of B
Weakness
Each process must access every element of matrix B
Ratio of computations per communication is poor:
only 2n/p
23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 34
Cannon’s algorithm
Compute C = A × B.
A, B ∈ RN×N
√
√
Distributed block-wise on a 2D-Field ( P × P)
C should be stored in a similar manner.
Processor (i,j) computes
X
Ci,j =
Ai,k · Bk,j
k
i.e. needs block-row i of A and block-column j of B
Prepare with alignment of blocks
Iterate over
rotation phase
computation phase
23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 35
Simple vs. Cannon’s Algorithm
23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 36
Alignment
Alignment
Matrix A
The block-rows of matrix A are shifted to the left
This has to be be done for each row, until the diagonal
blocks are placed in the first column.
Matrix B
The block-columns of matrix B are shifted upwards
This has to be be done for each columns, until the diagonal
blocks are placed in the first row.
After this step processor (i,j) holds the blocks
Ai,(j+i)%√P
Row i is shifted i-times to the left
√
B(j+i)% P,j
Columns j is shifted j-times upwards
23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 37
Alignment phase
Before
After alignment
23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 38
Computation & Rotation
Phase I – Computation
After the alignment, every processor hold two
corresponding blocks.
In the computation phase the matrix-matrix multiplication
is carried out block-wise
Phase II – Rotation
Blocks are shifted further to the left resp. upwards.
23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 39
Example
Starting point
Block-are distributed to processors P = 9.
Processor (i,j) holds block (i,j).
23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 40
Example
Rotation
Alignment has taken place
23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 41
Example | Iteration
23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 42
Complexity analysis| Cannon’s
√
Algorithm has p iterations
During each iteration a process multiplies
√
√
(n/ p) × (n/ p) blocks : O(n3 /p2/3 )
Total computation time O(n3 /p)
During each iteration process sends and receives two
√
√
blocks of size (n/ p) × (n/ p).
√
Communication complexity: O(n2 / p)
Row-wise algorithm was:
p iterations
Communication complexity: O(n2 )
23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 43
MPI-Code | Cannon’s alg.
1
2
3
4
5
6
7
8
9
10
11
/* Create Grid and get coordinates */
int dims[2], periods[2] = {1, 1};
int mycoords[2];
dims[0] = sqrt(num_procs);
dims[1] = num_procs / dims[0];
MPI_Cart_create(MPI_COMM_WORLD,2,dims,periods,0, &comm_2d);
MPI_Comm_rank(comm_2d, &my2drank);
MPI_Cart_coords(comm_2d, my2drank, 2, mycoords);
/* Local blocks of the matricies */
double *a, *b, *c;
/* Load a, b and c corresponding to coordinates*/
12
23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 44
MPI-Code | Cannon’s alg.
1
2
3
4
5
6
7
8
/* Matrix rotation A */
MPI_Cart_shift(comm_2d, 0, -mycoords[0], &shiftsource, &
shiftdest);
MPI_Sendrecv_replace(a, nlocal*nlocal, MPI_DOUBLE,
shiftdest, 77, shiftsource, 77,comm_2d, &
status);
/* Matrix rotation B */
MPI_Cart_shift(comm_2d, 1, -mycoords[1], &shiftsource, &
shiftdest);
MPI_Sendrecv_replace(b, nlocal*nlocal, MPI_DOUBLE,
shiftdest, 77, shiftsource, 77, comm_2d, &
status);
23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 45
MPI-Code | Cannon’s alg.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
/* Finde left and upper neighbour */
MPI_Cart_shift(comm_2d, 0, -1, &rightrank, &leftrank);
MPI_Cart_shift(comm_2d, 1, -1, &downrank, &uprank);
for (i=0; i<dims[0]; i++)
{
dgemm(nlocal, a, b, c); /* c= c + a * b */
/* Matrix A shifted t the left */
MPI_Sendrecv_replace(a, nlocal*nlocal, MPI_DOUBLE,
leftrank, 77, rightrank, 77, comm_2d, &status);
/* Matrix B shifted upwards*/
MPI_Sendrecv_replace(b, nlocal*nlocal, MPI_DOUBLE,
uprank, 77, downrank, 77, comm_2d, &status);
}
/* A und B back in original placement*/
23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 46
Virtual topologies in MPI
MPI allows one to associate virtual topologies, f.i.
Cartesian topology
Graph topology
Creating a topology produces a new communicator
MPI provides “mapping functions”
Mapping functions compute processor ranks, based on
the topology naming scheme
23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 47
Cartesian topology
Each process is connected to its neighbours in a virtual
grid
Boundaries can be cyclic
Processes can be identified by Cartesian coordinates
23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 48
Creating a Cartesian topology
int MPI_Cart_create (MPI_Comm comm_old, int ndims, int
int *periods, int reorder, MPI_Comm *comm_cart)
comm_old existing communicator
ndims number of dimensions
periods logical array indicating whether a dimension is
cyclic (TRUE ⇒ cyclic boundary conditions)
reorder logical
FALSE ⇒ rank preserved
TRUE ⇒ possible rank reordering
comm_cart new Cartesian communicator
23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 49
Cartesian example
1
MPI_Comm vu;
2
3
4
int dim[2], period[2], reorder;
dim[0]=4; dim[1]=3;
5
6
7
8
period[0]=TRUE;
period[1]=FALSE;
reorder=TRUE;
9
10
11
MPI_Cart_create(MPI_COMM_WORLD,
2,dim,period,reorder,&vu);
23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 50
Mapping coordinates to ranks
Mapping coordinates to ranks
int MPI_Cart_rank(MPI_Comm comm,init *coords,int *rank)
Mapping ranks to coordinates
int MPI_Cart_coords(MPI_Comm comm, int rank,
int maxdims,int *coords)
23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 51
Cartesian example
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
#include<mpi.h> /* Run with 12 processes */
int main(int argc, char *argv[]) {
int rank;
MPI_Comm vu;
int dim[2],period[2],reorder, coord[2],id;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
dim[0]=4; dim[1]=3;
period[0]=TRUE; period[1]=FALSE;reorder=TRUE;
MPI_Cart_create(MPI_COMM_WORLD,2,dim,period,reorder,&vu);
if(rank==5){
MPI_Cart_coords(vu,rank,2,coord);
printf("P:%d␣My␣coordinates␣are␣%d␣%d\n",rank,coord[0],coord[1]);
}
if(rank==0) {
coord[0]=3; coord[1]=1;
MPI_Cart_rank(vu,coord,&id);
printf("Proc.at␣position␣(%d,␣%d)␣has␣rank␣%d\n",coord[0],coord[1],id);
}
MPI_Finalize();
}
23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 52
Neighbouring ranks
output
Proc. at position (3,1) has rank 10
P:5 My coordinates are 1 2
Computing ranks of neighbouring processors
int MPI_Cart_shift (MPI_Comm comm, int direction,
int disp, int *rank_source, int *rank_dest)
Does not actually shift data: returns the correct ranks for a
shift that can be used in subsequent communication calls
23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 53
Neighbouring ranks
Arguments
direction dimension in which the shift should be made)
disp (length of the shift in processor coordinates [+ or -])
rank_source (where calling process should receive a
message from during the shift)
rank_dest (where calling process should send a
message to during the shift)
If we shift off of the topology, MPI_Proc_null i.e. (−1) is
returned
23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 54