Parallel Computing I SS 2014 - Numerical linear algebra II

Transcrição

Parallel Computing
Numerical linear algebra II
Thorsten Grahs, 23.06.2014
Overview
Sparse matrix Example – PageRank
Efficient storage formats
Parallel matrix-matrix-multiplication
Virtual topologies in MPI
23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 2
Matrices are extremely important in HPC
While it’s certainly not the case that high performance
computing is all about computing with matrices, matrix
operations are key to many important HPC applications.
Many important applications can be reduced to operations
on matrices, including (but not limited to)
1. searching and sorting
2. numerical simulation of physical processes
3. optimization
The list of the top 500 supercomputers in the world
(http://www.top500.org) is determined by a benchmark
program that performs matrix operations.
Dense vs. sparse matrices
Matrix A ∈ Rm×n is dense if all or most of its entries are
nonzero.
Storing a dense matrix (sometimes called a full matrix)
requires storing all m × n elements of the matrix.
Usually an array data structure is used to store a dense
matrix.
A matrix is sparse if most of its entries are zero.
Here we expect the number of zeros to far exceed the
number of nonzeros.
It is often most efficient to store only the nonzero entries
of a sparse matrix, but this requires that location
information also be stored.
Arrays and lists are most commonly used to store sparse
matrices.
Dense matrix examples
Find a matrix to represent this
complete graph if the ij entry
contains the weight of the edge
connecting node corresponding to
row i with the node corresponding
to column j . Use the value 0 if a
connection is missing.
Sparse matrix examples
Find a matrix to represent this
graph if the ij entry contains the
weight of the edge connecting node
corresponding to row i with the
node corresponding to column j .
As before, use the value 0 if a
connection is missing.
Example | PageRank
Google’s search algorithm works with PageRank
For each web-page a weight (PageRank) is computed.
The weight is higher when more sides point/linked to this
page. The weight is even higher, when the linked pages
have higher weights
Thus, the weight PRi of page i computes from the weight
PRj of the linked web-pages j.
Is page j linked to cj web-pages, the weight PRj is
distributed over these linked pages
Example | PageRank
from Wikipedia http://de.wikipedia.org/wiki/PageRank
Example | PageRank
This gives rise to the following recursive formula
X PRj
1−d
PRi =
+d
n
cj
∀j∈(j,i)
n total number of pages
d damping factor between 0 and 1.
This could be written as a linear system:
1−d
MPR =
1
n
with Mij = δij − dLij where L is a (sparse) matrix without
1/cij , if page i is linked with page j
Lij =
0,
else.
Example | PageRank
This system leads to the Eigenvalue problem
PT (PR) = PR
with Pso-called Google matrix.
For d < 1 the solution of the system is unique
(Theorem of Perron-Frobenius)
The system is solved with an iterative algorithm.
BiCGstab vs. Reverse GS for matrix n = 2Mill..
G.M. Del COrsa, A. Gullia, F. Romania, Comparison of Krylov subspace methods on the
PageRank problem, J. Comp.Appl.Math, Vol. 210,1–2, 31 2007
Storing in two dim. arrays
Dense or banded matrix
It is natural to use a 2D array to store a dense or banded
matrix. Unfortunately, there are a couple of significant issues
that complicate this seemingly simple approach.
Row-major vs. column-major storage pattern is language
dependent.
It is not possible to dynamically allocate two-dimensional
arrays in C and C++.
(at least not without pointer storage and manipulation
overhead).
Row-major storage
Languages like C and C++ use what is often called a
row-major storage pattern for 2D matrices.
In C and C++, the last index in a multidimensional array
indexes contiguous memory locations. Thus a[i][j] and
a[i][j + 1] are adjacent in memory.
Example
The stride between adjacent elements in the same row is
1.
The stride between adjacent elements in the same column
is the row length (5 in this example).
Column-major storage
In Fortran 2D matrices are stored in column-major form.
The first index in a multidimensional array indexes
contiguous memory locations. Thus a(i, j) and a(i + 1, j)
are adjacent in memory.
Example
The stride between adjacent elements in the same row is
the column length (2 in this example) while the stride
between adjacent elements in the same column is 1.
Notice that if C is used to read a matrix stored in Fortran
(or vice-versa), the transpose matrix will be read.
Significance of array ordering
There are two main reasons why HPC programmers need to
be aware of this issue:
Memory access patterns can have a dramatic impact on
performance, especially on modern systems with a
complicated memory hierarchy.
These code segments access the same elements of an
array, but the order of accesses is different.
Many important numerical libraries (e.g. LAPACK) are
written in Fortran. To use them with C or C++ the
programmer must often work with a transposed matrix
Dense storage formats
Efficient storage formats for sparse matrices
Obviously, there is no need to store the zeros for an
sparse matrix.
There are several ways to store a sparse matrix efficiently
Common sparse storage formats:
Coordinate
Diagonal
Compressed Sparse Row (CSR)
ELLPACK
Hybrid
Storage format COO
Coordinate (COO)
Storage format CSR
Compressed Sparse Row (CSR)
Storage format DIA
Diagonal (DIA)
Storage format ELL
ELLPACK (ELL)
Storage format Hyb
Hybrid (HYB) ELL+COO
(CUSP only)
Use of formats
Formats comparison | structured
Structured matrix
Bytes per non-zero entry
Format
COO
CSR
DIA
ELL
HYB
float
12.00
8.45
4.05
8.11
8.11
double
16.00
12.45
8.10
12.16
12.16
Formats comparison | unstructured
Unstructured matrix
Format
COO
CSR
DIA
ELL
HYB
float double
12.00
16.00
8.62
12.62
164.11 328.22
11.06
16.60
9.00
13.44
Formats comparison | random
Random matrix
Format
COO
CSR
DIA
ELL
HYB
float
12.00
8.42
76.83
14.20
9.60
double
16.00
12.42
153.65
21.29
14.20
Formats comparison | power-law
Power-law matrix
Format
COO
CSR
DIA
ELL
HYB
float double
12.00
16.00
8.74
12.73
118.83 237.66
49.88
74.82
13.50
19.46
Matrix-matrix multiplication
Series of inner product (dot product) operations
Iterative row-oriented algorithm
Block-matrix computation
Replace scalar multiplication with matrix multiplication
Replace scalar addition with matrix addition
Block-matrix computation
Recurs Until B Small Enough
First parallel algorithm
Partitioning
Divide matrices into rows
Each primitive task has corresponding rows of three
matrices
Communication
Each task must eventually see every row of B
Organize tasks into a ring
Agglomeration and mapping
Fixed number of tasks, each requiring same amount of
computation
Regular communication among tasks
Strategy: Assign each process a contiguous group of rows
Communication of B
Communication of B (cont.)
Complexity
Algorithm has p iterations
During each iteration a process multiplies
(n/p) × (n/p) block of A by (n/p) × n block of B, i.e.
multiplications per processor: O(n3 /p2 )
Total computation time O(n3 /p)
Each process ends up passing
(p − 1)n2 /p = O(n2 ) elements of B
Weakness
Each process must access every element of matrix B
Ratio of computations per communication is poor:
only 2n/p
Cannon’s algorithm
Compute C = A × B.
A, B ∈ RN×N
√
√
Distributed block-wise on a 2D-Field ( P × P)
C should be stored in a similar manner.
Processor (i,j) computes
X
Ci,j =
Ai,k · Bk,j
k
i.e. needs block-row i of A and block-column j of B
Prepare with alignment of blocks
Iterate over
rotation phase
computation phase
Simple vs. Cannon’s Algorithm
Alignment
Alignment
Matrix A
The block-rows of matrix A are shifted to the left
This has to be be done for each row, until the diagonal
blocks are placed in the first column.
Matrix B
The block-columns of matrix B are shifted upwards
This has to be be done for each columns, until the diagonal
blocks are placed in the first row.
After this step processor (i,j) holds the blocks
Ai,(j+i)%√P
Row i is shifted i-times to the left
√
B(j+i)% P,j
Columns j is shifted j-times upwards
Alignment phase
Before
After alignment
Computation & Rotation
Phase I – Computation
After the alignment, every processor hold two
corresponding blocks.
In the computation phase the matrix-matrix multiplication
is carried out block-wise
Phase II – Rotation
Blocks are shifted further to the left resp. upwards.
Example
Starting point
Block-are distributed to processors P = 9.
Processor (i,j) holds block (i,j).
Example
Rotation
Alignment has taken place
Example | Iteration
Complexity analysis| Cannon’s
√
Algorithm has p iterations
During each iteration a process multiplies
√
√
(n/ p) × (n/ p) blocks : O(n3 /p2/3 )
Total computation time O(n3 /p)
During each iteration process sends and receives two
√
√
blocks of size (n/ p) × (n/ p).
√
Communication complexity: O(n2 / p)
Row-wise algorithm was:
p iterations
Communication complexity: O(n2 )
MPI-Code | Cannon’s alg.
1
2
3
4
5
6
7
8
9
10
11
/* Create Grid and get coordinates */
int dims[2], periods[2] = {1, 1};
int mycoords[2];
dims[0] = sqrt(num_procs);
dims[1] = num_procs / dims[0];
MPI_Cart_create(MPI_COMM_WORLD,2,dims,periods,0, &comm_2d);
MPI_Comm_rank(comm_2d, &my2drank);
MPI_Cart_coords(comm_2d, my2drank, 2, mycoords);
/* Local blocks of the matricies */
double *a, *b, *c;
/* Load a, b and c corresponding to coordinates*/
12
1
2
3
4
5
6
7
8
/* Matrix rotation A */
MPI_Cart_shift(comm_2d, 0, -mycoords[0], &shiftsource, &
shiftdest);
MPI_Sendrecv_replace(a, nlocal*nlocal, MPI_DOUBLE,
shiftdest, 77, shiftsource, 77,comm_2d, &
status);
/* Matrix rotation B */
MPI_Cart_shift(comm_2d, 1, -mycoords[1], &shiftsource, &
shiftdest);
MPI_Sendrecv_replace(b, nlocal*nlocal, MPI_DOUBLE,
shiftdest, 77, shiftsource, 77, comm_2d, &
status);
1
2
3
4
5
6
7
8
9
10
11
12
13
14
/* Finde left and upper neighbour */
MPI_Cart_shift(comm_2d, 0, -1, &rightrank, &leftrank);
MPI_Cart_shift(comm_2d, 1, -1, &downrank, &uprank);
for (i=0; i<dims[0]; i++)
{
dgemm(nlocal, a, b, c); /* c= c + a * b */
/* Matrix A shifted t the left */
MPI_Sendrecv_replace(a, nlocal*nlocal, MPI_DOUBLE,
leftrank, 77, rightrank, 77, comm_2d, &status);
/* Matrix B shifted upwards*/
MPI_Sendrecv_replace(b, nlocal*nlocal, MPI_DOUBLE,
uprank, 77, downrank, 77, comm_2d, &status);
}
/* A und B back in original placement*/
Virtual topologies in MPI
MPI allows one to associate virtual topologies, f.i.
Cartesian topology
Graph topology
Creating a topology produces a new communicator
MPI provides “mapping functions”
Mapping functions compute processor ranks, based on
the topology naming scheme
Cartesian topology
Each process is connected to its neighbours in a virtual
grid
Boundaries can be cyclic
Processes can be identified by Cartesian coordinates
Creating a Cartesian topology
int MPI_Cart_create (MPI_Comm comm_old, int ndims, int
int *periods, int reorder, MPI_Comm *comm_cart)
comm_old existing communicator
ndims number of dimensions
periods logical array indicating whether a dimension is
cyclic (TRUE ⇒ cyclic boundary conditions)
reorder logical
FALSE ⇒ rank preserved
TRUE ⇒ possible rank reordering
comm_cart new Cartesian communicator
Cartesian example
1
MPI_Comm vu;
2
3
4
int dim[2], period[2], reorder;
dim[0]=4; dim[1]=3;
5
6
7
8
period[0]=TRUE;
period[1]=FALSE;
reorder=TRUE;
9
10
11
MPI_Cart_create(MPI_COMM_WORLD,
2,dim,period,reorder,&vu);
Mapping coordinates to ranks
Mapping coordinates to ranks
int MPI_Cart_rank(MPI_Comm comm,init *coords,int *rank)
Mapping ranks to coordinates
int MPI_Cart_coords(MPI_Comm comm, int rank,
int maxdims,int *coords)
Cartesian example
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
#include<mpi.h> /* Run with 12 processes */
int main(int argc, char *argv[]) {
int rank;
MPI_Comm vu;
int dim[2],period[2],reorder, coord[2],id;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
dim[0]=4; dim[1]=3;
period[0]=TRUE; period[1]=FALSE;reorder=TRUE;
MPI_Cart_create(MPI_COMM_WORLD,2,dim,period,reorder,&vu);
if(rank==5){
MPI_Cart_coords(vu,rank,2,coord);
printf("P:%d␣My␣coordinates␣are␣%d␣%d\n",rank,coord[0],coord[1]);
}
if(rank==0) {
coord[0]=3; coord[1]=1;
MPI_Cart_rank(vu,coord,&id);
printf("Proc.at␣position␣(%d,␣%d)␣has␣rank␣%d\n",coord[0],coord[1],id);
}
MPI_Finalize();
}
Neighbouring ranks
output
Proc. at position (3,1) has rank 10
P:5 My coordinates are 1 2
Computing ranks of neighbouring processors
int MPI_Cart_shift (MPI_Comm comm, int direction,
int disp, int *rank_source, int *rank_dest)
Does not actually shift data: returns the correct ranks for a
shift that can be used in subsequent communication calls
Neighbouring ranks
Arguments
direction dimension in which the shift should be made)
disp (length of the shift in processor coordinates [+ or -])
rank_source (where calling process should receive a
message from during the shift)
rank_dest (where calling process should send a
message to during the shift)
If we shift off of the topology, MPI_Proc_null i.e. (−1) is
returned

Parallel Computing I SS 2014 - Numerical linear algebra II

Transcrição

Documentos relacionados

Faszinierend melancholisch, einzigartig stimmgewaltig

Horus - Referenzen Stand 04 2015

Ausgewählt und kommentiert vom Süddeutsche Zeitung Magazin

Unit 1: My home is my castle

ARPA Sutstainable Finance of Protected Area Systems

ENGLISCH

In-situ gas analysis for fermentation BlueSens gas sensor - apz

Questionnaire for third-country nationals for enrolment with

Repertoire solid SEVEN (Stand: September 2015)

SONGAUSWAHL Seite 1 ARTIST SONG BPM 4 NON BLONDES

DATE:1./JJ,U ~

Quotation Marks and Gut Instinct? - Max-Planck