GPU computing

Transcrição

GPU computing
GPU computing
Jochen Gerhard
Institut für Informatik
Frankfurt Institute for Advanced Studies
Mittwoch, 29. Juni 2011
Overview
• How is a GPU structured? (Roughly)
• How does manycore programming work
compared to multicore?
• How can one access the GPU from Python?
• Some details about the structure of
OpenCL Programs.
• How to make the GPU do what you want?
Mittwoch, 29. Juni 2011
Hardware
• A modern Computer has more than just a
CPU:
• More than one socket and one core
anyway.
• But also graphic cards, sometimes even
more than one.
• Wouldn’t it be nice to harvest all that
computing power?
Mittwoch, 29. Juni 2011
•
HPC for the poor man:
GPUs
As example let’s take my MacBook Pro 5.1:
•
•
•
Mittwoch, 29. Juni 2011
1x Intel Core 2 Duo @2.4 GHz
•
Galaxy Benchmark ~12 GFLOPs
NVIDIA GeForce 9600M GT
•
Galaxy Benchmark ~40 GFLOPs
Why not harvest both of it?
•
With just coding one program???
The CPU
• n cores, n rather small (less than 32)
• each with private L1-cache
• pairwise/quadwise shared L2-cache
• shared L3-cache
• Slow access to system memory
• Is good in multiple things
Mittwoch, 29. Juni 2011
Multicore
•
Compute different (rather
complicated tasks)
•
•
•
Each core computes even
different programs.
Complicated hardware
Sometimes share
information. (messages)
Mittwoch, 29. Juni 2011
The GPU
• Lots of cores.
• Not so much memory.
• pretty simple.
• good in number crunching.
Mittwoch, 29. Juni 2011
Manycore
•
Compute (almost) always the
same task.
•
Different groups can work
on slightly different
branches.
•
•
Simpler hardware.
Mittwoch, 29. Juni 2011
Arrange within groups
only.
in turn, contain numerous processing elements, which are the fundamental,
programmable computational units that perform integer, single-precision floatingpoint, double-precision floating-point, and transcendental operations. All stream
cores within a compute unit execute the same instruction sequence; different
compute units can execute different instructions.
Ultra-Threaded Dispatch Processor
•
Hardware overview from AMD’s
Programming Guide (*).
•
To use the former picture:
•
•
5 chickens always stay
together. (processing
elements)
Compute
Unit
Compute
Unit
Each coop contains 16 of this
cliques. (compute units)
Stream Core
Compute
Unit
Compute
Unit
Instruction
and Control
Flow
Branch
Execution
Unit
T-Processing
Element
Processing
Element
General-Purpose Registers
Figure 1.2
Simplified Block Diagram of the GPU Compute Device1
1. Much of this is transparent to the programmer.
Mittwoch, 29. Juni 2011
1.2 Hardware Overview
Copyright © 2011 Advanced Micro Devices, Inc. All rights reserved.
1-3
The GPU’s hierarchy
• Lots of cores (e.g. ATI Radeon 5870)
• 20 compute units
• 16 stream cores
• (4+1) processing elements
• 1600 SP units, 320 DP units, 320 SF units.
Mittwoch, 29. Juni 2011
GPU / CPU
Mittwoch, 29. Juni 2011
GPU
Mainboard
Compute Unit
CPU / Socket
Stream Core
Core
Processing Element
FPU
The GPU Performance
•
Processing at 850 MHz => theoretical peak
performance of 1.36 TFLOPs (for 299$)
•
Not so much memory. (1/2 GB accessible
without tricks)
Mittwoch, 29. Juni 2011
•
•
Fast Global Memory on GPU
•
8 kB L1 (RO) cache per compute unit.
Very Fast Local Memory. (32kB come with
almost no latency for each compute unit).
The OpenCL Platform
•
•
One host
Various Compute devices
•
•
Each consistent of Compute
Units
•
•
GPUs and CPUs
Cores / SIMD engines
Again devised into Processing
Elements
•
Platform overview from AMD Programming Guide (*)
Figure 3.1: Platform model … one host plus one or more compute device
with one or more compute units each with one or more processing eleme
3.1.1 Platform Mixed Version Support
processing Elements / FPUs
OpenCL is designed to support devices with different capabilities under a sin
includes devices which conform to different versions of the OpenCL specific
three important version identifiers to consider for an OpenCL system: the pla
version of a device, and the version(s) of the OpenCL C language supported o
Mittwoch, 29. Juni 2011
•
Organizing OpenCL
First a platform has to be chosen:
•
platform = implementation of OpenCL
#platforms ≥ 1 (Like Apple + Nvidia on my
Laptop)
•
Then you have to query the devices, that can be
accessed by means of this platform.
•
In a context devices are tied together. It’s used to
manage buffers, programs and, kernels.
•
You perform actions on this objects in queues.
Mittwoch, 29. Juni 2011
•
Organizing OpenCL
First a platform has to be chosen:
•
platform = implementation of OpenCL
#platforms ≥ 1 (Like Apple + Nvidia on my
Laptop)
•
Then you have to query the devices, that can be
accessed by means of this platform.
•
In a context devices are tied together. It’s used to
manage buffers, programs and, kernels.
•
You perform actions on this objects in queues.
Mittwoch, 29. Juni 2011
Common usage
• Take first platform, you get!
• Put your GPU as the only DEVICE into the
CONTEXT
• Have one command QUEUE connected
with your GPU.
Mittwoch, 29. Juni 2011
Organizing OpenCL II
• Though possible, I would not recommend
to use more than one platform.
• If you want more than one GPU to work
on the same memory, they have to share a
context!
• When different devices share a context, the
Buffers share the device constraints.
Mittwoch, 29. Juni 2011
•
•
Memory
Memory is managed in so called Buffers.
•
•
They have to be declared (size and
specifiers)
You get your Data in and out, via a copy
command in the queue.
•
Mittwoch, 29. Juni 2011
Buffers are bound to a context
You may also just give a pointer to host
memory.
Execution model
• OpenCL programs are sets of functions in
a C99 derivative.
• Those functions to be executed directly in
the queue are called kernels.
• Kernels operate on every element of an input
stream independently.
• This is orchestrated by the NDRange
argument.
Mittwoch, 29. Juni 2011
Kernels
• Kernels are functions you put into the
command queue.
• Essentially within a Kernel you explain, what
each “chicken” (work unit) has to do!
• All work units will do the same thing,
written in the kernel!
Mittwoch, 29. Juni 2011
Kernels
• Kernels are functions you put into the
command queue.
• Essentially within a Kernel you explain, what
each “chicken” (work unit) has to do!
• All work units will do the same thing,
written in the kernel!
Mittwoch, 29. Juni 2011
Orchestrating the kernels
• Kernels are put into command queue
• Before enqueueing a kernel one has
specify where the kernel parameters
point to.
• Kernels are enqueued with a NDRange
argument:
• Gives an N-Dimensional-Range.
Mittwoch, 29. Juni 2011
The NDRange
•
•
•
Mittwoch, 29. Juni 2011
Giving the number of work items for kernel.
Can be organized geometrically:
•
e.g. 1024x1024 work-items suited to
problem size.
Can be subdivided into workgroups.
•
e.g. 128x128 workgroups, each having 8x8
work-items.
The NDRange:
Chicken Version
•
NDRange specifies how many
chickens you want to work.
•
You can organize them
geometrically. (16 = 4x4)
•
You can also group them
together.
Mittwoch, 29. Juni 2011
Why workgroups?
•
•
Mittwoch, 29. Juni 2011
All work items of a workgroup are executed on
the same compute unit.
•
They share the local memory. Which is
tremendously fast. (“chickens within a coop”)
•
Only within a workgroup you can synchronize.
Next finer granularity is the wavefront. The
execution stream within a wavefront is uniform. So
branching within is extremely expensive.
Why workgroups: Chicken version
•
•
Mittwoch, 29. Juni 2011
All chickens within the same group reside in the
same coop.
•
They share the same bowl, which is much
nearer than the global bowl for everyone.
•
They wait for each other, when going to the
local or global bowl. (synchronization)
Next finer granularity: Chickens will all do the same!
So if in the same wavefront, one chicken has to add
and the other has to subtract - they all will do both!
Hands on
import pyopencl as cl
import numpy as np
src = """ __kernel void set_it(__global int* data) {
int GID = get_global_id(0);
data[GID] = GID;
} """
if __name__ == "__main__"
context = cl.create_some_context()
queue = cl.CommandQueue(context)
GLOBAL_SIZE = (32,)
LOCAL_SIZE = (1,)
# can also say None here
data = np.zeros(GLOBAL_SIZE[0], dtype= int32 )
# Data on host
data_Buffer = cl.Buffer(context, cl.mem_flags.READ_WRITE, size=data.nbytes) # Data on device
print(data)
program = cl.Program(context, src).build()
program.set_it(queue, GLOBAL_SIZE, LOCAL_SIZE, data_Buffer).wait()
cl.enqueue_copy(queue, data, data_Buffer).wait()
print(data)
Mittwoch, 29. Juni 2011
The OpenCL part
src = """ __kernel void set_it(__global int* data) {
int GID = get_global_id(0);
data[GID] = GID;
} """
•
•
•
Is a python string
•
•
gets first its global id in x-direction (0)
Mittwoch, 29. Juni 2011
contains only one function, which is a kernel: __kernel
has one parameter data which is a __global
reachable array of int.
Each work unit sets “its” entry to its GID
The Python part I
context = cl.create_some_context()
queue = cl.CommandQueue(context)
GLOBAL_SIZE = (32,)
LOCAL_SIZE = (1,)
data = np.zeros(GLOBAL_SIZE[0], dtype= int32 )
data_Buffer = cl.Buffer(context, cl.mem_flags.READ_WRITE, size=data.nbytes)
•
platform / device / context is all managed by magic:
create_some_context()
•
•
queue is to be initialized with given context
•
declare how many work units you want.
•
Here we use 32 x 1 Work units.
We need representations of the data on host and on device.
Mittwoch, 29. Juni 2011
The Python part II
program = cl.Program(context, src).build()
program.set_it(queue, GLOBAL_SIZE, LOCAL_SIZE, data_Buffer).wait()
cl.enqueue_copy(queue, data, data_Buffer).wait()
•
•
Mittwoch, 29. Juni 2011
First we build the program from source and according
to its context.
•
Out of the context, the compiler knows the device
architecture.
•
One can pass also compiler options here. (e.g.
include files!)
Every kernel becomes a method for the program
Object.
The Python part III
program = cl.Program(context, src).build()
program.set_it(queue, GLOBAL_SIZE, LOCAL_SIZE, data_Buffer).wait()
cl.enqueue_copy(queue, data, data_Buffer).wait()
•
•
We pass queue, NDRange, and kernel
parameters
•
Last step is getting data out of the
data_Buffer into the Numpy array data
•
Mittwoch, 29. Juni 2011
The .wait() ensures we wait for completion.
We .wait() till this is finished too.
Backup
Mittwoch, 29. Juni 2011
Synchronization
• Within a workgroup
• barrier(CLK_LOCAL_MEM_FENCE)
• barrier(CLK_GLOBAL_MEM_FENCE)
• In a Queue
• .wait() waits for the Event being
computed.
Mittwoch, 29. Juni 2011
Synchronization
•
There is no global
synchronization between
work units.
• Chickens never wait for chickens in other
groups.
Mittwoch, 29. Juni 2011
A template
import pyopencl as cl
import numpy as np
from os import putenv
# For AMD-GPUs one has to set the ENV DISPLAY =: 0
putenv("DISPLAY", ":0")
def loadProgram(filename):
""" Gives the src-code of a given file """
srcFile = open(filename, 'r')
src = "".join(srcFile.readlines())
return src
platform = cl.get_platforms()[0]
try:
mydevices = platform.get_devices(device_type=cl.device_type.GPU)
except:
mydevices = platform.get_devices(device_type=cl.device_type.ALL)
mydevice = mydevices[0]
ctx = cl.Context(mydevices)
queue = cl.CommandQueue(ctx, device=mydevice)
src = loadProgram("./mycode.cl")
Mittwoch, 29. Juni 2011
A practical example
__kernel void naive_mul(__global float* A, __global float* B, __global float* C) {
const int xid = get_global_id(0);
const int yid = get_global_id(1);
const int dim = get_global_size(0);
__private float c = 0.f;
for (int k = 0; k < dim; ++k)
c += A[k + yid * dim] * B[k*dim + xid];
C[xid + yid * dim] = c;
}
•
•
Naive matrix multiplication (using only global memory)
Still approximately 300 times faster, than Numpy.dot
(A,B) for A, B 1024 x 1024 single precision matrices.
(On a ATI Radeon 5870)
Mittwoch, 29. Juni 2011
Global Matrix Multiplication
Mittwoch, 29. Juni 2011
Global Matrix Multiplication
Mittwoch, 29. Juni 2011
#define LDIM 16
__kernel void local_mul(__global float* A, __global float* B, __global float* C) {
const int LX = get_local_id(0);
const int LY = get_local_id(1);
const int WX = get_group_id(0);
const int WY = get_group_id(1);
const int DIM = get_global_size(0);
const int TILES = get_num_groups(0);
__local float Al[LDIM][LDIM], Bl[LDIM][LDIM];
__private float cl;
cl = 0.f; //make sure, it's zero!
for (int k = 0; k < TILES; ++k) {
Al[LY][LX] = A[LX + LY * DIM + k * LDIM + WY * LDIM * DIM];
Bl[LX][LY] = B[LX + LY * DIM + k * LDIM * DIM + WX * LDIM]; // transpose here
barrier(CLK_LOCAL_MEM_FENCE);
for (int kk = 0; kk < LDIM; ++kk)
cl += Al[LY][kk] * Bl[LX][kk]; //transpose here
barrier(CLK_LOCAL_MEM_FENCE); }
C[get_global_id(0) + get_global_id(1) * DIM] = cl; }
Mittwoch, 29. Juni 2011
Local Matrix Multiplication
Mittwoch, 29. Juni 2011
Local Matrix Multiplication
Mittwoch, 29. Juni 2011
Local Matrix Multiplication
Mittwoch, 29. Juni 2011
st
1
step
•
Copy data from __global
memory to __local memory
•
Each work item copies one
entry per matrix (A, B) per
round (k++) from global to
local memory
Mittwoch, 29. Juni 2011
nd
2
step
•
Now all memory
accesses are within local
memory.
•
Each work item in the
workgroup computes
like in the global
example.
Mittwoch, 29. Juni 2011
nd
2
step
•
Now all memory
accesses are within local
memory.
•
Each work item in the
workgroup computes
like in the global
example.
Mittwoch, 29. Juni 2011
Metaprogramming
• We can use Python to modify the OpenCL
source before compiling:
• src = “#DEFINE LDIM 16”
• src += loadFile(“matmul.cl”)
• src = “#DEFINE LDIM %i” %ldim
• where ldim is set in Python before...
Mittwoch, 29. Juni 2011
Resumé
• Accessing the GPU from Python is quite
easy.
• PyOpenCL works perfectly with Numpy.
• If you consider porting some slow routines
to C (e.g. using Cython), probably you should
consider OpenCL.
• First (even practical!) routines are easily
implemented.
Mittwoch, 29. Juni 2011
Introductory Documents
•
•
•
•
Mittwoch, 29. Juni 2011
(*) Programming Guide: AMD Accelerated Parallel Processing OpenCL
http://www.khronos.org/developers/library/overview/opencl_overview.pdf
http://mathema.tician.de/software/pyopencl
http://www.khronos.org/registry/cl/