GPU computing
Transcrição
GPU computing
GPU computing Jochen Gerhard Institut für Informatik Frankfurt Institute for Advanced Studies Mittwoch, 29. Juni 2011 Overview • How is a GPU structured? (Roughly) • How does manycore programming work compared to multicore? • How can one access the GPU from Python? • Some details about the structure of OpenCL Programs. • How to make the GPU do what you want? Mittwoch, 29. Juni 2011 Hardware • A modern Computer has more than just a CPU: • More than one socket and one core anyway. • But also graphic cards, sometimes even more than one. • Wouldn’t it be nice to harvest all that computing power? Mittwoch, 29. Juni 2011 • HPC for the poor man: GPUs As example let’s take my MacBook Pro 5.1: • • • Mittwoch, 29. Juni 2011 1x Intel Core 2 Duo @2.4 GHz • Galaxy Benchmark ~12 GFLOPs NVIDIA GeForce 9600M GT • Galaxy Benchmark ~40 GFLOPs Why not harvest both of it? • With just coding one program??? The CPU • n cores, n rather small (less than 32) • each with private L1-cache • pairwise/quadwise shared L2-cache • shared L3-cache • Slow access to system memory • Is good in multiple things Mittwoch, 29. Juni 2011 Multicore • Compute different (rather complicated tasks) • • • Each core computes even different programs. Complicated hardware Sometimes share information. (messages) Mittwoch, 29. Juni 2011 The GPU • Lots of cores. • Not so much memory. • pretty simple. • good in number crunching. Mittwoch, 29. Juni 2011 Manycore • Compute (almost) always the same task. • Different groups can work on slightly different branches. • • Simpler hardware. Mittwoch, 29. Juni 2011 Arrange within groups only. in turn, contain numerous processing elements, which are the fundamental, programmable computational units that perform integer, single-precision floatingpoint, double-precision floating-point, and transcendental operations. All stream cores within a compute unit execute the same instruction sequence; different compute units can execute different instructions. Ultra-Threaded Dispatch Processor • Hardware overview from AMD’s Programming Guide (*). • To use the former picture: • • 5 chickens always stay together. (processing elements) Compute Unit Compute Unit Each coop contains 16 of this cliques. (compute units) Stream Core Compute Unit Compute Unit Instruction and Control Flow Branch Execution Unit T-Processing Element Processing Element General-Purpose Registers Figure 1.2 Simplified Block Diagram of the GPU Compute Device1 1. Much of this is transparent to the programmer. Mittwoch, 29. Juni 2011 1.2 Hardware Overview Copyright © 2011 Advanced Micro Devices, Inc. All rights reserved. 1-3 The GPU’s hierarchy • Lots of cores (e.g. ATI Radeon 5870) • 20 compute units • 16 stream cores • (4+1) processing elements • 1600 SP units, 320 DP units, 320 SF units. Mittwoch, 29. Juni 2011 GPU / CPU Mittwoch, 29. Juni 2011 GPU Mainboard Compute Unit CPU / Socket Stream Core Core Processing Element FPU The GPU Performance • Processing at 850 MHz => theoretical peak performance of 1.36 TFLOPs (for 299$) • Not so much memory. (1/2 GB accessible without tricks) Mittwoch, 29. Juni 2011 • • Fast Global Memory on GPU • 8 kB L1 (RO) cache per compute unit. Very Fast Local Memory. (32kB come with almost no latency for each compute unit). The OpenCL Platform • • One host Various Compute devices • • Each consistent of Compute Units • • GPUs and CPUs Cores / SIMD engines Again devised into Processing Elements • Platform overview from AMD Programming Guide (*) Figure 3.1: Platform model … one host plus one or more compute device with one or more compute units each with one or more processing eleme 3.1.1 Platform Mixed Version Support processing Elements / FPUs OpenCL is designed to support devices with different capabilities under a sin includes devices which conform to different versions of the OpenCL specific three important version identifiers to consider for an OpenCL system: the pla version of a device, and the version(s) of the OpenCL C language supported o Mittwoch, 29. Juni 2011 • Organizing OpenCL First a platform has to be chosen: • platform = implementation of OpenCL #platforms ≥ 1 (Like Apple + Nvidia on my Laptop) • Then you have to query the devices, that can be accessed by means of this platform. • In a context devices are tied together. It’s used to manage buffers, programs and, kernels. • You perform actions on this objects in queues. Mittwoch, 29. Juni 2011 • Organizing OpenCL First a platform has to be chosen: • platform = implementation of OpenCL #platforms ≥ 1 (Like Apple + Nvidia on my Laptop) • Then you have to query the devices, that can be accessed by means of this platform. • In a context devices are tied together. It’s used to manage buffers, programs and, kernels. • You perform actions on this objects in queues. Mittwoch, 29. Juni 2011 Common usage • Take first platform, you get! • Put your GPU as the only DEVICE into the CONTEXT • Have one command QUEUE connected with your GPU. Mittwoch, 29. Juni 2011 Organizing OpenCL II • Though possible, I would not recommend to use more than one platform. • If you want more than one GPU to work on the same memory, they have to share a context! • When different devices share a context, the Buffers share the device constraints. Mittwoch, 29. Juni 2011 • • Memory Memory is managed in so called Buffers. • • They have to be declared (size and specifiers) You get your Data in and out, via a copy command in the queue. • Mittwoch, 29. Juni 2011 Buffers are bound to a context You may also just give a pointer to host memory. Execution model • OpenCL programs are sets of functions in a C99 derivative. • Those functions to be executed directly in the queue are called kernels. • Kernels operate on every element of an input stream independently. • This is orchestrated by the NDRange argument. Mittwoch, 29. Juni 2011 Kernels • Kernels are functions you put into the command queue. • Essentially within a Kernel you explain, what each “chicken” (work unit) has to do! • All work units will do the same thing, written in the kernel! Mittwoch, 29. Juni 2011 Kernels • Kernels are functions you put into the command queue. • Essentially within a Kernel you explain, what each “chicken” (work unit) has to do! • All work units will do the same thing, written in the kernel! Mittwoch, 29. Juni 2011 Orchestrating the kernels • Kernels are put into command queue • Before enqueueing a kernel one has specify where the kernel parameters point to. • Kernels are enqueued with a NDRange argument: • Gives an N-Dimensional-Range. Mittwoch, 29. Juni 2011 The NDRange • • • Mittwoch, 29. Juni 2011 Giving the number of work items for kernel. Can be organized geometrically: • e.g. 1024x1024 work-items suited to problem size. Can be subdivided into workgroups. • e.g. 128x128 workgroups, each having 8x8 work-items. The NDRange: Chicken Version • NDRange specifies how many chickens you want to work. • You can organize them geometrically. (16 = 4x4) • You can also group them together. Mittwoch, 29. Juni 2011 Why workgroups? • • Mittwoch, 29. Juni 2011 All work items of a workgroup are executed on the same compute unit. • They share the local memory. Which is tremendously fast. (“chickens within a coop”) • Only within a workgroup you can synchronize. Next finer granularity is the wavefront. The execution stream within a wavefront is uniform. So branching within is extremely expensive. Why workgroups: Chicken version • • Mittwoch, 29. Juni 2011 All chickens within the same group reside in the same coop. • They share the same bowl, which is much nearer than the global bowl for everyone. • They wait for each other, when going to the local or global bowl. (synchronization) Next finer granularity: Chickens will all do the same! So if in the same wavefront, one chicken has to add and the other has to subtract - they all will do both! Hands on import pyopencl as cl import numpy as np src = """ __kernel void set_it(__global int* data) { int GID = get_global_id(0); data[GID] = GID; } """ if __name__ == "__main__" context = cl.create_some_context() queue = cl.CommandQueue(context) GLOBAL_SIZE = (32,) LOCAL_SIZE = (1,) # can also say None here data = np.zeros(GLOBAL_SIZE[0], dtype= int32 ) # Data on host data_Buffer = cl.Buffer(context, cl.mem_flags.READ_WRITE, size=data.nbytes) # Data on device print(data) program = cl.Program(context, src).build() program.set_it(queue, GLOBAL_SIZE, LOCAL_SIZE, data_Buffer).wait() cl.enqueue_copy(queue, data, data_Buffer).wait() print(data) Mittwoch, 29. Juni 2011 The OpenCL part src = """ __kernel void set_it(__global int* data) { int GID = get_global_id(0); data[GID] = GID; } """ • • • Is a python string • • gets first its global id in x-direction (0) Mittwoch, 29. Juni 2011 contains only one function, which is a kernel: __kernel has one parameter data which is a __global reachable array of int. Each work unit sets “its” entry to its GID The Python part I context = cl.create_some_context() queue = cl.CommandQueue(context) GLOBAL_SIZE = (32,) LOCAL_SIZE = (1,) data = np.zeros(GLOBAL_SIZE[0], dtype= int32 ) data_Buffer = cl.Buffer(context, cl.mem_flags.READ_WRITE, size=data.nbytes) • platform / device / context is all managed by magic: create_some_context() • • queue is to be initialized with given context • declare how many work units you want. • Here we use 32 x 1 Work units. We need representations of the data on host and on device. Mittwoch, 29. Juni 2011 The Python part II program = cl.Program(context, src).build() program.set_it(queue, GLOBAL_SIZE, LOCAL_SIZE, data_Buffer).wait() cl.enqueue_copy(queue, data, data_Buffer).wait() • • Mittwoch, 29. Juni 2011 First we build the program from source and according to its context. • Out of the context, the compiler knows the device architecture. • One can pass also compiler options here. (e.g. include files!) Every kernel becomes a method for the program Object. The Python part III program = cl.Program(context, src).build() program.set_it(queue, GLOBAL_SIZE, LOCAL_SIZE, data_Buffer).wait() cl.enqueue_copy(queue, data, data_Buffer).wait() • • We pass queue, NDRange, and kernel parameters • Last step is getting data out of the data_Buffer into the Numpy array data • Mittwoch, 29. Juni 2011 The .wait() ensures we wait for completion. We .wait() till this is finished too. Backup Mittwoch, 29. Juni 2011 Synchronization • Within a workgroup • barrier(CLK_LOCAL_MEM_FENCE) • barrier(CLK_GLOBAL_MEM_FENCE) • In a Queue • .wait() waits for the Event being computed. Mittwoch, 29. Juni 2011 Synchronization • There is no global synchronization between work units. • Chickens never wait for chickens in other groups. Mittwoch, 29. Juni 2011 A template import pyopencl as cl import numpy as np from os import putenv # For AMD-GPUs one has to set the ENV DISPLAY =: 0 putenv("DISPLAY", ":0") def loadProgram(filename): """ Gives the src-code of a given file """ srcFile = open(filename, 'r') src = "".join(srcFile.readlines()) return src platform = cl.get_platforms()[0] try: mydevices = platform.get_devices(device_type=cl.device_type.GPU) except: mydevices = platform.get_devices(device_type=cl.device_type.ALL) mydevice = mydevices[0] ctx = cl.Context(mydevices) queue = cl.CommandQueue(ctx, device=mydevice) src = loadProgram("./mycode.cl") Mittwoch, 29. Juni 2011 A practical example __kernel void naive_mul(__global float* A, __global float* B, __global float* C) { const int xid = get_global_id(0); const int yid = get_global_id(1); const int dim = get_global_size(0); __private float c = 0.f; for (int k = 0; k < dim; ++k) c += A[k + yid * dim] * B[k*dim + xid]; C[xid + yid * dim] = c; } • • Naive matrix multiplication (using only global memory) Still approximately 300 times faster, than Numpy.dot (A,B) for A, B 1024 x 1024 single precision matrices. (On a ATI Radeon 5870) Mittwoch, 29. Juni 2011 Global Matrix Multiplication Mittwoch, 29. Juni 2011 Global Matrix Multiplication Mittwoch, 29. Juni 2011 #define LDIM 16 __kernel void local_mul(__global float* A, __global float* B, __global float* C) { const int LX = get_local_id(0); const int LY = get_local_id(1); const int WX = get_group_id(0); const int WY = get_group_id(1); const int DIM = get_global_size(0); const int TILES = get_num_groups(0); __local float Al[LDIM][LDIM], Bl[LDIM][LDIM]; __private float cl; cl = 0.f; //make sure, it's zero! for (int k = 0; k < TILES; ++k) { Al[LY][LX] = A[LX + LY * DIM + k * LDIM + WY * LDIM * DIM]; Bl[LX][LY] = B[LX + LY * DIM + k * LDIM * DIM + WX * LDIM]; // transpose here barrier(CLK_LOCAL_MEM_FENCE); for (int kk = 0; kk < LDIM; ++kk) cl += Al[LY][kk] * Bl[LX][kk]; //transpose here barrier(CLK_LOCAL_MEM_FENCE); } C[get_global_id(0) + get_global_id(1) * DIM] = cl; } Mittwoch, 29. Juni 2011 Local Matrix Multiplication Mittwoch, 29. Juni 2011 Local Matrix Multiplication Mittwoch, 29. Juni 2011 Local Matrix Multiplication Mittwoch, 29. Juni 2011 st 1 step • Copy data from __global memory to __local memory • Each work item copies one entry per matrix (A, B) per round (k++) from global to local memory Mittwoch, 29. Juni 2011 nd 2 step • Now all memory accesses are within local memory. • Each work item in the workgroup computes like in the global example. Mittwoch, 29. Juni 2011 nd 2 step • Now all memory accesses are within local memory. • Each work item in the workgroup computes like in the global example. Mittwoch, 29. Juni 2011 Metaprogramming • We can use Python to modify the OpenCL source before compiling: • src = “#DEFINE LDIM 16” • src += loadFile(“matmul.cl”) • src = “#DEFINE LDIM %i” %ldim • where ldim is set in Python before... Mittwoch, 29. Juni 2011 Resumé • Accessing the GPU from Python is quite easy. • PyOpenCL works perfectly with Numpy. • If you consider porting some slow routines to C (e.g. using Cython), probably you should consider OpenCL. • First (even practical!) routines are easily implemented. Mittwoch, 29. Juni 2011 Introductory Documents • • • • Mittwoch, 29. Juni 2011 (*) Programming Guide: AMD Accelerated Parallel Processing OpenCL http://www.khronos.org/developers/library/overview/opencl_overview.pdf http://mathema.tician.de/software/pyopencl http://www.khronos.org/registry/cl/