- Institute for Program Structures and Data Organization
Transcrição
- Institute for Program Structures and Data Organization
Universität Karlsruhe (TH) Forschungsuniversität · gegründet 1825 OpenMP-Programmierung Performanz Analyse Multikern-Praktikum Wintersemester 06-07 Fakultät für Informatik Lehrstuhl für Programmiersysteme Inhalt • Was ist OpenMP? • Parallele Regionen • Konstrukte zur Arbeitsteilung • Sichtbarkeit / Schutz privater Daten • Konstrukte zur Synchronisation • Ablaufplanung bei Schleifen • Andere nützliche Konstrukte • Überlegungen zur Performanz • Clauses / Directives- Zusammenfassung • Umgebungsvariabeln Fakultät für Informatik Lehrstuhl für Programmiersysteme Ali Jannesari 2 Institut für Program Überlegungen zur Performanz • Wartende Kontrollfäden erledigen keine sinnvolle Arbeit. • Die Arbeit sollte zwischen den Fäden so gleichmäßig wie möglich aufgeteilt werden. • Die Kontrollfäden sollten die parallelen Aufgaben alle zur gleichen Zeit beenden. • Synchronisation kann erforderlich sein. • Jedoch: Die Zeit, in der ein Faden auf eine geschützte Ressource wartet, muss minimiert werden. Fakultät für Informatik Lehrstuhl für Programmiersysteme Ali Jannesari 3 Institut für Program Ungleiche Lastverteilung (1) • Ungleiche Verteilung der Aufgaben führt zu unausgelasteten Fäden und damit zu „verlorener“ Rechenzeit. #pragma omp for for( ; ; ){ Zeit #pragma omp parallel { beschäftigt untätig } } Fakultät für Informatik Lehrstuhl für Programmiersysteme Ali Jannesari 4 Institut für Program Ungleiche Lastverteilung (2) • Lastbalancierung • Static scheduling • Gleiche Anzahl der Iterationsblöcke • Basiert auf Schleifengrenzen zur Laufzeit • Totalles paralleles scheduling • OpenMP* default • Dynamic und guided scheduling • Die Threads erledigen ihre Arbeit und gehen zum nächsten Arbeitsblock. • Aufwand für Schedulingalgorithmen. Fakultät für Informatik Lehrstuhl für Programmiersysteme Ali Jannesari 5 Institut für Program Synchronisierungsaufwand (1) • „Verlorene“ Zeit beim Warten auf Ressourcen #pragma omp parallel { in krit. Abschnitt Zeit #pragma omp critical { ... } ... } beschäftigt untätig Fakultät für Informatik Lehrstuhl für Programmiersysteme Ali Jannesari 6 Institut für Program Synchronisierungsaufwand (2) • Methoden zur Verfeinerung der Synchronisation: • Weniger Sperrkonkurenzen • Unterschiedliche Namen für kritische Abschnitte • Verwendung domainspezifischer Sperren • Mergen paralleler Schleifen und entfernen der Barrieren • Mergen kleiner kritischer Abschnitte • Verschieben kritischer Abschnitte außerhalb der Schleifen • Explizite Synchronisation • • • • • void omp_init_lock(omp_lock_t *lock) void omp_set_lock(omp_lock_t *lock) void omp_unset_lock(omp_lock_t *lock) void omp_test_lock(omp_lock_t *lock) void omp_destroy_lock(omp_lock_t *lock) Fakultät für Informatik Lehrstuhl für Programmiersysteme Ali Jannesari 7 Institut für Program Shared-Cache on Multi-Cores • Cache is one of the most important system resources • Software Techniques challenges for Shared-Cache Multi-Core Systems • design and fine-tune applications for better performance • Dual Core, Dual Processor – Micro-Architecture: Fakultät für Informatik Lehrstuhl für Programmiersysteme Ali Jannesari 8 Institut für Program Shared-Cache on Multi-Cores • Benefits of the shared cache architecture compare to individual caches: • Efficient usage of the last-level cache • If one core idles, the other core takes all the shared cache • Reduces resource underutilization • Flexibility for programmers • Allows more data-sharing opportunities for threads running on separate cores that are sharing cache • One core can pre-/post-process data for the other core • Alternative communication mechanisms between cores • Reduce cache-coherency complexity • Reduced false sharing because of the shared cache • Less workload to maintain coherency, compared to the private cache architecture Fakultät für Informatik Lehrstuhl für Programmiersysteme Ali Jannesari 9 Institut für Program Shared-Cache on Multi-Cores • Reduce data-storage redundancy • The same data only needs to be stored once • Reduce front-side bus traffic • allows data requests to be resolved at the shared-cache level instead of going all the way to the system memory • Some hardware-specific benefits such as cache power savings and pre- fetch logic features • Without proper usage of some techniques, the benefits of the shared cache architecture may not get fully utilized. Fakultät für Informatik Lehrstuhl für Programmiersysteme Ali Jannesari Institut für10 Program Shared-Cache techniques • Software techniques for shared-cache multi-core systems: • Use processor affinity to avoid data contention and facilitate data sharing • Cache blocking (or data tiling) • Hold approach • Avoid false sharing • Delayed approach Fakultät für Informatik Lehrstuhl für Programmiersysteme Ali Jannesari Institut für11 Program Processor Affinity • The control over which core to run a specific thread or application on is essential. • Without this control, the threads/applications may get assigned to the wrong processors or cores and cause unnecessary performance degradation. • Separating the contention-only threads (threads that are not sharing any data but need to use cache) from each other, and assigning them to cores that are not sharing the cache. • Asigning data-sharing threads to cores that are sharing the cache, or even assigning them to the same core if they are closely tied to each other. • In Linux to get or set the CPU affinity for a task use these functions: sched_getaffinity() or sched_setaffinity() Fakultät für Informatik Lehrstuhl für Programmiersysteme Ali Jannesari Institut für12 Program Processor Affinity • Example: Effect of processor affinity in a multi-core/multiprocessor environment: Fakultät für Informatik Lehrstuhl für Programmiersysteme Ali Jannesari Institut für13 Program Cache blocking (data tiling) • try to allow data to stay in the cache while being processed by data loops. • Reduce the unnecessary cache traffic (fetching and evicting the same data throughout the loops) a better cache hit rate • Possible candidates for the cache blocking technique include large data sets that get operated on many times. • if the entire data set does not fit into the cache, the first elements in the cache will be evicted to fit the last elements in • choose the flow or logic that is not the most obvious or natural implementation, but is more beneficial to the cache utilization. Fakultät für Informatik Lehrstuhl für Programmiersysteme Ali Jannesari Institut für14 Program Cache blocking (data tiling) • Improve cache hit rate with the cache blocking technique: • The normal flow shown is broken down into four data loops over smaller chunks of the data set,each chunk having a size smaller than the cache size. (stay in the cache longer) • less traffic into and out of the cache, and reduced cache-request miss Fakultät für Informatik Lehrstuhl für Programmiersysteme Ali Jannesari Institut für15 Program Hold approach • reducing the frequency of access to the shared data by letting each thread maintain its own private copy of data, only updating the shared copy when it is necessary: • A practical example is to use the OpenMP* reduction clause so that each thread keeps a private copy and only consolidates the results when all threads are complete Fakultät für Informatik Lehrstuhl für Programmiersysteme Ali Jannesari Institut für16 Program Delayed approach • Introduce a small amount of delay between one core, which is generating data, and the other core, which is requesting data, • As a result, the overall cache hit rate may get improved • takes advantage of the natural L1 cache eviction by inserting a wait until data gets pushed to the shared L2 cache – before the other core requests access to it. • fine-tune the performance, when two threads are running on two cores that are sharing the L2 cache and the threads are sharing the data • may be difficult to get a crisp control of the delay needed Fakultät für Informatik Lehrstuhl für Programmiersysteme Ali Jannesari Institut für17 Program Delayed approach • Figur (A) without Delay: • Step 1 and 2: thread 0 grabs data into L1 cache and modifies it • Step 3: thread 0 sends a signal to thread 1 to tell it the data is ready • Step 4: thread 1 checks if the data is in L2 cache, gets a miss • Step 5: data is evicted into L2 cache • Step 6: thread 1 gets data Fakultät für Informatik Lehrstuhl für Programmiersysteme Ali Jannesari Institut für18 Program Delayed approach • Figur (B) with Delay: • Step 1 and 2: thread 0 grabs data into L1 cache and modifies it • Step 3: instead of sending a signal immediately, thread 0 waits until a portion of the processed data is evicted into L2 cache. The amount of delay needed is determined by L1 cache usage and data size • Step 4: thread 0 sends a signal to thread 1 for processing • Step 5: thread 1 asks for data • Step 6: at least a portion of the data is already in L2 cache, which results in a better L2 hit rate Fakultät für Informatik Lehrstuhl für Programmiersysteme Ali Jannesari Institut für19 Program Avoid false sharing • well-known problem that happens when threads on different processors write to a shared cache line, but not at the same location. • there is no real coherency problem • But the cache-coherency protocol marks the cache line dirty, and when the other location gets a read/write request the hardware logic will force a reload of a cache-line update from memory. • Frequent updates of the data in the shared-cache line could cause severe performance degradation ip to 50 % (particularly in areas of high processing) Fakultät für Informatik Lehrstuhl für Programmiersysteme Ali Jannesari Institut für20 Program Avoid false sharing • cache-coherency issues at the shared-cache level, • thread 0 in core 0 brings a cache line to L1 and modifies a portion of it. • Thread 1 in core 1 wants to modify a different portion of the same cache line • The hardware coherency logic detects that the cache line has been modified by core 0. • As a result: the L1 cache line from core 0 must be evicted to L2 cache, and then loaded into core 1’s L1 cache before the update can be completed. Fakultät für Informatik Lehrstuhl für Programmiersysteme Ali Jannesari Institut für21 Program Avoid false sharing • In multiprocessor systems, there could be false sharing between cores that are not sharing the cache. • For example, threads running on core 0 will possibly run into false sharing issues with threads running on core 2. • Tips to fix false sharing • allocate non-shared data to different cache lines. (on SUN T1: L1• • • • cache line 8 Bytes and L2-Cache line is 64 Bytes) Pad each thread's data element to ensure that elements owned by different threads all lie on separate cache lines. copy the global variable to a local function variable, then copy the data back before the function exits. During the process, group the data structures together and keep the size of the data elements at or just below a multiple of cache-line size, in order to avoid wasting too much cache resource. moving the threads to the same core. Fakultät für Informatik Lehrstuhl für Programmiersysteme Ali Jannesari Institut für22 Program System design phase: Cachefriendly design • Thread placement based on data sharing/contention: • If two threads are completely independent of each other, • assigning them to cores that are not sharing the last-level cache (applicable on private cache architecture in this scenario). • If two threads are intimately sharing the data and need to communicate with each other very frequently, • assigning them to the same core so they share the L1 cache. • If it is somewhere in between (two threads need to share some data, but not with great frequency), • assigning them to two cores that are sharing the L2 cache (for example, core 0 and core 1 in Figure above). • save time by avoiding going all the way to the memory for the shared data by using the shared L2 cache. Fakultät für Informatik Lehrstuhl für Programmiersysteme Ali Jannesari Institut für23 Program System diagnostic phase • identify bottlenecks and monitor cache events using performance diagnostic tools • trace call graphs, • sample targeted events, • deadlocks or data contention • cache-miss events • Intel compiler: Intel® VTune™ technologies (Intel® Thread Checker, Intel® Thread Profiler, etc.) • Sun Compiler: Analyzer, Clollect commands Fakultät für Informatik Lehrstuhl für Programmiersysteme Ali Jannesari Institut für24 Program System tuning phase • fine-tune system performance • Check processor affinity again to see if cache contention can be avoided or cache-sharing can be facilitated by relocating the threads • Check for a high concentration of cache events to see if there is false sharing, and fix it by relocating a data portion or assigning it to the same core • Check data loops to see if cache blocking can be used • Check data loops to see if a hold approach can be used to be more disciplined when updating the shared copy • Check if cheaper synchronization primitives can be used • Check if a delay approach may be suitable Fakultät für Informatik Lehrstuhl für Programmiersysteme Ali Jannesari Institut für25 Program