- Institute for Program Structures and Data Organization

Transcrição

- Institute for Program Structures and Data Organization
Universität Karlsruhe (TH)
Forschungsuniversität · gegründet 1825
OpenMP-Programmierung
Performanz Analyse
Multikern-Praktikum Wintersemester 06-07
Fakultät für Informatik
Lehrstuhl für Programmiersysteme
Inhalt
• Was ist OpenMP?
• Parallele Regionen
• Konstrukte zur Arbeitsteilung
• Sichtbarkeit / Schutz privater Daten
• Konstrukte zur Synchronisation
• Ablaufplanung bei Schleifen
• Andere nützliche Konstrukte
• Überlegungen zur Performanz
• Clauses / Directives- Zusammenfassung
• Umgebungsvariabeln
Fakultät für Informatik
Lehrstuhl für Programmiersysteme
Ali Jannesari
2
Institut für Program
Überlegungen zur Performanz
• Wartende Kontrollfäden erledigen keine sinnvolle
Arbeit.
• Die Arbeit sollte zwischen den Fäden so
gleichmäßig wie möglich aufgeteilt werden.
• Die Kontrollfäden sollten die parallelen Aufgaben alle
zur gleichen Zeit beenden.
• Synchronisation kann erforderlich sein.
• Jedoch: Die Zeit, in der ein Faden auf eine geschützte
Ressource wartet, muss minimiert werden.
Fakultät für Informatik
Lehrstuhl für Programmiersysteme
Ali Jannesari
3
Institut für Program
Ungleiche Lastverteilung (1)
• Ungleiche Verteilung der Aufgaben führt zu
unausgelasteten Fäden und damit zu
„verlorener“ Rechenzeit.
#pragma omp for
for( ; ; ){
Zeit
#pragma omp parallel
{
beschäftigt
untätig
}
}
Fakultät für Informatik
Lehrstuhl für Programmiersysteme
Ali Jannesari
4
Institut für Program
Ungleiche Lastverteilung (2)
• Lastbalancierung
• Static scheduling
• Gleiche Anzahl der Iterationsblöcke
• Basiert auf Schleifengrenzen zur Laufzeit
• Totalles paralleles scheduling
• OpenMP* default
• Dynamic und guided scheduling
• Die Threads erledigen ihre Arbeit und gehen zum
nächsten Arbeitsblock.
• Aufwand für Schedulingalgorithmen.
Fakultät für Informatik
Lehrstuhl für Programmiersysteme
Ali Jannesari
5
Institut für Program
Synchronisierungsaufwand (1)
• „Verlorene“ Zeit beim Warten auf Ressourcen
#pragma omp parallel
{
in krit. Abschnitt
Zeit
#pragma omp critical
{
...
}
...
}
beschäftigt
untätig
Fakultät für Informatik
Lehrstuhl für Programmiersysteme
Ali Jannesari
6
Institut für Program
Synchronisierungsaufwand (2)
• Methoden zur Verfeinerung der Synchronisation:
• Weniger Sperrkonkurenzen
• Unterschiedliche Namen für kritische Abschnitte
• Verwendung domainspezifischer Sperren
• Mergen paralleler Schleifen und entfernen der Barrieren
• Mergen kleiner kritischer Abschnitte
• Verschieben kritischer Abschnitte außerhalb der
Schleifen
• Explizite Synchronisation
•
•
•
•
•
void omp_init_lock(omp_lock_t *lock)
void omp_set_lock(omp_lock_t *lock)
void omp_unset_lock(omp_lock_t *lock)
void omp_test_lock(omp_lock_t *lock)
void omp_destroy_lock(omp_lock_t *lock)
Fakultät für Informatik
Lehrstuhl für Programmiersysteme
Ali Jannesari
7
Institut für Program
Shared-Cache on Multi-Cores
• Cache is one of the most important system resources
• Software Techniques challenges for Shared-Cache Multi-Core Systems
• design and fine-tune applications for better performance
• Dual Core, Dual Processor – Micro-Architecture:
Fakultät für Informatik
Lehrstuhl für Programmiersysteme
Ali Jannesari
8
Institut für Program
Shared-Cache on Multi-Cores
• Benefits of the shared cache architecture compare to individual
caches:
• Efficient usage of the last-level cache
• If one core idles, the other core takes all the shared cache
• Reduces resource underutilization
• Flexibility for programmers
• Allows more data-sharing opportunities for threads running on
separate cores that are sharing cache
• One core can pre-/post-process data for the other core
• Alternative communication mechanisms between cores
• Reduce cache-coherency complexity
• Reduced false sharing because of the shared cache
• Less workload to maintain coherency, compared to the private cache
architecture
Fakultät für Informatik
Lehrstuhl für Programmiersysteme
Ali Jannesari
9
Institut für Program
Shared-Cache on Multi-Cores
• Reduce data-storage redundancy
• The same data only needs to be stored once
• Reduce front-side bus traffic
• allows data requests to be resolved at the shared-cache level instead of
going all the way to the system memory
• Some hardware-specific benefits such as cache power savings and pre-
fetch logic features
• Without proper usage of some techniques, the benefits of the
shared cache architecture may not get fully utilized.
Fakultät für Informatik
Lehrstuhl für Programmiersysteme
Ali Jannesari
Institut für10
Program
Shared-Cache techniques
• Software techniques for shared-cache multi-core
systems:
• Use processor affinity to avoid data contention and
facilitate data sharing
• Cache blocking (or data tiling)
• Hold approach
• Avoid false sharing
• Delayed approach
Fakultät für Informatik
Lehrstuhl für Programmiersysteme
Ali Jannesari
Institut für11
Program
Processor Affinity
• The control over which core to run a specific thread or application on is
essential.
• Without this control, the threads/applications may get assigned to the
wrong processors or cores and cause unnecessary performance
degradation.
• Separating the contention-only threads (threads that are not sharing
any data but need to use cache) from each other, and assigning them
to cores that are not sharing the cache.
• Asigning data-sharing threads to cores that are sharing the cache, or
even assigning them to the same core if they are closely tied to each
other.
• In Linux to get or set the CPU affinity for a task use these functions:
sched_getaffinity() or sched_setaffinity()
Fakultät für Informatik
Lehrstuhl für Programmiersysteme
Ali Jannesari
Institut für12
Program
Processor Affinity
• Example: Effect of processor affinity in a multi-core/multiprocessor
environment:
Fakultät für Informatik
Lehrstuhl für Programmiersysteme
Ali Jannesari
Institut für13
Program
Cache blocking (data tiling)
• try to allow data to stay in the cache while being processed by data
loops.
• Reduce the unnecessary cache traffic (fetching and evicting the same
data throughout the loops) a better cache hit rate
• Possible candidates for the cache blocking technique include large
data sets that get operated on many times.
• if the entire data set does not fit into the cache, the first elements in the
cache will be evicted to fit the last elements in
• choose the flow or logic that is not the most obvious or natural
implementation, but is more beneficial to the cache utilization.
Fakultät für Informatik
Lehrstuhl für Programmiersysteme
Ali Jannesari
Institut für14
Program
Cache blocking (data tiling)
•
Improve cache hit rate with the cache blocking technique:
• The normal flow shown is broken down into four data loops over
smaller chunks of the data set,each chunk having a size smaller than
the cache size. (stay in the cache longer)
• less traffic into and out of the cache, and reduced cache-request miss
Fakultät für Informatik
Lehrstuhl für Programmiersysteme
Ali Jannesari
Institut für15
Program
Hold approach
• reducing the frequency of access to the shared data by letting each thread
maintain its own private copy of data, only updating the shared copy when it
is necessary:
• A practical example is to use the OpenMP* reduction clause so that each
thread keeps a private copy and only consolidates the results when all threads
are complete
Fakultät für Informatik
Lehrstuhl für Programmiersysteme
Ali Jannesari
Institut für16
Program
Delayed approach
• Introduce a small amount of delay between one core, which is
generating data, and the other core, which is requesting data,
• As a result, the overall cache hit rate may get improved
• takes advantage of the natural L1 cache eviction by inserting a wait
until data gets pushed to the shared L2 cache – before the other core
requests access to it.
• fine-tune the performance, when two threads are running on two cores
that are sharing the L2 cache and the threads are sharing the data
• may be difficult to get a crisp control of the delay needed
Fakultät für Informatik
Lehrstuhl für Programmiersysteme
Ali Jannesari
Institut für17
Program
Delayed approach
•
Figur (A) without Delay:
•
Step 1 and 2: thread 0 grabs data into L1 cache and modifies it
•
Step 3: thread 0 sends a signal to thread 1 to tell it the data is ready
•
Step 4: thread 1 checks if the data is in L2 cache, gets a miss
•
Step 5: data is evicted into L2 cache
•
Step 6: thread 1 gets data
Fakultät für Informatik
Lehrstuhl für Programmiersysteme
Ali Jannesari
Institut für18
Program
Delayed approach
•
Figur (B) with Delay:
•
Step 1 and 2: thread 0 grabs data into L1 cache and modifies it
•
Step 3: instead of sending a signal immediately, thread 0 waits until a portion of the
processed data is evicted into L2 cache. The amount of delay needed is determined by
L1 cache usage and data size
•
Step 4: thread 0 sends a signal to thread 1 for processing
•
Step 5: thread 1 asks for data
•
Step 6: at least a portion of the data is already in L2 cache, which results in a better L2 hit
rate
Fakultät für Informatik
Lehrstuhl für Programmiersysteme
Ali Jannesari
Institut für19
Program
Avoid false sharing
• well-known problem that happens when threads on different processors
write to a shared cache line, but not at the same location.
• there is no real coherency problem
• But the cache-coherency protocol marks the cache line dirty, and when the
other location gets a read/write request the hardware logic will force a reload
of a cache-line update from memory.
• Frequent updates of the data in the shared-cache line could cause
severe performance degradation ip to 50 % (particularly in areas of
high processing)
Fakultät für Informatik
Lehrstuhl für Programmiersysteme
Ali Jannesari
Institut für20
Program
Avoid false sharing
• cache-coherency issues at the shared-cache level,
• thread 0 in core 0 brings a cache line to L1 and modifies a portion of it.
• Thread 1 in core 1 wants to modify a different portion of the same cache line
• The hardware coherency logic detects that the cache line has been modified by
core 0.
• As a result: the L1 cache line from core 0 must be evicted to L2 cache, and
then loaded into core 1’s L1 cache before the update can be completed.
Fakultät für Informatik
Lehrstuhl für Programmiersysteme
Ali Jannesari
Institut für21
Program
Avoid false sharing
• In multiprocessor systems, there could be false sharing
between cores that are not sharing the cache.
• For example, threads running on core 0 will possibly run into false
sharing issues with threads running on core 2.
• Tips to fix false sharing
• allocate non-shared data to different cache lines. (on SUN T1: L1•
•
•
•
cache line 8 Bytes and L2-Cache line is 64 Bytes)
Pad each thread's data element to ensure that elements owned by
different threads all lie on separate cache lines.
copy the global variable to a local function variable, then copy the
data back before the function exits.
During the process, group the data structures together and keep the
size of the data elements at or just below a multiple of cache-line
size, in order to avoid wasting too much cache resource.
moving the threads to the same core.
Fakultät für Informatik
Lehrstuhl für Programmiersysteme
Ali Jannesari
Institut für22
Program
System design phase: Cachefriendly design
• Thread placement based on data sharing/contention:
• If two threads are completely independent of each other,
• assigning them to cores that are not sharing the last-level cache
(applicable on private cache architecture in this scenario).
• If two threads are intimately sharing the data and need to communicate
with each other very frequently,
• assigning them to the same core so they share the L1 cache.
• If it is somewhere in between (two threads need to share some data,
but not with great frequency),
• assigning them to two cores that are sharing the L2 cache (for
example, core 0 and core 1 in Figure above).
• save time by avoiding going all the way to the memory for the
shared data by using the shared L2 cache.
Fakultät für Informatik
Lehrstuhl für Programmiersysteme
Ali Jannesari
Institut für23
Program
System diagnostic phase
• identify bottlenecks and monitor cache events
using performance diagnostic tools
• trace call graphs,
• sample targeted events,
• deadlocks or data contention
• cache-miss events
• Intel compiler: Intel® VTune™ technologies (Intel®
Thread Checker, Intel® Thread Profiler, etc.)
• Sun Compiler: Analyzer, Clollect commands
Fakultät für Informatik
Lehrstuhl für Programmiersysteme
Ali Jannesari
Institut für24
Program
System tuning phase
• fine-tune system performance
• Check processor affinity again to see if cache contention can be
avoided or cache-sharing can be facilitated by relocating the threads
• Check for a high concentration of cache events to see if there is
false sharing, and fix it by relocating a data portion or assigning it to
the same core
• Check data loops to see if cache blocking can be used
• Check data loops to see if a hold approach can be used to be more
disciplined when updating the shared copy
• Check if cheaper synchronization primitives can be used
• Check if a delay approach may be suitable
Fakultät für Informatik
Lehrstuhl für Programmiersysteme
Ali Jannesari
Institut für25
Program

Documentos relacionados