The E ect of Network Total Order, Broadcast, and Remote

Transcrição

The E ect of Network Total Order, Broadcast, and Remote
The Eect of Network Total Order, Broadcast, and Remote-Write
Capability on Network-Based Shared Memory Computing
Robert Stets, Sandhya Dwarkadas, Leonidas Kontothanassisy, Michael L. Scott
Department of Computer Science y Compaq Cambridge Research Lab
University of Rochester
One Kendall Sq., Bldg. 700
Rochester, NY 14627{0226
Cambridge, MA 02139
DRAFT COPY { Please do not redistribute.
Abstract
Emerging system-area networks provide a variety of features that can dramatically reduce network communication overhead. Such features include reduced latency, protected remote memory access, cheap broadcasting,
and ordering guarantees for network packets. Some of these features come at the expense of scalability for the
network fabric or at a signicant implementation cost. In this paper we evaluate the impact of these features
on the implementation of Software Distributed Shared Memory (SDSM) systems, in particular, on the Cashmere
protocol. Cashmere has been implemented on for the Compaq Memory Channel network, which has support for
write access to remote memory, inexpensive broadcast, and total ordering of network packets. Our evaluation
framework divides SDSM protocol communication into three areas: shared data propagation, protocol meta-data
maintenance, and synchronization; demonstrates the performance impact of exploiting Memory Channel features
in each of these three areas.
We found that eight of eleven well-known benchmark applications perform better on the base version of
Cashmere, which maximizes its leverage of the Memory Channel features, than on a comparable version that uses
explicit messages (and no broadcast) for all protocol communication. The performance dierence is 37% in one
application, which dynamically distributes its work, and 11% or less in the other seven applications. The remaining
three applications show no performance dierences. In general, the dierences are due to reduced protocol-induced
application perturbation and more ecient meta-data maintenance. Reduced perturbation accounts for the large
37% improvement by decreasing interference with the application's dynamic work distribution mechanism.
In addition, we have also investigated Home node migration to reduce shared data propagation. Our results
show that this optimization recoups the performance lost by abandoning the use of special Memory Channel
features. In fact, the optimization is so eective that three of the applications perform 18% to 34% better on a
protocol with migration and explicit messages than on our base protocol that fully leverages the Memory Channel.
The message-based protocol has the additional advantage of allowing shared memory to grow beyond the amount
that can be mapped through the network interface.
This work was supported in part by NSF grants CDA{9401142, CCR{9702466, and CCR{9705594; and an external
research grant from Compaq.
1
1 Introduction
Clusters of workstations connected by commodity networks have long provided aggregate power comparable to special-purpose parallel machines. In practice, however, performance has been limited by
the relatively high cost of inter-processor communication. Recent trends, particularly the introductions
of commodity-priced symmetric multiprocessors (SMPs) and low-latency (in the microseconds) system
area networks (SANs), have improved the potential performance of clusters. In addition to low messaging latency, many SANs provide other overhead-reducing features such as remote memory access,
inexpensive broadcast, and total ordering in the network [10, 15, 16]. On SMP clusters connected by
SANs, communication overhead can be greatly reduced. Communication within the same node can
occur through hardware, while across SMPs, communication overhead can be ameliorated by the high
performance network.
Since shared memory is available in hardware within SMP nodes, perhaps the most natural programming paradigm for these clusters is Software Distributed Shared Memory (SDSM) since it utilizes
the hardware within a node eciently. Several studies have already determined the positive impact of
SMP-based clusters on SDSM performance [12, 14, 20, 21, 22, 25]. Many of these same studies utilized
low latency networks. However, the benets of advanced network features (for example, remote memory
access) have not been directly quantied.
In this paper, we examine the impact of advanced networking features on the performance of the
state-of-the-art Cashmere-2L [25] protocol. The Cashmere protocol uses the virtual memory subsystem
to track data accesses, allows multiple concurrent writers, employs home nodes (i.e. maintains one
master copy of each shared data page), and leverages shared memory within SMPs to reduce protocol
overhead. In practice, Cashmere-2L has been shown to have very good performance [12, 17, 25].
Cashmere was originally designed for a cluster consisting of AlphaServer SMPs connected by a Compaq Memory Channel network, which oers low messaging latencies, write access to remote memory,
inexpensive broadcast, and total ordering. Cashmere therefore attempted to maximize performance by
placing shared data directly in remotely accessible memory, using broadcast to replicate the directory
among the nodes, and relying on network total order and reliability to avoid acknowledging the receipt
of meta-data information.
2
The purpose of this paper is to evaluate the performance impact of each of these design decisions.
We have structured our evaluation to determine not only the overall impact of the special Memory
Channel features, but also their impact on protocol communication and related design. In general, an
SDSM protocol incurs communication in three areas: the propagation of shared data, the maintenance
of internal protocol data structures (called protocol meta-data ), and synchronization. To evaluate the
impact of network support in these terms, we have constructed six Cashmere variants. Four of the
variants are used to isolate the impact of the Memory Channel features on protocol communication
in the above areas. The nal two variants employ a protocol optimization that allows home nodes to
migrate to active writers, thereby reducing remote propagation of shared data. This optimization is only
possible when shared data is not in remotely accessible memory, since migration of remotely accessible
memory is an expensive operation involving synchronization of all the nodes mapping the data.
Our results show that the Memory Channel features improve performance by an average of 8% across
eleven standard benchmarks. The largest improvement is 37% and occurs in a program with dynamic
work distribution. This application benets from reduced protocol-induced overhead imbalances, resulting in more eective work distribution. Importantly, the use of the Memory Channel features never
degrades performance in any of the applications. In terms of protocol design, meta-data maintenance
benets the most from the network support. In addition, we found that home node migration can
recover most of the benets lost by not using remote write access to propagate shared data, and importantly, allows shared data size to scale beyond the our network's limited remotely-accessible memory
space.1 Three of the applications actually obtain their best performance (by a factor of 18-34%) on a
protocol with migration and explicit messages.
The next section discusses the Memory Channel and its special features, along with the Cashmere
protocol. Section 3 evaluates the impact of the Memory Channel features and the home node migration
optimization.
4 covers
work, and
outlines our conclusions.
1
Most currentSection
commodity
remoterelated
access networks
haveSection
a limited5remotely-accessible
memory space. Methods to eliminate this restriction are a focus of ongoing research [6, 26].
3
2 Protocol Variants and Implementation
Cashmere was designed for SMP clusters connected by a high performance network, specically,
Compaq's Memory Channel network [15]. Earlier work [12, 14, 20, 21, 22, 25] on Cashmere and other
systems has quantied the benets of SMP nodes to SDSM performance. In this paper, we will examine
the performance impact of the network features exploited.
We begin by providing an overview of the Memory Channel network and its programming interface.
Following this overview is a description of the Cashmere protocol. In keeping with the focus of this
paper, the design discussion will primarily focus on the aspects of network communication. A discussion
of the design decisions related to the SMP nodes can be found in an earlier paper [25].
2.1 Memory Channel
The Memory Channel is a reliable, low-latency network. The hardware provides remote-write capability, which allows processors to modify remote memory without remote processor intervention. The
Memory Channel uses a memory-mapped, programmed I/O interface. To work with remotely-accessible
memory, a processor must attach to regions in the Memory Channel's address space. The regions can
be mapped for either transmit or receive. The physical addresses of transmit regions map to I/O space,
in particular, to addresses on the Memory Channel's network adapter. I/O space is uncacheable, but
writes can be coalesced in the processor's write buer. Receive regions map directly to physical memory.
After initial connection setup, the network can be accessed directly from user level. Writes to transmit
regions are routed to the network adapter (on the PCI bus), which automatically constructs and launches
a data message. Upon message reception, a node's adapter performs a DMA access to main memory if
the region is mapped for receive. Otherwise, the message is dropped.
Normally, a write to a transmit region is not reected to a corresponding receive region on the source
node. By placing a region in loopback mode, however, we can arrange for the source adapter to include
itself in the outgoing message's destination. The message will move through the hub and arrive back at
the source, where it will be processed as a normal incoming message. The adapter will place the data
in the appropriate receive region.
The Memory Channel guarantees total order { all writes to the network are observed in the same
4
order by all receivers. This guarantee is provided by a serializing hub that connects all the machines
in the cluster. The hub is bus-based, which ensures serialization and also accounts for the network's
inexpensive broadcast support. (The second generation of the Memory Channel has a crossbar hub.
Broadcast support will still available, however at a higher cost.)
2.2 Protocol Overview
Cashmere uses the virtual memory (VM) subsystem to track accesses to shared data, and so naturally,
the unit of coherence is a virtual memory page (8K on our system). To use Cashmere, an application
must be data-race-free [1]. Simply stated, one process must synchronize with another in order to see
its modications. Also, all synchronization primitives must be visible to the system. These primitives
can be constructed from basic acquire and release operations. The former is used to enter to a critical
section; the latter is used to exit.
Cashmere implements a variant of delayed consistency [9]. In this variant, data modications become
visible at a processor at the time of its next acquire operation. This model lies in between the consistency
models implemented by Munin [5] and TreadMarks [2]. In the former, modications become visible at
the time of the modier's release operation. In the latter, modications become visible at the time of
the next causally related acquire.
In Cashmere, each page of shared memory has a single, distinguished home node. Home nodes are
initially assigned using a rst-touch policy. The home node collects all modications into a master copy
of the page. Sharing set information and home node locations are maintained in a directory containing
one entry per page.
The main protocol entry points are page faults and synchronization operations. On a page fault, the
protocol updates the sharing set information in the directory and obtains an up-to-date copy of the
page from the home node. If the fault is due to a write access, the protocol will also create a pristine
copy (called a twin ) of the page and add the page to the dirty list. As an optimization in the write fault
handler, a page that is shared by only one node is moved into exclusive mode. In this case, the twin
and dirty list operation are skipped, and the page will incur no protocol overhead until another sharer
emerges.
5
Protocol Name
CSM-DMS
CSM-MS
CSM-S
CSM-None
CSM-MS-Mg
CSM-None-Mg
Data Meta-data Synchronization Home Migration
MC
MC
MC
No
Explicit
MC
MC
No
Explicit Explicit
MC
No
Explicit Explicit
Explicit
No
Explicit
MC
MC
Yes
Explicit Explicit
Explicit
Yes
Table 1: These protocol variants have been chosen to isolate the performance impact of special network
features on the areas of SDSM communication. Use of special Memory Channel features is denoted
by a \MC" under the area of communication. Otherwise, the explicit messages are used. The use of
Memory Channel features is also denoted in the protocol sux (D, M, and/or S), as is the use of home
node migration (Mg).
At the next release operation, the protocol examines each page in the dirty list and compares the
page to its twin in order to uncover the modications. These modications are placed in a di message
and sent to the home node to be incorporated into the master copy of the page. Upon completion of
the di message, the protocol downgrades permissions on the dirty pages and sends write notices to
all nodes in the sharing set. These write notices are accumulated into a list at the destination and
processed at the node's next acquire operation. All pages named by write notices are invalidated as
part of the acquire.
2.3 Protocol Communication
As described earlier, protocol communication can be broken down into three areas: shared data
propagation, protocol meta-data maintenance, and synchronization. In order to isolate the eects of
the special Memory Channel features on these three areas, we have prepared six variants of the Cashmere
protocol. Table 1 lists the variants and characterizes their use of the Memory Channel. For each of the
areas of protocol communication, the protocols either leverage the full Memory Channel capabilities
(i.e. remote write access, total ordering, and inexpensive broadcast) or instead send explicit messages
between processors. We assume a reliable network (as is common in current SANs). If we wish to
establish ordering, however, explicit messages require an acknowledgement.
6
2.3.1 CSM-DMS: Data, Meta-data, and Synchronization using Memory Channel
The base protocol, denoted CSM-DMS, is the same Cashmere-2L protocol described in our study on
the eects of SMP clusters [25]. As described in the subsequent paragraphs, this protocol fully exploits
the Memory Channel for all SDSM communication: to propagate shared data, to maintain protocol
meta-data, and for synchronization. The following text describes how the features are leveraged.
Data: Shared data is fetched from the home node and modications are written back, in the form
of dis, to the home node.2 The fetch operation could be optimized by a remote read operation or
by allowing the home node to write the data directly to the working address on the requesting node.
Unfortunately, the rst optimization is not available on the Memory Channel. The second optimization
requires shared data to be mapped at distinct Memory Channel addresses on each node. With only
128M of Memory Channel address space, this signicantly limits the maximum dataset size. (For eight
nodes, the maximum dataset would be only about 16M.) For this reason, CSM-DMS does not use the
second optimization either.
Instead of using Memory Channel address space for all shared data copies, CSM-DMS uses it for
home node copies only. This still limits dataset size, but the limit is much higher. With home node
copies in Memory Channel space, a processor can use remote writes to apply dis at release time. This
usage avoids the need to interrupt a home node processor.
To avoid race conditions, Cashmere must be sure all dis are completed before exiting a critical
section. Rather than requiring home nodes to return di acknowledgements, CSM-DMS instead relies
on the Memory Channel's total ordering. CSM-DMS performs all di operations and then completes
the release operation by resetting the corresponding synchronization location in Memory Channel space.
Since the network is totally ordered, the di is guaranteed to be completed by the time other processors
observe the completion of the release operation.
Meta-data: System-wide meta-data in CSM-DMS consists of the page directory and write notices.
CSM-DMS replicates the page directory on each node and then uses a remote write to broadcast all
An earlier Cashmere study [17] investigated using write-through to propagate data modications. Dis were found to
use bandwidth more eciently than write-through, and to provide better performance.
2
7
changes. Cashmere also uses remote writes to deliver write notices to a well-known location on each
node. At an acquire, the node simply reads the write notices from that location. As with dis, Cashmere
takes advantage of the guaranteed network ordering to avoid write notice acknowledgements.
Synchronization: Application locks, barriers, and ags all leverage the Memory Channel's broadcast
and write ordering capabilities. Locks are represented by an 8-entry array in Memory Channel space,
and by a test-and-set ag on each node. A process begins a global lock acquire operation by rst
acquiring the local test-and-set lock. Then the process asserts its node entry in the 8-entry array, waits
for the write to appear via loop-back, and then reads the entire array. If any of the other entries are
set, the process resets its entry, backs o, and tries again. If no other entries are set, the lock has been
acquired. Barriers are represented by a 8-entry array, a \sense" variable in Memory Channel space
and a local counter on each node. A processor atomically reads and increments the local node counter
to determine if it is the last processor on the node to enter the barrier. If so, the processor updates
the node's entry in the 8-entry array. A single master processor waits for all nodes to arrive and then
toggles the sense variable. This releases all the nodes, which are spinning on the sense variable. Flags
simply use the Memory Channel's remote write and broadcast.
2.3.2 CSM-MS: Meta-data and Synchronization using Memory Channel
As mentioned above, CSM-DMS places home node page copies in the Memory Channel address space,
which limits the maximum dataset size. CSM-MS does not place shared data in Memory Channel space
and so avoids network-induced limitations on dataset size. The tradeo is that CSM-MS cannot leverage
the Memory Channel to optimize di communication. Instead, dis are sent as explicit messages, which
must interrupt the home node and which require explicit acknowledgements. (Interrupts are achieved
via Memory Channel writes to a shared ag, coupled with receive-side polling on loop back edges.) In
CSM-MS, meta-data and synchronization still leverage all Memory Channel features.
2.3.3 CSM-S: Synchronization using Memory Channel
The third protocol variant, CSM-S, only leverages the Memory Channel for synchronization. Explicit
messages are used both to propagate shared data and to maintain meta-data. Instead of broadcasting
8
a directory change, a process must send the change to the home node in an explicit message. The home
node updates the entry and acknowledges the request. The home node is the only node guaranteed to
have an up-to-date directory entry.
In most cases, an separate directory update (or read) message can be avoided. Instead, the update
can be piggybacked onto an existing message. For example, a directory update is implicit in a page
fetch request and so can be piggybacked. Also, write notices always follow di operations, so the home
node can simply piggyback the sharing set (needed to identify where to send write notices) onto the di
acknowledgment. In fact, an explicit directory message is needed only when a page is invalidated.3
2.3.4 CSM-None: No Use of Special Memory Channel Features
The fourth protocol, CSM-None, uses explicit messages (and acknowledgments) for all communication.
This protocol variant relies only on low-latency messaging, and so could easily be ported to other
low-latency network architectures. Rather than inter-processor interrupts, our low-latency messaging
relies on the ecient polling mechanism described above. Earlier Cashmere work [17] found that the
expensive kernel transition incurred by inter-processor interrupts limited the benets of the low-latency
network. In our implementation, we poll a well-known message arrival ag that is updated through
remote-write. This mechanism should be considered independent of our above use of remote write, since
ecient polling can be implemented on other network interfaces [10, 26] that lack the ability to write
to arbitrary, user-dened locations.
2.3.5 CSM-MS-Mg and CSM-None-Mg: Home Node Migration
All of the above protocol variants use rst-touch home node assignment [18]. Home assignment is
extremely important because processors on the home node write directly to master copy and so do not
incur costly twin and di overheads. If a page has multiple writers during the course of execution,
protocol overhead can potentially be reduced by migrating the home node to the active writers.
Due to the high cost of remapping Memory Channel addresses, migrating home nodes cannot be used
when data is remotely accessible. Hence, CSM-MS-Mg and CSM-None-Mg both keep shared data in
The protocol could be designed to lazily downgrade the directory entry in this case. However the directory entry would
often over-estimate the number of sharers and compromise the eectiveness of Cashmere's exclusive mode optimization.
3
9
private memory, and allow the home to migrate during execution. When a processor incurs a write
fault, the protocol checks the local copy of the directory to see if the home is actively writing the page.
If not, a migration request is sent to the home. The request is granted if received when the home is not
writing the page. If granted, the home simply changes the directory entry to point to the new home.
The CSM-MS-Mg uses Memory Channel features for meta-data maintenance and for synchronization,
while CSM-None-Mg uses only explicit messages. The latter protocol can suer from unnecessary
migration requests since the cached directory entries may be out-of-date. We do not present CSM-S-Mg
since the results of using the Memory Channel for synchronization are qualitatively the same regardless
of whether the home node is xed or migrating.
3 Results
We begin this section with a brief description of our hardware platform and our application suite.
Next, we discuss the results of our investigation of the impact of Memory Channel features and the
home node migration optimization.
3.1 Platform and Basic Operation Costs
Our experimental environment consists of four DEC AlphaServer 2100 4/233 computers. Each AlphaServer is equipped with four 21064A processors operating at 233 MHz and with 256MB of shared
memory, as well as a Memory Channel network interface. The 21064A has two on-chip caches: a 16K
instruction cache and 16K data cache. The o-chip secondary cache size is 1 Mbyte. A cache line is 64
bytes. Each AlphaServer runs Digital UNIX 4.0D with TruCluster v. 1.5 (Memory Channel) extensions.
The systems execute in multi-user mode, but with the exception of normal Unix daemons no other processes were active during the tests. In order to increase cache eciency, application processes are pinned
to a processor at startup. No other processors are connected to the Memory Channel. Execution times
represent the median values of three runs.
On our platform, the Memory Channel has a point-to-point bandwidth of approximately 33MBytes/sec.
One-way latency for a 64-bit remote-write operation is 4.3 secs. In practice, the round-trip latency for
null message in Cashmere is 39 secs. This time includes the transfer of the message header and the
invocation of a null handler function.
10
Operation
Memory Channel Features Explicit Messages
Di (secs)
290{363
485{760
Lock Acquire (secs)
46
103
Barrier (secs)
90
158
Table 2: Basic operation costs at 16-processors. Di cost varies according to the size of the di.
Program
Problem Size
Time (sec.)
Barnes
128K bodies (26Mbytes)
469.4
CLU
2048x2048 (33Mbytes)
294.7
LU
2500x2500 (50Mbytes)
254.8
EM3D
64000 nodes (52Mbytes)
137.3
Gauss
2048x2048 (33Mbytes)
948.1
Ilink
CLP (15Mbytes)
755.9
SOR
3072x4096 (50Mbytes)
194.8
TSP
17 cities (1Mbyte)
4036.24
Volrend
Head (23Mbytes)
12.01
Water-nsquared
9261 mols. (6Mbytes)
1120.6
Water-spatial
9261 mols. (16Mbytes)
74.0
Table 3: Data set sizes and sequential execution time of applications.
As described earlier, Memory Channel features can be used to signicantly reduce the cost of dis,
directory updates, write notice propagation, and synchronization. Table 2 shows the costs for di
operations, lock acquires, and barriers, both when leveraging and not leveraging the Memory Channel
features. The cost of di operations varies according to the size of the di. Directory updates, write
notices, and ag synchronization all use the Memory Channel's remote-write and total ordering features.
(Directory updates and ag synchronization also rely on the inexpensive broadcast support.) Without
these features, these operations are accomplished via explicit messages. Directory updates are small
messages with simple handlers, so their cost is only slightly more than the cost of a null message. The
cost of write notices will depend greatly on the write notice count and destinations. Write notices
sent to dierent destinations can be overlapped, thus reducing the operation's overall latency. Flags are
inherently broadcast operations, but again the ag update messages to the processors can be overlapped
so perceived latency should not be much more than that of a null message.
11
3.2 Application Suite
Our applications are well-known benchmarks that have not been modied from their original distribution.
Barnes: an N-body simulation from the TreadMarks [2] distribution (and based on the same application
in the SPLASH-1 [23] suite), using the hierarchical Barnes-Hut method. Bodies in the simulation space
are placed into nodes in a tree structure based on their physical locations, and this tree structure is
used to control the computation. Synchronization consists of barriers between phases.
CLU: from the SPLASH-2 [27] benchmark. The kernel factors a matrix into the product of a lowertriangular and an upper-triangular matrix. Work is distributed by splitting the matrix into blocks and
assigning each block to a processor. Blocks modied by a single processor are allocated contiguously in
order to increase spatial locality. Barriers are used for synchronization.
LU: Also from Splash-2. The implementation is identical to CLU except that blocks are not allocated
contiguously, resulting in multiple write sharers per coherence block.
EM3D: a program to simulate electromagnetic wave propagation through 3D objects [8]. The primary
computational element is a set of magnetic and electric nodes that are equally distributed among the
processors. These nodes are only shared amongst neighboring processors. Phases of the simulation are
synchronized through barriers.
Gauss: a locally-developed solver for a system of linear equations AX = B using Gaussian Elimination
and back-substitution. Rows are distributed among processors cyclically. Synchronization ags are used
to signal when a pivot row becomes available.
Ilink: a widely used genetic linkage analysis program from the FASTLINK 2.3P [11] package that locates
disease genes on chromosomes. A master processor performs a round-robin assignment of elements in a
sparse array to a pool of slave processors. The slaves perform calculations on the assigned probabilities
and report the results to the master. Barriers are used to synchronize between the master and slaves
and between iterations in the program.
SOR: a Red-Black Successive Over-Relaxation program from the TreadMarks distribution. The program solves partial dierential equations. The red and black arrays are divided into roughly equal size
bands of rows, with each band assigned to a dierent processor. Processors synchronize using barriers.
12
TSP: a branch-and-bound solution to the traveling salesman problem. The program, also from in the
TreadMarks distribution, distributes work through a task queue. It is non-deterministic, in that parts
of the search space can be pruned, depending on when short paths are found. The task queues are
protected by locks.
Volrend: a SPLASH-2 application that renders a three-dimensional volume using a ray casting technique. The image plane is partitioned among processors in contiguous blocks, which are further partitioned into small tiles. These tiles serve as the basic unit of work and are distributed through a set of
task queues. Again, the task queues are protected by locks.
Water-nsquared: a uid ow simulation from the SPLASH-2 benchmark suite. The molecule structures are kept in a shared array that is divided into contiguous chunks and assigned to processors.
The bulk of the interprocessor communication occurs during a phase that updates intermolecular forces
(from within a radius of n/2 molecules, where n is the number of molecules), using per-molecule locks,
resulting in a migratory sharing pattern.
Water-spatial: another SPLASH-2 uid ow simulation that solves the same problem as Waternsquared. The simulation space is placed under a uniform 3-D grid of cells, with each cell assigned to
a processor. Sharing occurs when molecules move from one cell to another. In comparison with Waternsquared, this application also uses a more ecient, linear algorithm. The application uses barriers and
locks to synchronize.
The data set sizes and uniprocessor execution times for these applications are presented in Table 3.
The size of shared memory space is listed in parentheses. Execution times were measured by running
each uninstrumented application sequentially without linking it to the protocol library.
3.3 Performance
This subsection begins by discussing the impact of Memory Channel support, in particular, remotewrite capabilities, inexpensive broadcast, and total-ordering properties, on the three types of protocol
communication: shared data propagation, protocol meta-data maintenance, and synchronization. All
protocols described in this subsection use a rst-touch home node assigment.4 We found that eight of our
In the case of multiple sharers per page, the timing dierences between protocol variants can lead to rst-touch
dierences. To elminate these dierences and isolate Memory Channel impact, we captured the rst-touch assignments
4
13
eleven benchmark applications beneted from the special Memory Channel features. The improvement
can be especially large (up to 37% over an explicit messaging protocol) in an application that dynamically
distributes work. In this case, the special Memory Channel features serve to reduce protocol-induced
overhead, thereby reducing load imbalance and costly work re-distributions.
Throughout this section, we will refer to Figure 1 and Table 4. Figure 1 shows a breakdown of
execution time, normalized to that of the CSM-DMS protocol, for the six protocols variants. Execution
time is broken down to show the time spent executing application code (User), executing protocol
code (Protocol), waiting on synchronization operations (Synchronization), and sending or receiving
messages (Message). Table 4 lists the speedups and statistics on protocol communication for each
of the applications running on 16 processors. The statistics include the number of page transfers,
invalidations, and di operations. The table also list the number of home migrations, along with the
number of migration attempts (listed in parentheses).
3.3.1 The Impact of Memory Channel Features
Eight of our eleven applications show measurable performance improvements running on CSM-DMS
(fully leveraging Memory Channel features) as opposed to CSM-None (using explicit messages). Volrend
runs 37% faster on CSM-DMS than it does on CSM-None. Barnes, EM3D, LU, and Water-nsquared
run 7-11% faster. Gauss and SOR run less than 4% faster. Three applications, CLU, Ilink, and TSP,
are not sensitive to the use of Memory Channel features and do not show any signicant performance
dierences across our protocols.
Volrend's performance is very sensitive to its workload distribution. Peturbation introduced by the
protocol induces load imbalance and triggers expensive task stealing. As can be seen from Table 4, the
number of page transfers and dis increases as shared data propagation and protocol meta-data maintenance no longer leverage Memory Channel features. Despite performing all protocol communication
with explicit messages, CSM-None performs better than CSM-S. On CSM-None, the application has
better load balance and incurs less task stealing. CSM-None performs fewer di operations, and instrumentation shows it also performs fewer accesses to the task queue lock. Regardess, Volrend performs
from CSM-DMS and used them to explicitly assign home nodes in the other protocols.
14
poorly overall on all protocol versions { the best achieved speedup is only two on 16 processors.
Barnes exhibits a high degree of sharing and incurs a large amount of protocol and synchronization
overhead. Performance slowly degrades across the protocols. In this application, remote writes and
total ordering have the biggest impact on the cost of meta-data maintenance. These features permit
an approximate 5% reduction in the large number of invalidations. Without the use of Memory Channel features, the invalidations require explicit messages to update the master directory entry. These
messages result in higher protocol overhead and poorer synchronization characteristics (see Figure 1).
Water-nsquared has a large number of synchronization operations due to its use of per-molecule locks.
However, across protocols, the application does not show a large synchronization time relative to some
of the other applications. The large number of locks reduces per-lock contention, and largely limits the
synchronization overhead to the synchronization mechanism and associated protocol overhead. This
application benets most from using the Memory Channel features to optimize lock synchronization.
Figure 1 shows that the synchronization cost is highest in CSM-None. The application is written to
encourage lock handos between neighboring processors, which incur negligible protocol overhead if
within the same SMP node. However, as protocol-induced overhead increases across the protocols,
more lock handos occur between processors on dierent nodes. These inter-node handos lead to more
dis and higher protocol and synchronization times. As in Volrend, CSM-DMS incurs the least amount
of perturbation (imbalance) due to the protocol, which helps keep lock accesses inside nodes, thereby
avoiding di operations.
At the given matrix size, LU incurs a large amount of protocol communication due to the write-write
sharing at row boundaries. Remote writes and total ordering are very eective at reducing the overhead
of the dis and invalidation operations. The protocols using explicit messages show higher protocol and
synchronization overhead, due to more expensive dis and invalidations.
EM3D, Gauss, SOR, and Water-spatial all benet from protocols that leverage the special Memory
Channel support. In these applications, our instrumentation shows that most dis are handled by an
idle processor. For these applications, meta-data maintenance is again the area that benets most from
special Memory Channel support.
Of the remaining applications, CLU, Ilink, and TSP are not noticeably aected by the underlying
15
Memory Channel support. CLU and TSP have little communication that can be optimized. Ilink,
however, performs a large number of dis, and might be expected to benet signicantly from remotewrite support. However, 90% of the dis are applied at the home node by idle processors, so the extra
overhead is somewhat hidden from application computation.
3.3.2 Home Node Migration: Optimization for a Scalable Data Space
Home node migration can reduce the number of remote memory accesses by moving the home node to
active writers. Our results show that this optimization is very eective. Six of our eleven applications
perform better using home node migration and explicit data propagation (CSM-MS-Mg) than using
rst-touch and remote-write data propagation (CSM-DMS).5 Home node migration can reduce protocol
overhead by reducing the number of twin/dis and invalidations. In fact, this reduction can be so great
that three of our applications obtain the best overall performance when using migration and explicit
messages for all protocol communication.
Volrend, Water-spatial, and LU all benet greatly from migration because the number of di (and
attendant twin) operations is signicantly reduced (see Table 4). In fact, for these applications, CSMNone-Mg, which does not leverage the special Memory Channel features at all, outperforms the full
Memory Channel protocol CSM-DMS by a range of 18% to 34%. Figure 1 shows that the protocol
component of execution time is signicantly decreased for these applications. In Volrend, this decrease
is especially important since the reduced protocol overhead leads to better load balance and less task
stealing.
On EM3D and Water-nsquared, the migration protocols CSM-MS-Mg and CSM-None-Mg perform
better than their rst-touch counterparts that use explicit messages for at least some protocol communication (CSM-MS, CSM-S, and CSM-None). The migration optimization again reduces the number
of di operations. However, this gain is oset by increased overhead of migration requests. The two
migration protocols perform basically the same as the full MC protocol, CSM-DMS.
Barnes and Gauss are the only two applications to suer under the migration optimization. In
Migration can not be used when data is placed in remotely-accessible network address space, because of the high cost
of remapping.
5
16
C
eM
G
G
-M
S
on
-N
M
S
-M
60
40
G
M
G
-M
e-
on
-N
M
S
C
S
-M
M
G
M
G
-M
e-
on
-N
M
S
C
S
-M
M
S
C
e
on
-S
M
-N
M
S
C
C
-N
M
-S
on
eM
G
G
S
-M
on
e
S
M
S
S
-M
-N
-M
S
M
S
M
S
M
C
C
C
S
M
Execution Breakdown (%)
120
S
C
e
on
-N
-S
M
200
M
TSP
S
0
C
0
S
0
S
20
S
20
-M
20
C
100
S
100
-M
120
M
Ilink
120
C
0
S
0
C
0
C
40
-D
20
M
60
S
M
20
S
C
G
G
-M
on
eM
S
20
C
40
S
60
M
-M
-N
M
-S
on
e
-N
40
-D
80
Execution Breakdown (%)
M
S
M
60
S
M
S
C
S
C
C
S
C
M
S
S
-M
S
C
M
M
-D
Execution Breakdown (%)
80
M
80
Execution Breakdown (%)
G
M
e-
G
-M
S
on
-N
M
S
C
-M
M
S
C
e
-S
S
on
-N
M
S
C
M
S
C
S
C
M
CLU
-D
M
S
C
G
M
e-
on
G
-M
S
-M
M
-N
M
S
C
S
C
120
M
e
-S
S
on
-N
M
S
C
M
S
C
-M
S
M
-D
M
S
C
EM3D
S
e
-S
on
-N
0
M
0
S
120
C
20
M
20
S
100
S
Water-NSQ
C
0
-M
0
-M
20
M
20
20
S
100
M
40
S
SOR
S
60
M
40
-D
60
S
C
-M
G
on
eM
G
100
C
40
Execution Breakdown (%)
60
S
M
S
C
80
M
80
Execution Breakdown (%)
100
-D
M
S
C
M
-M
S
M
-N
S
100
C
80
Execution Breakdown (%)
S
C
C
on
e
M
-S
M
-N
S
C
M
-M
S
M
S
Execution Breakdown (%)
40
M
G
M
G
-M
e-
on
-N
M
S
C
S
-M
M
S
C
S
C
S
M
-D
C
S
C
60
S
G
M
G
-M
e-
on
-N
M
S
C
S
-M
M
Execution Breakdown (%)
80
C
G
G
eM
on
M
-N
-M
S
C
e
on
-N
-S
M
S
C
M
S
C
S
S
M
-M
M
S
C
-D
M
S
C
120
C
S
C
-M
S
M
e
on
-N
Execution Breakdown (%)
120
S
C
e
-S
M
S
C
M
S
C
S
-M
M
S
C
S
M
-D
M
S
C
120
on
-N
M
S
-S
M
S
C
Execution Breakdown (%)
120
C
S
-M
M
S
C
S
M
-D
M
S
C
Barnes
120
LU
100
80
60
40
Gauss
80
60
40
Volrend
100
180
160
140
120
100
80
60
40
0
Water-SP
100
80
Message
Synchronization
Protocol
User
Figure 1: Normalized execution time breakdown for the applications on the protocols at 16 processors.
The sux on the protocol name represents the areas of communication using Memory Channel features
(D: shared Data propagation, M: protocol Meta-data maintenance, S: Synchronization, None: No use
of Memory Channel features). Mg denotes a migrating home node policy.
17
Table 4: Application speedups and statistics at 16 processors. The sux on the protocol name represents
the areas of communication using Memory Channel features (D: shared Data propagation, M: protocol
Meta-data maintenance, S: Synchronization, None: No use of Memory Channel features). Mg denotes
a migrating home node policy.
18
Barnes, the degree of sharing is very high and there is a large number of migration requests. The extra
overhead of these requests balances the reduction of di operations in CSM-MS-Mg. CSM-None-Mg
loses performance since directory state is no longer kept consistent globally. As a result, CSM-None-Mg
sends approximately 580K unsuccessful migration requests. As shown in Table 4, Gauss performs many
more invalidations when using migration. These invalidations result in increased protocol and messaging
overhead with respect to the rst-touch protocols.
CLU, Ilink, and TSP again are relatively insensitive to the underlying Memory Channel support or
to the migration mechanism. In Ilink the number of di operations is signicantly reduced, but again
the benets are oset by increased overhead due to migration costs.
4 Related Work
In a technical report, Bilas et al. [4] also examine the impact of special network features on SDSM
performance. Their network has both remote-write and remote-read capabilities, but no broadcast or
total ordering. Their results show that advanced network features provide large improvements in SDSM
performance. However, their base protocol uses inter-processor interrupts to signal messaging delivery.
Interrupts on commodity machines are typically on the order of hundreds of microsends, and so largely
erase the benets of a low-latency network. Our evaluation here assumes that messages can be detected
through a much more ecient polling mechanism, as is found with other SANs [10, 13], and so each of
our protocols benet from the same low messaging latency.
Amza et al. [3] describe adaptive extensions to the TreadMarks [2] protocol that avoid twin/di
operations on shared pages with only a single writer. (Pages with multiple writers still use twins and
dis.) Our home node migration scheme is similar in principle. If a page has only a single writer, the
home always migrates to that writer, and so twin/di operations are avoided. In the presence of multiple
concurrent writers, our scheme will always migrate to one of the multiple concurrent writers, thereby
avoiding twin/di overhead at one node. Cashmere is also able to take advantage of the replicated
directory when making migration decisions (i.e. to determine if the home is currently writing the page).
There has also been much work in adapting coherence protocol operations to migratory access patterns. In a migratory access pattern, a piece of data is read and written by a succession of processors
19
in a lockstep manner. This pattern results in the transfer of data from one processor to another, and
usually involves two coherence operations (each with multiple messages), one for the read and one for
the write. Recent work [24, 7, 19] in both hardware and software coherent systems discusses methods
to classify migratory data and then collapsing the two coherence messages into one. This technique
could be built into our system, and may be very helpful in reducing the overhead due to unnecessary
migration requests.
5 Conclusions
In this paper, we have studied the eect of advanced network features, in particular, remote writes,
inexpensive broadcast, and total ordering, on SDSM. Our evaluation used the state-of-the-art Cashmere
protocol, which was designed with these network features specically in mind.
We have found that these features never hurts performance and does indeed lead to modest performance improvements (up to 11%) for most applications. The improvements are due to a decrease in
communication, and correspondingly protocol, overhead. One application, however, improves dramatically by 37%. This application uses a dynamic work distribution scheme, which operates more eectively
with the reduced protocol overhead. Unfortunately, even after the improvement, the application only
obtains an extremely poor speedup of two on 16 processors.
Virtually all of the performance dierences we have seen are due to optimized meta-data maintenance.
The use of remote writes to propagate data modications has little impact. In barrier-based programs,
this can be expected: instrumentation shows that most di messages are handled by idle processors. The
network features have little eect on the operational cost of synchronization primitives, so optimization
in this area has little eect on overall performance.
Finally, we also found that home node migration is a very eective mechanism for reducing the
number of twin/di operations and the resulting protocol overhead. The mechanism is so eective that
the benets outweigh those from using the network features for shared data propagation. Shared data
can thus safely be placed in the node's private memory. The pressure on remotely accessible memory
is thereby greatly reduced, providing more exibility and scalability for the system.
20
References
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
S. V. Adve and M. D. Hill. A Unied Formulation of Four Shared-Memory Models. IEEE Transactions on
Parallel and Distributed Systems, 4(6):613{624, June 1993.
C. Amza, A. L. Cox, S. Dwarkadas, P. Keleher, H. Lu, R. Rajamony, W. Yu, and W. Zwaenepoel. TreadMarks: Shared Memory Computing on Networks of Workstations. Computer, 29(2):18{28, February 1996.
C. Amza, A. Cox, S. Dwarkadas, and W. Zwaenepoel. Software DSM Protocols that Adapt between Single
Writer and Multiple Writer. In Proceedings of the Third International Symposium on High Performance
Computer Architecture, San Antonio, TX, February 1997.
A. Bilas, C. Liao, and J. P. Singh. Network Interface Support for Shared Virtual Memory on Clusters.
Technical Report TR-579-98, Department of Computer Science, Princeton University, March 1998.
J. B. Carter, J. K. Bennett, and W. Zwaenepoel. Implementation and Performance of Munin. In Proceedings
of the Thirteenth ACM Symposium on Operating Systems Principles, pages 152{164, Pacic Grove, CA,
October 1991.
Y. Chen, A. Bilas, S. N. Damianakis, C. Dubnicki, and K. Li. UTLB: A Mechanism for Address Translation
on Network Interfaces. In Proceedings of the Eighth International Conference on Architectural Support for
Programming Languages and Operating Systems, San Jose, CA, October 1998.
A. L. Cox and R. J. Fowler. Adaptive Cache Coherency for Detecting Migratory Shared Data. In Proceedings
of the Twentieth International Symposium on Computer Architecture, San Diego, CA, May 1993.
D. Culler, A. Dusseau, S. Goldstein, A. Krishnamurthy, S. Lumetta, T. von Eicken, and K. Yelick. Parallel
Programming in Split-C. In Proceedings, Supercomputing '93, pages 262{273, Portland, OR, November
1993.
M. Dubois, J. C. Wang, L. A. Barroso, K. L. Lee, and Y.-S. Chen. Delayed Consistency and its Eect on
the Miss Rate of Parallel Programs. In Supercomputing'91 Proceedings, pages 197{7206, Albuquerque, NM,
November 1991.
D. Dunning, G. Regnier, G. McAlpine, D. Cameron, B. Shubert, F. Berry, A. M. Meritt, E. Gronke, and
C. Dodd. The Virtual Interface Architecture. In IEEE Micro, pages 66{76, March 1998.
S. Dwarkadas, A. A. Schaer, R. W. Cottingham Jr., A. L. Cox, P. Keleher, and W. Zwaenepoel. Parallelization of General Linkage Analysis Problems. Human Heredity, 44:127{141, 1994.
S. Dwarkadas, K. Gharachorloo, L. Kontothanassis, D. J. Scales, M. L. Scott, and R. Stets. Comparative
Evaluation of Fine- and Coarse-Grain Approaches for Software Distributed Shared Memory. In Proceedings
of the Fifth International Symposium on High Performance Computer Architecture, Orlando, FL, January
1999.
T. v. Eicken, A. Basu, V. Buch, and W. Vogels. U-Net: A User-Level Network Interface for Parallel and
Distributed Computing. In Proceedings of the Fifteenth ACM Symposium on Operating Systems Principles,
Copper Mountain, CO, December 1995.
A. Erlichson, N. Nuckolls, G. Chesson, and J. Hennessy. SoftFLASH: Analyzing the Performance of Clustered Distributed Virtual Shared Memory. In Proceedings of the Seventh International Conference on
Architectural Support for Programming Languages and Operating Systems, pages 210{220, Cambridge, MA,
October 1996.
R. Gillett. Memory Channel: An Optimized Cluster Interconnect. IEEE Micro, 16(2):12{18, February
1996.
R. W. Horst and D. Garcia. ServerNet SAN I/O Architecture. In Proceedings of Hot Interconnects V
Symposium, Palo Alto, CA, August, 1997.
21
[17] L. Kontothanassis, G. Hunt, R. Stets, N. Hardavellas, M. Cierniak, S. Parthasarathy, W. Meira, S.
Dwarkadas, and M. L. Scott. VM-Based Shared Memory on Low-Latency, Remote-Memory-Access Networks. In Proceedings of the Twenty-Fourth International Symposium on Computer Architecture, pages
157{169, Denver, CO, June 1997.
[18] M. Marchetti, L. Kontothanassis, R. Bianchini, and M. L. Scott. Using Simple Page Placement Policies
to Reduce the Cost of Cache Fills in Coherent Shared-Memory Systems. In Proceedings of the Ninth
International Parallel Processing Symposium, Santa Barbara, CA, April 1995.
[19] L. R. Monnerat and R. Bianchini. Eciently Adapting to Sharing Patterns in Software DSMs. In Proceedings of the Fourth International Symposium on High Performance Computer Architecture, Las Vegas, NV,
February 1998.
[20] R. Samanta, A. Bilas, L. Iftode, and J. Singh. Home-Based SVM Protocols for SMP Clusters: Design
and Performance. In Proceedings of Fourth International Symposium on High Performance Computer
Architecture, pages 113{124, February 1998.
[21] D. J. Scales and K. Gharachorloo. Towards Transparent and Ecient Software Distributed Shared Memory.
In Proceedings of the Sixteenth ACM Symposium on Operating Systems Principles, St. Malo, France, October
1997.
[22] D. J. Scales, K. Gharachorloo, and A. Aggarwal. Fine-Grain Software Distributed Shared Memory on
SMP Clusters. In Proceedings of the Fourth International Symposium on High Performance Computer
Architecture, Las Vegas, NV, February 1998.
[23] J. P. Singh, W.-D. Weber, and A. Gupta. SPLASH: Stanford Parallel Applications for Shared-Memory.
ACM SIGARCH Computer Architecture News, 20(1):5{44, March 1992.
[24] P. Stenstrom, M. Brorsson, and L. Sandberg. An Adaptive Cache Coherence Protocol Optimized for
Migratory Sharing. In Proceedings of the Twentieth International Symposium on Computer Architecture,
San Diego, CA, May 1993.
[25] R. Stets, S. Dwarkadas, N. Hardavellas, G. Hunt, L. Kontothanassis, S. Parthasarathy, and M. Scott.
Cashmere-2L: Software Coherent Shared Memory on a Clustered Remote-Write Network. In Proceedings
of the Sixteenth ACM Symposium on Operating Systems Principles, St. Malo, France, October 1997.
[26] M. Welsh, A. Basu, and T. von Eicken. A Comparison of ATM and Fast Ethernet Network Interfaces
for User-level Communication. In Proceedings of the Third International Symposium on High Performance
Computer Architecture, San Antonio, TX, February 1997.
[27] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. Methodological Considerations and Characterization of the SPLASH-2 Parallel Application Suite. In Proceedings of the Twenty-Second International
Symposium on Computer Architecture, Santa Margherita Ligure, Italy, June 1995.
22