Load-Based Rerouting of Advanced Reservations

Transcrição

Load-Based Rerouting of Advanced Reservations
A Distributed Load-Based
Failure Recovery Mechanism
for Advance Reservation Environments
Lars-Olof Burchard
Barry Linnert
Jörg Schneider
KBS, TU Berlin
Motivation: SLA and Advance Reservations
 SLA provisioning for compute jobs and complex
workflows in Grid environments
 Resource planning
 Co-allocation
 Deadline guarantees
 Runtime responsibility
 Possible solution: Advance Reservations
 Start and stop time given by user
 Needs support by all local RMS
Jörg Schneider, KBS, TU Berlin
1-2
Failure Recovery for Advance Reservations
 Idea: remap all admitted jobs for the broken resource
failure duration
time
 How to determine the affected jobs?
 Estimate failure duration
 Underestimation: higher termination ratio
 Overestimation: higher blocking ratio, remapping overhead
Jörg Schneider, KBS, TU Berlin
1-3
Load-Based Remapping
 Remap independently of the failure duration
 Keeping the remapping interval clean
nodes
current time
J7
J6
J13
J5
J8
J2
J10
J12
J3
J9
J4
J11
J1
time
remapping interval
downtime
Jörg Schneider, KBS, TU Berlin
1-4
Load-Based Remapping
 Remap independently of the failure duration
 Keeping the remapping interval clean
nodes
current time
J7
J6
J12
J8
J3
J2
J4
J11
J1
time
remapping interval
downtime
Jörg Schneider, KBS, TU Berlin
1-5
Load-Based Remapping
 Remap independently of the failure duration
 Keeping the remapping interval clean
nodes
current time
J7
J6
J12
J8
J3
J2
J4
J11
J1
time
remapping interval
downtime
Jörg Schneider, KBS, TU Berlin
1-6
Load-Based Remapping
 Remap independently of the failure duration
 Keeping the remapping interval clean
nodes
current time
J7
J6
J12
J8
J3
J2
J4
J11
J1
time
remapping interval
downtime
Jörg Schneider, KBS, TU Berlin
1-7
Load-Based Remapping
 Remap independently of the failure duration
 Keeping the remapping interval clean
nodes
current time
J7
J6
J12
J8
J3
J2
J4
J11
J1
time
remapping interval
downtime
Jörg Schneider, KBS, TU Berlin
1-8
Load-Based Remapping
 Remap independently of the failure duration
 Keeping the remapping interval clean
nodes
current time
J7
J6
J12
J8
J3
J2
J4
J11
J1
time
remapping interval
downtime
Jörg Schneider, KBS, TU Berlin
1-9
Load-Based Remapping
 Remap independently of the failure duration
 Keeping the remapping interval clean
nodes
current time
J7
J6
J12
J8
J3
J2
J4
J11
J1
time
remapping interval
downtime
Jörg Schneider, KBS, TU Berlin
1-10
Load-Based Remapping
 Remap independently of the failure duration
 Keeping the remapping interval clean
nodes
current time
J7
J6
J12
J8
J3
J2
J4
J11
J1
time
remapping interval
downtime
Jörg Schneider, KBS, TU Berlin
1-11
Determine the Remapping Interval
 Remapping interval
 Longer:
Probability of success increases
 Shorter:
Less jobs unnecessarily remapped
 Optimum: remap, if in the remapping will be
impossible in the next time slot
 Calculate remapping interval using the currently
booked load and a prediction off the incoming
load
Jörg Schneider, KBS, TU Berlin
1-13
Load-Based Remapping
 Remapping Interval: i η (t 0 )
 For each t > iη(t0) must be lt (t) < η
0
Jörg Schneider, KBS, TU Berlin
1-14
Distributed Approach
 Grid is separated in domains
 No central instance to hold and calculate load and
booking profiles
 Avoid communication overhead
 information hiding between domains
 No global knowledge
 „How good performs the load based approach
with reduced knowledge?“
 Worst case: one resource per domain
 Inhomogeneous load distribution
Jörg Schneider, KBS, TU Berlin
1-15
Evaluation
 Measured termination ratio and remapping overhead
 Simulated Grid
 Resources: 32, 96, 128, 256, 512 nodes
 Reservations
 simple, synthetic jobs
 different book-ahead times / load situations
 Job size: 2, 4, 8, ..., 128 nodes
 Length: 250-750 time slots
 Failures
 Constant failure duration (200, 500, 1000 time slots)
 Same failure probability for each resource
Jörg Schneider, KBS, TU Berlin
1-16
Termination Ratio
loadglobal> loadlocal
loadglobal < loadlocal
100
termination ratio (%)
termination ratio (%)
100
90
80
70
60
50
40
60
40
20
0
200
central
Jörg Schneider, KBS, TU Berlin
80
500
failure duration
distributed
1000
exact knowledge
200
500
failure duration
1000
50 % underestimation
1-17
Remapping Overhead
loadglobal < loadlocal
80
70
60
50
40
30
20
10
0
remapping overhead (%)
remapping overhead (%)
loadglobal> loadlocal
200
500
160
140
120
100
80
60
40
20
0
1000
200
failure duration
1000
failure duration
central
Jörg Schneider, KBS, TU Berlin
500
distributed
1-18
Conclusion and Future Work
 Conclusion:
 Remapping in advance is needed for systems using
advance reservation.
 Load-based approach reach performance achieved
with exact knowledge.
 Load-based approach can be used distributed.
 Future Work:
 Checkpointing and migration to split jobs
 Adapt for other types of resources
Jörg Schneider, KBS, TU Berlin
1-19
A Distributed Load-Based
Failure Recovery Mechanism
for Advance Reservation Environments
Jörg Schneider, TU Berlin
[email protected]
http://kbs.cs.tu-berlin.de/
Termination Ratio
loadglobal> loadlocal
loadglobal < loadlocal
100
termination ratio (%)
termination ratio (%)
100
90
80
70
60
50
80
60
40
20
0
0.10 0.25 0.40 0.55 0.70 0.85 1.00
0.10 0.25 0.40 0.55 0.70 0.85 1.00
η
η
central
Jörg Schneider, KBS, TU Berlin
distributed
exact knowledge
50% underestimation
1-21
100
250
80
200
overhead (%)
overhead (%)
Remapping Overhead
60
40
20
150
100
50
0
0.10 0.25 0.40 0.55 0.70 0.85 1.00
0
0.10 0.25 0.40 0.55 0.70 0.85 1.00
eta
eta
central
Jörg Schneider, KBS, TU Berlin
distributed
1-22
Different Load Differences
distributed
100
80
60
40
20
0
1.0
650
0
0.7
η
0.4
Jörg Schneider, KBS, TU Berlin
0.1
-650
load
difference
termination ratio (%)
termination ratio (%)
central
100
80
60
40
20
650
0
1.0 0.7
η
0
0.4
0.1
-650
load
difference
1-23
Load-Based Remapping
 Load Profile: l t 0(t)
 For the time slot t0: The whole booked load of each
future time slot t0 + t
 Booking Profile: bt x(t)
 The load booked during time slot tx for the future
time slot tx + t
 Average Booking Profile: b(t)
 Combined Profile: l t 0(t) = l t 0(t) + b(t)
 Expected load for future time slots
 Load Threshold: η
 Maximum load for a high remapping probability.
Jörg Schneider, KBS, TU Berlin
1-24
Virtual Resource Manager: Main Idea
 Joint project of KBS (TU Berlin) and PC² (Paderborn
University)
 Layer between “Grid“ and local RMS (“close the gap”)
 Implement SLA negotiation, monitoring, allocation,
planning within this layer
 Few modules with well-defined standard interfaces (>GGF)
 Management entity providing enhanced functionality for
domains of various size
 Integration of existing RMS
Jörg Schneider, KBS, TU Berlin
1-25
VRM Architecture
„The Grid“ (other VRMs, Resources, ...)
ADC
VRM
Jörg Schneider, KBS, TU Berlin
AI
AI
AI
RMS
(Cluster)
RMS
(Network)
RMS
(Cluster)
1-26
Virtual Resource Manager
Jörg Schneider, KBS, TU Berlin
1-27
Application Environment
request
available
resources
 Advance Reservations
 Time slots of fixed size
 Book-ahead interval
resource usage of
admitted requests
slot
time
book-ahead
interval
 Reservations
 Booked in advance
 Start and stop time given
at admission
active, admitted
current time
Jörg Schneider, KBS, TU Berlin
inactive, admitted
time
1-28

Documentos relacionados