Load-Based Rerouting of Advanced Reservations
Transcrição
Load-Based Rerouting of Advanced Reservations
A Distributed Load-Based Failure Recovery Mechanism for Advance Reservation Environments Lars-Olof Burchard Barry Linnert Jörg Schneider KBS, TU Berlin Motivation: SLA and Advance Reservations SLA provisioning for compute jobs and complex workflows in Grid environments Resource planning Co-allocation Deadline guarantees Runtime responsibility Possible solution: Advance Reservations Start and stop time given by user Needs support by all local RMS Jörg Schneider, KBS, TU Berlin 1-2 Failure Recovery for Advance Reservations Idea: remap all admitted jobs for the broken resource failure duration time How to determine the affected jobs? Estimate failure duration Underestimation: higher termination ratio Overestimation: higher blocking ratio, remapping overhead Jörg Schneider, KBS, TU Berlin 1-3 Load-Based Remapping Remap independently of the failure duration Keeping the remapping interval clean nodes current time J7 J6 J13 J5 J8 J2 J10 J12 J3 J9 J4 J11 J1 time remapping interval downtime Jörg Schneider, KBS, TU Berlin 1-4 Load-Based Remapping Remap independently of the failure duration Keeping the remapping interval clean nodes current time J7 J6 J12 J8 J3 J2 J4 J11 J1 time remapping interval downtime Jörg Schneider, KBS, TU Berlin 1-5 Load-Based Remapping Remap independently of the failure duration Keeping the remapping interval clean nodes current time J7 J6 J12 J8 J3 J2 J4 J11 J1 time remapping interval downtime Jörg Schneider, KBS, TU Berlin 1-6 Load-Based Remapping Remap independently of the failure duration Keeping the remapping interval clean nodes current time J7 J6 J12 J8 J3 J2 J4 J11 J1 time remapping interval downtime Jörg Schneider, KBS, TU Berlin 1-7 Load-Based Remapping Remap independently of the failure duration Keeping the remapping interval clean nodes current time J7 J6 J12 J8 J3 J2 J4 J11 J1 time remapping interval downtime Jörg Schneider, KBS, TU Berlin 1-8 Load-Based Remapping Remap independently of the failure duration Keeping the remapping interval clean nodes current time J7 J6 J12 J8 J3 J2 J4 J11 J1 time remapping interval downtime Jörg Schneider, KBS, TU Berlin 1-9 Load-Based Remapping Remap independently of the failure duration Keeping the remapping interval clean nodes current time J7 J6 J12 J8 J3 J2 J4 J11 J1 time remapping interval downtime Jörg Schneider, KBS, TU Berlin 1-10 Load-Based Remapping Remap independently of the failure duration Keeping the remapping interval clean nodes current time J7 J6 J12 J8 J3 J2 J4 J11 J1 time remapping interval downtime Jörg Schneider, KBS, TU Berlin 1-11 Determine the Remapping Interval Remapping interval Longer: Probability of success increases Shorter: Less jobs unnecessarily remapped Optimum: remap, if in the remapping will be impossible in the next time slot Calculate remapping interval using the currently booked load and a prediction off the incoming load Jörg Schneider, KBS, TU Berlin 1-13 Load-Based Remapping Remapping Interval: i η (t 0 ) For each t > iη(t0) must be lt (t) < η 0 Jörg Schneider, KBS, TU Berlin 1-14 Distributed Approach Grid is separated in domains No central instance to hold and calculate load and booking profiles Avoid communication overhead information hiding between domains No global knowledge „How good performs the load based approach with reduced knowledge?“ Worst case: one resource per domain Inhomogeneous load distribution Jörg Schneider, KBS, TU Berlin 1-15 Evaluation Measured termination ratio and remapping overhead Simulated Grid Resources: 32, 96, 128, 256, 512 nodes Reservations simple, synthetic jobs different book-ahead times / load situations Job size: 2, 4, 8, ..., 128 nodes Length: 250-750 time slots Failures Constant failure duration (200, 500, 1000 time slots) Same failure probability for each resource Jörg Schneider, KBS, TU Berlin 1-16 Termination Ratio loadglobal> loadlocal loadglobal < loadlocal 100 termination ratio (%) termination ratio (%) 100 90 80 70 60 50 40 60 40 20 0 200 central Jörg Schneider, KBS, TU Berlin 80 500 failure duration distributed 1000 exact knowledge 200 500 failure duration 1000 50 % underestimation 1-17 Remapping Overhead loadglobal < loadlocal 80 70 60 50 40 30 20 10 0 remapping overhead (%) remapping overhead (%) loadglobal> loadlocal 200 500 160 140 120 100 80 60 40 20 0 1000 200 failure duration 1000 failure duration central Jörg Schneider, KBS, TU Berlin 500 distributed 1-18 Conclusion and Future Work Conclusion: Remapping in advance is needed for systems using advance reservation. Load-based approach reach performance achieved with exact knowledge. Load-based approach can be used distributed. Future Work: Checkpointing and migration to split jobs Adapt for other types of resources Jörg Schneider, KBS, TU Berlin 1-19 A Distributed Load-Based Failure Recovery Mechanism for Advance Reservation Environments Jörg Schneider, TU Berlin [email protected] http://kbs.cs.tu-berlin.de/ Termination Ratio loadglobal> loadlocal loadglobal < loadlocal 100 termination ratio (%) termination ratio (%) 100 90 80 70 60 50 80 60 40 20 0 0.10 0.25 0.40 0.55 0.70 0.85 1.00 0.10 0.25 0.40 0.55 0.70 0.85 1.00 η η central Jörg Schneider, KBS, TU Berlin distributed exact knowledge 50% underestimation 1-21 100 250 80 200 overhead (%) overhead (%) Remapping Overhead 60 40 20 150 100 50 0 0.10 0.25 0.40 0.55 0.70 0.85 1.00 0 0.10 0.25 0.40 0.55 0.70 0.85 1.00 eta eta central Jörg Schneider, KBS, TU Berlin distributed 1-22 Different Load Differences distributed 100 80 60 40 20 0 1.0 650 0 0.7 η 0.4 Jörg Schneider, KBS, TU Berlin 0.1 -650 load difference termination ratio (%) termination ratio (%) central 100 80 60 40 20 650 0 1.0 0.7 η 0 0.4 0.1 -650 load difference 1-23 Load-Based Remapping Load Profile: l t 0(t) For the time slot t0: The whole booked load of each future time slot t0 + t Booking Profile: bt x(t) The load booked during time slot tx for the future time slot tx + t Average Booking Profile: b(t) Combined Profile: l t 0(t) = l t 0(t) + b(t) Expected load for future time slots Load Threshold: η Maximum load for a high remapping probability. Jörg Schneider, KBS, TU Berlin 1-24 Virtual Resource Manager: Main Idea Joint project of KBS (TU Berlin) and PC² (Paderborn University) Layer between “Grid“ and local RMS (“close the gap”) Implement SLA negotiation, monitoring, allocation, planning within this layer Few modules with well-defined standard interfaces (>GGF) Management entity providing enhanced functionality for domains of various size Integration of existing RMS Jörg Schneider, KBS, TU Berlin 1-25 VRM Architecture „The Grid“ (other VRMs, Resources, ...) ADC VRM Jörg Schneider, KBS, TU Berlin AI AI AI RMS (Cluster) RMS (Network) RMS (Cluster) 1-26 Virtual Resource Manager Jörg Schneider, KBS, TU Berlin 1-27 Application Environment request available resources Advance Reservations Time slots of fixed size Book-ahead interval resource usage of admitted requests slot time book-ahead interval Reservations Booked in advance Start and stop time given at admission active, admitted current time Jörg Schneider, KBS, TU Berlin inactive, admitted time 1-28