Carnegie Mellon University

Carnegie Mellon University CARNEGIE INSTITUTE OF TECHNOLOGY THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF TITLE PR...

Author: Marlene Hampton

0 downloads 0 Views 1MB Size

Report

Download PDF

Recommend Documents

Carnegie Mellon University DOHA

Carnegie Mellon University

CARNEGIE MELLON UNIVERSITY LIBRARIES

NetSA January Carnegie Mellon University

Catering Menu. Carnegie mellon university

Research CMU. Carnegie Mellon University. Craig Alexander Griffith Carnegie Mellon University,

Summer Buggy Racing at Carnegie Mellon University

Amos Azaria, Carnegie Mellon University and Sentimetrix,

Vibhanshu Abhishek. Voice: SSRN Link: Heinz College, Carnegie Mellon University

CARNEGIE MELLON UNIVERSITY Tepper School of Business. Finance Course Syllabus

The Carnegie Mellon University Disruptive Health Technology Institute

THE COMPUTER SCIENCE PH.D. PROGRAM AT CARNEGIE MELLON UNIVERSITY

Portfolio MECHANICAL ENGINEERING & DESIGN NATHANIEL THOMPSON CARNEGIE MELLON UNIVERSITY

Carnegie Mellon University Office of International Education Fall Statistics 2013

Computer Architecture: Multithreading (II) Prof. Onur Mutlu Carnegie Mellon University

Electrical & Computer Engineering Department, Carnegie Mellon University, Pittsburgh, PA, USA

School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213

Zeyang Li (Linus) Carnegie Mellon University. Transformations in OpenGL

Robotics Research At Carnegie Mellon Robotics Institute

The Culture of Stress at Carnegie Mellon

Coordination in Teams: Evidence from a Simulated Management Game. Carnegie Mellon University. American University

Culture and Business. University of Pittsburgh. J. N. Hooker. Carnegie Mellon University. February 2006

The Game of Chess Herbert A. Simon Carnegie-Mellon University, and Jonathan Schaeffer University of Alberta

Feature Article. RYAN T. MILLER Kent State University. SILVIA PESSOA Carnegie Mellon University in Qatar

Carnegie Mellon University CARNEGIE INSTITUTE OF TECHNOLOGY

THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE DEGREE OF

TITLE

PRESENTED BY

Doctor of Philosophy

Power-Aware CPU Management in QoS-Guaranteed Systems

Saowanee Saewong

ACCEPTED BY THE DEPARTMENT OF

Electrical and Computer Engineering

___Prof. Ragunathan Rajkumar

________________________

ADVISOR, MAJOR PROFESSOR

___Prof. T.E. Schlesinger

DATE

________________________ DEPARTMENT HEAD

DATE

APPROVED BY THE COLLEGE COUNCIL

___Prof. Pradeep Khosla

________________________ DEAN

DATE

Power-Aware CPU Management in QoS-Guaranteed Systems A Dissertation Submitted to the Graduate Education Committee At the Department of Electrical and Computer Engineering Carnegie Mellon University in Partial Fulfillment of the Requirements For the degree of Doctor of Philosophy in Electrical and Computer Engineering

by

Saowanee Saewong COMMITTEE MEMBERS Advisor:

Prof. Ragunathan Rajkumar Prof. John Lehoczky Prof. Daniel Mosse Prof. Hyong Kim Pittsburgh, Pennsylvania May, 2007

Copyright © 2007 Saowanee Saewong

This dissertation is dedicated to my parents, Wong Chung Ho and Suwimon Saetung, and my brother, Plertichai Saewong.

ii

Abstract The need for power optimization with QoS support is increasingly compelling new CPU management in embedded, mobile and server systems. This is due to the direct impact of limited energy sources and heat dissipation. In addition to traditional lowpower design that aims for the highest performance delivery, the new concept of “poweraware” design, which enables hardware-software collaboration to scale down power and performance of hardware whenever the system performance can be relaxed, is gaining importance. In this dissertation, we design and develop “power-aware and reservation-based” CPU scheduling algorithms implemented in the OS layer. Using the knowledge of workload, the framework controls the power-state (active, sleep, etc.) and scales the voltage supply and operating frequency of the processors at context switches. The main goal is to minimize power usage and at the same time maintaining task-QoS guarantees. Two major frameworks are proposed. First, we develop varieties of dynamic voltage scaling (DVS) algorithms for systems using a fixed-priority preemptive scheduling policy such as Deadline Monotonic. We develop different DVS algorithms for different task characteristics such as hard real-time tasks, multimedia applications and batch/interactive tasks. Second, we design and analyze power-aware hierarchical reservations for systems with hierarchical scheduler support. This framework allows the coexistence of heterogeneous scheduling policies for systems that have concurrent applications with iii

diverse criticality and timing constraints such that a single scheduling policy is not sufficient to satisfy those requirements.

iv

Acknowledgements At the beginning of my Ph.D. studies, I asked myself what the most important factor was to successfully complete my degree is; the answer was hard work! After a long journey to pursue my degree, I now realize there are several other crucial factors. One of those is guidance from your thesis advisor! I consider myself to be a very fortunate person to have professor Raj Rajkumar as my advisor. It is not easy for me to express all my heartfelt gratitude to him in few sentences. Raj not only guides me in my research work and gives me insightful comments but also inspires and teaches me a great deal about how to be a great researcher. With his kindness and generosity, Raj always shares his wisdom and experience with his students. I am grateful to all of my thesis committee members, Professor John Lehoczky, Professor Daniel Mosse and Professor Hyong Kim for giving me useful comments on my thesis. I had a great opportunity to work with Prof. John and Mark Klein from SEI for an enduring but worthy scheduling analysis of hierarchical reservations. During the process, they both convinced me of the advantage of tackling problems from different angles. I would like to thank my dad and my brother for being supportive of my study and career path. I would like to thank my mom for encouraging me every time I think of her. I would like to express my sincere gratitude to all my uncles and aunts of Tantipoj family for their kind financial support for my graduate study at INI, CMU before my Ph.D. studies. Without them, I would not have an opportunity to accomplish my goal today. v

I would like to thank all my colleagues at Real-Time and Multimedia System Laboratories: Dionisio de Niz, Sourav Ghosh, Rahul Mangharam, Akihiko Miyoshi, Haifeng Zhu, Anand Eswaran, Anthony Rowe, Gaurav Bhatia and Luca Abeni. Our lengthy discussions in RTML meetings are always enjoyable and useful. I hope our friendship remains strong and hope that we will have opportunity to work together again. I would like to thank my special friends at CMU and University of Pittsburgh: Piyanun Harnpicharnchai, Saowaphak Sasanus, Panita Pongpaiboon, Poj Tangamchit, Peerapon Siripongwutikorn, Wiklom Teerapabkajorndet, Chatree Sangpachatanark, Ratchata Peechavanish, Tanapat Anusas-amornkul, Jumpol Polvichai, Nutthanon Leelathakul and Pasin Suriyentrakorn, for giving me warm friendship, parties and fun sport activities. Those have kept me feeling alive, fresh, relaxed and ready to continue concentrating on my hard work. Special thanks for my long-time buddies, Udomchai Techavipoo, Anchaya Daothong and Chotiros Prasarn, who have been good friends and always encourage me to fulfill my goal even though we are apart. Finally, I would like to thank my bosses at Texas Instruments, Xiaolin Lu and Don Shaver for giving me the opportunity to complete this dissertation.

This research was supported by the DARPA PowerTap Project and administered by the Defense Advanced Research Project Agency.

vi

Table of Contents Chapter 1 Introduction ........................................................................................................ 1 1.1 Proposed Approach................................................................................................... 3 1.2 Main Contributions ................................................................................................... 6 1.3 Power-Aware Linux/RK Supported Platforms ......................................................... 7 1.4 Organization of the thesis ....................................................................................... 10 Chapter 2 Literature Review............................................................................................. 11 2.1 Resource Kernel Architecture Background ............................................................ 11 2.1.1 Resource Kernel Architecture.......................................................................... 11 2.1.2 Resource Guarantee Mechanisms .................................................................... 13 2.2 Scope of the Related Work ..................................................................................... 15 2.3 Related Work on DVS for One-level Scheduler..................................................... 15 2.3.1 DVS for General Systems................................................................................ 16 2.3.2 DVS for Hard Real-Time Tasks ...................................................................... 16 2.3.3 DVS for Multimedia Applications................................................................... 17 2.3.4 DVS for Interactive and Batch Tasks .............................................................. 22 2.4 Related Work on DVS for Hierarchical Scheduler................................................. 23 Chapter 3 DVS for Hard Real-Time Periodic Tasks ........................................................ 24 3.1 Energy-Efficient Operating Frequencies ................................................................ 25 3.2 The Effect of Finite Operating Frequency Granularity .......................................... 30 3.3 DVS Algorithms Overview .................................................................................... 34 3.4 DVS System Model and Terminology.................................................................... 36 3.5 Assumptions............................................................................................................ 36 3.6 Sys-Clock Algorithm .............................................................................................. 37 3.7 PM-Clock Algorithm .............................................................................................. 42 vii

3.8 DPM-Clock Algorithm ........................................................................................... 45 TU

UT

3.9 Progressive Algorithm ............................................................................................ 46 TU

UT

3.9.1 Terminology..................................................................................................... 46 TU

UT

3.9.2 Progressive Slack Distribution......................................................................... 47 TU

UT

3.9.3 Progressive Slack Accounting ......................................................................... 50 TU

UT

3.10 Opt-Clock Algorithm ............................................................................................ 51 TU

UT

3.11 Algorithm Comparison and Experiment Results .................................................. 54 TU

UT

3.12 Summary ............................................................................................................... 59 TU

UT

Chapter 4 MultiRSV for Multimedia Applications .......................................................... 60 4.1 Multi-Granularity Reservation Design Goal........................................................... 61 TU

UT

TU

UT

4.2 The Multi-Granularity Reservation Model ............................................................. 62 TU

UT

4.3 Multi-Granularity Reserve Schedulability Analysis ............................................... 63 TU

UT

4.3.1 Critical Instant.................................................................................................. 64 TU

UT

4.3.2 The Worst-Case Response Time Test .............................................................. 67 TU

UT

4.3.3 The Utilization Bound Test.............................................................................. 71 TU

UT

4.4 Replenishment and Enforcement Mechanism ........................................................ 73 TU

UT

4.5 Robust Multi-Granularity Reservation ................................................................... 74 TU

UT

4.5.1 Possible Thrashing Condition in MultiRSV .................................................... 74 TU

UT

4.5.2 Thrashing Prevention Scheme ......................................................................... 75 TU

UT

4.6 Using Multi-Granular Reserves .............................................................................. 77 TU

UT

4.6.1 Multi-RSV vs. (m, k)-Firm Guarantees ........................................................... 78 TU

UT

4.6.2 Multi-RSV vs. Multiframe Model ................................................................... 78 TU

UT

4.6.3 Multi-RSV vs. CBS ......................................................................................... 79 TU

UT

4.6.4 Multi-RSV vs. Buffer Approach ...................................................................... 80 TU

UT

4.7 Performance Evaluation of MultiRSV .................................................................... 80 TU

UT

4.7.1 Evaluation of a Single Stream.......................................................................... 82 TU

UT

4.7.2 Evaluation of Multiple Streams ....................................................................... 88 TU

UT

viii

4.7.3 System Response Time Comparison ............................................................... 91 TU

UT

4.7.4 Performance Predictability............................................................................... 91 TU

UT

4.7.5 Ramp-up Index Effect on MultiRSV ............................................................... 92 TU

UT

4.8 Summary ................................................................................................................. 95 TU

UT

Chapter 5 DVS for Multimedia Applications ................................................................... 96 5.1 Multimedia DVS Algorithms Overview ................................................................. 97 TU

UT

TU

UT

5.2 Multi-Granular DVS System Model ....................................................................... 97 TU

UT

5.3 Multi-Granular DVS Assumptions ......................................................................... 98 TU

UT

5.4 Multi-Granular PM-Clock Algorithm ..................................................................... 99 TU

UT

5.5 Multi-Granular Progressive Algorithm ................................................................. 104 TU

UT

5.6 Experiment Results ............................................................................................... 105 TU

UT

5.6.1 Evaluation without Additional Non-Real-Time Workload............................ 106 TU

UT

5.6.2 Evaluation with Non-Real-Time Workload ................................................... 114 TU

UT

5.7 Summary ............................................................................................................... 119 TU

UT

Chapter 6 DVS for Interactive and Batch Tasks............................................................. 120 6.1 DVS System Architecture for Heterogeneous Tasks ............................................ 121 TU

UT

TU

UT

6.2 Background-Preserving Algorithms (BG-PRSV) ................................................. 122 TU

UT

6.2.1 Progressive Background-Preserving (PRO-PRSV) ....................................... 123 TU

UT

6.3 Background-On-Demand Algorithms (BG-OND) ............................................... 123 TU

UT

6.3.1 Progressive Background-On-Demand (PRO-OND) ...................................... 125 TU

UT

6.4 Experiment Results ............................................................................................... 127 TU

UT

6.5 Summary ............................................................................................................... 135 TU

UT

Chapter 7 Hierarchical Reservations .............................................................................. 137 7.1 Hierarchical Reservation Overview ...................................................................... 138 TU

UT

TU

UT

7.1.1 Design Goals .................................................................................................. 138 TU

UT

7.1.2 Architecture Overview and Terminology ...................................................... 140 TU

UT

7.1.3 The Functionality of Hierarchical Reservation .............................................. 142 TU

UT

7.2 The Hierarchical Deadline-Monotonic (HDM) Model ......................................... 144 TU

UT

7.2.1 System Model and Notation........................................................................... 144 TU

UT

ix

7.2.2 Assumptions................................................................................................... 145 TU

UT

7.2.3 Server Replenishment .................................................................................... 146 TU

UT

7.2.4 Jitter Effect under Hierarchical Reservation .................................................. 147 TU

UT

7.3 Sporadic Server – HDM Schedulability Analysis ................................................ 150 TU

UT

7.3.1 Critical Instant of SS-HDM ........................................................................... 150 TU

UT

7.3.2 The Worst-Case Response Time of SS-HDM ............................................... 152 TU

UT

7.4 Deferrable Server – HDM Schedulability Analysis ............................................. 155 TU

UT

7.4.1 Critical Instant of DS-HDM........................................................................... 155 TU

UT

7.4.2 The Worst-Case Response Time of DS-HDM............................................... 157 TU

UT

7.5 Polling Server – HDM Schedulability Analysis ................................................... 158 TU

UT

7.5.1 Critical Instant of PS-HDM ........................................................................... 158 TU

UT

7.5.2 The Worst-Case Response Time of PS-HDM ............................................... 159 TU

UT

7.6 Utilization Bound of Hierarchical RM Schedulers ............................................... 161 TU

UT

7.7 System Implementation and Evaluation ............................................................... 171 TU

UT

7.7.1 Resource Sharing among Default Reserves ................................................... 172 TU

UT

7.7.2 HRM API ....................................................................................................... 173 TU

UT

7.7.3 Performance Evaluation ................................................................................. 174 TU

UT

7.8 Resource Synchronization in HRSV..................................................................... 177 TU

UT

7.9 HRSV with Multi-Granularity Reservation Support ............................................ 178 TU

UT

7.10 Summary ............................................................................................................. 179 TU

UT

Chapter 8 DVS for Hierarchical Reservations ................................................................ 181 8.1 HRSV DVS System Architecture ......................................................................... 182 TU

UT

TU

UT

8.2 HRSV DVS System Model and Terminology ...................................................... 184 TU

UT

8.2.1 Assumptions................................................................................................... 185 TU

UT

8.3 Frequency-Cascading Property ............................................................................. 185 TU

UT

8.4 Hierarchical PM-Clock ......................................................................................... 187 TU

UT

8.5 Hierarchical Progressive ....................................................................................... 192 TU

UT

x

8.6 Performance Evaluation of HRSV with DVS....................................................... 194 TU

UT

8.7 Summary ............................................................................................................... 201 TU

UT

Chapter 9 Conclusion and Future Work ......................................................................... 202 9.1 Research Contributions ......................................................................................... 202 TU

UT

TU

UT

9.1.1 Resource Reservation for Soft Real-Time Applications................................ 203 TU

UT

9.1.2 Resource Reservation for Hierarchical Schedulers........................................ 203 TU

UT

9.1.3 DVFS-support Software Development Framework....................................... 204 TU

UT

9.2 Future Work .......................................................................................................... 209 TU

UT

9.2.1 DVS with Sporadic Task Support .................................................................. 210 TU

UT

9.2.2 DVS with Peripheral Considerations ............................................................. 210 TU

UT

9.2.3 DVS Algorithms with Power-Aware QoS Resource Manager ...................... 211 TU

UT

xi

List of Figures Figure 1-1: Power of Crusoe processor using DVFS (LongRun) and PSC (Normal/Sleep) TU

RR

............................................................................................................................................. 2 Figure 1-2: Scope of the dissertation .................................................................................. 6 TU

RR

UT

Figure 2-1: Resource Kernel (RK) reservation architecture............................................. 11 TU

RR

UT

Figure 3-1: A convex non-decreasing function of power vs. clock frequency ................ 26 TU

RR

UT

Figure 3-2: A scheme to determine energy-inefficient operating frequencies ................. 27 TU

RR

UT

Figure 3-3: A processor with a limited number of operating points ................................ 30 TU

RR

UT

Figure 3-4: Optimal operating frequencies for a processor with 10 operating points ..... 33 TU

RR

UT

Figure 3-5: Possible speed settings of a processor ........................................................... 38 TU

RR

UT

Figure 3-6: Workload vs. completion time....................................................................... 40 TU

RR

UT

Figure 3-7: Sys-Clock algorithm ...................................................................................... 42 TU

RR

UT

Figure 3-8: An example of PM-Clock .............................................................................. 42 TU

RR

UT

Figure 3-9: PM-Clock algorithm ...................................................................................... 44 TU

RR

UT

Figure 3-10: Possible scenarios in Progressive slack distribution.................................... 47 TU

RR

UT

Figure 3-11: Progressive’s slack-usage update algorithm ................................................ 50 TU

RR

UT

Figure 3-12: A workload example for Opt-Clock algorithm............................................ 52 TU

RR

UT

Figure 3-13: Energy vs. System Utilization at BCET/WCET=0.5 .................................. 56 TU

RR

UT

Figure 3-14: Energy vs. System Utilization at BCET/WCET=1.0 .................................. 57 TU

RR

UT

Figure 3-15: Energy vs. BCET/WCET at U=0.5 ............................................................. 57 TU

RR

UT

xii

UT

Figure 3-16: Comparison with discrete operating points at BCET/WCET=0.5 .............. 58 TU

RR

UT

Figure 3-17: Comparison with discrete operating points at BCET/WCET=1.0 .............. 59 TU

RR

UT

Figure 3-18: Energy-aware Linux/RK supported platformsError! TU

RR

Bookmark

UT

not

defined. Figure 4-1: Cascading water tanks concept in multi-granularity reservation................... 62 TU

RR

UT

Figure 4-2: The Worst-Case Burst Requests for MultiRSV Preemption ......................... 66 TU

RR

UT

Figure 4-3: The worst-case preemption by a MultiRSV task ........................................... 67 TU

RR

UT

Figure 4-4: Multi-granular preemption computation algorithm ...................................... 71 TU

RR

UT

Figure 4-5: An example of a thrashing incident in MultiRSV ......................................... 74 TU

RR

UT

Figure 4-6: An example of MultiRSV with Budget-Preserving scheme .......................... 76 TU

RR

UT

Figure 4-7: Budget-Preserving algorithm ......................................................................... 76 TU

RR

UT

Figure 4-8: Jurassic Miss Ratio vs. Utilization (Ramp-up = 0.1) ..................................... 83 TU

RR

UT

Figure 4-9: Jurassic MissD Ratio vs. Utilization (Ramp-up=0.1) .................................... 83 TU

RR

UT

Figure 4-10: Jurassic MissI Ratio vs. Utilization (Ramp-up = 0.1) ................................. 84 TU

RR

UT

Figure 4-11: Jurassic Dynamic Ratio vs. Utilization (Ramp-up = 0.1) ............................ 84 TU

RR

UT

Figure 4-12: Video Streams Comparison on Miss Ratio (Ramp-up = 0.1) ...................... 86 TU

RR

UT

Figure 4-13: Video Streams Comparison on MissD Ratio (Ramp-up = 0.1) ................... 86 TU

RR

UT

Figure 4-14: Video Streams Comparison on MissI Ratio (Ramp-up = 0.1) .................... 87 TU

RR

UT

Figure 4-15: Video Streams Comparison on Dynamic Ratio (Ramp-up = 0.1) ............... 87 TU

RR

UT

Figure 4-16: Miss Ratio of Multiplex Video Streams ...................................................... 89 TU

RR

UT

Figure 4-17: MissD Ratio of Multiplex Video Streams ................................................... 89 TU

RR

UT

Figure 4-18: MissI Ratio of Multiple Video Streams ....................................................... 90 TU

RR

UT

Figure 4-19: Dynamic Ratio of Multiplex Video Streams ............................................... 90 TU

RR

UT

xiii

Figure 4-20: Ramp-up index vs. MULTI-RSV performance on Jurassic and Lecture.... 94 TU

RR

UT

Figure 4-21: Zoom view of ramp-up index statistics of Lecture video stream ................ 94 TU

RR

UT

Figure 4-22: Ramp-up index statistics of MPEG-4 streams ............................................. 94 TU

RR

UT

Figure 5-1: Energy-minimizing completion time candidates ........................................... 99 TU

RR

UT

Figure 5-2: Inflated frequency effect on energy-minimizing frequency algorithm ....... 100 TU

RR

UT

Figure 5-3: Multi-granular PM-Clock’s FindPreemption procedure ............................. 101 TU

RR

UT

Figure 5-4: Multi-Granular PM-Clock’s Energy-Minimizing-Freq procedure .............. 102 TU

RR

UT

Figure 5-5: Multi-granular PM-Clock’s NextBusyTime procedure ................................ 102 TU

RR

UT

Figure 5-6: Multi-Granular PM-Clock algorithm........................................................... 103 TU

RR

UT

Figure 5-7: Energy vs. RT Workload Utilization at BCET/WCET = 0.5 ...................... 107 TU

RR

UT

Figure 5-8: Jurassic Miss Ratio vs. RT Workload Utilization at BCET/WCET = 0.5 .. 108 TU

RR

UT

Figure 5-9: Jurassic MissD Ratio vs. RT Workload Utilization at BCET/WCET = 0.5 108 TU

RR

UT

Figure 5-10: Energy vs. BCET/WCET at RT_U = 0.5 .................................................. 111 TU

RR

UT

Figure 5-11: Jurassic Miss Ratio vs. BCET/WCET at RT_U = 0.5 .............................. 111 TU

RR

UT

Figure 5-12: Jurassic MissD Ratio vs. BCET/WCET at RT_U = 0.5 ........................... 112 TU

RR

UT

Figure 5-13: Jurassic MissI Ratio vs. BCET/WCET at RT_U = 0.5 ............................. 112 TU

RR

UT

Figure 5-14: Jurassic Dyn Ratio vs. BCET/WCET at RT_U = 0.5 ............................... 113 TU

RR

UT

Figure 5-15: Energy vs. NRT Workload at BWET = 0.5 ............................................... 116 TU

RR

UT

Figure 5-16: Jurassic Miss Ratio vs. NRT Workload at BWET = 0.5 .......................... 117 TU

RR

UT

Figure 5-17: Jurassic MissD Ratio vs. NRT Workload at BWET = 0.5........................ 117 TU

RR

UT

Figure 5-18: Jurassic MissI Ratio vs. NRT Workload at BWET = 0.5 ......................... 118 TU

RR

UT

Figure 5-19: Jurassic Dyn Ratio vs. NRT Workload at BWET = 0.5 ........................... 118 TU

RR

UT

Figure 6-1: Slack adjustment during turbo mode transition ........................................... 126 TU

RR

UT

xiv

Figure 6-2: Energy vs. Background Workload at BG_U=0.2 and BWET=1.0.............. 130 TU

RR

UT

Figure 6-3: Short Request vs. Background Workload at BG_U=0.2 and BWET=1.0 ... 130 TU

RR

UT

Figure 6-4: Long Request vs. Background Workload at BG_U=0.2 and BWET=1.0 ... 131 TU

RR

UT

Figure 6-5: Energy vs. BWET at RT_U=0.7, BG_U=0.2 and maxReq=2..................... 132 TU

RR

UT

Figure 6-6: Short Request vs. BWET RT_U=0.7, BG_U=0.2 and maxReq=2 ............. 132 TU

RR

UT

Figure 6-7: Long Request vs. BWET RT_U=0.7, BG_U=0.2 and maxReq=2 .............. 133 TU

RR

UT

Figure 6-8: Energy vs. BG_U at RT_U=0.7, BWET=0.6 and maxReq=2..................... 134 TU

RR

UT

Figure 6-9: Short Request vs. BG_U at RT_U=0.7, BWET=0.6 and maxReq=2.......... 134 TU

RR

UT

Figure 6-10: Long Request vs. BG_U at RT_U=0.7, BWET=0.6 and maxReq=2 ........ 135 TU

RR

UT

Figure 7-1: A sample of reservation hierarchy ............................................................... 141 TU

RR

UT

Figure 7-2: The critical instant with multiple deferrable servers ................................... 148 TU

RR

UT

Figure 7-3: The critical zone of task τi for SS-HDM scheduler ..................................... 153 TU

RR

UB

UB

UT

Figure 7-4: Unsynchronized phasing effect .................................................................. 156 TU

RR

UT

Figure 7-5: Critical instant of DS-HDM ........................................................................ 157 TU

RR

UT

Figure 7-6: The critical zone of task τi for PS-HDM scheduler ..................................... 160 TU

RR

UB

UB

UT

Figure 7-7: The critical zone of hierarchical rate-monotonic schedulers ....................... 161 TU

RR

UT

Figure 7-8: Reserve capacities for the minimum utilization .......................................... 164 TU

RR

UT

Figure 7-9: The core reservation software architecture of HDM .................................. 172 TU

RR

UT

Figure 7-10: The hierarchical reserves used in the experiments .................................... 175 TU

RR

UT

Figure 7-11: A snapshot of task scheduling ................................................................... 176 TU

RR

UT

Figure 7-12: The CPU utilization of tasks with regular reserves ................................... 176 TU

RR

UT

Figure 7-13: The CPU utilization of tasks with default reserves .................................. 177 TU

RR

UT

Figure 8-1: A sample of HRSV with DVS suppor ......................................................... 183 TU

RR

UT

xv

Figure 8-2: The critical zone of a task for SS/DS-HDM scheduler with DVS .............. 186 TU

RR

UT

Figure 8-3: The critical zone of a task under SS/DS-HDM at assigned frequency of f TU

RR

p

UT

......................................................................................................................................... 188 Figure 8-4: Energy-Min-freq procedure for SS/DS-HDM ............................................. 190 TU

RR

UT

Figure 8-5: The Hierarchical PM-Clock algorithm ........................................................ 191 TU

RR

UT

Figure 8-6: A snapshot of run-time frequency assignment by Hierarchical Progressive 192 TU

RR

UT

Figure 8-7: Three hierarchical reservation structures in the experiments ...................... 195 TU

RR

UT

Figure 8-8: Energy vs. Different hierarchy setups ......................................................... 199 TU

RR

UT

Figure 8-9: Video MissD ratio vs. Different hierarchy setups ....................................... 199 TU

RR

UT

Figure 8-10: Response time vs. Different hierarchy setups ........................................... 200 TU

RR

UT

xvi

List of Tables Table 3-1: Clock frequency vs. Supply Voltage of Transmeta’s Crusoe Processor ........ 25 TU

RR

UT

Table 3-2: Summary of proposed DVS algorithms for hard real-time periodic tasks..... 35 TU

RR

UT

Table 4-1: Video frame statistics used in experiments ..................................................... 81 TU

RR

UT

Table 4-2: CBS vs. MULTI-RSV Comparison ................................................................ 91 TU

RR

UT

Table 4-3: The Performance of CBS at High Load .......................................................... 92 TU

RR

UT

Table 5-1: List of algorithms being evaluated in the experiments ................................. 105 TU

RR

UT

Table 7-1: Samples of the least upper bound of HRM utilization .................................. 171 TU

RR

UT

Table 7-2: The reservation parameters used in the experiments ................................... 175 TU

RR

UT

Table 8-1: Six hierarchical-scheduler setups in the experiments ................................... 194 TU

RR

UT

Table 8-2: Reserve parameters in the experiments......................................................... 196 TU

RR

UT

Table 8-3: Four workload scenarios ............................................................................... 197 TU

RR

UT

xvii

Abbreviations ACPI

Advanced Configuration and Power Interface

DM

Deadline Monotonic

DPM-Clock

Dynamic PM-Clock

DS

Deferrable Server

DVS

Dynamic Voltage Scaling

DVFS

Dynamic Voltage and Frequency Scaling

HRSV

Hierarchical Reservation

GUI

Graphic User Interface

MIPS

Million Instructions per Second

MultiRSV

Multi-granularity Reservation

PDA

Personal Digital Assistants

PM-Clock

Priority-Monotonic Clock Frequency Assignment Algorithm

PS

Polling Server

PSC

Power-saving State Control

SS

Sporadic Server

xviii

Chapter 1 Introduction The tremendous increase in demand for many battery-operated computing devices evidences the need for power-aware computing. As a new dimension of CPU computing, the goal of power-aware CPU scheduling is to dynamically adjust hardware to adapt to the expected performance of the current workload so that a system can efficiently save power, lengthening its battery life for useful work in the future. In addition to traditional “low-power” designs for the highest performance delivery, the new concept focuses on enabling hardware-software collaboration to scale down power and performance in hardware whenever the system performance can be relaxed. Power-saving State Control (PSC) and Dynamic Voltage Scaling (DVS) are promising examples of power-aware techniques developed in hardware. Power-saving states are like operating knobs of a device, as they consume much less power and support only partial functionalities compared to the regular active operating mode. With the prevailing power-saving trend, power states have been ubiquitously supported in processors, disks, RDRAM memory chips, wireless network cards, LCD displays, etc. Common power-saving states include standby (clock-gating), retention (clock-gating with just enough reduced supply voltage to save the logic contents of circuits), and power-down (power-gating) modes. Some devices provide specific power-states that offer more energy-efficient operating options such as the control of the backlight in LCD

1

displays and the control of the modulation scheme and transmission rate in wireless network cards. DVS techniques are deployed in many commercial processors such as Transmeta’s Crusoe, Intel’s XScale processors and Texas Instruments’ OMAP3430. Due to the fact that dynamic power in CMOS circuits has a quadratic dependency on the supply voltage, lowering the supply voltage is an effective way to reduce power. However, this voltage reduction also adversely affects the system performance through increasing delay. Therefore, efficient DVS algorithms must maintain the performance delivery required by applications.

Figure 1-1: Power of Crusoe processor using DVFS (LongRun) and PSC (Normal/Sleep) 1 TPF

FPT

No matter how advanced power-aware strategies in circuit designs and hardware drivers may become, they must be integrated with applications and the operating systems, and knowledge of the applications’ intents is essential. Advanced Configuration and Power Interface (ACPI) [14] is an open industry specification that establishes interfaces

TP

1 PT

Excerpted from “Issue Logic and Power/Performance Tradeoff” by Olson and Menard, MIT

2

for the OS to directly configure power states and supply voltages on each individual device. ACPI was developed by Hewlett-Packard, Intel, Microsoft, Phoenix and Toshiba. The operating system thus can customize energy policies to maintain the quality of service required by applications and assign proper operating modes to each device at runtime. In this dissertation, we will focus on power management for processors using DVFS techniques for tasks with mixed characteristics. Even though a significant amount of research work on CPU DVS algorithms has been done in the past decade, all of it has been dedicated either to real-time tasks where only the temporal guarantee of resource access matters, or best-effort tasks, where fast response time and high throughput are more favored. Unfortunately, in many practical systems such as PDAs, cellular phones, laptops and robots, the combination of hard and soft real-time periodic tasks, aperiodic real-time tasks, interactive tasks and batch tasks is likely to happen. To date, poweraware CPU scheduling for such mixed tasks has not been studied to any significant extent.

1.1 Proposed Approach Our power-aware computing framework is based on a “resource kernel” [43, 44] with the integration of the inter-task DVS concept. The resource kernel is a resourcecentric approach based on resource reservations and strict enforcement to provide timely and guaranteed access to resources. We assume that applications can specify resource requirements of individual tasks independent of the scheduling policies inside the kernel. The reserve specification lists the MIPS and deadline constraints of tasks using the {C, T, D} model to represent the requirement of C units of resource every recurring time interval

3

T before a deadline D. In this dissertation, C represents the CPU cycles, T represents the task period or the minimum interval of aperiodic task execution and D represents the maximum task’s completion time which is tolerable by applications. We extend this resource kernel concept to be power-aware such that tasks are scheduled efficiently with optimized CPU power using a combination of DVFS and PSC techniques. Power-aware scheduling policies and admission controls are deployed in the framework to assure that the final system does not conflict with task schedulability, guarantees of task reservations or temporal isolation among tasks. With this approach, the system can fully utilize DVFS techniques in a systematic way. The CPU operating mode is dynamically and transparently controlled by the poweraware resource kernel to the applications. This allows the decoupling of CPU power optimization from application functionality design and implementation, reducing the time-to-market of low-power software application development. Due to different timing overhead for scaling supply voltage and frequency across platforms and different timing constraints in SW applications, we design, analyze and implement varieties of DVS algorithms. All of those use the {C, T, D} reservation model and integrate with traditional real-time deadline monotonic (DM) scheduling policy to provide power-aware CPU management for hard real-time applications. We also design and develop multi-granularity reservations, a new reservation paradigm for soft real-time multimedia applications that can tolerate occasional missed deadlines. The new scheme uses {{C, T, D} …{C x, ε x*T, ε x*T} …{C y, ε y*T, ε P

P

P

PB

B

P

PB

B

P

P

P

PB

B

y P

PB

*T}} B

specification to provide hard guarantees for hotspot requirements of {C, T, D} as long as the average resource consumption is maintained in the longer time interval such as C

4

P

x P

units of resource over the time interval (ε x*T). In addition, it can also provide best-effort P

PB

B

services for the excessive CPU requests. We also study the integration DVS algorithms with this new reservation scheme. The responsiveness of systems to interactive tasks is very critical for GUI in many embedded systems such as PDAs and cell phones. Additionally, the completion time requirements of batch tasks vary solely by application and user preference, and their CPU requirements are difficult to predict in advance. Traditionally, those interactive and batch tasks can be acceptably scheduled in a best-effort fashion (with lower priority than realtime tasks). However, systems with DVS algorithms deployed for scheduling those tasks in that fashion can exhibit dramatically and unacceptably degradations in the system’s responsiveness due to increased waiting time for real-time tasks that have been slowed down. We, therefore, have developed two kinds of background reservation algorithms: Background-Preserving and Background-on-Demand. These are dedicated reservation algorithms, which distribute background cycles among batch and interactive tasks. For sporadic tasks which are aperiodic requests with certain deadlines, we believe that the concept of aperiodic servers such as sporadic-servers [67] and deferrable servers [68] can be easily adapted to our DVS algorithms. Some systems have concurrent applications with diverse criticality and timing constraints such that a single scheduling policy is not sufficient to satisfy those requirements. For those systems, the concept of hierarchical schedulers [55, 56] which allows the co-existence of heterogeneous scheduling policies among applications has been developed. This concept also allows for easy system integration of SW applications with QoS requirements in large-scale systems. We have developed and analyzed the

5

hierarchical reservation model, its admission control and its power-aware version which integrates our DVS algorithms into the framework.

1.2 Main Contributions Figure 1-2 shows the main contributions of the dissertation. We investigate and X

RR

X

develop power-aware scheduling policies for a single processor using DVFS techniques for mixed task sets of hard real-time periodic tasks, multimedia applications, interactive and batch tasks. All dash boxes in the figure list algorithms and analyses we have accomplished while the solid boxes list the target research domain of those algorithms. We also analyze the dramatically increasing effect of leakage energy on DVS algorithms running on commercial processors. To enable the coexistence of heterogeneous scheduling policies in a single system, we also propose the hierarchical reservation concept on the system resource capacity and perform its schedulability analyses including the analyses of its integrated version with our DVS algorithms. One-level fixed-priority preemptive scheduler

Hierarchical schedulers

DVS Algorithms for Mixed tasks Hard-RT Periodic Tasks DVS algorithms - Sys-Clock - PM-Clock - DPM-Clock - Progressive - Opt-Clock

Multimedia Applications MultiRSV+ DVS

Interactive& Batch Tasks Background Reservation - BG-Preserving - BG-On-Demand

Hierarchical Reservations HDM Analysis for different replenishment - Sporadic-server - Deferrable- server - Polling-server

Figure 1-2: Scope of the dissertation

6

Power-Aware Hierarchical Reservations -Frequencycascading -Hierarchical PM-Clock -Hierarchical Progressive

In order to illustrate and demonstrate the practicality of our DVS algorithms, our proposed DVS algorithms have been integrated with Linux/RK. We also have modified some platforms to enable voltage and frequency scaling. 1.3 briefly describes our XRR

X

supported platforms for Power-Aware Linux/RK.

1.3 Power-Aware Linux/RK Supported Platforms

Figure 1-3: Energy-aware Linux/RK supported platforms

We have implemented our DVS schemes on CMU's Linux/RK, extensions of which are commercially available from TimeSys Corporation. We refer to the resulting kernel “Power-Aware Linux/RK”. We currently support four different hardware targets. The first supported target is the 206MHz Compaq iPAQ H3700 PDA. On this target, frequency-scaling is possible but not voltage-scaling. For this platform, frequency scaling requires re-synchronization between the CPU and SDRAM timing, which in turn

7

causes a delay of 20ms. In other words, once frequency scaling is initiated, the processor becomes unavailable for 20ms! Such a delay is unacceptable for many real-time systems. Linux/RK therefore uses the Sys-Clock algorithm, one of our DVS algorithms for hard-real-time tasks, to scale the frequency based on the taskset workload only at admission-control time or at task deletion. The changes to the Linux/RK kernel are therefore confined to the admission control and task exit modules.

These changes

constitute less than 100 lines of code. Another quirk of the iPAQ hardware is that the entire range of operating frequencies (about 70MHz-206MHz) is not usable. The processor can operate within only one of two non-overlapping ranges (~70MHz-140Hz or ~155MHz-206MHz). The frequency cannot be scaled from a value within one range to a value in another range. Multimedia applications including a music player have been ported to demonstrate the functionality of our kernel. This kernel with its support for frequency-scaling can be downloaded from http://www.cs.cmu.edu/~rtml. We also port HTU

UTH

the software to Compaq iPAQ H5550, a newer version of iPAQ, which uses an XScale processor. More significant energy savings can be obtained by voltage-scaling instead of scaling only the frequency. We have successfully modified the XScale BRH single-board computer, which boasts a 733MHz XScale processor and a 128MB memory to support voltage scaling. The Maxim 1855 evaluation kit [42] which is a high-power dynamically adjustable notebook CPU power supply application circuit, was used to provide the programmable supply voltage for the XScale processor card. The kit provides a digitally adjustable voltage from 0.6 to 1.75 V. Eleven different voltage settings (1.0 to 1.5 V with a step of 0.5 V) are available for the OS. A particular value is chosen by simply

8

writing a 4-bit value to a memory-mapped address. Voltage and resulting frequency changes take a negligible amount of time (on the order of a few microseconds). We have ported Power-Aware Linux/RK with Sys-Clock, PM-Clock and DPM-Clock algorithms to this modified BRH board. PM-Clock and DPM-Clock are our proposed DVS algorithms for hard-real-time tasks which will aggressively change CPU clock frequencies at context switches to save more power than Sys-Clock. The Linux kernel has successfully been loaded and tested. With the low voltage-scaling overhead, the modified XScale BRH board can efficiently save more energy using the PM-Clock and DPMClock schemes. The details of Sys-Clock, PM-Clock and DPM-Clock algorithms are explained in Chapter 3. XRR

X

We have also implemented our Energy-Aware Linux/RK on BitsyX platform (a joint project with Vitronics). The BitsyX is a full-featured single board computer using Intel's PXA255 RISC microprocessor with an SA-1111 StrongARM companion chip. Since the BitsyX uses PXA255, which requires time for 20 ms to resynchronize an LCD during frequency scaling, we set the Sys-Clock algorithm by default. The software can be downloaded at http://www-2.cs.cmu.edu/~rtml/bitsyx/bitsyx-linux-2.4.19.tgz. The QT HTU

voltage

and

frequency

UTH

monitoring

tool

is

also

available

at

http://wwwHTU

2.cs.cmu.edu/~rtml/bitsyx/ bitsyxmon.tgz. UTH

For a joint project with BAE and ISI, we have implemented our Energy-Aware Linux/RK on the PowerPC RAD750 platform. This platform has much less timing overhead in voltage and frequency scaling compared to BitsyX and BRH. There are seven operating frequencies available, 4.125, 8.25, 16.5, 33, 99, 115.5 and 132 MHz. The minimum voltage supply needed for each operating frequency varies across hardware

9

from 2.1 mV to 2.5 mV. The software can be downloaded at http://wwwHTU

2.cs.cmu.edu/~rtml/bae/bae_linux_2.4.22_rk.tgz. UTH

1.4 Organization of the thesis The remainder of the dissertation is organized as follows. Chapter 2 provides the XRR

X

literature review of related work. Chapter 3 discusses our DVS solutions on commercial XRR

X

processors for hard real-time periodic tasks. Chapter 4 and Chapter 5 summarize the XRR

X

XRR

X

energy-aware CPU management for multimedia applications whose deadline missing constraints can be relaxed. Chapter 4 presents a new reserve paradigm, multi-granularity XRR

X

reservation, which delivers deterministic guarantees to multimedia applications in nonDVS systems. The integration of multi-granularity reservations to our DVS algorithms is presented in Chapter 5. Chapter 6 presents two novel energy-management techniques for XRR

X

XRR

X

interactive and batch tasks whose execution patterns are not deterministic and difficult to model using offline profiling. Two algorithms are proposed, Background-Preserving and Background-On-Demand. Chapter 7 describes the concept of hierarchical reservation in XRR

X

order to provide real-time guarantees for systems which require heterogeneous scheduling policies among applications. The support of DVS algorithms for the hierarchical reservation system is discussed in Chapter 8. Finally, a summary of our XRR

X

research contributions and possible future research direction are presented in Chapter 9. XRR

10

X

Chapter 2 Literature Review In this section, we first provide a brief overview of a resource kernel that our framework is built upon in order to provide the background for schedulability analyses of proposed DVS algorithms and hierarchical reservations with DVS integration. We then discuss the differences between our proposed schemes and other schemes proposed in the literature.

2.1 Resource Kernel Architecture Background 2.1.1 Resource Kernel Architecture Process2 Process1

Process4 Process3

Resource Set1 Rsv CPU,1

Rsv Disk,1

Process5

Resource Set2 Rsv CPU,2

Rsv Net,1

Portable RK

Scheduler CPU

CPU

Process6

Resource Set3 Rsv CPU,3

Rsv Disk,2

Rsv Net,2

Scheduler Net

Network

Scheduler Disk

Disk

Figure 2-1: Resource Kernel (RK) reservation architecture

11

Figure 2-1 depicts the reservation architecture inside the Resource Kernel (RK). X

RR

X

The system guarantees resource accesses by means of resource reservation. One or more reservations can be grouped together and reside in a resource set which will be bound to one or more tasks. As a result, tasks are allowed to use their reserved amount of resources exclusively inside their resource set. A reserve is created based on the {C, T, D} reserve specification. The kernel performs the schedulability analysis whenever a new reserve is created. Appropriate scheduling and enforcement of a reserve by the resource kernel guarantee that the reserved amount is always allocated for all granted reserves and the system achieves high resource utilization. The resource enforcement in the Resource Kernel ensures that a misbehaving task cannot jeopardize the timming requirement of other tasks by overusing its reserve. Whenever a task uses up its reserved resource for each period, the kernel marks the task as a depleted task and the corresponding reserve as a depleted reserve. At the end of each period, a reserve will obtain a new quota and is said to be replenished. Three types of reserves have been defined: hard reserve, soft reserve and firm reserve. Each has different behavior with respect to resource enforcement and replenishment after the depletion defined as follows. •

Hard Reservation: The reserve will not obtain any more resource until it is replenished. Basically, a task attached to this reserve will always be halted until its next period.

•

Firm Reservation: The reserve will be allowed to obtain more resources only if no other tasks are ready to access the resource. In other words, a task attached to

12

this reserve will be executed only if no other undepleted and unreserved tasks are ready to run. •

Soft Reservation: The reserve will be allowed to share the remaining resources from the undepleted reserves. Therefore, a task attached to this reserve will be executed along with other unreserved tasks and depleted tasks.

2.1.2 Resource Guarantee Mechanisms Four main mechanisms contribute to providing fine-grained timing guarantees: •

Admission Control: RK uses mathematical analysis to determine whether a new reserve can be granted without jeopardizing the timing guarantees of other tasks. The current implementation of Linux/RK supports Rate-Monotonic (RM) and Deadline-Monotonic (DM) scheduling policies. The response time test (a.k.a. the exact completion time test) is performed during admission control to provide more precise analysis and achieve high system utilization under both schemes. Note that the rest of the dissertation will focus only on the DM scheme.

•

Reserve Accounting and Enforcement: After a reserve is granted by the admission controller, the priorities of all reserves are reassigned according to their resource’s scheduling policy. For example, assuming that the deadline-monotonic scheduling policy is used, a reserve with a shorter deadline will be given a higher real-time priority. A task without a reservation will be executed with a user-given normal priority which is always lower than any real-time priority assigned to reservations. In other words, an unreserved task will be scheduled only if no task with undepleted reserves is active. RK keeps track of how much of the resources each reserve has used. One enforcement timer per resource is used to account for

13

the remaining reserved amount of the current task. Once the current task uses up its reserve (i.e. the enforcement timer expires), it will be removed from the ready queue or assigned a lower priority depending on the type of its corresponding reserve. •

Reserve Replenishment: RK implements one replenishment timer per reserve to resume a task at the beginning of reserve periods. Its corresponding reserve priority and resource quota will be restored at this time.

•

High Resolution Timer: RK enables the fine-grained resource control by setting a hardware timer to single-shot timer mode. This setting replaces a periodic 10 ms interval timer mode, a usual configuration in a Linux kernel. A software timer queue is used to keep track of all required timer interrupts. Depending on the availability of hardware timers, the software timer queue may also have to maintain the kernel jiffy interrupts (10-ms interval interrupts) if RK and the Linux kernel must share the same interrupt line.

In this dissertation, we design and implement an energy-aware extension of the Linux/RK (a Linux-based resource kernel). We focus on the integration of the voltage scaling technique with the deadline-monotonic scheduling policy on processors. Tasks are allowed to create a reservation of CPU cycles with the specification {C, T, D} where C is the number of processor cycles needed in each interval period T with the deadline D. During the admission control operation, the kernel verifies if the new taskset is schedulable. If the new reserve is granted, an appropriate voltage-scaling algorithm is chosen based on the platform and taskset characteristics. The DVS algorithm then determines an energy-efficient clock frequency (CPU speed) for executing each

14

individual task and the minimum voltage supply needed to deliver the desired clock frequency in order to minimize energy consumption. The resource accounting, replenishment and enforcement are managed such that the temporal isolation among tasks is maintained.

2.2 Scope of the Related Work This dissertation deals with two major research areas: (1) Dynamic voltage-scaling of processors for tasksets with a mixture of characteristics, and (2) Hierarchical scheduling for heterogeneous scheduling policies including its power-aware extension. To the best of my knowledge, this work is the first attempt to integrate DVS algorithms to support tasksets with a mixture of characteristics and to investigate DVS algorithms for the hierarchical scheduling framework. The following sections present more details of the related work in these research areas.

2.3 Related Work on DVS for One-level Scheduler The field of dynamic voltage scaling (DVS) is currently the focus of a great deal of power-aware research. This is due to the fact that the dynamic power consumption of 2 CMOS circuits [20, 60] is given by P = aC LVDD f , where P is the power, a is the average

activity factor, CL is the average load capacitance, VDD is the supply voltage and f is the B

B

B

B

operating frequency. Since the power has a quadratic dependency on the supply voltage, scaling the voltage down is the most effective way to minimize energy. However, lowering the supply voltage can also adversely affect the system performance due to

15

increasing delay. The maximum operating speed is a direct consequence of the supply voltage given by f = K (VDD − Vth )α / VDD where K is a constant specific to a given technology, Vth is the threshold voltage and ∝ is the velocity saturation index, 1 ≤ ∝ ≤ 2. B

B

2.3.1 DVS for General Systems A variety of DVS techniques [5, 29, 50, 65, 72] have addressed the tradeoffs between system performance and energy efficiency in the context of either general or real-time systems. These techniques make use of operating system information of the workload in order to reduce the processor voltage when full system performance is not necessary. For general systems, most of the proposed solutions adjust the processor supply voltage and operating frequency roughly based on the observed current workload. For example, Weiser et al. [69] proposed an interval-based DVS approach which schedules the CPU speed of a fixed-length interval based on the CPU utilization of the past intervals. Unfortunately, these techniques are not well-suited to real-time systems where tasks may have certain different deadline requirements.

2.3.2 DVS for Hard Real-Time Tasks DVS algorithms that target hard real-time systems instead assume the knowledge of timing constraints of real-time tasks which are specified by users or application developers in {C, T, D} tuple. Pillai and Shin [30] proposed a wide-ranging class of voltage scaling algorithms for real-time systems. In their static voltage scaling algorithm, instead of running tasks with different speeds, only one system frequency is determined and used. Their operating frequency is determined by equally scaling all tasks to complete the lowest-priority task as late as possible. This turns out to be pessimistic.

16

Since the amount of preemption by high-priority tasks is not uniformly distributed when there are multiple task periods, a task can encounter less preemption relative to its own computation and save more energy if it completes earlier than its deadline. Aydin et al. [5] proposed the optimal static voltage scaling algorithm using the solution in reward-based scheduling. Their approach assigns a single operating frequency for each task and focuses on the EDF scheduling policy. The DVS algorithms proposed in this dissertation instead focus on fixed-priority preemptive scheduling policies and search for not just the optimal but also suboptimal, practical schemes for online usage. In addition to the inter-task voltage scaling schemes mentioned above, several intra-task compiler-assisted voltage scaling algorithms have been proposed in [3, 23, 35, 63]. Intra-task voltage scaling algorithms use additional deterministic or stochastic information of application execution paths to control the supply voltage within the application boundary. This dissertation, however, will focus only on inter-task DVS techniques since they are more critical to maintaining the task set schedulability. Moreover, the hybrid schemes which combine both inter- and intra-task DVS techniques are applicable and already investigated in [64].

2.3.3 DVS for Multimedia Applications Many researchers have developed DVS algorithms for multimedia applications, which usually have high-variance resource demands. Their average demand is much less than their worst-case demand. Three main approaches have been proposed in the literature: (1) Best-effort service with DVS adjustment: This approach usually performs DVS adjustment based on the incoming workload on a request-by-request (frame-by-frame

17

for MPEG4 streams) basis. Knowledge of the incoming workload relies on the prediction from the workload history [66], the system feedback [41] or the information embedded in the contents [12]. Due to its lack of reservation, this approach does not guarantee the application’s QoS under high system workload. (2) DVS with the worst-case CPU reservation and run-time slack adjustment: This approach [5, 32, 50] determines the default clock frequencies of tasks based on their worst-case resource demand and further reduces their frequencies at run-time whenever there is additional slack from underused reserves. Even though this approach provides guarantees on application QoS delivery, its worst-case reservation dramatically decreases system utilization, which may not be tolerable for some embedded applications. In addition, its hard real-time guarantees deliver too high QoS to multimedia applications that actually can tolerate some deadline misses. (3) DVS with the average-case or statistical CPU reservation: The main goal of this approach is to create reserves based on the average or statistical resource demand in order to reduce unnecessary high resource reservation and improve system utilization. Yuan and Nahrstedt [73, 74] integrate DVS schemes into soft real-time scheduling based on the probability distribution of application cycle demands. Even though their schemes provide statistical guarantees on the upper-bound of deadline misses, they can still yield a long failure due to the nature of very long bursts in MPEG streams. The solutions proposed in this dissertation aim for the capability to provide deterministic guarantees of certain number of meeting-deadline requests in a given time interval even under high system workload. Two types of solutions are proposed. One uses the second approach (DPM-Clock and Progressive), which guarantees no deadline

18

misses. The other uses DVS with a new reserve paradigm (MultiRSV), which requires a smaller reservation and promises lower QoS in the form of some deadline misses but assures that there will not be long continuing failure of requests. 2.3.3.1 Comparison of DPM-Clock and Progressive with other algorithms Pillai et al. developed the cycle-conserving RT-DVS for RM and EDF and the look-ahead RT-DVS for EDF. The cycle-conserving schemes determine the default clock frequency using their proposed static voltage scaling algorithm by assuming the worstcase demand. When the actual demand is less, they further reduce the frequency to cover all possible idle periods so that the same amount of work is delivered by the next deadline of the system. Unlike the cycle-conserving schemes, the look-ahead scheme tries to defer the workload as much as possible by determining the minimum frequency that ensures that all future deadlines are met. Kim et al. [32] developed slack estimation algorithms for EDF. Aydin et al. proposed a dynamic reclaiming algorithm for EDF. Since the online slack computation of DM is much more complex O(mn2) than that of EDF [36], we develop a simple dynamic P

P

scheme called DPM-Clock that gives the slack generated by the early completion of a task only to the next ready task in the system. This may not be an optimal solution, but the O(1) computational complexity of slack computation in DPM-Clock is much simpler, compared to the O(n2) complexity of cycle-conserving scheme for DM. In addition, P

P

DPM-Clock performs this slack computation only when there is an early completion while cycle-conserving scheme performs the computation at every arrival of a task. We also develop a more aggressive dynamic scheme called Progressive which looks ahead to the two next scheduling points and distributes not only high-priority slacks but also low-

19

but-eligible-priority slacks to the next ready task. Progressive outperforms all algorithms under our consideration, although at a small increase in complexities at O(n). 2.3.3.2 Comparison of MultiRSV with other algorithms A number of techniques for multimedia guarantees with no DVS consideration have been proposed and studied to some extent. A variety of techniques has been specifically developed to relax the QoS specification for multimedia streams to include their deadline tolerability and reduce the over-provisioning of their resource needs. Those techniques can be categorized into two main approaches: deterministic reduced-QoS specification and statistical approaches. Most deterministic reduced-QoS specification approaches, such as the skip-over algorithm [33] and (m, k)-firm guarantee [26, 51, 53], allow users to specify the tolerance level of deadline misses and ensure the guarantees regardless of system workload. Their schedulability analyses pessimistically assume that all frames require the worst-case amount of resources. This results in resource overprovision per frame. The multiframe and generalized multiframe models [6, 46] allow users to provide more information about the variation of frame-to-frame execution time. While the former model requires the execution time of frames to be bounded in a fixed pattern, the latter requires a bound on the sum of the execution time of consecutive frames and therefore is more tolerant to tasks with dynamic execution patterns. Unfortunately, for some multimedia applications, determining such patterns of either the execution time or the sum of execution time on a frame-by-frame basis is a challenging task. Moreover, differentiating the worst-case demand of I, P and B frames typically does not necessarily diminish the extent of overprovision due to the long-range-dependence found in MPEG4 streams.

20

Statistical approaches, such as average resource reservation and Constant Bandwidth Server (CBS) [2], make reservation on resources based on the average demand. The delivery of high-demand frames therefore is not guaranteed and depends on instantaneous demand of other reservations and non-real-time tasks. Many statistical guarantees such as Distance Based Priority (DBP) assignment [25] and MPBP [31] schemes use best-effort heuristic methods. Some methods [1, 26] provide a probabilistic guarantee assuming the knowledge of probability distributions of the multimedia stream's interarrival and execution times. Unfortunately, obtaining such information in MPEG video streams [15] typically requires substantial effort and is perhaps even impossible for live video applications. In view of this fact, our proposed scheme instead makes use of coarse-grained frame statistics such as the maximum frame size (or decoding time), the mean frame size and the size of the group of pictures (GOP). Elastic scheduling proposed by Buttazzo, et al. [10] treats tasks as springs with different elastic coefficients. The model allows periodic tasks to intentionally change their execution rates to provide different QoS levels. This can be considered as a QoS negotiation technique which adjusts QoS of existing tasks to accommodate a new task during overload. However, the model does not efficiently handle varying-resource requests and relies on the uncontrollable task set characteristics at the negotiation time. Our multi-granularity reservation (MultiRSV) uses a deterministic lower-QoS specification approach. Its specification is given by {{C, T, D} …{C x, ε x*T, ε x*T} …{C P

P

P

PB

B

P

PB

B

P

, ε y*T, ε y*T}} in order to provide hard guarantees for hotspot requirements of {C, T, D}

y P

P

PB

B

PB

P

B

as long as the average resource consumption is maintained in the longer time interval such as Cx units of resource over time interval (ε x*T). Meanwhile, the scheme provides P

P

P

21

PB

B

best-effort services for the excessive CPU requests. Unlike other approaches, MultiRSV manages the resource budget efficiently on an inter-frame basis such that the left-over resources will be automatically transferred to succeeding requests as long as the budget of low-granular reserves is maintained; as a result, more requests will be guaranteed.

2.3.4 DVS for Interactive and Batch Tasks Power-aware computing for interactive and batch tasks has been studied to a small extent, especially for systems that have task sets with mixed characteristics. The interval-based voltage schedulers proposed in [20, 49, 69] use different methodologies but share the same principle which is to predict the current global workload from the history of CPU utilization in past intervals and adjust speed corresponding to the predicted workload. However, Yan et al. [71] stated that this simple prediction scheme of CPU utilization is inadequate for modern interactive systems involving high CPU utilization since tasks do have different timing constraints. They instead propose to inject hooks into the X server in order to keep track of user-perceived latency and to use this information to drive voltage scaling. Nevertheless, all those global prediction schemes from CPU utilization and user-perceived latency still lack the provision of timing guarantees for real-time applications. Meanwhile, many DVS algorithms for hard realtime systems [5, 32, 50, 59] consider only the timing constraints of real-time tasks and minimize clock frequencies to cover all possible idle periods. Unfortunately, this method unacceptably increases tardiness to coexisting interactive and batch tasks. We, therefore, propose two novel algorithms, background-preserving and background-on-demand, which enable flexible tardiness and energy tradeoff control for interactive and batch tasks in the presence of real-time applications.

22

2.4 Related Work on DVS for Hierarchical Scheduler The concept of hierarchical schedulers [16, 34, 55, 56] has been investigated to allow the coexistence of heterogeneous scheduling algorithms (real-time and generic) among applications. The Open System architecture developed by Liu et al. [16] proposes a two-level hierarchical scheduling scheme which requires EDF at the lower-level scheduler (OS level). The initial phasings of tasks served by the higher-level schedulers are assumed to be known in advance. Kuo and Li [34] extend the Open System architecture to use RM as the lowerlevel scheduler. Their architecture has two restrictions. First, the scheduler period must be the greatest common divisor (GCD) or a divisor of the GCD for all of the tasks’ periods served by the scheduler. Secondly, the initial phase of each task must occur at a time-point that is a multiple of the server period. Lipari and Baruah [39] have recently analyzed a hierarchical extension to the Constant Bandwidth Server (CBS) framework [2]. Their framework uses EDF. Our hierarchical scheduling architecture assumes that the root scheduler uses DM. The proposed architecture supports an arbitrary depth of hierarchy. The schedulability analysis by each scheduler is local such that it only needs to know its own resource reservation and its child reservations. To the best of my knowledge, there is no framework integrating hierarchical schedulers with DVS algorithms.

23

Chapter 3 DVS for Hard Real-Time Periodic Tasks Due to quadratic dependency of dynamic power in CMOS circuits and their supply voltage, a variety of dynamic voltage scaling techniques [5, 29, 50, 65, 72] have been proposed to tradeoff system performance with energy efficiency. Some of these techniques target hard real-time systems where the timing constraints of tasks must be satisfied without exception. They typically assume that (1) the energy consumption is minimized whenever the supply voltage is scaled down and (2) the voltage-scaling overhead is negligible. Unfortunately, these assumptions may not always be correct in practice. In some practical processors, Akihiko et al. [45] showed that there are some energy-inefficient operating frequencies in the sense that operating the same workload with a higher frequency and then putting the CPU in standby (idle) mode afterward will counterintuitively consume less energy. Those energy-inefficient operating frequencies therefore must be avoided under all circumstances. Moreover, a huge variation in the overhead of changing processor frequency and voltage supply across platforms tremendously affects the performance of DVS algorithms. We therefore propose four voltage-scaling algorithms, Sys_Clock, PM_Clock, DPM_Clock and Progressive DPM_Clock (or Progressive, for short). Each algorithm is suitable for different system characteristics

24

(HW overhead and SW taskset timing constraints), and the proper algorithm must be chosen during the deployment of the reservation-based resource kernel to the platform. In this chapter, we will present the study of energy-inefficient operating points usually found in commercial processors and discuss how to determine these operating points and include them in consideration of DVS algorithms. We then compare the performance of DVS algorithms on practical processors which usually have a small number of discrete operating points and ideal processors whose operating frequencies are assumed to be continuously varied. Finally, we present the detail of our algorithms.

3.1 Energy-Efficient Operating Frequencies In this section, we show that scaling voltage down does not always save the CPU power consumption of applications. As an example, we consider the commercially available Crusoe processor from Transmeta. It can operate at six different frequencies/voltages as shown in Table 3-1 [19]. We show that, contrary to intuition, X

RR

X

operating at 225 MHz is actually inefficient in terms of energy consumption. Frequency f (MHz)

Voltage VDD (V) B

Relative power (%) B

600

1.60

100.00

525

1.50

70.00

450

1.35

45.00

375

1.22

33.33

300

1.20

26.67

225

1.10

23.33

Table 3-1: Clock frequency vs. Supply Voltage of Transmeta’s Crusoe Processor

The processor consumption in the idle mode, Pidle, is about 0.3 watts at a clock B

B

frequency of 300 MHz, which is about 5% with respect to the maximum power

25

consumption at maximum clock frequency, Pmax. Assume that we can complete a job J B

B

within t225 MHz units of execution time at the operating frequency of 225 MHz. The energy B

B

consumption by this job is given by

E 225 MHz = P225 MHz × t 225 MHz = 0.233 × t 225 MHz × Pmax If we increase the processor frequency to 300 MHz, the energy consumption for completing the same job J is given by

E300 MHz = P300 MHz × t 300 MHz + Pidle × (t 225 MHz − t 300 MHz ) 255 75 × t 225 MHz × Pmax + 0.05 × × t 225 MHz × Pmax 300 300 = 0.21275 × t 225 MHz × Pmax < E 225 MHz = 0.267 ×

Hence, the operating point at 225 MHz is an example of an energy-inefficient operating frequency. It delivers less performance (meaning longer response time) and, even worse, consumes more energy than one of its higher operating frequencies, 300 MHz, in this case. Obviously, operating the system at these energy-inefficient operating frequencies must be avoided under all circumstances. Power Pz B

Px B

fx B

fy B

fz B

Frequency

Figure 3-1: A convex non-decreasing function of power vs. clock frequency

26

To formalize the concept, consider Figure 3-1, the clock frequency fx is said to be X

RR

X

B

B

energy-efficient if and only if the energy consumption by the same task operating at fx is B

B

lower than that operating at any higher frequencies fy, fz, etc. That is, B

B

B

B

⎛ f f ⎞ Px < Py × x + P(idle, y ) × ⎜1 − x ⎟ ⎜ fy f y ⎟⎠ ⎝ For simplicity, we assume P(idle, y) = P(idle, x). Let mxy be the power-frequency slope B

B

B

B

B

B

B

B

from fx to fy and Py = Px + mxy ( fy - fx ). Therefore, B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

∀f y < f x : m xy >

Px − P(idle, x ) fx

(3.1)

Equation ( 3.1) is a necessary condition for any operating frequency fx to be an X

RR

X

B

B

energy-efficient operating point. In other words, fx is considered to be an energyB

B

inefficient operating point if there is at least one higher frequency which performs better but consumes less energy. Accordingly, fx should be excluded from the set of operating B

B

frequencies managed by DVS algorithms.

Px − P(idle, x )

(1) slope =

(2) m xy (3)

fx

≤

fx

Px − P(idle, x ) fx is an energy-inefficient

Figure 3-2: A scheme to determine energy-inefficient operating frequencies

27

Figure 3-2 illustrates a pair-wise frequency comparison scheme to determine X

RR

X

energy-inefficient operating frequencies. For each candidate frequency fx, Equation ( 3.1) B

B

X

RR

X

must be tested. The complexity of this algorithm is O(n2) where n is the number of P

P

available operating frequencies. Next, we will prove that this complexity can be reduced to O(lg n) for processors that exhibit a convex non-decreasing power-frequency function. This property is usually true for practical processors using a voltage-scaling technique even though their power versus frequency function is not perfectly quadratic. Theorem 3.1 U

For a convex non-decreasing power-frequency function, P(f), if there is an operating frequency f which

is energy-inefficient relative to a higher operating

frequency, all operating frequencies lower than f are also energy-inefficient. Proof. U

We prove the theorem by contradiction. Consider Figure 3-1. Let Ex, Ey and Ez be X

RR

X

B

B

B

B

B

B

the energy consumption of a task at frequency fx, fy and fz respectively. Let us assume that B

B

B

B

B

B

Pidle is constant for all operating frequencies and fy is an energy-inefficient point relative B

B

B

B

to fz. That is, B

B

m yz ≤

Py − Pidle fy

(3.2)

If we assume that fx is energy-efficient, then we obtain B

B

(

Ex

t S , the task can exploit not only slack from all tasks but also slack from the idle period between t S and t N .

⎛ C − aci In summary, the task’s eligible slack is given by S = t N − t − ⎜⎜ i ⎝ νi eci = Ci − aci eci eci = ν 'i = S + (eci ν i ) t N − t

49

⎞ ⎟⎟ . Therefore, ⎠

3.9.3 Progressive Slack Accounting Progressive tracks the availability of slack from each individual task. At each context switch, assume τp is the previous task that the processor has executed most B

B

recently. The algorithm first adds the execution time that τp would need under the worstB

B

case PM-Clock schedule to reach the same progress (the same number of executed processor cycles), to τp’s slack. Let cp be the actual processor cycles which τp obtained in B

B

B

B

B

B

its last execution. Therefore, ⎧ ⎛ cp ⎞ ⎟ ⎪ Sp +⎜ ⎜ν p ⎟ ⎪⎪ ⎝ ⎠ Sp = ⎨ ⎛ C p − ac p ⎪ ⎜ ⎪ Sp +⎜ ν ⎪⎩ p ⎝

if τ p ' s request is not completed yet ⎞ ⎟ ⎟ ⎠

(3.8)

otherwise

The accumulated execution of the task is then updated as follows: acp = acp + cp. B

B

B

B

B

B

The above procedures are skipped if the CPU was previously idle. Finally, the algorithm updates the slack usage. It deducts the time passed since the last context switch from slacks in the descending order of their priorities. Note that the passed time may be occupied by τp’s execution or an idle period. Figure 3-11 summarizes the slack-usage B

B

X

RR

X

update algorithm. slack_usage_update(elapsedTime): For j = 1 to n then usej = MIN(Sj, elapsedTime) Sj = Sj – usej elapsedTime = elapsedTime – usej If (elapsedTime == 0) then return; End if End for

Figure 3-11: Progressive’s slack-usage update algorithm

50

Progressive always maintains schedulability, since at each context switch its frequency reduction assures that the preemption time seen by any active tasks at the next scheduling point is the same as what should be seen under the worst-case PM-Clock schedule. In addition, the slack detection only collects the reserved but unused execution times from tasks.

3.10

Opt-Clock Algorithm We will first show that finding the optimal task clock frequency can be a large-

scale non-linear optimization problem. Then we present pruning techniques which significantly decrease the problem size. Theorem 3.4 U

Consider a voltage-scaling processor with a convex non-decreasing powerfrequency function given by P( f ) = cf x , x ≥ 3 where c is a constant value. Assume that a fixed-priority preemptive scheduling policy is used. The optimal task-clock frequency assignment algorithm is a convex non-linear minimization problem. Proof. U

The goal of the Opt-Clock scheme is to minimize the energy consumption of a taskset in one hyper-period 6 . Each instance of a task τi will take Ci /( f i * f max ) time units TPF

FPT

B

B

where f i is its assigned clock frequency. The total energy consumed over one hyperperiod (E) is given by.

TP

6 PT

The hyper-period of n periodic tasks with τ i = {Ci , Ti } is given by LCM {Ti | i ∈ [1, n]} .

51

n

E

= ∑ c( H / Ti )(Ci /( f i * f max ))( f i * f max ) x , x ≥ 3 i =1

n

= cH ∑ U i f max (1 / δ i ) y , y ≥ 2 y

i =1

For a task τi with utilization Ui, H/Ti denotes the number of instances in a hyperB

B

B

B

B

B

period. Let δ i denote the reciprocal of f i . From [22], if f x has a second derivative in [a, b] then a necessary and sufficient condition for it to be convex on that interval is that

the second derivative f " ( x) > 0, ∀x ∈ [ a, b] . Combine this theorem with a property of convex functions in which the combination and the scaling of convex functions are also convex. It can be shown that this problem has a convex objective function. We now consider the constraints of the problem. τ1{3, 10, 10}, B

Run at fmax = 1 cycle/time unit B

B

B

τ2 {4, 23, 23}, B

B

τ3 {2, 32, 32}, B

B

Figure 3-12: A workload example for Opt-Clock algorithm

Consider an example of a task set shown in Figure 3-12. As described in Section X

RR

X

3.6, a task's workload varies with its completion time, i.e. the clock frequency assigned to

XRR

X

the task. Consequently, each task has multiple choices of constraints in order to complete the workload at any time before its deadline. The constraint of task τ 2 for this example can be either one of the following conditions.

⎡10 / T1 ⎤ (C1δ1 / f max ) + ⎡10 / T2 ⎤ (C2δ 2 / f max )

≤ 10

⎡20 / T1 ⎤ (C1δ1 / f max ) + ⎡20 / T2 ⎤ (C 2δ 2 / f max ) ≤ 20

52

The first and second equations are necessary conditions to completing task τ 2 before time 10 and 20, respectively. These constraints are linear and therefore are convex. Hence, the optimal task clock frequency assignment is a convex non-linear minimization problem since its non-linear objective function and constraints are convex.

Theorem 3.4 shows that Opt-clock needs to solve several non-linear optimization X

RR

X

problems having one objective function and alternative constraint sets. The problem size grows dramatically not only with the number of tasks but also with the number of constraint choices of each task. The latter number depends on the number of idle periods in the task’s critical zone and can be very large if the ratio of the task’s period and its highest priority task’s period is large. We now present three pruning techniques which can potentially decrease the problem size: 1. Pruning of High Workload Constraints: A constraint is said to be a high-

workload constraint if it requires a shorter completion time β S and has higher average processor demands from all higher priority tasks than another constraint with a longer completion time β L . This technique eliminates those redundant high-workload constraints. Since the average demands from all tasks decrease if the task completes at β L , using the optimal frequency set that satisfies the constraint for β S will definitely generate some slack. Consequently, some tasks’ clock frequencies can always be reduced to save more energy. Therefore, the solution from the constraint set for β S will impossibly give the optimal solution and can be pruned. 2. Pruning of Conflict Constraints: From constraints left over from the previous

step, this technique eliminates infeasible combinations of constraints where a 53

lower-priority task requires a shorter completion time than a higher-priority task. These constraints are not schedulable by fixed-priority preemptive scheduling. 3. Pruning of Inactive Constraints: Unlike the first two techniques, this technique

uses additional information from a solution of one problem to prune others. It first determines an inactive constraint from the current solution. Note that an inequality constraint, g (δ ) ≤ D , is said to be inactive at δ x if g (δ x ) < D . It is active at if g (δ x ) = D . Apply the Karush-Kuhn-Tucker (KKT) theorem [54] which states that the optimal solution subject to a set of active and inactive constraints is the same as that subject to only active constraints. Since a better result is not obtained by adding more constraints to the problem, it is sufficient to eliminate other constraint sets which alter only the inactive constraints. Our simulation experiments indicate that using these pruning techniques, we can reduce the number of constraint sets for scheduling 10 tasks from 2000 to 1000 sets on the average, a significant drop. However, other experiments indicate that PM-Clock can deliver comparable energy savings on the average, (EPM-Clock/EOpt-Clock ≈ 101%), at B

B

B

B

significantly lower complexity.

3.11 Algorithm Comparison and Experiment Results We compare the energy consumption of our DVS algorithms with that of other well-known algorithms, the static voltage scaling algorithm (SVS) and the cycleconserving RT-DVS (CYCLE) proposed in [50]. The processor is modeled based on P = kf 3 power-frequency relation, zero idle energy and ten available operating points. A real-time taskset is generated randomly. Each task has an equal probability of having a

54

short (0.1-1 ms), medium (1-10 ms) or long (10-100 ms) period. The task period is generated uniformly distributed in each range. The task computation time is randomly selected and then adjusted based on system utilization and the ratio of the best-case (BCET) to worst-case execution time (WCET). For the first experiment, we randomly generated tasks and scaled their utilization equally to achieve the desired total utilization. For the second experiment, we randomly generated tasks such that each task instance requests a random number of processor cycles uniformly distributed between BCET and WCET 7 . TPF

FPT

Figure 3-13 and Figure 3-14 show the average energy consumption normalized to X

RR

X

X

RR

X

energy of a no-DVS system from the first experiment. The performance of five algorithms, SVS, Sys-Clock, PM-Clock, CYCLE and DPM-Clock, is investigated. Figure X

3-15 shows the effect of the BCET/WCET ratio at system utilization of 0.5 on the same

RR

X

set of DVS algorithms, including Progressive. In this figure, the energy shown is normalized to the energy of SVS algorithm. As can be seen from both experiments, Sys-Clock and PM-Clock always outperform SVS. This is because SVS algorithm always assumes that the energy consumed by a workload is smallest if the processor completes it at its deadline. As pointed out earlier, this assumption is not always true. PM-Clock performs better than Sys-Clock as expected, with the tradeoff of additional voltage-scaling overhead in hardware during each context switch. With varying BCET/WCET, the cycle-conserving algorithm performs very well when the BCET/WCET is low but somewhat poorly when the BCET/WCET is high. This is because the scheme always executes any task at low

TP

7 PT

These execution times are subject to f max . 55

speed with the hope of saving energy in the future if the task uses fewer resources. Even though DPM-Clock has much lower complexity than CYCLE, the results show that its performance is very close to CYCLE when BCET/WCET is low and better otherwise. Additionally, with a little bit more complexity at each context switch, O(n), Progressive outperforms all DVS schemes for all circumstances. Note that the complexity at context switches of DPM-Clock and CYCLE are O(1) and O(n2), respectively. It must P

P

also be noted that PM-Clock, DPM-Clock and Progressive algorithms have a relatively high computation complexity of O(mn2) for the initial frequency assignment algorithm P

P

compared to SVS and CYCLE. However, this is acceptable since the frequency assignment algorithm is needed only at admission control. In addition, it can replace a real-time schedulability test with similar complexity, which is already implemented in a resource kernel (Linux/RK) including some other real-time operating systems.

Figure 3-13: Energy vs. System Utilization at BCET/WCET=0.5

56

Figure 3-14: Energy vs. System Utilization at BCET/WCET=1.0

Figure 3-15: Energy vs. BCET/WCET at U=0.5

In the third experiment, we do the same analysis on a processor with only two operating points and compare the energy consumption normalized to that of a no-DVS system with the previous results. As can be seen in Figure 3-16 and Figure 3-17, all X

57

RR

X

X

RR

X

voltage-scaling algorithms perform worse on the processor with a lesser number of available operating points. In Section 3.2, we showed that the minimum energy loss due XRR

X

to finite operating frequencies is given by EnoDVS/N where N is the number of available B

B

frequencies. This matches the results we observed. In the fourth experiment, we vary the size of the taskset but fix the system utilization and the BCET/WCET ratio. Simulation results show that there is a minimal effect of the size of the taskset on the performance of voltage-scaling schemes. 2 ops, U = 0.7

2 ops, U = 0.6

2 ops, U = 0.4

2 ops, U = 0.2

10 ops, U = 0.7

10 ops, U = 0.6

10 ops, U = 0.4

10 ops, U = 0.2

1.2

Normalized Energy

1 0.8 0.6 0.4 0.2 0 SVS

Sys-Clock

PM-Clock

CYCLE

DPM-Clock

Figure 3-16: Comparison with discrete operating points at BCET/WCET=0.5

58

2 ops, U = 0.7

2 ops, U = 0.6

2 ops, U = 0.4

2 ops, U = 0.2

10 ops, U = 0.7

10 ops, U = 0.6

10 ops, U = 0.4

10 ops, U = 0.2

1.2

Normalized Energy

1 0.8 0.6 0.4 0.2 0 SVS

Sys-Clock

PM-Clock

CYCLE

DPM-Clock

Figure 3-17: Comparison with discrete operating points at BCET/WCET=1.0

3.12 Summary We proposed four practical task-based DVS algorithms for hard real-time tasks: Sys-Clock, PM-Clock, DPM-Clock and Progressive. These algorithms exhibit acceptable complexity levels and deliver a significant amount of energy saving (60% at 50% system utilization). We have proved that Sys-Clock is optimal for fixed-priority preemptive scheduling policies that require one single clock frequency. PM-Clock assigns different clock frequencies to tasks and its power saving is comparable to the optimal algorithm, (EPM-Clock/EOpt-Clock ≈ 101%), despite its much lower complexity. DPM-Clock has B

B

B

B

additional complexity of O(1) from context switches, and its improvement is relatively minor (< 5% over PM-Clock). Progressive has additional complexity of O(n) from context switches but adapts very well with fluctuating run-time workload. It saves the most energy (up to 35% over DPM-Clock). We also derived the optimal operating frequency grid that minimizes the worst-case energy-quantization error, which has linearinversely proportional relation to the limited number of operating points. 59

Chapter 4

MultiRSV for Multimedia Applications The worst-case CPU reservation specification {C, T, D} in the resource kernel has been effectively used for hard real-time periodic tasks. However, for soft real-time multimedia applications, such as VBR MPEG-4 decoders, their high-varying resource requirements generally lead to very low resource usage on average compared to their worst-case usage. Therefore, the worst-case reservation turns out to be too pessimistic and results in unnecessary low system capacity. A straightforward solution is to create a statistical reservation using its average resource 8 and service the overrun demand by TPF

FPT

background cycles and unused reserved processor cycles. Unfortunately, this approach can yield a long failure of bursty requests when the system is overloaded and especially when DVS algorithms have been applied to the system to save energy. This is because DVS schemes usually diminish background cycles on which statistical reservation schemes rely to serve bursty requests. In this chapter, we propose a new concept of multi-granularity reservation (MultiRSV), study its effectiveness to handle MPEG-4 traffic and compare it with other statistical reservation schemes. In the next chapter, we present the integration of MultiRSV with our proposed DVS algorithms and investigate its energy efficiency.

TP

8 PT

Any statistical data representing the workload may be used here also. 60

4.1 Multi-Granularity Reservation Design Goal The resource demands of soft real-time multimedia applications typically have the following characteristics. (1) Highly variable resource consumption rates (2) A large peak-to-average ratio of resource demands (3) Long bursts of large requests (4) Occasional deadline misses can be tolerated To satisfy these demand characteristics, the design of our multi-granularity reservation is influenced by the following important considerations. •

Flexibility of QoS Specification: Since multimedia applications typically can

tolerate some deadline misses, using the worst-case reservation is pessimistic and leads to over resource allocation, which can reduce realized system utilization. A new model with a more flexible QoS specification is hence needed. •

QoS Isolation and Deterministic Guarantee: The ability to manage resources to

maximum the QoS return with lower-bound guarantees regardless of the variable system workload is crucial for real-time and QoS-guaranteed systems. A new model should not rely on the background cycles in order to deliver desired lower-bound QoS. •

Efficient Resource Utilization: To achieve high system utilization, a new model

should reserve just enough resource to satisfy QoS requirements. •

Varying Demand Tolerance: Since multimedia applications generally have very

high demand fluctuation, a new model should modify resource allocation algorithms to suit these characteristics.

61

4.2 The Multi-Granularity Reservation Model Instead of using one reserve rate {C, T, D} as in the classical Liu and Layland reservation model, the specification of a multi-granularity reserve (MultiRSV) is given by {{C, T, D}, {Cx, εxT},…, {Cy, εyT}}, where εx g i and ⎧ ⎪⎪ mx = ⎨ ⎪ ⎪⎩

⎢ t − m ( x +1) ⎥ x ⎢ ⎥ε i Ti for 1 ≤ x ≤ g i ⎢⎣ ε ixTi ⎥⎦ 0 otherwise

with the collaborative resource budget across granularities, the preemption must satisfy all inequalities above. Note that κ 1 ≤ κ a , ∀a ∈ Z + . The maximum preemption by

τi to any lower or same priority task in time interval [t0 ,t0 + t] , therefore, is given by B

B

69

Pi (t 0 , t 0 + t ) = MIN (κ 1 , γ 1 , K , γ ( g i ) )

(4.1)

where ⎥ ⎢ gi ⎥ ⎢ ⎥C ( gi −1) + K + ⎢(t − ⎥ C gi + ⎢ t − m i i ⎢ ε ( gi −1) T ⎥ ⎢ ε gi T ⎥ ⎢ i⎦ ⎣ ⎣ i ⎣ i i⎦ ⎢ ⎢ t ⎥ gi ⎥ ⎡ ⎥C ( gi −1) + K + ⎢(t − ⎥ C gi + ⎢ t − m γ1 = ⎢ i i ⎢ ε ( gi −1) T ⎥ ⎢ ε gi T ⎥ ⎢ i⎦ ⎢ ⎣ i ⎣ i i⎦ M ⎢

t

κ1 = ⎢

gi

∑m j) j =2 gi

∑m j) j =2

gi ⎥ (ε i1Ti )⎥C i1 + (t − ∑ m j ) ⎥ j =1 ⎦

⎤ (ε i1Ti )⎥C i1 ⎥ ⎥

⎥ ⎡ gi ⎤ ⎥C ( gi −1) ⎥ C gi + ⎢ t − m i ⎢ ε ( gi −1) T ⎥ i ⎢ ε gi T ⎥ i⎥ ⎢ i ⎣ i i⎦ ⎡ t ⎤ ⎥ C gi γ gi = ⎢ ⎢ ε gi T ⎥ i ⎢ i i⎥ ⎢

t

γ ( gi −1) = ⎢

Figure 4-4 shows our polynomial algorithm to determine the worst-case X

RR

X

preemption time by a multi-granularity reserve presented in Equation ( 4.1). X

RR

X

Theorem 4.2

For a multi-granularity resource reservation system, the worst-case response time for a task, τj, is the smallest solution to the following equation: B

B

B

B

ω k +1 = C j + ∑ Pi (0, ω (k ) )

(4.2)

i< j

This formula must be solved recursively starting with ω(0) = Cj and terminating P

P

B

B

when ω(k+1) = ω(k) on success or when ω(k) > Dj on failure. P

P

P

P

P

P

B

B

Proof.

We have proved earlier that the worst-case preemption time by a high-priority task τi is given by Equation ( 4.1). In order for τj to obtain its entire resource requirement B

B

X

RR

X

B

B

of Cj, there must be enough time for its own execution and all possible preemption times B

B

by its higher-priority tasks.

70

Note that an appropriate blocking term must be added in the same way as in the classical reservation framework [47, 62] if synchronization needs introduce the potential for priority inversion.

Figure 4-4: Multi-granular preemption computation algorithm

4.3.3 The Utilization Bound Test We now derive a simple utilization bound test for a multi-granular reserve when the deadline of a task is the same as its period (i.e. Di = Ti). We define the effective B

B

B

B

(processor) utilization of a high-priority task τi observed by a low-priority task τj, denoted B

B

B

B

as U (eff i → j ) , as the processor utilization of τi in its lowest possible reserve granularity with B

B

B

B

a granular period not larger than τj’s highest-granular reserve period. Therefore, B

B

71

(ψ ) Ci ij eff U (i − > j ) = (ψ ) ε i ij Ti (ψ ij )

ψ ij = MAX (k ∈ Z + | (1 ≤ k ≤ g i ) ∩ (ε i

Ti ≤ T j ))

Theorem 4.3

For a set of n tasks with multi-granularity resource reservations, the taskset will be schedulable only if the total effective utilization observed by each individual task is less than x(21 x − 1) where x is the number of its higher and same priority tasks. Proof.

Consider two tasks, τi and τj. We assume that τi’s priority is higher. Let us first B

B

B

B

B

B

combine consecutive ψij requests of τi to one single request for every time interval B

B

B

B

of ε i(ψ ij )T . Based on the reserve specification, the demand of each aggregated request is (ψ ij )

not larger than C i

. The new request pattern can only make τj’s completion time B

B

greater or the same. Therefore, in the utilization bound test of τj, we can simply regard a B

B

multi-granular task τi as a periodic task τ′i with the reserve specification {Cψi ij , ε i(ψ ij )T } . B

B

B

B

Applying this argument to all higher-priority tasks of τj, we obtain a set of Liu and B

B

Layland periodic tasks. From [40], the least upper bound of the taskset is proved to be x(21 x − 1) , where x is the number of tasks under consideration (i.e. τj’s higher or B

B

same priority tasks). Therefore,

∑

k |Tk ≤ T j

1x U (eff − 1) k → j ) ≤ x(2

To maintain the schedulability of the taskset, this utilization bound test must be

performed for all tasks in the system. 72

4.4 Replenishment and Enforcement Mechanism The schedulability analyses in Section 4.3 assume that tasks never exceed their XRR

X

reserve quota constraints. We now describe the implementation of the replenishment and enforcement mechanisms. These mechanisms assure that in the multi-granularity reservation framework any misbehaving tasks that overuse their resource quota cannot compromise other well-behaved reservation guarantees in the system, achieving the QoS isolation among tasks. Consider a multi-granularity reserve with k-levels of granularities. Let c1 , c 2 ,K, c k denote the run-time granular allowances of the 1th ,2 nd , K, k th -granular reserves, respectively. Each of the allowances (e.g. ci) is initialized to its corresponding P

P

maximum budget (e.g. Ci) specified in the reserve specification. At the end of each task’s P

P

execution, its last execution time after the last scheduling point will be deducted from all granular allowances. The same amount of time will be refilled to allowances at the beginning of their next granular period interval (e.g. ε i T ). Each reserve therefore needs k timers for resource replenishment. Only one enforcement timer is needed for the system. At the beginning of each task’s execution, if the task is associated with an undepleted reserve, the maximum amount of its granted resources will be determined. The enforcement timer is then set to the time instant at which the task’s execution will exceed the granted amount, which is given by MIN (c1 , c 2 ,K, c k ) . When a task is depleted (i.e. the enforcement timer is expired), the task will either be halted or put into background priority, depending on its reservation type described in Section 2.1. XRR

X

73

4.5 Robust Multi-Granularity Reservation The multi-granularity schedulability analyses presented in Section 4.3 assume XRR

X

that tasks must not violate their resource usage constraints. Section 4.4 presents the XRR

X

replenishment and enforcement schemes that prevent tasks, which overuse their resources from compromising other tasks’ resource guarantees. The impairment of resource guarantees is therefore self-contained. This section shows that in some conditions, an over budget task can cause a thrashing condition for its own resource guarantees. A thrashing prevention scheme called Budget-Preserving (BP) will be presented. With BP, the guarantees of the multi-granularity reservations are robust and sustainable even for the over budget tasks.

4.5.1 Possible Thrashing Condition in MultiRSV cx=π3< π8

cx=(Cx -π1- π2)< π3 P

P

P

P

B

B

B

B

cx=0

cx=(Cx -π1)> π2 P

P

P

P

B

B

P

B

P

P

B

P

P

P

P

B

P

P

B

B

cx=0

cx=0

P

P

B

Potential deadline misses

P

P

x

x

P

B

c =π1< π6

c =C >π1 x

B

cx=π2< π7

cx=0 P

P

P

B

P

B

B

B

…

… B

π1 B

B

B

π2 B

B

B

π3 B

B

B

π4 B

B

B

π5 B

π6

B

B

B

B

B

π7 B

B

B

π8 B

B

B

π9 B

B

B

π10 B

Tim

B

T t

(t+2εxT)

(t+εxT) P

P

P

P

Figure 4-5: An example of a thrashing incident in MultiRSV

Consider a MultiRSV task with reserve specification {{C, T}, {Cx, εxT}}. Figure P

P

P

P

X

4-5 shows an example of its requests between time t and (t+2εxT). 10 Let πi denote the

RR

TP

10

X

PT

P

P

TPF

FPT

Note that the figure does not represent the execution pattern of the task. 74

B

B

amount of demand of the ith request in the considered time interval. Let cx denote the runP

P

P

P

time allowances for any time interval of εxT; its maximum value is Cx. P

P

P

P

We first assume that the task does not consume any reserved resource at least εxT P

P

time units before time t. Therefore, at time t, with full allowances (i.e. cx =Cx), any P

P

P

P

incoming requests whose demand and accumulative demand are less than C and Cx will P

P

obtain the guaranteed resources successfully. Otherwise, the resources will be delivered in best-effort service. The guaranteed and best-effort resources are labeled in green and gray colors in the figure, respectively. The consumed guaranteed-resource will be replenished at the next time interval of εxT (e.g. resources of π1 time units will be P

P

B

B

replenished at time t+εxT). Therefore, if the xth-granular allowance is depleted before P

P

P

P

time (t+εxT) and the arrival request at time (t+ εxT) has demand larger than π1, the full P

P

P

P

B

B

demand of the request will not be guaranteed and hence can miss its deadline. The same condition happens to all subsequent requests. Consequently, in the stress condition where there is a very small number of processor cycles for best-effort services, the task will not be able to make any progress until its incoming demand decreases. In summary, the thrashing condition happens when an overbudget task has a series of ramp-up high-demand requests for a long period of time. The condition will recover to normal only when the demand for its subsequent requests is small enough to give an opportunity for the low-granular reserve to collect its replenished allowances.

4.5.2 Thrashing Prevention Scheme In this section, we present a thrashing prevention algorithm called BudgetPreserving (BP). The algorithm preserves allowances for possible increasing demand of requests in the next granular period, leaving some space for demand growth. We define 75

ramp-up index of a request as the ratio of the additional demand of an incoming request at the next granular period to the demand of the request. BP assumes that an approximate ramp-up index that characterizes a task will be given. It uses this parameter to calibrate the preservation as follows. Let π(t) and δ be the task’s demand at time t and the ramp-up index of its multi-granularity reserve. This implies that there is a high chance that the task’s demand at time t+εxT is given by π (t + ε x T ) = (1 + δ )π (t ) . The algorithm then P

P

deducts this amount from the current allowances for preservation as shown in Figure 4-6. X

RR

X

However, due to its maximum allowances, the preserved resource is limited to the remaining budget in the current granular period. Figure 4-7 summarizes the algorithm. X

RR

X

Note that the ramp-up index must be large enough to match the demand characteristics of tasks. However, too high an index value of it can cause low reserve utilization. cx=(Cx–(1+δ)π1-(1+δ)π2) π2 P

B

P

P

B

B

P

P

P

cx= (1+δ)π1- (1+δ)π6 + (1+δ) π2< π7

cx=0

P

P

P

P

B

P

P

B

B

B

B

B

P

B

B

B

cx=0

cx=0

cx=(1+δ)π1< π6

cx=Cx >π1 P

B

P

P

P

B

B

cx=0 P

P

…

… B

π1 B

B

B

T t

π2 B

B

B

π3 B

B

B

π4 B

B

B

π5 B

B

π6 B

B

B

B

π7 B

B

B

π8 B

B

B

Preserved allowances (t+εxT) P

P

π9 B

B

B

π10 B

Tim

B

(t+2εxT) P

P

Figure 4-6: An example of MultiRSV with Budget-Preserving scheme After a task’s execution (i.e. the task is completed, preempted or enforced) Let ec be the last execution time of the task cx = cx-ec; P = MIN(δ*ec, cx); ÅThe amount of preserved resource cx = cx-P; Let t be the time instant this granular period starts. Add the replenished amount of (ec+P) at time (t+εxT)

Figure 4-7: Budget-Preserving algorithm

76

4.6 Using Multi-Granular Reserves In this section, we will show how to use multi-granularity reservations efficiently for MPEG-4 video decoders. MPEG-4 video streams [17] are encoded into I, P and B type frames. An I-frame is encoded as a single image with no reference to any frames and typically has a larger size than other frames. A P-frame is encoded relative to a past reference P or I-frame. A B-frame is encoded relative to a past reference frame, a future frame, or both frames which are the closest I or P frames. Each video sequence is composed of a series of Groups of Pictures (GOP). A GOP is an independently decodable unit that can be of any size as long as it begins with an I-frame. One effective and simple example of a multi-granularity reserve for an MPEG-4 video decoder is reserving (1) its maximum decoding time in the frame period granularity and (2) its average decoding time for frames in GOP granularity as given by

R = {{dt max , T , T }, {( gop_size * dt avg ), ( gop_size * T )}}

(4.3)

Let gop_size, T, dtmax and dtavg denote the size of GOP, the frame period, and the B

B

B

B

maximum and average decoding time, respectively. Since a multi-granular reserve always delivers resources as long as the budget is available, the above reserve has a very high chance of successfully decoding the I frame in every GOP and also all frames in any GOP that requires a total decoding time not larger than (gop_size * dtavg). Note that a B

B

proper ramp-up index must be set to prevent the thrashing condition. In our experiments, we found that an efficient ramp-up index for MPEG-4 streams is typically less than 0.1. In practice, dtavg may not represent the exact average value but instead be B

B

sufficiently large enough to guarantee the desired frame processing deadline miss rate. The maximum decoding time for frames per GOP can also be used to obtain 100%

77

guarantees. More granular levels of reserves could also be used for handling different subset sequences of GOP. A more stringent or relaxed QoS specification and probabilistic guarantees can be obtained. Less pessimism will result if a detailed profile of the video stream is available.

4.6.1 Multi-RSV vs. (m, k)-Firm Guarantees All (m, k)-firm guarantee schemes mark m out of k consecutive requests as mandatory requests. Only these requests are considered as preemption to low priority tasks assuming the worst-case demand. Some (m, k) schemes (e.g. [51]) subtly assign mandatory requests to reduce interference among tasks. However, these schemes need knowledge of phasing among applications which is somewhat difficult in practice. Our scheme does not assume such knowledge and enables the budget to be shared across periods. Consider an MPEG-4 multi-granular reserve as given in Equation ( 4.3). If X

RR

X

we assume that all requests require the maximum demand, the least number of guaranteed frames is approximately given by mmpeg4 ≈ (dtavg *gop_size)/dtmax. In Section 4.7, we will B

B

B

B

B

B

X

X

compare our approach with (m, k)-firm guarantees using (mmpeg4, gop_size) parameter and B

B

show that our scheme always performs better despite using an equal amount of allocated resource. Note that a Pfair schedule [7] of multi-granularity reserve can also be modeled by adding one more mid-granular reserve to equally distribute to budget over its lowgranular period.

4.6.2 Multi-RSV vs. Multiframe Model A multi-granularity reserve can be considered as a special case of the generalized multiframe model. Any multi-granularity reserve can be conservatively transformed into

78

a generalized multiframe task model. Fundamentally, the worst-case execution time patterns of consecutive frames will be conservatively assumed by our worst-case response-time test. For instance, a reserve given by {{3,5,5},{7,25}} can be converted to . Note that our model allows tasks to have deadlines different from their periods and provides a simplistic method for users to specify required QoS in coarsegrained fashion. We also provide mechanisms to deliver more tolerant guarantees and maintain temporal isolation even when tasks violate their reservation specifications.

4.6.3 Multi-RSV vs. CBS The Constant-Bandwidth Server (CBS) uses EDF scheduling as opposed to fixedpriority scheduling. Our scheme shares the same concept with CBS in the sense that it is tolerant of varying instantaneous demand, yet in the long run, the multimedia workload will be shaped into its average-case specification. Through the dynamic adjustment of the request deadlines, CBS handles high-demand requests as a higher-priority class in EDF fashion starving all other non-real-time tasks. Moreover, its guarantees for the multimedia stream are not necessarily deterministic. In other words, a highly dynamic real-time workload can cause an unacceptably high failure rate. CBS with resource reclaiming [11] also has this property. In contrast, in the same scenario, multi-granularity reservations will be able to detect this possible failure rate and reject those requests in advance, triggering renegotiation of the desired QoS. This predictability aspect is desirable and perhaps even crucial for real-time and QoS-guaranteed systems.

79

4.6.4 Multi-RSV vs. Buffer Approach Even though buffering relaxes timing constraints and reduces the variation of resource demands, due to the long-range-dependence property of MPEG4 streams, a very large buffer is required to effectively smooth the bursts and results in long delays. Unfortunately, this solution is impractical for handheld devices. However, a reasonable buffering can be combined with a multi-granularity reservation model to minimize delay while maintaining high system utilization.

4.7 Performance Evaluation of MultiRSV In this section, we evaluate the performance of multi-granularity reservation schemes, and compare them against average-case single-granularity reservation, (m, k)firm guarantee, and CBS. We will refer to these schemes as MULTI-RSV, AVG-RSV, MK-RSV and CBS respectively. Simulations are used to study QoS and temporal isolation, the response time of non-real-time tasks and the system utilization among those schemes. We chose four MPEG-4 video traces from [27], Jurassic Park, Silence of the Lambs, News and Lecture Room Cam, to represent four different kinds of movies. The trace lengths are 60, 60, 15 and 60 minutes respectively. All video streams are encoded using the GOP pattern given by IBBPBBPBBPBB while data in the video streams typically occur as IPBBPBPBBPBB.

80

Frame Statistics

Unit

Compression ratio Frame rate

Jurassic

YUV:MP4

9.92

Frames/sec

25

Silence

News

Lecture

13.22

10.52

36.27

25

25

25

Video run time

msec

3.6e+06

3.6e+06

9e+05

3.6e+06

Mean frame size

Bytes

3.8e+03

2.9e+03

3.6e+03

1e+03

Var frame size

-

5.1e+06

5.2e+06

6.3e+06

8.3e+05

CoV frame size

-

0.59

0.80

0.70

0.87

Min frame size

Bytes

72

158

123

344

Max frame size

Bytes

16745

22239

17055

7447

-

0.103

0.0905

0.1025

0.064

Avg. CPU Utilization

Table 4-1: Video frame statistics used in experiments

Unless explicitly stated, five real-time tasks are randomly generated to achieve the given total real-time utilization. Each task has a uniform probability of having a short (110 ms), medium (10-100ms) or long (100-1000ms) period. Period values are uniformly distributed within each range. Five non-real-time tasks are randomly generated in the same fashion but without reservations. We estimate the video decoding time of each frame using the linear relationship with its corresponding frame size as suggested in [8]. Four metrics of experiment results will be observed: •

Miss: the ratio of deadline-missing frames to total frames received.

•

MissI: the ratio of deadline-missing I-frames to total frames (all types) received.

•

MissD: the ratio of un-decodable frames (due to a deadline miss or the failure of

its reference frames) to total frames received. •

Dyn: the ratio of dynamic errors 11 to total possible k-consecutive frame sets. TPF

FPT

11

Dynamic error is defined by Hamdaoui et al. as the failure of a system to satisfy timing constraints of at least m frames out of any k consecutive frames. TP

PT

81

4.7.1 Evaluation of a Single Stream We first evaluate the performance of one single video stream competing with realtime and non-real-time workload. We believe that this scenario is most likely to happen in most handheld devices and personal computers. We create a multi-granularity reserve as given by Equation ( 4.3). An (m, k)-reserve is created as explained in Section 4.6.1. For X

RR

X

XRR

X

CBS and average-case reserve, we allocate resources based on their average demand. For all schemes, in order to avoid queueing effects, we drop video requests immediately if they miss their deadlines. Four different real-time and total system utilization scenarios are investigated. (1) Low workload (RT_U=0.4, U=0.5) (2) Low workload high background (RT_U=0.4, U=1.5) (3) High workload (RT_U=0.65, U=0.75) (4) Full system load (RT_U=0.65, U=1.5) RT_U and U denote the real-time task utilization and the total system utilization, with both excluding the video stream utilization.

82

0.4 0.35 0.3

Miss Ratio

0.25

RT_U=0.4, U=0.5 RT_U=0.65, U=0.75

0.2

RT_U=0.4, U=1.5 RT_U=0.65, U=1.5

0.15 0.1 0.05 0 AVG-RSV

MK-RSV

CBS

MULTI-RSV

Figure 4-8: Jurassic Miss Ratio vs. Utilization (Ramp-up = 0.1) 0.8 0.7

MissD Ratio

0.6 0.5

RT_U=0.4, U=0.5 RT_U=0.65, U=0.75

0.4

RT_U=0.4, U=1.5 RT_U=0.65, U=1.5

0.3 0.2 0.1 0 AVG-RSV

MK-RSV

CBS

MULTI-RSV

Figure 4-9: Jurassic MissD Ratio vs. Utilization (Ramp-up=0.1)

83

0.08 0.07 0.06

MissI Ratio

0.05

RT_U=0.4, U=0.5 RT_U=0.65, U=0.75

0.04

RT_U=0.4, U=1.5 RT_U=0.65, U=1.5

0.03 0.02 0.01 0 AVG-RSV

MK-RSV

CBS

MULTI-RSV

Figure 4-10: Jurassic MissI Ratio vs. Utilization (Ramp-up = 0.1) 0.25

0.1913

Dyn Ratio

0.2

0.15

RT_U=0.4, U=0.5 RT_U=0.65, U=0.75 RT_U=0.4, U=1.5

0.1

0.05

RT_U=0.65, U=1.5

0.0864

0.0159 0.0307

0.0076 7.96E-04 6.09E-4 0.0017

0 AVG-RSV

MK-RSV

CBS

MULTI-RSV

Figure 4-11: Jurassic Dynamic Ratio vs. Utilization (Ramp-up = 0.1)

Figure 4-8 to Figure 4-11 show the performance of one Jurassic movie. The X

RR

X

X

RR

X

MULTIRSV’s ramp-up index is set to 0.1. As expected, AVG-RSV and MK-RSV 84

perform poorly at high system utilization. At high workload and full system load, more than 20% of frames miss their deadlines and more than 50% of frames are un-decodable under both schemes. MK-RSV delivers ∼2-4% more I-frames than AVG_RSV due to its worst-case resource allocation and hence achieves a somewhat better MissD ratio. MKRSV satisfies the (m, k) constraint as expected. Meanwhile, MULTI-RSV and CBS have a dynamic ratio of less than 2%. Even though our MULTI-RSV allocates resources for approximately the same amount as MK-RSV, it always achieves lower failure rates and yields better system utilization (i.e. ~4-20% and ~15-45% less for the Miss and MissD ratios, respectively). Unlike other schemes, the performance of CBS does not suffer from the presence of non-real-time tasks. This is due to its preference for high-demand workload over non-real-time tasks. In addition, MULTI-RSV has substantially fewer missing I-frames, which results in a better MissD ratio than that of CBS when the nonreal-time workload is low, and a comparable ratio otherwise. Figure 4-12 to Figure 4-15 show the results of the same experiment with the other X

RR

X

X

RR

X

three video streams. Overall, Lecture obtains better performance than others due to its smaller variance in frame size. However, the large loss of its I-frames, which are major reference frames, results in poor MissD ratios. Again, the performance under CBS is independent from the existence of the non-real-time workload, and the graphs with respect to low and high background workload are identical. MULTI-RSV performance is somewhat worse in Lecture’s case due to the use of mismatched ramp-up indexes. We will later present a study of the ramp-up index effect. As expected, the number of deadline-missing I-frames of MULTI-RSV is lower than that of other schemes.

85

RT_U=0.65, U=1.5

RT_U=0.65, U=0.75

0.4 0.35 0.3 Miss Ratio

0.25 0.2 0.15 0.1 0.05

Jurassic

Silence

News

MULTI-RSV

CBS

MK-RSV

AVG-RSV

MULTI-RSV

CBS

MK-RSV

AVG-RSV

MULTI-RSV

CBS

MK-RSV

AVG-RSV

MULTI-RSV

CBS

MK-RSV

AVG-RSV

0

Lecture

Figure 4-12: Video Streams Comparison on Miss Ratio (Ramp-up = 0.1) RT_U=0.65, U=1.5

RT_U=0.65, U=0.75

0.8 0.7

MissD Ratio

0.6 0.5 0.4 0.3 0.2 0.1

Jurassic

Silence

News

MULTI-RSV

CBS

MK-RSV

AVG-RSV

MULTI-RSV

CBS

MK-RSV

AVG-RSV

MULTI-RSV

CBS

MK-RSV

AVG-RSV

MULTI-RSV

CBS

MK-RSV

AVG-RSV

0

Lecture

Figure 4-13: Video Streams Comparison on MissD Ratio (Ramp-up = 0.1)

86

RT_U=0.65, U=1.5

RT_U=0.65, U=0.75

0.09 0.08

MissI Ratio

0.07 0.06 0.05 0.04 0.03 0.02 0.01

Jurassic

Silence

News

MULTI-RSV

CBS

MK-RSV

AVG-RSV

MULTI-RSV

CBS

MK-RSV

AVG-RSV

MULTI-RSV

CBS

MK-RSV

AVG-RSV

MULTI-RSV

CBS

MK-RSV

AVG-RSV

0

Lecture

Figure 4-14: Video Streams Comparison on MissI Ratio (Ramp-up = 0.1) RT_U=0.65, U=1.5

RT_U=0.65, U=0.75

0.25

Dyn Ratio

0.2 0.15

0.1

0.05

Jurassic

Silence

News

MULTI-RSV

CBS

MK-RSV

AVG-RSV

MULTI-RSV

CBS

MK-RSV

AVG-RSV

MULTI-RSV

CBS

MK-RSV

AVG-RSV

MULTI-RSV

CBS

MK-RSV

AVG-RSV

0

Lecture

Figure 4-15: Video Streams Comparison on Dynamic Ratio (Ramp-up = 0.1)

87

4.7.2 Evaluation of Multiple Streams We now simultaneously run three streams, Jurassic, News and Lecture with RT_U=0.35 and U=1.50; the parameters are chosen to fully utilize the system reserve capacity. Figure 4-16 to Figure 4-19 show the performance comparison of different miss X

RR

X

X

RR

X

ratios among the algorithms. Compared with the single-stream experiments, multiplexing video streams yield better performance overall, especially in the reduction of the MissI ratio in MK-RSV. There are two main reasons for this. First, there is typically a low probability that video streams that are statistically independent of each other will request resources at the peak rate simultaneously. Secondly, the worst-case resource allocation per frame used by MK-RSV more properly manages priorities for a complete I-frame execution. Again, Lecture obtains better performance than others due to its smaller fluctuation in demand. MULTI-RSV again successfully delivers more I-frames and consequently achieves a low missD ratio. In summary, this experiment shows that despite the benefit of the video multiplexing in a server where multiple streams can share resources, efficient deterministic guarantees are nevertheless crucial for protecting unpredictable high failure rates during overload.

88

0.4 0.35 0.3

Miss Ratio

0.25

AVG-RSV MK-RSV

0.2

CBS MULTI-RSV

0.15 0.1 0.05 0 Jurassic

News

Lecture

Figure 4-16: Miss Ratio of Multiplex Video Streams 0.7

0.6

MissD Ratio

0.5 AVG-RSV

0.4

MK-RSV CBS

0.3

MULTI-RSV

0.2

0.1

0 Jurassic

News

Lecture

Figure 4-17: MissD Ratio of Multiplex Video Streams

89

0.08 0.07 0.06

MissI Ratio

0.05

AVG-RSV MK-RSV

0.04

CBS MULTI-RSV

0.03 0.02 0.01 0 Jurassic

News

Lecture

Figure 4-18: MissI Ratio of Multiple Video Streams 0.12

0.1

Dyn Ratio

0.08 AVG-RSV MK-RSV

0.06

CBS MULTI-RSV

0.04

0.02

0 Jurassic

News

Lecture

Figure 4-19: Dynamic Ratio of Multiplex Video Streams

90

4.7.3 System Response Time Comparison This section shows the effects of CBS and MULTI-RSV approaches on the system response time. A random number of short non-real-time requests (0 to 3 requests) are generated every 40 ms to compete for CPU resources along with one video stream, Jurassic. The sizes of the requests are uniformly distributed between 10 and 20 milliseconds. In addition to the average CPU utilization of ∼0.11 by the video stream, a set of real-time tasks with total utilization of 0.5 is generated in the same fashion as in the first experiment. We measure the normalized response time which is defined as follows: normalized_response_time =

completion_time required_execution_time

As can be seen in Table 4-2, even though both schemes deliver comparable X

RR

X

performance (MULTI-RSV has ~2% worse of MissD matrix) on multimedia streams, CBS delays the non real-time responses by approximately 59%. Algorithm

Miss

MissI

MissD

Dyn

NRT Response Time

CBS

0.0623

0.0070

0.1574

0.0000

36.5859

MULTI-RSV

0.1218

0.0019

0.1760

0.0086

22.9831

Table 4-2: CBS vs. MULTI-RSV Comparison

4.7.4 Performance Predictability Unlike CBS, multi-granularity reservation trades some increased system capacity for better provision of deterministic guarantees. In this section, we confirm the necessity of predictability in scheduling policies. We experiment with multiple video streams using CBS. A real-time task set again is randomly generated as in previous experiments with

91

the highest utilization allowed by CBS. 12 As can be seen in Table 4-3, the performance TPF

FPT

X

RR

X

of video streams totally depends on competing real-time task sets despite the same utilization. Therefore, the resource guarantees provided by CBS are not deterministic. The worst-case performance, which is defined as the highest average ratio among all performance ratios, may even have 27% of frames missing deadlines, 10% of dynamic errors and 64% of decodable frames lost. Such unpredictability is generally unacceptable in many QoS-guaranteed systems. Performance Statistics

Miss

MissI

MissD

Dyn

Jurassic

0.0939

0.0089

0.1975

0.0013

Lecture

0.0210

0.0023

0.0479

0.0006

News

0.1001

0.0106

0.1980

0.0017

Jurassic

0.1190

0.0204

0.2880

0.0050

Lecture

0.0424

0.0119

0.1310

0.0032

News

0.1473

0.0297

0.3422

0.0183

Jurassic

0.1688

0.0445

0.4413

0.0231

Lecture

0.0816

0.0306

0.2617

0.0136

News

0.2743

0.0676

0.6419

0.1019

Best performance

Mean performance

Worst performance

Table 4-3: The Performance of CBS at High Load

4.7.5 Ramp-up Index Effect on MultiRSV In this section, we confirm the need for a proper ramp-up index for multigranularity reservations for efficient reserve utilization. The same set of experiments for a single video stream has been conducted with varying ramp-up index setups. Figure 4-20 X

RR

X

shows the results of MULTI-RSV’s performance for Jurassic and Lecture cases. As can

12

No non-real-time task is created since it does not affect CBS performance as previously discussed. TP

PT

92

be seen, sufficient budget must be preserved to prevent thrashing conditions. However, a too-large ramp-up index leads to over-preservation, where resources are not likely to be used in the future. Figure 4-21 and Figure 4-22 show the histograms of ramp-up index X

RR

X

X

RR

X

statistics of four video streams. The y-axes show the percentage of frames whose ramp-up index falls between the values listed on the x-axes. As can be seen, most of the video requests have ramp-up indexes typically less than 0.1. However, the Lecture stream has much less fluctuation demand than others and therefore, using a multi-granularity reserve with a smaller ramp-up index (e.g. 0.1

Figure 4-21: Zoom view of ramp-up index statistics of Lecture video stream Silence

70 60 50 40 30 20 10 0

Percentage

Percentage

Jurassic

≤0

0.1

0.2

0.3

0.4

0.5

70 60 50 40 30 20 10 0

>0.5

≤0

0.1

0.3

0.4

0.5

>0.5

0.4

0.5 >0.5

Lecture

News 70 60 50 40 30 20 10 0

80 Percentage

Percentage

0.2

60 40 20 0

≤0

0.1

0.2

0.3

0.4

0.5

≤0

>0.5

0.1

0.2

0.3

Figure 4-22: Ramp-up index statistics of MPEG-4 streams

94

4.8 Summary Reservation-based real-time operating systems provide applications with guaranteed, timely and enforced access to resources. Classical reservation schemes are well-suited for hard real-time tasks whose worst-case demands are acceptable. However, they are not suitable for multimedia applications whose resource demands fluctuate widely and can tolerate occasional deadline misses. In this chapter, we proposed a multigranularity reservation, a new reserve paradigm that provides flexibility in QoS specification while still delivering core deterministic guarantees. The predictability of the guarantees can be conservatively achieved in the form of a firm guarantee provision or achieved in fine-grained fashion by the statistical profiling of multimedia streams. We derived the schedulability analysis in two forms: a worst-case response time test and a simple utilization bound test. In addition to offering predictability, the scheme outperforms deterministic, stochastic and heuristic approaches under fixed-priority scheduling policies such as average-case reservation as well as (m,k)-firm guarantees. In addition, compared to the Constant Bandwidth Server (CBS) which uses EDF, the optimal dynamic scheduling policy, the performance of our scheme is comparable but its admission control gives more deterministic guarantees. In other words, the scheme can accurately predict the possibility of guarantee failures, reject tasks at admission control, and possibly initiate a QoS negotiation with tasks.

95

Chapter 5

DVS for Multimedia Applications Most studies of DVS schemes for multimedia applications in the literature either focus on (1) using the worst-case reservations with a run-time speed adjustment or (2) using the average-case or statistical reservations with and without a run-time speed adjustment. Both approaches have disadvantages in different aspects. Due to the worstcase resource allocation in the first approach, all task instances are guaranteed to meet deadlines. Since most multimedia applications are soft real-time tasks which can tolerate occasional deadline misses, this approach turns out to be QoS overprovision and results in unnecessary low system utilization. The second approach relaxes pessimism in resource allocation by reserving fewer resources and relying on available background (unreserved) processor cycles to service excessive requests. However, this approach can yield a long failure of bursty requests since most DVS algorithms exploit slacks to adjust the CPU speed and hence have limited background cycles which must be shared by both excessive multimedia requests and non-real-time requests. In this chapter, we will focus on the integration of our DVS algorithms and our new reserve paradigm for multimedia applications, the multi-granularity reservation. The new algorithms, Multi-granular PM-Clock and Multi-granular Progressive, achieve high system utilization, provide flexible but deterministic QoS guarantees even when the system is overloaded, and significantly minimize power.

96

5.1 Multimedia DVS Algorithms Overview We have proposed two practical DVS algorithms for soft real-time periodic tasks. •

Multi-granular PM-Clock: This scheme assigns the optimum processor clock

frequencies to tasks at admission control. It scales voltage and frequency at every context switch based on the assigned clock frequency of the next task to run. Consequently, it has a low complexity of O(1) at run-time. This scheme is suitable for systems which have hard real-time tasks with small fluctuation in demand and multimedia tasks served by multi-granularity reserves. •

Multi-granular Progressive: This scheme is suitable for systems which also

may have hard real-time tasks with high fluctuation of demands. In addition to the same initial clock frequency assignment as in Multi-granular PM-Clock algorithm, the scheme detects run-time slacks and further reduces the CPU clock frequency. The run-time slacks may arise from the underutilization of the reserves, and regular idle periods that occur when a taskset does not have worstcase task phasings as assumed in the admission control. Note that for systems which have too high overhead to perform DVFS at context switches and hence need only one system clock frequency, the combination of multi-granularity reserves and the Sys-Clock algorithm is applicable. We can apply the same integration technique used in Multi-granular PM-Clock without including InflatedF procedure.

5.2 Multi-Granular DVS System Model We assume the use of the DM scheduling policy and the existence of n reserves denoted as R1 , R2 , K, Rn , respectively. One task is associated with each reserve and is

97

denoted by τ 1 ,τ 2 ,K ,τ n , respectively. Each reserve can use either the classical reserve specification {C, T, D} or the multi-granular reserve specification {{C, T, D}, {Cx, P

P

εxT},…, {Cy, εyT}}, ε x < K < ε y and ∀i, ε i ∈ Z + , where C denotes a task’s worst-case P

P

P

P

P

P

required processor cycles every hotspot (highest-granular) time period, T denotes its hotspot period and D denotes its relative hotspot deadline from arrival time. Additionally, Cx denotes the processor cycles the task requires in every (longer) time interval of εxT. P

P

P

P

Note that this definition of C is different from what defined in Chapter 4 where C denotes XRR

X

required execution time not processor cycles. Each multi-granular reserve, Ri, is assumed B

B

to have gi levels of granularity. By convention, we assume that D1 ≤ D2 ≤ … ≤ Dn. B

B

B

B

B

B

B

B

5.3 Multi-Granular DVS Assumptions The following assumptions have been made to obtain the analytical results. (A1) Processors have a convex non-decreasing power-frequency relation. (A2) Tasks with reservations are independent of one another and ready to run on arrival without blocking. Additionally, they can be preempted at any time. (A3) The next request of each task occurs only after the previous request is completed and arrives periodically every time interval specified in its hotspot period. (A4) For multi-granular reserves, the reserved rate at a low-granular reserve is always smaller than that of a high-granular one (i.e. (A5) The cost of preemption is negligibly small.

98

Cy Cx < ). ε yT ε xT

5.4 Multi-Granular PM-Clock Algorithm Multi-granular PM-Clock is an extended version of the PM-Clock algorithm to support multi-granularity reserves. It applies Theorem 3.3 to determine the energyX

RR

X

minimizing clock frequency of a task by finding the lowest possible constant clock frequency that allows the task’s workload to complete before its deadline. A task’s workload is composed of its own execution and the preemption by its higher-priority tasks which can be determined by Equation ( 4.1) or the polynomial algorithm listed in X

RR

X

Figure 4-4. Since the preemption from higher-priority tasks is not uniformly distributed X

RR

X

over the task’s critical zone, the algorithm needs to consider all possible completion times of the task. We now illustrate the algorithm with the following example. Earliest Possible Completion Time

0

5

10

15

20

τ1 {{3,5,5},{7,20},{13,50}} B

B

25

Latest Possible Completion Time

30

35

40

45

50

τ2 {{6,72,72},{10,144}}

55

B

60

65

70

75

Run at fmax = 1 cycle/time unit

B

B

B

Figure 5-1: Energy-minimizing completion time candidates

Consider a set of two tasks, τ1 and τ2 with reserve specifications {{3, 5, 5}, {7, B

B

B

B

20}, {13, 50}} and {{6, 72, 72}, {10, 144}}, respectively. Figure 5-1 illustrates the critical X

RR

X

time zone of τ2’s peak request at the maximum clock frequency, fmax, using the critical B

B

B

B

instant under multi-granular reservation model derived in Section 4.3.1. The feasible XRR

X

constant clock frequencies that satisfy τ2’s timing constraint must allow the task’s B

B

workload to complete at anytime between its earliest possible completion time (i.e. the earliest completion time assuming that all tasks are executed at fmax) and its latest possible B

B

completion time. Since the task’s workload changes only at the beginning of each busy

99

period, the algorithm can consider only those time instants (labeled as red dash lines in the figure). Therefore, the energy-minimizing clock frequency of τ2, denoted as α2, is B

B

B

B

given by the following equation.

α 2 = MIN (

13 16 19 22 25 26 26 , , , , , )= = 0.3714 20 25 50 55 60 70 70

This implies that executing both tasks at a clock frequency of 0.3714fmax will B

B

consume minimum energy and satisfy τ2’s timing constraint. However, the energyB

B

minimizing clock frequency of a task may be unacceptable for some of its higher-priority tasks which have more stringent timing constraints. Consider the same example, τ1’s peak B

B

request requires 3 processor cycles to complete within 5 time units. Therefore, the system must execute τ1 at (3/5) = 0.6fmax, hence it can save more power by fixing the inflated B

B

B

B

frequency needed by τ1 and recalculating τ2’s energy-minimizing clock frequency. Figure B

B

B

B

X

5-2 shows the inflated frequency effect on the critical zone of τ2. The clock frequency for

RR

X

B

B

τ2 can be reduced to α 2 f max and α 2 is given by B

B

α 2 = MIN (

6 6 6 , , ) = 0.164 20 − (7 / 0.6) 50 − (13 / 0.6) 70 − (20 / 0.6)

Earliest Possible Completion Time

Latest Possible Completion Time

Scalable workload

0

5

10

15

20

25

30

τ1 {{3,5,5},{7,20},{13,50}} B

35

40

50

55

60

65

70

75

τ1 runs at 0.6fmax, τ2 runs at fmax

B

B

B

B

B

B

B

B

fmax = 1 cycle/time unit

τ2 {{6,72,72},{10,144}} B

45

B

B

B

Figure 5-2: Inflated frequency effect on energy-minimizing frequency algorithm

Figure 5-3 to Figure 5-6 summarize the Multi-granular PM-Clock algorithm. X

RR

X

X

RR

X

Note that the Energy-Min-Freq and InflatedF procedures are similar to those listed in 100

Section 3.6 and Section 3.7. They only replace the calculation of the preemption time XRR

X

XRR

X

and the determination of the beginning time instants of busy periods with respect to multi-granular reserves. The FindPremption procedure is also similar to that listed in Section 4.3.2. It serves the new reservation specification in terms of processor cycles XRR

X

which can be executed at different frequencies. FindPreemption(τi, t, υ): // p = preemption, σ = granularity, enf = enforce // υ is the τi’s assigned relative clock frequency // normalized to fmax. If Ri is a classical reserve then

⎛⎢ t ⎥ ⎞ Ci p = ⎜ ⎢ ⎥ + 1⎟ ⎜ T ⎟υ *f max ⎝⎣ i ⎦ ⎠ return p Else // Ri is a multi-granular reserve with gi levels of granularity. // The following line is to enforce the Energy-Minimizing-Freq // and InflatedF procedures to move forward across idle period // boundaries. If (t % Ti == 0) then t = t + Ti End if g

p = tmp = 0, enf = t + Ci i For σ = gi to 1 then

⎢ t ⎥ Ciσ tmp = tmp + ⎢ ⎥ ⎢⎣ εiσ Ti ⎥⎦ υ * fmax If (tmp ≥ enf) then tmp = enf, p = p + tmp, return p End if

⎢ t ⎥ σ t = t − ⎢ ⎥εi Ti ⎢⎣ εiσ Ti ⎥⎦ If (t == 0) then p = p + tmp, return p End if If (σ == 1) then tmp = tmp +(Ci/(υ*fmax)) If (tmp > enf) then tmp = enf End if p = p + tmp return p End if

σ /(υ * f If (tmp + (Ci max)) < enf)then p = p + tmp, tmp = 0, enf = (Ciσ /(υ ∗ fmax)) End if End for return p End if

Figure 5-3: Multi-granular PM-Clock’s FindPreemption procedure

101

Energy-Min-Freq(τi): // S: slack, I: idle duration, α is normalized to fmax // t: the end of an idle period. // β is workload, IN_BZP is busy period flag. // hp(τi) are tasks with priorities ≥ τi’s. S = I = β = Δ = 0, α = 1, IN_BZP = TRUE ω = Ci/fmax, ω′ = 0 Do while (ω < Di) If (IN_BZP == TRUE) then Δ = Di - ω Do while (ω < Di) and (Δ > 0) then

ω′ = S +

∑ FindPreemption(τ k, ω, 1 )

∀τ k ∈hp(τi)

Δ = ω ′ − ω, ω = ω ′ End while IN_BZP = FALSE Else

I = MIN ∀τ k ∈hp(τi)(NextBusyTime(τ k, ω), Di − ω) S = S + I, ω = ω + I, t = ω, β = ω - S If (β/t < α) then α = β/t End if IN_BZP = TRUE End if End while return α

Figure 5-4: Multi-Granular PM-Clock’s Energy-Minimizing-Freq procedure

NextBusyTime(τi, t): // nextB = next busy time, p = preemption, // nextP = next preemption If Ri is a classical reserve then

⎡t ⎤ nextB = ⎢ ⎥ Ti ; return nextB ⎢Ti ⎥ Else // Ri is a multi-granular reserve // with gi levels of granularity

⎢t ⎥ nextB = ⎢ ⎥ Ti ⎣Ti ⎦ p = FindPreemption(τi, nextB, 1) nextP = p g While (nextB ≤ (t + εi iTi))then nextP = FindPreemption(τi, nextB, 1) If (nextP != p) then return (nextB) End if nextB = nextB + Ti End while End if

Figure 5-5: Multi-granular PM-Clock’s NextBusyTime procedure

102

// Assume a taskset has n tasks and D1 ≤ D2 ≤ … ≤ Dn. // υi and αi are relative task clock frequencies normalized to fmax. // hp(τi) and lp(τi) are tasks with priorities ≥ and ≤ τi’s priority. During Admission Control: For each task τi then υi = 0, αi = Energy-Min-Freq(τi) End for For i = 1 to n then υi = lowest f such that f / fmax ≥ MAX ∀j∈lp(τ i )(α j) If (i != 1) and (υi-1 > υi) then υi = 0 For j = i to n then αj = InflatedF(τj) End for End if υi = lowest f such that f / fmax ≥ MAX ∀j∈lp(τ i)(α j) End for InflatedF(τi): // i_β and β are inflated and scalable workload for time t // δ = scalable time, IN_BZP is the busy period flag S = I = β =Δ = δ = 0, α = 1, IN_BZP = TRUE ω = Ci/fmax, ω′ = 0 Do while (ω < Di) then If (IN_BZP == TRUE) then Δ = Di - ω Do while (ω < Di) and (Δ > 0) then i_β = 0, β = 0 For j = 1 to n then If (Dj ≤ Di) and (υj != 0) then i_β = i_β +FindPreemption(τj, ω, υj) Else // This task is scalable β = β +FindPreemption(τj, ω, 1) End if End for ω′ = i_β + β +S, Δ = ω′ - ω, ω = ω′ End while IN_BZP = FALSE Else

I = MIN ∀τ j ∈hp(τi)(NextBusyTime(τ j, ω), Di − ω) S = S + I, ω = ω + I, t = ω, δ = t – i_β If (β/δ < α) then α = β/δ End if IN_BZP = TRUE End if End while return α

Figure 5-6: Multi-Granular PM-Clock algorithm

103

5.5 Multi-Granular Progressive Algorithm Multi-granular Progressive is an extended version of Progressive algorithm to support multi-granularity reserves. During admission control, the algorithm uses MultiGranular PM-Clock algorithm to determine the initial clock frequencies for all tasks. At run-time, it detects and claims slacks to further reduce CPU speed. By investigating two scheduling points in advance, it allows a task to exploit slack not only from higherpriority tasks but also from some eligible lower-priority tasks and idle periods. Three types of slack are detected at run-time. •

Early slack: Slack from tasks which are ahead of their original schedule assuming

the use of their initial frequencies. •

Underused slack: Slack from tasks that consume resources less than its reserved

amount. •

Unreserved slack: Slack from idle periods.

Unlike classical reserves, multi-granular reserves support the inter-period mutual budget to provide varying demand tolerance. Since its underused resources will be saved for future demand, Multi-Granular Progressive will never claim underused slack from requests (i.e. requests whose demand is less than Ci) of a MultiRSV task. In summary, B

B

the algorithm is the same as Progressive algorithm presented in Section 3.9 except that XRR

X

Equation ( 3.8) is changed as follows. X

RR

X

⎧ ⎛c ⎞ ⎪ Sp + ⎜ p ⎟, ⎜ν ⎟ ⎪ ⎝ p⎠ ⎪ Sp = ⎨ ⎪ ⎛ C p − ac p ⎪ ⎜ ⎪ Sp + ⎜ ν p ⎝ ⎩

if τ p ' s request is not completed or τ p is a MultiRSV task, ⎞ ⎟, ⎟ ⎠

otherwise.

104

(5.1)

5.6 Experiment Results We evaluate the energy consumption and the performance on resource guarantees of eight algorithms that can be categorized into three main approaches. Table 5-1 lists the X

RR

X

algorithms by categories. Categories Worst-case resource allocation

Without run-time speed adjustment

With run-time speed adjustment

PM-Clock (PM)

CYCLECONSERVE (CYCLE) by Pillai et al. Progressive (PROG)

Average-case or statistical resource allocation Multi-granular deterministic resource allocation

NODVS-AVG

Progressive-AVG (PROG-AVG)

PM-Clock-AVG (PM-AVG) Multi-granular PM-Clock (Multi-PM)

Multi-granular Progressive (Multi-PROG)

Table 5-1: List of algorithms being evaluated in the experiments

Simulations are generated by assuming an ideal processor with ten operating points, zero idle energy and power-frequency relation given by P = kf 3 . We chose the Jurassic MPEG-4 video trace from [28] to represent a typical video workload. Five realtime tasks are generated randomly. For each real-time task, we generate a classical reserve based on its worst-case resource demand. Each task has a uniform probability of having a short (0.1-1 ms), medium (1-10 ms) or long (10-100 ms) period. The task period is uniformly distributed in each range. The task computation is randomly selected and then adjusted based on the system utilization. Each task instance requests a random number of processor cycles which is uniformly distributed to achieve the desired ratio of the best-case (BCET) to worst-case execution time (WCET). We will refer to this ratio as BWET. We also denote RT_U and U as the real-time task utilization and the total system

105

utilization, with both excluding the video stream utilization. For all experiments, in order to avoid queueing effects, we drop video requests immediately if they miss their deadlines. We create all reserves for the Jurassic video stream as follows. Reserves using the worst-case and the average-case resource allocations are given by {dtmax, T} and {dtavg, T}, B

B

B

B

respectively. The multi-granular reserves are given by Equation ( 5.2) with a ramp-up X

RR

X

index of 0.1. Let gop_size, T, dtmax and dtavg denote the size of GOP, the frame period, the B

B

B

B

maximum and average decoding time, respectively.

R = {{dt max , T , T }, {( gop_size * dt avg ), ( gop_size * T )}}

(5.2)

Five metrics of experimental results are presented: Energy: the percentage of total CPU energy normalized to that by NODVS-AVG 13 . TPF

FPT

Miss: the ratio of deadline-missing frames to total frames received. MissI: the ratio of deadline-missing I-frames to total frames (all types) received. MissD: the ratio of un-decodable frames (due to a deadline miss or the failure of its

reference frames) to total frames received. Dyn: the ratio of dynamic errors 14 to total possible k-consecutive frame sets. TPF

FPT

5.6.1 Evaluation without Additional Non-Real-Time Workload In this section, we evaluate the performance of the Jurassic video stream, competing with only a real-time workload. Two scenarios will be evaluated.

13

Without DVS, we assume that CPU always executes at fmax even during idle periods. Dynamic error is defined by Hamdaoui et al. as the failure of a system to satisfy timing constraints of at least m frames out of any k consecutive frames. TP

PT

TP

B

14

PT

106

B

(1) Varying the total utilization of the real-time workload with fixed BWET of 0.5 (2) Varying the BWET of the real-time workload with fixed total utilization of 0.5 In the first experiment, we ran the Jurassic video stream with varying real-time workloads (RT_U = 0.1, 0.25 and 0.5). All workloads are generated to have fixed BWET of 0.5. Since the worst-case resource utilization of the Jurassic stream is about 0.3, the last RT_U fully utilizes the system reserve capacity allowed by PM, CYCLE and PROG algorithms. For other algorithms with reserved resources less than the worst-case demand, the excessive portion of multimedia requests (i.e. the pending demands when their reserves are depleted) will be executed using the system’s default frequency. Figure X

5-7 to Figure 5-9 show the total CPU energy consumption and Jurassic’s frame missing

RR

X

X

RR

X

statistics by all algorithms. Since all algorithms in this experiment deliver MissI and Dyn ratios less than 5%, we will not show these results in the graphs. RT_U=0.5, Low default freq

RT_U=0.5, High default freq

RT_U=0.25, High default freq

RT_U=0.1, High default freq

Energy (%) wrt. NODVS-AVG

25

20

15

10

5

0 PM

CYCLE

PROG

Multi-PM

Multi-PROG

PM-AVG

PROG-AVG

Figure 5-7: Energy vs. RT Workload Utilization at BCET/WCET = 0.5

107

RT_U=0.5, Low default freq

RT_U=0.5, High default freq

RT_U=0.25, High default freq

RT_U=0.1, High default freq

0.6 0.5

Miss Ratio

0.4 0.3 0.2 0.1 0.0 NODVSAVG

PM

CYCLE

PROG

Multi-PM

Multi-PROG

PM-AVG

PROG-AVG

Figure 5-8: Jurassic Miss Ratio vs. RT Workload Utilization at BCET/WCET = 0.5 RT_U=0.5, Low default freq

RT_U=0.5, High default freq

RT_U=0.25, High default freq

RT_U=0.1, High default freq

0.60

0.50

MissD Ratio

0.40

0.30

0.20

0.10

0.00

NODVSAVG

PM

CYCLE

PROG

Multi-PM

Multi-PROG

PM-AVG

PROG-AVG

Figure 5-9: Jurassic MissD Ratio vs. RT Workload Utilization at BCET/WCET = 0.5

We observed that under all algorithms, requests from the real-time tasks with classical reserves always meet their deadlines as expected. We will now focus on the

108

resource guarantees obtained by the Jurassic video stream and the total CPU energy consumption for the different algorithms. As expected, all schemes using the worst-case resource allocation approach (PM, CYCLE and PROG) always guarantee zero frame misses and save energy more than 75% of that consumed by NODVS-AVG. However, due to their worst-case resource allocation, their system reserve capacity is much less than that of the other schemes. CYCLE and PROG consume less energy than PM because of the run-time slack adjustment. PROG always consumes less energy than CYCLE since the algorithm can claim slacks not only from high-priority tasks but also from low-priority tasks and idle periods. Even though PM-AVG and PROG-AVG deliver much less energy than the other algorithms, their video performance is unacceptable (∼40-50% of frames are undecodable at RT_U=0.5 with lowest default frequency). As expected, their performance is much worse than the NODVS-AVG scheme which has MissD less than 10%. This is due to their dependency on the background cycles to deliver good performance, and these cycles are diminished by DVS schemes. In addition, using the highest clock frequency as the default frequency can reduce the number of un-decodable frames from 49% to 21% and 46% to 39% under the PM-AVG and the PROG-AVG algorithms, respectively. This performance, however, is not deterministic and can be much worse when BWET is high. Our new algorithms using the multi-granular approach, Multi-PM and MultiPROG, deliver good performance in terms of energy consumption, video quality and system utilization. Even with a high workload (RT_U=0.5), their missD ratios are still less than 10%. More importantly, they also save more energy than algorithms using the

109

worst-case approach. Significantly, Multi-PM consumes less energy than CYCLE even though it is a static algorithm without any run-time speed adjustment. This is because the worst-case approach assumes much worse preemption from the video stream and hence assigns higher frequencies to other tasks. However, when RT_U is low, this effect is lessened and the multi-granular approach may even consume a little bit more energy for two main reasons. First, with a limited number of CPU operating points, the preemption difference between the two approaches is not high enough so both assign the same initial frequencies to tasks. Secondly, the multi-granular approach executes some portion of the excessive requests at a higher frequency. Nevertheless, this effect is negligible, and it only happens when the default frequency is set at a high frequency, which is unlikely to happen in practice. In the second experiment, we fix the real-time workload at RT_U=0.5 and vary its BWET ratio to 0.2, 0.5 and 1.0, respectively. Figure 5-10 to Figure 5-14 show the X

RR

X

X

RR

performance of all schemes in terms of energy and video frame miss ratios.

110

X

BWET=0.2, High default freq

BWET=0.5, High default freq

BWET=0.5, Low default freq

BWET=1.0, High default freq

Energy (%) Normalized to NODVS-AVG

50

40

30

20

10

0

PM

CYCLE

PROG

Multi-PM

MultiPROG

PM-AVG

PROGAVG

Figure 5-10: Energy vs. BCET/WCET at RT_U = 0.5

BWET=0.2, High default freq

BWET=0.5, High default freq

BWET=0.5, Low default freq

BWET=1.0, High default freq

0.70 0.60

Miss Ratio

0.50 0.40 0.30 0.20 0.10 0.00

NODVSAVG

PM

CYCLE

PROG

Multi-PM

MultiPROG

PM-AVG

Figure 5-11: Jurassic Miss Ratio vs. BCET/WCET at RT_U = 0.5

111

PROGAVG

BWET=0.2, High default freq BWET=0.5, Low default freq

BWET=0.5, High default freq BWET=1.0, High default freq

0.70 0.60

MissD Ratio

0.50 0.40 0.30 0.20 0.10 0.00

NODVSAVG

PM

CYCLE

PROG

Multi-PM

MultiPROG

PM-AVG

PROGAVG

Figure 5-12: Jurassic MissD Ratio vs. BCET/WCET at RT_U = 0.5

BWET=0.2, High default freq BWET=0.5, Low default freq

BWET=0.5, High default freq BWET=1.0, High default freq

0.25

MissI Ratio

0.20

0.15

0.10

0.05

0.00

NODVSAVG

PM

CYCLE

PROG

Multi-PM

MultiPROG

PM-AVG

Figure 5-13: Jurassic MissI Ratio vs. BCET/WCET at RT_U = 0.5

112

PROGAVG

BWET=0.2, High default freq BWET=0.5, Low default freq

BWET=0.5, High default freq BWET=1.0, High default freq

0.25

Dyn Ratio

0.20

0.15

0.10

0.05

0.00

NODVSAVG

PM

CYCLE

PROG

Multi-PM

MultiPROG

PM-AVG

PROGAVG

Figure 5-14: Jurassic Dyn Ratio vs. BCET/WCET at RT_U = 0.5

Overall, the experimental results show the same relationship between energy and performance for the different approaches as those observed in the first experiment. For each approach, when BCET is lower, algorithms with run-time speed adjustment usually save more energy, however, they may deliver more frame misses if they do not use worst-case resource allocation. This is because these algorithms exploit run-time slacks from all unreserved cycles and possible idle periods for speed adjustment. Consequently, they have fewer background cycles to service excessive requests. Nevertheless, MULTIPROG still delivers acceptable video performance in all cases (MissD < 15%, Miss < 10%, Dyn < 0.6% and MissI < 0.2%), since the algorithm ensures that sufficient resources for desired lower-bound guarantees are reserved. However, it is possible that PROG-AVG, which is an integrated version of PMAVG with Progressive, delivers fewer frame misses than PM-AVG as we see in the case 113

of RT_U=0.5 with the lowest default frequency. This is because sometimes the run-time speed adjustment must choose a higher frequency than the optimal value due to limited availability of CPU operating points. As a result, some real-time requests are ahead of their original PM-AVG schedule and can cause less preemption to future video frames. This experiment also shows that even using highest clock frequency to service excessive demands, PM-AVG and PROG-AVG can still encounter very high MissD ratios (66% and 68%, respectively) when BWET is high. In summary, the MULTI-PROG and PROG algorithms deliver the best power saving results with deterministic resource guarantees, while MULTI-PROG compromises some deadline guarantees that can be tolerated by soft real-time tasks to achieve higher system reserve capacity.

5.6.2 Evaluation with Non-Real-Time Workload In this section, we run the Jurassic video stream with a high real-time workload (RT_U=0.65 and BWET=0.5). The given RT_U is chosen to fully utilize the system reserve capacity of multi-granular algorithms yet unschedulable for algorithms using worst-case resource allocation. Five additional non-real-tasks are randomly generated in the same fashion as real-time tasks but without reservations to achieve total utilization of 0.65, 0.75 and 1.5, respectively. All algorithms schedule non-real-time tasks using a round robin scheduling policy. The system default frequency is set to the lowest frequency since it saves energy in a more controllable fashion. Figure 5-15 to Figure 5-19 show the performance of all algorithms in this X

RR

X

X

RR

X

experiment. Overall, in the presence of non-real-time workload, the total energy of all

114

algorithms is slightly increased 15 and their video performance worsened. The limited TPF

FPT

energy increase is because only a small portion of the non-real-time workload will be able to be executed due to the idle-period-diminishment phenomenon under DVS algorithms. With the same effect for excessive video demands, PM-AVG and PROGAVG algorithms hence perform poorly even without a non-real-time workload (MissD > 50% for all cases). Even though the NODVS-AVG algorithm has a lower Miss Ratio than multigranular algorithms, its larger amount of missed I-frames results in a higher MissD Ratio. In addition, it starts to abruptly deliver degraded performance when the non-real-time workload is large (i.e. MissD=29% at U=1.5) such that not enough CPU resources are given to excessive video requests. Instead, since sufficient resources for desired lowerbound resource guarantees are reserved under MULTI-PM and MULTI-PROG, their performance does not depend much on background cycles, that is the non-real-time workload. Consequently, under the same scenario, their performance decreases are confined to the amount such that the lower-bound performance specified in their multigranular reserve specifications are still obtained (MissD≈20%). In summary, MULTI-PM and MULTI-PROG algorithms save power by more than 70-80% with deterministic and acceptable performance guarantees (20% of frames un-decodable). Moreover, the lower-bound performance guarantees by both algorithms are flexible and controllable through multi-granular reserve specifications. Additionally, for a system that processes hard real-time tasks that have high fluctuation of demands and

TP

15 PT

NODVS-AVG however consumes the same energy. 115

require the classical reserves for 100% guarantees, MULTI-PROG with additional complexity for a run-time speed adjustment saves more power than MULTI-PM. RT_U=0.65, U=0.65

RT_U=0.65, U=0.75

RT_U=0.65, U=1.5

Energy (%) Normalized to NODVS-AVG

35 30 25 20 15 10 5 0 PM-AVG

Multi-PM

PROG-AVG

Multi-PROG

Figure 5-15: Energy vs. NRT Workload at BWET = 0.5

116

RT_U=0.65, U=0.65

RT_U=0.65, U=0.75

RT_U=0.65, U=1.5

0.60 0.50

Miss Ratio

0.40 0.30

0.20 0.10 0.00 NODVS-AVG

PM-AVG

Multi-PM

PROG-AVG

Multi-PROG

Figure 5-16: Jurassic Miss Ratio vs. NRT Workload at BWET = 0.5 RT_U=0.65, U=0.65

RT_U=0.65, U=0.75

RT_U=0.65, U=1.5

0.60 0.50

MissD Ratio

0.40 0.30 0.20 0.10 0.00 NODVS-AVG

PM-AVG

Multi-PM

PROG-AVG

Multi-PROG

Figure 5-17: Jurassic MissD Ratio vs. NRT Workload at BWET = 0.5

117

RT_U=0.65, U=0.65

RT_U=0.65, U=0.75

RT_U=0.65, U=1.5

0.10

MissI Ratio

0.08

0.06

0.04

0.02

0.00 NODVS-AVG

PM-AVG

Multi-PM

PROG-AVG

Multi-PROG

Figure 5-18: Jurassic MissI Ratio vs. NRT Workload at BWET = 0.5 RT_U=0.65, U=0.65

RT_U=0.65, U=0.75

RT_U=0.65, U=1.5

0.10

Dyn Ratio

0.08

0.06

0.04

0.02

0.00 NODVS-AVG

PM-AVG

Multi-PM

PROG-AVG

Multi-PROG

Figure 5-19: Jurassic Dyn Ratio vs. NRT Workload at BWET = 0.5

118

5.7 Summary Using typical statistical reservation schemes for multimedia applications can cause a long failure of bursts when the system is overloaded. This is due to the dependency of those schemes upon background cycles to service excess demands of multimedia applications. In this chapter, we showed that by simply integrating DVS algorithms with those schemes, the system will reach this breakdown service point much more often regardless of its workload. We proposed two new algorithms, Multi-granular PM-Clock and Multi-granular Progressive. Both algorithms save significant CPU energy (70% at 60% system utilization). They also provide deterministic guarantees (5 to 22% frames are un-decodable) even when the system is overloaded. Applications can tune multi-granular reservation parameters to achieve better QoS. In addition, Multi-granular Progressive saves more energy (up to 33% over Multi-granular PM-Clock at 75% system utilization with BCET/WCET of 0.5) in the presence of hard real-time tasks, which require classical (worst-case) reservations but have widely fluctuating demands.

119

Chapter 6

DVS for Interactive and Batch Tasks Interactive and batch tasks typically have aperiodic random demands and arrival patterns. Interactive requests typically have short computational time and require fast response time, no longer than the user perception threshold of 50-100 milliseconds [18]. In general-purpose systems, those tasks are assigned high priority for high responsiveness. Batch tasks are computing-intensive with less timing criticality and typically are scheduled in background. Unfortunately, most real-time DVS algorithms focus only on the real-time task workload and timing constraints in the determination of the power-optimized clock frequency. This approach often leaves insufficient background cycles for servicing interactive and batch tasks. This leads to high stall rates or even starvation of conventional applications. General-purpose DVS systems address solutions for interactive and batch tasks to some extent. Most solutions propose varieties of prediction mechanisms to estimate the future workload of interactive, periodic (soft real-time) and background (batch) tasks. One single clock frequency for the whole system is then adjusted proportionally to the estimated workload. With the possibility of prediction errors, especially in a multi-task environment, this approach does not provide deterministic guarantees to soft real-time tasks and cannot support hard real-time tasks.

120

In this chapter, we develop a new DVS framework for embedded systems. The main goal is to allow the coexistence of hard real-time, soft real-time and conventional applications. We assume that a system uses any technique offered in [18, 21, 24, 70] to predict the interactive and batch workload of conventional applications. Our DVS framework then combines this estimated workload with the classical reserve specifications from hard real-time tasks and multi-granularity reserve specifications from soft real-time tasks. The framework ensures that conventional applications will obtain acceptable response times and workload throughput without breaking the temporal constraints of real-time tasks. We propose two solutions: Background-Preserving and Background-On-Demand algorithms. The first algorithm straightforwardly increases the clock frequencies of all tasks to accommodate possible the future non-real-time workload. The second algorithm, on the other hand, assigns two modes of frequencies to each task, normal and turbo mode. The normal mode assumes the absence of a non-real-time workload so other real-time tasks can execute at lower frequencies. The turbo mode is triggered when at least one non-real-time task is pending in a system. We also provide the integrated versions of both schemes with the Progressive algorithm. The integrated versions exploit the slack time generated by early-completed tasks and idle durations to reduce the frequency for more power saving.

6.1 DVS System Architecture for Heterogeneous Tasks As in typical real-time resource kernels [48, 52], our system assumes two schedulers, one for real-time tasks and the other for non-real-time (background) tasks. We assume the use of DM and round-robin scheduling policies in the real-time and non-

121

real-time schedulers, respectively. Background tasks are always assigned lower priority than real-time tasks. They are executed only when all real-time requests are completed. We assume that users or application developers provide the resource requirements of real-time tasks through reserve specifications. The requirements include their worstcase or statistical resource demands and timing constraints. These data can be obtained from offline profiling. On the other hand, we assume that the demand of interactive and batch tasks is dynamically obtained either from online prediction mechanisms or direct hints from users. The total demand is given in terms of desired processor utilization, Ubg. B

B

In this chapter, we simply service interactive tasks in background like other batch tasks. Alternatively, a polling server with a short period of 25 ms and utilization of 5% can be created for interactive tasks for faster response16 . However, with this approach, the TPF

FPT

system must distinguish interactive requests which are initiated by a GUI event using techniques shown in [18].

6.2 Background-Preserving Algorithms (BG-PRSV) The Background-Preserving algorithm creates one “pseudo” reserve for background tasks with the reserve specification given by {Cbg, Tbg, Dbg=Tbg}. The B

B

B

B

B

B

B

B

background period, Tbg, is chosen to be larger than the (highest-granular) periods of all B

B

classical and multi-granular reserves to ensure it has the lowest priority. The background computation, Cbg, is chosen to satisfy the desired processor utilization; therefore, Ubg = B

B

B

B

Cbg/Tbg. The algorithm takes this pseudo background reserve into consideration and uses B

B

B

B

Sys-Clock to determine one optimal clock frequency for all reserves. The details of the

TP

16 PT

This estimation is based on the collected data from four interactive benchmarks in [18]. 122

Sys-clock algorithm are explained in Section 3.6 and summarized in Figure 3-7. The XRR

X

X

RR

X

resulting frequency ensures that all reserves including the background reserve will satisfy their resource constraints. The new frequency will be assigned to all reserves and also the system default frequency. Note that this pseudo reserve is only used during admission control. No resource replenishment, enforcement and accounting are managed for this reserve. In other words, at run-time, the system works as if there is no background reserve. Tasks without reserves or with depleted reserves will be just executed at the system default frequency.

6.2.1 Progressive Background-Preserving (PRO-PRSV) To be able to detect slacks from unused reserves and idle periods, Progressive Background-Preserving adopts the run-time speed adjustment mechanism from the Progressive algorithm. All background tasks are always executed at the system default frequency without any slack reclaiming. This is to maintain a fast response time and a short completion time for interactive and batch tasks.

6.3 Background-On-Demand Algorithms (BG-OND) Interactive and batch tasks regularly have no-activity intervals (e.g. think time). Their duration can be milliseconds, minutes or even longer. Background-On-Demand therefore divides a system into two modes of operation, normal and turbo modes. The system is set to turbo mode whenever there is a ready background task pending in the queue. Otherwise the system is set to normal mode. The algorithm creates one pseudo background reserve and determines one new system clock frequency. It uses the same method as the Background-Preserving algorithm. The new frequency is recognized as the

123

turbo-mode frequency. The original assigned frequencies of tasks (disregarding the background reserve) are recognized as normal-mode frequencies. In the normal mode, tasks with reserves are executed based on their normal-mode frequencies. In turbo mode, all tasks are executed at the same turbo-mode frequency. Theorem 6.1

The Background-On-Demand algorithm maintains the schedulability of real-time tasks and the resource throughput of Ubg, Cbg cycles every interval of Tbg, for background B

B

B

B

B

B

tasks. Proof.

We divide the proof into two steps. First, we prove that all real-time tasks are guaranteed to meet their deadlines in both modes of operation. Secondly, we prove that background requests will always obtain the average processor utilization of Ubg. B

B

Since background tasks have the lowest priority, their presence will not affect other reserved tasks’ critical zones. Whether the system is in normal or turbo mode, executing reserved tasks using normal-mode frequencies will satisfy their constraints. Moreover, executing those tasks at the turbo-mode (higher) frequency can only make their completion times shorter. We now assume that a new incoming background request with demand Cbg B

B

arrives at time t. This triggers the system to enter the turbo mode. We first assume the critical instant of the background request where all reserved tasks also arrive at the same time as the request. Obviously, executing all tasks at the turbo-mode frequency will satisfy the pseudo background reserve’s specification. If a reserved task arrives and is executed earlier, its preemptive effect on the background request will be the same or less.

124

Applying this to all reserved tasks, the request’s completion will be same or short than Tbg. If the background request has a demand larger than Cbg, every excessive demand of B

B

B

B

Cbg is also guaranteed to complete before the next Tbg. Same resource pattern is also B

B

B

B

obtained for a smaller request or a burst of requests. Therefore, the algorithm assures that at least the resource throughput of Ubg will be given to background tasks. B

B

6.3.1 Progressive Background-On-Demand (PRO-OND) The integration of the Progressive algorithm with Background-On-Demand is more complex. It is possible that a new background request arrives in the middle of a task’s execution that already claimed slack based on the normal-mode frequencies. The task might also falsely claim slack from idle periods due to the incorrect assumption on the absence of background workload. As a result, the background request’s completion can take much longer than Tbg. The algorithm solves this problem by adjusting the B

B

remaining slack with respect to the turbo-mode frequency. It then recalculates the new higher frequency to execute the task. This adjustment, unfortunately, does not always guarantee the background processor utilization, Ubg. This is because the task might B

B

already be executed at too low of a frequency for too long of a time such that the system cannot catch up to the new desired schedule.

125

Next scheduling point Normal mode

SH P

SL

τp’s execution B

B

Idle period

P

Possible time for background tasks to execute

t Normal mode

Turbo mode

Background task arrives

Figure 6-1: Slack adjustment during turbo mode transition

Figure 6-1 illustrates the slack adjustment during the system transition to turbo X

RR

X

mode. Let τp, SH and SL be the current task, its higher-priority slack and its eligible B

B

P

P

P

P

lower-priority slack, respectively. The top portion of the figure shows the slack usable by τp if no background request arrives before the next scheduling point. This slack pattern is B

B

used to calculate the current clock frequency of τp. The frequency is to complete the task B

B

exactly at the next scheduling point. The bottom portion of the figure shows the ideal usable slack in order to maintain the schedule by the Background-On-Demand algorithm if a background task arrives at time t. After time t, tasks should be executed at the turbo-mode frequency. The algorithm, therefore, converts the remaining slack to correspond to the turbo-mode frequency and discards idle slack. Since the algorithm already executed the task at too low of a frequency, it then needs to execute the task faster to meet the desired schedule. It is possible that the algorithm cannot complete the task as expected if the new required frequency is higher than fmax. However, the new completion time will be no later than the B

B

126

next scheduling point, which is the completion time by normal-mode frequencies. Consequently, the algorithm always maintains schedulability of all real-time tasks even though in few occasions when the fault-reclaiming of idle periods cannot be recovered, it compromises guarantees to background tasks.

6.4 Experiment Results We evaluate the performance of six algorithms, BG-PRSV, PRO-PRSV, BGOND, PRO-OND, Progressive with highest default frequency (PRO-HIGH) and NODVS where CPU always executes at fmax even during idle periods. Simulations are generated B

B

by assuming an ideal processor with ten operating points, zero idle energy and powerfrequency relation given by P = kf 3 . A set of real-time tasks is generated randomly. Each task has a uniform probability of having a short (1-10 ms), medium (10-100 ms) or long (100-1000 ms) period. Period values are uniformly distributed within each range. Task computation times are randomly selected and adjusted based on the desired total utilization and the ratio of the best-case to worst-case execution time. A random number of short and long background requests (0 to maxReq for each type) are generated every 200 ms and 500 ms, respectively. Each short request has fixed demand of 1 ms at fmax. B

B

Each long request has a random demand that is uniformly distributed between 20 to 80 ms at fmax. B

B

Let RT_U, BG_U and BWET be the real-time tasks’ total utilization, the background tasks’ reserved utilization and the ratio of best-case to worst-case execution time, respectively. These are input parameters. For output, two metrics are observed, the percentage of total CPU energy normalized to that of the NODVS scheme and the

127

average response time index of the short and long background requests which is defined as follows: response_t ime _ index ( RTI ) =

completion _time required_execution_time _ at _ f max

In the first experiment, we reserve a processor utilization of 0.2 for background tasks and vary their incoming rate (maxReq=0, 2 and 4). Two systems are investigated: low (RT_U=0.3) and high system workload (RT_U=0.7). All real-time tasks are set to always request their worst-case amount of resources (i.e. BWET=1.0). Figure 6-2 shows X

RR

X

the total energy consumed by all algorithms. Figure 6-3 and Figure 6-4 show average X

RR

X

X

RR

X

response time indices of short and long background requests. Note that for readability, these graphs are truncated at 250 and 20, respectively. As expected, PRO-HIGH always delivers a large average response time for both short and long background, RTI > 1000 and 100, respectively. This is due to its diminished-idle-period phenomenon. Therefore, the algorithm should be avoided for systems with a considerable amount of background tasks (e.g. utilization >10%). Consider cases with the same real-time workload. For long background requests, all algorithms except NODVS and PRO-HIGH deliver larger response times in proportion with their increasing arrival rate. This is due to the fixed processor utilization they deliver to background tasks. Among static DVS algorithms, BG-OND delivers comparable response time but consumes less power than BG-PRSV. This is the benefit of using normal-mode frequency operations when no background workload is present. The power saving rate, however, decreases when the arrival rate of background requests is sufficiently high so that most of the time the system is in turbo mode.

128

Due to its run-time speed adjustment, PRO-PRSV always saves more power than BG-PRSV, but it also can deliver longer response times. The problem occurs when a background request arrives in the middle of a real-time task’s execution which already claimed the slack from idle periods. This forces the background request to wait longer. The problem is dramatically reduced in the PRO-OND algorithm by means of the slack adjustment at the normal to turbo mode transition as we can see from the results. We observe a high response time index for short requests. This is because frequently a background request must wait for a large real-time request. The situation is worse with DVS algorithms, because real-time tasks’ executions are slowed down. As a result, interactive tasks can exceed the user perception threshold. One additional polling server with a short reserve period dedicated for interactive tasks, which allows them to preempt other real-time tasks, can solve this problem. This approach, however, needs a mechanism to identify interactive tasks. In summary, BG-OND and PRO-OND outperform other algorithms in terms of both energy and response time. We will show in the next experiment that PRO-OND saves more power than BG-OND when BWET is lower.

129

maxReq=0, RT_U=0.3

maxReq=2, RT_U=0.3

maxReq=4, RT_U=0.3

maxReq=0, RT_U=0.7

maxReq=2, RT_U=0.7

maxReq=4, RT_U=0.7

Energy (%) wrt NODVS

100

80

60

40

20

0

BG-PRSV

PRO-PRSV

BG-OND

PRO-OND

PRO-HIGH

Figure 6-2: Energy vs. Background Workload at BG_U=0.2 and BWET=1.0 maxReq=2, RT_U=0.3

maxReq=4, RT_U=0.3

maxReq=2, RT_U=0.7

maxReq=4, RT_U=0.7

RTI of short requests

250

200

150

100

50

0

NODVS

BG-PRSV

PRO-PRSV

BG-OND

PRO-OND

PRO-HIGH

Figure 6-3: Short Request vs. Background Workload at BG_U=0.2 and BWET=1.0

130

maxReq=2, RT_U=0.3

maxReq=4, RT_U=0.3

maxReq=2, RT_U=0.7

maxReq=4, RT_U=0.7

RTI of long requests

20

15

10

5

0

NODVS

BG-PRSV

PRO-PRSV

BG-OND

PRO-OND

PRO-HIGH

Figure 6-4: Long Request vs. Background Workload at BG_U=0.2 and BWET=1.0

In the second experiment, we focus on a high system workload (RT_U=0.7) with varying best-case to worst-case execution time ratio (BWET=0.2, 0.4, 0.6, 0.8 and 1.0). We fix the reserved processor utilization for background tasks and their incoming rate to BG_U=0.2 and maxReq=2. Figure 6-5 shows total CPU energy normalized to that X

RR

X

consumed by BG-PRSV for each scenario. Figure 6-6 and Figure 6-7 show the average X

RR

X

X

RR

X

RTI for short and long background requests, respectively. Since PRO-HIGH performs very poorly as in the first experiment, we do not plot its results for graph readability. As expected, the power saving rate of RRO-OND over BG-OND increases when BWET decreases, with a very small increase in response time. In addition, when BWET is low, all algorithms deliver smaller response time for both short and long requests because of less preemption from real-time tasks.

131

BWET=0.2

BWET=0.4

BWET=0.6

BWET=0.8

BWET=1.0

Energy (%) wrt BG-PRSV

100

90

80

70

60

50

BG-PRSV

PRO-PRSV

BG-OND

PRO-OND

Figure 6-5: Energy vs. BWET at RT_U=0.7, BG_U=0.2 and maxReq=2 BWET=0.2

BWET=0.4

BWET=0.6

BWET=0.8

BWET=1.0

RTI of short requets

160

120

80

40

0

NODVS

BG-PRSV

PRO-PRSV

BG-OND

PRO-OND

Figure 6-6: Short Request vs. BWET RT_U=0.7, BG_U=0.2 and maxReq=2

132

BWET=0.2

BWET=0.4

BWET=0.6

BWET=0.8

BWET=1.0

10

RTI of long requests

8

6

4

2

0

NODVS

BG-PRSV

PRO-PRSV

BG-OND

PRO-OND

Figure 6-7: Long Request vs. BWET RT_U=0.7, BG_U=0.2 and maxReq=2

We now show the effect of different reserved processor utilization for background tasks on their response time. In this experiment, we fix the real-time workload at RT_U=0.7 with BWET=0.6. The background tasks are generated at maxReq=2. We run this same system workload with varying BG_U of 0.05, 0.1 and 0.2, respectively. Figure 6-8 to Figure 6-10 show the results. X

RR

X

X

RR

X

When BG_U is high, all DVS algorithms reserve higher frequency for real-time and background tasks. These inflated frequencies cause less preemption by real-time tasks and make a background request’s execution time shorter. The graphs show the expected results, which is the reduction of average response time when BG_U increases. Unfortunately, these high frequencies also result in more CPU energy consumption. However, PRO-OND and BG-OND have smaller increasing rate of energy than BG-PRSV and PRO-PRSV since both algorithms use high frequencies only when necessary. Again, PRO-OND saves more energy than BG-OND because of its ability to

133

further reduce frequencies using run-time slack from the varying real-time workload. The algorithm also has the smallest power penalty from excessive BG_U setting. BG_U=0.05

BG_U=0.10

BG_U=0.20

60

Energy (%) wrt NODVS

50

40

30

20

10

0

BG-PRSV

PRO-PRSV

BG-OND

PRO-OND

Figure 6-8: Energy vs. BG_U at RT_U=0.7, BWET=0.6 and maxReq=2 BG_U=0.05

BG_U=0.10

BG_U=0.20

250

RTI of short requests

200

150

100

50

0

NODVS

BG-PRSV

PRO-PRSV

BG-OND

PRO-OND

Figure 6-9: Short Request vs. BG_U at RT_U=0.7, BWET=0.6 and maxReq=2

134

BG_U=0.05

BG_U=0.10

BG_U=0.20

RTI of long requests

20

15

10

5

0

NODVS

BG-PRSV

PRO-PRSV

BG-OND

PRO-OND

Figure 6-10: Long Request vs. BG_U at RT_U=0.7, BWET=0.6 and maxReq=2

In summary, from all experiments, PRO-OND has the most energy saving and fast response for background tasks. The algorithm adapts very well with workload variance. Moreover, it has the smallest power penalty from excessive BG_U setting. This property allows the system to use a pessimistic prediction scheme to obtain better response times for background tasks with a small reduction of power efficiency.

6.5 Summary Most real-time DVS systems only concern the timing constraints of the real-time workload, causing a diminished-idle-period phenomenon to non-real-time tasks. As a result, conventional applications in those systems usually have high stall rates and may be unacceptable for some users. General-purpose DVS systems instead periodically predict the interactive and the batch loads together with the real-time load based on their past workload and adjust one system clock frequency to serve all tasks. Due to possible

135

prediction errors, especially in a multi-task environment, these systems can jeopardize the performance of soft and hard real-time applications. In this chapter, we proposed a framework which allows the deterministic guarantees for hard and soft real-time applications to coexist with adjustable efficient response time for interactive and batch tasks. We proposed two algorithms, BackgroundPreserving (BG-PRSV) and Background-On-Demand (BG-OND). We also integrated both algorithms with the run-time frequency adjustment mechanism of the Progressive algorithm (PRO-PRSV and PRO-OND, respectively). Simulation results show that BGOND and PRO-OND save more energy than BG-PRSV and PRO-PRSV. They also deliver comparable response time to interactive and batch tasks (better response time compared to PRO-PRSV). For systems with hard real-time and widely-fluctuating requests, PRO-OND can save more power than BG-OND (17% at 70% system utilization and BCET/WCET = 0.2). Moreover, its adaptability to the run-time workload decreases the energy penalty arising from workload over-estimation. This allows the system to tolerate more conservative prediction schemes in order to achieve better response times for interactive and batch tasks.

136

Chapter 7

Hierarchical Reservations Most real-time systems typically divide tasks into two main groups, real-time and non-real-time tasks. Tasks in each individual group are prioritized and dispatched based on only one specific scheduling policy. This framework limits the flexibility of resource sharing among tasks. For systems that have applications with mixed characteristics of QoS requirements, the one scheduling policy restriction might be unacceptable. We propose a hierarchical reservation model that can be applied in the hierarchical scheduler framework proposed in [16, 34, 55, 56]. The model allows applications or groups of tasks to have different scheduling policies that are tailored to meet their specific needs. It supports an arbitrarily deep level of hierarchy. Our hierarchical reservation structure is designed to guarantee the schedulability across hierarchies. In other words, it ensures that the coexistence of heterogeneous scheduling policies will not break timing-guarantee services provided by real-time schedulers. The model supports the locality of admission control within each scheduler. Its hierarchical enforcement and protection also enhances the implementation of a virtual machine. This includes the ability to isolate applications from others and the guarantees of their resource quota even in a low-trust environment. The abstraction of resource isolation in this model is extensible, not to cover just a task’s boundary, but also collections of tasks, applications, users or other high-level resource management entities.

137

Besides systems that require different scheduling policies among applications, the model can also be applied to large complex systems such as radar systems and automation systems which generally consist of many components, e.g. processor, sensors, and actuators. Components can be multi-thread applications with both real-time and nonreal-time properties. In such systems, the locality of admission control provided by hierarchical reservations is crucial to maintain the system scalability. In this chapter, we focus on the schedulability analyses of hierarchical reservations for CPU schedulers using DM scheduling policies. Each scheduler can be considered as a server. Three types of servers, sporadic, deferrable and polling servers, are analyzed. We derive the critical instant, the critical zone, and the exact completion time test of each type of server. A least upper bound test on schedulable processor utilization is defined for hierarchical rate-monotonic (RM) schedulers. We then discuss how the multi-reserve priority-ceiling protocol (multi-reserve PCP) can be used to handle non-preemptable resource-sharing in the hierarchy. Finally, we discuss how to extend the schedulability analyses of hierarchical reservations to support multi-granularity reservations.

7.1 Hierarchical Reservation Overview 7.1.1 Design Goals The hierarchical reservation model assumes that each individual CPU scheduler has a reservation for CPU resource using the classical {C, T, D} reservation model. The design of the model is influenced by the following important considerations.

138

•

Heterogeneity of resource scheduling policies: An application should be able to

select scheduling policies that can be real-time, non-real-time, or a combination of both in its own resource partition. However, it should avoid a scheduling mismatch among schedulers. For example, it may be inefficient or even impossible to provide a real-time guarantee under a non-real-time scheduler. •

Hierarchical enforcement and protection: A parent scheduler must enforce

resource usage of its child schedulers and tasks. Its enforcement is to assure that their total resource consumption does not exceed the parent’s resource quota and timing constraint. Consequently, any resource usage misbehavior in one hierarchical group cannot hurt other components outside. •

Hierarchical management of unused resources: If one or more child schedulers

or tasks under-utilize resources, the parent scheduler should be able to recover the remaining resources and distribute them to other child members. •

Locality of admission control: The parent scheduler is responsible for

determining the schedulability of all reserves of its children. The analysis should require only the parent’s reserve specification, its scheduling policy and its children's reserve specifications. In other words, the admission control should not depend on other reserves outside a hierarchical group. •

Uniform reserve specification: In order to maintain schedulability analysis

locally in each layer of the hierarchy, a uniform reserve specification for both real-time and non-real-time schedulers is needed. This should also abstract away the detail of the scheduling policies.

139

7.1.2 Architecture Overview and Terminology In our hierarchical reservation model, any resource management entity, such as a task, an application and a group of users, is able to create a reservation to obtain resource and/or timing guarantees. Resource requests will be granted only if the new request and all current allocations can be scheduled on a timely basis. Each reserve (scheduler) can then recursively create child reserves and become a parent reserve. Different parent reserves can specify different scheduling policies to suit the needs of their respective descendants. For example, one node in the hierarchy may use a DM scheduler, a proportional fair-share scheduler or an EDF scheduler. The resource isolation mechanism will ensure that each child reserve cannot use more resources than its allocation. However, if a child reserve under-utilizes its resource allocation, those unclaimed resources can be assigned to its siblings. The key challenge of such a system is the capability to grant throughput and latency guarantees to each node in the hierarchy based on its scheduling policy. With run-time efficiency in mind, we require that admission control can be done locally at each level of the hierarchy. We use a uniform reserve specification which is defined as {C, T, D} where C is the amount of CPU execution time in every period T with given deadline D to abstract the heterogeneous scheduling policies in the hierarchy. Figure 7-1 illustrates a simple X

example of reservation hierarchy.

140

RR

X

Root (100%) DM

Parent reserve DM

(10,50,50)

(5,100,100)

DM

RR

D

I

D

I

D

I

Child reserve D (5,250,250)

I

(10,500,500) Reserve domain

Reserve

D

Default

I

Idle

Figure 7-1: A sample of reservation hierarchy

We used terms “parent reserve” and “child reserve” to refer to parent-child relationship in the hierarchy. Each reserve can act as a scheduler with different scheduling policies or a passive resource (in case of leaf nodes in the hierarchy). Default and idle reserves are special reserves used for managing unused resources in each reserve level. More details of special reserves and some terminology are given below. •

A default reserve is a reservation for background cycles of a given reserve level.

Non-real-time tasks can be bound (by default) to a default reserve to obtain the residual of their parent resources which are not used by other sibling reserves. This default reserve can utilize any scheduling policy to distribute resources among tasks. The weighted round robin scheduler integrated with “goodness” value generally employed by a Linux kernel is one good example of the provision of both fairness and starvation protection among non-real-time tasks.

141

•

An idle reserve is used to account for any unused resource of its parent reserve. This

is useful to detect over-reservation scenarios and can be used as feedback for adaptive resource management. •

The term depleted reserve represents a reserve that has used up its resource allocation C within its current period T.

•

The term reserve domain of a reserve is used to represent all immediate child reserves of that particular reserve including the reserve itself. Every reserve domain may have one or more child reserves, a default reserve and an idle reserve (see Figure X

7-1).

RR

X

Each reserve must be guaranteed by its parent reserve to obtain timely resource access as specified in the reserve specification. The parent reserve may have its own scheduling policy, admission control, replenishment and enforcement mechanism. It distributes its own resource budget to their child reserves based on the scheduling policy. To maintain resource isolation, its enforcement mechanism assures that those child reserves cannot overuse their budget, and the total resource usage does not exceed the parent’s resource quota. In practice, parent reserves with non-real-time scheduling policies are usually attached by non-real-time tasks directly; no child reserves (schedulers). We thus primarily focus on the schedulability analyses for real-time tasks under hierarchical DM scheduling policy.

7.1.3 The Functionality of Hierarchical Reservation The additional functionality of hierarchical reservations relative to “flat” resource reservations can be summarized as follows. •

Each reserve can create child reserves to form a sub-tree.

142

•

Parent reserves can use different scheduling policies set by resource principals.

•

Each reserve uses the classic reserve specification, {C, T, D}, for resource usage and timeliness guarantees. The same model can also represent bandwidth of C/T and delay D for soft real-time tasks or time-slices for non-real-time tasks. 17 TPF

•

FPT

The guaranteed resource allocation of a reserve is allocated and checked by its parent reserve.

•

Each parent reserve (scheduler) ensures timeliness by dynamically monitoring and enforcing actual resource usage of its children. This is done recursively. Even though some resource principals can deploy a scheduler that lacks this support, temporal isolation is still maintained in its parent layers.

•

A corresponding default reserve is activated when a parent reserve has available resources but there is no child reserve claiming them. This default reserve is meant for soft reserves and non-real-time applications.

•

A corresponding idle reserve is activated when the parent reserve has available resources but there is no application (not even a non-real-time one) trying to use the resource. This information is useful to detect over-reservation and can be used as a feedback control for adaptive resource management.

•

To maintain composability of the analysis, a hierarchical priority assignment scheme has been used to avoid a scheduling mismatch among schedulers. Strictly speaking, a child reserve from a higher-priority parent will always be considered as a higher-priority reserve than any child reserve from a lower-priority parent.

TP

17 PT

We will extend the model to multi-granularity reservation model later in this chapter. 143

•

Starting at the root scheduler, the highest-priority child reserve eligible to execute is scheduled to run. This decision is then applied recursively.

•

For admission control, the parent reserve is responsible for determining the schedulability of its children based on the following information. 1. Its own reserve specification: how much and how frequently it obtains resources, 2. Its scheduling policy: how it manages its resources and 3. Its children's requirements: how much and how frequently a child needs resources. This admission control does not depend on what the upper scheduling policies are (i.e. how its grandparent manages the resources) and what lower scheduling policies their children have (i.e. how their children manage their resource shares).

7.2 The Hierarchical Deadline-Monotonic (HDM) Model Our analysis follows an approach similar to that of Liu and Layland [40]. However, the hierarchical reservation model has some subtle assumptions and constraints. This causes the analysis to be much more complex.

7.2.1 System Model and Notation In the hierarchical deadline-monotonic scheduler model, a child reserve Ri can be B

B

created only if its parent resource R p has been guaranteed {C p ,T p ,D p } for resource accesses. In our schedulability analyses for child reserves, we assume that the parent reserve has n child reserves: R1, R2,…, Rn corresponding to the request of {C1, T1, D1} to B

B

B

B

B

B

B

{Cn, Tn, Dn}, respectively. By convention, we assume D1 ≤ D2 ≤ …≤ Dn. B

B

B

B

B

B

B

144

B

B

B

B

B

B

B

B

Tasks must be bound to leaf (child) reserves to obtain a resource guarantee. In other words, parent reserves (non-leaf reserves in the hierarchy) have no tasks attached to them directly and, in practice, usually have their periods equal to their deadlines. For the sake of the analysis, we will assume that a task τ i is associated with a reserve Ri . Schedulability analysis is done on one reserve domain at a time. It makes worst-case assumptions concerning all other tasks outside the domain to achieve the locality of schedulability analysis. This analysis of a reserve domain can be applied to any node in the hierarchy. Throughout this chapter, we will refer to sibling tasks that are associated with reserves sharing the same parent reserve in a reserve domain as simply tasks. The parent reserve and other reserves outside the reserve domain will be specified explicitly.

7.2.2 Assumptions The following assumptions have been made to obtain the analytical results. Some of these are similar to the standard Liu and Layland assumptions [40] . (A1) Tasks with hard deadlines are periodic and independent of other tasks. (A2) The next request of a task occurs only after the previous request is completed. (A3) After sending a request, a task is always ready to run without blocking itself or waiting for another resource. 18 TPF

FPT

(A4) Tasks are preemptable at any time. The cost of preemption is negligible. (A5) Each child reserve is associated with one real-time task. One reserve can also be shared among non-real-time tasks whose timing constraints are not critical.

TP

18 PT

This assumption will be removed later in this chapter. 145

(A6) A child reserve has an arbitrary phasing and may therefore be synchronized or unsynchronized with its parent reserve. (A7) The period and deadline of a child reserve are greater than respectively those of its parent reserve. (A8) The resource of the parent reserve is already guaranteed and is given by {Cp, P

P

Tp = Dp}. P

P

P

P

7.2.3 Server Replenishment To analyze the schedulability of HDM scheduling, we consider a parent reserve to be a server that serves requests from its child reserves. Three kinds of server replenishment scheme s will be analyzed: the deferrable-server [68], the sporadic-server [67] and the polling-server [37]. The deferrable server with {C sp , Tsp , Dsp } allows any of its clients to use its resource any time within the current period Tsp until its budget C sp is exhausted. The server budget will be filled up periodically with period Tsp . The budget cannot be saved for future use, which means that any unclaimed budget remaining from the previous replenishment is always thrown away at the next replenishment. Despite their periodicity, child reserves can have different initial phasings. Different execution phasings can produce a back-to-back execution phenomenon [68] as in the Deferrable Server (DS) which serves aperiodic requests [38]. This happens when a child reserve’s request arrives near the end of the server period, fully consumes the server budget and continues to consume the new budget when the next replenishment is made at the end of the period. This causes a double preemption

146

time ( 2C sp ) to lower priority tasks in the system. This back-to-back execution can be avoided using the Sporadic Server (SS) technique [67] but with higher implementation complexity. Instead of providing periodic replenishments to the server budget, the sporadic server replenishes the budget based on the actual time of prior resource usage, such that in the worst-case, a child reserve behaves like a classical Liu and Layland periodic task. More details on both server types are given in [38, 67, 68]. As in the deferrable server, the polling server budget is filled up periodically with period Tsp . However, to maintain its worst-case resource consumption not to exceed a classical Liu and Layland periodic task, only requests arriving before the current replenishment will be served. Late requests will be delayed to the next replenishment. In this chapter, we will present the schedulability analyses for all three servers.

7.2.4 Jitter Effect under Hierarchical Reservation The introduction of a multi-level hierarchy using deferrable server replenishment may introduce a potential source of jitters. Bernat and Burns [9] prove a general form of the worst-case response-time formulation under jitters as shown in Equation ( 7.1). Due to X

RR

X

possible back-to-back execution, a parent reserve using deferrable-server replenishment behaves like a periodic task with a jitter given by Jj = Tj-Cj. In the case of regular B

B

B

B

B

B

reserves (reserves that do not have any child) and parent reserves with sporadic and polling server replenishments, this jitter is zero. This equation must be solved recursively starting with ω0 = Ci and finishing when ωk+1 = ωk on success or ωk+1 > Di on failure. P

P

B

B

P

P

P

P

⎡ω k + J j ⎤ ⎥C j Tj ⎥ j i

(7.2)

7.3.1 Critical Instant of SS-HDM We now present the critical instant of a hierarchical deadline-monotonic scheduler with sporadic-server replenishment in Theorem 7.2. X

RR

X

Theorem 7.2

Under the HDM scheduler with SS replenishment, a critical instant for a task occurs whenever the task’s request arrives simultaneously with requests from all higherpriority tasks, and lower-priority tasks have used its parent’s resource quota of the current replenishment for PR (τ i ) time units. Proof.

150

We first determine the phasing of a task τi that will make its response time the B

B

longest with respect to higher-priority tasks. Consider any task τj with higher priority B

B

than τ i ’s priority. Suppose τj arrived earlier than τi. There are two cases. B

B

B

B

First, τj could have executed for some time since its parent reserve was available. B

B

In this case, moving τj’s arrival time towards τi’s arrival time can only make τi’s response B

B

B

B

B

B

time longer or keep it the same. Secondly, τj could not execute since the parent reserve B

B

was depleted. In this case, moving τi’s arrival time towards τj’s can only make τi’s B

B

B

B

B

B

response time longer. Now, suppose τj instead arrived later than τi. Moving task τj’s B

B

B

B

B

B

arrival time towards that of τi cannot make the response time for τi shorter. Using this B

B

B

B

argument across all higher-priority tasks, τi’s longest response time occurs when it arrives B

B

together with its higher-priority tasks. Now, consider any task τk with lower priority than τi. In addition to the B

B

B

B

preemption by higher-priority tasks, unlike the analysis in the classical Liu and Layland model, τk can cause τ i ’s response time (and that of other higher priority tasks) to be B

B

longer by consuming the parent reserve’s resource before τi and its higher-priority tasks B

B

arrive. The maximum amount of time that can be consumed by all lower-priority tasks is given by PR (τ i ) in Equation ( 7.2). X

RR

X

We now analyze when the worst-case pre-used resource should occur to produce the critical instant. With the SS replenishment, the resource is always replenished Tp units P

P

from when it is used. Consequently, if we advance the pre-used resource consumption to be earlier, the next replenishment will be advanced too and, hence, task τi can obtain its B

B

resource earlier. Therefore, we can conclude that the critical instant of task τi occurs B

151

B

when lower-priority tasks consume the pre-used resource just before τi’s request arrives, B

B

which is the same time at which other higher-priority tasks’ requests arrive.

7.3.2 The Worst-Case Response Time of SS-HDM We first illustrate our notation with an example. Let us assume that we have a parent reserve RP with three child reserves, Rj, Ri and Rk where Rj has the highest priority and Rk P

P

B

B

B

B

B

B

B

B

B

B

has the lowest priority among these three child reserves. Let τj, τi and τk be tasks B

B

B

B

B

B

associated with reserves Rj, Ri and Rk respectively. The worst-case completion time of B

B

B

B

B

B

task τi can be obtained by the combination of the following: B

B

(1) τi arrives at the worst-case phasing with respect to its parent reserve and its B

B

sibling reserves. This corresponds to the critical instant defined in Theorem 7.2. X

X

(2) τi’s parent reserve obtains its guaranteed Cp at the latest possible moment, i.e. the B

B

P

P

last Cp time unit of its guaranteed time-frame Tp. Note that this is a pessimistic P

P

P

P

assumption, since the parent reserve may have the highest priority in the system and can obtain the resource immediately after its replenishment. However, this assumption is a necessary cost within our model to obtain the locality of schedulability analysis. In other words, the analysis can be performed without the need for specific information from the higher levels of the hierarchy (for example, what scheduling policy the upper scheduler uses or what the parent reserve’s priority is).

152

Rp P

P

t0 B

τj B

t2

t2+Tj

B

τi

B

B

t2+kTj

B

B

B

B

B

t1 B

τk

t2+Ti

PR(τi) B

B

B

B

B

B

t3

Tp-Cp

B

P

P

Tp-Cp

P

P

P

P

Critical zone of τi B

t2+Ti

t1

B

B

B

B

Figure 7-3: The critical zone of task τi for SS-HDM scheduler B

B

Figure 7-3 shows the critical zone of task τi. The first time-line shows the X

RR

X

B

B

currently available resource in the parent reserve. The color boxes show the execution time of tasks τj, τi and τk, which produce the worst-case scenario. The last line shows the B

B

B

B

B

B

critical zone of τi. The execution time of each task from the time-lines at the top is B

B

projected on to the last line. We will generalize and formalize this execution time-pattern in the proof of the next theorem. Theorem 7.3

For a HDM scheduler with SS replenishment, the worst-case response time of a task τi which shares the parent reserve with the reserve specification given by {Cp, Tp = B

B

P

P

P

P

Dp} with tasks τ1, τ2, ..., τn where D1 ≤ D2 ≤ …≤ Dn is the smallest solution to the P

P

B

B

B

B

B

B

B

B

B

B

B

B

following equation:

ω

k +1

⎡ω k + J j ⎤ ⎡ ω k + PR(τ i ) ⎤ p p = Ci + ∑ ⎢ ⎥C j + ⎢ ⎥ (T − C ) p T T j Di on failure. P

P

P

P

P

P

B

B

153

P

B

B

Proof.

Consider a task τi and assume one of its requests arrives at time t1. Its critical B

B

B

B

instant occurs when its parent reserve has been consumed by PR (τ i ) as shown in Figure X

7-3. The first replenishment in the critical zone begins at time t1 − PR (τ i ) + T p . Before

RR

X

that time instant, τi and its higher-priority tasks may obtain the resource if there is any B

B

resource left from PR (τ i ) . Afterward, the task and its higher-priority tasks continuously obtain the resource from their parent reserve, which is shaped into a periodic form with parameter Cp every Tp. We can think of the waiting time to obtain the resource, shown as P

P

P

P

brick boxes in the figure, as a pseudo-task having capacity (Tp-Cp) and period Tp. Let ω P

P

P

P

P

P

be the time that takes τi to finish its job. The preemption from the pseudo-task is given by B

B

⎡ ω + PR(τ i ) ⎤ p (T − C p ). p ⎢ ⎥ T ⎢ ⎥

(7.4)

The remainder of the resource, after excluding the preemption of the pseudo-task, is shared by τi and its higher-priority tasks. The maximum preemption by higher-priority B

B

tasks which is encountered by τj has a jitter Jj 19 and is given by the following equation: B

B

B

B

TPF

FPT

⎡ω + J j ⎤ ⎥C j . Tj ⎥ j Cp, will be given to τi before t0+φ+Tp. P

P

B

B

B

B

P

P

B

B

B

B

Two cases arise: τi’s request B

Enforcement

B

Replenishment

Cp t0 B

t0+Tp

t0+φ B

B

B

B

P

B

P

P

t0+2Tp

t0+φ+TP B

B

B

t0+3Tp

P

B

P

…

Cp

Cp

P

B

t0+4Tp

P

B

B

P

Figure 7-4: Unsynchronized phasing effect

•

Case 1. If φ ≥ Cp, increasing φ reduces the waiting time of τi to get the resource at P

P

B

B

the next replenishment (at t0+2Tp) and hence reduces its completion time. B

B

P

P

Therefore, task τi has the largest completion time when φ = Cp. B

•

B

P

P

Case 2. If φ < Cp, the amount of resource which τi can obtain before time t0+2Tp, P

P

B

B

B

B

is always larger than that in Case 1. Combining both two cases, the maximum completion time of τi occurs when its B

B

request arrives after any replenishment instant for C p time units. Suppose there is a task

τj that has higher priority than τi. It can cause the maximum preemption to task τi when B

B

B

B

B

B

both tasks request the resource simultaneously. The detail of this analysis is the same as shown in Section 7.3.1. XRR

X

156

Cp P

Cp

Cp

Cp

…

Cp

Cp

…

P

P

P

τi’s request B

B

PR(τi)=Cp B

B

P

Cp P

t0+2Tp

t0 t0+Tp τi’s request

B

B

B

B

B

P

P

B

P

P

t0+3Tp B

B

t0+4Tp

P

B

B

P

B

Figure 7-5: Critical instant of DS-HDM

Consider the effect of resource pre-usage by lower-priority tasks. Let the task arrive at time Cp after the replenishment, as shown in the top time-line in Figure 7-5. In P

P

X

RR

X

the worst case, it obtains the resource after the next replenishment time no matter if the resource is pre-used. However, if lower-priority tasks arrive and obtain the resource for PR(τi) time units right before τi arrives and PR(τi) = Cp, the parent reserve is depleted, τi B

B

B

B

B

B

B

B

will have to wait for the next replenishment, and it may obtain the resource at the end of

the guaranteed time-frame time as shown in the bottom time-line in Figure 7-5. X

RR

X

As can be seen from Figure 7-5, the completion time of τi due to the pre-used X

RR

X

B

B

resource (the bottom time-line) and the Cp phase offset (the top time-line) will be the P

P

same if τi and its higher priority reserves request more than Cp of resource and τi’s period B

B

P

P

B

B

is larger than 2Tp-Cp. Consequently, we can use either case to find the worst-case P

P

P

P

completion time of τi. Due to its mathematical simplicity, we will use the bottom line to B

B

represent the critical zone.

7.4.2 The Worst-Case Response Time of DS-HDM Theorem 7.5

157

For the HDM scheduler with DS replenishment, if Ti > 2Tp-Cp for all i, the B

B

P

P

P

P

schedulability of task τi that shares the parent reserve {Cp, Tp = Dp} with tasks τ1, τ2, ..., τn B

B

P

P

P

P

P

P

B

B

B

B

B

B

where D1 ≤ D2 ≤ … ≤ Dn, can be determined by finding the smallest solution of the B

B

B

B

B

B

following equation:

ω

k +1

⎡ω k + J j ⎤ ⎡ω k + C p ⎤ p p = Ci + ∑ ⎢ ⎥C j + ⎢ ⎥ (T − C ) p T T j Di on failure. P

P

P

P

P

P

B

B

Proof.

The critical zone of the task (the bottom time-line in Figure 7-5) has the same X

RR

X

pattern as the one in the sporadic-server case when PR (τ i ) = C p . Substituting the value of PR(τ i ) in Equation ( 7.3), we obtain the theorem. X

RR

X

For the case that Dp < Tp, a child reserve and its higher-priority reserves will P

P

P

P

obtain the first Cp time units of resource earlier by Tp-Dp, but the period of the P

P

P

P

P

P

replenishment is still the same (Tp). Therefore, the completion time of both SS-HDM and P

P

DS-HDM can be obtained by subtracting Tp-Dp from Equations ( 7.3) and ( 7.6) . P

P

P

P

X

RR

X

X

RR

X

7.5 Polling Server – HDM Schedulability Analysis We now analyze the schedulability of a HDM scheduler using the polling-server.

7.5.1 Critical Instant of PS-HDM Theorem 7.6

158

Under the HDM scheduler with PS replenishment, a critical instant for a task occurs whenever the task’s request arrives simultaneously with requests from all higherpriority tasks at the time instant just after the polling time of the server. Proof.

We can use the same methodology as in Theorem 7.2 to show that the task’s X

RR

X

completion time is longest if all higher-priority tasks and the task arrive at the same time. Therefore, to avoid redundancy, we omit the detail of the proof of this part. Now assume that all tasks arrive before or at the server’s polling time, some of higher-priority tasks will get executed earlier, making the task’s completion time shorter. If all tasks arrive after the polling time, their executions will be delayed at least until the next polling time. Therefore, the task’s longest completion occurs when all tasks arrive just after the polling time, delaying their execution as much as possible. The worst-case delay time to the next polling time is Tp. B

B

T

T

T

7.5.2 The Worst-Case Response Time of PS-HDM Theorem 7.7

For the HDM scheduler with PS replenishment, if Ti > 2Tp-Cp for all i, the B

B

P

P

P

P

schedulability of task τi that shares the parent reserve {Cp, Tp = Dp} with tasks τ1, τ2, ..., τn B

B

P

P

P

P

P

P

B

B

B

B

B

B

where D1 ≤ D2 ≤ … ≤ Dn, can be determined by finding the smallest solution of the B

B

B

B

B

B

following equation:

ω

k +1

⎡ω k + J j ⎤ ⎡ω k − T p ⎤ p p p = Ci + ∑ ⎢ ⎥C j + (T + ⎢ ⎥ (T − C )). p Tj ⎥ j Di on failure. P

P

P

P

P

P

B

B

159

P

B

B

Proof. Tp-Cp

Tp

P

P

P

P

Cp

Cp

delay

P

P

P

Cp

P

P

…

P

All tasks’ requests T

Figure 7-6: The critical zone of task τi for PS-HDM scheduler B

B

Figure 7-6 shows the critical zone of task τi. All tasks’ execution will be delayed X

RR

X

B

B

until the next polling time. In addition, those tasks are executed only when their parent obtains its resource as late as possible, which is at the end of its period. The term Tp in Equation ( 7.7) for the first represents the delay for the next polling P

P

X

RR

X

time. After the delay, the preemption of the pseudo-task continues to be (Tp-Cp) every P

P

P

P

period Tp, which can be formalized by the following equation: P

P

⎡ω − T p ⎤ p p ⎢ ⎥ (T − C ). p ⎢ T ⎥ The remainder of the resource after excluding the first polling server delay and the preemption of the pseudo-task is shared by τi and its higher-priority tasks. The B

B

maximum preemption by higher-priority tasks which may have a jitter defined in Equation ( 7.1) is given by the following equation: X

RR

X

⎡ω + J j ⎤ ⎥C j . ⎢ ∑ Tj ⎥ j B

B

B

B

2. We can modify Ti to T'i= kTi where Tn = kTi + r, 0 ≤ r ≤ Ti and k ≥ 2. To maintain full B

B

B

B

B

B

B

B

B

B

B

B

utilization of the processor, we also need to modify Cn to C'n =Cn + (k-1)Ci. Then, the B

B

B

B

B

B

B

B

utilization of the modified task decreases by (k-1)Ci(1/Tn-1/(kTi)) > 0 and Tn/T'i < 2. B

B

B

B

B

B

B

B

B

B

Consequently, Tn/Ti < 2 is a necessary condition to minimize the least upper bound of the B

B

B

B

processor utilization. Let R1, R2,…, Rn denote the n child reserves whose period ratio is less than 2 but B

B

B

B

B

B

the capacity ratio is greater than 2. Let C1, C2,…, Cn denote the capacity of the n child B

B

B

B

B

B

reserves that fully utilize the processor. Let T*1 be the minimum time that satisfies κ(T*1) P

PB

B

P

PB

B

= 2κ(T1). Assume that T1 ≤ … ≤ Tj ≤ T*1 ≤ Tj+1 ≤ … ≤ Tn. We will show that C1 must be B

B

B

B

B

B

P

PB

B

B

B

B

B

B

B

equal to C*1 = T2-T1-Preemption(T1, T2) in order to minimize the utilization bound. P

PB

B

B

B

B

B

B

B

B

B

Suppose that C1 = C*1 - Δ. Transform this reserve set as follows: B

B

P

PB

B

162

C '1 = C1* , C '2 = C2 , M

C ' n = C n − 2Δ The new reserve set still fully utilizes the processor. Let U and U' be the utilization factors of the original reserve set and the modified reserve set respectively. We have Δ 2Δ −