Published by Addison-Wesley (September 1, 2014) © 2015
Thomas Limoncelli | Strata Chalup | Christina Hogan“There’s an incredible amount of depth and thinking in the practices described here, and it’s impressive to see it all in one place.”
—Win Treese, coauthor of Designing Systems for Internet Commerce
The Practice of Cloud System Administration, Volume 2, focuses on “distributed” or “cloud” computing and brings a DevOps/SRE sensibility to the practice of system administration. Unsatisfied with books that cover either design or operations in isolation, the authors created this authoritative reference centered on a comprehensive approach.
Case studies and examples from Google, Etsy, Twitter, Facebook, Netflix, Amazon, and other industry giants are explained in practical ways that are useful to all enterprises. The new companion to the best-selling first volume, The Practice of System and Network Administration, Second Edition, this guide offers expert coverage of the following and many other crucial topics:
Designing and building modern web and distributed systems
- Fundamentals of large system design
- Understand the new software engineering implications of cloud administration
- Make systems that are resilient to failure and grow and scale dynamically
- Implement DevOps principles and cultural changes
- IaaS/PaaS/SaaS and virtual platform selection
Operating and running systems using the latest DevOps/SRE strategies
- Upgrade production systems with zero down-time
- What and how to automate; how to decide what not to automate
- On-call best practices that improve uptime
- Why distributed systems require fundamentally different system administration techniques
- Identify and resolve resiliency problems before they surprise you
Assessing and evaluating your team’s operational effectiveness
- Manage the scientific process of continuous improvement
- A forty-page, pain-free assessment system you can start using today
Preface xxiii
About the Authors xxix
Introduction 1
Part I: Design: Building It 7
Chapter 1: Designing in a Distributed World 9
1.1 Visibility at Scale 10
1.2 The Importance of Simplicity 11
1.3 Composition 12
1.4 Distributed State 17
1.5 The CAP Principle 21
1.6 Loosely Coupled Systems 24
1.7 Speed 26
1.8 Summary 29
Exercises 30
Chapter 2: Designing for Operations 31
2.1 Operational Requirements 31
2.2 Implementing Design for Operations 45
2.3 Improving the Model 48
2.4 Summary 49
Exercises 50
Chapter 3: Selecting a Service Platform 51
3.1 Level of Service Abstraction 52
3.2 Type of Machine 56
3.3 Level of Resource Sharing 62
3.4 Colocation 65
3.5 Selection Strategies 66
3.6 Summary 68
Exercises 68
Chapter 4: Application Architectures 69
4.1 Single-Machine Web Server 70
4.2 Three-Tier Web Service 71
4.3 Four-Tier Web Service 77
4.4 Reverse Proxy Service 80
4.5 Cloud-Scale Service 80
4.6 Message Bus Architectures 85
4.7 Service-Oriented Architecture 90
4.8 Summary 92
Exercises 93
Chapter 5: Design Patterns for Scaling 95
5.1 General Strategy 96
5.2 Scaling Up 98
5.3 The AKF Scaling Cube 99
5.4 Caching 104
5.5 Data Sharding 110
5.6 Threading 112
5.7 Queueing 113
5.8 Content Delivery Networks 114
5.9 Summary 116
Exercises 116
Chapter 6: Design Patterns for Resiliency 119
6.1 Software Resiliency Beats Hardware Reliability 120
6.2 Everything Malfunctions Eventually 121
6.3 Resiliency through Spare Capacity 124
6.4 Failure Domains 126
6.5 Software Failures 128
6.6 Physical Failures 131
6.7 Overload Failures 138
6.8 Human Error 141
6.9 Summary 142
Exercises 143
Part II: Operations: Running It 145
Chapter 7: Operations in a Distributed World 147
7.1 Distributed Systems Operations 148
7.2 Service Life Cycle 155
7.3 Organizing Strategy for Operational Teams 160
7.4 Virtual Office 166
7.5 Summary 167
Exercises 168
Chapter 8: DevOps Culture 171
8.1 What Is DevOps? 172
8.2 The Three Ways of DevOps 176
8.3 History of DevOps 180
8.4 DevOps Values and Principles 181
8.5 Converting to DevOps 186
8.6 Agile and Continuous Delivery 188
8.7 Summary 192
Exercises 193
Chapter 9: Service Delivery: The Build Phase 195
9.1 Service Delivery Strategies 197
9.2 The Virtuous Cycle of Quality 200
9.3 Build-Phase Steps 202
9.4 Build Console 205
9.5 Continuous Integration 205
9.6 Packages as Handoff Interface 207
9.7 Summary 208
Exercises 209
Chapter 10: Service Delivery: The Deployment Phase 211
10.1 Deployment-Phase Steps 211
10.2 Testing and Approval 214
10.3 Operations Console 217
10.4 Infrastructure Automation Strategies 217
10.5 Continuous Delivery 221
10.6 Infrastructure as Code 221
10.7 Other Platform Services 222
10.8 Summary 222
Exercises 223
Chapter 11: Upgrading Live Services 225
11.1 Taking the Service Down for Upgrading 225
11.2 Rolling Upgrades 226
11.3 Canary 227
11.4 Phased Roll-outs 229
11.5 Proportional Shedding 230
11.6 Blue-Green Deployment 230
11.7 Toggling Features 230
11.8 Live Schema Changes 234
11.9 Live Code Changes 236
11.10 Continuous Deployment 236
11.11 Dealing with Failed Code Pushes 239
11.12 Release Atomicity 240
11.13 Summary 241
Exercises 241
Chapter 12: Automation 243
12.1 Approaches to Automation 244
12.2 Tool Building versus Automation 250
12.3 Goals of Automation 252
12.4 Creating Automation 255
12.5 How to Automate 258
12.6 Language Tools 258
12.7 Software Engineering Tools and Techniques 262
12.8 Multitenant Systems 270
12.9 Summary 271
Exercises 272
Chapter 13: Design Documents 275
13.1 Design Documents Overview 275
13.2 Design Document Anatomy 277
13.3 Template 279
13.4 Document Archive 279
13.5 Review Workflows 280
13.6 Adopting Design Documents 282
13.7 Summary 283
Exercises 284
Chapter 14: Oncall 285
14.1 Designing Oncall 285
14.2 Being Oncall 294
14.3 Between Oncall Shifts 299
14.4 Periodic Review of Alerts 302
14.5 Being Paged Too Much 304
14.6 Summary 305
Exercises 306
Chapter 15: Disaster Preparedness 307
15.1 Mindset 308
15.2 Individual Training: Wheel of Misfortune 311
15.3 Team Training: Fire Drills 312
15.4 Training for Organizations: Game Day/DiRT 315
15.5 Incident Command System 323
15.6 Summary 329
Exercises 330
Chapter 16: Monitoring Fundamentals 331
16.1 Overview 332
16.2 Consumers of Monitoring Information 334
16.3 What to Monitor 336
16.4 Retention 338
16.5 Meta-monitoring 339
16.6 Logs 340
16.7 Summary 342
Exercises 342
Chapter 17: Monitoring Architecture and Practice 345
17.1 Sensing and Measurement 346
17.2 Collection 350
17.3 Analysis and Computation 353
17.4 Alerting and Escalation Manager 354
17.5 Visualization 358
17.6 Storage 362
17.7 Configuration 362
17.8 Summary 363
Exercises 364
Chapter 18: Capacity Planning 365
18.1 Standard Capacity Planning 366
18.2 Advanced Capacity Planning 371
18.3 Resource Regression 381
18.4 Launching New Services 382
18.5 Reduce Provisioning Time 384
18.6 Summary 385
Exercises 386
Chapter 19: Creating KPIs 387
19.1 What Is a KPI? 388
19.2 Creating KPIs 389
19.3 Example KPI: Machine Allocation 393
19.4 Case Study: Error Budget 396
19.5 Summary 399
Exercises 399
Chapter 20: Operational Excellence 401
20.1 What Does Operational Excellence Look Like? 401
20.2 How to Measure Greatness 402
20.3 Assessment Methodology 403
20.4 Service Assessments 407
20.5 Organizational Assessments 411
20.6 Levels of Improvement 412
20.7 Getting Started 413
20.8 Summary 414
Exercises 415
Epilogue 416
Part III: Appendices 419
Appendix A: Assessments 421
A.1 Regular Tasks (RT) 423
A.2 Emergency Response (ER) 426
A.3 Monitoring and Metrics (MM) 428
A.4 Capacity Planning (CP) 431
A.5 Change Management (CM) 433
A.6 New Product Introduction and Removal (NPI/NPR) 435
A.7 Service Deployment and Decommissioning (SDD) 437
A.8 Performance and Efficiency (PE) 439
A.9 Service Delivery: The Build Phase 442
A.10 Service Delivery: The Deployment Phase 444
A.11 Toil Reduction 446
A.12 Disaster Preparedness 448
Appendix B: The Origins and Future of Distributed Computing and Clouds 451
B.1 The Pre-Web Era (1985–1994) 452
B.2 The First Web Era: The Bubble (1995–2000) 455
B.3 The Dot-Bomb Era (2000–2003) 459
B.4 The Second Web Era (2003–2010) 465
B.5 The Cloud Computing Era (2010–present) 469
B.6 Conclusion 472
Exercises 473
Appendix C: Scaling Terminology and Concepts 475
C.1 Constant, Linear, and Exponential Scaling 475
C.2 Big O Notation 476
C.3 Limitations of Big O Notation 478
Appendix D: Templates and Examples 481
D.1 Design Document Template 481
D.2 Design Document Example 482
D.3 Sample Postmortem Template 484
Appendix E: Recommended Reading 487
Bibliography 491
Index 499