Published by Addison-Wesley (September 1, 2014) © 2015

Thomas Limoncelli | Strata Chalup | Christina Hogan
    VitalSource eTextbook (Lifetime access)
    €24,99
    Adding to cart… The item has been added
    ISBN-13: 9780133478532

    Practice of Cloud System Administration, The: DevOps and SRE Practices for Web Services, Volume 2,1st edition

    Language: English

    “There’s an incredible amount of depth and thinking in the practices described here, and it’s impressive to see it all in one place.”

    —Win Treese, coauthor of Designing Systems for Internet Commerce

     

    The Practice of Cloud System Administration, Volume 2, focuses on “distributed” or “cloud” computing and brings a DevOps/SRE sensibility to the practice of system administration. Unsatisfied with books that cover either design or operations in isolation, the authors created this authoritative reference centered on a comprehensive approach.

     

    Case studies and examples from Google, Etsy, Twitter, Facebook, Netflix, Amazon, and other industry giants are explained in practical ways that are useful to all enterprises. The new companion to the best-selling first volume, The Practice of System and Network Administration, Second Edition, this guide offers expert coverage of the following and many other crucial topics:

     

    Designing and building modern web and distributed systems

    • Fundamentals of large system design
    • Understand the new software engineering implications of cloud administration
    • Make systems that are resilient to failure and grow and scale dynamically
    • Implement DevOps principles and cultural changes
    • IaaS/PaaS/SaaS and virtual platform selection

    Operating and running systems using the latest DevOps/SRE strategies

    • Upgrade production systems with zero down-time
    • What and how to automate; how to decide what not to automate
    • On-call best practices that improve uptime
    • Why distributed systems require fundamentally different system administration techniques
    • Identify and resolve resiliency problems before they surprise you

    Assessing and evaluating your team’s operational effectiveness

    • Manage the scientific process of continuous improvement
    • A forty-page, pain-free assessment system you can start using today

     

    Preface xxiii

    About the Authors xxix

     

    Introduction 1

     

    Part I: Design: Building It 7

     

    Chapter 1: Designing in a Distributed World 9

    1.1 Visibility at Scale 10

    1.2 The Importance of Simplicity 11

    1.3 Composition 12

    1.4 Distributed State 17

    1.5 The CAP Principle 21

    1.6 Loosely Coupled Systems 24

    1.7 Speed 26

    1.8 Summary 29

    Exercises 30

     

    Chapter 2: Designing for Operations 31

    2.1 Operational Requirements 31

    2.2 Implementing Design for Operations 45

    2.3 Improving the Model 48

    2.4 Summary 49

    Exercises 50

     

    Chapter 3: Selecting a Service Platform 51

    3.1 Level of Service Abstraction 52

    3.2 Type of Machine 56

    3.3 Level of Resource Sharing 62

    3.4 Colocation 65

    3.5 Selection Strategies 66

    3.6 Summary 68

    Exercises 68

     

    Chapter 4: Application Architectures 69

    4.1 Single-Machine Web Server 70

    4.2 Three-Tier Web Service 71

    4.3 Four-Tier Web Service 77

    4.4 Reverse Proxy Service 80

    4.5 Cloud-Scale Service 80

    4.6 Message Bus Architectures 85

    4.7 Service-Oriented Architecture 90

    4.8 Summary 92

    Exercises 93

     

    Chapter 5: Design Patterns for Scaling 95

    5.1 General Strategy 96

    5.2 Scaling Up 98

    5.3 The AKF Scaling Cube 99

    5.4 Caching 104

    5.5 Data Sharding 110

    5.6 Threading 112

    5.7 Queueing 113

    5.8 Content Delivery Networks 114

    5.9 Summary 116

    Exercises 116

     

    Chapter 6: Design Patterns for Resiliency 119

    6.1 Software Resiliency Beats Hardware Reliability 120

    6.2 Everything Malfunctions Eventually 121

    6.3 Resiliency through Spare Capacity 124

    6.4 Failure Domains 126

    6.5 Software Failures 128

    6.6 Physical Failures 131

    6.7 Overload Failures 138

    6.8 Human Error 141

    6.9 Summary 142

    Exercises 143

     

    Part II: Operations: Running It 145

     

    Chapter 7: Operations in a Distributed World 147

    7.1 Distributed Systems Operations 148

    7.2 Service Life Cycle 155

    7.3 Organizing Strategy for Operational Teams 160

    7.4 Virtual Office 166

    7.5 Summary 167

    Exercises 168

     

    Chapter 8: DevOps Culture 171

    8.1 What Is DevOps? 172

    8.2 The Three Ways of DevOps 176

    8.3 History of DevOps 180

    8.4 DevOps Values and Principles 181

    8.5 Converting to DevOps 186

    8.6 Agile and Continuous Delivery 188

    8.7 Summary 192

    Exercises 193

     

    Chapter 9: Service Delivery: The Build Phase 195

    9.1 Service Delivery Strategies 197

    9.2 The Virtuous Cycle of Quality 200

    9.3 Build-Phase Steps 202

    9.4 Build Console 205

    9.5 Continuous Integration 205

    9.6 Packages as Handoff Interface 207

    9.7 Summary 208

    Exercises 209

     

    Chapter 10: Service Delivery: The Deployment Phase 211

    10.1 Deployment-Phase Steps 211

    10.2 Testing and Approval 214

    10.3 Operations Console 217

    10.4 Infrastructure Automation Strategies 217

    10.5 Continuous Delivery 221

    10.6 Infrastructure as Code 221

    10.7 Other Platform Services 222

    10.8 Summary 222

    Exercises 223

     

    Chapter 11: Upgrading Live Services 225

    11.1 Taking the Service Down for Upgrading 225

    11.2 Rolling Upgrades 226

    11.3 Canary 227

    11.4 Phased Roll-outs 229

    11.5 Proportional Shedding 230

    11.6 Blue-Green Deployment 230

    11.7 Toggling Features 230

    11.8 Live Schema Changes 234

    11.9 Live Code Changes 236

    11.10 Continuous Deployment 236

    11.11 Dealing with Failed Code Pushes 239

    11.12 Release Atomicity 240

    11.13 Summary 241

    Exercises 241

     

    Chapter 12: Automation 243

    12.1 Approaches to Automation 244

    12.2 Tool Building versus Automation 250

    12.3 Goals of Automation 252

    12.4 Creating Automation 255

    12.5 How to Automate 258

    12.6 Language Tools 258

    12.7 Software Engineering Tools and Techniques 262

    12.8 Multitenant Systems 270

    12.9 Summary 271

    Exercises 272

     

    Chapter 13: Design Documents 275

    13.1 Design Documents Overview 275

    13.2 Design Document Anatomy 277

    13.3 Template 279

    13.4 Document Archive 279

    13.5 Review Workflows 280

    13.6 Adopting Design Documents 282

    13.7 Summary 283

    Exercises 284

     

    Chapter 14: Oncall 285

    14.1 Designing Oncall 285

    14.2 Being Oncall 294

    14.3 Between Oncall Shifts 299

    14.4 Periodic Review of Alerts 302

    14.5 Being Paged Too Much 304

    14.6 Summary 305

    Exercises 306

     

    Chapter 15: Disaster Preparedness 307

    15.1 Mindset 308

    15.2 Individual Training: Wheel of Misfortune 311

    15.3 Team Training: Fire Drills 312

    15.4 Training for Organizations: Game Day/DiRT 315

    15.5 Incident Command System 323

    15.6 Summary 329

    Exercises 330

     

    Chapter 16: Monitoring Fundamentals 331

    16.1 Overview 332

    16.2 Consumers of Monitoring Information 334

    16.3 What to Monitor 336

    16.4 Retention 338

    16.5 Meta-monitoring 339

    16.6 Logs 340

    16.7 Summary 342

    Exercises 342

     

    Chapter 17: Monitoring Architecture and Practice 345

    17.1 Sensing and Measurement 346

    17.2 Collection 350

    17.3 Analysis and Computation 353

    17.4 Alerting and Escalation Manager 354

    17.5 Visualization 358

    17.6 Storage 362

    17.7 Configuration 362

    17.8 Summary 363

    Exercises 364

     

    Chapter 18: Capacity Planning 365

    18.1 Standard Capacity Planning 366

    18.2 Advanced Capacity Planning 371

    18.3 Resource Regression 381

    18.4 Launching New Services 382

    18.5 Reduce Provisioning Time 384

    18.6 Summary 385

    Exercises 386

     

    Chapter 19: Creating KPIs 387

    19.1 What Is a KPI? 388

    19.2 Creating KPIs 389

    19.3 Example KPI: Machine Allocation 393

    19.4 Case Study: Error Budget 396

    19.5 Summary 399

    Exercises 399

     

    Chapter 20: Operational Excellence 401

    20.1 What Does Operational Excellence Look Like? 401

    20.2 How to Measure Greatness 402

    20.3 Assessment Methodology 403

    20.4 Service Assessments 407

    20.5 Organizational Assessments 411

    20.6 Levels of Improvement 412

    20.7 Getting Started 413

    20.8 Summary 414

    Exercises 415

     

    Epilogue 416

     

    Part III: Appendices 419

     

    Appendix A: Assessments 421

    A.1 Regular Tasks (RT) 423

    A.2 Emergency Response (ER) 426

    A.3 Monitoring and Metrics (MM) 428

    A.4 Capacity Planning (CP) 431

    A.5 Change Management (CM) 433

    A.6 New Product Introduction and Removal (NPI/NPR) 435

    A.7 Service Deployment and Decommissioning (SDD) 437

    A.8 Performance and Efficiency (PE) 439

    A.9 Service Delivery: The Build Phase 442

    A.10 Service Delivery: The Deployment Phase 444

    A.11 Toil Reduction 446

    A.12 Disaster Preparedness 448

     

    Appendix B: The Origins and Future of Distributed Computing and Clouds 451

    B.1 The Pre-Web Era (1985–1994) 452

    B.2 The First Web Era: The Bubble (1995–2000) 455

    B.3 The Dot-Bomb Era (2000–2003) 459

    B.4 The Second Web Era (2003–2010) 465

    B.5 The Cloud Computing Era (2010–present) 469

    B.6 Conclusion 472

    Exercises 473

     

    Appendix C: Scaling Terminology and Concepts 475

    C.1 Constant, Linear, and Exponential Scaling 475

    C.2 Big O Notation 476

    C.3 Limitations of Big O Notation 478

     

    Appendix D: Templates and Examples 481

    D.1 Design Document Template 481

    D.2 Design Document Example 482

    D.3 Sample Postmortem Template 484

     

    Appendix E: Recommended Reading 487

     

    Bibliography 491

    Index 499