Published by Addison-Wesley Professional (February 20, 2023) © 2023

Daniel Chen
    VitalSource eTextbook (Lifetime access)
    €37,99
    Adding to cart… The item has been added
    ISBN-13: 9780137891054

    Pandas for Everyone: Python Data Analysis ,2nd edition

    Language: English

    Manage and Automate Data Analysis with Pandas in Python

    Today, analysts must manage data characterized by extraordinary variety, velocity, and volume. Using the open source Pandas library, you can use Python to rapidly automate and perform virtually any data analysis task, no matter how large or complex. Pandas can help you ensure the veracity of your data, visualize it for effective decision-making, and reliably reproduce analyses across multiple data sets.

    Pandas for Everyone, 2nd Edition, brings together practical knowledge and insight for solving real problems with Pandas, even if you’re new to Python data analysis. Daniel Y. Chen introduces key concepts through simple but practical examples, incrementally building on them to solve more difficult, real-world data science problems such as using regularization to prevent data overfitting, or when to use unsupervised machine learning methods to find the underlying structure in a data set.

    New features to the second edition include: 

    • Extended coverage of plotting and the seaborn data visualization library
    • Expanded examples and resources
    • Updated Python 3.9 code and packages coverage, including statsmodels and scikit-learn libraries
    • Online bonus material on geopandas, Dask, and creating interactive graphics with Altair


    Chen gives you a jumpstart on using Pandas with a realistic data set and covers combining data sets, handling missing data, and structuring data sets for easier analysis and visualization. He demonstrates powerful data cleaning techniques, from basic string manipulation to applying functions simultaneously across dataframes.

    Once your data is ready, Chen guides you through fitting models for prediction, clustering, inference, and exploration. He provides tips on performance and scalability and introduces you to the wider Python data analysis ecosystem. 

    • Work with DataFrames and Series, and import or export data
    • Create plots with matplotlib, seaborn, and pandas
    • Combine data sets and handle missing data
    • Reshape, tidy, and clean data sets so they’re easier to work with
    • Convert data types and manipulate text strings
    • Apply functions to scale data manipulations
    • Aggregate, transform, and filter large data sets with groupby
    • Leverage Pandas’ advanced date and time capabilities
    • Fit linear models using statsmodels and scikit-learn libraries
    • Use generalized linear modeling to fit models with different response variables
    • Compare multiple models to select the “best” one
    • Regularize to overcome overfitting and improve performance
    • Use clustering in unsupervised machine learning

    Foreword by Anne M. Brown     xxiii

    Foreword by Jared Lander     xxv

    Preface     xxvii

    Changes in the Second Edition     xxxix

     

    Part I: Introduction    1

    Chapter 1. Pandas DataFrame Basics     3

           Learning Objectives      3

           1.1 Introduction      3

           1.2 Load Your First Data Set      4

           1.3 Look at Columns, Rows, and Cells      6

           1.4 Grouped and Aggregated Calculations      23

           1.5 Basic Plot      27

           Conclusion      28

     

    Chapter 2. Pandas Data Structures Basics      31

           Learning Objectives      31

           2.1 Create Your Own Data      31

           2.2 The Series      33

           2.3 The DataFrame      42

           2.4 Making Changes to Series and DataFrames      45

           2.5 Exporting and Importing Data      52

           Conclusion      63

     

    Chapter 3. Plotting Basics      65

           Learning Objectives      65

           3.1 Why Visualize Data?       65

           3.2 Matplotlib Basics      66

           3.3 Statistical Graphics Using matplotlib      72

           3.4 Seaborn      78

           3.5 Pandas Plotting Method      111

           Conclusion      115

     

    Chapter 4. Tidy Data      117

           Learning Objectives      117

           Note About This Chapter       117

           4.1 Columns Contain Values, Not Variables      118

           4.2 Columns Contain Multiple Variables      122

           4.3 Variables in Both Rows and Columns      126

           Conclusion      129

     

    Chapter 5. Apply Functions      131

           Learning Objectives      131

           Note About This Chapter      131

           5.1 Primer on Functions      131

           5.2 Apply (Basics)       133

           5.3 Vectorized Functions      138

           5.4 Lambda Functions (Anonymous Functions)       141

           Conclusion      142

     

    Part II: Data Processing     143

    Chapter 6. Data Assembly      145

           Learning Objectives      145

           6.1 Combine Data Sets      145

           6.2 Concatenation      146

           6.3 Observational Units Across Multiple Tables      154

           6.4 Merge Multiple Data Sets      160

           Conclusion      167

     

    Chapter 7. Data Normalization      169

           Learning Objectives      169

           7.1 Multiple Observational Units in a Table (Normalization)     169

           Conclusion      173

     

    Chapter 8. Groupby Operations: Split-Apply-Combine      175

           Learning Objectives      175

           8.1 Aggregate      176

           8.2 Transform      184

           8.3 Filter      188

           8.4 The pandas.core.groupby.DataFrameGroupBy object      190

           8.5 Working with a MultiIndex      195

           Conclusion      199

     

    Part III: Data Types    203

    Chapter 9. Missing Data      203

           Learning Objectives      203

           9.1 What Is a NaN Value?       203

           9.2 Where Do Missing Values Come From?       205

           9.3 Working with Missing Data      210

           9.4 Pandas Built-In NA Missing      216

           Conclusion      218

     

    Chapter 10. Data Types      219

           Learning Objectives      219

           10.1 Data Types      219

           10.2 Converting Types      220

           10.3 Categorical Data      225

           Conclusion      227

     

    Chapter 11. Strings and Text Data      229

           Introduction      229

           Learning Objectives      229

           11.1 Strings      229

           11.2 String Methods      233

           11.3 More String Methods      234

           11.4 String Formatting (F-Strings)       236

           11.5 Regular Expressions (RegEx)      239

           11.6 The regex Library      247

           Conclusion      247

     

    Chapter 12. Dates and Times      249

           Learning Objectives      249

           12.1 Python's datetime Object      249

           12.2 Converting to datetime      250

           12.3 Loading Data That Include Dates      253

           12.4 Extracting Date Components      254

           12.5 Date Calculations and Timedeltas      257

           12.6 Datetime Methods      259

           12.7 Getting Stock Data      261

           12.8 Subsetting Data Based on Dates      263

           12.9 Date Ranges      266

           12.10 Shifting Values      270

           12.11 Resampling      276

           12.12 Time Zones      278

           12.13 Arrow for Better Dates and Times      280

           Conclusion      280

     

    Part IV: Data Modeling    281

    Chapter 13. Linear Regression (Continuous Outcome Variable)      283

           13.1 Simple Linear Regression      283

           13.2 Multiple Regression      287

           13.3 Models with Categorical Variables      289

           13.4 One-Hot Encoding in scikit-learn with Transformer Pipelines      294

           Conclusion      296

     

    Chapter 14. Generalized Linear Models      297

           About This Chapter      297

           14.1 Logistic Regression (Binary Outcome Variable)       297

           14.2 Poisson Regression (Count Outcome Variable)       304

           14.3 More Generalized Linear Models      308

           Conclusion      309

     

    Chapter 15. Survival Analysis      311

           15.1 Survival Data      311

           15.2 Kaplan Meier Curves      312

           15.3 Cox Proportional Hazard Model      314

           Conclusion      317

     

    Chapter 16. Model Diagnostics      319

           16.1 Residuals      319

           16.2 Comparing Multiple Models      324

           16.3 k-Fold Cross-Validation      329

           Conclusion      334

     

    Chapter 17. Regularization      335

           17.1 Why Regularize?       335

           17.2 LASSO Regression      337

           17.3 Ridge Regression      338

           17.4 Elastic Net      340

           17.5 Cross-Validation      341

           Conclusion      343

     

    Chapter 18. Clustering      345

           18.1 k-Means      345

           18.2 Hierarchical Clustering      351

           Conclusion     356

     

    Part V. Conclusion    357

    Chapter 19. Life Outside of Pandas      359

           19.1 The (Scientific) Computing Stack      359

           19.2 Performance      360

           19.3 Dask      360

           19.4 Siuba      360

           19.5 Ibis      361

           19.6 Polars      361

           19.7 PyJanitor      361

           19.8 Pandera      361

           19.9 Machine Learning      361

           19.10 Publishing      362

           19.11 Dashboards      362

           Conclusion      362

     

    Chapter 20. It's Dangerous To Go Alone!      363

           20.1 Local Meetups      363

           20.2 Conferences      363

           20.3 The Carpentries      364

           20.4 Podcasts      364

           20.5 Other Resources      365

           Conclusion      365

     

    Appendices      367

    A.      Concept Maps      369
    B.      Installation and Setup     373
    C.      Command Line     377
    D.      Project Templates     379
    E.      Using Python       381
    F.       Working Directories       383
    G.      Environments       385
    H.      Install Packages       389
    I.       Importing Libraries       391
    J.       Code Style       393
    K.      Containers: Lists, Tuples, and Dictionaries       395
    L.      Slice Values       399
    M.     Loops       401
    N.     Comprehensions       403
    O.     Functions       405
    P.      Ranges and Generators       409
    Q.     Multiple Assignment       413
    R.     Numpy ndarray       415
    S.     Classes       417
    T.      SettingWithCopyWarning       419
    U.     Method Chaining       423
    V.      Timing Code       427
    W.     String Formatting       429
    X.      Conditionals (if-elif-else)        433
    Y.      New York ACS Logistic Regression Example       435
    Z.      Replicating Results in R       443

    Index      451