Cover....1
Copyright....4
Table of Contents....5
Preface....13
Section 1. Conventions Used in This Book....13
Section 2. Using Code Examples....14
Section 3. O’Reilly Online Learning....15
Section 4. How to Contact Us....15
Section 5. Acknowledgments....16
In Memoriam: John D. Hunter (1968–2012)....16
Acknowledgments for the Third Edition (2022)....16
Acknowledgments for the Second Edition (2017)....17
Acknowledgments for the First Edition (2012)....18
Chapter 1. Preliminaries....19
1.1 What Is This Book About?....19
What Kinds of Data?....19
1.2 Why Python for Data Analysis?....20
Python as Glue....21
Solving the “Two-Language” Problem....21
Why Not Python?....21
1.3 Essential Python Libraries....22
NumPy....22
pandas....23
matplotlib....24
IPython and Jupyter....24
SciPy....25
scikit-learn....26
statsmodels....26
Other Packages....27
1.4 Installation and Setup....27
Miniconda on Windows....27
GNU/Linux....28
Miniconda on macOS....29
Installing Necessary Packages....29
Integrated Development Environments and Text Editors....30
1.5 Community and Conferences....31
1.6 Navigating This Book....32
Code Examples....33
Data for Examples....33
Import Conventions....34
Chapter 2. Python Language Basics, IPython, and Jupyter Notebooks....35
2.1 The Python Interpreter....36
2.2 IPython Basics....37
Running the IPython Shell....37
Running the Jupyter Notebook....38
Tab Completion....41
Introspection....43
2.3 Python Language Basics....44
Language Semantics....44
Scalar Types....52
Control Flow....60
2.4 Conclusion....63
Chapter 3. Built-In Data Structures, Functions, and Files....65
3.1 Data Structures and Sequences....65
Tuple....65
List....69
Dictionary....73
Set....77
Built-In Sequence Functions....80
List, Set, and Dictionary Comprehensions....81
3.2 Functions....83
Namespaces, Scope, and Local Functions....85
Returning Multiple Values....86
Functions Are Objects....87
Anonymous (Lambda) Functions....88
Generators....89
Errors and Exception Handling....92
3.3 Files and the Operating System....94
Bytes and Unicode with Files....98
3.4 Conclusion....100
Chapter 4. NumPy Basics: Arrays and Vectorized Computation....101
4.1 The NumPy ndarray: A Multidimensional Array Object....103
Creating ndarrays....104
Data Types for ndarrays....106
Arithmetic with NumPy Arrays....109
Basic Indexing and Slicing....110
Boolean Indexing....115
Fancy Indexing....118
Transposing Arrays and Swapping Axes....120
4.2 Pseudorandom Number Generation....121
4.3 Universal Functions: Fast Element-Wise Array Functions....123
4.4 Array-Oriented Programming with Arrays....126
Expressing Conditional Logic as Array Operations....128
Mathematical and Statistical Methods....129
Methods for Boolean Arrays....131
Sorting....132
Unique and Other Set Logic....133
4.5 File Input and Output with Arrays....134
4.6 Linear Algebra....134
4.7 Example: Random Walks....136
Simulating Many Random Walks at Once....138
4.8 Conclusion....139
Chapter 5. Getting Started with pandas....141
5.1 Introduction to pandas Data Structures....142
Series....142
DataFrame....147
Index Objects....154
5.2 Essential Functionality....156
Reindexing....156
Dropping Entries from an Axis....159
Indexing, Selection, and Filtering....160
Arithmetic and Data Alignment....170
Function Application and Mapping....176
Sorting and Ranking....178
Axis Indexes with Duplicate Labels....182
5.3 Summarizing and Computing Descriptive Statistics....183
Correlation and Covariance....186
Unique Values, Value Counts, and Membership....188
5.4 Conclusion....191
Chapter 6. Data Loading, Storage, and File Formats....193
6.1 Reading and Writing Data in Text Format....193
Reading Text Files in Pieces....200
Writing Data to Text Format....202
Working with Other Delimited Formats....203
JSON Data....205
XML and HTML: Web Scraping....207
6.2 Binary Data Formats....211
Reading Microsoft Excel Files....212
Using HDF5 Format....213
6.3 Interacting with Web APIs....215
6.4 Interacting with Databases....217
6.5 Conclusion....219
Chapter 7. Data Cleaning and Preparation....221
7.1 Handling Missing Data....221
Filtering Out Missing Data....223
Filling In Missing Data....225
7.2 Data Transformation....227
Removing Duplicates....227
Transforming Data Using a Function or Mapping....229
Replacing Values....230
Renaming Axis Indexes....232
Discretization and Binning....233
Detecting and Filtering Outliers....235
Permutation and Random Sampling....237
Computing Indicator/Dummy Variables....239
7.3 Extension Data Types....242
7.4 String Manipulation....245
Python Built-In String Object Methods....245
Regular Expressions....247
String Functions in pandas....250
7.5 Categorical Data....253
Background and Motivation....254
Categorical Extension Type in pandas....255
Computations with Categoricals....258
Categorical Methods....260
7.6 Conclusion....263
Chapter 8. Data Wrangling: Join, Combine, and Reshape....265
8.1 Hierarchical Indexing....265
Reordering and Sorting Levels....268
Summary Statistics by Level....269
Indexing with a DataFrame’s columns....270
8.2 Combining and Merging Datasets....271
Database-Style DataFrame Joins....272
Merging on Index....277
Concatenating Along an Axis....281
Combining Data with Overlap....286
8.3 Reshaping and Pivoting....288
Reshaping with Hierarchical Indexing....288
Pivoting “Long” to “Wide” Format....291
Pivoting “Wide” to “Long” Format....295
8.4 Conclusion....297
Chapter 9. Plotting and Visualization....299
9.1 A Brief matplotlib API Primer....300
Figures and Subplots....301
Colors, Markers, and Line Styles....306
Ticks, Labels, and Legends....308
Annotations and Drawing on a Subplot....312
Saving Plots to File....314
matplotlib Configuration....315
9.2 Plotting with pandas and seaborn....316
Line Plots....316
Bar Plots....319
Histograms and Density Plots....327
Scatter or Point Plots....329
Facet Grids and Categorical Data....332
9.3 Other Python Visualization Tools....335
9.4 Conclusion....335
Chapter 10. Data Aggregation and Group Operations....337
10.1 How to Think About Group Operations....338
Iterating over Groups....342
Selecting a Column or Subset of Columns....344
Grouping with Dictionaries and Series....345
Grouping with Functions....346
Grouping by Index Levels....346
10.2 Data Aggregation....347
Column-Wise and Multiple Function Application....349
Returning Aggregated Data Without Row Indexes....353
10.3 Apply: General split-apply-combine....353
Suppressing the Group Keys....356
Quantile and Bucket Analysis....356
Example: Filling Missing Values with Group-Specific Values....358
Example: Random Sampling and Permutation....361
Example: Group Weighted Average and Correlation....362
Example: Group-Wise Linear Regression....365
10.4 Group Transforms and “Unwrapped” GroupBys....365
10.5 Pivot Tables and Cross-Tabulation....369
Cross-Tabulations: Crosstab....372
10.6 Conclusion....373
Chapter 11. Time Series....375
11.1 Date and Time Data Types and Tools....376
Converting Between String and Datetime....377
11.2 Time Series Basics....379
Indexing, Selection, Subsetting....381
Time Series with Duplicate Indices....383
11.3 Date Ranges, Frequencies, and Shifting....384
Generating Date Ranges....385
Frequencies and Date Offsets....388
Shifting (Leading and Lagging) Data....389
11.4 Time Zone Handling....392
Time Zone Localization and Conversion....393
Operations with Time Zone-Aware Timestamp Objects....395
Operations Between Different Time Zones....396
11.5 Periods and Period Arithmetic....397
Period Frequency Conversion....398
Quarterly Period Frequencies....400
Converting Timestamps to Periods (and Back)....402
Creating a PeriodIndex from Arrays....403
11.6 Resampling and Frequency Conversion....405
Downsampling....406
Upsampling and Interpolation....409
Resampling with Periods....410
Grouped Time Resampling....412
11.7 Moving Window Functions....414
Exponentially Weighted Functions....417
Binary Moving Window Functions....419
User-Defined Moving Window Functions....420
11.8 Conclusion....421
Chapter 12. Introduction to Modeling Libraries in Python....423
12.1 Interfacing Between pandas and Model Code....423
12.2 Creating Model Descriptions with Patsy....426
Data Transformations in Patsy Formulas....428
Categorical Data and Patsy....430
12.3 Introduction to statsmodels....433
Estimating Linear Models....433
Estimating Time Series Processes....437
12.4 Introduction to scikit-learn....438
12.5 Conclusion....441
Chapter 13. Data Analysis Examples....443
13.1 Bitly Data from 1.USA.gov....443
Counting Time Zones in Pure Python....444
Counting Time Zones with pandas....446
13.2 MovieLens 1M Dataset....453
Measuring Rating Disagreement....457
13.3 US Baby Names 1880–2010....461
Analyzing Naming Trends....466
13.4 USDA Food Database....475
13.5 2012 Federal Election Commission Database....481
Donation Statistics by Occupation and Employer....484
Bucketing Donation Amounts....487
Donation Statistics by State....489
13.6 Conclusion....490
Appendix A. Advanced NumPy....491
A.1 ndarray Object Internals....491
NumPy Data Type Hierarchy....492
A.2 Advanced Array Manipulation....494
Reshaping Arrays....494
C Versus FORTRAN Order....496
Concatenating and Splitting Arrays....497
Repeating Elements: tile and repeat....499
Fancy Indexing Equivalents: take and put....501
A.3 Broadcasting....502
Broadcasting over Other Axes....505
Setting Array Values by Broadcasting....507
A.4 Advanced ufunc Usage....508
ufunc Instance Methods....508
Writing New ufuncs in Python....511
A.5 Structured and Record Arrays....511
Nested Data Types and Multidimensional Fields....512
Why Use Structured Arrays?....513
A.6 More About Sorting....513
Indirect Sorts: argsort and lexsort....515
Alternative Sort Algorithms....516
Partially Sorting Arrays....517
numpy.searchsorted: Finding Elements in a Sorted Array....518
A.7 Writing Fast NumPy Functions with Numba....519
Creating Custom numpy.ufunc Objects with Numba....520
A.8 Advanced Array Input and Output....521
Memory-Mapped Files....521
HDF5 and Other Array Storage Options....522
A.9 Performance Tips....523
The Importance of Contiguous Memory....523
Appendix B. More on the IPython System....527
B.1 Terminal Keyboard Shortcuts....527
B.2 About Magic Commands....528
The %run Command....530
Executing Code from the Clipboard....531
B.3 Using the Command History....532
Searching and Reusing the Command History....532
Input and Output Variables....533
B.4 Interacting with the Operating System....534
Shell Commands and Aliases....535
Directory Bookmark System....536
B.5 Software Development Tools....537
Interactive Debugger....537
Timing Code: %time and %timeit....541
Basic Profiling: %prun and %run -p....543
Profiling a Function Line by Line....545
B.6 Tips for Productive Code Development Using IPython....547
Reloading Module Dependencies....547
Code Design Tips....548
B.7 Advanced IPython Features....550
Profiles and Configuration....550
B.8 Conclusion....551
Index....553
About the Author....580
Colophon....581
Get the definitive handbook for manipulating, processing, cleaning, and crunching datasets in Python. Updated for Python 3.10 and pandas 1.4, the third edition of this hands-on guide is packed with practical case studies that show you how to solve a broad set of data analysis problems effectively. You'll learn the latest versions of pandas, NumPy, and Jupyter in the process.
Written by Wes McKinney, the creator of the Python pandas project, this book is a practical, modern introduction to data science tools in Python. It's ideal for analysts new to Python and for Python programmers new to data science and scientific computing. Data files and related material are available on GitHub.