Preface....7
What Is Data Science?....7
Who Is This Book For?....8
Why Python?....9
Outline of the Book....10
Installation Considerations....11
Conventions Used in This Book....12
Using Code Examples....13
O’Reilly Online Learning....14
How to Contact Us....14
I. Jupyter: Beyond Normal Python....16
1. Getting Started in IPython and Jupyter....18
Launching the IPython Shell....18
Launching the Jupyter Notebook....19
Help and Documentation in IPython....20
Accessing Documentation with ?....21
Accessing Source Code with ??....23
Exploring Modules with Tab Completion....23
Keyboard Shortcuts in the IPython Shell....26
Navigation Shortcuts....27
Text Entry Shortcuts....27
Command History Shortcuts....28
Miscellaneous Shortcuts....30
2. Enhanced Interactive Features....31
IPython Magic Commands....31
Running External Code: %run....31
Timing Code Execution: %timeit....32
Help on Magic Functions: ?, %magic, and %lsmagic....33
Input and Output History....34
IPython’s In and Out Objects....34
Underscore Shortcuts and Previous Outputs....36
Suppressing Output....36
Related Magic Commands....37
IPython and Shell Commands....37
Quick Introduction to the Shell....38
Shell Commands in IPython....40
Passing Values to and from the Shell....40
Shell-Related Magic Commands....41
3. Debugging and Profiling....43
Errors and Debugging....43
Controlling Exceptions: %xmode....43
Debugging: When Reading Tracebacks Is Not Enough....45
Profiling and Timing Code....48
Timing Code Snippets: %timeit and %time....49
Profiling Full Scripts: %prun....51
Line-by-Line Profiling with %lprun....53
Profiling Memory Use: %memit and %mprun....54
More IPython Resources....56
Web Resources....56
Books....57
II. Introduction to NumPy....58
4. Understanding Data Types in Python....61
A Python Integer Is More Than Just an Integer....62
A Python List Is More Than Just a List....64
Fixed-Type Arrays in Python....66
Creating Arrays from Python Lists....66
Creating Arrays from Scratch....67
NumPy Standard Data Types....69
5. The Basics of NumPy Arrays....72
NumPy Array Attributes....73
Array Indexing: Accessing Single Elements....73
Array Slicing: Accessing Subarrays....75
One-Dimensional Subarrays....75
Multidimensional Subarrays....76
Subarrays as No-Copy Views....77
Creating Copies of Arrays....78
Reshaping of Arrays....78
Array Concatenation and Splitting....79
Concatenation of Arrays....80
Splitting of Arrays....81
6. Computation on NumPy Arrays: Universal Functions....83
The Slowness of Loops....83
Introducing Ufuncs....85
Exploring NumPy’s Ufuncs....86
Array Arithmetic....86
Absolute Value....88
Trigonometric Functions....89
Exponents and Logarithms....90
Specialized Ufuncs....91
Advanced Ufunc Features....92
Specifying Output....92
Aggregations....93
Outer Products....93
Ufuncs: Learning More....94
7. Aggregations: min, max, and Everything in Between....95
Summing the Values in an Array....95
Minimum and Maximum....96
Multidimensional Aggregates....97
Other Aggregation Functions....98
Example: What Is the Average Height of US Presidents?....99
8. Computation on Arrays: Broadcasting....102
Introducing Broadcasting....102
Rules of Broadcasting....104
Broadcasting Example 1....105
Broadcasting Example 2....106
Broadcasting Example 3....106
Broadcasting in Practice....108
Centering an Array....108
Plotting a Two-Dimensional Function....109
9. Comparisons, Masks, and Boolean Logic....111
Example: Counting Rainy Days....111
Comparison Operators as Ufuncs....113
Working with Boolean Arrays....114
Counting Entries....115
Boolean Operators....116
Boolean Arrays as Masks....118
Using the Keywords and/or Versus the Operators &/|....119
10. Fancy Indexing....122
Exploring Fancy Indexing....122
Combined Indexing....124
Example: Selecting Random Points....125
Modifying Values with Fancy Indexing....127
Example: Binning Data....129
11. Sorting Arrays....132
Fast Sorting in NumPy: np.sort and np.argsort....133
Sorting Along Rows or Columns....134
Partial Sorts: Partitioning....134
Example: k-Nearest Neighbors....135
12. Structured Data: NumPy’s Structured Arrays....140
Exploring Structured Array Creation....142
More Advanced Compound Types....143
Record Arrays: Structured Arrays with a Twist....144
On to Pandas....145
III. Data Manipulation with Pandas....146
13. Introducing Pandas Objects....149
The Pandas Series Object....149
Series as Generalized NumPy Array....150
Series as Specialized Dictionary....151
Constructing Series Objects....152
The Pandas DataFrame Object....153
DataFrame as Generalized NumPy Array....154
DataFrame as Specialized Dictionary....155
Constructing DataFrame Objects....156
The Pandas Index Object....158
Index as Immutable Array....158
Index as Ordered Set....159
14. Data Indexing and Selection....160
Data Selection in Series....160
Series as Dictionary....160
Series as One-Dimensional Array....161
Indexers: loc and iloc....162
Data Selection in DataFrames....164
DataFrame as Dictionary....164
DataFrame as Two-Dimensional Array....166
Additional Indexing Conventions....168
15. Operating on Data in Pandas....170
Ufuncs: Index Preservation....170
Ufuncs: Index Alignment....171
Index Alignment in Series....172
Index Alignment in DataFrames....173
Ufuncs: Operations Between DataFrames and Series....175
16. Handling Missing Data....177
Trade-offs in Missing Data Conventions....177
Missing Data in Pandas....178
None as a Sentinel Value....179
NaN: Missing Numerical Data....180
NaN and None in Pandas....181
Pandas Nullable Dtypes....183
Operating on Null Values....183
Detecting Null Values....184
Dropping Null Values....185
Filling Null Values....187
17. Hierarchical Indexing....189
A Multiply Indexed Series....189
The Bad Way....190
The Better Way: The Pandas MultiIndex....191
MultiIndex as Extra Dimension....192
Methods of MultiIndex Creation....194
Explicit MultiIndex Constructors....194
MultiIndex Level Names....196
MultiIndex for Columns....196
Indexing and Slicing a MultiIndex....197
Multiply Indexed Series....198
Multiply Indexed DataFrames....199
Rearranging Multi-Indexes....201
Sorted and Unsorted Indices....201
Stacking and Unstacking Indices....203
Index Setting and Resetting....204
18. Combining Datasets: concat and append....205
Recall: Concatenation of NumPy Arrays....206
Simple Concatenation with pd.concat....207
Duplicate Indices....208
Concatenation with Joins....210
The append Method....211
19. Combining Datasets: merge and join....213
Relational Algebra....213
Categories of Joins....214
One-to-One Joins....214
Many-to-One Joins....215
Many-to-Many Joins....216
Specification of the Merge Key....217
The on Keyword....218
The left_on and right_on Keywords....218
The left_index and right_index Keywords....219
Specifying Set Arithmetic for Joins....221
Overlapping Column Names: The suffixes Keyword....222
Example: US States Data....224
20. Aggregation and Grouping....230
Planets Data....230
Simple Aggregation in Pandas....231
groupby: Split, Apply, Combine....234
Split, Apply, Combine....234
The GroupBy Object....237
Aggregate, Filter, Transform, Apply....239
Specifying the Split Key....242
Grouping Example....244
21. Pivot Tables....246
Motivating Pivot Tables....246
Pivot Tables by Hand....247
Pivot Table Syntax....248
Multilevel Pivot Tables....248
Additional Pivot Table Options....250
Example: Birthrate Data....251
22. Vectorized String Operations....258
Introducing Pandas String Operations....258
Tables of Pandas String Methods....259
Methods Similar to Python String Methods....260
Methods Using Regular Expressions....261
Miscellaneous Methods....263
Example: Recipe Database....265
A Simple Recipe Recommender....268
Going Further with Recipes....270
23. Working with Time Series....271
Dates and Times in Python....272
Native Python Dates and Times: datetime and dateutil....272
Typed Arrays of Times: NumPy’s datetime64....273
Dates and Times in Pandas: The Best of Both Worlds....276
Pandas Time Series: Indexing by Time....277
Pandas Time Series Data Structures....278
Regular Sequences: pd.date_range....279
Frequencies and Offsets....281
Resampling, Shifting, and Windowing....284
Resampling and Converting Frequencies....286
Time Shifts....287
Rolling Windows....288
Example: Visualizing Seattle Bicycle Counts....290
Visualizing the Data....292
Digging into the Data....294
24. High-Performance Pandas: eval and query....298
Motivating query and eval: Compound Expressions....298
pandas.eval for Efficient Operations....300
DataFrame.eval for Column-Wise Operations....302
Assignment in DataFrame.eval....303
Local Variables in DataFrame.eval....304
The DataFrame.query Method....305
Performance: When to Use These Functions....305
Further Resources....307
IV. Visualization with Matplotlib....309
25. General Matplotlib Tips....311
Importing Matplotlib....311
Setting Styles....311
show or No show? How to Display Your Plots....312
Plotting from a Script....312
Plotting from an IPython Shell....313
Plotting from a Jupyter Notebook....313
Saving Figures to File....314
Two Interfaces for the Price of One....316
26. Simple Line Plots....319
Adjusting the Plot: Line Colors and Styles....322
Adjusting the Plot: Axes Limits....325
Labeling Plots....328
Matplotlib Gotchas....329
27. Simple Scatter Plots....331
Scatter Plots with plt.plot....331
Scatter Plots with plt.scatter....334
plot Versus scatter: A Note on Efficiency....336
Visualizing Uncertainties....337
Basic Errorbars....337
Continuous Errors....339
28. Density and Contour Plots....342
Visualizing a Three-Dimensional Function....342
Histograms, Binnings, and Density....347
Two-Dimensional Histograms and Binnings....350
plt.hist2d: Two-Dimensional Histogram....350
plt.hexbin: Hexagonal Binnings....351
Kernel Density Estimation....352
29. Customizing Plot Legends....355
Choosing Elements for the Legend....357
Legend for Size of Points....359
Multiple Legends....361
30. Customizing Colorbars....363
Customizing Colorbars....364
Choosing the Colormap....365
Color Limits and Extensions....369
Discrete Colorbars....370
Example: Handwritten Digits....371
31. Multiple Subplots....374
plt.axes: Subplots by Hand....374
plt.subplot: Simple Grids of Subplots....376
plt.subplots: The Whole Grid in One Go....377
plt.GridSpec: More Complicated Arrangements....379
32. Text and Annotation....382
Example: Effect of Holidays on US Births....382
Transforms and Text Position....385
Arrows and Annotation....388
33. Customizing Ticks....392
Major and Minor Ticks....392
Hiding Ticks or Labels....394
Reducing or Increasing the Number of Ticks....396
Fancy Tick Formats....397
Summary of Formatters and Locators....400
34. Customizing Matplotlib: Configurations and Stylesheets....402
Plot Customization by Hand....402
Changing the Defaults: rcParams....404
Stylesheets....406
Default Style....407
FiveThiryEight Style....407
ggplot Style....408
Bayesian Methods for Hackers Style....409
Dark Background Style....410
Grayscale Style....411
Seaborn Style....412
35. Three-Dimensional Plotting in Matplotlib....413
Three-Dimensional Points and Lines....414
Three-Dimensional Contour Plots....415
Wireframes and Surface Plots....417
Surface Triangulations....418
Example: Visualizing a Möbius Strip....420
36. Visualization with Seaborn....423
Exploring Seaborn Plots....424
Histograms, KDE, and Densities....424
Pair Plots....426
Faceted Histograms....427
Categorical Plots....428
Joint Distributions....429
Bar Plots....430
Example: Exploring Marathon Finishing Times....432
Further Resources....441
Other Python Visualization Libraries....441
V. Machine Learning....443
37. What Is Machine Learning?....444
Categories of Machine Learning....444
Qualitative Examples of Machine Learning Applications....445
Classification: Predicting Discrete Labels....446
Regression: Predicting Continuous Labels....449
Clustering: Inferring Labels on Unlabeled Data....451
Dimensionality Reduction: Inferring Structure of Unlabeled Data....453
Summary....455
38. Introducing Scikit-Learn....457
Data Representation in Scikit-Learn....457
The Features Matrix....458
The Target Array....459
The Estimator API....461
Basics of the API....462
Supervised Learning Example: Simple Linear Regression....463
Supervised Learning Example: Iris Classification....468
Unsupervised Learning Example: Iris Dimensionality....469
Unsupervised Learning Example: Iris Clustering....471
Application: Exploring Handwritten Digits....472
Loading and Visualizing the Digits Data....473
Unsupervised Learning Example: Dimensionality Reduction....475
Classification on Digits....476
Summary....479
39. Hyperparameters and Model Validation....481
Thinking About Model Validation....481
Model Validation the Wrong Way....482
Model Validation the Right Way: Holdout Sets....483
Model Validation via Cross-Validation....483
Selecting the Best Model....486
The Bias-Variance Trade-off....487
Validation Curves in Scikit-Learn....490
Learning Curves....494
Validation in Practice: Grid Search....499
Summary....501
40. Feature Engineering....503
Categorical Features....503
Text Features....505
Image Features....507
Derived Features....507
Imputation of Missing Data....510
Feature Pipelines....511
41. In Depth: Naive Bayes Classification....513
Bayesian Classification....513
Gaussian Naive Bayes....514
Multinomial Naive Bayes....518
Example: Classifying Text....518
When to Use Naive Bayes....522
42. In Depth: Linear Regression....524
Simple Linear Regression....524
Basis Function Regression....527
Polynomial Basis Functions....527
Gaussian Basis Functions....529
Regularization....531
Ridge Regression (L2 Regularization)....533
Lasso Regression (L1 Regularization)....534
Example: Predicting Bicycle Traffic....536
43. In Depth: Support Vector Machines....543
Motivating Support Vector Machines....543
Support Vector Machines: Maximizing the Margin....545
Fitting a Support Vector Machine....546
Beyond Linear Boundaries: Kernel SVM....550
Tuning the SVM: Softening Margins....554
Example: Face Recognition....555
Summary....560
44. In Depth: Decision Trees and Random Forests....562
Motivating Random Forests: Decision Trees....562
Creating a Decision Tree....563
Decision Trees and Overfitting....566
Ensembles of Estimators: Random Forests....567
Random Forest Regression....570
Example: Random Forest for Classifying Digits....572
Summary....574
45. In Depth: Principal Component Analysis....576
Introducing Principal Component Analysis....576
PCA as Dimensionality Reduction....578
PCA for Visualization: Handwritten Digits....580
What Do the Components Mean?....581
Choosing the Number of Components....582
PCA as Noise Filtering....584
Example: Eigenfaces....586
Summary....589
46. In Depth: Manifold Learning....591
Manifold Learning: “HELLO”....592
Multidimensional Scaling....593
MDS as Manifold Learning....596
Nonlinear Embeddings: Where MDS Fails....598
Nonlinear Manifolds: Locally Linear Embedding....600
Some Thoughts on Manifold Methods....602
Example: Isomap on Faces....604
Example: Visualizing Structure in Digits....608
47. In Depth: k-Means Clustering....613
Introducing k-Means....613
Expectation–Maximization....615
Examples....623
Example 1: k-Means on Digits....623
Example 2: k-Means for Color Compression....626
48. In Depth: Gaussian Mixture Models....631
Motivating Gaussian Mixtures: Weaknesses of k-Means....631
Generalizing E–M: Gaussian Mixture Models....635
Choosing the Covariance Type....640
Gaussian Mixture Models as Density Estimation....640
Example: GMMs for Generating New Data....646
49. In Depth: Kernel Density Estimation....650
Motivating Kernel Density Estimation: Histograms....650
Kernel Density Estimation in Practice....656
Selecting the Bandwidth via Cross-Validation....657
Example: Not-so-Naive Bayes....658
Anatomy of a Custom Estimator....660
Using Our Custom Estimator....662
50. Application: A Face Detection Pipeline....665
HOG Features....666
HOG in Action: A Simple Face Detector....667
1. Obtain a Set of Positive Training Samples....668
2. Obtain a Set of Negative Training Samples....668
3. Combine Sets and Extract HOG Features....670
4. Train a Support Vector Machine....670
5. Find Faces in a New Image....671
Caveats and Improvements....674
Further Machine Learning Resources....676
Index....679
About the Author....745
Python is a first-class tool for many researchers, primarily because of its libraries for storing, manipulating, and gaining insight from data. Several resources exist for individual pieces of this data science stack, but only with the new edition of Python Data Science Handbook do you get them all--IPython, NumPy, pandas, Matplotlib, scikit-learn, and other related tools.
Working scientists and data crunchers familiar with reading and writing Python code will find the second edition of this comprehensive desk reference ideal for tackling day-to-day issues: manipulating, transforming, and cleaning data; visualizing different types of data; and using data to build statistical or machine learning models. Quite simply, this is the must-have reference for scientific computing in Python.