Preface....14
The art of data wrangling....14
Aims, scope, and design philosophy....15
We need maths....16
We need some computing environment....16
We need data and domain knowledge....17
Structure....18
The Rules....20
About the author....22
Acknowledgements....22
You can make this book better....23
I Introducing Python....24
Getting started with Python....26
Installing Python....26
Working with Jupyter notebooks....27
Launching JupyterLab....28
First notebook....28
More cells....29
Edit vs command mode....30
Markdown cells....31
The best note-taking app....32
Initialising each session and getting example data....33
Exercises....35
Scalar types and control structures in Python....36
Scalar types....36
Logical values....36
Numeric values....36
Arithmetic operators....37
Creating named variables....38
Character strings....38
F-strings (formatted string literals)....39
Calling built-in functions....40
Positional and keyword arguments....40
Modules and packages....41
Slots and methods....41
Controlling program flow....42
Relational and logical operators....42
The if statement....43
The while loop....44
Defining functions....45
Exercises....47
Sequential and other types in Python....48
Sequential types....48
Lists....48
Tuples....49
Ranges....49
Strings (again)....50
Working with sequences....50
Extracting elements....50
Slicing....51
Modifying elements of mutable sequences....52
Searching for specific elements....53
Arithmetic operators....53
Dictionaries....53
Iterable types....55
The for loop....55
Tuple assignment....57
Argument unpacking (*)....59
Variadic arguments: *args and **kwargs (*)....60
Object references and copying (*)....60
Copying references....60
Pass by assignment....61
Object copies....61
Modify in place or return a modified copy?....62
Further reading....63
Exercises....64
II Unidimensional data....66
Unidimensional numeric data and their empirical distribution....68
Creating vectors in numpy....69
Enumerating elements....70
Arithmetic progressions....71
Repeating values....72
numpy.r_ (*)....72
Generating pseudorandom variates....73
Loading data from files....73
Some mathematical notation....74
Inspecting the data distribution with histograms....75
heights: A bell-shaped distribution....75
income: A right-skewed distribution....76
How many bins?....78
peds: A bimodal distribution (already binned)....80
matura: A bell-shaped distribution (almost)....82
marathon (truncated – fastest runners): A left-skewed distribution....82
Log-scale and heavy-tailed distributions....83
Cumulative probabilities and the empirical cumulative distribution function....86
Exercises....87
Processing unidimensional data....90
Aggregating numeric data....90
Measures of location....91
Arithmetic mean and median....91
Sensitive to outliers vs robust....92
Sample quantiles....93
Measures of dispersion....95
Standard deviation (and variance)....96
Interquartile range....96
Measures of shape....97
Box (and whisker) plots....98
Other aggregation methods (*)....99
Vectorised mathematical functions....101
Logarithms and exponential functions....102
Trigonometric functions....103
Arithmetic operators....104
Vector-scalar case....105
Application: Feature scaling....105
Standardisation and z-scores....106
Min-max scaling and clipping....107
Normalisation (l2; dividing by magnitude)....108
Normalisation (l1; dividing by sum)....109
Vector-vector case....109
Indexing vectors....111
Integer indexing....111
Logical indexing....112
Slicing....113
Other operations....114
Cumulative sums and iterated differences....114
Sorting....114
Dealing with tied observations....115
Determining the ordering permutation and ranking....115
Searching for certain indexes (argmin, argmax)....116
Dealing with round-off and measurement errors....117
Vectorising scalar operations with list comprehensions....119
Exercises....120
Continuous probability distributions....122
Normal distribution....123
Estimating parameters....123
Data models are useful....124
Assessing goodness-of-fit....126
Comparing cumulative distribution functions....126
Comparing quantiles....128
Kolmogorov–Smirnov test (*)....130
Other noteworthy distributions....132
Log-normal distribution....132
Pareto distribution....136
Uniform distribution....139
Distribution mixtures (*)....141
Generating pseudorandom numbers....143
Uniform distribution....143
Not exactly random....143
Sampling from other distributions....144
Natural variability....145
Adding jitter (white noise)....147
Independence assumption....148
Further reading....148
Exercises....149
III Multidimensional data....150
From uni- to multidimensional numeric data....152
Creating matrices....152
Reading CSV files....152
Enumerating elements....154
Repeating arrays....154
Stacking arrays....155
Other functions....156
Reshaping matrices....156
Mathematical notation....158
Transpose....159
Row and column vectors....159
Identity and other diagonal matrices....160
Visualising multidimensional data....161
2D Data....161
3D data and beyond....162
Scatter plot matrix (pairs plot)....165
Exercises....167
Processing multidimensional data....168
Extending vectorised operations to matrices....168
Vectorised mathematical functions....168
Componentwise aggregation....168
Arithmetic, logical, and relational operations....169
Matrix vs scalar....170
Matrix vs matrix....170
Matrix vs any vector....172
Row vector vs column vector (*)....173
Other row and column transforms (*)....174
Indexing matrices....175
Slice-based indexing....176
Scalar-based indexing....176
Mixed logical/integer vector and scalar/slice indexers....177
Two vectors as indexers (*)....177
Views of existing arrays (*)....178
Adding and modifying rows and columns....179
Matrix multiplication, dot products, and Euclidean norm (*)....179
Pairwise distances and related methods (*)....182
Euclidean metric (*)....182
Centroids (*)....185
Multidimensional dispersion and other aggregates (**)....185
Fixed-radius and k-nearest neighbour search (**)....186
Spatial search with K-d trees (**)....188
Exercises....189
Exploring relationships between variables....192
Measuring correlation....193
Pearson linear correlation coefficient....193
Perfect linear correlation....194
Strong linear correlation....195
No linear correlation does not imply independence....196
False linear correlations....197
Correlation is not causation....199
Correlation heat map....199
Linear correlation coefficients on transformed data....201
Spearman rank correlation coefficient....203
Regression tasks (*)....204
K-nearest neighbour regression (*)....204
From data to (linear) models (*)....206
Least squares method (*)....207
Analysis of residuals (*)....210
Multiple regression (*)....214
Variable transformation and linearisable models (**)....214
Descriptive vs predictive power (**)....216
Fitting regression models with scikit-learn (*)....222
Ill-conditioned model matrices (**)....223
Finding interesting combinations of variables (*)....227
Dot products, angles, collinearity, and orthogonality (*)....227
Geometric transformations of points (*)....229
Matrix inverse (*)....231
Singular value decomposition (*)....232
Dimensionality reduction with SVD (*)....234
Principal component analysis (*)....237
Further reading....240
Exercises....241
IV Heterogeneous data....242
Introducing data frames....244
Creating data frames....245
Data frames are matrix-like....246
Series....247
Index....249
Aggregating data frames....251
Transforming data frames....253
Indexing Series objects....256
Do not use [...] directly (in the current version of pandas)....257
loc[...]....258
iloc[...]....259
Logical indexing....260
Indexing data frames....260
loc[...] and iloc[...]....260
Adding rows and columns....262
Modifying items....263
Pseudorandom sampling and splitting....263
Hierarchical indexes (*)....265
Further operations on data frames....267
Sorting....267
Stacking and unstacking (long/tall and wide forms)....270
Joining (merging)....272
Set-theoretic operations and removing duplicates....274
…and (too) many more....276
Exercises....277
Handling categorical data....278
Representing and generating categorical data....278
Encoding and decoding factors....279
Binary data as logical and probability vectors....281
One-hot encoding (*)....282
Binning numeric data (revisited)....283
Generating pseudorandom labels....285
Frequency distributions....285
Counting....285
Two-way contingency tables: Factor combinations....286
Combinations of even more factors....287
Visualising factors....289
Bar plots....289
Political marketing and statistics....291
.....292
Pareto charts (*)....293
Heat maps....295
Aggregating and comparing factors....296
Mode....296
Binary data as logical vectors....297
Pearson chi-squared test (*)....298
Two-sample Pearson chi-squared test (*)....299
Measuring association (*)....301
Binned numeric data....303
Ordinal data (*)....303
Exercises....304
Processing data in groups....306
Basic methods....307
Aggregating data in groups....309
Transforming data in groups....310
Manual splitting into subgroups (*)....311
Plotting data in groups....314
Series of box plots....314
Series of bar plots....315
Semitransparent histograms....316
Scatter plots with group information....316
Grid (trellis) plots....316
Kolmogorov–Smirnov test for comparing ECDFs (*)....317
Comparing quantiles....320
Classification tasks (*)....321
K-nearest neighbour classification (*)....323
Assessing prediction quality (*)....326
Splitting into training and test sets (*)....329
Validating many models (parameter selection) (**)....330
Clustering tasks (*)....331
K-means method (*)....332
Solving k-means is hard (*)....335
Lloyd algorithm (*)....335
Local minima (*)....336
Random restarts (*)....339
Further reading....342
Exercises....342
Accessing databases....344
Example database....344
Exporting data to a database....346
Exercises on SQL vs pandas....347
Filtering....348
Ordering....349
Removing duplicates....350
Grouping and aggregating....351
Joining....352
Solutions to exercises....353
Closing the database connection....356
Common data serialisation formats for the Web....357
Working with many files....358
File paths....358
File search....359
Exception handling....359
File connections (*)....359
Further reading....360
Exercises....360
V Other data types....362
Text data....364
Basic string operations....364
Unicode as the universal encoding....365
Normalising strings....365
Substring searching and replacing....366
Locale-aware services in ICU (*)....367
String operations in pandas....368
String operations in numpy (*)....370
Working with string lists....372
Formatted outputs for reproducible report generation....373
Formatting strings....374
str and repr....374
Aligning strings....374
Direct Markdown output in Jupyter....375
Manual Markdown file output (*)....375
Regular expressions (*)....377
Regex matching with re (*)....377
Regex matching with pandas (*)....379
Matching individual characters (*)....380
Matching anything (almost) (*)....381
Defining character sets (*)....381
Complementing sets (*)....382
Defining code point ranges (*)....382
Using predefined character sets (*)....382
Alternating and grouping subexpressions (*)....383
Alternation operator (*)....383
Grouping subexpressions (*)....383
Non-grouping parentheses (*)....383
Quantifiers (*)....384
Capture groups and references thereto (**)....385
Extracting capture group matches (**)....385
Replacing with capture group matches (**)....387
Back-referencing (**)....387
Anchoring (*)....388
Matching at the beginning or end of a string (*)....388
Matching at word boundaries (*)....388
Looking behind and ahead (**)....388
Exercises....389
Missing, censored, and questionable data....390
Missing data....390
Representing and detecting missing values....391
Computing with missing values....391
Missing at random or not?....393
Discarding missing values....393
Mean imputation....394
Imputation by classification and regression (*)....395
Censored and interval data (*)....396
Incorrect data....396
Outliers....398
The 3/2 IQR rule for normally-distributed data....398
Unidimensional density estimation (*)....399
Multidimensional density estimation (*)....401
Exercises....404
Time series....406
Temporal ordering and line charts....406
Working with date-times and time-deltas....408
Representation: The UNIX epoch....408
Time differences....409
Date-times in data frames....409
Basic operations....413
Iterated differences and cumulative sums revisited....413
Smoothing with moving averages....416
Detecting trends and seasonal patterns....417
Imputing missing values....420
Plotting multidimensional time series....421
Candlestick plots (*)....423
Further reading....425
Exercises....425
Changelog....428
References....432
Minimalist Data Wrangling with Python is envisaged as a student's first introduction to data science, providing a high-level overview as well as discussing key concepts in detail. We explore methods for cleaning data gathered from different sources, transforming, selecting, and extracting features, performing exploratory data analysis and dimensionality reduction, identifying naturally occurring data clusters, modelling patterns in data, comparing data between groups, and reporting the results.