Minimalist Data Wrangling with Python

Minimalist Data Wrangling with Python

Minimalist Data Wrangling with Python
Автор: Gagolewski Marek
Дата выхода: 2022
Издательство: Independent publishing
Количество страниц: 437
Размер файла: 2.8 MB
Тип файла: PDF
Добавил: codelibs
 Проверить на вирусы  Дополнительные материалы 

Preface....14

The art of data wrangling....14

Aims, scope, and design philosophy....15

We need maths....16

We need some computing environment....16

We need data and domain knowledge....17

Structure....18

The Rules....20

About the author....22

Acknowledgements....22

You can make this book better....23

I Introducing Python....24

Getting started with Python....26

Installing Python....26

Working with Jupyter notebooks....27

Launching JupyterLab....28

First notebook....28

More cells....29

Edit vs command mode....30

Markdown cells....31

The best note-taking app....32

Initialising each session and getting example data....33

Exercises....35

Scalar types and control structures in Python....36

Scalar types....36

Logical values....36

Numeric values....36

Arithmetic operators....37

Creating named variables....38

Character strings....38

F-strings (formatted string literals)....39

Calling built-in functions....40

Positional and keyword arguments....40

Modules and packages....41

Slots and methods....41

Controlling program flow....42

Relational and logical operators....42

The if statement....43

The while loop....44

Defining functions....45

Exercises....47

Sequential and other types in Python....48

Sequential types....48

Lists....48

Tuples....49

Ranges....49

Strings (again)....50

Working with sequences....50

Extracting elements....50

Slicing....51

Modifying elements of mutable sequences....52

Searching for specific elements....53

Arithmetic operators....53

Dictionaries....53

Iterable types....55

The for loop....55

Tuple assignment....57

Argument unpacking (*)....59

Variadic arguments: *args and **kwargs (*)....60

Object references and copying (*)....60

Copying references....60

Pass by assignment....61

Object copies....61

Modify in place or return a modified copy?....62

Further reading....63

Exercises....64

II Unidimensional data....66

Unidimensional numeric data and their empirical distribution....68

Creating vectors in numpy....69

Enumerating elements....70

Arithmetic progressions....71

Repeating values....72

numpy.r_ (*)....72

Generating pseudorandom variates....73

Loading data from files....73

Some mathematical notation....74

Inspecting the data distribution with histograms....75

heights: A bell-shaped distribution....75

income: A right-skewed distribution....76

How many bins?....78

peds: A bimodal distribution (already binned)....80

matura: A bell-shaped distribution (almost)....82

marathon (truncated – fastest runners): A left-skewed distribution....82

Log-scale and heavy-tailed distributions....83

Cumulative probabilities and the empirical cumulative distribution function....86

Exercises....87

Processing unidimensional data....90

Aggregating numeric data....90

Measures of location....91

Arithmetic mean and median....91

Sensitive to outliers vs robust....92

Sample quantiles....93

Measures of dispersion....95

Standard deviation (and variance)....96

Interquartile range....96

Measures of shape....97

Box (and whisker) plots....98

Other aggregation methods (*)....99

Vectorised mathematical functions....101

Logarithms and exponential functions....102

Trigonometric functions....103

Arithmetic operators....104

Vector-scalar case....105

Application: Feature scaling....105

Standardisation and z-scores....106

Min-max scaling and clipping....107

Normalisation (l2; dividing by magnitude)....108

Normalisation (l1; dividing by sum)....109

Vector-vector case....109

Indexing vectors....111

Integer indexing....111

Logical indexing....112

Slicing....113

Other operations....114

Cumulative sums and iterated differences....114

Sorting....114

Dealing with tied observations....115

Determining the ordering permutation and ranking....115

Searching for certain indexes (argmin, argmax)....116

Dealing with round-off and measurement errors....117

Vectorising scalar operations with list comprehensions....119

Exercises....120

Continuous probability distributions....122

Normal distribution....123

Estimating parameters....123

Data models are useful....124

Assessing goodness-of-fit....126

Comparing cumulative distribution functions....126

Comparing quantiles....128

Kolmogorov–Smirnov test (*)....130

Other noteworthy distributions....132

Log-normal distribution....132

Pareto distribution....136

Uniform distribution....139

Distribution mixtures (*)....141

Generating pseudorandom numbers....143

Uniform distribution....143

Not exactly random....143

Sampling from other distributions....144

Natural variability....145

Adding jitter (white noise)....147

Independence assumption....148

Further reading....148

Exercises....149

III Multidimensional data....150

From uni- to multidimensional numeric data....152

Creating matrices....152

Reading CSV files....152

Enumerating elements....154

Repeating arrays....154

Stacking arrays....155

Other functions....156

Reshaping matrices....156

Mathematical notation....158

Transpose....159

Row and column vectors....159

Identity and other diagonal matrices....160

Visualising multidimensional data....161

2D Data....161

3D data and beyond....162

Scatter plot matrix (pairs plot)....165

Exercises....167

Processing multidimensional data....168

Extending vectorised operations to matrices....168

Vectorised mathematical functions....168

Componentwise aggregation....168

Arithmetic, logical, and relational operations....169

Matrix vs scalar....170

Matrix vs matrix....170

Matrix vs any vector....172

Row vector vs column vector (*)....173

Other row and column transforms (*)....174

Indexing matrices....175

Slice-based indexing....176

Scalar-based indexing....176

Mixed logical/integer vector and scalar/slice indexers....177

Two vectors as indexers (*)....177

Views of existing arrays (*)....178

Adding and modifying rows and columns....179

Matrix multiplication, dot products, and Euclidean norm (*)....179

Pairwise distances and related methods (*)....182

Euclidean metric (*)....182

Centroids (*)....185

Multidimensional dispersion and other aggregates (**)....185

Fixed-radius and k-nearest neighbour search (**)....186

Spatial search with K-d trees (**)....188

Exercises....189

Exploring relationships between variables....192

Measuring correlation....193

Pearson linear correlation coefficient....193

Perfect linear correlation....194

Strong linear correlation....195

No linear correlation does not imply independence....196

False linear correlations....197

Correlation is not causation....199

Correlation heat map....199

Linear correlation coefficients on transformed data....201

Spearman rank correlation coefficient....203

Regression tasks (*)....204

K-nearest neighbour regression (*)....204

From data to (linear) models (*)....206

Least squares method (*)....207

Analysis of residuals (*)....210

Multiple regression (*)....214

Variable transformation and linearisable models (**)....214

Descriptive vs predictive power (**)....216

Fitting regression models with scikit-learn (*)....222

Ill-conditioned model matrices (**)....223

Finding interesting combinations of variables (*)....227

Dot products, angles, collinearity, and orthogonality (*)....227

Geometric transformations of points (*)....229

Matrix inverse (*)....231

Singular value decomposition (*)....232

Dimensionality reduction with SVD (*)....234

Principal component analysis (*)....237

Further reading....240

Exercises....241

IV Heterogeneous data....242

Introducing data frames....244

Creating data frames....245

Data frames are matrix-like....246

Series....247

Index....249

Aggregating data frames....251

Transforming data frames....253

Indexing Series objects....256

Do not use [...] directly (in the current version of pandas)....257

loc[...]....258

iloc[...]....259

Logical indexing....260

Indexing data frames....260

loc[...] and iloc[...]....260

Adding rows and columns....262

Modifying items....263

Pseudorandom sampling and splitting....263

Hierarchical indexes (*)....265

Further operations on data frames....267

Sorting....267

Stacking and unstacking (long/tall and wide forms)....270

Joining (merging)....272

Set-theoretic operations and removing duplicates....274

…and (too) many more....276

Exercises....277

Handling categorical data....278

Representing and generating categorical data....278

Encoding and decoding factors....279

Binary data as logical and probability vectors....281

One-hot encoding (*)....282

Binning numeric data (revisited)....283

Generating pseudorandom labels....285

Frequency distributions....285

Counting....285

Two-way contingency tables: Factor combinations....286

Combinations of even more factors....287

Visualising factors....289

Bar plots....289

Political marketing and statistics....291

.....292

Pareto charts (*)....293

Heat maps....295

Aggregating and comparing factors....296

Mode....296

Binary data as logical vectors....297

Pearson chi-squared test (*)....298

Two-sample Pearson chi-squared test (*)....299

Measuring association (*)....301

Binned numeric data....303

Ordinal data (*)....303

Exercises....304

Processing data in groups....306

Basic methods....307

Aggregating data in groups....309

Transforming data in groups....310

Manual splitting into subgroups (*)....311

Plotting data in groups....314

Series of box plots....314

Series of bar plots....315

Semitransparent histograms....316

Scatter plots with group information....316

Grid (trellis) plots....316

Kolmogorov–Smirnov test for comparing ECDFs (*)....317

Comparing quantiles....320

Classification tasks (*)....321

K-nearest neighbour classification (*)....323

Assessing prediction quality (*)....326

Splitting into training and test sets (*)....329

Validating many models (parameter selection) (**)....330

Clustering tasks (*)....331

K-means method (*)....332

Solving k-means is hard (*)....335

Lloyd algorithm (*)....335

Local minima (*)....336

Random restarts (*)....339

Further reading....342

Exercises....342

Accessing databases....344

Example database....344

Exporting data to a database....346

Exercises on SQL vs pandas....347

Filtering....348

Ordering....349

Removing duplicates....350

Grouping and aggregating....351

Joining....352

Solutions to exercises....353

Closing the database connection....356

Common data serialisation formats for the Web....357

Working with many files....358

File paths....358

File search....359

Exception handling....359

File connections (*)....359

Further reading....360

Exercises....360

V Other data types....362

Text data....364

Basic string operations....364

Unicode as the universal encoding....365

Normalising strings....365

Substring searching and replacing....366

Locale-aware services in ICU (*)....367

String operations in pandas....368

String operations in numpy (*)....370

Working with string lists....372

Formatted outputs for reproducible report generation....373

Formatting strings....374

str and repr....374

Aligning strings....374

Direct Markdown output in Jupyter....375

Manual Markdown file output (*)....375

Regular expressions (*)....377

Regex matching with re (*)....377

Regex matching with pandas (*)....379

Matching individual characters (*)....380

Matching anything (almost) (*)....381

Defining character sets (*)....381

Complementing sets (*)....382

Defining code point ranges (*)....382

Using predefined character sets (*)....382

Alternating and grouping subexpressions (*)....383

Alternation operator (*)....383

Grouping subexpressions (*)....383

Non-grouping parentheses (*)....383

Quantifiers (*)....384

Capture groups and references thereto (**)....385

Extracting capture group matches (**)....385

Replacing with capture group matches (**)....387

Back-referencing (**)....387

Anchoring (*)....388

Matching at the beginning or end of a string (*)....388

Matching at word boundaries (*)....388

Looking behind and ahead (**)....388

Exercises....389

Missing, censored, and questionable data....390

Missing data....390

Representing and detecting missing values....391

Computing with missing values....391

Missing at random or not?....393

Discarding missing values....393

Mean imputation....394

Imputation by classification and regression (*)....395

Censored and interval data (*)....396

Incorrect data....396

Outliers....398

The 3/2 IQR rule for normally-distributed data....398

Unidimensional density estimation (*)....399

Multidimensional density estimation (*)....401

Exercises....404

Time series....406

Temporal ordering and line charts....406

Working with date-times and time-deltas....408

Representation: The UNIX epoch....408

Time differences....409

Date-times in data frames....409

Basic operations....413

Iterated differences and cumulative sums revisited....413

Smoothing with moving averages....416

Detecting trends and seasonal patterns....417

Imputing missing values....420

Plotting multidimensional time series....421

Candlestick plots (*)....423

Further reading....425

Exercises....425

Changelog....428

References....432

Minimalist Data Wrangling with Python is envisaged as a student's first introduction to data science, providing a high-level overview as well as discussing key concepts in detail. We explore methods for cleaning data gathered from different sources, transforming, selecting, and extracting features, performing exploratory data analysis and dimensionality reduction, identifying naturally occurring data clusters, modelling patterns in data, comparing data between groups, and reporting the results.


Похожее:

Список отзывов:

Нет отзывов к книге.