Preface....10
Acknowledgements....11
Introduction....12
Case studies....12
Who will find this book useful?....12
What does this book cover?....13
What is not covered by this book?....13
I R....14
Installing R and RStudio....15
Installing R....15
Installing RStudio....22
Getting Started with R and RStudio....28
Why R?....28
The R console....28
Scripts....29
RStudio....30
Installing R packages....39
R Basics....42
Case study: US Gun Murders....42
The very basics....44
Exercises....48
Data types....49
Data frames....49
Exercises....54
Vectors....55
Coercion....57
Exercises....58
Sorting....58
Exercise....61
Vector arithmetics....61
Exercises....63
Indexing....63
Exercises....65
Basic plots....66
Exercises....68
Programming basics....69
Conditional expressions....69
Defining functions....71
Namespaces....71
For-loops....72
Vectorization and functionals....74
Exercises....75
The tidyverse....76
Tidy data....76
Exercises....77
Manipulating data frames....78
Exercises....79
The pipe: %>%....80
Exercises....81
Summarizing data....82
Sorting data frames....85
Exercises....86
Tibbles....87
The dot operator....90
do....91
The purrr package....92
Tidyverse conditionals....93
Exercises....94
Importing data....95
Paths and the working directory....96
The readr and readxl packages....98
Exercises....100
Downloading files....100
R-base importing functions....101
Text versus binary files....101
Unicode versus ASCII....102
Organizing Data with Spreadsheets....102
Exercises....103
II Data Visualization....104
Introduction to data visualization....105
ggplot2....109
The components of a graph....109
ggplot objects....111
Geometries....111
Aesthetic mappings....113
Layers....114
Global versus local aesthetic mappings....115
Scales....117
Labels and titles....118
Categories as colors....118
Annotation, shapes, and adjustments....119
Add-on packages....120
Putting it all together....121
Quick plots with qplot....122
Grids of plots....123
Exercises....124
Visualizing data distributions....127
Variable types....127
Case study: describing student heights....128
Distribution function....128
Cumulative distribution functions....129
Histograms....130
Smoothed density....131
Exercises....136
The normal distribution....139
Standard units....141
Quantile-quantile plots....142
Percentiles....143
Boxplots....144
Stratification....145
Case study: describing student heights (continued)....145
Exercises....147
ggplot2 geometries....148
Exercises....157
Data visualization in practice....159
Case study: new insights on poverty....159
Scatterplots....161
Faceting....162
Time series plots....165
Data transformations....168
Visualizing multimodal distributions....172
Comparing multiple distributions with boxplots and ridge plots....172
The ecological fallacy and importance of showing the data....185
Data visualization principles....188
Encoding data using visual cues....188
Know when to include 0....190
Do not distort quantities....194
Order categories by a meaningful value....195
Show the data....197
Ease comparisons....200
Think of the color blind....205
Plots for two variables....205
Encoding a third variable....208
Avoid pseudo-three-dimensional plots....210
Avoid too many significant digits....211
Know your audience....212
Exercises....212
Case study: impact of vaccines on battling infectious diseases....216
Exercises....219
Robust summaries....220
Outliers....220
Median....221
The inter quartile range (IQR)....222
Tukey's definition of an outlier....222
Median absolute deviation....223
Exercises....223
Case study: self-reported student heights....224
III Statistics with R....227
Introduction to Statistics with R....228
Probability....229
Discrete probability....229
Monte Carlo simulations for categorical data....230
Independence....232
Conditional probabilities....233
Addition and multiplication rules....233
Combinations and permutations....234
Examples....238
Infinity in practice....241
Exercises....242
Continuous probability....243
Theoretical continuous distributions....244
Monte Carlo simulations for continuous variables....247
Continuous distributions....248
Exercises....249
Random variables....251
Random variables....251
Sampling models....252
The probability distribution of a random variable....252
Distributions versus probability distributions....254
Notation for random variables....255
The expected value and standard error....255
Central Limit Theorem....258
Statistical properties of averages....259
Law of large numbers....260
Exercises....260
Case study: The Big Short....262
Exercises....267
Statistical Inference....268
Polls....268
Populations, samples, parameters and estimates....271
Exercises....273
Central Limit Theorem in practice....274
Exercises....278
Confidence intervals....279
Exercises....283
Power....284
p-values....284
Association Tests....285
Exercises....290
Statistical models....292
Poll aggregators....292
Data driven models....297
Exercises....299
Bayesian statistics....301
Bayes Theorem simulation....302
Hierarchical models....304
Exercises....306
Case study: Election forecasting....307
Exercise....319
The t-distribution....320
Regression....323
Case study: is height hereditary?....323
The correlation coefficient....324
Conditional expectations....328
The regression line....331
Exercises....337
Linear Models....338
Case Study: Moneyball....338
Confounding....346
Least Squared Estimates....350
Exercises....355
Linear regression in the tidyverse....356
Exercises....361
Case study: Moneyball (continued)....361
The regression fallacy....368
Measurement error models....371
Exercises....373
Association is not causation....375
Spurious correlation....375
Outliers....377
Reversing cause and effect....379
Confounders....380
Simpson's paradox....384
Exercises....385
IV Data Wrangling....387
Introduction to Data Wrangling....388
Reshaping data....389
gather....389
spread....390
separate....391
unite....393
Exercises....394
Joining tables....396
Joins....397
Binding....400
Set operators....401
Exercises....403
Web Scraping....404
HTML....405
The rvest package....406
CSS selectors....407
JSON....408
Exercises....409
String Processing....411
The stringr package....411
Case study 1: US murders data....413
Case study 2: self reported heights....414
How to escape when defining strings....416
Regular expressions....417
Search and replace with regex....423
Testing and improving....425
Trimming....427
Changing lettercase....428
Case study 2: self reported heights (continued)....428
String splitting....431
Case study 3: extracting tables from a PDF....433
Recoding....435
Exercises....437
Parsing Dates and Times....440
The date data type....440
The lubridate package....441
Exercises....444
Text mining....446
Case study: Trump tweets....446
Text as data....448
Sentiment analysis....452
Exercises....456
V Machine Learning....458
Introduction to Machine Learning....459
Notation....459
An example....460
Exercises....462
Evaluation Metrics....462
Exercises....475
Conditional probabilities and expectations....476
Exercises....478
Case study: is it a 2 or a 7?....478
Smoothing....483
Bin smoothing....485
Kernels....487
Local weighted regression (loess)....489
Connecting smoothing to machine learning....495
Exercises....496
Cross validation....498
Motivation with k-nearest neighbors....498
Mathematical description of cross validation....503
K-fold cross validation....504
Exercises....509
Bootstrap....510
Exercises....512
The caret package....514
The caret train functon....514
Cross validation....515
Example: fitting with loess....517
Examples of algorithms....520
Linear regression....520
Exercises....522
Logistic regression....523
Exercises....528
k-nearest neighbors....529
Exercises....530
Generative models....530
Exercises....541
Classification and Regression Trees (CART)....542
Random Forests....555
Exercises....558
Machine learning in practice....561
Preprocessing....562
k-Nearest Neighbor and Random Forest....563
Variable importance....566
Visual assessments....566
Ensembles....567
Exercises....568
Large datasets....570
Matrix algebra....570
Exercises....579
Distance....579
Exercises....584
Dimension reduction....584
Exercises....599
Recommendation systems....600
Exercises....607
Regularization....608
Exercises....614
Matrix factorization....615
Exercises....625
Clustering....630
Hierarchical clustering....631
k-means....634
Heatmaps....636
Filtering features....637
Exercises....638
VI Productivity tools....639
Introduction to productivity tools....640
Accessing the terminal and installing Git....641
Accessing the terminal on a Mac....641
Installing Git on the Mac....642
Installing Git and Git Bash on Windows....645
Accessing the terminal on Windows....648
Organizing with Unix....651
Naming convention....651
The terminal....652
The filesystem....652
Unix commands....656
Some examples....659
More Unix commands....661
Preparing for a data science project....663
Advanced Unix....663
Git and GitHub....668
Why use Git and GitHub?....668
GitHub accounts....668
GitHub repositories....673
Overview of Git....676
Initializing a Git directory....681
Using Git and GitHub in RStudio....683
Reproducible projects with RStudio and R markdown....691
RStudio projects....691
R markdown....695
Organizing a data science project....701
Introduction to Data Science: Data Analysis and Prediction Algorithms with R introduces concepts and skills that can help you tackle real-world data analysis challenges. It covers concepts from probability, statistical inference, linear regression, and machine learning. It also helps you develop skills such as R programming, data wrangling, data visualization, predictive algorithm building, file organization with UNIX/Linux shell, version control with Git and GitHub, and reproducible document preparation.
This book is a textbook for a first course in data science. No previous knowledge of R is necessary, although some experience with programming may be helpful. The book is divided into six parts: R, data visualization, statistics with R, data wrangling, machine learning, and productivity tools. Each part has several chapters meant to be presented as one lecture.
The author uses motivating case studies that realistically mimic a data scientist’s experience. He starts by asking specific questions and answers these through data analysis so concepts are learned as a means to answering the questions. Examples of the case studies included are: US murder rates by state, self-reported student heights, trends in world health and economics, the impact of vaccines on infectious disease rates, the financial crisis of 2007-2008, election forecasting, building a baseball team, image processing of hand-written digits, and movie recommendation systems.
The statistical concepts used to answer the case study questions are only briefly introduced, so complementing with a probability and statistics textbook is highly recommended for in-depth understanding of these concepts. If you read and understand the chapters and complete the exercises, you will be prepared to learn the more advanced concepts and skills needed to become an expert.
A complete solutions manual is available to registered instructors who require the text for a course.