Introduction to Data Science: Data Analysis and Prediction Algorithms with R

Introduction to Data Science: Data Analysis and Prediction Algorithms with R

Introduction to Data Science: Data Analysis and Prediction Algorithms with R
Автор: Irizarry Rafael A.
Дата выхода: 2019
Издательство: CRC Press is an imprint of Taylor & Francis Group, LLC
Количество страниц: 708
Размер файла: 18.2 MB
Тип файла: PDF
Добавил: codelibs
 Проверить на вирусы

 Preface....10

Acknowledgements....11

Introduction....12

Case studies....12

Who will find this book useful?....12

What does this book cover?....13

What is not covered by this book?....13

I R....14

Installing R and RStudio....15

Installing R....15

Installing RStudio....22

Getting Started with R and RStudio....28

Why R?....28

The R console....28

Scripts....29

RStudio....30

Installing R packages....39

R Basics....42

Case study: US Gun Murders....42

The very basics....44

Exercises....48

Data types....49

Data frames....49

Exercises....54

Vectors....55

Coercion....57

Exercises....58

Sorting....58

Exercise....61

Vector arithmetics....61

Exercises....63

Indexing....63

Exercises....65

Basic plots....66

Exercises....68

Programming basics....69

Conditional expressions....69

Defining functions....71

Namespaces....71

For-loops....72

Vectorization and functionals....74

Exercises....75

The tidyverse....76

Tidy data....76

Exercises....77

Manipulating data frames....78

Exercises....79

The pipe: %>%....80

Exercises....81

Summarizing data....82

Sorting data frames....85

Exercises....86

Tibbles....87

The dot operator....90

do....91

The purrr package....92

Tidyverse conditionals....93

Exercises....94

Importing data....95

Paths and the working directory....96

The readr and readxl packages....98

Exercises....100

Downloading files....100

R-base importing functions....101

Text versus binary files....101

Unicode versus ASCII....102

Organizing Data with Spreadsheets....102

Exercises....103

II Data Visualization....104

Introduction to data visualization....105

ggplot2....109

The components of a graph....109

ggplot objects....111

Geometries....111

Aesthetic mappings....113

Layers....114

Global versus local aesthetic mappings....115

Scales....117

Labels and titles....118

Categories as colors....118

Annotation, shapes, and adjustments....119

Add-on packages....120

Putting it all together....121

Quick plots with qplot....122

Grids of plots....123

Exercises....124

Visualizing data distributions....127

Variable types....127

Case study: describing student heights....128

Distribution function....128

Cumulative distribution functions....129

Histograms....130

Smoothed density....131

Exercises....136

The normal distribution....139

Standard units....141

Quantile-quantile plots....142

Percentiles....143

Boxplots....144

Stratification....145

Case study: describing student heights (continued)....145

Exercises....147

ggplot2 geometries....148

Exercises....157

Data visualization in practice....159

Case study: new insights on poverty....159

Scatterplots....161

Faceting....162

Time series plots....165

Data transformations....168

Visualizing multimodal distributions....172

Comparing multiple distributions with boxplots and ridge plots....172

The ecological fallacy and importance of showing the data....185

Data visualization principles....188

Encoding data using visual cues....188

Know when to include 0....190

Do not distort quantities....194

Order categories by a meaningful value....195

Show the data....197

Ease comparisons....200

Think of the color blind....205

Plots for two variables....205

Encoding a third variable....208

Avoid pseudo-three-dimensional plots....210

Avoid too many significant digits....211

Know your audience....212

Exercises....212

Case study: impact of vaccines on battling infectious diseases....216

Exercises....219

Robust summaries....220

Outliers....220

Median....221

The inter quartile range (IQR)....222

Tukey's definition of an outlier....222

Median absolute deviation....223

Exercises....223

Case study: self-reported student heights....224

III Statistics with R....227

Introduction to Statistics with R....228

Probability....229

Discrete probability....229

Monte Carlo simulations for categorical data....230

Independence....232

Conditional probabilities....233

Addition and multiplication rules....233

Combinations and permutations....234

Examples....238

Infinity in practice....241

Exercises....242

Continuous probability....243

Theoretical continuous distributions....244

Monte Carlo simulations for continuous variables....247

Continuous distributions....248

Exercises....249

Random variables....251

Random variables....251

Sampling models....252

The probability distribution of a random variable....252

Distributions versus probability distributions....254

Notation for random variables....255

The expected value and standard error....255

Central Limit Theorem....258

Statistical properties of averages....259

Law of large numbers....260

Exercises....260

Case study: The Big Short....262

Exercises....267

Statistical Inference....268

Polls....268

Populations, samples, parameters and estimates....271

Exercises....273

Central Limit Theorem in practice....274

Exercises....278

Confidence intervals....279

Exercises....283

Power....284

p-values....284

Association Tests....285

Exercises....290

Statistical models....292

Poll aggregators....292

Data driven models....297

Exercises....299

Bayesian statistics....301

Bayes Theorem simulation....302

Hierarchical models....304

Exercises....306

Case study: Election forecasting....307

Exercise....319

The t-distribution....320

Regression....323

Case study: is height hereditary?....323

The correlation coefficient....324

Conditional expectations....328

The regression line....331

Exercises....337

Linear Models....338

Case Study: Moneyball....338

Confounding....346

Least Squared Estimates....350

Exercises....355

Linear regression in the tidyverse....356

Exercises....361

Case study: Moneyball (continued)....361

The regression fallacy....368

Measurement error models....371

Exercises....373

Association is not causation....375

Spurious correlation....375

Outliers....377

Reversing cause and effect....379

Confounders....380

Simpson's paradox....384

Exercises....385

IV Data Wrangling....387

Introduction to Data Wrangling....388

Reshaping data....389

gather....389

spread....390

separate....391

unite....393

Exercises....394

Joining tables....396

Joins....397

Binding....400

Set operators....401

Exercises....403

Web Scraping....404

HTML....405

The rvest package....406

CSS selectors....407

JSON....408

Exercises....409

String Processing....411

The stringr package....411

Case study 1: US murders data....413

Case study 2: self reported heights....414

How to escape when defining strings....416

Regular expressions....417

Search and replace with regex....423

Testing and improving....425

Trimming....427

Changing lettercase....428

Case study 2: self reported heights (continued)....428

String splitting....431

Case study 3: extracting tables from a PDF....433

Recoding....435

Exercises....437

Parsing Dates and Times....440

The date data type....440

The lubridate package....441

Exercises....444

Text mining....446

Case study: Trump tweets....446

Text as data....448

Sentiment analysis....452

Exercises....456

V Machine Learning....458

Introduction to Machine Learning....459

Notation....459

An example....460

Exercises....462

Evaluation Metrics....462

Exercises....475

Conditional probabilities and expectations....476

Exercises....478

Case study: is it a 2 or a 7?....478

Smoothing....483

Bin smoothing....485

Kernels....487

Local weighted regression (loess)....489

Connecting smoothing to machine learning....495

Exercises....496

Cross validation....498

Motivation with k-nearest neighbors....498

Mathematical description of cross validation....503

K-fold cross validation....504

Exercises....509

Bootstrap....510

Exercises....512

The caret package....514

The caret train functon....514

Cross validation....515

Example: fitting with loess....517

Examples of algorithms....520

Linear regression....520

Exercises....522

Logistic regression....523

Exercises....528

k-nearest neighbors....529

Exercises....530

Generative models....530

Exercises....541

Classification and Regression Trees (CART)....542

Random Forests....555

Exercises....558

Machine learning in practice....561

Preprocessing....562

k-Nearest Neighbor and Random Forest....563

Variable importance....566

Visual assessments....566

Ensembles....567

Exercises....568

Large datasets....570

Matrix algebra....570

Exercises....579

Distance....579

Exercises....584

Dimension reduction....584

Exercises....599

Recommendation systems....600

Exercises....607

Regularization....608

Exercises....614

Matrix factorization....615

Exercises....625

Clustering....630

Hierarchical clustering....631

k-means....634

Heatmaps....636

Filtering features....637

Exercises....638

VI Productivity tools....639

Introduction to productivity tools....640

Accessing the terminal and installing Git....641

Accessing the terminal on a Mac....641

Installing Git on the Mac....642

Installing Git and Git Bash on Windows....645

Accessing the terminal on Windows....648

Organizing with Unix....651

Naming convention....651

The terminal....652

The filesystem....652

Unix commands....656

Some examples....659

More Unix commands....661

Preparing for a data science project....663

Advanced Unix....663

Git and GitHub....668

Why use Git and GitHub?....668

GitHub accounts....668

GitHub repositories....673

Overview of Git....676

Initializing a Git directory....681

Using Git and GitHub in RStudio....683

Reproducible projects with RStudio and R markdown....691

RStudio projects....691

R markdown....695

Organizing a data science project....701

Introduction to Data Science: Data Analysis and Prediction Algorithms with R introduces concepts and skills that can help you tackle real-world data analysis challenges. It covers concepts from probability, statistical inference, linear regression, and machine learning. It also helps you develop skills such as R programming, data wrangling, data visualization, predictive algorithm building, file organization with UNIX/Linux shell, version control with Git and GitHub, and reproducible document preparation.

This book is a textbook for a first course in data science. No previous knowledge of R is necessary, although some experience with programming may be helpful. The book is divided into six parts: R, data visualization, statistics with R, data wrangling, machine learning, and productivity tools. Each part has several chapters meant to be presented as one lecture.

The author uses motivating case studies that realistically mimic a data scientist’s experience. He starts by asking specific questions and answers these through data analysis so concepts are learned as a means to answering the questions. Examples of the case studies included are: US murder rates by state, self-reported student heights, trends in world health and economics, the impact of vaccines on infectious disease rates, the financial crisis of 2007-2008, election forecasting, building a baseball team, image processing of hand-written digits, and movie recommendation systems.

The statistical concepts used to answer the case study questions are only briefly introduced, so complementing with a probability and statistics textbook is highly recommended for in-depth understanding of these concepts. If you read and understand the chapters and complete the exercises, you will be prepared to learn the more advanced concepts and skills needed to become an expert.

A complete solutions manual is available to registered instructors who require the text for a course.


Похожее:

Список отзывов:

Нет отзывов к книге.