R in Action: Data analysis and graphics with R and Tidyverse. 3 Ed

R in Action: Data analysis and graphics with R and Tidyverse. 3 Ed

R in Action: Data analysis and graphics with R and Tidyverse. 3 Ed
Автор: Kabacoff Robert I.
Дата выхода: 2022
Издательство: Manning Publications Co.
Количество страниц: 1094
Размер файла: 7,4 МБ
Тип файла: PDF
Добавил: codelibs
 Проверить на вирусы

R in Action....2

Copyright....4

Praise for the previous edition of R in Action....6

brief contents....7

contents....9

Front matter....23

preface....23

acknowledgments....26

about this book....28

What's new in the third edition....30

Who should read this book....32

How this book is organized: A road map....33

Advice for data miners....38

About the code....39

liveBook discussion forum....41

about the author....42

about the cover illustration....42

Part 1. Getting started....43

1 Introduction to R....46

1.1 Why use R?....49

1.2 Obtaining and installing R....53

1.3 Working with R....54

1.3.1 Getting started....55

1.3.2 Using RStudio....59

1.3.3 Getting help....63

1.3.4 The workspace....66

1.3.5 Projects....68

1.4 Packages....69

1.4.1 What are packages?....69

1.4.2 Installing a package....70

1.4.3 Loading a package....71

1.4.4 Learning about a package....71

1.5 Using output as input: Reusing results....73

1.6 Working with large datasets....74

1.7 Working through an example....75

Summary....78

2 Creating a dataset....79

2.1 Understanding datasets....80

2.2 Data structures....82

2.2.1 Vectors....83

2.2.2 Matrices....84

2.2.3 Arrays....87

2.2.4 Data frames....88

2.2.5 Factors....92

2.2.6 Lists....96

2.2.7 Tibbles....98

2.3 Data input....101

2.3.1 Entering data from the keyboard....102

2.3.2 Importing data from a delimited text file....105

2.3.3 Importing data from Excel....111

2.3.4 Importing data from JSON....113

2.3.5 Importing data from the web....113

2.3.6 Importing data from SPSS....114

2.3.7 Importing data from SAS....115

2.3.8 Importing data from Stata....116

2.3.9 Accessing database management systems....116

2.3.10 Importing data via StatTransfer....119

2.4 Annotating datasets....121

2.4.1 Variable labels....121

2.4.2 Value labels....122

2.5 Useful functions for working with data objects....122

Summary....124

3 Basic data management....125

3.1 A working example....125

3.2 Creating new variables....127

3.3 Recoding variables....129

3.4 Renaming variables....131

3.5 Missing values....132

3.5.1 Recoding values to missing....134

3.5.2 Excluding missing values from analyses....134

3.6 Date values....136

3.6.1 Converting dates to character variables....138

3.6.2 Going further....138

3.7 Type conversions....139

3.8 Sorting data....140

3.9 Merging datasets....141

3.9.1 Adding columns to a data frame....141

3.9.2 Adding rows to a data frame....142

3.10 Subsetting datasets....142

3.10.1 Selecting variables....142

3.10.2 Dropping variables....144

3.10.3 Selecting observations....145

3.10.4 The subset() function....146

3.10.5 Random samples....147

3.11 Using dplyr to manipulate data frames....148

3.11.1 Basic dplyr functions....148

3.11.2 Using pipe operators to chain statements....152

3.12 Using SQL statements to manipulate data frames....152

Summary....153

4 Getting started with graphs....155

4.1 Creating a graph with ggplot2....157

4.1.1 ggplot....157

4.1.2 Geoms....158

4.1.3 Grouping....164

4.1.4 Scales....167

4.1.5 Facets....171

4.1.6 Labels....174

4.1.7 Themes....175

4.2 ggplot2 details....177

4.2.1 Placing the data and mapping options....178

4.2.2 Graphs as objects....181

4.2.3 Saving graphs....182

4.2.4 Common mistakes....184

Summary....185

5 Advanced data management....187

5.1 A data management challenge....188

5.2 Numerical and character functions....189

5.2.1 Mathematical functions....190

5.2.2 Statistical functions....192

5.2.3 Probability functions....197

5.2.4 Character functions....202

5.2.5 Other useful functions....205

5.2.6 Applying functions to matrices and data frames....207

5.2.7 A solution for the data management challenge....209

5.3 Control flow....216

5.3.1 Repetition and looping....217

5.3.2 Conditional execution....218

5.4 User-written functions....221

5.5 Reshaping data....224

5.5.1 Transposing....224

5.5.2 Converting from wide to long dataset formats....226

5.6 Aggregating data....230

Summary....233

Part 2. Basic methods....234

6 Basic graphs....236

6.1 Bar charts....237

6.1.1 Simple bar charts....237

6.1.2 Stacked, grouped, and filled bar charts....239

6.1.3 Mean bar charts....242

6.1.4 Tweaking bar charts....246

6.2 Pie charts....253

6.3 Tree maps....257

6.4 Histograms....262

6.5 Kernel density plots....265

6.6 Box plots....271

6.6.1 Using parallel box plots to compare groups....273

6.6.2 Violin plots....277

6.7 Dot plots....280

Summary....283

7 Basic statistics....285

7.1 Descriptive statistics....287

7.1.1 A menagerie of methods....287

7.1.2 Even more methods....289

7.1.3 Descriptive statistics by group....293

7.1.4 Summarizing data interactively with dplyr....295

7.1.5 Visualizing results....299

7.2 Frequency and contingency tables....299

7.2.1 Generating frequency tables....300

7.2.2 Tests of independence....310

7.2.3 Measures of association....312

7.2.4 Visualizing results....313

7.3 Correlations....314

7.3.1 Types of correlations....315

7.3.2 Testing correlations for significance....319

7.3.3 Visualizing correlations....323

7.4 T-tests....323

7.4.1 Independent t-test....324

7.4.2 Dependent t-test....325

7.4.3 When there are more than two groups....327

7.5 Nonparametric tests of group differences....327

7.5.1 Comparing two groups....327

7.5.2 Comparing more than two groups....330

7.6 Visualizing group differences....333

Summary....334

Part 3. Intermediate methods....336

8 Regression....339

8.1 The many faces of regression....341

8.1.1 Scenarios for using OLS regression....343

8.1.2 What you need to know....345

8.2 OLS regression....345

8.2.1 Fitting regression models with lm()....347

8.2.2 Simple linear regression....351

8.2.3 Polynomial regression....354

8.2.4 Multiple linear regression....357

8.2.5 Multiple linear regression with interactions....361

8.3 Regression diagnostics....364

8.3.1 A typical approach....366

8.3.2 An enhanced approach....369

8.3.3 Multicollinearity....378

8.4 Unusual observations....380

8.4.1 Outliers....380

8.4.2 High-leverage points....381

8.4.3 Influential observations....384

8.5 Corrective measures....389

8.5.1 Deleting observations....390

8.5.2 Transforming variables....391

8.5.3 Adding or deleting variables....394

8.5.4 Trying a different approach....394

8.6 Selecting the best regression model....395

8.6.1 Comparing models....396

8.6.2 Variable selection....397

8.7 Taking the analysis further....402

8.7.1 Cross-validation....403

8.7.2 Relative importance....406

Summary....410

9 Analysis of variance....412

9.1 A crash course on terminology....413

9.2 Fitting ANOVA models....417

9.2.1 The aov() function....418

9.2.2 The order of formula terms....420

9.3 One-way ANOVA....422

9.3.1 Multiple comparisons....425

9.3.2 Assessing test assumptions....431

9.4 One-way ANCOVA....433

9.4.1 Assessing test assumptions....437

9.4.2 Visualizing the results....438

9.5 Two-way factorial ANOVA....440

9.6 Repeated measures ANOVA....443

9.7 Multivariate analysis of variance (MANOVA)....449

9.7.1 Assessing test assumptions....451

9.7.2 Robust MANOVA....453

9.8 ANOVA as regression....454

Summary....458

10 Power analysis....460

10.1 A quick review of hypothesis testing....461

10.2 Implementing power analysis with the pwr package....465

10.2.1 T-tests....466

10.2.2 ANOVA....469

10.2.3 Correlations....470

10.2.4 Linear models....471

10.2.5 Tests of proportions....472

10.2.6 Chi-square tests....474

10.2.7 Choosing an appropriate effect size in novel situations....476

10.3 Creating power analysis plots....479

10.4 Other packages....481

Summary....482

11 Intermediate graphs....484

11.1 Scatter plots....486

11.1.1 Scatter plot matrices....491

11.1.2 High-density scatter plots....496

11.1.3 3D scatter plots....502

11.1.4 Spinning 3D scatter plots....506

11.1.5 Bubble plots....509

11.2 Line charts....513

11.3 Corrgrams....517

11.4 Mosaic plots....526

Summary....531

12 Resampling statistics and bootstrapping....532

12.1 Permutation tests....533

12.2 Permutation tests with the coin package....537

12.2.1 Independent two-sample and k-sample tests....539

12.2.2 Independence in contingency tables....542

12.2.3 Independence between numeric variables....543

12.2.4 Dependent two-sample and k-sample tests....543

12.2.5 Going further....544

12.3 Permutation tests with the lmPerm package....545

12.3.1 Simple and polynomial regression....546

12.3.2 Multiple regression....548

12.3.3 One-way ANOVA and ANCOVA....549

12.3.4 Two-way ANOVA....550

12.4 Additional comments on permutation tests....551

12.5 Bootstrapping....552

12.6 Bootstrapping with the boot package....554

12.6.1 Bootstrapping a single statistic....557

12.6.2 Bootstrapping several statistics....560

Summary....563

Part 4. Advanced methods....565

13 Generalized linear models....568

13.1 Generalized linear models and the glm() function....569

13.1.1 The glm() function....571

13.1.2 Supporting functions....574

13.1.3 Model fit and regression diagnostics....575

13.2 Logistic regression....577

13.2.1 Interpreting the model parameters....581

13.2.2 Assessing the impact of predictors on the probability of an outcome....582

13.2.3 Overdispersion....584

13.2.4 Extensions....586

13.3 Poisson regression....587

13.3.1 Interpreting the model parameters....591

13.3.2 Overdispersion....593

13.3.3 Extensions....596

Summary....599

14 Principal components and factor analysis....600

14.1 Principal components and factor analysis in R....602

14.2 Principal components....603

14.2.1 Selecting the number of components to extract....604

14.2.2 Extracting principal components....606

14.2.3 Rotating principal components....610

14.2.4 Obtaining principal component scores....611

14.3 Exploratory factor analysis....613

14.3.1 Deciding how many common factors to extract....614

14.3.2 Extracting common factors....615

14.3.3 Rotating factors....617

14.3.4 Factor scores....620

14.3.5 Other EFA-related packages....621

14.4 Other latent variable models....621

Summary....622

15 Time series....625

15.1 Creating a time-series object in R....628

15.2 Smoothing and seasonal decomposition....633

15.2.1 Smoothing with simple moving averages....633

15.2.2 Seasonal decomposition....636

15.3 Exponential forecasting models....645

15.3.1 Simple exponential smoothing....647

15.3.2 Holt and Holt–Winters exponential smoothing....651

15.3.3 The ets() function and automated forecasting....654

15.4 ARIMA forecasting models....657

15.4.1 Prerequisite concepts....657

15.4.2 ARMA and ARIMA models....660

15.4.3 Automated ARIMA forecasting....668

15.5 Going further....669

Summary....670

16 Cluster analysis....672

16.1 Common steps in cluster analysis....674

16.2 Calculating distances....678

16.3 Hierarchical cluster analysis....680

16.4 Partitioning-cluster analysis....688

16.4.1 K-means clustering....688

16.4.2 Partitioning around medoids....698

16.5 Avoiding nonexistent clusters....700

16.6 Going further....705

Summary....706

17 Classification....707

17.1 Preparing the data....709

17.2 Logistic regression....711

17.3 Decision trees....714

17.3.1 Classical decision trees....715

17.3.2 Conditional inference trees....721

17.4 Random forests....724

17.5 Support vector machines....728

17.5.1 Tuning an SVM....733

17.6 Choosing a best predictive solution....736

17.7 Understanding black box predictions....741

17.7.1 Break-down plots....743

17.7.2 Plotting Shapley values....747

17.8 Going further....749

Summary....751

18 Advanced methods for missing data....753

18.1 Steps in dealing with missing data....756

18.2 Identifying missing values....759

18.3 Exploring missing-values patterns....761

18.3.1 Visualizing missing values....761

18.3.2 Using correlations to explore missing values....768

18.4 Understanding the sources and impact of missing data....771

18.5 Rational approaches for dealing with incomplete data....773

18.6 Deleting missing data....775

18.6.1 Complete-case analysis (listwise deletion)....776

18.6.2 Available case analysis (pairwise deletion)....779

18.7 Single imputation....780

18.7.1 Simple imputation....780

18.7.2 K-nearest neighbor imputation....780

18.7.3 missForest....783

18.8 Multiple imputation....785

18.9 Other approaches to missing data....791

Summary....791

Part 5. Expanding your skills....793

19 Advanced graphs....795

19.1 Modifying scales....797

19.1.1 Customizing axes....797

19.1.2 Customizing colors....807

19.2 Modifying themes....814

19.2.1 Prepackaged themes....816

19.2.2 Customizing fonts....818

19.2.3 Customizing legends....823

19.2.4 Customizing the plot area....826

19.3 Adding annotations....830

19.4 Combining graphs....840

19.5 Making graphs interactive....844

Summary....849

20 Advanced programming....850

20.1 A review of the language....851

20.1.1 Data types....852

20.1.2 Control structures....863

20.1.3 Creating functions....867

20.2 Working with environments....871

20.3 Non-standard evaluation....874

20.4 Object-oriented programming....878

20.4.1 Generic functions....879

20.4.2 Limitations of the S3 model....883

20.5 Writing efficient code....883

20.5.1 Efficient data input....884

20.5.2 Vectorization....885

20.5.3 Correctly sizing objects....887

20.5.4 Parallelization....888

20.6 Debugging....891

20.6.1 Common sources of errors....891

20.6.2 Debugging tools....893

20.6.3 Session options that support debugging....898

20.6.4 Using RStudios visual debugger....902

20.7 Going further....906

Summary....907

21 Creating dynamic reports....909

21.1 A template approach to reports....913

21.2 Creating a report with R and R Markdown....916

21.3 Creating a report with R and LaTeX....926

21.3.1 Creating a parameterized report....929

21.4 Avoiding common R Markdown problems....935

21.5 Going further....938

Summary....939

22 Creating a package....941

22.1 The edatools package....943

22.2 Creating a package....946

22.2.1 Installing development tools....947

22.2.2 Creating a package project....948

22.2.3 Writing the package functions....949

22.2.4 Adding function documentation....957

22.2.5 Adding a general help file (optional)....961

22.2.6 Adding sample data to the package (optional)....962

22.2.7 Adding a vignette (optional)....963

22.2.8 Editing the DESCRIPTION file....965

22.2.9 Building and installing the package....967

22.3 Sharing your package....973

22.3.1 Distributing a source package file....974

22.3.2 Submitting to CRAN....975

22.3.3 Hosting on GitHub....976

22.3.4 Creating a package website....980

22.4 Going further....982

Summary....983

Afterword. Into the rabbit hole....985

Appendix A. Graphical user interfaces....989

Appendix B. Customizing the startup environment....993

Appendix C. Exporting data from R....998

C.1 Delimited text file....998

C.2 Excel spreadsheet....999

C.3 Statistical applications....1000

Appendix D. Matrix algebra in R....1001

Appendix E. Packages used in this book....1005

Appendix F. Working with large datasets....1015

F.1 Efficient programming....1016

F.2 Storing data outside of RAM....1018

F.3 Analytic packages for out-of-memory data....1019

F.4 Comprehensive solutions for working with enormous datasets....1020

Appendix G. Updating an R installation....1026

G.1 Automated installation (Windows only)....1026

G.2 Manual installation (Windows and macOS)....1027

G.3 Updating an R installation (Linux)....1030

References....1031

index....1039

R in Action, Third Edition makes learning R quick and easy. That’s why thousands of data scientists have chosen this guide to help them master the powerful language. Far from being a dry academic tome, every example you’ll encounter in this book is relevant to scientific and business developers, and helps you solve common data challenges. R expert Rob Kabacoff takes you on a crash course in statistics, from dealing with messy and incomplete data to creating stunning visualizations. This revised and expanded third edition contains fresh coverage of the new tidyverse approach to data analysis and R’s state-of-the-art graphing capabilities with the ggplot2 package.

About the technology

Used daily by data scientists, researchers, and quants of all types, R is the gold standard for statistical data analysis. This free and open source language includes packages for everything from advanced data visualization to deep learning. Instantly comfortable for mathematically minded users, R easily handles practical problems without forcing you to think like a software engineer.

About the book

R in Action, Third Edition teaches you how to do statistical analysis and data visualization using R and its popular tidyverse packages. In it, you’ll investigate real-world data challenges, including forecasting, data mining, and dynamic report writing. This revised third edition adds new coverage for graphing with ggplot2, along with examples for machine learning topics like clustering, classification, and time series analysis.

What's inside

  • Clean, manage, and analyze data
  • Use the ggplot2 package for graphs and visualizations
  • Techniques for debugging programs and creating packages
  • A complete learning resource for R and tidyverse

About the reader

Requires basic math and statistics. No prior experience with R needed.


Похожее:

Список отзывов:

  • Фундаментальное обновлённое руководство по R, которое охватывает полный цикл анализа данных: от импорта и очистки данных до визуализации, статистического моделирования, машинного обучения и создания динамических отчётов. Ключевое отличие третьего издания — широкое использование tidyverse (dplyr, ggplot2, tidyr) и современных подходов.

    Сильные стороны:

    • Огромный охват тем: от базовой статистики (t-тесты, ANOVA, регрессии) до продвинутых методов (бутстрап, анализ пропущенных данных, случайные леса, PCA, временные ряды).
    • Детальный разбор ggplot2 с примерами кастомизации (от основ до интерактивных графиков через plotly).
    • Отдельные главы по программированию на R (отладка, окружения, эффективный код) и созданию пакетов.
    • Современный подход к отчётности — R Markdown, параметризованные отчёты.
    • Множество практических примеров на реальных данных.

    Минусы:

    • Не подходит абсолютным новичкам в статистике; требует хотя бы одного курса статистики.
    • Большой объём (~1000+ страниц) — это скорее справочник, который читают выборочно.

    Итог: Настольная книга для тех, кто серьёзно работает с R — от студентов до практикующих аналитиков. Позволяет быстро перейти от теории к решению прикладных задач.