Cover....1
Copyright....3
Contributors....4
Table of Contents....6
Preface....18
Chapter 1: Introducing Machine Learning....24
The origins of machine learning....25
Uses and abuses of machine learning....28
Machine learning successes....30
The limits of machine learning....31
Machine learning ethics....33
How machines learn....37
Data storage....39
Abstraction....39
Generalization....43
Evaluation....45
Machine learning in practice....46
Types of input data....47
Types of machine learning algorithms....49
Matching input data to algorithms....54
Machine learning with R....55
Installing R packages....56
Loading and unloading R packages....57
Installing RStudio....57
Why R and why R now?....59
Summary....61
Chapter 2: Managing and Understanding Data....62
R data structures....63
Vectors....63
Factors....66
Lists....67
Data frames....70
Matrices and arrays....74
Managing data with R....75
Saving, loading, and removing R data structures....75
Importing and saving datasets from CSV files....77
Importing common dataset formats using RStudio....79
Exploring and understanding data....82
Exploring the structure of data....83
Exploring numeric features....84
Measuring the central tendency – mean and median....85
Measuring spread – quartiles and the five-number summary....87
Visualizing numeric features – boxplots....89
Visualizing numeric features – histograms....91
Understanding numeric data – uniform and normal distributions....93
Measuring spread – variance and standard deviation....94
Exploring categorical features....96
Measuring the central tendency – the mode....97
Exploring relationships between features....99
Visualizing relationships – scatterplots....99
Examining relationships – two-way cross-tabulations....101
Summary....105
Chapter 3: Lazy Learning – Classification Using Nearest Neighbors....106
Understanding nearest neighbor classification....107
The k-NN algorithm....107
Measuring similarity with distance....111
Choosing an appropriate k....113
Preparing data for use with k-NN....114
Why is the k-NN algorithm lazy?....117
Example – diagnosing breast cancer with the k-NN algorithm....118
Step 1 – collecting data....119
Step 2 – exploring and preparing the data....119
Transformation – normalizing numeric data....121
Data preparation – creating training and test datasets....123
Step 3 – training a model on the data....124
Step 4 – evaluating model performance....126
Step 5 – improving model performance....127
Transformation – z-score standardization....127
Testing alternative values of k....129
Summary....130
Chapter 4: Probabilistic Learning – Classification Using Naive Bayes....132
Understanding Naive Bayes....133
Basic concepts of Bayesian methods....133
Understanding probability....134
Understanding joint probability....135
Computing conditional probability with Bayes’ theorem....137
The Naive Bayes algorithm....140
Classification with Naive Bayes....141
The Laplace estimator....143
Using numeric features with Naive Bayes....145
Example – filtering mobile phone spam with the Naive Bayes algorithm....146
Step 1 – collecting data....147
Step 2 – exploring and preparing the data....148
Data preparation – cleaning and standardizing text data....149
Data preparation – splitting text documents into words....155
Data preparation – creating training and test datasets....158
Visualizing text data – word clouds....159
Data preparation – creating indicator features for frequent words....162
Step 3 – training a model on the data....164
Step 4 – evaluating model performance....166
Step 5 – improving model performance....167
Summary....168
Chapter 5: Divide and Conquer – Classification Using Decision Trees and Rules....170
Understanding decision trees....171
Divide and conquer....172
The C5.0 decision tree algorithm....176
Choosing the best split....177
Pruning the decision tree....180
Example – identifying risky bank loans using C5.0 decision trees....181
Step 1 – collecting data....182
Step 2 – exploring and preparing the data....182
Data preparation – creating random training and test datasets....184
Step 3 – training a model on the data....186
Step 4 – evaluating model performance....192
Step 5 – improving model performance....193
Boosting the accuracy of decision trees....193
Making some mistakes cost more than others....196
Understanding classification rules....198
Separate and conquer....199
The 1R algorithm....201
The RIPPER algorithm....203
Rules from decision trees....205
What makes trees and rules greedy?....206
Example – identifying poisonous mushrooms with rule learners....208
Step 1 – collecting data....209
Step 2 – exploring and preparing the data....209
Step 3 – training a model on the data....210
Step 4 – evaluating model performance....212
Step 5 – improving model performance....213
Summary....217
Chapter 6: Forecasting Numeric Data – Regression Methods....220
Understanding regression....221
Simple linear regression....223
Ordinary least squares estimation....226
Correlations....228
Multiple linear regression....230
Generalized linear models and logistic regression....235
Example – predicting auto insurance claims costs using linear regression....241
Step 1 – collecting data....242
Step 2 – exploring and preparing the data....243
Exploring relationships between features – the correlation matrix....246
Visualizing relationships between features – the scatterplot matrix....247
Step 3 – training a model on the data....250
Step 4 – evaluating model performance....253
Step 5 – improving model performance....255
Model specification – adding nonlinear relationships....255
Model specification – adding interaction effects....256
Putting it all together – an improved regression model....256
Making predictions with a regression model....258
Going further – predicting insurance policyholder churn with logistic regression....261
Understanding regression trees and model trees....268
Adding regression to trees....269
Example – estimating the quality of wines with regression trees and model trees....271
Step 1 – collecting data....272
Step 2 – exploring and preparing the data....273
Step 3 – training a model on the data....275
Visualizing decision trees....278
Step 4 – evaluating model performance....280
Measuring performance with the mean absolute error....280
Step 5 – improving model performance....282
Summary....285
Chapter 7: Black-Box Methods – Neural Networks and Support Vector Machines....288
Understanding neural networks....289
From biological to artificial neurons....290
Activation functions....292
Network topology....296
The number of layers....296
The direction of information travel....298
The number of nodes in each layer....300
Training neural networks with backpropagation....301
Example – modeling the strength of concrete with ANNs....304
Step 1 – collecting data....304
Step 2 – exploring and preparing the data....305
Step 3 – training a model on the data....307
Step 4 – evaluating model performance....310
Step 5 – improving model performance....311
Understanding support vector machines....317
Classification with hyperplanes....318
The case of linearly separable data....320
The case of nonlinearly separable data....322
Using kernels for nonlinear spaces....323
Example – performing OCR with SVMs....325
Step 1 – collecting data....326
Step 2 – exploring and preparing the data....327
Step 3 – training a model on the data....328
Step 4 – evaluating model performance....331
Step 5 – improving model performance....333
Changing the SVM kernel function....333
Identifying the best SVM cost parameter....334
Summary....336
Chapter 8: Finding Patterns – Market Basket Analysis Using Association Rules....338
Understanding association rules....339
The Apriori algorithm for association rule learning....340
Measuring rule interest – support and confidence....342
Building a set of rules with the Apriori principle....343
Example – identifying frequently purchased groceries with association rules....344
Step 1 – collecting data....345
Step 2 – exploring and preparing the data....346
Data preparation – creating a sparse matrix for transaction data....347
Visualizing item support – item frequency plots....351
Visualizing the transaction data – plotting the sparse matrix....353
Step 3 – training a model on the data....354
Step 4 – evaluating model performance....358
Step 5 – improving model performance....362
Sorting the set of association rules....363
Taking subsets of association rules....364
Saving association rules to a file or data frame....365
Using the Eclat algorithm for greater efficiency....366
Summary....368
Chapter 9: Finding Groups of Data – Clustering with k-means....370
Understanding clustering....371
Clustering as a machine learning task....371
Clusters of clustering algorithms....374
The k-means clustering algorithm....379
Using distance to assign and update clusters ....380
Choosing the appropriate number of clusters....385
Finding teen market segments using k-means clustering....387
Step 1 – collecting data....387
Step 2 – exploring and preparing the data....388
Data preparation – dummy coding missing values....390
Data preparation – imputing the missing values....391
Step 3 – training a model on the data....393
Step 4 – evaluating model performance....396
Step 5 – improving model performance....400
Summary....402
Chapter 10: Evaluating Model Performance....404
Measuring performance for classification....405
Understanding a classifier’s predictions....406
A closer look at confusion matrices....410
Using confusion matrices to measure performance....412
Beyond accuracy – other measures of performance....414
The kappa statistic....416
The Matthews correlation coefficient....420
Sensitivity and specificity....423
Precision and recall....424
The F-measure....426
Visualizing performance tradeoffs with ROC curves....427
Comparing ROC curves....432
The area under the ROC curve....435
Creating ROC curves and computing AUC in R....436
Estimating future performance....439
The holdout method....440
Cross-validation....444
Bootstrap sampling....448
Summary....450
Chapter 11: Being Successful with Machine Learning....452
What makes a successful machine learning practitioner?....453
What makes a successful machine learning model?....455
Avoiding obvious predictions....459
Conducting fair evaluations....462
Considering real-world impacts....466
Building trust in the model....471
Putting the “science” in data science....475
Using R Notebooks and R Markdown....479
Performing advanced data exploration....483
Constructing a data exploration roadmap....484
Encountering outliers: a real-world pitfall....487
Example – using ggplot2 for visual data exploration....490
Summary....503
Chapter 12: Advanced Data Preparation....506
Performing feature engineering....507
The role of human and machine....508
The impact of big data and deep learning....512
Feature engineering in practice....519
Hint 1: Brainstorm new features....520
Hint 2: Find insights hidden in text....521
Hint 3: Transform numeric ranges....523
Hint 4: Observe neighbors’ behavior....524
Hint 5: Utilize related rows....526
Hint 6: Decompose time series....527
Hint 7: Append external data....532
Exploring R’s tidyverse....534
Making tidy table structures with tibbles....535
Reading rectangular files faster with readr and readxl....536
Preparing and piping data with dplyr....538
Transforming text with stringr....543
Cleaning dates with lubridate....549
Summary....554
Chapter 13: Challenging Data – Too Much, Too Little, Too Complex....556
The challenge of high-dimension data....557
Applying feature selection....559
Filter methods....561
Wrapper methods and embedded methods....562
Example – Using stepwise regression for feature selection....564
Example – Using Boruta for feature selection....568
Performing feature extraction....571
Understanding principal component analysis....571
Example – Using PCA to reduce highly dimensional social media data....576
Making use of sparse data....585
Identifying sparse data....585
Example – Remapping sparse categorical data....586
Example – Binning sparse numeric data....590
Handling missing data....595
Understanding types of missing data....596
Performing missing value imputation....598
Simple imputation with missing value indicators....599
Missing value patterns....600
The problem of imbalanced data....602
Simple strategies for rebalancing data....603
Generating a synthetic balanced dataset with SMOTE....606
Example – Applying the SMOTE algorithm in R....607
Considering whether balanced is always better....610
Summary....612
Chapter 14: Building Better Learners....614
Tuning stock models for better performance....615
Determining the scope of hyperparameter tuning....616
Example – using caret for automated tuning....621
Creating a simple tuned model....624
Customizing the tuning process....627
Improving model performance with ensembles....632
Understanding ensemble learning....633
Popular ensemble-based algorithms....636
Bagging....636
Boosting....638
Random forests....641
Gradient boosting....647
Extreme gradient boosting with XGBoost....652
Why are tree-based ensembles so popular?....659
Stacking models for meta-learning....661
Understanding model stacking and blending....663
Practical methods for blending and stacking in R....665
Summary....668
Chapter 15: Making Use of Big Data....670
Practical applications of deep learning....671
Beginning with deep learning....672
Choosing appropriate tasks for deep learning....673
The TensorFlow and Keras deep learning frameworks....676
Understanding convolutional neural networks....678
Transfer learning and fine tuning....681
Example – classifying images using a pre-trained CNN in R....682
Unsupervised learning and big data....689
Representing highly dimensional concepts as embeddings....690
Understanding word embeddings....692
Example – using word2vec for understanding text in R....694
Visualizing highly dimensional data....703
The limitations of using PCA for big data visualization....704
Understanding the t-SNE algorithm....706
Example – visualizing data’s natural clusters with t-SNE....709
Adapting R to handle large datasets....714
Querying data in SQL databases....715
The tidy approach to managing database connections....715
Using a database backend for dplyr with dbplyr....718
Doing work faster with parallel processing....720
Measuring R’s execution time....722
Enabling parallel processing in R....722
Taking advantage of parallel with foreach and doParallel....725
Training and evaluating models in parallel with caret....727
Utilizing specialized hardware and algorithms....728
Parallel computing with MapReduce concepts via Apache Spark....729
Learning via distributed and scalable algorithms with H2O....731
GPU computing....733
Summary....735
Other Books You May Enjoy....738
Index....744
Dive into R with this data science guide on machine learning (ML). Machine Learning with R, Fourth Edition, takes you through classification methods like nearest neighbor and Naive Bayes and regression modeling, from simple linear to logistic.
Dive into practical deep learning with neural networks and support vector machines and unearth valuable insights from complex data sets with market basket analysis. Learn how to unlock hidden patterns within your data using k-means clustering.
With three new chapters on data, you'll hone your skills in advanced data preparation, mastering feature engineering, and tackling challenging data scenarios. This book helps you conquer high-dimensionality, sparsity, and imbalanced data with confidence. Navigate the complexities of big data with ease, harnessing the power of parallel computing and leveraging GPU resources for faster insights.
Elevate your understanding of model performance evaluation, moving beyond accuracy metrics. With a new chapter on building better learners, you'll pick up techniques that top teams use to improve model performance with ensemble methods and innovative model stacking and blending techniques.
Machine Learning with R, Fourth Edition, equips you with the tools and knowledge to tackle even the most formidable data challenges. Unlock the full potential of machine learning and become a true master of the craft.
This book is designed to help data scientists, actuaries, data analysts, financial analysts, social scientists, business and machine learning students, and any other practitioners who want a clear, accessible guide to machine learning with R. No R experience is required, although prior exposure to statistics and programming is helpful.