Machine Learning with R: Learn techniques for building and improving machine learning models, from data preparation to model tuning, evaluation, and working with big data. 4 Ed

Автор: Lantz Brett

Дата выхода: 2023

Издательство: Packt Publishing Limited

Количество страниц: 763

Размер файла: 11,0 МБ

Тип файла: PDF

Добавил: codelibs

Проверить на вирусы Дополнительные материалы

Cover....1

Contributors....4

Table of Contents....6

Preface....18

Chapter 1: Introducing Machine Learning....24

The origins of machine learning....25

Uses and abuses of machine learning....28

Machine learning successes....30

The limits of machine learning....31

Machine learning ethics....33

How machines learn....37

Data storage....39

Abstraction....39

Generalization....43

Evaluation....45

Machine learning in practice....46

Types of input data....47

Types of machine learning algorithms....49

Matching input data to algorithms....54

Machine learning with R....55

Installing R packages....56

Loading and unloading R packages....57

Installing RStudio....57

Why R and why R now?....59

Summary....61

Chapter 2: Managing and Understanding Data....62

R data structures....63

Vectors....63

Factors....66

Lists....67

Data frames....70

Matrices and arrays....74

Managing data with R....75

Saving, loading, and removing R data structures....75

Importing and saving datasets from CSV files....77

Importing common dataset formats using RStudio....79

Exploring and understanding data....82

Exploring the structure of data....83

Exploring numeric features....84

Measuring the central tendency – mean and median....85

Measuring spread – quartiles and the five-number summary....87

Visualizing numeric features – boxplots....89

Visualizing numeric features – histograms....91

Understanding numeric data – uniform and normal distributions....93

Measuring spread – variance and standard deviation....94

Exploring categorical features....96

Measuring the central tendency – the mode....97

Exploring relationships between features....99

Visualizing relationships – scatterplots....99

Examining relationships – two-way cross-tabulations....101

Summary....105

Chapter 3: Lazy Learning – Classification Using Nearest Neighbors....106

Understanding nearest neighbor classification....107

The k-NN algorithm....107

Measuring similarity with distance....111

Choosing an appropriate k....113

Preparing data for use with k-NN....114

Why is the k-NN algorithm lazy?....117

Example – diagnosing breast cancer with the k-NN algorithm....118

Step 1 – collecting data....119

Step 2 – exploring and preparing the data....119

Transformation – normalizing numeric data....121

Data preparation – creating training and test datasets....123

Step 3 – training a model on the data....124

Step 4 – evaluating model performance....126

Step 5 – improving model performance....127

Transformation – z-score standardization....127

Testing alternative values of k....129

Summary....130

Chapter 4: Probabilistic Learning – Classification Using Naive Bayes....132

Understanding Naive Bayes....133

Basic concepts of Bayesian methods....133

Understanding probability....134

Understanding joint probability....135

Computing conditional probability with Bayes’ theorem....137

The Naive Bayes algorithm....140

Classification with Naive Bayes....141

The Laplace estimator....143

Using numeric features with Naive Bayes....145

Example – filtering mobile phone spam with the Naive Bayes algorithm....146

Step 1 – collecting data....147

Step 2 – exploring and preparing the data....148

Data preparation – cleaning and standardizing text data....149

Data preparation – splitting text documents into words....155

Data preparation – creating training and test datasets....158

Visualizing text data – word clouds....159

Data preparation – creating indicator features for frequent words....162

Step 3 – training a model on the data....164

Step 4 – evaluating model performance....166

Step 5 – improving model performance....167

Summary....168

Chapter 5: Divide and Conquer – Classification Using Decision Trees and Rules....170

Understanding decision trees....171

Divide and conquer....172

The C5.0 decision tree algorithm....176

Choosing the best split....177

Pruning the decision tree....180

Example – identifying risky bank loans using C5.0 decision trees....181

Step 1 – collecting data....182

Step 2 – exploring and preparing the data....182

Data preparation – creating random training and test datasets....184

Step 3 – training a model on the data....186

Step 4 – evaluating model performance....192

Step 5 – improving model performance....193

Boosting the accuracy of decision trees....193

Making some mistakes cost more than others....196

Understanding classification rules....198

Separate and conquer....199

The 1R algorithm....201

The RIPPER algorithm....203

Rules from decision trees....205

What makes trees and rules greedy?....206

Example – identifying poisonous mushrooms with rule learners....208

Step 1 – collecting data....209

Step 2 – exploring and preparing the data....209

Step 3 – training a model on the data....210

Step 4 – evaluating model performance....212

Step 5 – improving model performance....213

Summary....217

Chapter 6: Forecasting Numeric Data – Regression Methods....220

Understanding regression....221

Simple linear regression....223

Ordinary least squares estimation....226

Correlations....228

Multiple linear regression....230

Generalized linear models and logistic regression....235

Example – predicting auto insurance claims costs using linear regression....241

Step 1 – collecting data....242

Step 2 – exploring and preparing the data....243

Exploring relationships between features – the correlation matrix....246

Visualizing relationships between features – the scatterplot matrix....247

Step 3 – training a model on the data....250

Step 4 – evaluating model performance....253

Step 5 – improving model performance....255

Model specification – adding nonlinear relationships....255

Model specification – adding interaction effects....256

Putting it all together – an improved regression model....256

Making predictions with a regression model....258

Going further – predicting insurance policyholder churn with logistic regression....261

Understanding regression trees and model trees....268

Adding regression to trees....269

Example – estimating the quality of wines with regression trees and model trees....271

Step 1 – collecting data....272

Step 2 – exploring and preparing the data....273

Step 3 – training a model on the data....275

Visualizing decision trees....278

Step 4 – evaluating model performance....280

Measuring performance with the mean absolute error....280

Step 5 – improving model performance....282

Summary....285

Chapter 7: Black-Box Methods – Neural Networks and Support Vector Machines....288

Understanding neural networks....289

From biological to artificial neurons....290

Activation functions....292

Network topology....296

The number of layers....296

The direction of information travel....298

The number of nodes in each layer....300

Training neural networks with backpropagation....301

Example – modeling the strength of concrete with ANNs....304

Step 1 – collecting data....304

Step 2 – exploring and preparing the data....305

Step 3 – training a model on the data....307

Step 4 – evaluating model performance....310

Step 5 – improving model performance....311

Understanding support vector machines....317

Classification with hyperplanes....318

The case of linearly separable data....320

The case of nonlinearly separable data....322

Using kernels for nonlinear spaces....323

Example – performing OCR with SVMs....325

Step 1 – collecting data....326

Step 2 – exploring and preparing the data....327

Step 3 – training a model on the data....328

Step 4 – evaluating model performance....331

Step 5 – improving model performance....333

Changing the SVM kernel function....333

Identifying the best SVM cost parameter....334

Summary....336

Chapter 8: Finding Patterns – Market Basket Analysis Using Association Rules....338

Understanding association rules....339

The Apriori algorithm for association rule learning....340

Measuring rule interest – support and confidence....342

Building a set of rules with the Apriori principle....343

Example – identifying frequently purchased groceries with association rules....344

Step 1 – collecting data....345

Step 2 – exploring and preparing the data....346

Data preparation – creating a sparse matrix for transaction data....347

Visualizing item support – item frequency plots....351

Visualizing the transaction data – plotting the sparse matrix....353

Step 3 – training a model on the data....354

Step 4 – evaluating model performance....358

Step 5 – improving model performance....362

Sorting the set of association rules....363

Taking subsets of association rules....364

Saving association rules to a file or data frame....365

Using the Eclat algorithm for greater efficiency....366

Summary....368

Chapter 9: Finding Groups of Data – Clustering with k-means....370

Understanding clustering....371

Clustering as a machine learning task....371

Clusters of clustering algorithms....374

The k-means clustering algorithm....379

Using distance to assign and update clusters ....380

Choosing the appropriate number of clusters....385

Finding teen market segments using k-means clustering....387

Step 1 – collecting data....387

Step 2 – exploring and preparing the data....388

Data preparation – dummy coding missing values....390

Data preparation – imputing the missing values....391

Step 3 – training a model on the data....393

Step 4 – evaluating model performance....396

Step 5 – improving model performance....400

Summary....402

Chapter 10: Evaluating Model Performance....404

Measuring performance for classification....405

Understanding a classifier’s predictions....406

A closer look at confusion matrices....410

Using confusion matrices to measure performance....412

Beyond accuracy – other measures of performance....414

The kappa statistic....416

The Matthews correlation coefficient....420

Sensitivity and specificity....423

Precision and recall....424

The F-measure....426

Visualizing performance tradeoffs with ROC curves....427

Comparing ROC curves....432

The area under the ROC curve....435

Creating ROC curves and computing AUC in R....436

Estimating future performance....439

The holdout method....440

Cross-validation....444

Bootstrap sampling....448

Summary....450

Chapter 11: Being Successful with Machine Learning....452

What makes a successful machine learning practitioner?....453

What makes a successful machine learning model?....455

Avoiding obvious predictions....459

Conducting fair evaluations....462

Considering real-world impacts....466

Building trust in the model....471

Putting the “science” in data science....475

Using R Notebooks and R Markdown....479

Performing advanced data exploration....483

Constructing a data exploration roadmap....484

Encountering outliers: a real-world pitfall....487

Example – using ggplot2 for visual data exploration....490

Summary....503

Chapter 12: Advanced Data Preparation....506

Performing feature engineering....507

The role of human and machine....508

The impact of big data and deep learning....512

Feature engineering in practice....519

Hint 1: Brainstorm new features....520

Hint 2: Find insights hidden in text....521

Hint 3: Transform numeric ranges....523

Hint 4: Observe neighbors’ behavior....524

Hint 5: Utilize related rows....526

Hint 6: Decompose time series....527

Hint 7: Append external data....532

Exploring R’s tidyverse....534

Making tidy table structures with tibbles....535

Reading rectangular files faster with readr and readxl....536

Preparing and piping data with dplyr....538

Transforming text with stringr....543

Cleaning dates with lubridate....549

Summary....554

Chapter 13: Challenging Data – Too Much, Too Little, Too Complex....556

The challenge of high-dimension data....557

Applying feature selection....559

Filter methods....561

Wrapper methods and embedded methods....562

Example – Using stepwise regression for feature selection....564

Example – Using Boruta for feature selection....568

Performing feature extraction....571

Understanding principal component analysis....571

Example – Using PCA to reduce highly dimensional social media data....576

Making use of sparse data....585

Identifying sparse data....585

Example – Remapping sparse categorical data....586

Example – Binning sparse numeric data....590

Handling missing data....595

Understanding types of missing data....596

Performing missing value imputation....598

Simple imputation with missing value indicators....599

Missing value patterns....600

The problem of imbalanced data....602

Simple strategies for rebalancing data....603

Generating a synthetic balanced dataset with SMOTE....606

Example – Applying the SMOTE algorithm in R....607

Considering whether balanced is always better....610

Summary....612

Chapter 14: Building Better Learners....614

Tuning stock models for better performance....615

Determining the scope of hyperparameter tuning....616

Example – using caret for automated tuning....621

Creating a simple tuned model....624

Customizing the tuning process....627

Improving model performance with ensembles....632

Understanding ensemble learning....633

Popular ensemble-based algorithms....636

Bagging....636

Boosting....638

Random forests....641

Gradient boosting....647

Extreme gradient boosting with XGBoost....652

Why are tree-based ensembles so popular?....659

Stacking models for meta-learning....661

Understanding model stacking and blending....663

Practical methods for blending and stacking in R....665

Summary....668

Chapter 15: Making Use of Big Data....670

Practical applications of deep learning....671

Beginning with deep learning....672

Choosing appropriate tasks for deep learning....673

The TensorFlow and Keras deep learning frameworks....676

Understanding convolutional neural networks....678

Transfer learning and fine tuning....681

Example – classifying images using a pre-trained CNN in R....682

Unsupervised learning and big data....689

Representing highly dimensional concepts as embeddings....690

Understanding word embeddings....692

Example – using word2vec for understanding text in R....694

Visualizing highly dimensional data....703

The limitations of using PCA for big data visualization....704

Understanding the t-SNE algorithm....706

Example – visualizing data’s natural clusters with t-SNE....709

Adapting R to handle large datasets....714

Querying data in SQL databases....715

The tidy approach to managing database connections....715

Using a database backend for dplyr with dbplyr....718

Doing work faster with parallel processing....720

Measuring R’s execution time....722

Enabling parallel processing in R....722

Taking advantage of parallel with foreach and doParallel....725

Training and evaluating models in parallel with caret....727

Utilizing specialized hardware and algorithms....728

Parallel computing with MapReduce concepts via Apache Spark....729

Learning via distributed and scalable algorithms with H2O....731

GPU computing....733

Summary....735

Other Books You May Enjoy....738

Index....744

Dive into R with this data science guide on machine learning (ML). Machine Learning with R, Fourth Edition, takes you through classification methods like nearest neighbor and Naive Bayes and regression modeling, from simple linear to logistic.

Dive into practical deep learning with neural networks and support vector machines and unearth valuable insights from complex data sets with market basket analysis. Learn how to unlock hidden patterns within your data using k-means clustering.

With three new chapters on data, you'll hone your skills in advanced data preparation, mastering feature engineering, and tackling challenging data scenarios. This book helps you conquer high-dimensionality, sparsity, and imbalanced data with confidence. Navigate the complexities of big data with ease, harnessing the power of parallel computing and leveraging GPU resources for faster insights.

Elevate your understanding of model performance evaluation, moving beyond accuracy metrics. With a new chapter on building better learners, you'll pick up techniques that top teams use to improve model performance with ensemble methods and innovative model stacking and blending techniques.

Machine Learning with R, Fourth Edition, equips you with the tools and knowledge to tackle even the most formidable data challenges. Unlock the full potential of machine learning and become a true master of the craft.

What You Will Learn:

Learn the end-to-end process of machine learning from raw data to implementation
Classify important outcomes using nearest neighbor and Bayesian methods
Predict future events using decision trees, rules, and support vector machines
Forecast numeric data and estimate financial values using regression methods
Model complex processes with artificial neural networks
Prepare, transform, and clean data using the tidyverse
Evaluate your models and improve their performance
Connect R to SQL databases and emerging big data technologies such as Spark, Hadoop, H2O, and TensorFlow

Who this book is for:

This book is designed to help data scientists, actuaries, data analysts, financial analysts, social scientists, business and machine learning students, and any other practitioners who want a clear, accessible guide to machine learning with R. No R experience is required, although prior exposure to statistics and programming is helpful.

Если вам понравилась эта страница - поделитесь ею с друзьями, тем самым вы помогаете нам развиваться и добавлять всё больше интересных и нужным вам книг