Machine Learning for Tabular Data: XGBoost, Deep Learning, and AI

Machine Learning for Tabular Data: XGBoost, Deep Learning, and AI

Автор: Massaron Luca , Ryan Mark

Дата выхода: 2025

Издательство: Manning Publications Co.

Количество страниц: 809

Размер файла: 8,0 МБ

Тип файла: PDF

Добавил: codelibs

Проверить на вирусы Дополнительные материалы

Machine Learning for Tabular Data....2

dedication....7

contents....8

front matter....17

foreword....17

preface....18

acknowledgments....19

about this book....20

Who should read this book?....21

How this book is organized: A roadmap....21

About the code....24

liveBook discussion forum....25

about the authors....26

about the cover illustration....27

Part 1. Introducing machine learning for tabular data....29

1 Understanding tabular data....31

1.1 What is tabular data?....31

1.2 The world runs on tabular data....35

1.3 Machine learning vs. deep learning....36

1.4 What makes tabular data different?....40

1.5 Generative AI and tabular data....43

Summary....47

2 Exploring tabular datasets....49

2.1 Row and column characteristics....49

2.1.1 The ideal criteria for tabular rows....51

2.1.2 The ideal criteria for tabular columns....59

2.1.3 Representing rows and columns....64

2.2 Pathologies and remedies....66

2.2.1 Constant or quasi-constant columns....68

2.2.2 Duplicated and highly collinear features....69

2.2.3 Irrelevant features....73

2.2.4 Missing data....74

2.2.5 Rare categories....75

2.2.6 Errors in data....76

2.2.7 Leakage features....76

2.3 Finding external and internal data....78

2.3.1 Using pandas to access data stores....80

2.3.2 Internet data....86

2.3.3 Synthetic data....92

2.4 Exploratory data analysis....97

2.4.1 Loading the Auto MPG example dataset....99

2.4.2 Examining labels, values, distributions....103

2.4.3 Exploring bivariate and multivariate relationships....118

Summary....127

3 Machine learning vs. deep learning....129

3.1 Predicting Airbnb prices in New York City....130

3.1.1 The Airbnb NYC dataset....130

3.1.2 Introduction to the code....134

3.1.3 A deep learning solution using Keras....136

3.1.4 Training features....137

3.1.5 Comparing gradient boosting and deep learning solutions....140

3.1.6 Conclusions....150

3.2 Transparency....153

3.2.1 Explainability....154

3.2.2 Feature importance....159

3.2.3 Conclusions....162

3.3 Efficacy....163

3.3.1 Evaluating performance....163

3.4 Digging Deeper....165

Summary....173

Part 2. Machine learning and gradient boosting for tabular data....175

4 Classical algorithms for tabular data....177

4.1 Introducing Scikit-learn....178

4.1.1 Common features of Scikit-learn packages....180

4.1.2 Common Scikit-learn interface....182

4.1.3 Introduction to Scikit-learn pipelines....188

4.2 Exploring and processing features of the Airbnb NYC dataset....190

4.2.1 Dataset exploration....191

4.2.2 Pipelines preparation....204

4.3 Classical machine learning....207

4.3.1 Linear and logistic regression....211

4.3.2 Regularized methods....219

4.3.3 Logistic regression....226

4.3.4 Generalized linear methods....231

4.3.5 Handling large datasets with stochastic gradient descent....236

4.3.6 Choosing your algorithm....243

Summary....245

5 Decision trees and gradient boosting....249

5.1 Introduction to tree-based methods....249

5.1.1 Bagging and sampling....261

5.1.2 Predicting with random forests....268

5.1.3 Resorting to extremely randomized trees....273

5.2 Gradient boosting....276

5.2.1 How gradient boosting works....279

5.2.2 Extrapolating with gradient boosting....285

5.2.3 Explaining gradient boosting effectiveness....292

5.3 Boosting in Scikit-learn....296

5.3.1 Applying early stopping to avoid overfitting....300

5.4 Using XGBoost....305

5.4.1 XGBoost’s key parameters....308

5.4.2 How XGBoost works....315

5.4.3 Accelerating with histogram splitting....319

5.4.4 Applying early stopping to XGBoost....324

5.5 Introduction to LightGBM....327

5.5.1 How LightGBM grows trees....332

5.5.2 Gaining speed with exclusive feature bundling and gradient-based one-side sampling....334

5.5.3 Applying early stopping to LightGBM....337

5.5.4 Making XGBoost imitate LightGBM....340

5.5.5 How LightGBM inspired Scikit-learn....341

Summary....344

6 Advanced feature processing methods....348

6.1 Processing features....349

6.1.1 Multivariate missing data imputation....351

6.1.2 Handling missing data with GBDTs....357

6.1.3 Target encoding....359

6.1.4 Transforming numerical data....367

6.2 Selecting features....379

6.2.1 Stability selection for linear models....381

6.2.2 Shadow features and Boruta....384

6.2.3 Forward and backward selection....388

6.3 Optimizing hyperparameters....391

6.3.1 Searching systematically....393

6.3.2 Using random trials....396

6.3.3 Reducing the computational burden....399

6.3.4 Extending your search by Bayesian methods....401

6.3.5 Manually setting hyperparameters....408

6.4 Mastering gradient boosting....411

6.4.1 Deciding between XGBoost and LightGBM....411

6.4.2 Exploring tree structures....413

6.4.3 Speeding up by GBDTs and compiling....420

Summary....425

7 An end-to-end example using XGBoost....429

7.1 Preparing and exploring your data....429

7.1.1 Using generative AI to help prepare data....430

7.1.2 Getting and preparing your data....432

7.1.3 Engineering more complex features....445

7.1.4 Finalizing your data....457

7.1.5 Exploring and fixing your data....459

7.1.6 Exploring your target....466

7.2 Building and optimizing your model....470

7.2.1 Preparing a cross-validation strategy....470

7.2.2 Preparing your pipeline....473

7.2.3 Building a baseline model....478

7.2.4 Building a first tentative model....486

7.2.5 Optimizing your model....490

7.2.6 Training the final model....496

7.3 Explaining your model with SHAP....499

Summary....513

Part 3. Deep learning for tabular data....516

8 Getting started with deep learning with tabular data....518

8.1 The deep learning with tabular data stack....519

8.2 PyTorch with fastai....526

8.2.1 Reviewing the key code aspects of the fastai solution....526

8.2.2 Comparing the fastai solution with the Keras solution....537

8.3 PyTorch with TabNet....540

8.3.1 Key code aspects of the TabNet solution....540

8.3.2 Comparing the TabNet solution with the Keras solution....544

8.4 PyTorch with Lightning Flash....545

8.4.1 The key code aspects of the Lightning Flash solution....546

8.4.2 Comparing the Lightning Flash solution with the Keras solution....551

8.5 Overall comparison of the stacks....552

8.6 The stacks we didn’t explore....555

Summary....560

9 Deep learning best practices....562

9.1 Introduction to the Kuala Lumpur real estate dataset....563

9.2 Processing the dataset....570

9.2.1 Processing Bathrooms, Car Parks, Furnishing, Property Type, and Location columns....571

9.2.2 Processing the Price column....573

9.2.3 Processing the Rooms column....575

9.2.4 Processing the Size column....580

9.3 Defining the deep learning model....589

9.3.1 Contrasting the custom layer and Keras preprocessing layer approaches....589

9.3.2 Examining the code for model definition using Keras preprocessing layers....595

9.4 Training the deep learning model....603

9.4.1 Cross-validation in the training process....605

9.4.2 Regularization in the training process....606

9.4.3 Normalization in the training process....607

9.5 Exercising the deep learning model....608

9.5.1 Rationale for exercising the trained model on some new data points....609

9.5.2 Exercising the trained model on some new data points....612

Summary....615

10 Model deployment....617

10.1 A simple web deployment....617

10.1.1 Overview of web deployment....618

10.1.2 The Flask server module....620

10.1.3 The home.html page....624

10.1.4 The show-prediction.html page....630

10.1.5 Exercising the web deployment....631

10.2 Public clouds and machine learning operations....632

10.3 Getting started with Google Cloud....634

10.3.1 Accessing Google Cloud for the first time....634

10.3.2 Creating a Google Cloud project....636

10.3.3 Creating a Google Cloud Storage bucket....638

10.4 Deploying a model in Vertex AI....640

10.4.1 Uploading the model to a Cloud Storage bucket....641

10.4.2 Importing the model to Vertex AI....643

10.4.3 Deploying the model to an endpoint....648

10.4.4 Initial test of the model deployment....652

10.5 Using the Vertex AI deployment with Flask....656

10.5.1 Setting up the Vertex AI SDK....657

10.5.2 Updating the Flask server module to call the endpoint....658

10.5.3 Benefits of deploying a model to an endpoint....661

10.6 Gemini for Google Cloud: Generative AI assistance in Google Cloud....663

10.6.1 Setting up Gemini for Google Cloud....664

10.6.1 Using Gemini for Google Cloud to answer questions about Google Cloud....665

Summary....671

11 Building a machine learning pipeline....672

11.1 Introduction to ML pipelines....672

11.1.1 Three kinds of pipelines....673

11.1.2 Overview of Vertex AI ML pipelines....676

11.2 ML pipeline preparation steps....677

11.2.1 Creating a service account for the ML pipeline....677

11.2.2 Creating a service account key....680

11.2.3 Granting the service account access to the Compute Engine default service account....683

11.2.4 Introduction to Cloud Shell....687

11.2.5 Uploading the service account key....689

11.2.6 Uploading the cleaned-up dataset to a Google Cloud Storage bucket....692

11.2.7 Creating a Vertex AI managed dataset....695

11.3 Defining the ML pipeline....700

11.3.1 Local implementation vs. ML pipeline....701

11.3.2 Introduction to containers....704

11.3.3 Benefits of using containers in an ML pipeline....705

11.3.4 Introduction to adapting code to run in a container....706

11.3.5 Updating the training code to work in a container....708

11.3.6 The pipeline script....711

11.3.7 Testing the model trained in the pipeline....715

11.4 Using generative AI to help create the ML pipeline....718

11.4.1 Using Gemini for Google Cloud to answer questions about the ML pipeline....718

11.4.2 Using Gemini for Google Cloud to generate code for the ML pipeline....723

11.4.3 Using Gemini for Google Cloud to explain code for the ML pipeline....726

11.4.4 Using Gemini for Google Cloud to summarize log entries....729

11.4.5 Tuning a foundation model in Vertex AI....736

Summary....741

12 Blending gradient boosting and deep learning....743

12.1 Review of the gradient boosting solution from chapter 7....746

12.2 Selecting a deep learning solution....754

12.3 Selected deep learning solution to the Tokyo Airbnb problem....756

12.4 Comparing the XGBoost and fastai solutions to the Tokyo Airbnb problem....760

12.5 Ensembling the two solutions to the Tokyo Airbnb problem....765

12.6 Overall comparison of gradient boosting and deep learning....768

Summary....770

Appendix A. Hyperparameters for classical machine learning models....771

Appendix B. K-nearest neighbors and support vector machines....777

B.1 k-NN....778

B.2 SVMs....785

B.3 Using GPUs for machine learning....788

index....793

Business runs on tabular data in databases, spreadsheets, and logs. Crunch that data using deep learning, gradient boosting, and other machine learning techniques.

Machine Learning for Tabular Data teaches you to train insightful machine learning models on common tabular business data sources such as spreadsheets, databases, and logs. You’ll discover how to use XGBoost and LightGBM on tabular data, optimize deep learning libraries like TensorFlow and PyTorch for tabular data, and use cloud tools like Vertex AI to create an automated MLOps pipeline.

Machine Learning for Tabular Data will teach you how to:

Pick the right machine learning approach for your data
Apply deep learning to tabular data
Deploy tabular machine learning locally and in the cloud
Pipelines to automatically train and maintain a model

About the technology

Machine learning can accelerate everyday business chores like account reconciliation, demand forecasting, and customer service automation—not to mention more exotic challenges like fraud detection, predictive maintenance, and personalized marketing. This book shows you how to unlock the vital information stored in spreadsheets, ledgers, databases and other tabular data sources using gradient boosting, deep learning, and generative AI.

About the book

Machine Learning for Tabular Data delivers practical ML techniques to upgrade every stage of the business data analysis pipeline. In it, you’ll explore examples like using XGBoost and Keras to predict short-term rental prices, deploying a local ML model with Python and Flask, and streamlining workflows using large language models (LLMs). Along the way, you’ll learn to make your models both more powerful and more explainable.

What's inside

Master XGBoost
Apply deep learning to tabular data
Deploy models locally and in the cloud
Build pipelines to train and maintain models

About the reader

For readers experienced with Python and the basics of machine learning.

Если вам понравилась эта страница - поделитесь ею с друзьями, тем самым вы помогаете нам развиваться и добавлять всё больше интересных и нужным вам книг