Cover....1
Half Title....2
Series Page....3
Title Page....4
Copyright Page....5
Dedication....6
Contents....8
Preface....14
1. What Is This Book About?....18
1.1. Machine Learning....18
1.2. Data Science....21
1.3. Software Engineering....23
1.4. How Do They Go Together?....25
I. Foundations of Scientific Computing....28
2. Hardware Architectures....30
2.1. Types of Hardware....31
2.1.1. Compute....32
2.1.2. Memory....37
2.1.3. Connections....41
2.2. Making Hardware Live Up to Expectations....43
2.3. Local and Remote Hardware....45
2.4. Choosing the Right Hardware for the Job....47
3. Variable Types and Data Structures....52
3.1. Variable Types....53
3.1.1. Integers....53
3.1.2. Floating Point....57
3.1.3. Strings....64
3.2. Data Structures....65
3.2.1. Vectors and Lists....66
3.2.2. Representing Data with Data Frames....68
3.2.3. Dense and Sparse Matrices....70
3.3. Choosing the Right Variable Types for the Job....73
3.4. Choosing the Right Data Structures for the Job....78
4. Analysis of Algorithms....80
4.1. Writing Pseudocode....80
4.2. Computational Complexity and Big-O Notation....83
4.3. Big-O Notation and Benchmarking....87
4.4. Algorithm Analysis for Machine Learning....89
4.5. Some Examples of Algorithm Analysis....90
4.5.1. Estimating Linear Regression Models....91
4.5.2. Sparse Matrices Representation....97
4.5.3. Uniform Simulations of Directed Acyclic Graphs....101
4.6. Big-O Notation and Real-World Performance....107
II. Best Practices for Machine Learning Pipelines....110
5. Designing and Structuring Pipelines....112
5.1. Data as Code....112
5.2. Technical Debt....115
5.2.1. At the Data Level....116
5.2.2. At the Model Level....118
5.2.3. At the Architecture (Design) Level....121
5.2.4. At the Code Level....123
5.3. Machine Learning Pipeline....124
5.3.1. Project Scoping....128
5.3.2. Producing a Baseline Implementation....132
5.3.3. Data Ingestion and Preparation....133
5.3.4. Model Training, Evaluation and Validation....135
5.3.5. Deployment, Serving and Inference....138
5.3.6. Monitoring, Logging and Reporting....140
6. Writing Machine Learning Code....146
6.1. Choosing Languages and Libraries....147
6.2. Naming Things....150
6.3. Coding Styles and Coding Standards....153
6.4. Filesystem Structure....156
6.5. Effective Versioning....160
6.6. Code Review....163
6.7. Refactoring....168
6.8. Reworking Academic Code: An Example....170
7. Packaging and Deploying Pipelines....180
7.1. Model Packaging....180
7.1.1. Standalone Packaging....181
7.1.2. Programming Language Package Managers....181
7.1.3. Virtual Machines....182
7.1.4. Containers....184
7.2. Model Deployment: Strategies....189
7.3. Model Deployment: Infrastructure....193
7.4. Model Deployment: Monitoring and Logging....194
7.5. What Can Possibly Go Wrong?....196
7.6. Rolling Back....199
8. Documenting Pipelines....202
8.1. Comments....203
8.2. Documenting Public Interfaces....206
8.3. Documenting Architecture and Design....216
8.4. Documenting Algorithms and Business Cases....222
8.5. Illustrating Practical Use Cases....226
9. Troubleshooting and Testing Pipelines....230
9.1. Data Are the Problem....231
9.1.1. Large Data....232
9.1.2. Heterogeneous Data....234
9.1.3. Dynamic Data....235
9.2. Models Are the Problem....236
9.2.1. Large Models....236
9.2.2. Black-Box Models....237
9.2.3. Costly Models....238
9.2.4. Many Models....239
9.3. Common Signs That Something Is Up....240
9.4. Tests Are the Solution....243
9.4.1. What Do We Want to Achieve?....244
9.4.2. What Should We Test?....245
9.4.3. Offline and Online Data....247
9.4.4. Testing Local and Testing Global....251
9.4.5. Conceptual and Implementation Errors....254
9.4.6. Code Coverage and Test Prioritisation....256
III. Tools and Technologies....262
10. Tools for Developing Pipelines....264
10.1. Data Exploration and Experiment Tracking....264
10.2. Code Development....268
10.2.1. Code Editors and IDEs....269
10.2.2. Notebooks....271
10.2.3. Accessing Data and Documentation....274
10.3. Build, Test and Documentation Tools....274
11. Tools to Manage Pipelines in Production....280
11.1. Infrastructure Management....280
11.2. Machine Learning Software Management....283
11.3. Dashboards, Visualisation and Reporting....288
IV. A Case Study....292
12. Recommending Recommendations: A Recommender System Using Natural Language Understanding....294
12.1. The Domain Problem....295
12.2. The Machine Learning Model....298
12.3. The Infrastructure....302
12.4. The Architecture of the Pipeline....305
12.4.1. Data Ingestion and Data Preparation....306
12.4.2. Data Tracking and Versioning....310
12.4.3. Training and Experiment Tracking....311
12.4.4. Model Packaging....314
12.4.5. Deployment and Inference....315
Bibliography....320
Index....354
Machine learning has redefined the way we work with data and is increasingly becoming an indispensable part of everyday life. The Pragmatic Programmer for Machine Learning: Engineering Analytics and Data Science Solutions discusses how modern software engineering practices are part of this revolution both conceptually and in practical applictions.
Comprising a broad overview of how to design machine learning pipelines as well as the state-of-the-art tools we use to make them, this book provides a multi-disciplinary view of how traditional software engineering can be adapted to and integrated with the workflows of domain experts and probabilistic models.
From choosing the right hardware to designing effective pipelines architectures and adopting software development best practices, this guide will appeal to machine learning and data science specialists, whilst also laying out key high-level principlesin a way that is approachable for students of computer science and aspiring programmers.