Preface....7
Who Is This Book For?....8
Why Python?....10
What Is Not in This Book....11
Guide to This Book....12
Reading Order....13
Conventions Used in This Book....14
Using Code Examples....15
O’Reilly Online Learning....16
How to Contact Us....16
Acknowledgments....17
1. What Is Good Code?....19
Why Good Code Matters....19
Adapting to Changing Requirements....21
Simplicity....22
Don’t Repeat Yourself (DRY)....23
Avoid Verbose Code....25
Modularity....26
Readability....27
Standards and Conventions....28
Names....29
Cleaning up....30
Documentation....30
Performance....31
Robustness....31
Errors and Logging....32
Testing....32
Key Takeaways....33
2. Analyzing Code Performance....35
Methods to Improve Performance....36
Timing Your Code....38
Profiling Your Code....42
cProfile....42
line_profiler....45
Memory Profiling with Memray....46
Time Complexity....49
How to Estimate Time Complexity....49
Big O Notation....51
Key Takeaways....53
3. Using Data Structures Effectively....55
Native Python Data Structures....56
Lists....56
Tuples....59
Dictionaries....59
Sets....62
NumPy Arrays....63
NumPy Array Functionality....64
NumPy Array Performance Considerations....65
Array Operations Using Dask....69
Arrays in Machine Learning....71
pandas DataFrames....73
DataFrame Functionality....73
DataFrame Performance Considerations....75
Key Takeaways....76
4. Object-Oriented Programming and Functional Programming....79
Object-Oriented Programming....80
Classes, Methods, and Attributes....80
Defining Your Own Classes....84
OOP Principles....87
Functional Programming....91
Lambda Functions and map()....92
Applying Functions to DataFrames....93
Which Paradigm Should I Use?....94
Key Takeaways....95
5. Errors, Logging, and Debugging....96
Errors in Python....96
Reading Python Error Messages....96
Handling Errors....98
Raising Errors....102
Logging....104
What to Log....104
Logging Configuration....105
How to Log....107
Debugging....109
Strategies for Debugging....110
Tools for Debugging....111
Key Takeaways....117
6. Code Formatting, Linting, and Type Checking....119
Code Formatting and Style Guides....120
PEP8....121
Import Formatting....122
Automatic Code Formatting with Black....124
Linting....127
Linting Tools....127
Linting in Your IDE....130
Type Checking....131
Type Annotations....133
Type Checking with mypy....135
Key Takeaways....136
7. Testing Your Code....137
Why You Should Write Tests....138
When to Test....139
How to Write and Run Tests....140
A Basic Test....140
Testing Unexpected Inputs....143
Running Automated Tests with Pytest....145
Types of Tests....147
Unit Tests....148
Integration Tests....148
Data Validation....150
Data Validation Examples....150
Using Pandera for Data Validation....151
Data Validation with Pydantic....153
Testing for Machine Learning....155
Testing Model Training....157
Testing Model Inference....157
Key Takeaways....158
8. Design and Refactoring....159
Project Design and Structure....160
Project Design Considerations....160
An Example Machine Learning Project....162
Code Design....165
Modular Code....165
A Code Design Framework....167
Interfaces and Contracts....168
Coupling....168
From Notebooks to Scalable Scripts....171
Why Use Scripts Instead of Notebooks?....171
Creating Scripts from Notebooks....173
Refactoring....176
Strategies for Refactoring....177
An Example Refactoring Workflow....178
Key Takeaways....180
9. Documentation....182
Documentation Within the Codebase....183
Names....184
Comments....187
Docstrings....189
Readmes, Tutorials, and Other Longer Documents....191
Documentation in Jupyter Notebooks....193
Documenting Machine Learning Experiments....196
Key Takeaways....197
10. Sharing Your Code: Version Control, Dependencies, and Packaging....199
Version Control Using Git....199
How Does Git Work?....200
Tracking Changes and Committing....202
Remote and Local....204
Branches and Pull Requests....206
Dependencies and Virtual Environments....211
Virtual Environments....212
Managing Dependencies with pip....214
Managing Dependencies with Poetry....215
Python Packaging....218
Packaging Basics....219
pyproject.toml....221
Building and Uploading Packages....222
Key Takeaways....224
11. APIs....226
Calling an API....227
HTTP Methods and Status Codes....227
Getting Data from the SDG API....229
Creating Your Own API Using FastAPI....234
Setting Up the API....234
Adding Functionality to Your API....238
Making Requests to Your API....243
Key Takeaways....244
12. Automation and Deployment....246
Deploying Code....247
Automation Examples....249
Pre-Commit Hooks....249
GitHub Actions....253
Cloud Deployments....258
Containers and Docker....259
Building a Docker Container....260
Deploying an API on Google Cloud....262
Deploying an API on Other Cloud Providers....264
Key Takeaways....265
13. Security....267
What Is Security?....267
Security Risks....269
Credentials, Physical Security, and Social Engineering....270
Third-Party Packages....270
The Python Pickle Module....271
Version Control Risks....271
API Security Risks....272
Security Practices....273
Security Reviews and Policies....273
Secure Coding Tools....274
Simple Code Scanning....274
Security for Machine Learning....277
Attacks on ML Systems....278
Security Practices for ML Systems....280
Key Takeaways....281
14. Working in Software....283
Development Principles and Practices....283
The Software Development Lifecycle....283
Waterfall Software Development....285
Agile Software Development....286
Agile Data Science....287
Roles in the Software Industry....288
Software Engineer....289
QA or Test Engineer....291
Data Engineer....291
Data Analyst....292
Product Manager....293
UX Researcher....294
Designer....295
Community....296
Open Source....297
Speaking at Events....299
The Python Community....300
Key Takeaways....301
15. Next Steps....303
The Future of Code....305
Your Future in Code....308
Thank You....309
Index....310
About the Author....335
Data science happens in code. The ability to write reproducible, robust, scaleable code is key to a data science project's success—and is absolutely essential for those working with production code. This practical book bridges the gap between data science and software engineering,and clearly explains how to apply the best practices from software engineering to data science.
Examples are provided in Python, drawn from popular packages such as NumPy and pandas. If you want to write better data science code, this guide covers the essential topics that are often missing from introductory data science or coding classes, including how to: