Software Engineering for Data Scientists: From Notebooks to Scalable Systems

Software Engineering for Data Scientists: From Notebooks to Scalable Systems

Software Engineering for Data Scientists: From Notebooks to Scalable Systems
Автор: Nelson Catherine
Дата выхода: 2024
Издательство: O’Reilly Media, Inc.
Количество страниц: 336
Размер файла: 3.5 MB
Тип файла: PDF
Добавил: codelibs
 Проверить на вирусы

Preface....7

Who Is This Book For?....8

Why Python?....10

What Is Not in This Book....11

Guide to This Book....12

Reading Order....13

Conventions Used in This Book....14

Using Code Examples....15

O’Reilly Online Learning....16

How to Contact Us....16

Acknowledgments....17

1. What Is Good Code?....19

Why Good Code Matters....19

Adapting to Changing Requirements....21

Simplicity....22

Don’t Repeat Yourself (DRY)....23

Avoid Verbose Code....25

Modularity....26

Readability....27

Standards and Conventions....28

Names....29

Cleaning up....30

Documentation....30

Performance....31

Robustness....31

Errors and Logging....32

Testing....32

Key Takeaways....33

2. Analyzing Code Performance....35

Methods to Improve Performance....36

Timing Your Code....38

Profiling Your Code....42

cProfile....42

line_profiler....45

Memory Profiling with Memray....46

Time Complexity....49

How to Estimate Time Complexity....49

Big O Notation....51

Key Takeaways....53

3. Using Data Structures Effectively....55

Native Python Data Structures....56

Lists....56

Tuples....59

Dictionaries....59

Sets....62

NumPy Arrays....63

NumPy Array Functionality....64

NumPy Array Performance Considerations....65

Array Operations Using Dask....69

Arrays in Machine Learning....71

pandas DataFrames....73

DataFrame Functionality....73

DataFrame Performance Considerations....75

Key Takeaways....76

4. Object-Oriented Programming and Functional Programming....79

Object-Oriented Programming....80

Classes, Methods, and Attributes....80

Defining Your Own Classes....84

OOP Principles....87

Functional Programming....91

Lambda Functions and map()....92

Applying Functions to DataFrames....93

Which Paradigm Should I Use?....94

Key Takeaways....95

5. Errors, Logging, and Debugging....96

Errors in Python....96

Reading Python Error Messages....96

Handling Errors....98

Raising Errors....102

Logging....104

What to Log....104

Logging Configuration....105

How to Log....107

Debugging....109

Strategies for Debugging....110

Tools for Debugging....111

Key Takeaways....117

6. Code Formatting, Linting, and Type Checking....119

Code Formatting and Style Guides....120

PEP8....121

Import Formatting....122

Automatic Code Formatting with Black....124

Linting....127

Linting Tools....127

Linting in Your IDE....130

Type Checking....131

Type Annotations....133

Type Checking with mypy....135

Key Takeaways....136

7. Testing Your Code....137

Why You Should Write Tests....138

When to Test....139

How to Write and Run Tests....140

A Basic Test....140

Testing Unexpected Inputs....143

Running Automated Tests with Pytest....145

Types of Tests....147

Unit Tests....148

Integration Tests....148

Data Validation....150

Data Validation Examples....150

Using Pandera for Data Validation....151

Data Validation with Pydantic....153

Testing for Machine Learning....155

Testing Model Training....157

Testing Model Inference....157

Key Takeaways....158

8. Design and Refactoring....159

Project Design and Structure....160

Project Design Considerations....160

An Example Machine Learning Project....162

Code Design....165

Modular Code....165

A Code Design Framework....167

Interfaces and Contracts....168

Coupling....168

From Notebooks to Scalable Scripts....171

Why Use Scripts Instead of Notebooks?....171

Creating Scripts from Notebooks....173

Refactoring....176

Strategies for Refactoring....177

An Example Refactoring Workflow....178

Key Takeaways....180

9. Documentation....182

Documentation Within the Codebase....183

Names....184

Comments....187

Docstrings....189

Readmes, Tutorials, and Other Longer Documents....191

Documentation in Jupyter Notebooks....193

Documenting Machine Learning Experiments....196

Key Takeaways....197

10. Sharing Your Code: Version Control, Dependencies, and Packaging....199

Version Control Using Git....199

How Does Git Work?....200

Tracking Changes and Committing....202

Remote and Local....204

Branches and Pull Requests....206

Dependencies and Virtual Environments....211

Virtual Environments....212

Managing Dependencies with pip....214

Managing Dependencies with Poetry....215

Python Packaging....218

Packaging Basics....219

pyproject.toml....221

Building and Uploading Packages....222

Key Takeaways....224

11. APIs....226

Calling an API....227

HTTP Methods and Status Codes....227

Getting Data from the SDG API....229

Creating Your Own API Using FastAPI....234

Setting Up the API....234

Adding Functionality to Your API....238

Making Requests to Your API....243

Key Takeaways....244

12. Automation and Deployment....246

Deploying Code....247

Automation Examples....249

Pre-Commit Hooks....249

GitHub Actions....253

Cloud Deployments....258

Containers and Docker....259

Building a Docker Container....260

Deploying an API on Google Cloud....262

Deploying an API on Other Cloud Providers....264

Key Takeaways....265

13. Security....267

What Is Security?....267

Security Risks....269

Credentials, Physical Security, and Social Engineering....270

Third-Party Packages....270

The Python Pickle Module....271

Version Control Risks....271

API Security Risks....272

Security Practices....273

Security Reviews and Policies....273

Secure Coding Tools....274

Simple Code Scanning....274

Security for Machine Learning....277

Attacks on ML Systems....278

Security Practices for ML Systems....280

Key Takeaways....281

14. Working in Software....283

Development Principles and Practices....283

The Software Development Lifecycle....283

Waterfall Software Development....285

Agile Software Development....286

Agile Data Science....287

Roles in the Software Industry....288

Software Engineer....289

QA or Test Engineer....291

Data Engineer....291

Data Analyst....292

Product Manager....293

UX Researcher....294

Designer....295

Community....296

Open Source....297

Speaking at Events....299

The Python Community....300

Key Takeaways....301

15. Next Steps....303

The Future of Code....305

Your Future in Code....308

Thank You....309

Index....310

About the Author....335

Data science happens in code. The ability to write reproducible, robust, scaleable code is key to a data science project's success—and is absolutely essential for those working with production code. This practical book bridges the gap between data science and software engineering,and clearly explains how to apply the best practices from software engineering to data science.

Examples are provided in Python, drawn from popular packages such as NumPy and pandas. If you want to write better data science code, this guide covers the essential topics that are often missing from introductory data science or coding classes, including how to:

  • Understand data structures and object-oriented programming
  • Clearly and skillfully document your code
  • Package and share your code
  • Integrate data science code with a larger code base
  • Learn how to write APIs
  • Create secure code
  • Apply best practices to common tasks such as testing, error handling, and logging
  • Work more effectively with software engineers
  • Write more efficient, maintainable, and robust code in Python
  • Put your data science projects into production
  • And more

Похожее:

Список отзывов:

Нет отзывов к книге.