Preface....5
What This Book Isn’t....5
What This Book Is About....6
Who Should Read This Book....8
Prerequisites....8
What You’ll Learn and How It Will Improve Your Abilities....9
The Book Outline....10
Conventions Used in This Book....12
How to Contact Us....13
Acknowledgments....14
I. Foundation and Building Blocks....17
1. Data Engineering Described....18
What Is Data Engineering?....18
Data Engineering Defined....20
The Data Engineering Lifecycle....20
Evolution of the Data Engineer....21
Data Engineering and Data Science....28
Data Engineering Skills and Activities....30
Data Maturity and the Data Engineer....31
The Background and Skills of a Data Engineer....36
Business Responsibilities....37
Technical Responsibilities....38
The Continuum of Data Engineering Roles, from A to B....43
Data Engineers Inside an Organization....44
Internal-Facing Versus External-Facing Data Engineers....45
Data Engineers and Other Technical Roles....47
Data Engineers and Business Leadership....52
Conclusion....56
Additional Resources....57
2. The Data Engineering Lifecycle....60
What Is the Data Engineering Lifecycle?....60
The Data Lifecycle Versus the Data Engineering Lifecycle....62
Generation: Source Systems....62
Storage....66
Ingestion....69
Transformation....73
Serving Data....75
Major Undercurrents Across the Data Engineering Lifecycle....82
Security....82
Data Management....84
Orchestration....96
DataOps....98
Data Architecture....104
Software Engineering....104
Conclusion....107
Additional Resources....108
3. Designing Good Data Architecture....111
What Is Data Architecture?....111
Enterprise Architecture, Defined....111
Data Architecture Defined....115
“Good” Data Architecture....117
Principles of Good Data Architecture....118
Principle 1: Choose Common Components Wisely....120
Principle 2: Plan for Failure....121
Principle 3: Architect for Scalability....122
Principle 4: Architecture Is Leadership....122
Principle 5: Always Be Architecting....123
Principle 6: Build Loosely Coupled Systems....124
Principle 7: Make Reversible Decisions....126
Principle 8: Prioritize Security....127
Principle 9: Embrace FinOps....129
Major Architecture Concepts....131
Domains and Services....131
Distributed Systems, Scalability, and Designing for Failure....132
Tight Versus Loose Coupling: Tiers, Monoliths, and Microservices....135
User Access: Single Versus Multitenant....142
Event-Driven Architecture....143
Brownfield Versus Greenfield Projects....144
Examples and Types of Data Architecture....146
Data Warehouse....146
Data Lake....151
Convergence, Next-Generation Data Lakes, and the Data Platform....152
Modern Data Stack....153
Lambda Architecture....154
Kappa Architecture....156
The Dataflow Model and Unified Batch and Streaming....156
Architecture for IoT....157
Data Mesh....161
Other Data Architecture Examples....163
Who’s Involved with Designing a Data Architecture?....164
Conclusion....165
Additional Resources....165
4. Choosing Technologies Across the Data Engineering Lifecycle....171
Team Size and Capabilities....172
Speed to Market....173
Interoperability....174
Cost Optimization and Business Value....175
Total Cost of Ownership....175
Total Opportunity Cost of Ownership....176
FinOps....177
Today Versus the Future: Immutable Versus Transitory Technologies....178
Our Advice....180
Location....181
On Premises....182
Cloud....183
Hybrid Cloud....188
Multicloud....189
Decentralized: Blockchain and the Edge....190
Our Advice....191
Cloud Repatriation Arguments....192
Build Versus Buy....195
Open Source Software....196
Proprietary Walled Gardens....201
Our Advice....204
Monolith Versus Modular....204
Monolith....205
Modularity....206
The Distributed Monolith Pattern....208
Our Advice....209
Serverless Versus Servers....210
Serverless....210
Containers....211
When Infrastructure Makes Sense....212
Our Advice....214
Optimization, Performance, and the Benchmark Wars....216
Big Data...for the 1990s....217
Nonsensical Cost Comparisons....217
Asymmetric Optimization....218
Caveat Emptor....218
Undercurrents and Their Impacts on Choosing Technologies....218
Data Management....218
DataOps....219
Data Architecture....220
Orchestration Example: Airflow....220
Software Engineering....221
Conclusion....222
II. The Data Engineering Lifecycle in Depth....224
5. Data Generation in Source Systems....225
Sources of Data: How Is Data Created?....226
Source Systems: Main Ideas....227
Files and Unstructured Data....227
APIs....227
Application Databases (OLTP systems)....228
Online Analytical Processing System....230
Change Data Capture....231
Logs....231
Database Logs....234
CRUD....234
Insert-Only....235
Messages and Streams....238
Types of Time....240
Source System Practical Details....241
Databases....241
APIs....255
Data Sharing....258
Third-Party Data Sources....258
Message Queues and Event-Streaming Platforms....259
Whom You’ll Work With....264
Undercurrents and Their Impact on Source Systems....266
Security....267
Data Management....267
DataOps....268
Data Architecture....270
Orchestration....271
Software Engineering....272
Conclusion....273
Additional Resources....274
6. Storage....276
Raw Ingredients of Data Storage....279
Magnetic Disk Drive....279
Solid-State Drive....283
Random Access Memory....284
Networking and CPU....286
Serialization....286
Compression....287
Caching....288
Data Storage Systems....292
Single Machine Versus Distributed Storage....292
Eventual Versus Strong Consistency....293
File Storage....295
Block Storage....298
Object Storage....303
Cache and Memory-Based Storage Systems....311
The Hadoop Distributed File System....312
Streaming Storage....313
Indexes, Partitioning, and Clustering....314
Data Engineering Storage Abstractions....316
The Data Warehouse....318
The Data Lake....318
The Data Lakehouse....319
Data Platforms....320
Stream-to-Batch Storage Architecture....320
Big Ideas and Trends in Storage....321
Data Catalog....321
Data Sharing....322
Schema....323
Separation of Compute from Storage....323
Data Storage Lifecycle and Data Retention....328
Single-Tenant Versus Multitenant Storage....332
Whom You’ll Work With....334
Undercurrents....335
Security....335
Data Management....335
DataOps....336
Data Architecture....337
Orchestration....337
Software Engineering....338
Conclusion....338
Additional Resources....338
7. Ingestion....340
What Is Data Ingestion?....340
Key Engineering Considerations for the Ingestion Phase....343
Bounded Versus Unbounded....344
Frequency....346
Synchronous Versus Asynchronous Ingestion....347
Serialization and Deserialization....349
Throughput and Scalability....349
Reliability and Durability....350
Payload....351
Push Versus Pull Versus Poll Patterns....355
Batch Ingestion Considerations....356
Snapshot or Differential Extraction....357
File-Based Export and Ingestion....358
ETL Versus ELT....358
Inserts, Updates, and Batch Size....359
Data Migration....359
Message and Stream Ingestion Considerations....360
Schema Evolution....360
Late-Arriving Data....361
Ordering and Multiple Delivery....361
Replay....361
Time to Live....362
Message Size....362
Error Handling and Dead-Letter Queues....363
Consumer Pull and Push....363
Location....364
Ways to Ingest Data....364
Direct Database Connection....364
Change Data Capture....366
APIs....369
Message Queues and Event-Streaming Platforms....370
Managed Data Connectors....371
Moving Data with Object Storage....372
EDI....372
Databases and File Export....373
Practical Issues with Common File Formats....373
Shell....374
SSH....375
SFTP and SCP....375
Webhooks....376
Web Interface....377
Web Scraping....377
Transfer Appliances for Data Migration....378
Data Sharing....379
Whom You’ll Work With....379
Upstream Stakeholders....379
Downstream Stakeholders....380
Undercurrents....381
Security....381
Data Management....382
DataOps....384
Orchestration....387
Software Engineering....387
Conclusion....388
Additional Resources....388
8. Queries, Modeling, and Transformation....390
Queries....390
What Is a Query?....391
The Life of a Query....392
The Query Optimizer....392
Improving Query Performance....392
Queries on Streaming Data....396
Data Modeling....400
What Is a Data Model?....401
Conceptual, Logical, and Physical Data Models....401
Normalization....402
Techniques for Modeling Batch Analytical Data....412
Modeling Streaming Data....437
Transformations....438
Batch Transformations....438
Materialized Views, Federation, and Query Virtualization....448
Streaming Transformations and Processing....450
Whom You’ll Work With....452
Upstream Stakeholders....452
Downstream Stakeholders....452
Undercurrents....452
Security....452
Data Management....453
DataOps....453
Data Architecture....454
Orchestration....454
Software Engineering....454
Conclusion....455
Additional Resources....455
9. Serving Data for Analytics, Machine Learning, and Reverse ETL....458
General Considerations for Serving Data....459
Trust....459
What’s the Use Case, and Who’s the User?....461
Data Products....462
Self-Service or Not?....463
Data Definitions and Logic....465
Data Mesh....466
Analytics....466
Business Analytics....467
Operational Analytics....469
Embedded Analytics....472
Machine Learning....473
What a Data Engineer Should Know About ML....474
Ways to Serve Data for Analytics and ML....476
File Exchange....476
Databases....478
Streaming Systems....479
Query Federation....480
Data Sharing....481
Semantic and Metrics Layers....482
Serving Data in Notebooks....483
Reverse ETL....486
Ways to Serve Data with Reverse ETL....488
Whom You’ll Work With....488
Undercurrents....489
Security....490
Data Management....491
DataOps....492
Data Architecture....493
Orchestration....493
Software Engineering....494
Conclusion....495
Additional Resources....496
III. Security, Privacy, and the Future of Data Engineering....498
10. Security and Privacy....499
People....500
The Power of Negative Thinking....500
Always Be Paranoid....501
Processes....501
Security Theater Versus Security Habit....501
Active Security....502
The Principle of Least Privilege....502
Shared Responsibility in the Cloud....503
Always Back Up Your Data....503
An Example Security Policy....504
Technology....506
Patch and Update Systems....507
Encryption....507
Logging, Monitoring, and Alerting....508
Network Access....509
Security for Low-Level Data Engineering....510
Conclusion....511
Additional Resources....512
11. The Future of Data Engineering....513
The Data Engineering Lifecycle Isn’t Going Away....514
The Decline of Complexity and the Rise of Easy-to-Use Data Tools....514
The Cloud-Scale Data OS and Improved Interoperability....516
“Enterprisey” Data Engineering....518
Titles and Responsibilities Will Morph.......519
Moving Beyond the Modern Data Stack, Toward the Live Data Stack....520
The Live Data Stack....521
Streaming Pipelines and Real-Time Analytical Databases....522
The Fusion of Data with Applications....524
The Tight Feedback Between Applications and ML....525
Dark Matter Data and the Rise of...Spreadsheets?!....525
Conclusion....526
A. Serialization and Compression Technical Details....529
Serialization Formats....529
Row-Based Serialization....529
Columnar Serialization....531
Hybrid Serialization....534
Database Storage Engines....535
Compression: gzip, bzip2, Snappy, etc.....535
B. Cloud Networking....537
Cloud Network Topology....537
Data Egress Charges....537
Availability Zones....537
Regions....538
GCP-Specific Networking and Multiregional Redundancy....539
Direct Network Connections to the Clouds....540
CDNs....540
The Future of Data Egress Fees....541
Index....542
About the Authors....543
Data engineering has grown rapidly in the past decade, leaving many software engineers, data scientists, and analysts looking for a comprehensive view of this practice. With this practical book, you'll learn how to plan and build systems to serve the needs of your organization and customers by evaluating the best technologies available through the framework of the data engineering lifecycle.
Authors Joe Reis and Matt Housley walk you through the data engineering lifecycle and show you how to stitch together a variety of cloud technologies to serve the needs of downstream data consumers. You'll understand how to apply the concepts of data generation, ingestion, orchestration, transformation, storage, and governance that are critical in any data environment regardless of the underlying technology.