Fundamentals of Data Engineering: Plan and Build Robust Data Systems

Fundamentals of Data Engineering: Plan and Build Robust Data Systems

Fundamentals of Data Engineering: Plan and Build Robust Data Systems
Автор: Housley Matt, Reis Joe
Дата выхода: 2022
Издательство: O’Reilly Media, Inc.
Количество страниц: 544
Размер файла: 3.2 MB
Тип файла: PDF
Добавил: codelibs
 Проверить на вирусы

Preface....5

What This Book Isn’t....5

What This Book Is About....6

Who Should Read This Book....8

Prerequisites....8

What You’ll Learn and How It Will Improve Your Abilities....9

The Book Outline....10

Conventions Used in This Book....12

How to Contact Us....13

Acknowledgments....14

I. Foundation and Building Blocks....17

1. Data Engineering Described....18

What Is Data Engineering?....18

Data Engineering Defined....20

The Data Engineering Lifecycle....20

Evolution of the Data Engineer....21

Data Engineering and Data Science....28

Data Engineering Skills and Activities....30

Data Maturity and the Data Engineer....31

The Background and Skills of a Data Engineer....36

Business Responsibilities....37

Technical Responsibilities....38

The Continuum of Data Engineering Roles, from A to B....43

Data Engineers Inside an Organization....44

Internal-Facing Versus External-Facing Data Engineers....45

Data Engineers and Other Technical Roles....47

Data Engineers and Business Leadership....52

Conclusion....56

Additional Resources....57

2. The Data Engineering Lifecycle....60

What Is the Data Engineering Lifecycle?....60

The Data Lifecycle Versus the Data Engineering Lifecycle....62

Generation: Source Systems....62

Storage....66

Ingestion....69

Transformation....73

Serving Data....75

Major Undercurrents Across the Data Engineering Lifecycle....82

Security....82

Data Management....84

Orchestration....96

DataOps....98

Data Architecture....104

Software Engineering....104

Conclusion....107

Additional Resources....108

3. Designing Good Data Architecture....111

What Is Data Architecture?....111

Enterprise Architecture, Defined....111

Data Architecture Defined....115

“Good” Data Architecture....117

Principles of Good Data Architecture....118

Principle 1: Choose Common Components Wisely....120

Principle 2: Plan for Failure....121

Principle 3: Architect for Scalability....122

Principle 4: Architecture Is Leadership....122

Principle 5: Always Be Architecting....123

Principle 6: Build Loosely Coupled Systems....124

Principle 7: Make Reversible Decisions....126

Principle 8: Prioritize Security....127

Principle 9: Embrace FinOps....129

Major Architecture Concepts....131

Domains and Services....131

Distributed Systems, Scalability, and Designing for Failure....132

Tight Versus Loose Coupling: Tiers, Monoliths, and Microservices....135

User Access: Single Versus Multitenant....142

Event-Driven Architecture....143

Brownfield Versus Greenfield Projects....144

Examples and Types of Data Architecture....146

Data Warehouse....146

Data Lake....151

Convergence, Next-Generation Data Lakes, and the Data Platform....152

Modern Data Stack....153

Lambda Architecture....154

Kappa Architecture....156

The Dataflow Model and Unified Batch and Streaming....156

Architecture for IoT....157

Data Mesh....161

Other Data Architecture Examples....163

Who’s Involved with Designing a Data Architecture?....164

Conclusion....165

Additional Resources....165

4. Choosing Technologies Across the Data Engineering Lifecycle....171

Team Size and Capabilities....172

Speed to Market....173

Interoperability....174

Cost Optimization and Business Value....175

Total Cost of Ownership....175

Total Opportunity Cost of Ownership....176

FinOps....177

Today Versus the Future: Immutable Versus Transitory Technologies....178

Our Advice....180

Location....181

On Premises....182

Cloud....183

Hybrid Cloud....188

Multicloud....189

Decentralized: Blockchain and the Edge....190

Our Advice....191

Cloud Repatriation Arguments....192

Build Versus Buy....195

Open Source Software....196

Proprietary Walled Gardens....201

Our Advice....204

Monolith Versus Modular....204

Monolith....205

Modularity....206

The Distributed Monolith Pattern....208

Our Advice....209

Serverless Versus Servers....210

Serverless....210

Containers....211

When Infrastructure Makes Sense....212

Our Advice....214

Optimization, Performance, and the Benchmark Wars....216

Big Data...for the 1990s....217

Nonsensical Cost Comparisons....217

Asymmetric Optimization....218

Caveat Emptor....218

Undercurrents and Their Impacts on Choosing Technologies....218

Data Management....218

DataOps....219

Data Architecture....220

Orchestration Example: Airflow....220

Software Engineering....221

Conclusion....222

II. The Data Engineering Lifecycle in Depth....224

5. Data Generation in Source Systems....225

Sources of Data: How Is Data Created?....226

Source Systems: Main Ideas....227

Files and Unstructured Data....227

APIs....227

Application Databases (OLTP systems)....228

Online Analytical Processing System....230

Change Data Capture....231

Logs....231

Database Logs....234

CRUD....234

Insert-Only....235

Messages and Streams....238

Types of Time....240

Source System Practical Details....241

Databases....241

APIs....255

Data Sharing....258

Third-Party Data Sources....258

Message Queues and Event-Streaming Platforms....259

Whom You’ll Work With....264

Undercurrents and Their Impact on Source Systems....266

Security....267

Data Management....267

DataOps....268

Data Architecture....270

Orchestration....271

Software Engineering....272

Conclusion....273

Additional Resources....274

6. Storage....276

Raw Ingredients of Data Storage....279

Magnetic Disk Drive....279

Solid-State Drive....283

Random Access Memory....284

Networking and CPU....286

Serialization....286

Compression....287

Caching....288

Data Storage Systems....292

Single Machine Versus Distributed Storage....292

Eventual Versus Strong Consistency....293

File Storage....295

Block Storage....298

Object Storage....303

Cache and Memory-Based Storage Systems....311

The Hadoop Distributed File System....312

Streaming Storage....313

Indexes, Partitioning, and Clustering....314

Data Engineering Storage Abstractions....316

The Data Warehouse....318

The Data Lake....318

The Data Lakehouse....319

Data Platforms....320

Stream-to-Batch Storage Architecture....320

Big Ideas and Trends in Storage....321

Data Catalog....321

Data Sharing....322

Schema....323

Separation of Compute from Storage....323

Data Storage Lifecycle and Data Retention....328

Single-Tenant Versus Multitenant Storage....332

Whom You’ll Work With....334

Undercurrents....335

Security....335

Data Management....335

DataOps....336

Data Architecture....337

Orchestration....337

Software Engineering....338

Conclusion....338

Additional Resources....338

7. Ingestion....340

What Is Data Ingestion?....340

Key Engineering Considerations for the Ingestion Phase....343

Bounded Versus Unbounded....344

Frequency....346

Synchronous Versus Asynchronous Ingestion....347

Serialization and Deserialization....349

Throughput and Scalability....349

Reliability and Durability....350

Payload....351

Push Versus Pull Versus Poll Patterns....355

Batch Ingestion Considerations....356

Snapshot or Differential Extraction....357

File-Based Export and Ingestion....358

ETL Versus ELT....358

Inserts, Updates, and Batch Size....359

Data Migration....359

Message and Stream Ingestion Considerations....360

Schema Evolution....360

Late-Arriving Data....361

Ordering and Multiple Delivery....361

Replay....361

Time to Live....362

Message Size....362

Error Handling and Dead-Letter Queues....363

Consumer Pull and Push....363

Location....364

Ways to Ingest Data....364

Direct Database Connection....364

Change Data Capture....366

APIs....369

Message Queues and Event-Streaming Platforms....370

Managed Data Connectors....371

Moving Data with Object Storage....372

EDI....372

Databases and File Export....373

Practical Issues with Common File Formats....373

Shell....374

SSH....375

SFTP and SCP....375

Webhooks....376

Web Interface....377

Web Scraping....377

Transfer Appliances for Data Migration....378

Data Sharing....379

Whom You’ll Work With....379

Upstream Stakeholders....379

Downstream Stakeholders....380

Undercurrents....381

Security....381

Data Management....382

DataOps....384

Orchestration....387

Software Engineering....387

Conclusion....388

Additional Resources....388

8. Queries, Modeling, and Transformation....390

Queries....390

What Is a Query?....391

The Life of a Query....392

The Query Optimizer....392

Improving Query Performance....392

Queries on Streaming Data....396

Data Modeling....400

What Is a Data Model?....401

Conceptual, Logical, and Physical Data Models....401

Normalization....402

Techniques for Modeling Batch Analytical Data....412

Modeling Streaming Data....437

Transformations....438

Batch Transformations....438

Materialized Views, Federation, and Query Virtualization....448

Streaming Transformations and Processing....450

Whom You’ll Work With....452

Upstream Stakeholders....452

Downstream Stakeholders....452

Undercurrents....452

Security....452

Data Management....453

DataOps....453

Data Architecture....454

Orchestration....454

Software Engineering....454

Conclusion....455

Additional Resources....455

9. Serving Data for Analytics, Machine Learning, and Reverse ETL....458

General Considerations for Serving Data....459

Trust....459

What’s the Use Case, and Who’s the User?....461

Data Products....462

Self-Service or Not?....463

Data Definitions and Logic....465

Data Mesh....466

Analytics....466

Business Analytics....467

Operational Analytics....469

Embedded Analytics....472

Machine Learning....473

What a Data Engineer Should Know About ML....474

Ways to Serve Data for Analytics and ML....476

File Exchange....476

Databases....478

Streaming Systems....479

Query Federation....480

Data Sharing....481

Semantic and Metrics Layers....482

Serving Data in Notebooks....483

Reverse ETL....486

Ways to Serve Data with Reverse ETL....488

Whom You’ll Work With....488

Undercurrents....489

Security....490

Data Management....491

DataOps....492

Data Architecture....493

Orchestration....493

Software Engineering....494

Conclusion....495

Additional Resources....496

III. Security, Privacy, and the Future of Data Engineering....498

10. Security and Privacy....499

People....500

The Power of Negative Thinking....500

Always Be Paranoid....501

Processes....501

Security Theater Versus Security Habit....501

Active Security....502

The Principle of Least Privilege....502

Shared Responsibility in the Cloud....503

Always Back Up Your Data....503

An Example Security Policy....504

Technology....506

Patch and Update Systems....507

Encryption....507

Logging, Monitoring, and Alerting....508

Network Access....509

Security for Low-Level Data Engineering....510

Conclusion....511

Additional Resources....512

11. The Future of Data Engineering....513

The Data Engineering Lifecycle Isn’t Going Away....514

The Decline of Complexity and the Rise of Easy-to-Use Data Tools....514

The Cloud-Scale Data OS and Improved Interoperability....516

“Enterprisey” Data Engineering....518

Titles and Responsibilities Will Morph.......519

Moving Beyond the Modern Data Stack, Toward the Live Data Stack....520

The Live Data Stack....521

Streaming Pipelines and Real-Time Analytical Databases....522

The Fusion of Data with Applications....524

The Tight Feedback Between Applications and ML....525

Dark Matter Data and the Rise of...Spreadsheets?!....525

Conclusion....526

A. Serialization and Compression Technical Details....529

Serialization Formats....529

Row-Based Serialization....529

Columnar Serialization....531

Hybrid Serialization....534

Database Storage Engines....535

Compression: gzip, bzip2, Snappy, etc.....535

B. Cloud Networking....537

Cloud Network Topology....537

Data Egress Charges....537

Availability Zones....537

Regions....538

GCP-Specific Networking and Multiregional Redundancy....539

Direct Network Connections to the Clouds....540

CDNs....540

The Future of Data Egress Fees....541

Index....542

About the Authors....543

Data engineering has grown rapidly in the past decade, leaving many software engineers, data scientists, and analysts looking for a comprehensive view of this practice. With this practical book, you'll learn how to plan and build systems to serve the needs of your organization and customers by evaluating the best technologies available through the framework of the data engineering lifecycle.

Authors Joe Reis and Matt Housley walk you through the data engineering lifecycle and show you how to stitch together a variety of cloud technologies to serve the needs of downstream data consumers. You'll understand how to apply the concepts of data generation, ingestion, orchestration, transformation, storage, and governance that are critical in any data environment regardless of the underlying technology.

This book will help you:

  • Get a concise overview of the entire data engineering landscape
  • Assess data engineering problems using an end-to-end framework of best practices
  • Cut through marketing hype when choosing data technologies, architecture, and processes
  • Use the data engineering lifecycle to design and build a robust architecture
  • Incorporate data governance and security across the data engineering lifecycle

Похожее:

Список отзывов:

Нет отзывов к книге.