Copyright....6
Table of Contents....7
Preface....11
Why I Wrote This Book....12
Who Should Read This Book?....13
Introducing Klodars Corporation....13
Navigating the Book....14
Conventions Used in This Book....15
OReilly Online Learning....16
How to Contact Us....16
Acknowledgments....17
Chapter 1. Big DataBeyond the Buzz....19
What Is Big Data?....20
Elastic Data InfrastructureThe Challenge....26
Cloud Computing Fundamentals....26
Cloud Computing Terminology....26
Value Proposition of the Cloud....28
Cloud Data Lake Architecture....30
Limitations of On-Premises Data Warehouse Solutions....31
What Is a Cloud Data Lake Architecture?....32
Benefits of a Cloud Data Lake Architecture....33
Defining Your Cloud Data Lake Journey....34
Summary....37
Chapter 2. Big Data Architectures on the Cloud....39
Why Klodars Corporation Moves to the Cloud....40
Fundamentals of Cloud Data Lake Architectures....41
A Word on Variety of Data....41
Cloud Data Lake Storage....44
Big Data Analytics Engines....46
Cloud Data Warehouses....52
Modern Data Warehouse Architecture....54
Reference Architecture....54
Sample Use Case for a Modern Data Warehouse Architecture....56
Benefits and Challenges of Modern Data Warehouse Architecture....58
Data Lakehouse Architecture....58
Reference Architecture for the Data Lakehouse....59
Sample Use Case for Data Lakehouse Architecture....66
Benefits and Challenges of the Data Lakehouse Architecture....67
Data Warehouses and Unstructured Data....69
Data Mesh....69
Reference Architecture....71
Sample Use Case for a Data Mesh Architecture....72
Challenges and Benefits of a Data Mesh Architecture....73
What Is the Right Architecture for Me?....74
Know Your Customers....74
Know Your Business Drivers....75
Consider Your Growth and Future Scenarios....76
Design Considerations....76
Hybrid Approaches....78
Summary....79
Chapter 3. Design Considerations for Your Data Lake....81
Setting Up the Cloud Data Lake Infrastructure....81
Identify Your Goals....82
Plan Your Architecture and Deliverables....85
Implement the Cloud Data Lake....89
Release and Operationalize....90
Organizing Data in Your Data Lake....90
A Day in the Life of Data....91
Data Lake Zones....91
Organization Mechanisms....95
Introduction to Data Governance....96
Actors Involved in Data Governance....97
Data Classification....99
Metadata Management, Data Catalog, and Data Sharing....100
Data Access Management....101
Data Quality and Observability....103
Data Governance at Klodars Corporation....105
Data Governance Wrap-Up....106
Manage Data Lake Costs....107
Demystifying Data Lake Costs on the Cloud....108
Data Lake Cost Strategy....110
Summary....113
Chapter 4. Scalable Data Lakes....115
A Sneak Peek into Scalability....115
What Is Scalability?....116
Scale in Our Day-to-Day Life....116
Scalability in Data Lake Architectures....119
Internals of Data Lake Processing Systems....122
Data Copy Internals....124
ELTETL Processing Internals....126
A Note on Other Interactive Queries....129
Considerations for Scalable Data Lake Solutions....129
Pick the Right Cloud Offerings....129
Plan for Peak Capacity....133
Data Formats and Job Profile....135
Summary....135
Chapter 5. Optimizing Cloud Data Lake Architectures for Performance....137
Basics of Measuring Performance....137
Goals and Metrics for Performance....139
Measuring Performance....140
Optimizing for Faster Performance....141
Cloud Data Lake Performance....143
SLAs, SLOs, and SLIs....143
Example: How Klodars Corporation Managed Its SLAs, SLOs, and SLIs....144
Drivers of Performance....146
Performance Drivers for a Copy Job....146
Performance Drivers for a Spark Job....148
Optimization Principles and Techniques for Performance Tuning....152
Data Formats....152
Data Organization and Partitioning....158
Choosing the Right Configurations on Apache Spark....160
Minimize Overheads with Data Transfer....163
Premium Offerings and Performance....164
The Case of Bigger Virtual Machines....164
The Case of Flash Storage....164
Summary....165
Chapter 6. Deep Dive on Data Formats....167
Why Do We Need These Open Data Formats?....167
Why Do We Need to Store Tabular Data?....168
Why Is It a Problem to Store Tabular Data in a Cloud Data Lake Storage?....169
Delta Lake....170
Why Was Delta Lake Founded?....170
How Does Delta Lake Work?....173
When Do You Use Delta Lake?....175
Apache Iceberg....175
Why Was Apache Iceberg Founded?....175
How Does Apache Iceberg Work?....177
When Do You Use Apache Iceberg?....179
Apache Hudi....180
Why Was Apache Hudi Founded?....181
How Does Apache Hudi Work?....182
When Do You Use Apache Hudi?....185
Summary....186
Chapter 7. Decision Framework for Your Architecture....187
Cloud Data Lake Assessment....188
Cloud Data Lake Assessment Questionnaire....188
Analysis for Your Cloud Data Lake Assessment....190
Starting from Scratch....191
Migrating an Existing Data Lake or Data Warehouse to the Cloud....191
Improving an Existing Cloud Data Lake....192
Phase 1 of Decision Framework: Assess....193
Understand Customer Requirements....194
Understand Opportunities for Improvement....195
Know Your Business Drivers....196
Complete the Assess Phase by Prioritizing the Requirements....197
Phase 2 of Decision Framework: Define....198
Finalize the Design Choices for the Cloud Data Lake....200
Plan Your Cloud Data Lake Project Deliverables....204
Phase 3 of Decision Framework: Implement....205
Phase 4 of Decision Framework: Operationalize....208
Summary....208
Chapter 8. Six Lessons for a Data Informed Future....209
Lesson 1: Focus on the How and When, Not the If and Why, When It Comes to Cloud Data Lakes....210
Lesson 2: With Great Power Comes Great ResponsibilityData Is No Exception....211
Lesson 3: Customers Lead Technology, Not the Other Way Around....213
Lesson 4: Change Is Inevitable, so Be Prepared....214
Lesson 5: Build Empathy and Prioritize Ruthlessly....215
Lesson 6: Big Impact Does Not Happen Overnight....216
Summary....217
Appendix A. Cloud Data Lake Decision Framework Template....219
Phase 1: Assess Framework....219
Phase 2: Define Framework....221
Planning the Cloud Data Lake Deliverables....222
Phase 3: Implement Framework....225
Index....231
About the Author....246
Colophon....246
More organizations than ever understand the importance of data lake architectures for deriving value from their data. Building a robust, scalable, and performant data lake remains a complex proposition, however, with a buffet of tools and options that need to work together to provide a seamless end-to-end pipeline from data to insights.
This book provides a concise yet comprehensive overview on the setup, management, and governance of a cloud data lake. Author Rukmani Gopalan, a product management leader and data enthusiast, guides data architects and engineers through the major aspects of working with a cloud data lake, from design considerations and best practices to data format optimizations, performance optimization, cost management, and governance.