The Cloud Data Lake: A Guide to Building Robust Cloud Data Architecture

The Cloud Data Lake: A Guide to Building Robust Cloud Data Architecture

The Cloud Data Lake: A Guide to Building Robust Cloud Data Architecture
Автор: Gopalan Rukmani
Дата выхода: 2023
Издательство: O’Reilly Media, Inc.
Количество страниц: 247
Размер файла: 2,9 МБ
Тип файла: PDF
Добавил: codelibs
 Проверить на вирусы

Copyright....6

Table of Contents....7

Preface....11

Why I Wrote This Book....12

Who Should Read This Book?....13

Introducing Klodars Corporation....13

Navigating the Book....14

Conventions Used in This Book....15

OReilly Online Learning....16

How to Contact Us....16

Acknowledgments....17

Chapter 1. Big DataBeyond the Buzz....19

What Is Big Data?....20

Elastic Data InfrastructureThe Challenge....26

Cloud Computing Fundamentals....26

Cloud Computing Terminology....26

Value Proposition of the Cloud....28

Cloud Data Lake Architecture....30

Limitations of On-Premises Data Warehouse Solutions....31

What Is a Cloud Data Lake Architecture?....32

Benefits of a Cloud Data Lake Architecture....33

Defining Your Cloud Data Lake Journey....34

Summary....37

Chapter 2. Big Data Architectures on the Cloud....39

Why Klodars Corporation Moves to the Cloud....40

Fundamentals of Cloud Data Lake Architectures....41

A Word on Variety of Data....41

Cloud Data Lake Storage....44

Big Data Analytics Engines....46

Cloud Data Warehouses....52

Modern Data Warehouse Architecture....54

Reference Architecture....54

Sample Use Case for a Modern Data Warehouse Architecture....56

Benefits and Challenges of Modern Data Warehouse Architecture....58

Data Lakehouse Architecture....58

Reference Architecture for the Data Lakehouse....59

Sample Use Case for Data Lakehouse Architecture....66

Benefits and Challenges of the Data Lakehouse Architecture....67

Data Warehouses and Unstructured Data....69

Data Mesh....69

Reference Architecture....71

Sample Use Case for a Data Mesh Architecture....72

Challenges and Benefits of a Data Mesh Architecture....73

What Is the Right Architecture for Me?....74

Know Your Customers....74

Know Your Business Drivers....75

Consider Your Growth and Future Scenarios....76

Design Considerations....76

Hybrid Approaches....78

Summary....79

Chapter 3. Design Considerations for Your Data Lake....81

Setting Up the Cloud Data Lake Infrastructure....81

Identify Your Goals....82

Plan Your Architecture and Deliverables....85

Implement the Cloud Data Lake....89

Release and Operationalize....90

Organizing Data in Your Data Lake....90

A Day in the Life of Data....91

Data Lake Zones....91

Organization Mechanisms....95

Introduction to Data Governance....96

Actors Involved in Data Governance....97

Data Classification....99

Metadata Management, Data Catalog, and Data Sharing....100

Data Access Management....101

Data Quality and Observability....103

Data Governance at Klodars Corporation....105

Data Governance Wrap-Up....106

Manage Data Lake Costs....107

Demystifying Data Lake Costs on the Cloud....108

Data Lake Cost Strategy....110

Summary....113

Chapter 4. Scalable Data Lakes....115

A Sneak Peek into Scalability....115

What Is Scalability?....116

Scale in Our Day-to-Day Life....116

Scalability in Data Lake Architectures....119

Internals of Data Lake Processing Systems....122

Data Copy Internals....124

ELTETL Processing Internals....126

A Note on Other Interactive Queries....129

Considerations for Scalable Data Lake Solutions....129

Pick the Right Cloud Offerings....129

Plan for Peak Capacity....133

Data Formats and Job Profile....135

Summary....135

Chapter 5. Optimizing Cloud Data Lake Architectures for Performance....137

Basics of Measuring Performance....137

Goals and Metrics for Performance....139

Measuring Performance....140

Optimizing for Faster Performance....141

Cloud Data Lake Performance....143

SLAs, SLOs, and SLIs....143

Example: How Klodars Corporation Managed Its SLAs, SLOs, and SLIs....144

Drivers of Performance....146

Performance Drivers for a Copy Job....146

Performance Drivers for a Spark Job....148

Optimization Principles and Techniques for Performance Tuning....152

Data Formats....152

Data Organization and Partitioning....158

Choosing the Right Configurations on Apache Spark....160

Minimize Overheads with Data Transfer....163

Premium Offerings and Performance....164

The Case of Bigger Virtual Machines....164

The Case of Flash Storage....164

Summary....165

Chapter 6. Deep Dive on Data Formats....167

Why Do We Need These Open Data Formats?....167

Why Do We Need to Store Tabular Data?....168

Why Is It a Problem to Store Tabular Data in a Cloud Data Lake Storage?....169

Delta Lake....170

Why Was Delta Lake Founded?....170

How Does Delta Lake Work?....173

When Do You Use Delta Lake?....175

Apache Iceberg....175

Why Was Apache Iceberg Founded?....175

How Does Apache Iceberg Work?....177

When Do You Use Apache Iceberg?....179

Apache Hudi....180

Why Was Apache Hudi Founded?....181

How Does Apache Hudi Work?....182

When Do You Use Apache Hudi?....185

Summary....186

Chapter 7. Decision Framework for Your Architecture....187

Cloud Data Lake Assessment....188

Cloud Data Lake Assessment Questionnaire....188

Analysis for Your Cloud Data Lake Assessment....190

Starting from Scratch....191

Migrating an Existing Data Lake or Data Warehouse to the Cloud....191

Improving an Existing Cloud Data Lake....192

Phase 1 of Decision Framework: Assess....193

Understand Customer Requirements....194

Understand Opportunities for Improvement....195

Know Your Business Drivers....196

Complete the Assess Phase by Prioritizing the Requirements....197

Phase 2 of Decision Framework: Define....198

Finalize the Design Choices for the Cloud Data Lake....200

Plan Your Cloud Data Lake Project Deliverables....204

Phase 3 of Decision Framework: Implement....205

Phase 4 of Decision Framework: Operationalize....208

Summary....208

Chapter 8. Six Lessons for a Data Informed Future....209

Lesson 1: Focus on the How and When, Not the If and Why, When It Comes to Cloud Data Lakes....210

Lesson 2: With Great Power Comes Great ResponsibilityData Is No Exception....211

Lesson 3: Customers Lead Technology, Not the Other Way Around....213

Lesson 4: Change Is Inevitable, so Be Prepared....214

Lesson 5: Build Empathy and Prioritize Ruthlessly....215

Lesson 6: Big Impact Does Not Happen Overnight....216

Summary....217

Appendix A. Cloud Data Lake Decision Framework Template....219

Phase 1: Assess Framework....219

Phase 2: Define Framework....221

Planning the Cloud Data Lake Deliverables....222

Phase 3: Implement Framework....225

Index....231

About the Author....246

Colophon....246

More organizations than ever understand the importance of data lake architectures for deriving value from their data. Building a robust, scalable, and performant data lake remains a complex proposition, however, with a buffet of tools and options that need to work together to provide a seamless end-to-end pipeline from data to insights.

This book provides a concise yet comprehensive overview on the setup, management, and governance of a cloud data lake. Author Rukmani Gopalan, a product management leader and data enthusiast, guides data architects and engineers through the major aspects of working with a cloud data lake, from design considerations and best practices to data format optimizations, performance optimization, cost management, and governance.

  • Learn the benefits of a cloud-based big data strategy for your organization
  • Get guidance and best practices for designing performant and scalable data lakes
  • Examine architecture and design choices, and data governance principles and strategies
  • Build a data strategy that scales as your organizational and business needs increase
  • Implement a scalable data lake in the cloud
  • Use cloud-based advanced analytics to gain more value from your data

Похожее:

Список отзывов:

Нет отзывов к книге.