The Cloud Data Lake: A Guide to Building Robust Cloud Data Architecture

The Cloud Data Lake: A Guide to Building Robust Cloud Data Architecture

The Cloud Data Lake: A Guide to Building Robust Cloud Data Architecture

Автор: Rukmani Gopalan
Дата выхода: 2023
Издательство: O’Reilly Media, Inc.
Количество страниц: 247
Размер файла: 7,1 МБ
Тип файла: PDF
Добавил: codelibs
 Проверить на вирусы

Preface ix
1. Big Data—Beyond the Buzz 1
What Is Big Data? 2
Elastic Data Infrastructure—The Challenge 8
Cloud Computing Fundamentals 8
Cloud Computing Terminology 8
Value Proposition of the Cloud 10
Cloud Data Lake Architecture 12
Limitations of On-Premises Data Warehouse Solutions 13
What Is a Cloud Data Lake Architecture? 14
Benefits of a Cloud Data Lake Architecture 15
Defining Your Cloud Data Lake Journey 16
Summary 19
2. Big Data Architectures on the Cloud 21
Why Klodars Corporation Moves to the Cloud 22
Fundamentals of Cloud Data Lake Architectures 23
A Word on Variety of Data 23
Cloud Data Lake Storage 26
Big Data Analytics Engines 28
Cloud Data Warehouses 34
Modern Data Warehouse Architecture 36
Reference Architecture 36
Sample Use Case for a Modern Data Warehouse Architecture 38
Benefits and Challenges of Modern Data Warehouse Architecture 40
Data Lakehouse Architecture 40
Reference Architecture for the Data Lakehouse 41
Sample Use Case for Data Lakehouse Architecture 48
Benefits and Challenges of the Data Lakehouse Architecture 49
Data Warehouses and Unstructured Data 51
Data Mesh 51
Reference Architecture 53
Sample Use Case for a Data Mesh Architecture 54
Challenges and Benefits of a Data Mesh Architecture 55
What Is the Right Architecture for Me? 56
Know Your Customers 56
Know Your Business Drivers 57
Consider Your Growth and Future Scenarios 58
Design Considerations 58
Hybrid Approaches 60
Summary 61
3. Design Considerations for Your Data Lake 63
Setting Up the Cloud Data Lake Infrastructure 63
Identify Your Goals 64
Plan Your Architecture and Deliverables 67
Implement the Cloud Data Lake 71
Release and Operationalize 72
Organizing Data in Your Data Lake 72
A Day in the Life of Data 73
Data Lake Zones 73
Organization Mechanisms 77
Introduction to Data Governance 78
Actors Involved in Data Governance 79
Data Classification 81
Metadata Management, Data Catalog, and Data Sharing 82
Data Access Management 83
Data Quality and Observability 85
Data Governance at Klodars Corporation 87
Data Governance Wrap-Up 88
Manage Data Lake Costs 89
Demystifying Data Lake Costs on the Cloud 90
Data Lake Cost Strategy 92
Summary 95
4. Scalable Data Lakes 97
A Sneak Peek into Scalability 97
What Is Scalability? 98
Scale in Our Day-to-Day Life 98
Scalability in Data Lake Architectures 101
Internals of Data Lake Processing Systems 104
Data Copy Internals 106
ELT/ETL Processing Internals 108
A Note on Other Interactive Queries 111
Considerations for Scalable Data Lake Solutions 111
Pick the Right Cloud Offerings 111
Plan for Peak Capacity 115
Data Formats and Job Profile 117
Summary 117
5. Optimizing Cloud Data Lake Architectures for Performance 119
Basics of Measuring Performance 119
Goals and Metrics for Performance 121
Measuring Performance 122
Optimizing for Faster Performance 123
Cloud Data Lake Performance 125
SLAs, SLOs, and SLIs 125
Example: How Klodars Corporation Managed Its SLAs, SLOs, and SLIs 126
Drivers of Performance 128
Performance Drivers for a Copy Job 128
Performance Drivers for a Spark Job 130
Optimization Principles and Techniques for Performance Tuning 134
Data Formats 134
Data Organization and Partitioning 140
Choosing the Right Configurations on Apache Spark 142
Minimize Overheads with Data Transfer 145
Premium Offerings and Performance 146
The Case of Bigger Virtual Machines 146
The Case of Flash Storage 146
Summary 147
6. Deep Dive on Data Formats 149
Why Do We Need These Open Data Formats? 149
Why Do We Need to Store Tabular Data? 150
Why Is It a Problem to Store Tabular Data in a Cloud Data Lake Storage? 151
Delta Lake 152
Why Was Delta Lake Founded? 152
How Does Delta Lake Work? 155
When Do You Use Delta Lake? 157
Apache Iceberg 157
Why Was Apache Iceberg Founded? 157
How Does Apache Iceberg Work? 159
When Do You Use Apache Iceberg? 161
Apache Hudi 162
Why Was Apache Hudi Founded? 163
How Does Apache Hudi Work? 164
When Do You Use Apache Hudi? 167
Summary 168
7. Decision Framework for Your Architecture 169
Cloud Data Lake Assessment 170
Cloud Data Lake Assessment Questionnaire 170
Analysis for Your Cloud Data Lake Assessment 172
Starting from Scratch 173
Migrating an Existing Data Lake or Data Warehouse to the Cloud 173
Improving an Existing Cloud Data Lake 174
Phase 1 of Decision Framework: Assess 175
Understand Customer Requirements 176
Understand Opportunities for Improvement 177
Know Your Business Drivers 178
Complete the Assess Phase by Prioritizing the Requirements 179
Phase 2 of Decision Framework: Define 180
Finalize the Design Choices for the Cloud Data Lake 182
Plan Your Cloud Data Lake Project Deliverables 186
Phase 3 of Decision Framework: Implement 187
Phase 4 of Decision Framework: Operationalize 190
Summary 190
8. Six Lessons for a Data Informed Future 191
Lesson 1: Focus on the How and When, Not the If and Why, When It Comes
to Cloud Data Lakes 192
Lesson 2: With Great Power Comes Great
Responsibility—Data Is No Exception 193
Lesson 3: Customers Lead Technology, Not the Other Way Around 195
Lesson 4: Change Is Inevitable, so Be Prepared 196
Lesson 5: Build Empathy and Prioritize Ruthlessly 197
Lesson 6: Big Impact Does Not Happen Overnight 198
Summary 199
Appendix. Cloud Data Lake Decision Framework Template 201
Index 213

 More organizations than ever understand the importance of data lake architectures for deriving value from their data. Building a robust, scalable, and performant data lake remains a complex proposition, however, with a buffet of tools and options that need to work together to provide a seamless end-to-end pipeline from data to insights.

 This book provides a concise yet comprehensive overview on the setup, management, and governance of a cloud data lake. Author Rukmani Gopalan, a product management leader and data enthusiast, guides data architects and engineers through the major aspects of working with a cloud data lake, from design considerations and best practices to data format optimizations, performance optimization, cost management, and governance.

  • Learn the benefits of a cloud-based big data strategy for your organization

  • Get guidance and best practices for designing performant and scalable data lakes

  • Examine architecture and design choices, and data governance principles and strategies

  • Build a data strategy that scales as your organizational and business needs increase

  • Implement a scalable data lake in the cloud

  • Use cloud-based advanced analytics to gain more value from your data


Похожее:

Список отзывов:

Нет отзывов к книге.