Cover....1
Copyright....6
Table of Contents....11
Preface....19
Who Should Read This Book?....20
Whats New in the Second Edition?....21
References and Further Reading....21
Conventions Used in This Book....22
OReilly Online Learning....22
How to Contact Us....23
Acknowledgments....23
Chapter 1. Trade-Offs in Data Systems Architecture....25
Operational Versus Analytical Systems....27
Characterizing Transaction Processing and Analytics....29
Data Warehousing....31
Systems of Record and Derived Data....34
Cloud Versus Self-Hosting....36
Pros and Cons of Cloud Services....37
Cloud Native System Architecture....38
Operations in the Cloud Era....41
Distributed Versus Single-Node Systems....43
Problems with Distributed Systems....44
Microservices and Serverless....45
Cloud Computing Versus Supercomputing....47
Data Systems, Law, and Society....48
Summary....49
Chapter 2. Defining Nonfunctional Requirements....57
Case Study: Social Network Home Timelines....58
Representing Users, Posts, and Follows....58
Materializing and Updating Timelines....59
Describing Performance....61
Latency and Response Time....62
Average, Median, and Percentiles....64
Use of Response Time Metrics....65
Reliability and Fault Tolerance....67
Fault Tolerance....67
Hardware and Software Faults....68
Humans and Reliability....71
Scalability....73
Understanding Load....74
Shared-Memory, Shared-Disk, and Shared-Nothing Architectures....75
Principles for Scalability....76
Maintainability....76
Operability: Making Life Easy for Operations....77
Simplicity: Managing Complexity....78
Evolvability: Making Change Easy....79
Summary....80
Chapter 3. Data Models and Query Languages....89
Relational Versus Document Models....91
The Object-Relational Mismatch....92
Normalization, Denormalization, and Joins....96
Many-to-One and Many-to-Many Relationships....99
Stars and Snowflakes: Schemas for Analytics....101
When to Use Which Model....104
Graph-Like Data Models....108
Property Graphs....110
The Cypher Query Language....112
Graph Queries in SQL....114
Triple Stores and SPARQL....116
Datalog: Recursive Relational Queries....120
GraphQL....122
Event Sourcing and CQRS....125
DataFrames, Matrices, and Arrays....129
Summary....131
Chapter 4. Storage and Retrieval....139
Storage and Indexing for OLTP....140
Log-Structured Storage....142
B-Trees....149
Comparing B-Trees and LSM-Trees....153
Multicolumn and Secondary Indexes....156
Storing Values Within the Index....157
Keeping Everything in Memory....157
Data Storage for Analytics....158
Cloud Data Warehouses....159
Column-Oriented Storage....160
Query Execution: Compilation and Vectorization....166
Materialized Views and Data Cubes....167
Multidimensional and Full-Text Indexes....169
Full-Text Search....170
Vector Embeddings....171
Summary....174
Chapter 5. Encoding and Evolution....185
Formats for Encoding Data....187
Language-Specific Formats....188
JSON, XML, and Binary Variants....189
Protocol Buffers....193
Avro....196
The Merits of Schemas....201
Modes of Dataflow....202
Dataflow Through Databases....202
Dataflow Through Services: REST and RPC....204
Durable Execution and Workflows....211
Event-Driven Architectures....213
Summary....215
Chapter 6. Replication....221
Single-Leader Replication....222
Synchronous Versus Asynchronous Replication....224
Setting Up New Followers....225
Handling Node Outages....228
Implementation of Replication Logs....230
Problems with Replication Lag....233
Solutions for Replication Lag....238
Multi-Leader Replication....239
Geographically Distributed Operation....240
Sync Engines and Local-First Software....244
Dealing with Conflicting Writes....246
Leaderless Replication....253
Writing to the Database When a Node Is Down....253
Single-Leader Versus Leaderless Replication Performance....259
Multi-Region Operation....260
Detecting Concurrent Writes....261
Summary....267
Chapter 7. Sharding....275
Pros and Cons of Sharding....277
Sharding for Multitenancy....278
Sharding of Key-Value Data....279
Sharding by Key Range....280
Sharding by Hash of Key....282
Skewed Workloads and Relieving Hot Spots....287
Operations: Automatic Versus Manual Rebalancing....288
Request Routing....289
Sharding and Secondary Indexes....292
Local Secondary Indexes....292
Global Secondary Indexes....294
Summary....295
Chapter 8. Transactions....301
What Exactly Is a Transaction?....302
The Meaning of ACID....303
Single-Object and Multi-Object Operations....308
Weak Isolation Levels....312
Read Committed....314
Snapshot Isolation and Repeatable Read....317
Preventing Lost Updates....323
Write Skew and Phantoms....327
Serializability....332
Actual Serial Execution....333
Two-Phase Locking....337
Serializable Snapshot Isolation....341
Distributed Transactions....347
Two-Phase Commit....348
Distributed Transactions Across Different Systems....352
Database-Internal Distributed Transactions....357
Exactly-Once Message Processing Revisited....358
Summary....359
Chapter 9. The Trouble with Distributed Systems....369
Faults and Partial Failures....370
Unreliable Networks....371
The Limitations of TCP....372
Network Faults in Practice....374
Fault Detection....375
Timeouts and Unbounded Delays....376
Synchronous Versus Asynchronous Networks....379
Unreliable Clocks....382
Monotonic Versus Time-of-Day Clocks....383
Clock Synchronization and Accuracy....384
Relying on Synchronized Clocks....386
Process Pauses....390
Knowledge, Truth, and Lies....395
The Majority Rules....396
Distributed Locks and Leases....397
Byzantine Faults....401
System Model and Reality....404
Formal Methods and Randomized Testing....408
Summary....412
Chapter 10. Consistency and Consensus....425
Linearizability....426
What Makes a System Linearizable?....428
Relying on Linearizability....432
Implementing Linearizable Systems....435
The Cost of Linearizability....437
ID Generators and Logical Clocks....441
Logical Clocks....444
Linearizable ID Generators....447
Consensus....449
The Many Faces of Consensus....451
Consensus in Practice....457
Coordination Services....461
Summary....464
Chapter 11. Batch Processing....475
Batch Processing with Unix Tools....478
Simple Log Analysis....478
Chain of Commands Versus Custom Program....480
Sorting Versus In-Memory Aggregation....480
Batch Processing in Distributed Systems....481
Distributed Filesystems....482
Object Stores....484
Distributed Job Orchestration....485
Batch Processing Models....490
MapReduce....490
Dataflow Engines....492
Shuffling Data....493
Joins and Grouping....495
Query Languages....497
DataFrames....499
Batch Use Cases....500
Extract–Transform–Load....500
Analytics....501
Machine Learning....502
Serving Derived Data....503
Summary....505
Chapter 12. Stream Processing....511
Transmitting Event Streams....512
Messaging Systems....513
Log-Based Message Brokers....519
Databases and Streams....524
Keeping Systems in Sync....525
Change Data Capture....527
State, Streams, and Immutability....532
Processing Streams....537
Uses of Stream Processing....538
Reasoning About Time....542
Stream Joins....547
Fault Tolerance....550
Summary....553
Chapter 13. A Philosophy of Streaming Systems....563
Data Integration....563
Combining Specialized Tools by Deriving Data....564
Batch and Stream Processing....568
Unbundling Databases....570
Composing Data Storage Technologies....571
Designing Applications Around Dataflow....575
Observing Derived State....579
Aiming for Correctness....585
The End-to-End Argument for Databases....586
Enforcing Constraints....590
Timeliness and Integrity....595
Trust, but Verify....599
Summary....603
Chapter 14. Doing the Right Thing....609
Predictive Analytics....610
Bias and Discrimination....610
Responsibility and Accountability....611
Feedback Loops....612
Privacy and Tracking....613
Surveillance....614
Consent and Freedom of Choice....615
Privacy and Use of Data....616
Data as Assets and Power....618
Remembering the Industrial Revolution....619
Legislation and Self-Regulation....620
Summary....621
Glossary....627
Index....633
About the Authors....672
Colophon....672
Data is at the center of many challenges in system design today. Difficult issues such as scalability, consistency, reliability, efficiency, and maintainability need to be resolved. In addition, there's an overwhelming variety of systems, including relational databases, NoSQL datastores, data warehouses, and data lakes. There are cloud services, on-premises services, and embedded databases. What are the right choices for your application? How do you make sense of all these buzzwords?
In this second edition, authors Martin Kleppmann and Chris Riccomini build on the foundation laid in the acclaimed first edition, integrating new technologies and emerging trends. You'll be guided through the maze of decisions and trade-offs involved in building a modern data system, learn how to choose the right tools for your needs, and understand the fundamentals of distributed systems.