foreword xvii
preface xxi
acknowledgments xxiii
about this book xxv
about the author xxviii
about the cover illustration xxix
Part 1............................................................................1
1 A walkthrough of system design concepts 3
1.1 It is a discussion about tradeoffs 4
1.2 Should you read this book? 4
1.3 Overview of this book 5
1.4 Prelude—A brief discussion of scaling the various services of a
system 6
The beginning—A small initial deployment of our app 6
Scaling with GeoDNS 7
Adding a caching service 8
Content Distribution Network (CDN) 9
A brief discussion of horizontal scalability and cluster management, continuous integration (CI) and continuous deployment (CD) 10
Functional partitioning and centralization of cross-cutting concerns 13
Batch and streaming extract, transform, and load (ETL) 17
Other common services 18
Cloud vs. bare metal 19
Serverless—Function as a Service (FaaS) 22
Conclusion—Scaling backend services 23
2 A typical system design interview flow 24
2.1 Clarify requirements and discuss tradeoffs 26
2.2 Draft the API specification 28
Common API endpoints 28
2.3 Connections and processing between users and data 28
2.4 Design the data model 29
Example of the disadvantages of multiple services sharing databases 30
A possible technique to prevent concurrent user update conflicts 31
2.5 Logging, monitoring, and alerting 34
The importance of monitoring 34
Observability 34
Responding to alerts 36
Application-level logging tools 37
Streaming and batch audit of data quality 39
Anomaly detection to detect data anomalies 39
Silent errors and auditing 40
Further reading on observability 40
2.6 Search bar 40
Introduction 40
Search bar implementation with Elasticsearch 41
Elasticsearch index and ingestion 42
Using Elasticsearch in place of SQL 43
Implementing search in our services 44
Further reading on search 44
2.7 Other discussions 44
Maintaining and extending the application 44
Supporting other types of users 45
Alternative architectural decisions 45
Usability and feedback 45
Edge cases and new constraints 46
Cloud native concepts 47
2.8 Post-interview reflection and assessment 47
Write your reflection as soon as possible after the interview 47
Writing your assessment 49
Details you didn’t mention 49
Interview feedback 50
2.9 Interviewing the company 51
3 Non-functional requirements 54
3.1 Scalability 56
Stateless and stateful services 57
Basic load balancer concepts 57
3.2 Availability 59
3.3 Fault-tolerance 60
Replication and redundancy 60
Forward error correction (FEC) and error correction code (ECC) 61
Circuit breaker 61
Exponential backoff and retry 62
Caching responses of other services 62
Checkpointing 62
Dead letter queue 62
Logging and periodic auditing 63
Bulkhead 63
Fallback pattern 64
3.4 Performance/latency and throughput 65
3.5 Consistency 66
Full mesh 67
Coordination service 68
Distributed cache 69
Gossip protocol 70
Random Leader Selection 70
3.6 Accuracy 70
3.7 Complexity and maintainability 71
Continuous deployment (CD) 72
3.8 Cost 72
3.9 Security 73
3.10 Privacy 73
External vs. internal services 74
3.11 Cloud native 75
3.12 Further reading 75
4 Scaling databases 77
4.1 Brief prelude on storage services 77
4.2 When to use vs. avoid databases 79
4.3 Replication 79
Distributing replicas 80
Single-leader replication 80
Multi-leader replication 84
Leaderless replication 85
HDFS replication 85
Further reading 87
4.4 Scaling storage capacity with sharded databases 87
Sharded RDBMS 88
4.5 Aggregating events 88
Single-tier aggregation 89
Multi-tier aggregation 89
Partitioning 90
Handling a large key space 91
Replication and fault-tolerance 92
4.6 Batch and streaming ETL 93
A simple batch ETL pipeline 93
Messaging terminology 95
Kafka vs. RabbitMQ 96
Lambda architecture 98
4.7 Denormalization 98
4.8 Caching 99
Read strategies 100
Write strategies 101
4.9 Caching as a separate service 103
4.10 Examples of different kinds of data to cache and how to cache them 103
4.11 Cache invalidation 104
Browser cache invalidation 105
Cache invalidation in caching services 105
4.12 Cache warming 106
4.13 Further reading 107
Caching references 107
5 Distributed transactions 109
5.1 Event Driven Architecture (EDA) 110
5.2 Event sourcing 111
5.3 Change Data Capture (CDC) 112
5.4 Comparison of event sourcing and CDC 113
5.5 Transaction supervisor 114
5.6 Saga 115
Choreography 115
Orchestration 117
Comparison 119
5.7 Other transaction types 120
5.8 Further reading 120
6 Common services for functional partitioning 122
6.1 Common functionalities of various services 123
Security 123
Error-checking 124
Performance and availability 124
Logging and analytics 124
6.2 Service mesh / sidecar pattern 125
6.3 Metadata service 126
6.4 Service discovery 127
6.5 Functional partitioning and various frameworks 128
Basic system design of an app 128
Purposes of a web server app 129
Web and mobile frameworks 130
6.6 Library vs. service 134
Language specific vs. technology-agnostic 135
Predictability of latency 136
Predictability and reproducibility of behavior 136
Scaling considerations for libraries 136
Other considerations 137
6.7 Common API paradigms 137
The Open Systems Interconnection (OSI) model 137
REST 138
RPC (Remote Procedure Call) 140
GraphQL 141
WebSocket 142
Comparison 142
Part 2........................................................................ 145
7 Design Craigslist 147
7.1 User stories and requirements 148
7.2 API 149
7.3 SQL database schema 150
7.4 Initial high-level architecture 150
7.5 A monolith architecture 151
7.6 Using a SQL database and object store 153
7.7 Migrations are troublesome 153
7.8 Writing and reading posts 156
7.9 Functional partitioning 158
7.10 Caching 159
7.11 CDN 160
7.12 Scaling reads with a SQL cluster 160
7.13 Scaling write throughput 160
7.14 Email service 161
7.15 Search 162
7.16 Removing old posts 162
7.17 Monitoring and alerting 163
7.18 Summary of our architecture discussion so far 163
7.19 Other possible discussion topics 164
Reporting posts 164
Graceful degradation 164
Complexity 164
Item categories/tags 166
Analytics and recommendations 166
A/B testing 167
Subscriptions and saved searches 167
Allow duplicate requests to the search service 168
Avoid duplicate requests to the search service 168
Rate limiting 169
Large number of posts 169
Local regulations 169
8 Design a rate-limiting service 171
8.1 Alternatives to a rate-limiting service, and why they are infeasible 172
8.2 When not to do rate limiting 174
8.3 Functional requirements 174
8.4 Non-functional requirements 175
Scalability 175
Performance 175
Complexity 175
Security and privacy 176
Availability and faulttolerance 176
Accuracy 176
Consistency 176
8.5 Discuss user stories and required service components 177
8.6 High-level architecture 177
8.7 Stateful approach/sharding 180
8.8 Storing all counts in every host 182
High-level architecture 182
Synchronizing counts 185
8.9 Rate-limiting algorithms 187
Token bucket 188
Leaky bucket 189
Fixed window counter 190
Sliding window log 192
Sliding window counter 193
8.10 Employing a sidecar pattern 193
8.11 Logging, monitoring, and alerting 193
8.12 Providing functionality in a client library 194
8.13 Further reading 195
9 Design a notification/alerting service 196
9.1 Functional requirements 196
Not for uptime monitoring 197
Users and data 197
Recipient channels 198
Templates 198
Trigger conditions 199
Manage subscribers, sender groups, and recipient groups 199
User features 199
Analytics 200
9.2 Non-functional requirements 200
9.3 Initial high-level architecture 200
9.4 Object store: Configuring and sending notifications 205
9.5 Notification templates 207
Notification template service 207
Additional features 209
9.6 Scheduled notifications 210
9.7 Notification addressee groups 212
9.8 Unsubscribe requests 215
9.9 Handling failed deliveries 216
9.10 Client-side considerations regarding duplicate notifications 218
9.11 Priority 218
9.12 Search 219
9.13 Monitoring and alerting 219
9.14 Availability monitoring and alerting on the notification/alerting service 220
9.15 Other possible discussion topics 220
9.16 Final notes 221
10 Design a database batch auditing service 223
10.1 Why is auditing necessary? 224
10.2 Defining a validation with a conditional statement on a SQL query’s result 226
10.3 A simple SQL batch auditing service 229
An audit script 229
An audit service 230
10.4 Requirements 232
10.5 High-level architecture 233
Running a batch auditing job 234
Handling alerts 235
10.6 Constraints on database queries 237
Limit query execution time 238
Check the query strings before submission 238
Users should be trained early 239
10.7 Prevent too many simultaneous queries 239
10.8 Other users of database schema metadata 240
10.9 Auditing a data pipeline 241
10.10 Logging, monitoring, and alerting 242
10.11 Other possible types of audits 242
Cross data center consistency audits 242
Compare upstream
and downstream data 243
10.12 Other possible discussion topics 243
10.13 References 243
11 Autocomplete/typeahead 245
11.1 Possible uses of autocomplete 246
11.2 Search vs. autocomplete 246
11.3 Functional requirements 248
Scope of our autocomplete service 248
Some UX (user experience) details 248
Considering search history 249
Content moderation and fairness 250
11.4 Nonfunctional requirements 250
11.5 Planning the high-level architecture 251
11.6 Weighted trie approach and initial high-level
architecture 252
11.7 Detailed implementation 253
Each step should be an independent task 255
Fetch relevant logs from Elasticsearch to HDFS 255
Split the search strings into words, and other simple operations 255
Filter out inappropriate words 256
Fuzzy matching and spelling correction 258
Count the words 259
Filter for appropriate words 259
Managing new popular unknown words 259
Generate and deliver the weighted trie 259
11.8 Sampling approach 260
11.9 Handling storage requirements 261
11.10 Handling phrases instead of single words 263
Maximum length of autocomplete suggestions 263
Preventing inappropriate suggestions 263
11.11 Logging, monitoring, and alerting 264
11.12 Other considerations and further discussion 264
12 Design Flickr 266
12.1 User stories and functional requirements 267
12.2 Non-functional requirements 267
12.3 High-level architecture 269
12.4 SQL schema 270
12.5 Organizing directories and files on the CDN 271
12.6 Uploading a photo 272
Generate thumbnails on the client 272
Generate thumbnails on the backend 276
Implementing both server-side and clientside generation 281
12.7 Downloading images and data 282
Downloading pages of thumbnails 282
12.8 Monitoring and alerting 283
12.9 Some other services 283
Premium features 283
Payments and taxes service 283
Censorship/content moderation 283
Advertising 284
Personalization 284
12.10 Other possible discussions 284
13 Design a Content Distribution Network (CDN) 287
13.1 Advantages and disadvantages of a CDN 288
Advantages of using a CDN 288
Disadvantages of using a CDN 289
Example of an unexpected problem from using a CDN to serve images 290
13.2 Requirements 291
13.3 CDN authentication and authorization 291
Steps in CDN authentication and authorization 292
Key rotation 294
13.4 High-level architecture 294
13.5 Storage service 295
In-cluster 296
Out-cluster 296
Evaluation 296
13.6 Common operations 297
Reads–Downloads 297
Writes–Directory creation, file upload, and file deletion 301
13.7 Cache invalidation 306
13.8 Logging, monitoring, and alerting 306
13.9 Other possible discussions on downloading media files 306
14 Design a text messaging app 308
14.1 Requirements 309
14.2 Initial thoughts 310
14.3 Initial high-level design 310
14.4 Connection service 312
Making connections 312
Sender blocking 312
14.5 Sender service 316
Sending a message 316
Other discussions 319
14.6 Message service 320
14.7 Message sending service 321
Introduction 321
High-level architecture 322
Steps in sending a message 324
Some questions 325
Improving availability 325
14.8 Search 326
14.9 Logging, monitoring, and alerting 326
14.10 Other possible discussion points 327
15 Design Airbnb 329
15.1 Requirements 330
15.2 Design decisions 333
Replication 334
Data models for room availability 334
Handling overlapping bookings 335
Randomize search results 335
Lock rooms during booking flow 335
15.3 High-level architecture 335
15.4 Functional partitioning 337
15.5 Create or update a listing 337
15.6 Approval service 339
15.7 Booking service 345
15.8 Availability service 349
15.9 Logging, monitoring, and alerting 350
15.10 Other possible discussion points 351
Handling regulations 352
16 Design a news feed 354
16.1 Requirements 355
16.2 High-level architecture 356
16.3 Prepare feed in advance 360
16.4 Validation and content moderation 364
Changing posts on users’ devices 365
Tagging posts 365
Moderation service 367
16.5 Logging, monitoring, and alerting 368
Serving images as well as text 368
High-level architecture 369
16.6 Other possible discussion points 372
17 Design a dashboard of top 10 products on Amazon by sales
volume 374
17.1 Requirements 375
17.2 Initial thoughts 376
17.3 Initial high-level architecture 377
17.4 Aggregation service 378
Aggregating by product ID 379
Matching host IDs and product IDs 379
Storing timestamps 380
Aggregation process on a host 380
17.5 Batch pipeline 381
17.6 Streaming pipeline 383
Hash table and max-heap with a single host 383
Horizontal scaling to multiple hosts and multi-tier aggregation 385
17.7 Approximation 386
Count-min sketch 388
17.8 Dashboard with Lambda architecture 390
17.9 Kappa architecture approach 390
Lambda vs. Kappa architecture 391
Kappa architecture for our dashboard 392
17.10 Logging, monitoring, and alerting 393
17.11 Other possible discussion points 393
17.12 References 394
A Monoliths vs. microservices 395
A.1 Disadvantages of monoliths 395
A.2 Advantages of monoliths 396
A.3 Advantages of services 396
Agile and rapid development and scaling of product requirements and business functionalities 397
Modularity and replaceability 397
Failure isolation and fault-tolerance 397
Ownership and organizational structure 398
A.4 Disadvantages of services 398
Duplicate components 398
Development and maintenance costs of additional components 399
Distributed transactions 400
Referential integrity 400
Coordinating feature development and deployments that span multiple services 400
Interfaces 401
A.5 References 402
B OAuth 2.0 authorization and OpenID Connect
authentication 403
B.1 Authorization vs. authentication 403
B.2 Prelude: Simple login, cookie-based authentication 404
B.3 Single sign-on (SSO) 404
B.4 Disadvantages of simple login 404
Complexity and lack of maintainability 405
No partial authorization 405
B.5 OAuth 2.0 flow 406
OAuth 2.0 terminology 407
Initial client setup 407
Back channel and front channel 409
B.6 Other OAuth 2.0 flows 410
B.7 OpenID Connect authentication 411
C C4 Model 413
D Two-phase commit (2PC) 418
index 422
The system design interview is one of the hardest challenges you’ll face in the software engineering hiring process. This practical book gives you the insights, the skills, and the hands-on practice you need to ace the toughest system design interview questions and land the job and salary you want.
In Acing the System Design Interview you will master a structured and organized approach to present system design ideas like:
Scaling applications to support heavy traffic
Distributed transactions techniques to ensure data consistency
Services for functional partitioning such as API gateway and service mesh
Common API paradigms including REST, RPC, and GraphQL
Caching strategies, including their tradeoffs
Logging, monitoring, and alerting concepts that are critical in any system design
Communication skills that demonstrate your engineering maturity
Don’t be daunted by the complex, open-ended nature of system design interviews! In this in-depth guide, author Zhiyong Tan shares what he’s learned on both sides of the interview table. You’ll dive deep into the common technical topics that arise during interviews and learn how to apply them to mentally perfect different kinds of systems.
The system design interview is daunting even for seasoned software engineers. Fortunately, with a little careful prep work you can turn those open-ended questions and whiteboard sessions into your competitive advantage! In this powerful book, Zhiyong Tan reveals practical interview techniques and insights about system design that have earned developers job offers from Amazon, Apple, ByteDance, PayPal, and Uber.
Acing the System Design Interview is a masterclass in how to confidently nail your next interview. Following these easy-to-remember techniques, you’ll learn to quickly assess a question, identify an advantageous approach, and then communicate your ideas clearly to an interviewer. As you work through this book, you’ll gain not only the skills to successfully interview, but also to do the actual work of great system design.
Insights on scaling, transactions, logging, and more
Practice questions for core system design concepts
How to demonstrate your engineering maturity
Great questions to ask your interviewer
For software engineers, software architects, and engineering managers looking to advance their careers.