contents....6
foreword....19
preface....23
acknowledgments....25
about this book....27
Who should read this book....28
How this book is organized: A roadmap....28
liveBook discussion forum....28
Other online resources....29
about the author....30
about the cover illustration....31
Part 1....33
1 A walkthrough of system design concepts....35
A discussion about tradeoffs....36
Should you read this book?....36
Overview of this book....37
Prelude: A brief discussion of scaling the various services of a system....38
The beginning: A small initial deployment of our app....38
Scaling with GeoDNS....39
Adding a caching service....40
Content distribution network....41
A brief discussion of horizontal scalability and cluster management, continuous integration, and continuous deployment....42
Functional partitioning and centralization of cross-cutting concerns....45
Batch and streaming extract, transform, and load (ETL)....49
Other common services....50
Cloud vs. bare metal....51
Serverless: Function as a Service (FaaS)....54
Conclusion: Scaling backend services....55
2 A typical system design interview flow....56
Clarify requirements and discuss tradeoffs....58
Draft the API specification....60
Common API endpoints....60
Connections and processing between users and data....60
Design the data model....61
Example of the disadvantages of multiple services sharing databases....62
A possible technique to prevent concurrent user update conflicts....63
Logging, monitoring, and alerting....66
The importance of monitoring....66
Observability....66
Responding to alerts....68
Application-level logging tools....69
Streaming and batch audit of data quality....71
Anomaly detection to detect data anomalies....71
Silent errors and auditing....72
Further reading on observability....72
Search bar....72
Introduction....72
Search bar implementation with Elasticsearch....73
Elasticsearch index and ingestion....74
Using Elasticsearch in place of SQL....75
Implementing search in our services....76
Further reading on search....76
Other discussions....76
Maintaining and extending the application....76
Supporting other types of users....77
Alternative architectural decisions....77
Usability and feedback....77
Edge cases and new constraints....78
Cloud-native concepts....79
Post-interview reflection and assessment....79
Write your reflection as soon as possible after the interview....79
Writing your assessment....81
Details you didnt mention....81
Interview feedback....82
Interviewing the company....83
3 Non-functional requirements....86
Scalability....88
Stateless and stateful services....89
Basic load balancer concepts....89
Availability....91
Fault-tolerance....92
Replication and redundancy....92
Forward error correction and error correction code....93
Circuit breaker....93
Exponential backoff and retry....94
Caching responses of other services....94
Checkpointing....94
Dead letter queue....94
Logging and periodic auditing....95
Bulkhead....95
Fallback pattern....96
Performancelatency and throughput....97
Consistency....98
Full mesh....99
Coordination service....100
Distributed cache....101
Gossip protocol....102
Random Leader Selection....102
Accuracy....102
Complexity and maintainability....103
Continuous deployment (CD)....104
Cost....104
Security....105
Privacy....105
External vs. internal services....106
Cloud native....107
Further reading....107
4 Scaling databases....109
Brief prelude on storage services....109
When to use vs. avoid databases....111
Replication....111
Distributing replicas....112
Single-leader replication....112
Multi-leader replication....116
Leaderless replication....117
HDFS replication....117
Further reading....119
Scaling storage capacity with sharded databases....119
Sharded RDBMS....120
Aggregating events....120
Single-tier aggregation....121
Multi-tier aggregation....121
Partitioning....122
Handling a large key space....123
Replication and fault-tolerance....124
Batch and streaming ETL....125
A simple batch ETL pipeline....125
Messaging terminology....127
Kafka vs. RabbitMQ....128
Lambda architecture....129
Denormalization....130
Caching....131
Read strategies....132
Write strategies....133
Caching as a separate service....135
Examples of different kinds of data to cache and how to cache them....135
Cache invalidation....137
Browser cache invalidation....137
Cache invalidation in caching services....137
Cache warming....138
Further reading....139
Caching references....139
5 Distributed transactions....141
Event Driven Architecture (EDA)....142
Event sourcing....143
Change Data Capture (CDC)....144
Comparison of event sourcing and CDC....145
Transaction supervisor....146
Saga....147
Choreography....147
Orchestration....149
Comparison....151
Other transaction types....152
Further reading....152
6 Common services for functional partitioning....154
Common functionalities of various services....155
Security....155
Error-checking....156
Performance and availability....156
Logging and analytics....156
Service meshsidecar pattern....157
Metadata service....158
Service discovery....159
Functional partitioning and various frameworks....160
Basic system design of an app....160
Purposes of a web server app....161
Web and mobile frameworks....162
Library vs. service....166
Language specific vs. technology-agnostic....167
Predictability of latency....168
Predictability and reproducibility of behavior....168
Scaling considerations for libraries....168
Other considerations....169
Common API paradigms....169
The Open Systems Interconnection (OSI) model....169
REST....170
RPC (Remote Procedure Call)....172
GraphQL....173
WebSocket....174
Comparison....174
Part 2....177
7 Design Craigslist....179
User stories and requirements....180
API....181
SQL database schema....182
Initial high-level architecture....182
A monolith architecture....183
Using an SQL database and object store....185
Migrations are troublesome....186
Writing and reading posts....188
Functional partitioning....190
Caching....191
CDN....192
Scaling reads with a SQL cluster....192
Scaling write throughput....192
Email service....193
Search....194
Removing old posts....194
Monitoring and alerting....195
Summary of our architecture discussion so far....195
Other possible discussion topics....196
Reporting posts....196
Graceful degradation....196
Complexity....196
Item categoriestags....198
Analytics and recommendations....198
AB testing....199
Subscriptions and saved searches....199
Allow duplicate requests to the search service....200
Avoid duplicate requests to the search service....200
Rate limiting....201
Large number of posts....201
Local regulations....201
8 Design a rate-limiting service....203
Alternatives to a rate-limiting service and why they are infeasible....204
When not to do rate limiting....206
Functional requirements....206
Non-functional requirements....207
Scalability....207
Performance....207
Complexity....207
Security and privacy....208
Availability and fault-tolerance....208
Accuracy....208
Consistency....208
Discuss user stories and required service components....209
High-level architecture....209
Stateful approachsharding....212
Storing all counts in every host....214
High-level architecture....214
Synchronizing counts....217
Rate-limiting algorithms....219
Token bucket....220
Leaky bucket....221
Fixed window counter....222
Sliding window log....224
Sliding window counter....225
Logging, monitoring, and alerting....225
Providing functionality in a client library....226
Further reading....227
Employing a sidecar pattern....225
9 Design a notificationalerting service....228
Functional requirements....228
Not for uptime monitoring....229
Users and data....229
Recipient channels....230
Templates....230
Trigger conditions....231
Manage subscribers, sender groups, and recipient groups....231
User features....231
Analytics....232
Non-functional requirements....232
Initial high-level architecture....232
Object store: Configuring and sending notifications....237
Notification templates....239
Notification template service....239
Additional features....241
Scheduled notifications....242
Notification addressee groups....244
Unsubscribe requests....247
Handling failed deliveries....248
Client-side considerations regarding duplicate notifications....250
Priority....250
Search....251
Monitoring and alerting....251
Availability monitoring and alerting on the notificationalerting service....252
Other possible discussion topics....252
Final notes....253
10 Design a database batch auditing service....255
Why is auditing necessary?....256
Defining a validation with a conditional statement on a SQL querysresult....258
A simple SQL batch auditing service....261
An audit script....261
An audit service....262
Requirements....264
High-level architecture....265
Running a batch auditing job....266
Handling alerts....267
Constraints on database queries....269
Limit query execution time....270
Check the query strings before submission....270
Users should be trained early....271
Prevent too many simultaneous queries....271
Other users of database schema metadata....272
Auditing a data pipeline....273
Logging, monitoring, and alerting....274
Other possible types of audits....274
Cross data center consistency audits....274
Compare upstream and downstream data....274
Other possible discussion topics....275
References....275
11 Autocompletetypeahead....277
Possible uses of autocomplete....278
Search vs. autocomplete....278
Functional requirements....280
Scope of our autocomplete service....280
Some UX details....280
Considering search history....281
Content moderation and fairness....282
Non-functional requirements....282
Planning the high-level architecture....283
Weighted trie approach and initial high-level architecture....283
Detailed implementation....285
Each step should be an independent task....287
Fetch relevant logs from Elasticsearch to HDFS....287
Split the search strings into words and other simple operations....287
Filter out inappropriate words....288
Fuzzy matching and spelling correction....290
Count the words....291
Filter for appropriate words....291
Managing new popular unknown words....291
Generate and deliver the weighted trie....291
Sampling approach....292
Handling storage requirements....293
Handling phrases instead of single words....295
Maximum length of autocomplete suggestions....295
Preventing inappropriate suggestions....295
Logging, monitoring, and alerting....296
Other considerations and further discussion....296
12 Design Flickr....298
User stories and functional requirements....299
Non-functional requirements....300
High-level architecture....301
SQL schema....302
Organizing directories and files on the CDN....303
Uploading a photo....304
Generate thumbnails on the client....304
Generate thumbnails on the backend....308
Implementing both server-side and client-side generation....313
Downloading images and data....314
Downloading pages of thumbnails....314
Monitoring and alerting....315
Some other services....315
Premium features....315
Payments and taxes service....315
Censorshipcontent moderation....315
Advertising....316
Personalization....316
Other possible discussion topics....316
13 Design a Content Distribution Network....319
Advantages and disadvantages of a CDN....320
Advantages of using a CDN....320
Disadvantages of using a CDN....321
Example of an unexpected problem from using a CDN to serve images....322
Requirements....323
CDN authentication and authorization....323
Steps in CDN authentication and authorization....324
Key rotation....326
High-level architecture....326
Storage service....327
In-cluster....328
Out-cluster....328
Evaluation....328
Common operations....329
Reads: Downloads....329
Writes: Directory creation, file upload, and file deletion....333
Logging, monitoring, and alerting....338
Other possible discussions on downloading media files....338
Cache invalidation....338
14 Design a text messaging app....340
Requirements....341
Initial thoughts....342
Initial high-level design....342
Connection service....344
Making connections....344
Sender blocking....344
Sender service....348
Sending a message....348
Other discussions....351
Message service....352
Message-sending service....353
Introduction....353
High-level architecture....354
Steps in sending a message....356
Some questions....357
Improving availability....357
Search....358
Logging, monitoring, and alerting....358
Other possible discussion topics....359
15 Design Airbnb....361
Requirements....361
Design decisions....365
Replication....365
Data models for room availability....366
Handling overlapping bookings....367
Randomize search results....367
Lock rooms during booking flow....367
High-level architecture....367
Functional partitioning....369
Create or update a listing....369
Approval service....371
Booking service....377
Availability service....381
Logging, monitoring, and alerting....382
Other possible discussion topics....383
Handling regulations....384
16 Design a news feed....386
Requirements....387
High-level architecture....388
Prepare feed in advance....392
Validation and content moderation....396
Changing posts on users devices....397
Tagging posts....397
Moderation service....399
Logging, monitoring, and alerting....400
Serving images as well as text....400
High-level architecture....401
Other possible discussion topics....404
17 Design a dashboard of top 10 products on Amazon by sales volume....406
Requirements....407
Initial thoughts....408
Initial high-level architecture....409
Aggregation service....410
Aggregating by product ID....411
Matching host IDs and product IDs....411
Storing timestamps....412
Aggregation process on a host....412
Batch pipeline....413
Streaming pipeline....415
Hash table and max-heap with a single host....415
Horizontal scaling to multiple hosts and multi-tier aggregation....417
Approximation....418
Count-min sketch....420
Dashboard with Lambda architecture....422
Kappa architecture approach....422
Lambda vs. Kappa architecture....422
Kappa architecture for our dashboard....424
Logging, monitoring, and alerting....425
Other possible discussion topics....425
References....426
A Monoliths vs. microservices....427
Advantages of monoliths....428
Disadvantages of monoliths....428
Advantages of services....429
Agile and rapid development and scaling of product requirements and business functionalities....429
Modularity and replaceability....429
Failure isolation and fault-tolerance....429
Ownership and organizational structure....430
Disadvantages of services....430
Duplicate components....430
Development and maintenance costs of additional components....431
Distributed transactions....432
Referential integrity....432
Coordinating feature development and deployments that span multiple services....433
Interfaces....434
References....434
B OAuth 2.0 authorizationand OpenID Connectauthentication....435
Authorization vs. authentication....435
Prelude: Simple login, cookie-based authentication....436
Single sign-on....436
Disadvantages of simple login....436
Complexity and lack of maintainability....437
No partial authorization....437
OAuth 2.0 flow....438
OAuth 2.0 terminology....439
Initial client setup....439
Back channel and front channel....441
Other OAuth 2.0 flows....442
OpenID Connect authentication....443
C C4 Model....445
D Two-phase commit (2PC)....450
index....454
The system design interview is one of the hardest challenges you’ll face in the software engineering hiring process. This practical book gives you the insights, the skills, and the hands-on practice you need to ace the toughest system design interview questions and land the job and salary you want.
Don’t be daunted by the complex, open-ended nature of system design interviews! In this in-depth guide, author Zhiyong Tan shares what he’s learned on both sides of the interview table. You’ll dive deep into the common technical topics that arise during interviews and learn how to apply them to mentally perfect different kinds of systems.
The system design interview is daunting even for seasoned software engineers. Fortunately, with a little careful prep work you can turn those open-ended questions and whiteboard sessions into your competitive advantage! In this powerful book, Zhiyong Tan reveals practical interview techniques and insights about system design that have earned developers job offers from Amazon, Apple, ByteDance, PayPal, and Uber.
Acing the System Design Interview is a masterclass in how to confidently nail your next interview. Following these easy-to-remember techniques, you’ll learn to quickly assess a question, identify an advantageous approach, and then communicate your ideas clearly to an interviewer. As you work through this book, you’ll gain not only the skills to successfully interview, but also to do the actual work of great system design.
For software engineers, software architects, and engineering managers looking to advance their careers.