LLMs in Production....1
brief contents....8
contents....9
foreword....14
preface....15
acknowledgments....17
about the book....19
Who should read this book....19
How this book is organized....20
About the code....21
liveBook Discussion Forum....21
about the authors....22
about the cover illustration....23
1 Words’ awakening: Why large language models have captured attention....24
1.1 Large language models accelerating communication....26
1.2 Navigating the build-and-buy decision with LLMs....30
1.2.1 Buying: The beaten path....31
1.2.2 Building: The path less traveled....32
1.2.3 A word of warning: Embrace the future now....38
1.3 Debunking myths....39
Summary....42
2 Large language models: A deep dive into language modeling....43
2.1 Language modeling....44
2.1.1 Linguistic features....46
2.1.2 Semiotics....52
2.1.3 Multilingual NLP....55
2.2 Language modeling techniques....56
2.2.1 N-gram and corpus-based techniques....57
2.2.2 Bayesian techniques....59
2.2.3 Markov chains....63
2.2.4 Continuous language modeling....66
2.2.5 Embeddings....70
2.2.6 Multilayer perceptrons....72
2.2.7 Recurrent neural networks and long short-term memory networks....74
2.2.8 Attention....81
2.3 Attention is all you need....83
2.3.1 Encoders....84
2.3.2 Decoders....85
2.3.3 Transformers....87
2.4 Really big transformers....89
Summary....94
3 Large language model operations: Building a platform for LLMs....96
3.1 Introduction to large language model operations....96
3.2 Operations challenges with large language models....97
3.2.1 Long download times....97
3.2.2 Longer deploy times....98
3.2.3 Latency....99
3.2.4 Managing GPUs....100
3.2.5 Peculiarities of text data....100
3.2.6 Token limits create bottlenecks....101
3.2.7 Hallucinations cause confusion....103
3.2.8 Bias and ethical considerations....104
3.2.9 Security concerns....104
3.2.10 Controlling costs....107
3.3 LLMOps essentials....107
3.3.1 Compression....107
3.3.2 Distributed computing....116
3.4 LLM operations infrastructure....122
3.4.1 Data infrastructure....124
3.4.2 Experiment trackers....125
3.4.3 Model registry....126
3.4.4 Feature stores....127
3.4.5 Vector databases....128
3.4.6 Monitoring system....129
3.4.7 GPU-enabled workstations....130
3.4.8 Deployment service....131
Summary....132
4 Data engineering for large language models: Setting up for success....134
4.1 Models are the foundation....135
4.1.1 GPT....136
4.1.2 BLOOM....137
4.1.3 LLaMA....138
4.1.4 Wizard....138
4.1.5 Falcon....139
4.1.6 Vicuna....139
4.1.7 Dolly....139
4.1.8 OpenChat....140
4.2 Evaluating LLMs....141
4.2.1 Metrics for evaluating text....141
4.2.2 Industry benchmarks....144
4.2.3 Responsible AI benchmarks....149
4.2.4 Developing your own benchmark....151
4.2.5 Evaluating code generators....153
4.2.6 Evaluating model parameters....154
4.3 Data for LLMs....156
4.3.1 Datasets you should know....157
4.3.2 Data cleaning and preparation....161
4.4 Text processors....167
4.4.1 Tokenization....167
4.4.2 Embeddings....172
4.5 Preparing a Slack dataset....175
Summary....176
5 Training large language models: How to generate the generator....177
5.1 Multi-GPU environments....178
5.1.1 Setting up....178
5.1.2 Libraries....182
5.2 Basic training techniques....184
5.2.1 From scratch....185
5.2.2 Transfer learning (finetuning)....192
5.2.3 Prompting....197
5.3 Advanced training techniques....198
5.3.1 Prompt tuning....198
5.3.2 Finetuning with knowledge distillation....204
5.3.3 Reinforcement learning with human feedback....208
5.3.4 Mixture of experts....211
5.3.5 LoRA and PEFT....214
5.4 Training tips and tricks....219
5.4.1 Training data size notes....219
5.4.2 Efficient training....220
5.4.3 Local minima traps....221
5.4.4 Hyperparameter tuning tips....221
5.4.5 A note on operating systems....222
5.4.6 Activation function advice....222
Summary....223
6 Large language model services: A practical guide....224
6.1 Creating an LLM service....225
6.1.1 Model compilation....226
6.1.2 LLM storage strategies....232
6.1.3 Adaptive request batching....235
6.1.4 Flow control....235
6.1.5 Streaming responses....238
6.1.6 Feature store....239
6.1.7 Retrieval-augmented generation....242
6.1.8 LLM service libraries....246
6.2 Setting up infrastructure....247
6.2.1 Provisioning clusters....248
6.2.2 Autoscaling....250
6.2.3 Rolling updates....255
6.2.4 Inference graphs....257
6.2.5 Monitoring....260
6.3 Production challenges....263
6.3.1 Model updates and retraining....264
6.3.2 Load testing....264
6.3.3 Troubleshooting poor latency....268
6.3.4 Resource management....270
6.3.5 Cost engineering....271
6.3.6 Security....272
6.4 Deploying to the edge....274
Summary....276
7 Prompt engineering: Becoming an LLM whisperer....277
7.1 Prompting your model....278
7.1.1 Few-shot prompting....278
7.1.2 One-shot prompting....280
7.1.3 Zero-shot prompting....281
7.2 Prompt engineering basics....283
7.2.1 Anatomy of a prompt....284
7.2.2 Prompting hyperparameters....286
7.2.3 Scrounging the training data....288
7.3 Prompt engineering tooling....289
7.3.1 LangChain....289
7.3.2 Guidance....290
7.3.3 DSPy....293
7.3.4 Other tooling is available but . . .....294
7.4 Advanced prompt engineering techniques....294
7.4.1 Giving LLMs tools....294
7.4.2 ReAct....297
Summary....300
8 Large language model applications: Building an interactive experience....302
8.1 Building an application....303
8.1.1 Streaming on the frontend....304
8.1.2 Keeping a history....307
8.1.3 Chatbot interaction features....310
8.1.4 Token counting....313
8.1.5 RAG applied....314
8.2 Edge applications....316
8.3 LLM agents....319
Summary....327
9 Creating an LLM project: Reimplementing Llama 3....328
9.1 Implementing Meta’s Llama....329
9.1.1 Tokenization and configuration....329
9.1.2 Dataset, data loading, evaluation, and generation....332
9.1.3 Network architecture....337
9.2 Simple Llama....340
9.3 Making it better....344
9.3.1 Quantization....345
9.3.2 LoRA....346
9.3.3 Fully sharded data parallel–quantized LoRA....349
9.4 Deploy to a Hugging Face Hub Space....351
Summary....354
10 Creating a coding copilot project: This would have helped you earlier....355
10.1 Our model....356
10.2 Data is king....359
10.2.1 Our VectorDB....359
10.2.2 Our dataset....360
10.2.3 Using RAG....364
10.3 Build the VS Code extension....367
10.4 Lessons learned and next steps....374
Summary....377
11 Deploying an LLM on a Raspberry Pi: How low can you go?....378
11.1 Setting up your Raspberry Pi....379
11.1.1 Pi Imager....380
11.1.2 Connecting to Pi....382
11.1.3 Software installations and updates....386
11.2 Preparing the model....387
11.3 Serving the model....389
11.4 Improvements....391
11.4.1 Using a better interface....391
11.4.2 Changing quantization....392
11.4.3 Adding multimodality....393
11.4.4 Serving the model on Google Colab....397
Summary....400
12 Production, an ever-changing landscape: Things are just getting started....402
12.1 A thousand-foot view....403
12.2 The future of LLMs....404
12.2.1 Government and regulation....404
12.2.2 LLMs are getting bigger....409
12.2.3 Multimodal spaces....415
12.2.4 Datasets....416
12.2.5 Solving hallucination....417
12.2.6 New hardware....424
12.2.7 Agents will become useful....425
12.3 Final thoughts....429
Summary....430
appendix A History of linguistics....431
A.1 Ancient linguistics....431
A.2 Medieval linguistics....432
A.3 Renaissance and early modern linguistics....433
A.4 Early 20th-century linguistics....435
A.5 Mid-20th century and modern linguistics....437
appendix B Reinforcement learning with human feedback....439
appendix C Multimodal latent spaces....443
index....450
Symbols....450
Numerics....450
A....450
B....450
C....450
D....451
E....451
F....451
G....451
H....452
I....452
J....452
K....452
L....452
M....453
N....453
O....454
P....454
Q....454
R....454
S....454
T....455
U....455
V....455
W....455
X....455
Z....455
This practical book offers clear, example-rich explanations of how LLMs work, how you can interact with them, and how to integrate LLMs into your own applications. Find out what makes LLMs so different from traditional software and ML, discover best practices for working with them out of the lab, and dodge common pitfalls with experienced advice.
Most business software is developed and improved iteratively, and can change significantly even after deployment. By contrast, because LLMs are expensive to create and difficult to modify, they require meticulous upfront planning, exacting data standards, and carefully-executed technical implementation. Integrating LLMs into production products impacts every aspect of your operations plan, including the application lifecycle, data pipeline, compute cost, security, and more. Get it wrong, and you may have a costly failure on your hands.
LLMs in Production teaches you how to develop an LLMOps plan that can take an AI app smoothly from design to delivery. You’ll learn techniques for preparing an LLM dataset, cost-efficient training hacks like LORA and RLHF, and industry benchmarks for model evaluation. Along the way, you’ll put your new skills to use in three exciting example projects: creating and training a custom LLM, building a VSCode AI coding extension, and deploying a small model to a Raspberry Pi.
For data scientists and ML engineers who know Python and the basics of cloud deployment.