Domain-Specific Small Langauge Models....1
brief contents....6
contents....8
foreword....13
preface....15
acknowledgments....16
about this book....17
Who should read this book....17
How this book is organized: A roadmap....18
About the code....19
liveBook discussion forum....19
about the author....20
about the cover illustration....21
Part 1 First Steps....22
1 Small language models....24
1.1 What are small language models?....25
1.2 The Transformer architecture....26
1.3 Areas of application....27
1.4 The open source revolution....27
1.5 Risks and challenges with generalist LLMs....31
1.6 Domain-specific vs. generalist LLMs for business value....32
Part 2 Core domain-specific LLMs....36
2 Tuning for a specific domain....38
2.1 Data preparation....38
2.1.1 Data preparation for BERT fine-tuning....39
2.1.2 Data preparation for GPT fine-tuning....41
2.1.3 Data preparation for RAG....43
2.2 Retrieval-augmented generation....45
2.3 Fine-tuning....46
2.4 LoRA....50
2.5 RAG or fine-tuning?....55
3 End-to-end transformer fine-tuning....57
3.1 Data preparation....57
3.2 Fine-tuning....59
3.3 Testing the fine-tuned model....62
3.4 Domain-specific evaluation....65
4 Running inference....72
4.1 How to generate content....72
4.1.1 Text completion....73
4.1.2 Few-shot learning....76
4.1.3 Code generation....77
4.1.4 Evaluating the generated content....78
4.2 Calculating inference cost....80
4.3 Areas for improvement (cost savings and performance)....81
4.3.1 Getting the most from your GPU....81
4.3.2 Batching....85
4.3.3 Estimating the generation time....86
4.3.4 Optimizing GPU use with DeepSpeed....87
5 Exploring ONNX....92
5.1 The ONNX format....92
5.2 ONNX operators and types....95
5.3 The ONNX runtime....102
5.4 ONNX runtime providers....103
5.5 ONNX for LLMs on CPU....105
5.6 ONNX for LLMs on GPU....109
5.6.1 ONNX for GPT on GPU....110
5.6.2 IO binding....113
6 Quantizing for your production environment....117
6.1 Transformers precision formats....117
6.2 8-bit quantization....120
6.2.1 Hands-on 8-bit quantization....121
6.2.2 LLM.int8() and quantization....125
6.3 8-bit quantization with ONNX....128
6.4 4-bit quantization....133
6.4.1 4-bit quantization with GPTQ....133
6.4.2 4-bit quantization with ggml....135
Part 3 Real-world use cases....140
7 Generating Python code....142
7.1 Using Transformers to generate code....142
7.2 Generating Python code with a Transformer architecture....144
7.2.1 Python code generation with CodeGen....144
7.2.2 Using ONNX with models not supported by Optimum....154
7.2.3 Model evaluation....155
7.2.4 Python code generation with better models....158
7.3 Coding assistance on commodity hardware....161
8 Generating protein structures....166
8.1 Applying Transformers in chemistry....166
8.2 From natural language to protein structures....168
8.3 Antibody generation with an SLM....170
8.4 From CIF files to crystal structures....174
Part 4 Advanced concepts....184
9 Advanced quantization techniques....186
9.1 FlexGen....187
9.2 SmoothQuant....192
9.3 BitNet....197
9.4 BitNet and Python....200
10 Profiling insights....205
10.1 Profiling ONNX-ported LLMs....205
10.2 Transforming raw ONNX profiling data into insights....209
10.3 Optimization of ONNX graphs for LLMs....218
11 Deployment and serving....227
11.1 vLLM....227
11.1.1 Offline serving....229
11.1.2 Online serving....233
11.2 FastAPI....236
11.2.1 Benchmarking various models....239
11.2.2 Deploying the best-performing model with FastAPI....243
11.3 MLC LLM....244
11.4 Deployment and inference on Android devices....250
11.4.1 MLC LLM framework....250
11.4.2 MLLM framework....250
11.4.3 Hugging Face Transformers....252
12 Running on your laptop....256
12.1 Why use a personal local assistant....257
12.2 Running an LLM locally with Ollama....257
12.2.1 Importing a custom model into Ollama....261
12.2.2 User privacy in Ollama....264
12.3 Running an LLM locally with LM Studio....265
12.4 The LM Studio Python SDK....269
12.5 Running an LLM locally with Jan....272
12.6 The Cortex local LLM engine....274
13 Creating end-to-end LLM applications....279
13.1 Why LLMs alone arent enough....280
13.2 Combining a domain-specific SLM with RAG....282
13.3 Using a vector database....296
13.4 Building an agent....300
14 Advanced components for LLM applications....313
14.1 GraphRAG....313
14.1.1 Microsofts open source GraphRAG capabilities....326
14.1.2 Evaluation metrics....327
14.2 RAG Agentic AI....328
14.3 Long- and short-term memory management....341
15 Test-time compute and small language models....347
15.1 Test-time compute....347
15.2 The OptiLLM inference proxy....349
15.3 SLMs with embedded test-time compute....358
15.4 Building a reasoning domain-specific SLM....359
index....368
Bigger isn’t always better. Train and tune highly focused language models optimized for domain specific tasks.
When you need a language model to respond accurately and quickly about a specific field of knowledge, the sprawling capacity of a LLM may hurt more than it helps. Domain-Specific Small Language Models teaches you to build generative AI models optimized for specific fields.
Perfect for cost- or hardware-constrained environments, Small Language Models (SLMs) train on domain specific data for high-quality results in specific tasks. In Domain-Specific Small Language Models you’ll develop SLMs that can generate everything from Python code to protein structures and antibody sequences—all on commodity hardware.
Small-footprint language models trained on custom data sets and hosted locally can perform as well as large generalist models in speed and accuracy, often at a fraction of the cost. Domain-Specific Small Language Models shows you how to build privacy-preserving and regulation-compliant SLMs for agentic systems, specialist applications, and deployment on the edge.
This is a practical book that shows you how to adapt pretrained open source models to your domain using transfer learning and parameter-efficient fine-tuning. You’ll learn to minimize cost through optimization and quantization, develop secure APIs to serve your models, and deploy SLMs on commodity hardware—including small devices. The hands-on examples include integrating SLMs into RAG systems and agentic workflows.
For AI engineers familiar with Python.