Domain-Specific Small Language Models: Efficient AI for local deployment

Автор: Iozzia Guglielmo

Дата выхода: 2026

Издательство: Manning Publications Co.

Количество страниц: 376

Размер файла: 4,5 МБ

Тип файла: PDF

Добавил: codelibs

Проверить на вирусы Дополнительные материалы

Domain-Specific Small Langauge Models....1

brief contents....6

contents....8

foreword....13

preface....15

acknowledgments....16

about this book....17

Who should read this book....17

How this book is organized: A roadmap....18

About the code....19

liveBook discussion forum....19

about the author....20

about the cover illustration....21

Part 1 First Steps....22

1 Small language models....24

1.1 What are small language models?....25

1.2 The Transformer architecture....26

1.3 Areas of application....27

1.4 The open source revolution....27

1.5 Risks and challenges with generalist LLMs....31

1.6 Domain-specific vs. generalist LLMs for business value....32

Part 2 Core domain-specific LLMs....36

2 Tuning for a specific domain....38

2.1 Data preparation....38

2.1.1 Data preparation for BERT fine-tuning....39

2.1.2 Data preparation for GPT fine-tuning....41

2.1.3 Data preparation for RAG....43

2.2 Retrieval-augmented generation....45

2.3 Fine-tuning....46

2.4 LoRA....50

2.5 RAG or fine-tuning?....55

3 End-to-end transformer fine-tuning....57

3.1 Data preparation....57

3.2 Fine-tuning....59

3.3 Testing the fine-tuned model....62

3.4 Domain-specific evaluation....65

4 Running inference....72

4.1 How to generate content....72

4.1.1 Text completion....73

4.1.2 Few-shot learning....76

4.1.3 Code generation....77

4.1.4 Evaluating the generated content....78

4.2 Calculating inference cost....80

4.3 Areas for improvement (cost savings and performance)....81

4.3.1 Getting the most from your GPU....81

4.3.2 Batching....85

4.3.3 Estimating the generation time....86

4.3.4 Optimizing GPU use with DeepSpeed....87

5 Exploring ONNX....92

5.1 The ONNX format....92

5.2 ONNX operators and types....95

5.3 The ONNX runtime....102

5.4 ONNX runtime providers....103

5.5 ONNX for LLMs on CPU....105

5.6 ONNX for LLMs on GPU....109

5.6.1 ONNX for GPT on GPU....110

5.6.2 IO binding....113

6 Quantizing for your production environment....117

6.1 Transformers precision formats....117

6.2 8-bit quantization....120

6.2.1 Hands-on 8-bit quantization....121

6.2.2 LLM.int8() and quantization....125

6.3 8-bit quantization with ONNX....128

6.4 4-bit quantization....133

6.4.1 4-bit quantization with GPTQ....133

6.4.2 4-bit quantization with ggml....135

Part 3 Real-world use cases....140

7 Generating Python code....142

7.1 Using Transformers to generate code....142

7.2 Generating Python code with a Transformer architecture....144

7.2.1 Python code generation with CodeGen....144

7.2.2 Using ONNX with models not supported by Optimum....154

7.2.3 Model evaluation....155

7.2.4 Python code generation with better models....158

7.3 Coding assistance on commodity hardware....161

8 Generating protein structures....166

8.1 Applying Transformers in chemistry....166

8.2 From natural language to protein structures....168

8.3 Antibody generation with an SLM....170

8.4 From CIF files to crystal structures....174

Part 4 Advanced concepts....184

9 Advanced quantization techniques....186

9.1 FlexGen....187

9.2 SmoothQuant....192

9.3 BitNet....197

9.4 BitNet and Python....200

10 Profiling insights....205

10.1 Profiling ONNX-ported LLMs....205

10.2 Transforming raw ONNX profiling data into insights....209

10.3 Optimization of ONNX graphs for LLMs....218

11 Deployment and serving....227

11.1 vLLM....227

11.1.1 Offline serving....229

11.1.2 Online serving....233

11.2 FastAPI....236

11.2.1 Benchmarking various models....239

11.2.2 Deploying the best-performing model with FastAPI....243

11.3 MLC LLM....244

11.4 Deployment and inference on Android devices....250

11.4.1 MLC LLM framework....250

11.4.2 MLLM framework....250

11.4.3 Hugging Face Transformers....252

12 Running on your laptop....256

12.1 Why use a personal local assistant....257

12.2 Running an LLM locally with Ollama....257

12.2.1 Importing a custom model into Ollama....261

12.2.2 User privacy in Ollama....264

12.3 Running an LLM locally with LM Studio....265

12.4 The LM Studio Python SDK....269

12.5 Running an LLM locally with Jan....272

12.6 The Cortex local LLM engine....274

13 Creating end-to-end LLM applications....279

13.1 Why LLMs alone arent enough....280

13.2 Combining a domain-specific SLM with RAG....282

13.3 Using a vector database....296

13.4 Building an agent....300

14 Advanced components for LLM applications....313

14.1 GraphRAG....313

14.1.1 Microsofts open source GraphRAG capabilities....326

14.1.2 Evaluation metrics....327

14.2 RAG Agentic AI....328

14.3 Long- and short-term memory management....341

15 Test-time compute and small language models....347

15.1 Test-time compute....347

15.2 The OptiLLM inference proxy....349

15.3 SLMs with embedded test-time compute....358

15.4 Building a reasoning domain-specific SLM....359

index....368

Bigger isn’t always better. Train and tune highly focused language models optimized for domain specific tasks.

When you need a language model to respond accurately and quickly about a specific field of knowledge, the sprawling capacity of a LLM may hurt more than it helps. Domain-Specific Small Language Models teaches you to build generative AI models optimized for specific fields.

In Domain-Specific Small Language Models you’ll discover:

Model sizing best practices
Open source libraries, frameworks, utilities and runtimes
Fine-tuning techniques for custom datasets
Hugging Face’s libraries for SLMs
Running SLMs on commodity hardware
Model optimization or quantization

Perfect for cost- or hardware-constrained environments, Small Language Models (SLMs) train on domain specific data for high-quality results in specific tasks. In Domain-Specific Small Language Models you’ll develop SLMs that can generate everything from Python code to protein structures and antibody sequences—all on commodity hardware.

about the technology

Small-footprint language models trained on custom data sets and hosted locally can perform as well as large generalist models in speed and accuracy, often at a fraction of the cost. Domain-Specific Small Language Models shows you how to build privacy-preserving and regulation-compliant SLMs for agentic systems, specialist applications, and deployment on the edge.

about the book

This is a practical book that shows you how to adapt pretrained open source models to your domain using transfer learning and parameter-efficient fine-tuning. You’ll learn to minimize cost through optimization and quantization, develop secure APIs to serve your models, and deploy SLMs on commodity hardware—including small devices. The hands-on examples include integrating SLMs into RAG systems and agentic workflows.

what's inside

ONNX and other quantization methods
Integrate SLMs into end-to-end applications
Deploy SLMs on laptops, smartphones, and other devices

about the reader

For AI engineers familiar with Python.

Если вам понравилась эта страница - поделитесь ею с друзьями, тем самым вы помогаете нам развиваться и добавлять всё больше интересных и нужным вам книг