Building Large Language Models from Scratch: Design, Train, and Deploy LLMs with PyTorch

Building Large Language Models from Scratch: Design, Train, and Deploy LLMs with PyTorch

Building Large Language Models from Scratch: Design, Train, and Deploy LLMs with PyTorch
Автор: Grigorov Dilyan
Дата выхода: 2026
Издательство: Apress Media, LLC.
Количество страниц: 547
Размер файла: 5,0 МБ
Тип файла: PDF
Добавил: codelibs
 Проверить на вирусы

Table of Contents....5

About the Author....21

About the Technical Reviewer....22

Introduction....23

Chapter 1: What Is a Large Language Model? Getting Started with Libraries and Environment Setup for Building an LLM from Scratch....24

Getting Started with the Foundations of Language Modeling....24

What Is a Large Language Model (LLM)?....26

The Attention MechanismCornerstone of Modern Large Language Models (LLMs)....29

What Are Attention Mechanisms?....30

Detailed Explanation About How Self-Attention Works in Transformers....30

Why Attention Is Central to LLMs....32

Variants and Innovations in Attention (As of May 2025)....33

Challenges and Limitations of Attention....34

Practical Significance in LLMs....35

The Tools We Will Use to Build a Large Language Model from Scratch....35

Why Python for Building Large Language Models?....37

Readable, Simple, and Accessible for All Skill Levels....37

Extensive Machine Learning Ecosystem and Strong Community Support....37

Integrated Tools for Visualization, Experimentation, and Full ML Pipelines....38

High Performance Through GPU Acceleration, Scalability, and Low-Level Interoperability....38

Production-Ready Deployment and Access to Pretrained Models for Transfer Learning....38

Why Jupyter Notebook for Building Large Language Models?....39

Interactive, Iterative Coding with Immediate Feedback....39

Unified Environment for Code, Documentation, and Visualization....40

Deep Debugging Transparency and Real-Time Inspection....40

Optimized for Learning, Collaboration, and Reproducible Research....40

Seamless Integration with Scientific Libraries and Scalable Environments....40

Why PyTorch for Building Large Language Models?....41

Dynamic, Flexible Execution for Fast Experimentation and Debugging....41

Pythonic Design and Deep Integration with the Scientific Ecosystem....41

High-Performance GPU Acceleration and Parallel Computing at Scale....41

Modular Architecture and a Rich, Research-Friendly Ecosystem....42

Production-Ready Deployment, Interoperability, and Future-Proof Development....42

Why CUDA for Building Large Language Models?....43

Massive Parallelism for Compute-Intensive Operations....43

Seamless Integration with Deep Learning Frameworks....44

Support for Mixed Precision Training....45

Optimized Libraries for Deep Learning....45

Scalability for Distributed Training....46

Support for Advanced Hardware Features....47

Flexibility for Custom Kernels and Optimizations....47

Rapid Prototyping and Experimentation....48

Support for Inference Optimization....48

Extensive Ecosystem and Community Support....49

Energy Efficiency and Cost Savings....50

Future-Readiness and Innovation....50

Accessibility for Diverse Developers....51

Support for Research and Custom Architectures....51

Why NumPy and Matplotlib for Building Large Language Models?....52

NumPy: Efficient Numerical Computations for Data Processing....53

NumPy: Memory Efficiency for Large-Scale Data....54

NumPy: Foundation for Preprocessing and Feature Engineering....55

NumPy: Support for Custom Algorithms and Research....55

Matplotlib: Visualization for Model Understanding....56

Matplotlib: Debugging and Interpretability....56

Matplotlib: Customizable and Publication-Ready Visuals....57

NumPy and Matplotlib: Seamless Integration with Python Ecosystem....57

NumPy and Matplotlib: Accessibility for Beginners....58

NumPy: Support for Dataset Creation and Evaluation....58

Matplotlib: Support for Experiment Tracking....59

NumPy and Matplotlib: Support for Research and Innovation....59

NumPy and Matplotlib: Lightweight and Local-First....60

NumPy and Matplotlib: Community and Ecosystem Support....60

TinyStories, Shakespeare, The Wizard of Oz, and OpenWebText....61

Four Key Datasets for Building and Understanding LLMs....61

TinyStories: Learning LLM Fundamentals Through Simple Narratives....62

Shakespeare: Stylized, Poetic, and Ideal for Demonstrations....62

The Wizard of Oz: A Compact, Coherent Narrative for Prototyping....62

OpenWebText: Large-Scale, Diverse, and Suitable for Real LLM Training....63

How These Datasets Work Together....63

Ethical and Practical Considerations....63

Why PyLZMA for Building Large Language Models?....64

Why PyLZMA Matters for LLM Development....64

Advanced Concepts....67

In Summary....69

Chapter 2: Foundational Concepts in LLM Development....70

Large Language Models Common Architecture....70

Transformer Architecture: Conceptual and Mathematical Deep Dive....71

Input Embedding....72

Positional Encoding....73

Attention Mechanism....75

Scaled Dot-Product Attention....75

Multihead Attention....76

Masked Attention (Decoder)....77

Feed-Forward Neural Network....78

Residual Connections and Layer Normalization....78

Output Layer....79

Loss Function....80

Encoder-Decoder Interaction....80

Complexity Analysis....81

Advantages of the Transformer....81

Limitations....82

Applications....83

Variants and Improvements....83

The Comprehensive Mathematics Behind Training Large Language Models....84

Explaining the Purpose and Scope....84

Defining the Goal of Language Modeling....84

Next-Token Prediction....85

Modeling Sequential Probability....85

Defining the Loss Function....85

Softmax for Probability Distribution....85

Practical Considerations....86

Masked Language Modeling....86

Bidirectional Context Modeling....86

Comparison with Next-Token Prediction....86

Optimization: Gradient-Based Learning....87

Gradient Descent....87

Iterative Parameter Updates....87

Stochastic Gradient Descent....87

Adam Optimizer....88

Adaptive Optimization....88

Why Adam Works for LLMs....88

Backpropagation....89

Computing Gradients....89

Role of Automatic Differentiation....89

Regularization Techniques....89

Preventing Overfitting....89

Dropout....90

Randomly Deactivating Neurons....90

Why Dropout Helps....90

Weight Decay....90

Penalizing Large Weights....90

Implementation in Adam....90

Label Smoothing....91

Softening Target Distributions....91

Impact on Training....91

Fine-Tuning and RLHF....91

Adapting Pretrained Models....91

Supervised Fine-Tuning (SFT)....91

Task-Specific Optimization....91

Differences from Pretraining....92

Reinforcement Learning from Human Feedback (RLHF)....92

Aligning with Human Preferences....92

Why RLHF Is Effective....93

Summarizing the Mathematical Framework....93

Modern LLM Architecture Characteristics As of the End of 2024 and During 2025....93

In Summary....96

Chapter 3: Building a Tokenizer for the Transformers Architecture Model....97

What Is Tokenization?....98

Why Is Tokenization Important?....99

Types of Tokenizers....100

Word-Based Tokenizers....100

Character-Based Tokenizers....100

Subword-Based Tokenizers....101

Rule-Based Tokenizers....101

Byte-Pair Encoding (BPE): A Deep Dive....102

Origins of BPE....102

BPE Algorithm Outline and Mathematical Formulation....103

Identify Frequent Pairs....103

Replace and Record....103

Repeat Until No Gains....104

Decompression (Decoding)....104

BPE Algorithm Example....104

Concrete Example of the Encoding Part....104

Iteration 1....105

Iteration 2....106

Iteration 3....106

Encoding New Text....108

Concrete Example of the Decoding Part....108

Optimization Objective....109

Vocabulary Size and Hyperparameters....109

Step-by-Step Process of Building a BPE Tokenizer....110

Initialize the Vocabulary....110

Count Pair Frequencies....110

Merge the Most Frequent Pair....111

Repeat Merging....111

Tokenize New Text....111

Map Tokens to IDs....112

Applications of BPE....112

Advantages of BPE....112

Limitations of BPE....113

Comparison with Other Tokenization Methods....113

Practical Considerations....114

BPE Implementation Walkthrough....115

Train Method Code BreakdownStep by Step....120

Step 1: Preprocess the Corpus....120

Step 2: Collect Base Characters....121

Step 3: Build Initial Vocabulary....121

Step 4: Map Text to Initial Token IDs....122

Step 5: Learn Merges....122

Step 6: Build GPT-2-Style Merge Ranks....123

The Encode Method....124

The Decode Method: One of the Most Important Ones....125

Step 1: Reconstruct Raw String from Tokens....127

Step 2: Normalize Unicode and Remove Control Characters....128

Step 3: Clean Up Underscore Markup....128

Step 4: Clean Up Whitespace and Punctuation....129

Why This Method Is One of the Most Important....129

Save and Load Methods....130

Helping Functions....132

Testing Our Tokenizer for Accuracy....139

Tokenizer Output....145

Analysis of the Output....149

Training the Tokenizer....149

Learned Merges....150

Encoding and Decoding Examples....151

Example A: Simple Sentence....151

Example B: Sentence with Newline....151

Example C: Special Token ....152

Example D: With BOS and EOS....152

Step-by-Step BPE Trace....152

Saving and Loading....153

Chapter 4: RMS Normalization and Model Configuration....155

Model Parameters Configuration and Mathematical Foundations....156

Mathematical Notation....157

Global Tensor Shapes and Consistency....157

num_hidden_layersThe Depth of Abstraction....158

vocab_sizeThe Universe of Symbols....158

hidden_sizeThe Resolution of Thought....159

intermediate_sizeThe Breathing Room of the Network....160

head_dimThe Grain of Attention....161

num_attention_headsThe Many Eyes of the Model....161

num_key_value_headsSharing the Burden....162

sliding_windowConstraining Attention....163

initial_context_lengthHow Far the Model Can See....164

The Geometry of Positionrope_theta, rope_scaling_factor, rope_ntk_alpha, rope_ntk_beta....164

rope_thetaThe Base Frequency....165

rope_scaling_factorStretching the Map....165

rope_ntk_alphaBalancing Long-Context Stability....165

rope_ntk_betaControlling the Rate of Growth....166

swiglu_limitTaming the Activations....166

What Is SwiGLU?....166

Background on Related Concepts....167

How SwiGLU Works....167

Advantages and Usage....167

SwiGLU Enhances FFN Expressiveness in GPT Architectures with Gated Activations....168

RMS Normalization in GPT Architectures....169

What Is RMS Normalization?....169

Why Is RMSNorm Used in GPT Architectures?....170

How Is RMSNorm Used in GPT Architectures?....171

Step-by-Step Implementation in a Transformer Block:....171

RMS Normalization for Our Custom Large Language Model....172

Comprehensive Explanation of RMSNorm Code....172

Detailed Code Explanation....173

1. Import and Class Definition....173

2. Initialization (__init__)....174

3. Forward Pass (forward)....176

Mathematical Summary....179

Implementation Details and Design Choices....179

SummaryRMS Normalization and Model Configuration....180

Chapter 5: Rotary Positional Embeddings: Integrating NTK and YaRN Scaling....183

Rotary Positional Embeddings: An In-Depth and Comprehensive Exploration....183

The Fundamental Need for Positional Embeddings in Transformers....185

Historical Evolution of Positional Encodings....185

Pretransformer Approaches: Sequential Processing....185

Early Transformer Positional Encodings....186

Sinusoidal Positional Embeddings....187

Learned Absolute Positional Embeddings....188

Relative Positional Encodings....188

Transformer-XL Relative Embeddings....189

T5 Relative Bias....189

Emergence of Rotary Positional Embeddings (RoPE)....190

Summary of Traditional Positional Embedding Approaches....191

Absolute Positional Embeddings....191

Relative Positional Embeddings....192

Absolute vs. Relative Positional Embeddings....193

Mathematical Formulation of RoPE....193

2D Intuition and Derivation....193

Starting in Two Dimensions....194

The Complex Number Perspective....194

Generalizing to High Dimensions....195

Attention Decay and Frequency Selection....195

Efficient Implementation....196

Attention Matrix Heatmap....197

Connection to Linguistic Interpretability....199

Advantages of RoPE....200

Applications in Modern Models....201

Variants and Extensions....202

Why Rotary Positional Embeddings Are Essential to Large Language Models....202

Solving Permutation Invariance....202

Superior Length Extrapolation....202

Balancing Absolute and Relative Information....203

Efficiency and Scalability....203

Natural Attention Decay....203

Training Benefits....204

Industry Adoption....204

RoPE Embeddings Implementation....204

Function Signature....205

Step-by-Step Explanation....205

1. Splitting the Embedding Tensor....206

2. Preparing Cosine and Sine Tensors....206

3. Applying Rotations....207

Output....208

Key Features and Safety Considerations....208

Example Workflow....208

Geometric Interpretation....209

RotaryEmbedding Class Definition and Methods....209

Class Structure....212

Detailed Explanation....213

1. Initialization (__init__)....213

2. Computing Concentration and Inverse Frequencies (_compute_concentration_and_inv_freq)....214

3. Computing Cosine and Sine Tensors (_compute_cos_sin)....217

4. Forward Pass (forward)....218

Key Features and Safety Considerations....219

Example Workflow....220

Geometric Interpretation....220

What Is NTK (Neural Tangent Kernel) in the Context of RoPE....221

Implementation in RotaryEmbedding....222

Key Features....223

What Is YaRN (Yet Another RoPE Extension)....223

YaRNs Scaling Approach....224

Implementation in RotaryEmbedding....225

Key Features....226

Summary....226

Chapter 6: Scaled Dot-Product Attention CoreSliding Window and Grouped Query AttentionThe Core Behind All Transformer Models....228

What Is Scaled Dot-Product Attention (SDPA)?....228

Historical Evolution and Contextual Foundations....229

Intuitive and Conceptual Underpinnings, Math Formulations....230

Masking Mechanisms....231

Causal Masking....232

Padding Masking....234

Custom Masks for Structured Data....235

Sparse Attention Masks....237

Learned and Adaptive Masks....239

Bidirectional and Cross-Attention Masks....240

Specialized Attention Forms....241

Hierarchical Attention....241

Adaptive Attention Span....243

Challenges and Limitations of Masking Mechanisms....244

Complexity of Designing Masks....245

The Risk of Over-Masking....245

Computational Overhead of Generating and Applying Masks....246

Future Directions in Masking....249

The Enduring Legacy of SDPA....250

Custom Implementation of SDPASliding Window and Grouped Query Attention for Our LLM....250

Understanding the Inputs and Outputs....253

Step 1: Shape Inference, Broadcasting, and Reshaping....254

Step 2: Computing Raw Attention Scores....255

Step 3: Constructing and Applying the Attention Mask....256

Step 4: Incorporating Sink Logits....257

Step 5: Softmax Normalization and Output Computation....258

Summary....259

Chapter 7: AttentionBlock with Rotary Embedding, GQA, Sliding Window, and Sink Tokens....261

Fundamentals of Attention....261

Self-Attention in Transformers....262

Operational Properties....263

Causal Self-Attention....264

Implementation in Transformers....265

Role in LLMs....265

Limitations and Challenges....266

Optimizations and Variants....266

Multihead Attention in Transformers....267

Operational Principles....268

Integration in Transformers....269

Positional Considerations....269

Role in LLMs....270

Limitations and Challenges....271

Optimizations and Variants....271

Advanced Considerations (up to 2025)....272

What Is Grouped Query Attention, Sliding Window Attention, and Sink Tokens?....273

Grouped Query Attention (GQA)....273

Sliding Window Attention (SWA)....274

Sink Tokens....274

Integration in the Attention Block....275

Integration of Attention Mechanism in Our Custom Large Language Model....275

Exhaustive Explanation of the AttentionBlock PyTorch Module....278

Class Definition and PyTorch Integration....279

Constructor (__init__) Method....279

Parameter Extraction and GQA Configuration....280

Validation Checks....280

Sliding Window and GQA Grouping....281

Normalization Layer....282

QKV Projection Layer....282

Output Projection Layer....283

Rotary Embedding Initialization....283

Sink Logits Parameter....284

Softmax Scaling Factor....285

Forward Pass (forward Method)....285

Input Unpacking and Residual Connection....286

Pre-attention Normalization....286

QKV Projection and Slicing....286

GQA Reshaping....287

RoPE Application....287

Core Attention Computation....288

Output Projection and Residual Addition....289

Training and Inference Behaviors....290

Performance Optimizations....290

Interpretability....292

Comparison to Other Attention Mechanisms....293

Practical Implementation Notes....294

Summary....294

Chapter 8: Multilayer Perceptron Block with Mixture of Experts (MoE) and SwiGLU....296

Mixture of Experts: A Comprehensive Overview....297

Architecture of Mixture of Experts....298

Components....299

Sparse Activation....300

Variants of MoE....300

Mathematical Foundations....301

Training Mechanisms....302

Loss Function....302

Backpropagation in MoE....302

Load Balancing....303

Optimization Enhancements....304

Challenges in Training....304

Applications of Mixture of Experts....305

Advantages of MoE....305

Limitations of MoE....306

SwiGLU: An In-Depth Exploration....307

Historical Context....308

The Basics: From Activation Functions to Gated Units....309

Activation Functions....309

Gated Linear Units (GLU)....309

SwiGLU: Definition and Structure....310

Mathematical Representation....311

Parameter Count....312

Training Mechanisms....312

Optimization Considerations....312

Challenges....313

Integration in Transformers....313

Applications of SwiGLU....314

Advantages of SwiGLU....314

Limitations of SwiGLU....315

Practical Considerations....315

Advanced Variants and Comparisons....316

Variants of SwiGLU....316

Comparisons with Other Activations....316

Multilayer Perceptron Blocks with Mixture of Experts (MoE) and SwiGLU: A Comprehensive Integration....317

Integrated Architecture: MLP with MoE and SwiGLU....318

Architecture Overview....318

Mathematical Foundations....320

Training Mechanisms....321

Forward and Backward Pass....321

Load Balancing....322

Optimization Enhancements....322

Challenges....322

Why They Work Together: Rationale....323

Scalability and Efficiency....323

Specialization and Expressiveness....323

Gradient Flow and Training Stability....323

Empirical Performance....323

Flexibility Across Domains....324

Advantages....324

High Performance....324

Computational Efficiency....324

Scalability....324

Robust Training....324

Limitations....324

Training Complexity....325

Hardware Dependency....325

Interpretability....325

Memory Requirements....325

Practical Implementation Considerations....325

Architecture Design....325

Framework Support....325

Training Workflow....326

Hardware Optimization....326

Evaluation....326

Future Directions....326

Multilayer Perceptron Block with Mixture of Experts (MoE) and SwiGLU for Our Large Language Model....327

In-Depth Analysis of the MLPBlock with Mixture of Experts (MoE) and SwiGLU....330

Overview of the MLPBlock....330

Initialization Components....331

Parameter Sharding and Memory Considerations....333

Forward Pass: Step-by-Step Execution....334

Step 1: Input and Residual Connection....334

Step 2: RMS Normalization....335

Step 3: Gating Network and Top-k Selection....335

Step 4: Gathering Expert Parameters....336

Step 5: First Projection (H 2 I_local)....336

Step 6: SwiGLU Activation....337

Step 7: Second Projection (I_local H)....338

Step 8: Tensor Parallelism....339

Step 9: Weighted Sum Across Experts....339

Step 10: Residual Connection and Output....339

Training Dynamics....340

Backpropagation....340

Load Balancing....341

Optimization....341

Numerical Stability....341

Edge Cases and Robustness....342

Summary....342

Chapter 9: Transformer Block and Full Transformer ModelIts Time to Put the Puzzle Together....344

The Role of the Transformer Block in Sequence Processing....344

The Architecture of the Transformers Block....345

Query-Key-Value Projections and Multihead Decomposition....348

Rotary Position Embeddings: Geometric Foundations....349

Scaled Dot-Product Attention: The Core Mechanism....349

The Feed-Forward Block: Position-Wise Transformations....351

Alternative Activations: SwiGLU and Gated Variants....352

Block Composition and Information Flow....353

Full Model Architecture and Training Dynamics....354

Building the Transformers Block for Our LLM from Scratch....354

The TransformerBlock Class....356

Initialization Method (__init__)....356

Forward Pass Method....358

Transformer ClassThe Complete Language Model....360

Initialization MethodBuilding the Full Architecture....360

Forward PassFrom Tokens to Predictions....362

Step 1: Embedding Lookup....363

Step 2: Processing Through Transformer Layers....363

Step 3: Final Normalization....363

Step 4: Output Projection....364

from_checkpoint Class MethodLoading Pretrained Models....364

Step-by-Step Loading Process....365

Import Necessary Modules....365

Device Handling....365

Load Configuration from JSON....366

Initialize Model Architecture....366

Load Model Weights....367

Load State Dictionary into Model....367

Error Handling and Validation....368

Set to Evaluation Mode....369

Architecture Design Choices and Modern Practices....369

Advantages, Challenges, and Broader Impact....370

Parallelism and Scalability....370

Transfer Learning and Few-Shot Generalization....371

Interpretability and Mechanistic Understanding....372

Fundamental Challenges....374

Quadratic Complexity and Context Length Limitations....374

Data Efficiency and Sample Complexity....375

Catastrophic Forgetting and Continual Learning....376

Alignment and Control....376

Robustness and Adversarial Vulnerabilities....377

Future Trajectories....377

Architectural Innovations on the Horizon....377

Toward Artificial General Intelligence....380

Summary....384

Advantages and Challenges....384

Chapter 10: Dataset Preparation, Model Training, Token Generator for Inference and PromptingThe BIG Moment....386

Dataset Preparation for LLM Training....386

The Importance of Dataset Quality....386

Data Collection and Sourcing....387

Data Cleaning and Filtering....388

Deduplication....389

Data Formatting and Tokenization....389

Dataset Composition and Mixing....390

Best Practices and Recommendations....391

Preparing Our Dataset....393

Text Characteristics and Classification of the Dataset....401

Genre and Style Identification....401

Data Preparation Pipeline for This Text....401

Dataset Balancing Considerations....402

Training Large Language Models....403

Model Architecture....404

Training Objectives....404

Optimization....404

Distributed Training....404

Infrastructure....404

Training Stages....405

Challenges....405

Hyperparameter Selection....405

Compute and Efficiency....405

Advanced Techniques....406

Post-training....406

Ethical Considerations....406

Source Code for Training Our LLM....406

Understanding the Code Step-by-Step....413

Part 1: Environment Setup and Configuration....413

Part 2: Dataset Preparation....415

Part 3: Tokenizer Training or Loading....416

Part 4: Dataset Encoding and Caching....417

Part 5: DataLoader Setup....418

Part 6: Model Configuration and Initialization....419

Part 7: Optimizer Configuration....421

Part 8: Learning Rate Scheduling....422

Part 9: Model Compilation (Optional)....423

Part 10: The Training Loop....423

Part 11: Saving the Model....426

Understanding Training Dynamics....427

What the Model Learns....427

TokenGenerator for Inference....429

What Is TokenGenerator?....429

What Is TokenGenerator?....437

Core Functionality....437

The Generation Process....438

Sophisticated Anti-repetition System....438

1. Windowed Repetition and Frequency Penalties....438

2. No-Repeat N-gram Ban....439

3. Self N-gram Ban (BigramTrigram Blocking)....439

4. Dataset Anti-Copy Ban....440

5. Self Anti-Copy Ban....440

6. Variable-Period Loop Detection....441

Advanced Sampling Strategy....441

Typical Sampling....441

Top-K Sampling....441

Top-P (Nucleus) Sampling....442

Robust Fallbacks....442

Temperature Control....442

Why So Many Guardrails?....443

Practical Usage....443

Key Differences from Training....444

The Big MomentPrompting Our Model....445

Understanding Inference Code and Model Behavior....449

Setup and Initialization....449

Building Anti-copy Indices....449

Text Cleaning Utilities....450

Loop Detection....450

Generation Function with Sophisticated Controls....451

Interactive Interface....451

Understanding the OutputAn Educational Demonstration....452

The User Query....452

The Generated OutputThree Revealing Behaviors....452

Phase 1: Accessing Learned Knowledge....452

Phase 2: Pattern Repetition Under Constraints....453

Phase 3: Exploring Low-Probability Space....454

What This Demonstrates About Language Model Design....455

Principle 1: Model Scale and Generalization....455

Principle 2: Training Data Diversity....456

Principle 3: Training Duration and Compute....457

Principle 4: Instruction Fine-Tuning and Alignment....458

The Value of This Demonstration....458

What Pretraining Alone Achieves....459

What Requires Additional Scale and Training....459

How Inference Mechanisms Work....459

Practical Lessons from This Example....460

Lesson 1: Match Model Scale to Task Complexity....460

Lesson 2: Inference Constraints Have Limits....460

Lesson 3: Training Data Determines Capabilities....460

Lesson 4: Hyperparameter Tuning Matters....460

Lesson 5: Multistage Training Is Essential....461

Exploring Even FurtherA Real Trained Model: TinyStories GPT-4 Version Implementation....461

TinyStories Dataset and Model....461

What Makes This Implementation Different....461

Learning from the Implementation....462

Key Technical Differences from the Books Code....462

Architecture and Scale....463

Tokenization....463

Dataset Management....463

Distributed Training Infrastructure....463

Advanced Training Features....464

Configuration Management....464

Generation and Inference....464

Data Loading Strategy....464

Training Metrics and Monitoring....465

Optimization Details....465

Checkpoint Management....465

Web Deployment....465

Why This Implementation Wasnt Included in the Book....466

The Reality of Training Costs....466

Accessibility and Learning Goals....466

The Scaling Gradient....467

The Value Proposition....467

When to Make the Investment....468

Appreciating What Was Achieved....468

Looking Forward....469

Chapter 11: Advanced Training and CUDA Kernels....471

The Journey from Raw Text to Intelligent AssistantThe Art and Science of LLM Training....471

PretrainingBuilding the Foundation....472

Pretraining Data....473

Next-Token Prediction....473

Architecture and Training Setup....473

Compute Budget, Duration, and Evaluation....474

Mid-TrainingTargeted Capability Development....474

What Is Mid-Training?....474

Why Mid-Training Matters....474

Limitations of Pure Pretraining....474

Benefits of Mid-Training....475

Types of Mid-Training....475

Domain-Specific Mid-Training....475

Capability-Specific Mid-Training....475

Data Quality Enhancement....476

Mid-Training Methodology....476

Data Curation....476

Training Approach....476

Preventing Catastrophic Forgetting....477

Examples of Successful Mid-Training....477

Mid-Training vs. Fine-Tuning....478

Supervised Fine-Tuning (SFT)Teaching Instructions....478

The Transition from Base to Assistant....478

SFT Data: Instructions and Demonstrations....478

Data Format....478

Data Sources....479

Dataset Composition....479

SFT Training Process....479

Objective Function....479

Training Hyperparameters....480

Data Quality in SFT....480

The Importance of Quality over Quantity....480

Quality Indicators....480

Balancing the SFT Dataset....480

Task Distribution....480

Avoiding Overfitting....481

Multiturn Dialog Training....481

The Result: An Instruction-Following Model....481

Reinforcement Learning from Human Feedback (RLHF)....482

The Alignment Problem....482

The Three Stages of RLHF....482

Reward Model Training....482

Purpose....482

Data Collection Process....483

Training the Reward Model....483

Reward Model Outputs....483

Reinforcement Learning Optimization....484

The Setup....484

The Algorithm: Proximal Policy Optimization (PPO)....484

Key Innovation: KL Penalty....484

Challenges in RLHF....484

Reward Hacking....484

Reward Model Limitations....485

Training Instability....485

Practical Considerations....486

Computational Cost....486

Human Labeling....486

Results of RLHF....486

Alternative Alignment Approaches....487

Direct Preference Optimization (DPO)....487

The Innovation....487

How DPO Works....487

Advantages of DPO....487

Limitations....487

Constitutional AI (CAI)....488

Philosophy....488

Two-Stage Process....488

Constitutional Principles Example....488

Advantages....488

Reinforcement Learning from AI Feedback (RLAIF)....489

Core Idea....489

When RLAIF Works Well....489

Limitations....489

Iterative Approaches....489

Iterative RLHF....489

Online Learning....490

Hybrid Approaches....490

Post-Training Techniques and Refinements....490

Context Distillation....490

Self-Improvement Techniques....491

Self-Critique....491

Iterative Refinement....491

Red Teaming and Adversarial Training....491

Red Teaming....491

Adversarial Training....491

Capability-Specific Fine-Tuning....492

Multiobjective Optimization....492

Evaluation and Benchmarking....492

Evaluation During Pretraining....492

Intrinsic Metrics....492

Downstream Tasks....493

Evaluation During Supervised Fine-Tuning....493

Instruction Following....493

Task Performance....493

Style and Format....493

Evaluation During Alignment Training....494

Preference Modeling....494

Safety Evaluations....494

Human Evaluation....494

Comprehensive Benchmarks....494

Knowledge and Reasoning....494

Coding....495

Safety and Alignment....495

Multilingual....495

The Limitations of Benchmarks....495

Benchmark Saturation....495

Gap Between Benchmarks and Real-World Use....495

Solutions....496

Practical Considerations and Best Practices....496

Data Is King....496

Computational Resource Management....496

Cost-Benefit Analysis....496

Strategic Decisions....497

Preventing Degradation....497

Common Pitfalls....497

Prevention Strategies....497

Scaling Laws and Efficiency....498

Compute-Optimal Training....498

Efficiency Techniques....498

Responsible AI Considerations....498

Throughout the Training Pipeline....498

Red Lines....499

The Future of LLM Training....499

Emerging Trends....499

Multimodal Training....499

Longer Context Windows....499

Continuous Learning....500

More Efficient Training Methods....500

Few-Shot and Zero-Shot Alignment....500

Self-Supervised Alignment....500

Mixture of Experts....500

Better Evaluation....501

More Robust Benchmarks....501

Automatic Evaluation....501

Democratization....501

Smaller, More Efficient Models....501

Open-Source Progress....501

Better Tools and Infrastructure....502

Theoretical Understanding....502

Why Does It Work?....502

Controllability....502

Training Neural Networks with CUDA Kernels and Modern Frameworks....502

Understanding CUDA and GPU Computing....503

What Is CUDA?....503

Why GPUs for Deep Learning?....503

The CPU vs. GPU Paradigm....503

CUDA Kernels Explained....504

What Is a CUDA Kernel?....504

Thread Hierarchy....504

Kernel Syntax....504

Memory Hierarchy....505

CUDA Kernels in Neural Network Training....505

Where Kernels Are Used....505

The Training Loop at Kernel Level....506

Why Custom Kernels Matter....507

Practical CUDA Kernel Examples....507

Example 1: Element-Wise ReLU Activation....507

Example 2: Matrix Multiplication (Naive Implementation)....509

Example 3: Optimized Matrix Multiplication with Shared Memory....510

Example 4: Softmax with Numerical Stability....511

Example 5: Custom Fused KernelLayerNorm GELU....513

Integration with Deep Learning Frameworks....516

PyTorch Custom CUDA Extensions....516

Triton: High-Level GPU Programming....518

Performance Optimization Strategies....521

Memory Coalescing....521

Occupancy Optimization....521

Kernel Fusion....521

Asynchronous Operations....522

Using Tensor Cores....522

Chapter 7: Real-World Training Performance....523

Profiling and Bottleneck Identification....523

FlashAttention Example....524

Training at Scale....524

Practical Tips for Working with CUDA Kernels....525

When to Write Custom Kernels....525

Development Workflow....525

Debugging CUDA Kernels....526

Common Pitfalls....526

Appendix: Glossary of Terms....528

Index....530

This book is a complete, hands-on guide to designing, training, and deploying your own Large Language Models (LLMs)—from the foundations of tokenization to the advanced stages of fine-tuning and reinforcement learning. Written for developers, data scientists, and AI practitioners, it bridges core principles and state-of-the-art techniques, offering a rare, transparent look at how modern transformers truly work beneath the surface.

Starting from the essentials, you’ll learn how to set up your environment with Python and PyTorch, manage datasets, and implement critical fundamentals such as tensors, embeddings, and gradient descent. You’ll then progress through the architectural heart of modern models, covering RMS normalization, rotary positional embeddings (RoPE), scaled dot-product attention, Grouped Query Attention (GQA), Mixture of Experts (MoE), and SwiGLU activations, each explored in depth and built step by step in code. As you advance, the book introduces custom CUDA kernel integration, teaching you how to optimize key components for speed and memory efficiency at the GPU level—an essential skill for scaling real-world LLMs. You’ll also gain mastery over the phases of training that define today’s leading models:

  • Pretraining - Building general linguistic and semantic understanding.
  • Midtraining - Expanding domain-specific capabilities and adaptability.
  • Supervised Fine-Tuning (SFT) - Aligning behavior with curated, task-driven data.
  • Reinforcement Learning from Human Feedback (RLHF) - Refining responses through reward-based optimization for human alignment.

The final chapters guide you through dataset preparation, filtering, deduplication, and training optimization, culminating in model evaluation and real-world prompting with a custom TokenGenerator for text generation and inference.

By the end of this book, you’ll have the knowledge and confidence to architect, train, and deploy your own transformer-based models, equipped with both the theoretical depth and practical expertise to innovate in the rapidly evolving world of AI.

What You’ll Learn

  • How to configure and optimize your development environment using PyTorch
  • The mechanics of tokenization, embeddings, normalization, and attention mechanisms.
  • How to implement transformer components like RMSNorm, RoPE, GQA, MoE, and SwiGLU from scratch.
  • How to integrate custom CUDA kernels to accelerate transformer computations.
  • The full LLM training pipeline: pretraining, midtraining, supervised fine-tuning, and RLHF.
  • Techniques for dataset preparation, deduplication, model debugging, and GPU memory management.
  • How to train, evaluate, and deploy a complete GPT-like architecture for real-world tasks.

Who this book is for:

Software developers, data scientists, machine learning engineers and AI enthusiasts looking to build their models from scratch.


Похожее:

Список отзывов:

Нет отзывов к книге.