Table of Contents....5
About the Author....21
About the Technical Reviewer....22
Introduction....23
Chapter 1: What Is a Large Language Model? Getting Started with Libraries and Environment Setup for Building an LLM from Scratch....24
Getting Started with the Foundations of Language Modeling....24
What Is a Large Language Model (LLM)?....26
The Attention MechanismCornerstone of Modern Large Language Models (LLMs)....29
What Are Attention Mechanisms?....30
Detailed Explanation About How Self-Attention Works in Transformers....30
Why Attention Is Central to LLMs....32
Variants and Innovations in Attention (As of May 2025)....33
Challenges and Limitations of Attention....34
Practical Significance in LLMs....35
The Tools We Will Use to Build a Large Language Model from Scratch....35
Why Python for Building Large Language Models?....37
Readable, Simple, and Accessible for All Skill Levels....37
Extensive Machine Learning Ecosystem and Strong Community Support....37
Integrated Tools for Visualization, Experimentation, and Full ML Pipelines....38
High Performance Through GPU Acceleration, Scalability, and Low-Level Interoperability....38
Production-Ready Deployment and Access to Pretrained Models for Transfer Learning....38
Why Jupyter Notebook for Building Large Language Models?....39
Interactive, Iterative Coding with Immediate Feedback....39
Unified Environment for Code, Documentation, and Visualization....40
Deep Debugging Transparency and Real-Time Inspection....40
Optimized for Learning, Collaboration, and Reproducible Research....40
Seamless Integration with Scientific Libraries and Scalable Environments....40
Why PyTorch for Building Large Language Models?....41
Dynamic, Flexible Execution for Fast Experimentation and Debugging....41
Pythonic Design and Deep Integration with the Scientific Ecosystem....41
High-Performance GPU Acceleration and Parallel Computing at Scale....41
Modular Architecture and a Rich, Research-Friendly Ecosystem....42
Production-Ready Deployment, Interoperability, and Future-Proof Development....42
Why CUDA for Building Large Language Models?....43
Massive Parallelism for Compute-Intensive Operations....43
Seamless Integration with Deep Learning Frameworks....44
Support for Mixed Precision Training....45
Optimized Libraries for Deep Learning....45
Scalability for Distributed Training....46
Support for Advanced Hardware Features....47
Flexibility for Custom Kernels and Optimizations....47
Rapid Prototyping and Experimentation....48
Support for Inference Optimization....48
Extensive Ecosystem and Community Support....49
Energy Efficiency and Cost Savings....50
Future-Readiness and Innovation....50
Accessibility for Diverse Developers....51
Support for Research and Custom Architectures....51
Why NumPy and Matplotlib for Building Large Language Models?....52
NumPy: Efficient Numerical Computations for Data Processing....53
NumPy: Memory Efficiency for Large-Scale Data....54
NumPy: Foundation for Preprocessing and Feature Engineering....55
NumPy: Support for Custom Algorithms and Research....55
Matplotlib: Visualization for Model Understanding....56
Matplotlib: Debugging and Interpretability....56
Matplotlib: Customizable and Publication-Ready Visuals....57
NumPy and Matplotlib: Seamless Integration with Python Ecosystem....57
NumPy and Matplotlib: Accessibility for Beginners....58
NumPy: Support for Dataset Creation and Evaluation....58
Matplotlib: Support for Experiment Tracking....59
NumPy and Matplotlib: Support for Research and Innovation....59
NumPy and Matplotlib: Lightweight and Local-First....60
NumPy and Matplotlib: Community and Ecosystem Support....60
TinyStories, Shakespeare, The Wizard of Oz, and OpenWebText....61
Four Key Datasets for Building and Understanding LLMs....61
TinyStories: Learning LLM Fundamentals Through Simple Narratives....62
Shakespeare: Stylized, Poetic, and Ideal for Demonstrations....62
The Wizard of Oz: A Compact, Coherent Narrative for Prototyping....62
OpenWebText: Large-Scale, Diverse, and Suitable for Real LLM Training....63
How These Datasets Work Together....63
Ethical and Practical Considerations....63
Why PyLZMA for Building Large Language Models?....64
Why PyLZMA Matters for LLM Development....64
Advanced Concepts....67
In Summary....69
Chapter 2: Foundational Concepts in LLM Development....70
Large Language Models Common Architecture....70
Transformer Architecture: Conceptual and Mathematical Deep Dive....71
Input Embedding....72
Positional Encoding....73
Attention Mechanism....75
Scaled Dot-Product Attention....75
Multihead Attention....76
Masked Attention (Decoder)....77
Feed-Forward Neural Network....78
Residual Connections and Layer Normalization....78
Output Layer....79
Loss Function....80
Encoder-Decoder Interaction....80
Complexity Analysis....81
Advantages of the Transformer....81
Limitations....82
Applications....83
Variants and Improvements....83
The Comprehensive Mathematics Behind Training Large Language Models....84
Explaining the Purpose and Scope....84
Defining the Goal of Language Modeling....84
Next-Token Prediction....85
Modeling Sequential Probability....85
Defining the Loss Function....85
Softmax for Probability Distribution....85
Practical Considerations....86
Masked Language Modeling....86
Bidirectional Context Modeling....86
Comparison with Next-Token Prediction....86
Optimization: Gradient-Based Learning....87
Gradient Descent....87
Iterative Parameter Updates....87
Stochastic Gradient Descent....87
Adam Optimizer....88
Adaptive Optimization....88
Why Adam Works for LLMs....88
Backpropagation....89
Computing Gradients....89
Role of Automatic Differentiation....89
Regularization Techniques....89
Preventing Overfitting....89
Dropout....90
Randomly Deactivating Neurons....90
Why Dropout Helps....90
Weight Decay....90
Penalizing Large Weights....90
Implementation in Adam....90
Label Smoothing....91
Softening Target Distributions....91
Impact on Training....91
Fine-Tuning and RLHF....91
Adapting Pretrained Models....91
Supervised Fine-Tuning (SFT)....91
Task-Specific Optimization....91
Differences from Pretraining....92
Reinforcement Learning from Human Feedback (RLHF)....92
Aligning with Human Preferences....92
Why RLHF Is Effective....93
Summarizing the Mathematical Framework....93
Modern LLM Architecture Characteristics As of the End of 2024 and During 2025....93
In Summary....96
Chapter 3: Building a Tokenizer for the Transformers Architecture Model....97
What Is Tokenization?....98
Why Is Tokenization Important?....99
Types of Tokenizers....100
Word-Based Tokenizers....100
Character-Based Tokenizers....100
Subword-Based Tokenizers....101
Rule-Based Tokenizers....101
Byte-Pair Encoding (BPE): A Deep Dive....102
Origins of BPE....102
BPE Algorithm Outline and Mathematical Formulation....103
Identify Frequent Pairs....103
Replace and Record....103
Repeat Until No Gains....104
Decompression (Decoding)....104
BPE Algorithm Example....104
Concrete Example of the Encoding Part....104
Iteration 1....105
Iteration 2....106
Iteration 3....106
Encoding New Text....108
Concrete Example of the Decoding Part....108
Optimization Objective....109
Vocabulary Size and Hyperparameters....109
Step-by-Step Process of Building a BPE Tokenizer....110
Initialize the Vocabulary....110
Count Pair Frequencies....110
Merge the Most Frequent Pair....111
Repeat Merging....111
Tokenize New Text....111
Map Tokens to IDs....112
Applications of BPE....112
Advantages of BPE....112
Limitations of BPE....113
Comparison with Other Tokenization Methods....113
Practical Considerations....114
BPE Implementation Walkthrough....115
Train Method Code BreakdownStep by Step....120
Step 1: Preprocess the Corpus....120
Step 2: Collect Base Characters....121
Step 3: Build Initial Vocabulary....121
Step 4: Map Text to Initial Token IDs....122
Step 5: Learn Merges....122
Step 6: Build GPT-2-Style Merge Ranks....123
The Encode Method....124
The Decode Method: One of the Most Important Ones....125
Step 1: Reconstruct Raw String from Tokens....127
Step 2: Normalize Unicode and Remove Control Characters....128
Step 3: Clean Up Underscore Markup....128
Step 4: Clean Up Whitespace and Punctuation....129
Why This Method Is One of the Most Important....129
Save and Load Methods....130
Helping Functions....132
Testing Our Tokenizer for Accuracy....139
Tokenizer Output....145
Analysis of the Output....149
Training the Tokenizer....149
Learned Merges....150
Encoding and Decoding Examples....151
Example A: Simple Sentence....151
Example B: Sentence with Newline....151
Example C: Special Token ....152
Example D: With BOS and EOS....152
Step-by-Step BPE Trace....152
Saving and Loading....153
Chapter 4: RMS Normalization and Model Configuration....155
Model Parameters Configuration and Mathematical Foundations....156
Mathematical Notation....157
Global Tensor Shapes and Consistency....157
num_hidden_layersThe Depth of Abstraction....158
vocab_sizeThe Universe of Symbols....158
hidden_sizeThe Resolution of Thought....159
intermediate_sizeThe Breathing Room of the Network....160
head_dimThe Grain of Attention....161
num_attention_headsThe Many Eyes of the Model....161
num_key_value_headsSharing the Burden....162
sliding_windowConstraining Attention....163
initial_context_lengthHow Far the Model Can See....164
The Geometry of Positionrope_theta, rope_scaling_factor, rope_ntk_alpha, rope_ntk_beta....164
rope_thetaThe Base Frequency....165
rope_scaling_factorStretching the Map....165
rope_ntk_alphaBalancing Long-Context Stability....165
rope_ntk_betaControlling the Rate of Growth....166
swiglu_limitTaming the Activations....166
What Is SwiGLU?....166
Background on Related Concepts....167
How SwiGLU Works....167
Advantages and Usage....167
SwiGLU Enhances FFN Expressiveness in GPT Architectures with Gated Activations....168
RMS Normalization in GPT Architectures....169
What Is RMS Normalization?....169
Why Is RMSNorm Used in GPT Architectures?....170
How Is RMSNorm Used in GPT Architectures?....171
Step-by-Step Implementation in a Transformer Block:....171
RMS Normalization for Our Custom Large Language Model....172
Comprehensive Explanation of RMSNorm Code....172
Detailed Code Explanation....173
1. Import and Class Definition....173
2. Initialization (__init__)....174
3. Forward Pass (forward)....176
Mathematical Summary....179
Implementation Details and Design Choices....179
SummaryRMS Normalization and Model Configuration....180
Chapter 5: Rotary Positional Embeddings: Integrating NTK and YaRN Scaling....183
Rotary Positional Embeddings: An In-Depth and Comprehensive Exploration....183
The Fundamental Need for Positional Embeddings in Transformers....185
Historical Evolution of Positional Encodings....185
Pretransformer Approaches: Sequential Processing....185
Early Transformer Positional Encodings....186
Sinusoidal Positional Embeddings....187
Learned Absolute Positional Embeddings....188
Relative Positional Encodings....188
Transformer-XL Relative Embeddings....189
T5 Relative Bias....189
Emergence of Rotary Positional Embeddings (RoPE)....190
Summary of Traditional Positional Embedding Approaches....191
Absolute Positional Embeddings....191
Relative Positional Embeddings....192
Absolute vs. Relative Positional Embeddings....193
Mathematical Formulation of RoPE....193
2D Intuition and Derivation....193
Starting in Two Dimensions....194
The Complex Number Perspective....194
Generalizing to High Dimensions....195
Attention Decay and Frequency Selection....195
Efficient Implementation....196
Attention Matrix Heatmap....197
Connection to Linguistic Interpretability....199
Advantages of RoPE....200
Applications in Modern Models....201
Variants and Extensions....202
Why Rotary Positional Embeddings Are Essential to Large Language Models....202
Solving Permutation Invariance....202
Superior Length Extrapolation....202
Balancing Absolute and Relative Information....203
Efficiency and Scalability....203
Natural Attention Decay....203
Training Benefits....204
Industry Adoption....204
RoPE Embeddings Implementation....204
Function Signature....205
Step-by-Step Explanation....205
1. Splitting the Embedding Tensor....206
2. Preparing Cosine and Sine Tensors....206
3. Applying Rotations....207
Output....208
Key Features and Safety Considerations....208
Example Workflow....208
Geometric Interpretation....209
RotaryEmbedding Class Definition and Methods....209
Class Structure....212
Detailed Explanation....213
1. Initialization (__init__)....213
2. Computing Concentration and Inverse Frequencies (_compute_concentration_and_inv_freq)....214
3. Computing Cosine and Sine Tensors (_compute_cos_sin)....217
4. Forward Pass (forward)....218
Key Features and Safety Considerations....219
Example Workflow....220
Geometric Interpretation....220
What Is NTK (Neural Tangent Kernel) in the Context of RoPE....221
Implementation in RotaryEmbedding....222
Key Features....223
What Is YaRN (Yet Another RoPE Extension)....223
YaRNs Scaling Approach....224
Implementation in RotaryEmbedding....225
Key Features....226
Summary....226
Chapter 6: Scaled Dot-Product Attention CoreSliding Window and Grouped Query AttentionThe Core Behind All Transformer Models....228
What Is Scaled Dot-Product Attention (SDPA)?....228
Historical Evolution and Contextual Foundations....229
Intuitive and Conceptual Underpinnings, Math Formulations....230
Masking Mechanisms....231
Causal Masking....232
Padding Masking....234
Custom Masks for Structured Data....235
Sparse Attention Masks....237
Learned and Adaptive Masks....239
Bidirectional and Cross-Attention Masks....240
Specialized Attention Forms....241
Hierarchical Attention....241
Adaptive Attention Span....243
Challenges and Limitations of Masking Mechanisms....244
Complexity of Designing Masks....245
The Risk of Over-Masking....245
Computational Overhead of Generating and Applying Masks....246
Future Directions in Masking....249
The Enduring Legacy of SDPA....250
Custom Implementation of SDPASliding Window and Grouped Query Attention for Our LLM....250
Understanding the Inputs and Outputs....253
Step 1: Shape Inference, Broadcasting, and Reshaping....254
Step 2: Computing Raw Attention Scores....255
Step 3: Constructing and Applying the Attention Mask....256
Step 4: Incorporating Sink Logits....257
Step 5: Softmax Normalization and Output Computation....258
Summary....259
Chapter 7: AttentionBlock with Rotary Embedding, GQA, Sliding Window, and Sink Tokens....261
Fundamentals of Attention....261
Self-Attention in Transformers....262
Operational Properties....263
Causal Self-Attention....264
Implementation in Transformers....265
Role in LLMs....265
Limitations and Challenges....266
Optimizations and Variants....266
Multihead Attention in Transformers....267
Operational Principles....268
Integration in Transformers....269
Positional Considerations....269
Role in LLMs....270
Limitations and Challenges....271
Optimizations and Variants....271
Advanced Considerations (up to 2025)....272
What Is Grouped Query Attention, Sliding Window Attention, and Sink Tokens?....273
Grouped Query Attention (GQA)....273
Sliding Window Attention (SWA)....274
Sink Tokens....274
Integration in the Attention Block....275
Integration of Attention Mechanism in Our Custom Large Language Model....275
Exhaustive Explanation of the AttentionBlock PyTorch Module....278
Class Definition and PyTorch Integration....279
Constructor (__init__) Method....279
Parameter Extraction and GQA Configuration....280
Validation Checks....280
Sliding Window and GQA Grouping....281
Normalization Layer....282
QKV Projection Layer....282
Output Projection Layer....283
Rotary Embedding Initialization....283
Sink Logits Parameter....284
Softmax Scaling Factor....285
Forward Pass (forward Method)....285
Input Unpacking and Residual Connection....286
Pre-attention Normalization....286
QKV Projection and Slicing....286
GQA Reshaping....287
RoPE Application....287
Core Attention Computation....288
Output Projection and Residual Addition....289
Training and Inference Behaviors....290
Performance Optimizations....290
Interpretability....292
Comparison to Other Attention Mechanisms....293
Practical Implementation Notes....294
Summary....294
Chapter 8: Multilayer Perceptron Block with Mixture of Experts (MoE) and SwiGLU....296
Mixture of Experts: A Comprehensive Overview....297
Architecture of Mixture of Experts....298
Components....299
Sparse Activation....300
Variants of MoE....300
Mathematical Foundations....301
Training Mechanisms....302
Loss Function....302
Backpropagation in MoE....302
Load Balancing....303
Optimization Enhancements....304
Challenges in Training....304
Applications of Mixture of Experts....305
Advantages of MoE....305
Limitations of MoE....306
SwiGLU: An In-Depth Exploration....307
Historical Context....308
The Basics: From Activation Functions to Gated Units....309
Activation Functions....309
Gated Linear Units (GLU)....309
SwiGLU: Definition and Structure....310
Mathematical Representation....311
Parameter Count....312
Training Mechanisms....312
Optimization Considerations....312
Challenges....313
Integration in Transformers....313
Applications of SwiGLU....314
Advantages of SwiGLU....314
Limitations of SwiGLU....315
Practical Considerations....315
Advanced Variants and Comparisons....316
Variants of SwiGLU....316
Comparisons with Other Activations....316
Multilayer Perceptron Blocks with Mixture of Experts (MoE) and SwiGLU: A Comprehensive Integration....317
Integrated Architecture: MLP with MoE and SwiGLU....318
Architecture Overview....318
Mathematical Foundations....320
Training Mechanisms....321
Forward and Backward Pass....321
Load Balancing....322
Optimization Enhancements....322
Challenges....322
Why They Work Together: Rationale....323
Scalability and Efficiency....323
Specialization and Expressiveness....323
Gradient Flow and Training Stability....323
Empirical Performance....323
Flexibility Across Domains....324
Advantages....324
High Performance....324
Computational Efficiency....324
Scalability....324
Robust Training....324
Limitations....324
Training Complexity....325
Hardware Dependency....325
Interpretability....325
Memory Requirements....325
Practical Implementation Considerations....325
Architecture Design....325
Framework Support....325
Training Workflow....326
Hardware Optimization....326
Evaluation....326
Future Directions....326
Multilayer Perceptron Block with Mixture of Experts (MoE) and SwiGLU for Our Large Language Model....327
In-Depth Analysis of the MLPBlock with Mixture of Experts (MoE) and SwiGLU....330
Overview of the MLPBlock....330
Initialization Components....331
Parameter Sharding and Memory Considerations....333
Forward Pass: Step-by-Step Execution....334
Step 1: Input and Residual Connection....334
Step 2: RMS Normalization....335
Step 3: Gating Network and Top-k Selection....335
Step 4: Gathering Expert Parameters....336
Step 5: First Projection (H 2 I_local)....336
Step 6: SwiGLU Activation....337
Step 7: Second Projection (I_local H)....338
Step 8: Tensor Parallelism....339
Step 9: Weighted Sum Across Experts....339
Step 10: Residual Connection and Output....339
Training Dynamics....340
Backpropagation....340
Load Balancing....341
Optimization....341
Numerical Stability....341
Edge Cases and Robustness....342
Summary....342
Chapter 9: Transformer Block and Full Transformer ModelIts Time to Put the Puzzle Together....344
The Role of the Transformer Block in Sequence Processing....344
The Architecture of the Transformers Block....345
Query-Key-Value Projections and Multihead Decomposition....348
Rotary Position Embeddings: Geometric Foundations....349
Scaled Dot-Product Attention: The Core Mechanism....349
The Feed-Forward Block: Position-Wise Transformations....351
Alternative Activations: SwiGLU and Gated Variants....352
Block Composition and Information Flow....353
Full Model Architecture and Training Dynamics....354
Building the Transformers Block for Our LLM from Scratch....354
The TransformerBlock Class....356
Initialization Method (__init__)....356
Forward Pass Method....358
Transformer ClassThe Complete Language Model....360
Initialization MethodBuilding the Full Architecture....360
Forward PassFrom Tokens to Predictions....362
Step 1: Embedding Lookup....363
Step 2: Processing Through Transformer Layers....363
Step 3: Final Normalization....363
Step 4: Output Projection....364
from_checkpoint Class MethodLoading Pretrained Models....364
Step-by-Step Loading Process....365
Import Necessary Modules....365
Device Handling....365
Load Configuration from JSON....366
Initialize Model Architecture....366
Load Model Weights....367
Load State Dictionary into Model....367
Error Handling and Validation....368
Set to Evaluation Mode....369
Architecture Design Choices and Modern Practices....369
Advantages, Challenges, and Broader Impact....370
Parallelism and Scalability....370
Transfer Learning and Few-Shot Generalization....371
Interpretability and Mechanistic Understanding....372
Fundamental Challenges....374
Quadratic Complexity and Context Length Limitations....374
Data Efficiency and Sample Complexity....375
Catastrophic Forgetting and Continual Learning....376
Alignment and Control....376
Robustness and Adversarial Vulnerabilities....377
Future Trajectories....377
Architectural Innovations on the Horizon....377
Toward Artificial General Intelligence....380
Summary....384
Advantages and Challenges....384
Chapter 10: Dataset Preparation, Model Training, Token Generator for Inference and PromptingThe BIG Moment....386
Dataset Preparation for LLM Training....386
The Importance of Dataset Quality....386
Data Collection and Sourcing....387
Data Cleaning and Filtering....388
Deduplication....389
Data Formatting and Tokenization....389
Dataset Composition and Mixing....390
Best Practices and Recommendations....391
Preparing Our Dataset....393
Text Characteristics and Classification of the Dataset....401
Genre and Style Identification....401
Data Preparation Pipeline for This Text....401
Dataset Balancing Considerations....402
Training Large Language Models....403
Model Architecture....404
Training Objectives....404
Optimization....404
Distributed Training....404
Infrastructure....404
Training Stages....405
Challenges....405
Hyperparameter Selection....405
Compute and Efficiency....405
Advanced Techniques....406
Post-training....406
Ethical Considerations....406
Source Code for Training Our LLM....406
Understanding the Code Step-by-Step....413
Part 1: Environment Setup and Configuration....413
Part 2: Dataset Preparation....415
Part 3: Tokenizer Training or Loading....416
Part 4: Dataset Encoding and Caching....417
Part 5: DataLoader Setup....418
Part 6: Model Configuration and Initialization....419
Part 7: Optimizer Configuration....421
Part 8: Learning Rate Scheduling....422
Part 9: Model Compilation (Optional)....423
Part 10: The Training Loop....423
Part 11: Saving the Model....426
Understanding Training Dynamics....427
What the Model Learns....427
TokenGenerator for Inference....429
What Is TokenGenerator?....429
What Is TokenGenerator?....437
Core Functionality....437
The Generation Process....438
Sophisticated Anti-repetition System....438
1. Windowed Repetition and Frequency Penalties....438
2. No-Repeat N-gram Ban....439
3. Self N-gram Ban (BigramTrigram Blocking)....439
4. Dataset Anti-Copy Ban....440
5. Self Anti-Copy Ban....440
6. Variable-Period Loop Detection....441
Advanced Sampling Strategy....441
Typical Sampling....441
Top-K Sampling....441
Top-P (Nucleus) Sampling....442
Robust Fallbacks....442
Temperature Control....442
Why So Many Guardrails?....443
Practical Usage....443
Key Differences from Training....444
The Big MomentPrompting Our Model....445
Understanding Inference Code and Model Behavior....449
Setup and Initialization....449
Building Anti-copy Indices....449
Text Cleaning Utilities....450
Loop Detection....450
Generation Function with Sophisticated Controls....451
Interactive Interface....451
Understanding the OutputAn Educational Demonstration....452
The User Query....452
The Generated OutputThree Revealing Behaviors....452
Phase 1: Accessing Learned Knowledge....452
Phase 2: Pattern Repetition Under Constraints....453
Phase 3: Exploring Low-Probability Space....454
What This Demonstrates About Language Model Design....455
Principle 1: Model Scale and Generalization....455
Principle 2: Training Data Diversity....456
Principle 3: Training Duration and Compute....457
Principle 4: Instruction Fine-Tuning and Alignment....458
The Value of This Demonstration....458
What Pretraining Alone Achieves....459
What Requires Additional Scale and Training....459
How Inference Mechanisms Work....459
Practical Lessons from This Example....460
Lesson 1: Match Model Scale to Task Complexity....460
Lesson 2: Inference Constraints Have Limits....460
Lesson 3: Training Data Determines Capabilities....460
Lesson 4: Hyperparameter Tuning Matters....460
Lesson 5: Multistage Training Is Essential....461
Exploring Even FurtherA Real Trained Model: TinyStories GPT-4 Version Implementation....461
TinyStories Dataset and Model....461
What Makes This Implementation Different....461
Learning from the Implementation....462
Key Technical Differences from the Books Code....462
Architecture and Scale....463
Tokenization....463
Dataset Management....463
Distributed Training Infrastructure....463
Advanced Training Features....464
Configuration Management....464
Generation and Inference....464
Data Loading Strategy....464
Training Metrics and Monitoring....465
Optimization Details....465
Checkpoint Management....465
Web Deployment....465
Why This Implementation Wasnt Included in the Book....466
The Reality of Training Costs....466
Accessibility and Learning Goals....466
The Scaling Gradient....467
The Value Proposition....467
When to Make the Investment....468
Appreciating What Was Achieved....468
Looking Forward....469
Chapter 11: Advanced Training and CUDA Kernels....471
The Journey from Raw Text to Intelligent AssistantThe Art and Science of LLM Training....471
PretrainingBuilding the Foundation....472
Pretraining Data....473
Next-Token Prediction....473
Architecture and Training Setup....473
Compute Budget, Duration, and Evaluation....474
Mid-TrainingTargeted Capability Development....474
What Is Mid-Training?....474
Why Mid-Training Matters....474
Limitations of Pure Pretraining....474
Benefits of Mid-Training....475
Types of Mid-Training....475
Domain-Specific Mid-Training....475
Capability-Specific Mid-Training....475
Data Quality Enhancement....476
Mid-Training Methodology....476
Data Curation....476
Training Approach....476
Preventing Catastrophic Forgetting....477
Examples of Successful Mid-Training....477
Mid-Training vs. Fine-Tuning....478
Supervised Fine-Tuning (SFT)Teaching Instructions....478
The Transition from Base to Assistant....478
SFT Data: Instructions and Demonstrations....478
Data Format....478
Data Sources....479
Dataset Composition....479
SFT Training Process....479
Objective Function....479
Training Hyperparameters....480
Data Quality in SFT....480
The Importance of Quality over Quantity....480
Quality Indicators....480
Balancing the SFT Dataset....480
Task Distribution....480
Avoiding Overfitting....481
Multiturn Dialog Training....481
The Result: An Instruction-Following Model....481
Reinforcement Learning from Human Feedback (RLHF)....482
The Alignment Problem....482
The Three Stages of RLHF....482
Reward Model Training....482
Purpose....482
Data Collection Process....483
Training the Reward Model....483
Reward Model Outputs....483
Reinforcement Learning Optimization....484
The Setup....484
The Algorithm: Proximal Policy Optimization (PPO)....484
Key Innovation: KL Penalty....484
Challenges in RLHF....484
Reward Hacking....484
Reward Model Limitations....485
Training Instability....485
Practical Considerations....486
Computational Cost....486
Human Labeling....486
Results of RLHF....486
Alternative Alignment Approaches....487
Direct Preference Optimization (DPO)....487
The Innovation....487
How DPO Works....487
Advantages of DPO....487
Limitations....487
Constitutional AI (CAI)....488
Philosophy....488
Two-Stage Process....488
Constitutional Principles Example....488
Advantages....488
Reinforcement Learning from AI Feedback (RLAIF)....489
Core Idea....489
When RLAIF Works Well....489
Limitations....489
Iterative Approaches....489
Iterative RLHF....489
Online Learning....490
Hybrid Approaches....490
Post-Training Techniques and Refinements....490
Context Distillation....490
Self-Improvement Techniques....491
Self-Critique....491
Iterative Refinement....491
Red Teaming and Adversarial Training....491
Red Teaming....491
Adversarial Training....491
Capability-Specific Fine-Tuning....492
Multiobjective Optimization....492
Evaluation and Benchmarking....492
Evaluation During Pretraining....492
Intrinsic Metrics....492
Downstream Tasks....493
Evaluation During Supervised Fine-Tuning....493
Instruction Following....493
Task Performance....493
Style and Format....493
Evaluation During Alignment Training....494
Preference Modeling....494
Safety Evaluations....494
Human Evaluation....494
Comprehensive Benchmarks....494
Knowledge and Reasoning....494
Coding....495
Safety and Alignment....495
Multilingual....495
The Limitations of Benchmarks....495
Benchmark Saturation....495
Gap Between Benchmarks and Real-World Use....495
Solutions....496
Practical Considerations and Best Practices....496
Data Is King....496
Computational Resource Management....496
Cost-Benefit Analysis....496
Strategic Decisions....497
Preventing Degradation....497
Common Pitfalls....497
Prevention Strategies....497
Scaling Laws and Efficiency....498
Compute-Optimal Training....498
Efficiency Techniques....498
Responsible AI Considerations....498
Throughout the Training Pipeline....498
Red Lines....499
The Future of LLM Training....499
Emerging Trends....499
Multimodal Training....499
Longer Context Windows....499
Continuous Learning....500
More Efficient Training Methods....500
Few-Shot and Zero-Shot Alignment....500
Self-Supervised Alignment....500
Mixture of Experts....500
Better Evaluation....501
More Robust Benchmarks....501
Automatic Evaluation....501
Democratization....501
Smaller, More Efficient Models....501
Open-Source Progress....501
Better Tools and Infrastructure....502
Theoretical Understanding....502
Why Does It Work?....502
Controllability....502
Training Neural Networks with CUDA Kernels and Modern Frameworks....502
Understanding CUDA and GPU Computing....503
What Is CUDA?....503
Why GPUs for Deep Learning?....503
The CPU vs. GPU Paradigm....503
CUDA Kernels Explained....504
What Is a CUDA Kernel?....504
Thread Hierarchy....504
Kernel Syntax....504
Memory Hierarchy....505
CUDA Kernels in Neural Network Training....505
Where Kernels Are Used....505
The Training Loop at Kernel Level....506
Why Custom Kernels Matter....507
Practical CUDA Kernel Examples....507
Example 1: Element-Wise ReLU Activation....507
Example 2: Matrix Multiplication (Naive Implementation)....509
Example 3: Optimized Matrix Multiplication with Shared Memory....510
Example 4: Softmax with Numerical Stability....511
Example 5: Custom Fused KernelLayerNorm GELU....513
Integration with Deep Learning Frameworks....516
PyTorch Custom CUDA Extensions....516
Triton: High-Level GPU Programming....518
Performance Optimization Strategies....521
Memory Coalescing....521
Occupancy Optimization....521
Kernel Fusion....521
Asynchronous Operations....522
Using Tensor Cores....522
Chapter 7: Real-World Training Performance....523
Profiling and Bottleneck Identification....523
FlashAttention Example....524
Training at Scale....524
Practical Tips for Working with CUDA Kernels....525
When to Write Custom Kernels....525
Development Workflow....525
Debugging CUDA Kernels....526
Common Pitfalls....526
Appendix: Glossary of Terms....528
Index....530
This book is a complete, hands-on guide to designing, training, and deploying your own Large Language Models (LLMs)—from the foundations of tokenization to the advanced stages of fine-tuning and reinforcement learning. Written for developers, data scientists, and AI practitioners, it bridges core principles and state-of-the-art techniques, offering a rare, transparent look at how modern transformers truly work beneath the surface.
Starting from the essentials, you’ll learn how to set up your environment with Python and PyTorch, manage datasets, and implement critical fundamentals such as tensors, embeddings, and gradient descent. You’ll then progress through the architectural heart of modern models, covering RMS normalization, rotary positional embeddings (RoPE), scaled dot-product attention, Grouped Query Attention (GQA), Mixture of Experts (MoE), and SwiGLU activations, each explored in depth and built step by step in code. As you advance, the book introduces custom CUDA kernel integration, teaching you how to optimize key components for speed and memory efficiency at the GPU level—an essential skill for scaling real-world LLMs. You’ll also gain mastery over the phases of training that define today’s leading models:
The final chapters guide you through dataset preparation, filtering, deduplication, and training optimization, culminating in model evaluation and real-world prompting with a custom TokenGenerator for text generation and inference.
By the end of this book, you’ll have the knowledge and confidence to architect, train, and deploy your own transformer-based models, equipped with both the theoretical depth and practical expertise to innovate in the rapidly evolving world of AI.
Software developers, data scientists, machine learning engineers and AI enthusiasts looking to build their models from scratch.