Table of Contents....5
About the Author....15
About the Technical Reviewer....16
Acknowledgments....17
Introduction....18
Chapter 1: Introduction to Reinforcement Learning....23
Reinforcement Learning....23
Machine Learning Branches....25
Supervised Learning....26
Unsupervised Learning....27
Reinforcement Learning....28
Emerging Sub-branches....29
Self-Supervised Learning....29
Generative AI....31
Generative AI vs Other Learning Paradigms....33
Core Elements of RL....34
Deep Learning with Reinforcement Learning....35
Examples and Case Studies....36
Autonomous Vehicles....36
Robots....37
Recommendation Systems....37
Finance and Trading....37
Healthcare....38
Large Language Models and Generative AI....38
Game Playing....38
Libraries and Environment Setup....39
Local Install (Recommended for a Local Option)....39
Local Install with VS Code....45
Running on Google Colab (Recommended for a Cloud Option)....48
Running on Kaggle....52
Using devcontainer-Based Environments....52
Running devcontainer Locally....53
Running on GitHub Codespaces....56
Running on AWS Studio Lab....60
Running Using Lightning.ai....61
Other Options to Run Code....63
Summary....63
Chapter 2: The Foundation: Markov Decision Processes....65
Definition of Reinforcement Learning....65
Agent and Environment....71
Rewards....73
Markov Processes....76
Markov Chains....77
Markov Reward Processes....81
Markov Decision Processes....83
Policies and Value Functions....86
Bellman Equations....89
Optimality Bellman Equations....92
Train Your First Agent....96
First Agent....96
Walkthrough of Common Libraries Used....100
Environments: Gymnasium and OpenAI Gym....100
Stable Baselines3 (SB3)....101
RL Baselines3 Zoo....102
Hugging Face....102
Second Agent....103
RL Zoo Baselines3....105
Solution Approaches with a Mind Map....106
Summary....108
Chapter 3: Model-Based Approaches....110
Grid World Environment....111
Dynamic Programming....114
Policy Evaluation/Prediction....117
Policy Improvement and Iterations....123
Value Iteration....129
Generalized Policy Iteration....132
Asynchronous Backups....135
Summary....138
Chapter 4: Model-Free Approaches....139
Estimation/Prediction with Monte Carlo....140
Bias and Variance of MC Predication Methods....149
Control with Monte Carlo....152
Off-Policy MC Control....157
Importance Sampling....158
Temporal Difference Learning Methods....163
Temporal Difference Control....165
Cliff Walking....166
Taxi....167
Cart Pole....169
On-Policy SARSA....169
Q-Learning: An Off-Policy TD Control....176
Maximization Bias and Double Learning....181
Expected SARSA Control....182
Replay Buffer and Off-Policy Learning....185
Q-Learning for Continuous State Spaces....189
n-Step Returns....191
Eligibility Traces and TD(λ)....193
Relationships Between DP, MC, and TD....195
Summary....195
Chapter 5: Function Approximation and Deep Learning....197
Introduction....198
Theory of Approximation....200
Coarse Coding....202
Tile Encoding....204
Challenges in Approximation....205
Incremental Prediction: MC, TD, TD(λ)....207
Incremental Control....212
Semi-gradient n-step SARSA Control....213
Semi-gradient SARSA(λ) Control....220
Convergence in Functional Approximation....224
Gradient Temporal Difference Learning....226
Batch Methods (DQN)....227
Linear Least Squares Method....230
Deep Learning Libraries....232
PyTorch....232
What Are Neural Networks....232
Training with Back-Propagation....236
PyTorch Lightning....238
TensorFlow....241
Summary....243
Chapter 6: Deep Q-Learning (DQN)....245
Deep Q Networks....246
OpenAI Gym vs Farma Gymnasium....259
Recording Videos of Trained Agents....261
End-to-End Training with SB3....263
End to End Training with SB3 Zoo....264
Hyperparameter Optimization**....266
Integration with Rliable library(**)....271
Atari Game-Playing Agent Using DQN....273
Atari Environment in Gymnasium....273
Preprocessing and Training....276
Overview of Various RL Environments and Libraries....282
PyGame....283
MuJoCo....283
Unity ML Agents....285
PettingZoo....286
Bullet Physics Engine and Related Environments....286
CleanRL....287
MineRL....289
FinRL....289
FlappyBird Environment....290
Summary....291
Chapter 7: Improvements to DQN**....292
Prioritized Replay....292
Double DQN (DDQN)....298
Dueling DQN....302
NoisyNets DQN....307
Categorical 51-Atom DQN (C51)....316
Quantile Regression DQN....319
Hindsight Experience Replay....321
Summary....326
Chapter 8: Policy Gradient Algorithms....328
Introduction....328
Pros and Cons of Policy-Based Methods....329
Policy Representation....332
Discrete Cases....333
Continuous Cases....333
Policy Gradient Derivation....334
Objective Function....334
Derivative Update Rule....336
Intuition Behind the Update Rule....339
The REINFORCE Algorithm....341
Variance Reduction with Rewards-to-Go....345
Further Variance Reduction with Baselines....355
Actor-Critic Methods....359
Defining Advantage....360
Advantage Actor-Critic (A2C)....361
Implementation of the A2C Algorithm....365
Asynchronous Advantage Actor-Critic....369
Trust Region Policy Optimization Algorithm....371
Proximal Policy Optimization Algorithm (PPO)....375
Curiosity-Driven Learning....378
Summary....382
Chapter 9: Combining Policy Gradient and Q-Learning....383
Tradeoffs in Policy Gradient and Q-Learning....384
General Framework to Combine Policy Gradient with Q-Learning....387
Deep Deterministic Policy Gradient....388
Q-Learning in DDPG (Critic)....390
Policy Learning in DDPG (Actor)....391
Pseudocode and Implementation....392
Gymnasium Environments Used in Code....393
Code Listing....394
Policy Network Actor....394
Q-Network Critic Implementation....395
Combined Model-Actor-Critic Implementation....396
Experience Replay....397
Q-Loss Implementation....397
Policy Loss Implementation....399
One-Step Update Implementation....399
DDPG: Main Loop....401
Twin Delayed DDPG....404
Target-Policy Smoothing....405
Q-Loss (Critic)....405
Policy Loss (Actor)....406
Delayed Update....406
Pseudocode and Implementation....406
Code Implementation....408
Combined Model-Actor-Critic Implementation....408
Q-Loss Implementation....409
Policy-Loss Implementation....410
One-Step Update Implementation....410
TD3 Main Loop....412
Reparameterization Trick....413
Score/Reinforce Way....413
Reparameterization Trick and Pathwise Derivatives....414
Experiment....416
Entropy Explained....422
Soft Actor-Critic....423
SAC vs. TD3....424
Q-Loss with Entropy-Regularization....425
Policy Loss with the Reparameterization Trick....426
Pseudocode and Implementation....427
Policy Network-Actor Implementation....429
Q-Network, Combined Model, and Experience Replay....430
Q-Loss and Policy-Loss Implementation....431
One-Step Update and SAC Main Loop....431
Summary....432
Chapter 10: Integrated Planning and Learning....433
Model-Based Reinforcement Learning....434
Planning with a Learned Model....438
Integrating Learning and Planning (Dyna)....439
Dyna Q and Changing Environments....445
Dyna Q+....446
Expected vs. Sample Updates....447
Exploration vs. Exploitation....451
Multi-Arm Bandit....452
Regret: Measure the Quality of Exploration....453
Epsilon Greedy Exploration....455
Upper Confidence Bound Exploration....457
Thompson Sampling Exploration....458
Comparing Different Exploration Strategies....459
Planning at Decision Time and Monte Carlo Tree Search....461
Example Uses of MCTS....469
AlphaGo....469
AlphaGo Zero and AlphaZero....472
AlphaFold with MCTS....475
Use of MCTS in Other Domains....475
Summary....476
Chapter 11: Proximal Policy Optimization (PPO) and RLHF....478
Theoretical Foundations of PPO**....481
Score Function and MLE Estimator....482
Fisher Information Matrix (FIM) and Hessian....485
Natural Gradient Method....486
Trust Region Policy Optimization (TRPO)....490
PPO Deep Dive**....491
PPO CLIP Objective....491
Advantage Calculation....493
Value and Entropy Loss Objectives....494
Implementation Details of PPO....494
1. Vectorized Environment....495
2. Parameter Initialization....496
3. Adam Optimizer’s Epsilon Parameter....496
4. Adam Learning Rate Annealing....497
5. Generalized Advantage Estimation....497
6. Mini-Batch Updates....497
7. Normalization of Advantages....498
8. Clipped Surrogate Objective....498
9. Value Function Loss Clipping....499
10. Overall Loss and Entropy Bonus....499
11. Global Gradient Clipping....499
12. Debug Variables....499
13. Shared and Separate MLP Networks for Policy and Value Functions....500
Running CleanRL PPO....501
Asynchronous PPO....501
Large Language Models(**)....504
Prompt Engineering....509
Prompting Techniques....510
RAG and Chat Bots....513
LLMs as Operating Systems....516
Fine-Tuning....516
Parameter Efficient Fine-Tuning (PEFT)....518
Chaining LLMs Together....522
Auto Agents....524
Multimodal Generative AI....526
RL with Human Feedback....527
Latest Advances in LLM Alignment....530
Libraries and Frameworks for RLHF....531
VertexAI from Google....532
SageMaker from AWS Using Trlx....532
TRL Library from HuggingFace....532
Walkthrough of RLHF Tuning....533
Summary....539
Chapter 12: Multi-Agent RL (MARL)....540
Key Challenges in MARL....543
MARL Taxonomy....545
Communication Between Agents....548
Mapping with Game Theory....548
Solutions in MARL....549
MARL and Core Algorithms....552
Value Iteration....552
TD Approach with Joint Action Learning....553
Minimax Q-Learning....555
Nash Q-Learning....555
Correlated Q-Learning....555
Assumptions on Agents....555
Policy-Based Learning....556
No-Regret Learning....558
Deep MARL....560
Petting Zoo Library....562
Sample Training....565
Summary....568
Chapter 13: Additional Topics and Recent Advances....569
Other Interesting RL Environments....570
MineRL....570
Donkey Car RL....571
FinRL....573
Star Craft II: PySc2....579
Godot RL Agents....580
Model-Based RL: Additional Approaches....581
World Models....581
Imagination-Augmented Agents (I2A)....585
Model-Based RL with Model-Free Fine-Tuning (MBMF)....590
Model-Based Value Expansion (MBVE)....593
IRIS: Transformers as World Models....596
Causal World Models....599
Offline RL....600
Decision Transformers....605
Automatic Curriculum Learning....610
Imitation Learning and Inverse Reinforcement Learning....612
Derivative-Free Methods....617
Transfer Learning and Multitask Learning....621
Meta-Learning....626
Unsupervised Zero-Shot Reinforcement Learning....627
REINFORCE Learning from Human Feedback in LLMs....629
How to Continue Studying....630
Summary....631
Index....633
Gain a theoretical understanding to the most popular libraries in deep reinforcement learning (deep RL). This new edition focuses on the latest advances in deep RL using a learn-by-coding approach, allowing readers to assimilate and replicate the latest research in this field.
New agent environments ranging from games, and robotics to finance are explained to help you try different ways to apply reinforcement learning. A chapter on multi-agent reinforcement learning covers how multiple agents compete, while another chapter focuses on the widely used deep RL algorithm, proximal policy optimization (PPO). You'll see how reinforcement learning with human feedback (RLHF) has been used by chatbots, built using Large Language Models, e.g. ChatGPT to improve conversational capabilities.
You'll also review the steps for using the code on multiple cloud systems and deploying models on platforms such as Hugging Face Hub. The code is in Jupyter Notebook, which canbe run on Google Colab, and other similar deep learning cloud platforms, allowing you to tailor the code to your own needs.
Whether it’s for applications in gaming, robotics, or Generative AI, Deep Reinforcement Learning with Python will help keep you ahead of the curve.
Software engineers and machine learning developers eager to sharpen their understanding of deep RL and acquire practical skills in implementing RL algorithms fromscratch.