Preface....5
Synopsis....5
Features....6
Errata, Codes, and Exercise Answers....6
Acknowledgements....7
Contents....8
Notations....17
Introduction of Reinforcement Learning (RL)....21
What is RL?....21
Applications of RL....23
Agent–Environment Interface....25
Taxonomy of RL....27
Task-based Taxonomy....27
Algorithm-based Taxonomy....30
Performance Metrics....31
Case Study: Agent–Environment Interface in Gym....32
Install Gym....33
Use Gym....34
Example: MountainCar....36
Summary....40
Exercises....41
Multiple Choices....41
Programming....42
Mock Interview....42
MDP: Markov Decision Process....43
MDP Model....44
DTMDP: Discrete-Time MDP....44
Environment and Dynamics....48
Policy....52
Discounted Return....55
Value....57
Definition of Value....57
Properties of Value....58
Calculate Value....62
Calculate Initial Expected Returns using Values....67
Partial Order of Policy and Policy Improvement....67
Visitation Frequency....70
Definition of Visitation Frequency....71
Properties of Visitation Frequency....73
Calculate Visitation Frequency....76
Equivalence between Visitation Frequency and Policy....78
Expectation over Visitation Frequency....79
Optimal Policy and Optimal Value....81
From Optimal Policy to Optimal Value....81
Existence and Uniqueness of Optimal Policy....82
Properties of Optimal Values....83
Calculate Optimal Values....86
Use Optimal Values to Find Optimal Strategy....91
Case Study: CliffWalking....92
Use Environment....93
Policy Evaluation....93
Solve Optimal Values....95
Solve Optimal Policy....96
Summary....96
Exercises....99
Multiple Choices....99
Programming....100
Mock Interview....100
Model-Based Numerical Iteration....101
Bellman Operators and Its Properties....101
Model-Based Policy Iteration....107
Policy Evaluation....108
Policy Iteration....111
VI: Value Iteration....112
Bootstrapping and Dynamic Programming....114
Case Study: FrozenLake....116
Use Environment....117
Use Model-Based Policy Iteration....119
Use VI....121
Summary....122
Exercises....123
Multiple Choices....123
Programming....123
Mock Interview....123
MC: Monte Carlo Learning....125
On-Policy MC Learning....126
On-Policy MC Policy Evaluation....126
MC Learning with Exploration Start....132
MC Learning on Soft Policy....135
Off-Policy MC Learning....138
Importance Sampling....138
Off-Policy MC Policy Evaluation....141
Off-Policy MC Policy Optimization....142
Case Study: Blackjack....143
Use Environment....144
On-Policy Policy Evaluation....146
On-Policy Policy Optimization....147
Off-Policy Policy Evaluation....151
Off-Policy Policy Optimization....151
Summary....152
Exercises....153
Multiple Choices....153
Programming....154
Mock Interview....154
TD: Temporal Difference Learning....155
TD return....156
On-Policy TD Learning....158
TD Policy Evaluation....158
SARSA....164
Expected SARSA....167
Off-Policy TD Learning....169
Off-Policy Algorithm based on Importance Sampling....169
Q Learning....171
Double Q Learning....173
Eligibility Trace....175
Return....175
TD()....177
Case Study: Taxi....180
Use Environment....180
On-Policy TD....182
Off-Policy TD....185
Eligibility Trace....186
Summary....188
Exercises....189
Multiple Choices....189
Programming....189
Mock Interview....190
Function Approximation....191
Basic of Function Approximation....192
Parameter Update using Gradient....195
SGD: Stochastic Gradient Descent....195
Semi-Gradient Descent....198
Semi-Gradient Descent with Eligibility Trace....200
Convergence of Function Approximation....202
Condition of Convergence....202
Baird's Counterexample....203
DQN: Deep Q Network....206
Experience Replay....207
Deep Q Learning with Target Network....210
Double DQN....212
Dueling DQN....213
Case Study: MountainCar....214
Use Environment....215
Linear Approximation....216
DQN and its Variants....221
Summary....230
Exercises....230
Multiple Choices....230
Programming....231
Mock Interview....231
PG: Policy Gradient....232
Theory of PG....232
Function Approximation for Policy....233
PG Theorem....234
Relationship between PG and Maximum Likelihood Estimate....238
On-Policy PG....239
VPG: Vanilla Policy Gradient....239
PG with Baseline....240
Off-Policy PG....242
Case Study: CartPole....243
On-Policy PG....244
Off-Policy PG....249
Summary....254
Exercises....254
Multiple Choices....254
Programming....255
Mock Interview....255
AC: Actor–Critic....256
Intuition of AC....256
On-Policy AC....257
Action-Value AC....257
Advantage AC....258
Eligibility Trace AC....260
On-Policy AC with Surrogate Objective....261
Performance Difference Lemma....261
Surrogate Advantage....262
PPO: Proximal Policy Optimization....264
Natural PG and Trust Region Algorithm....266
Kullback–Leibler Divergence and Fisher Information Matrix....267
Trust Region of Surrogate Objective....270
NPG: Natural Policy Gradient....271
TRPO: Trust Region Policy Optimization....275
Importance Sampling Off-Policy AC....276
Case Study: Acrobot....277
On-Policy AC....279
On-Policy AC with Surrogate Objective....287
NPG and TRPO....291
Importance Sampling Off-Policy AC....302
Summary....304
Exercises....305
Multiple Choices....305
Programming....306
Mock Interview....306
DPG: Deterministic Policy Gradient....307
DPG Theorem....307
On-Policy DPG....310
Off-Policy DPG....311
OPDAC: Off-Policy Deterministic Actor–Critic....311
DDPG: Deep Deterministic Policy Gradient....313
TD3: Twin Delay Deep Deterministic Policy Gradient....314
Exploration Process....316
Case Study: Pendulum....317
DDPG....319
TD3....323
Summary....327
Exercises....328
Multiple Choices....328
Programming....328
Mock Interview....328
Maximum-Entropy RL....330
Maximum-Entropy RL and Soft RL....330
Reward Engineering and Reward with Entropy....330
Soft Values....332
Soft Policy Improvement Theorem and Numeric Iterative Algorithm....334
Optimal Values....337
Soft Policy Gradient Theorem....338
Soft RL Algorithms....342
SQL: Soft Q Learning....342
SAC: Soft Actor–Critic....344
Automatic Entropy Adjustment....347
Case Study: Lunar Lander....349
Install Environment....350
Use Environment....350
Use SQL to Solve LunarLander....352
Use SAC to Solve LunarLander....355
Use Automatic Entropy Adjustment to Solve LunarLander....359
Solve LunarLanderContinuous....364
Summary....369
Exercises....370
Multiple Choices....370
Programming....370
Mock Interview....370
Policy-Based Gradient-Free Algorithms....372
Gradient-Free Algorithms....372
ES: Evolution Strategy....372
ARS: Augmented Random Search....374
Compare Gradient-Free Algorithms and Policy Gradient Algorithms....375
Case Study: BipedalWalker....376
Reward Shaping and Reward Clipping....378
ES....379
ARS....380
Summary....381
Exercises....382
Multiple Choices....382
Programming....383
Mock Interview....383
Distributional RL....384
Value Distribution and its Properties....384
Maximum Utility RL....388
Probability-Based Algorithm....391
C51: Categorical DQN....392
Categorical DQN with Utility....395
Quantile Based RL....397
QR-DQN: Quantile Regression Deep Q Network....398
IQN: Implicit Quantile Networks....401
QR Algorithms with Utility....403
Compare Categorical DQN and QR Algorithms....405
Case Study: Atari Game Pong....406
Atari Game Environment....406
The Game Pong....408
Wrapper Class of Atari Environment....410
Use Categorical DQN to Solve Pong....410
Use QR-DQN to Solve Pong....415
Use IQN to Solve Pong....419
Summary....425
Exercises....425
Multiple Choices....425
Programming....426
Mock Interview....426
Minimize Regret....427
Regret....427
MAB: Multi-Arm Bandit....429
MAB Problem....429
-Greedy Algorithm....430
UCB: Upper Confidence Bound....431
Bayesian UCB....436
Thompson Sampling....438
UCBVI: Upper Confidence Bound Value Iteration....439
Case Study: Bernoulli-Reward MAB....441
Create Custom Environment....441
-Greedy Solver....442
UCB1 Solver....444
Bayesian UCB Solver....444
Thompson Sampling Solver....445
Summary....446
Exercises....447
Multiple Choices....447
Programming....447
Mock Interview....448
Tree Search....449
MCTS: Monte Carlo Tree Search....450
Select....452
Expand and Evaluate....454
Backup....455
Decide....456
Train Networks in MCTS....456
Application in Board Game....459
Board Games....460
Self-Play....465
Neural Networks for Board Games....467
From AlphaGo to MuZero....469
Case Study: Tic-Tac-Toe....472
boardgame2: Board Game Environment....472
Exhaustive Search....477
Heuristic Search....479
Summary....486
Exercises....487
Multiple Choices....487
Programming....488
Mock Interview....488
More Agent–Environment Interfaces....489
Average Reward DTMDP....490
Average Reward....490
Differential Values....494
Optimal Policy....498
CTMDP: Continuous-Time MDP....502
Non-Homogenous MDP....506
Representation of Non-Stationary States....506
Bounded Time Index....507
Unbounded Time Index....508
SMDP: Semi-MDP....510
SMDP and its Values....510
Find Optimal Policy....513
HRL: Hierarchical Reinforcement Learning....514
POMDP: Partially Observable Markov Decision Process....515
DTPOMDP: Discrete-Time POMDP....515
Belief....516
Belief MDP....521
Belief Values....524
Belief Values for Finite POMDP....527
Use Memory....530
Case Study: Tiger....531
Compare Discounted Return Expectation and Average Reward....531
Belief MDP....533
Non-Stationary Belief State Values....534
Summary....536
Exercises....538
Multiple Choices....538
Programming....539
Mock Interview....539
Learn from Feedback and Imitation Learning....540
Learn from Feedback....540
Reward Model....541
PbRL: Preference-based RL....542
RLHF: Reinforcement Learning with Human Feedback....543
IL: Imitation Learning....546
f-Divergences and their Properties....547
BC: Behavior Cloning....554
GAIL: Generative Adversarial Imitation Learning....556
Application In Training GPT....559
Case Study: Humanoid....560
Use PyBullet....561
Use BC to IL....564
Use GAIL to IL....566
Summary....572
Exercises....573
Multiple Choices....573
Programming....574
Mock Interview....574
Reinforcement Learning: Theory and Python Implementation is a tutorial book on reinforcement learning, with explanations of both theory and applications. Starting from a uniform mathematical framework, this book derives the theory of modern reinforcement learning systematically and introduces all mainstream reinforcement learning algorithms such as PPO, SAC, and MuZero. It also covers key technologies of GPT training such as RLHF, IRL, and PbRL. Every chapter is accompanied by high-quality implementations, and all implementations of deep reinforcement learning algorithms are with both TensorFlow and PyTorch. Codes can be found on GitHub along with their results and are runnable on a conventional laptop with either Windows, macOS, or Linux.
This book is intended for readers who want to learn reinforcement learning systematically and apply reinforcement learning to practical applications. It is also ideal to academical researchers who seek theoretical foundation or algorithm enhancement in their cutting-edge AI research.