Deep Reinforcement Learning with Python: RLHF for Chatbots and Large Language Models. 2 Ed

Автор: Sanghi Nimish

Дата выхода: 2024

Издательство: Apress Media, LLC.

Количество страниц: 650

Размер файла: 6,8 МБ

Тип файла: PDF

Добавил: codelibs

Проверить на вирусы Дополнительные материалы

Table of Contents....5

About the Author....15

About the Technical Reviewer....16

Acknowledgments....17

Introduction....18

Chapter 1: Introduction to Reinforcement Learning....23

Reinforcement Learning....23

Machine Learning Branches....25

Supervised Learning....26

Unsupervised Learning....27

Reinforcement Learning....28

Emerging Sub-branches....29

Self-Supervised Learning....29

Generative AI....31

Generative AI vs Other Learning Paradigms....33

Core Elements of RL....34

Deep Learning with Reinforcement Learning....35

Examples and Case Studies....36

Autonomous Vehicles....36

Robots....37

Recommendation Systems....37

Finance and Trading....37

Healthcare....38

Large Language Models and Generative AI....38

Game Playing....38

Libraries and Environment Setup....39

Local Install (Recommended for a Local Option)....39

Local Install with VS Code....45

Running on Google Colab (Recommended for a Cloud Option)....48

Running on Kaggle....52

Using devcontainer-Based Environments....52

Running devcontainer Locally....53

Running on GitHub Codespaces....56

Running on AWS Studio Lab....60

Running Using Lightning.ai....61

Other Options to Run Code....63

Summary....63

Chapter 2: The Foundation: Markov Decision Processes....65

Definition of Reinforcement Learning....65

Agent and Environment....71

Rewards....73

Markov Processes....76

Markov Chains....77

Markov Reward Processes....81

Markov Decision Processes....83

Policies and Value Functions....86

Bellman Equations....89

Optimality Bellman Equations....92

Train Your First Agent....96

First Agent....96

Walkthrough of Common Libraries Used....100

Environments: Gymnasium and OpenAI Gym....100

Stable Baselines3 (SB3)....101

RL Baselines3 Zoo....102

Hugging Face....102

Second Agent....103

RL Zoo Baselines3....105

Solution Approaches with a Mind Map....106

Summary....108

Chapter 3: Model-Based Approaches....110

Grid World Environment....111

Dynamic Programming....114

Policy Evaluation/Prediction....117

Policy Improvement and Iterations....123

Value Iteration....129

Generalized Policy Iteration....132

Asynchronous Backups....135

Summary....138

Chapter 4: Model-Free Approaches....139

Estimation/Prediction with Monte Carlo....140

Bias and Variance of MC Predication Methods....149

Control with Monte Carlo....152

Off-Policy MC Control....157

Importance Sampling....158

Temporal Difference Learning Methods....163

Temporal Difference Control....165

Cliff Walking....166

Taxi....167

Cart Pole....169

On-Policy SARSA....169

Q-Learning: An Off-Policy TD Control....176

Maximization Bias and Double Learning....181

Expected SARSA Control....182

Replay Buffer and Off-Policy Learning....185

Q-Learning for Continuous State Spaces....189

n-Step Returns....191

Eligibility Traces and TD(λ)....193

Relationships Between DP, MC, and TD....195

Summary....195

Chapter 5: Function Approximation and Deep Learning....197

Introduction....198

Theory of Approximation....200

Coarse Coding....202

Tile Encoding....204

Challenges in Approximation....205

Incremental Prediction: MC, TD, TD(λ)....207

Incremental Control....212

Semi-gradient n-step SARSA Control....213

Semi-gradient SARSA(λ) Control....220

Convergence in Functional Approximation....224

Gradient Temporal Difference Learning....226

Batch Methods (DQN)....227

Linear Least Squares Method....230

Deep Learning Libraries....232

PyTorch....232

What Are Neural Networks....232

Training with Back-Propagation....236

PyTorch Lightning....238

TensorFlow....241

Summary....243

Chapter 6: Deep Q-Learning (DQN)....245

Deep Q Networks....246

OpenAI Gym vs Farma Gymnasium....259

Recording Videos of Trained Agents....261

End-to-End Training with SB3....263

End to End Training with SB3 Zoo....264

Hyperparameter Optimization**....266

Integration with Rliable library(**)....271

Atari Game-Playing Agent Using DQN....273

Atari Environment in Gymnasium....273

Preprocessing and Training....276

Overview of Various RL Environments and Libraries....282

PyGame....283

MuJoCo....283

Unity ML Agents....285

PettingZoo....286

Bullet Physics Engine and Related Environments....286

CleanRL....287

MineRL....289

FinRL....289

FlappyBird Environment....290

Summary....291

Chapter 7: Improvements to DQN**....292

Prioritized Replay....292

Double DQN (DDQN)....298

Dueling DQN....302

NoisyNets DQN....307

Categorical 51-Atom DQN (C51)....316

Quantile Regression DQN....319

Hindsight Experience Replay....321

Summary....326

Chapter 8: Policy Gradient Algorithms....328

Introduction....328

Pros and Cons of Policy-Based Methods....329

Policy Representation....332

Discrete Cases....333

Continuous Cases....333

Policy Gradient Derivation....334

Objective Function....334

Derivative Update Rule....336

Intuition Behind the Update Rule....339

The REINFORCE Algorithm....341

Variance Reduction with Rewards-to-Go....345

Further Variance Reduction with Baselines....355

Actor-Critic Methods....359

Defining Advantage....360

Advantage Actor-Critic (A2C)....361

Implementation of the A2C Algorithm....365

Asynchronous Advantage Actor-Critic....369

Trust Region Policy Optimization Algorithm....371

Proximal Policy Optimization Algorithm (PPO)....375

Curiosity-Driven Learning....378

Summary....382

Chapter 9: Combining Policy Gradient and Q-Learning....383

Tradeoffs in Policy Gradient and Q-Learning....384

General Framework to Combine Policy Gradient with Q-Learning....387

Deep Deterministic Policy Gradient....388

Q-Learning in DDPG (Critic)....390

Policy Learning in DDPG (Actor)....391

Pseudocode and Implementation....392

Gymnasium Environments Used in Code....393

Code Listing....394

Policy Network Actor....394

Q-Network Critic Implementation....395

Combined Model-Actor-Critic Implementation....396

Experience Replay....397

Q-Loss Implementation....397

Policy Loss Implementation....399

One-Step Update Implementation....399

DDPG: Main Loop....401

Twin Delayed DDPG....404

Target-Policy Smoothing....405

Q-Loss (Critic)....405

Policy Loss (Actor)....406

Delayed Update....406

Pseudocode and Implementation....406

Code Implementation....408

Combined Model-Actor-Critic Implementation....408

Q-Loss Implementation....409

Policy-Loss Implementation....410

One-Step Update Implementation....410

TD3 Main Loop....412

Reparameterization Trick....413

Score/Reinforce Way....413

Reparameterization Trick and Pathwise Derivatives....414

Experiment....416

Entropy Explained....422

Soft Actor-Critic....423

SAC vs. TD3....424

Q-Loss with Entropy-Regularization....425

Policy Loss with the Reparameterization Trick....426

Pseudocode and Implementation....427

Policy Network-Actor Implementation....429

Q-Network, Combined Model, and Experience Replay....430

Q-Loss and Policy-Loss Implementation....431

One-Step Update and SAC Main Loop....431

Summary....432

Chapter 10: Integrated Planning and Learning....433

Model-Based Reinforcement Learning....434

Planning with a Learned Model....438

Integrating Learning and Planning (Dyna)....439

Dyna Q and Changing Environments....445

Dyna Q+....446

Expected vs. Sample Updates....447

Exploration vs. Exploitation....451

Multi-Arm Bandit....452

Regret: Measure the Quality of Exploration....453

Epsilon Greedy Exploration....455

Upper Confidence Bound Exploration....457

Thompson Sampling Exploration....458

Comparing Different Exploration Strategies....459

Planning at Decision Time and Monte Carlo Tree Search....461

Example Uses of MCTS....469

AlphaGo....469

AlphaGo Zero and AlphaZero....472

AlphaFold with MCTS....475

Use of MCTS in Other Domains....475

Summary....476

Chapter 11: Proximal Policy Optimization (PPO) and RLHF....478

Theoretical Foundations of PPO**....481

Score Function and MLE Estimator....482

Fisher Information Matrix (FIM) and Hessian....485

Natural Gradient Method....486

Trust Region Policy Optimization (TRPO)....490

PPO Deep Dive**....491

PPO CLIP Objective....491

Advantage Calculation....493

Value and Entropy Loss Objectives....494

Implementation Details of PPO....494

1. Vectorized Environment....495

2. Parameter Initialization....496

3. Adam Optimizer’s Epsilon Parameter....496

4. Adam Learning Rate Annealing....497

5. Generalized Advantage Estimation....497

6. Mini-Batch Updates....497

7. Normalization of Advantages....498

8. Clipped Surrogate Objective....498

9. Value Function Loss Clipping....499

10. Overall Loss and Entropy Bonus....499

11. Global Gradient Clipping....499

12. Debug Variables....499

13. Shared and Separate MLP Networks for Policy and Value Functions....500

Running CleanRL PPO....501

Asynchronous PPO....501

Large Language Models(**)....504

Prompt Engineering....509

Prompting Techniques....510

RAG and Chat Bots....513

LLMs as Operating Systems....516

Fine-Tuning....516

Parameter Efficient Fine-Tuning (PEFT)....518

Chaining LLMs Together....522

Auto Agents....524

Multimodal Generative AI....526

RL with Human Feedback....527

Latest Advances in LLM Alignment....530

Libraries and Frameworks for RLHF....531

VertexAI from Google....532

SageMaker from AWS Using Trlx....532

TRL Library from HuggingFace....532

Walkthrough of RLHF Tuning....533

Summary....539

Chapter 12: Multi-Agent RL (MARL)....540

Key Challenges in MARL....543

MARL Taxonomy....545

Communication Between Agents....548

Mapping with Game Theory....548

Solutions in MARL....549

MARL and Core Algorithms....552

Value Iteration....552

TD Approach with Joint Action Learning....553

Minimax Q-Learning....555

Nash Q-Learning....555

Correlated Q-Learning....555

Assumptions on Agents....555

Policy-Based Learning....556

No-Regret Learning....558

Deep MARL....560

Petting Zoo Library....562

Sample Training....565

Summary....568

Chapter 13: Additional Topics and Recent Advances....569

Other Interesting RL Environments....570

MineRL....570

Donkey Car RL....571

FinRL....573

Star Craft II: PySc2....579

Godot RL Agents....580

Model-Based RL: Additional Approaches....581

World Models....581

Imagination-Augmented Agents (I2A)....585

Model-Based RL with Model-Free Fine-Tuning (MBMF)....590

Model-Based Value Expansion (MBVE)....593

IRIS: Transformers as World Models....596

Causal World Models....599

Offline RL....600

Decision Transformers....605

Automatic Curriculum Learning....610

Imitation Learning and Inverse Reinforcement Learning....612

Derivative-Free Methods....617

Transfer Learning and Multitask Learning....621

Meta-Learning....626

Unsupervised Zero-Shot Reinforcement Learning....627

REINFORCE Learning from Human Feedback in LLMs....629

How to Continue Studying....630

Summary....631

Index....633

Gain a theoretical understanding to the most popular libraries in deep reinforcement learning (deep RL). This new edition focuses on the latest advances in deep RL using a learn-by-coding approach, allowing readers to assimilate and replicate the latest research in this field.

New agent environments ranging from games, and robotics to finance are explained to help you try different ways to apply reinforcement learning. A chapter on multi-agent reinforcement learning covers how multiple agents compete, while another chapter focuses on the widely used deep RL algorithm, proximal policy optimization (PPO). You'll see how reinforcement learning with human feedback (RLHF) has been used by chatbots, built using Large Language Models, e.g. ChatGPT to improve conversational capabilities.

You'll also review the steps for using the code on multiple cloud systems and deploying models on platforms such as Hugging Face Hub. The code is in Jupyter Notebook, which canbe run on Google Colab, and other similar deep learning cloud platforms, allowing you to tailor the code to your own needs.

Whether it’s for applications in gaming, robotics, or Generative AI, Deep Reinforcement Learning with Python will help keep you ahead of the curve.

What You'll Learn

Explore Python-based RL libraries, including StableBaselines3 and CleanRL
Work with diverse RL environments like Gymnasium, Pybullet, and Unity ML
Understand instruction finetuning of Large Language Models using RLHF and PPO
Study training and optimization techniques using HuggingFace, Weights and Biases, and Optuna

Who This Book Is For

Software engineers and machine learning developers eager to sharpen their understanding of deep RL and acquire practical skills in implementing RL algorithms fromscratch.

Если вам понравилась эта страница - поделитесь ею с друзьями, тем самым вы помогаете нам развиваться и добавлять всё больше интересных и нужным вам книг