Deep Learning with PyTorch Step-by-Step: A Beginner’s Guide....2
Table of Contents....5
Preface....23
Acknowledgements....24
About the Author....25
Frequently Asked Questions (FAQ)....26
Why PyTorch?....26
Why This Book?....27
Who Should Read This Book?....28
What Do I Need to Know?....29
How to Read This Book....29
What’s Next?....32
Setup Guide....33
Official Repository....33
Environment....33
Google Colab....33
Binder....34
Local Installation....35
1. Anaconda....36
2. Conda (Virtual) Environments....36
3. PyTorch....38
4. TensorBoard....40
5. GraphViz and Torchviz (optional)....41
6. Git....42
7. Jupyter....44
Moving On....44
Part I: Fundamentals....46
Chapter 0: Visualizing Gradient Descent....47
Spoilers....47
Jupyter Notebook....47
Imports....48
Visualizing Gradient Descent....48
Model....49
Data Generation....50
Synthetic Data Generation....50
Train-Validation-Test Split....52
Step 0 - Random Initialization....53
Step 1 - Compute Model’s Predictions....54
Step 2 - Compute the Loss....55
Loss Surface....57
Cross-Sections....61
Step 3 - Compute the Gradients....62
Visualizing Gradients....64
Backpropagation....65
Step 4 - Update the Parameters....66
Learning Rate....68
Low Learning Rate....69
High Learning Rate....71
Very High Learning Rate....72
"Bad" Feature....73
Scaling / Standardizing / Normalizing....76
Step 5 - Rinse and Repeat!....80
The Path of Gradient Descent....81
Recap....83
Chapter 1: A Simple Regression Problem....85
Spoilers....85
Jupyter Notebook....85
Imports....86
A Simple Regression Problem....86
Data Generation....87
Synthetic Data Generation....87
Gradient Descent....88
Step 0 - Random Initialization....89
Step 1 - Compute Model’s Predictions....89
Step 2 - Compute the Loss....89
Step 3 - Compute the Gradients....90
Step 4 - Update the Parameters....91
Step 5 - Rinse and Repeat!....92
Linear Regression in Numpy....92
PyTorch....96
Tensor....96
Loading Data, Devices, and CUDA....101
Creating Parameters....106
Autograd....110
backward....110
grad....112
zero_....113
Updating Parameters....114
no_grad....117
Dynamic Computation Graph....117
Optimizer....121
step / zero_grad....122
Loss....124
Model....128
Parameters....130
state_dict....131
Device....132
Forward Pass....132
train....134
Nested Models....134
Sequential Models....137
Layers....138
Putting It All Together....140
Data Preparation....141
Model Configuration....142
Model Training....143
Recap....146
Chapter 2: Rethinking the Training Loop....148
Spoilers....148
Jupyter Notebook....148
Imports....148
Rethinking the Training Loop....149
Training Step....155
Dataset....159
TensorDataset....161
DataLoader....161
Mini-Batch Inner Loop....167
Random Split....170
Evaluation....172
Plotting Losses....176
TensorBoard....177
Running It Inside a Notebook....177
Running It Separately (Local Installation)....179
Running It Separately (Binder)....180
SummaryWriter....180
add_graph....182
add_scalars....183
Saving and Loading Models....189
Model State....189
Saving....189
Resuming Training....190
Deploying / Making Predictions....193
Setting the Model’s Mode....194
Putting It All Together....195
Recap....198
Chapter 2.1: Going Classy....200
Spoilers....200
Jupyter Notebook....200
Imports....200
Going Classy....201
The Class....201
The Constructor....202
Arguments....202
Placeholders....203
Variables....205
Functions....205
Training Methods....212
Saving and Loading Models....216
Visualization Methods....217
The Full Code....218
Classy Pipeline....219
Model Training....222
Making Predictions....224
Checkpointing....224
Resuming Training....225
Putting It All Together....227
Recap....229
Chapter 3: A Simple Classification Problem....231
Spoilers....231
Jupyter Notebook....231
Imports....231
A Simple Classification Problem....232
Data Generation....233
Data Preparation....234
Model....235
Logits....236
Probabilities....237
Odds Ratio....237
Log Odds Ratio....239
From Logits to Probabilities....240
Sigmoid....242
Logistic Regression....243
Loss....246
BCELoss....248
BCEWithLogitsLoss....250
Imbalanced Dataset....253
Model Configuration....256
Model Training....257
Decision Boundary....261
Classification Threshold....266
Confusion Matrix....268
Metrics....270
True and False Positive Rates....270
Precision and Recall....273
Accuracy....274
Trade-offs and Curves....275
Low Threshold....275
High Threshold....277
ROC and PR Curves....278
The Precision Quirk....280
Best and Worst Curves....281
Comparing Models....282
Putting It All Together....284
Recap....286
Part II: Computer Vision....289
Chapter 4: Classifying Images....290
Spoilers....290
Jupyter Notebook....290
Imports....290
Classifying Images....291
Data Generation....292
Shape (NCHW vs NHWC)....296
Torchvision....299
Datasets....299
Models....299
Transforms....299
Transforms on Images....303
Transforms on Tensor....303
Normalize Transform....304
Composing Transforms....306
Data Preparation....308
Dataset Transforms....308
SubsetRandomSampler....310
Data Augmentation Transforms....313
WeightedRandomSampler....314
Seeds and more (seeds)....318
Putting It Together....320
Pixels as Features....321
Shallow Model....323
Notation....324
Model Configuration....325
Model Training....326
Deep-ish Model....326
Model Configuration....329
Model Training....329
Show Me the Math!....331
Show Me the Code!....333
Weights as Pixels....336
Activation Functions....337
Sigmoid....337
Hyperbolic Tangent (TanH)....339
Rectified Linear Unit (ReLU)....340
Leaky ReLU....342
Parametric ReLU (PReLU)....344
Deep Model....345
Model Configuration....346
Model Training....347
Show Me the Math Again!....349
Putting It All Together....351
Recap....355
Bonus Chapter: Feature Space....356
Two-Dimensional Feature Space....356
Transformations....357
A Two-Dimensional Model....358
Decision Boundary, Activation Style!....360
More Functions, More Boundaries....363
More Layers, More Boundaries....365
More Dimensions, More Boundaries....366
Recap....368
Chapter 5: Convolutions....369
Spoilers....369
Jupyter Notebook....369
Imports....369
Convolutions....370
Filter / Kernel....370
Convolving....372
Moving Around....373
Shape....376
Convolving in PyTorch....377
Striding....381
Padding....383
A REAL Filter....387
Pooling....389
Flattening....391
Dimensions....392
Typical Architecture....392
LeNet-5....393
A Multiclass Classification Problem....396
Data Generation....396
Data Preparation....397
Loss....400
Logits....400
Softmax....400
LogSoftmax....403
Negative Log-Likelihood Loss....403
Cross-Entropy Loss....407
Classification Losses Showdown!....409
Model Configuration....409
Model Training....412
Visualizing Filters and More!....413
Visualizing Filters....416
Hooks....419
Visualizing Feature Maps....428
Visualizing Classifier Layers....431
Accuracy....432
Loader Apply....434
Putting It All Together....435
Recap....439
Chapter 6: Rock, Paper, Scissors....441
Spoilers....441
Jupyter Notebook....441
Imports....441
Rock, Paper, Scissors…....442
Rock Paper Scissors Dataset....443
Data Preparation....444
ImageFolder....444
Standardization....445
The Real Datasets....449
Three-Channel Convolutions....450
Fancier Model....453
Dropout....456
Two-Dimensional Dropout....462
Model Configuration....463
Optimizer....463
Learning Rate....464
Model Training....464
Accuracy....465
Regularizing Effect....465
Visualizing Filters....467
Learning Rates....469
Finding LR....471
Adaptive Learning Rate....479
Moving Average (MA)....479
EWMA....480
EWMA Meets Gradients....486
Adam....487
Visualizing Adapted Gradients....488
Stochastic Gradient Descent (SGD)....495
Momentum....496
Nesterov....499
Flavors of SGD....500
Learning Rate Schedulers....503
Epoch Schedulers....504
Validation Loss Scheduler....505
Schedulers in StepByStep — Part I....507
Mini-Batch Schedulers....510
Schedulers in StepByStep — Part II....512
Scheduler Paths....514
Adaptive vs Cycling....517
Putting It All Together....517
Recap....520
Chapter 7: Transfer Learning....523
Spoilers....523
Jupyter Notebook....523
Imports....524
Transfer Learning....524
ImageNet....525
ImageNet Large Scale Visual Recognition Challenge (ILSVRC)....525
ILSVRC-2012....526
AlexNet (SuperVision Team)....526
ILSVRC-2014....526
VGG....527
Inception (GoogLeNet Team)....527
ILSVRC-2015....527
ResNet (MSRA Team)....528
Comparing Architectures....528
Transfer Learning in Practice....530
Pre-Trained Model....530
Adaptive Pooling....532
Loading Weights....533
Model Freezing....534
Top of the Model....535
Model Configuration....538
Data Preparation....538
Model Training....540
Generating a Dataset of Features....541
Top Model....544
Auxiliary Classifiers (Side-Heads)....546
1x1 Convolutions....549
Inception Modules....552
Batch Normalization....557
Running Statistics....560
Evaluation Phase....566
Momentum....567
BatchNorm2d....569
Other Normalizations....570
Small Summary....570
Residual Connections....571
Learning the Identity....571
The Power of Shortcuts....575
Residual Blocks....576
Putting It All Together....579
Fine-Tuning....580
Feature Extraction....581
Recap....583
Extra Chapter: Vanishing and Exploding Gradients....586
Spoilers....586
Jupyter Notebook....586
Imports....586
Vanishing and Exploding Gradients....587
Vanishing Gradients....587
Ball Dataset and Block Model....588
Weights, Activations, and Gradients....590
Initialization Schemes....592
Batch Normalization....595
Exploding Gradients....596
Data Generation & Preparation....596
Model Configuration & Training....597
Gradient Clipping....599
Value Clipping....600
Norm Clipping (or Gradient Scaling)....601
Model Configuration & Training....605
Clipping with Hooks....608
Recap....609
Part III: Sequences....611
Chapter 8: Sequences....612
Spoilers....612
Jupyter Notebook....612
Imports....613
Sequences....613
Data Generation....614
Recurrent Neural Networks (RNNs)....616
RNN Cell....619
RNN Layer....626
Shapes....629
Stacked RNN....632
Bidirectional RNN....636
Square Model....640
Data Generation....640
Data Preparation....641
Model Configuration....641
Model Training....643
Visualizing the Model....644
Transformed Inputs....644
Hidden States....645
The Journey of a Hidden State....647
Can We Do Better?....649
Gated Recurrent Units (GRUs)....650
GRU Cell....651
GRU Layer....659
Square Model II — The Quickening....660
Model Configuration & Training....661
Visualizing the Model....662
Hidden States....662
The Journey of a Gated Hidden State....663
Can We Do Better?....665
Long Short-Term Memory (LSTM)....665
LSTM Cell....666
LSTM Layer....674
Square Model III — The Sorcerer....675
Model Configuration & Training....676
Visualizing the Hidden States....677
Variable-Length Sequences....678
Padding....679
Packing....682
Unpacking (to padded)....686
Packing (from padded)....688
Variable-Length Dataset....689
Data Preparation....689
Collate Function....691
Square Model IV — Packed....692
Model Configuration & Training....694
1D Convolutions....695
Shapes....696
Multiple Features or Channels....697
Dilation....699
Data Preparation....701
Model Configuration & Training....701
Visualizing the Model....703
Putting It All Together....704
Fixed-Length Dataset....704
Variable-Length Dataset....705
There Can Be Only ONE … Model....706
Model Configuration & Training....707
Recap....708
Chapter 9 — Part I: Sequence-to-Sequence....711
Spoilers....711
Jupyter Notebook....711
Imports....711
Sequence-to-Sequence....712
Data Generation....712
Encoder-Decoder Architecture....714
Encoder....714
Decoder....716
Teacher Forcing....721
Encoder + Decoder....723
Data Preparation....726
Model Configuration & Training....728
Visualizing Predictions....729
Can We Do Better?....729
Attention....730
"Values"....733
"Keys" and "Queries"....733
Computing the Context Vector....735
Scoring Method....739
Attention Scores....741
Scaled Dot Product....742
Attention Mechanism....748
Source Mask....751
Decoder....753
Encoder + Decoder + Attention....755
Model Configuration & Training....757
Visualizing Predictions....758
Visualizing Attention....759
Multi-Headed Attention....760
Chapter 9 — Part II: Sequence-to-Sequence....765
Spoilers....765
Self-Attention....765
Encoder....766
Cross-Attention....771
Decoder....773
Subsequent Inputs and Teacher Forcing....775
Attention Scores....776
Target Mask (Training)....777
Target Mask (Evaluation/Prediction)....779
Encoder + Decoder + Self-Attention....783
Model Configuration & Training....787
Visualizing Predictions....788
Sequential No More....789
Positional Encoding (PE)....790
Encoder + Decoder + PE....803
Model Configuration & Training....805
Visualizing Predictions....806
Visualizing Attention....807
Putting It All Together....809
Data Preparation....809
Model Assembly....810
Encoder + Decoder + Positional Encoding....812
Self-Attention "Layers"....813
Attention Heads....815
Model Configuration & Training....817
Recap....818
Chapter 10: Transform and Roll Out....821
Spoilers....821
Jupyter Notebook....821
Imports....821
Transform and Roll Out....822
Narrow Attention....822
Chunking....823
Multi-Headed Attention....826
Stacking Encoders and Decoders....832
Wrapping "Sub-Layers"....833
Transformer Encoder....836
Transformer Decoder....841
Layer Normalization....846
Batch vs Layer....851
Our Seq2Seq Problem....853
Projections or Embeddings....854
The Transformer....856
Data Preparation....859
Model Configuration & Training....860
Visualizing Predictions....863
The PyTorch Transformer....863
Model Configuration & Training....869
Visualizing Predictions....870
Vision Transformer....871
Data Generation & Preparation....871
Patches....874
Rearranging....874
Embeddings....876
Special Classifier Token....878
The Model....882
Model Configuration & Training....884
Putting It All Together....886
Data Preparation....886
Model Assembly....886
1. Encoder-Decoder....888
2. Encoder....891
3. Decoder....892
4. Positional Encoding....893
5. Encoder "Layer"....894
6. Decoder "Layer"....895
7. "Sub-Layer" Wrapper....896
8. Multi-Headed Attention....898
Model Configuration & Training....900
Recap....901
Part IV: Natural Language Processing....904
Chapter 11: Down the Yellow Brick Rabbit Hole....905
Spoilers....905
Jupyter Notebook....905
Additional Setup....906
Imports....906
"Down the Yellow Brick Rabbit Hole"....908
Building a Dataset....908
Sentence Tokenization....910
HuggingFace’s Dataset....916
Loading a Dataset....917
Attributes....918
Methods....919
Word Tokenization....921
Vocabulary....925
HuggingFace’s Tokenizer....931
Before Word Embeddings....939
One-Hot Encoding (OHE)....939
Bag-of-Words (BoW)....940
Language Models....941
N-grams....943
Continuous Bag-of-Words (CBoW)....944
Word Embeddings....944
Word2Vec....944
What Is an Embedding Anyway?....949
Pre-trained Word2Vec....952
Global Vectors (GloVe)....953
Using Word Embeddings....956
Vocabulary Coverage....956
Tokenizer....959
Special Tokens' Embeddings....960
Model I — GloVE + Classifier....962
Data Preparation....962
Pre-trained PyTorch Embeddings....964
Model Configuration & Training....966
Model II — GloVe + Transformer....967
Visualizing Attention....970
Contextual Word Embeddings....973
ELMo....974
BERT....982
Document Embeddings....984
Model III — Preprocessed Embeddings....987
Data Preparation....987
Model Configuration & Training....989
BERT....990
Tokenization....993
Input Embeddings....995
Pre-training Tasks....1000
Masked Language Model (MLM)....1000
Next Sentence Prediction (NSP)....1003
Outputs....1004
Model IV — Classifying Using BERT....1009
Data Preparation....1011
Model Configuration & Training....1013
Fine-Tuning with HuggingFace....1014
Sequence Classification (or Regression)....1014
Tokenized Dataset....1017
Trainer....1019
Predictions....1024
Pipelines....1026
More Pipelines....1027
GPT-2....1029
Putting It All Together....1033
Data Preparation....1033
"Packed" Dataset....1034
Model Configuration & Training....1037
Generating Text....1039
Recap....1041
Thank You!....1043
In 2019, I published a PyTorch tutorial on Towards Data Science and I was amazed by the reaction from the readers! Their feedback motivated me to write this book to help beginners start their journey into Deep Learning and PyTorch. I hope you enjoy reading this book as much as I enjoy writing it.
UPDATE (July, 19th, 2022): The Spanish version of Part I, Fundamentals, was published today:https://leanpub.com/pytorch_ES
UPDATE (February 23rd, 2022): The paperback edition is available now (the book had to be split into 3
volumes for printing). For more details, please check pytorchstepbystep.com.
UPDATE (February 13th, 2022): The latest revised edition (v1.1.1) was published today to address small changes to Chapters 9 and 10 that weren't included in the previous revision.
UPDATE (January 23rd, 2022): The revised edition (v1.1) was published today - better graphics, improved formatting, larger page size (thus reducing page count from 1187 to 1045 pages - no content was removed!). If you already bought the book, you can download the new version at any time!
If you're looking for a book where you can learn about Deep Learning and PyTorch without having to spend hours deciphering cryptic text and code, and that's easy and enjoyable to read, this is it :-)
The book covers from the basics of gradient descent all the way up to fine-tuning large NLP models (BERT and GPT-2) using HuggingFace. It is divided into four parts