Cover....1
Copyright....3
Table of Contents....4
Preface....10
What Is Web Scraping?....10
Why Web Scraping?....11
About This Book....12
Conventions Used in This Book....13
Using Code Examples....14
O’Reilly Online Learning....15
How to Contact Us....15
Acknowledgments....16
Part I. Building Scrapers....18
Chapter 1. How the Internet Works....20
Networking....21
Physical Layer....22
Data Link Layer....22
Network Layer....23
Transport Layer....23
Session Layer....24
Presentation Layer....24
Application Layer....24
HTML....24
CSS....26
JavaScript....28
Watching Websites with Developer Tools....30
Chapter 2. The Legalities and Ethics of Web Scraping....34
Trademarks, Copyrights, Patents, Oh My!....34
Copyright Law....35
Trespass to Chattels....38
The Computer Fraud and Abuse Act....40
robots.txt and Terms of Service....41
Three Web Scrapers....45
eBay v. Bidder’s Edge and Trespass to Chattels....45
United States v. Auernheimer and the Computer Fraud and Abuse Act....46
Field v. Google: Copyright and robots.txt....48
Chapter 3. Applications of Web Scraping....50
Classifying Projects....50
E-commerce....51
Marketing....52
Academic Research....53
Product Building....54
Travel....55
Sales....56
SERP Scraping....57
Chapter 4. Writing Your First Web Scraper....58
Installing and Using Jupyter....58
Connecting....60
An Introduction to BeautifulSoup....61
Installing BeautifulSoup....61
Running BeautifulSoup....63
Connecting Reliably and Handling Exceptions....66
Chapter 5. Advanced HTML Parsing....70
Another Serving of BeautifulSoup....70
find() and find_all() with BeautifulSoup....72
Other BeautifulSoup Objects....74
Navigating Trees....75
Regular Expressions....79
Regular Expressions and BeautifulSoup....83
Accessing Attributes....84
Lambda Expressions....85
You Don’t Always Need a Hammer....86
Chapter 6. Writing Web Crawlers....88
Traversing a Single Domain....88
Crawling an Entire Site....92
Collecting Data Across an Entire Site....95
Crawling Across the Internet....98
Chapter 7. Web Crawling Models....104
Planning and Defining Objects....105
Dealing with Different Website Layouts....108
Structuring Crawlers....113
Crawling Sites Through Search....113
Crawling Sites Through Links....116
Crawling Multiple Page Types....118
Thinking About Web Crawler Models....120
Chapter 8. Scrapy....122
Installing Scrapy....122
Initializing a New Spider....123
Writing a Simple Scraper....124
Spidering with Rules....125
Creating Items....130
Outputting Items....132
The Item Pipeline....133
Logging with Scrapy....136
More Resources....136
Chapter 9. Storing Data....138
Media Files....138
Storing Data to CSV....141
MySQL....143
Installing MySQL....144
Some Basic Commands....146
Integrating with Python....149
Database Techniques and Good Practice....152
“Six Degrees” in MySQL....154
Email....157
Part II. Advanced Scraping....160
Chapter 10. Reading Documents....162
Document Encoding....162
Text....163
Text Encoding and the Global Internet....164
CSV....168
Reading CSV Files....168
PDF....170
Microsoft Word and .docx....172
Chapter 11. Working with Dirty Data....176
Cleaning Text....177
Working with Normalized Text....181
Cleaning Data with Pandas....183
Cleaning....185
Indexing, Sorting, and Filtering....188
More About Pandas....189
Chapter 12. Reading and Writing Natural Languages....190
Summarizing Data....191
Markov Models....195
Six Degrees of Wikipedia: Conclusion....198
Natural Language Toolkit....201
Installation and Setup....201
Statistical Analysis with NLTK....202
Lexicographical Analysis with NLTK....205
Additional Resources....208
Chapter 13. Crawling Through Forms and Logins....210
Python Requests Library....210
Submitting a Basic Form....211
Radio Buttons, Checkboxes, and Other Inputs....214
Submitting Files and Images....215
Handling Logins and Cookies....216
HTTP Basic Access Authentication....217
Other Form Problems....219
Chapter 14. Scraping JavaScript....220
A Brief Introduction to JavaScript....221
Common JavaScript Libraries....222
Ajax and Dynamic HTML....225
Executing JavaScript in Python with Selenium....226
Installing and Running Selenium....226
Selenium Selectors....229
Waiting to Load....230
XPath....232
Additional Selenium WebDrivers....233
Handling Redirects....233
A Final Note on JavaScript....235
Chapter 15. Crawling Through APIs....238
A Brief Introduction to APIs....238
HTTP Methods and APIs....240
More About API Responses....241
Parsing JSON....243
Undocumented APIs....244
Finding Undocumented APIs....245
Documenting Undocumented APIs....247
Combining APIs with Other Data Sources....247
More About APIs....251
Chapter 16. Image Processing and Text Recognition....252
Overview of Libraries....253
Pillow....253
Tesseract....254
NumPy....256
Processing Well-Formatted Text....256
Adjusting Images Automatically....259
Scraping Text from Images on Websites....262
Reading CAPTCHAs and Training Tesseract....265
Training Tesseract....266
Retrieving CAPTCHAs and Submitting Solutions....273
Chapter 17. Avoiding Scraping Traps....276
A Note on Ethics....276
Looking Like a Human....277
Adjust Your Headers....278
Handling Cookies with JavaScript....279
TLS Fingerprinting....281
Timing Is Everything....284
Common Form Security Features....284
Hidden Input Field Values....285
Avoiding Honeypots....286
The Human Checklist....288
Chapter 18. Testing Your Website with Scrapers....290
An Introduction to Testing....291
What Are Unit Tests?....291
Python unittest....292
Testing Wikipedia....294
Testing with Selenium....296
Interacting with the Site....297
Chapter 19. Web Scraping in Parallel....302
Processes Versus Threads....302
Multithreaded Crawling....303
Race Conditions and Queues....306
More Features of the Threading Module....309
Multiple Processes....311
Multiprocess Crawling....313
Communicating Between Processes....314
Multiprocess Crawling—Another Approach....316
Chapter 20. Web Scraping Proxies....318
Why Use Remote Servers?....318
Avoiding IP Address Blocking....319
Portability and Extensibility....320
Tor....320
PySocks....322
Remote Hosting....323
Running from a Website-Hosting Account....323
Running from the Cloud....324
Moving Forward....325
Web Scraping Proxies....326
ScrapingBee....327
ScraperAPI....329
Oxylabs....331
Zyte....335
Additional Resources....338
Index....340
Colophon....349
If programming is magic, then web scraping is surely a form of wizardry. By writing a simple automated program, you can query web servers, request data, and parse it to extract the information you need. This thoroughly updated third edition not only introduces you to web scraping but also serves as a comprehensive guide to scraping almost every type of data from the modern web.
Part I focuses on web scraping mechanics: using Python to request information from a web server, performing basic handling of the server's response, and interacting with sites in an automated fashion. Part II explores a variety of more specific tools and applications to fit any web scraping scenario you're likely to encounter.