Web Scraping With Python: Data Extraction from the Modern Web. 3 Ed

Web Scraping With Python: Data Extraction from the Modern Web. 3 Ed

Web Scraping With Python: Data Extraction from the Modern Web. 3 Ed
Автор: Mitchell Ryan
Дата выхода: 2024
Издательство: O’Reilly Media, Inc.
Количество страниц: 352
Размер файла: 2.1 MB
Тип файла: PDF
Добавил: codelibs
 Проверить на вирусы

Cover....1

Copyright....3

Table of Contents....4

Preface....10

What Is Web Scraping?....10

Why Web Scraping?....11

About This Book....12

Conventions Used in This Book....13

Using Code Examples....14

O’Reilly Online Learning....15

How to Contact Us....15

Acknowledgments....16

Part I. Building Scrapers....18

Chapter 1. How the Internet Works....20

Networking....21

Physical Layer....22

Data Link Layer....22

Network Layer....23

Transport Layer....23

Session Layer....24

Presentation Layer....24

Application Layer....24

HTML....24

CSS....26

JavaScript....28

Watching Websites with Developer Tools....30

Chapter 2. The Legalities and Ethics of Web Scraping....34

Trademarks, Copyrights, Patents, Oh My!....34

Copyright Law....35

Trespass to Chattels....38

The Computer Fraud and Abuse Act....40

robots.txt and Terms of Service....41

Three Web Scrapers....45

eBay v. Bidder’s Edge and Trespass to Chattels....45

United States v. Auernheimer and the Computer Fraud and Abuse Act....46

Field v. Google: Copyright and robots.txt....48

Chapter 3. Applications of Web Scraping....50

Classifying Projects....50

E-commerce....51

Marketing....52

Academic Research....53

Product Building....54

Travel....55

Sales....56

SERP Scraping....57

Chapter 4. Writing Your First Web Scraper....58

Installing and Using Jupyter....58

Connecting....60

An Introduction to BeautifulSoup....61

Installing BeautifulSoup....61

Running BeautifulSoup....63

Connecting Reliably and Handling Exceptions....66

Chapter 5. Advanced HTML Parsing....70

Another Serving of BeautifulSoup....70

find() and find_all() with BeautifulSoup....72

Other BeautifulSoup Objects....74

Navigating Trees....75

Regular Expressions....79

Regular Expressions and BeautifulSoup....83

Accessing Attributes....84

Lambda Expressions....85

You Don’t Always Need a Hammer....86

Chapter 6. Writing Web Crawlers....88

Traversing a Single Domain....88

Crawling an Entire Site....92

Collecting Data Across an Entire Site....95

Crawling Across the Internet....98

Chapter 7. Web Crawling Models....104

Planning and Defining Objects....105

Dealing with Different Website Layouts....108

Structuring Crawlers....113

Crawling Sites Through Search....113

Crawling Sites Through Links....116

Crawling Multiple Page Types....118

Thinking About Web Crawler Models....120

Chapter 8. Scrapy....122

Installing Scrapy....122

Initializing a New Spider....123

Writing a Simple Scraper....124

Spidering with Rules....125

Creating Items....130

Outputting Items....132

The Item Pipeline....133

Logging with Scrapy....136

More Resources....136

Chapter 9. Storing Data....138

Media Files....138

Storing Data to CSV....141

MySQL....143

Installing MySQL....144

Some Basic Commands....146

Integrating with Python....149

Database Techniques and Good Practice....152

“Six Degrees” in MySQL....154

Email....157

Part II. Advanced Scraping....160

Chapter 10. Reading Documents....162

Document Encoding....162

Text....163

Text Encoding and the Global Internet....164

CSV....168

Reading CSV Files....168

PDF....170

Microsoft Word and .docx....172

Chapter 11. Working with Dirty Data....176

Cleaning Text....177

Working with Normalized Text....181

Cleaning Data with Pandas....183

Cleaning....185

Indexing, Sorting, and Filtering....188

More About Pandas....189

Chapter 12. Reading and Writing Natural Languages....190

Summarizing Data....191

Markov Models....195

Six Degrees of Wikipedia: Conclusion....198

Natural Language Toolkit....201

Installation and Setup....201

Statistical Analysis with NLTK....202

Lexicographical Analysis with NLTK....205

Additional Resources....208

Chapter 13. Crawling Through Forms and Logins....210

Python Requests Library....210

Submitting a Basic Form....211

Radio Buttons, Checkboxes, and Other Inputs....214

Submitting Files and Images....215

Handling Logins and Cookies....216

HTTP Basic Access Authentication....217

Other Form Problems....219

Chapter 14. Scraping JavaScript....220

A Brief Introduction to JavaScript....221

Common JavaScript Libraries....222

Ajax and Dynamic HTML....225

Executing JavaScript in Python with Selenium....226

Installing and Running Selenium....226

Selenium Selectors....229

Waiting to Load....230

XPath....232

Additional Selenium WebDrivers....233

Handling Redirects....233

A Final Note on JavaScript....235

Chapter 15. Crawling Through APIs....238

A Brief Introduction to APIs....238

HTTP Methods and APIs....240

More About API Responses....241

Parsing JSON....243

Undocumented APIs....244

Finding Undocumented APIs....245

Documenting Undocumented APIs....247

Combining APIs with Other Data Sources....247

More About APIs....251

Chapter 16. Image Processing and Text Recognition....252

Overview of Libraries....253

Pillow....253

Tesseract....254

NumPy....256

Processing Well-Formatted Text....256

Adjusting Images Automatically....259

Scraping Text from Images on Websites....262

Reading CAPTCHAs and Training Tesseract....265

Training Tesseract....266

Retrieving CAPTCHAs and Submitting Solutions....273

Chapter 17. Avoiding Scraping Traps....276

A Note on Ethics....276

Looking Like a Human....277

Adjust Your Headers....278

Handling Cookies with JavaScript....279

TLS Fingerprinting....281

Timing Is Everything....284

Common Form Security Features....284

Hidden Input Field Values....285

Avoiding Honeypots....286

The Human Checklist....288

Chapter 18. Testing Your Website with Scrapers....290

An Introduction to Testing....291

What Are Unit Tests?....291

Python unittest....292

Testing Wikipedia....294

Testing with Selenium....296

Interacting with the Site....297

Chapter 19. Web Scraping in Parallel....302

Processes Versus Threads....302

Multithreaded Crawling....303

Race Conditions and Queues....306

More Features of the Threading Module....309

Multiple Processes....311

Multiprocess Crawling....313

Communicating Between Processes....314

Multiprocess Crawling—Another Approach....316

Chapter 20. Web Scraping Proxies....318

Why Use Remote Servers?....318

Avoiding IP Address Blocking....319

Portability and Extensibility....320

Tor....320

PySocks....322

Remote Hosting....323

Running from a Website-Hosting Account....323

Running from the Cloud....324

Moving Forward....325

Web Scraping Proxies....326

ScrapingBee....327

ScraperAPI....329

Oxylabs....331

Zyte....335

Additional Resources....338

Index....340

Colophon....349

If programming is magic, then web scraping is surely a form of wizardry. By writing a simple automated program, you can query web servers, request data, and parse it to extract the information you need. This thoroughly updated third edition not only introduces you to web scraping but also serves as a comprehensive guide to scraping almost every type of data from the modern web.

Part I focuses on web scraping mechanics: using Python to request information from a web server, performing basic handling of the server's response, and interacting with sites in an automated fashion. Part II explores a variety of more specific tools and applications to fit any web scraping scenario you're likely to encounter.

  • Parse complicated HTML pages
  • Develop crawlers with the Scrapy framework
  • Learn methods to store the data you scrape
  • Read and extract data from documents
  • Clean and normalize badly formatted data
  • Read and write natural languages
  • Crawl through forms and logins
  • Scrape JavaScript and crawl through APIs
  • Use and write image-to-text software
  • Avoid scraping traps and bot blockers
  • Use scrapers to test your website

Похожее:

Список отзывов:

Нет отзывов к книге.